# Introduction to Spark

## Objective

Analyzing guests on The Daily Show using basic concepts of Spark

## Data Set

The data set containing the names of all of the guests who have appeared on [The Daily Show](https://en.wikipedia.org/wiki/The_Daily_Show). 

The data set formatted TSV and can be downloaded from [FiveThirtyEight's data set](https://github.com/fivethirtyeight/data/tree/master/daily-show-guests)

Here are the description of the header of the data set:

| Header                    | Definition                                                                                                                                              |
|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| YEAR                      | The year the episode aired                                                                                                                              |
| GoogleKnowlege_Occupation | Their occupation or office, according to Google's Knowledge Graph or, if they're not in there, how Stewart introduced them on the program.              |
| Show                      | Air date of episode. Not unique, as some shows had more than one guest                                                                                  |
| Group                     | A larger group designation for the occupation. For instance, us senators, us presidents, and former presidents are all under "politicians"              |
| Raw_Guest_List            | The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row. |


## Reading In The Data

In [2]:
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext()

raw_data = sc.textFile('daily_show_guests.tsv')
raw_data.take(5)

['YEAR\tGoogleKnowlege_Occupation\tShow\tGroup\tRaw_Guest_List',
 '1999\tactor\t1/11/1999\tActing\tMichael J. Fox',
 '1999\tComedian\t1/12/1999\tComedy\tSandra Bernhard',
 '1999\ttelevision actress\t1/13/99\tActing\tTracey Ullman',
 '1999\tfilm actress\t1/14/99\tActing\tGillian Anderson']

In [4]:
daily_show = raw_data.map(lambda line: line.split('\t'))
daily_show.take(5)

[['YEAR', 'GoogleKnowlege_Occupation', 'Show', 'Group', 'Raw_Guest_List'],
 ['1999', 'actor', '1/11/1999', 'Acting', 'Michael J. Fox'],
 ['1999', 'Comedian', '1/12/1999', 'Comedy', 'Sandra Bernhard'],
 ['1999', 'television actress', '1/13/99', 'Acting', 'Tracey Ullman'],
 ['1999', 'film actress', '1/14/99', 'Acting', 'Gillian Anderson']]

### The Number of guests who have appeared on The Daily Show during each year

In [5]:
tally = daily_show.map(lambda x: (x[0], 1)).reduceByKey(lambda x,y: x+y)
print(tally)

PythonRDD[8] at RDD at PythonRDD.scala:43


In [7]:
tally.take(tally.count())

[('YEAR', 1),
 ('2012', 164),
 ('2013', 166),
 ('2004', 164),
 ('2011', 163),
 ('2014', 163),
 ('2002', 159),
 ('2007', 141),
 ('2015', 100),
 ('2003', 166),
 ('2010', 165),
 ('2001', 157),
 ('2000', 169),
 ('2008', 164),
 ('2005', 162),
 ('2009', 163),
 ('1999', 166),
 ('2006', 161)]

## Removing the Column Header

In [8]:
def filter_year(line):
    if line[0] == 'YEAR':
        return False
    else:
        return True

filtered_daily_show = daily_show.filter(lambda line: filter_year(line))
filtered_daily_show

PythonRDD[13] at RDD at PythonRDD.scala:43

In [10]:
filtered_daily_show.take(5)

[['1999', 'actor', '1/11/1999', 'Acting', 'Michael J. Fox'],
 ['1999', 'Comedian', '1/12/1999', 'Comedy', 'Sandra Bernhard'],
 ['1999', 'television actress', '1/13/99', 'Acting', 'Tracey Ullman'],
 ['1999', 'film actress', '1/14/99', 'Acting', 'Gillian Anderson'],
 ['1999', 'actor', '1/18/99', 'Acting', 'David Alan Grier']]

## Pipelines

Chain together a series of data transformations into a pipeline

filter out actors for whom the profession is blank, lowercase each profession, and output the first five tuples in the histogram.

In [11]:
filtered_daily_show.filter(lambda line: line[1] != '') \
                   .map(lambda line: (line[1].lower(), 1)) \
                   .reduceByKey(lambda x,y: x+y) \
                   .take(5)


[('radio personality', 3),
 ('former governor of new york', 1),
 ('illustrator', 1),
 ('presidnet', 3),
 ('former united states secretary of state', 6)]