### Analysis of the Saturday Night Live database

If you have downloaded the snl database you have the following files available:

* snl_season (sid, year)
* snl_episode (sid, eid, year, aired)
* snl_title (sid, eid, tid, title, titleType)
* snl_actor (aid, name, isCast)
* snl_actor_sketch (sid, eid, tid, aid, actorType)
* snl_rating (lots of rating data from IMDb)

In this notebook I want to have a first look at the data and show some interesting analysis that is possible with this dataset. Feel free to take your own look at it.

#### Imports & setup

In [1]:
import pandas as pd
import numpy as np
import bokeh
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
output_notebook()

#### Load the data

In [2]:
dfs = pd.read_csv('./db/snl_season.csv', encoding="utf-8")
dfe = pd.read_csv('./db/snl_episode.csv', encoding="utf-8",parse_dates=['aired'])
dft = pd.read_csv('./db/snl_title.csv', encoding="utf-8")
dfa = pd.read_csv('./db/snl_actor.csv', encoding="utf-8")
dfat = pd.read_csv('./db/snl_actor_title.csv', encoding="utf-8")
dfr = pd.read_csv('./db/snl_rating.csv', encoding="utf-8")

#### Have a look at the data

In [3]:
dfs.head(2)

Unnamed: 0,sid,year
0,1,1975
1,2,1976


In [4]:
dfe.head(2)

Unnamed: 0,sid,eid,year,aired,host
0,3,20,1977,1978-05-20,Buck Henry
1,3,19,1977,1978-05-13,Richard Dreyfuss


In [5]:
dft.head(2)

Unnamed: 0,sid,eid,tid,title,titleType
0,5,18,1980051011,,Goodnights
1,2,3,1976100224,,Goodnights


In [6]:
dfa.head(2)

Unnamed: 0,aid,name,isCast
0,Bob Newhart,Bob Newhart,0
1,JaCu,Jane Curtin,1


In [7]:
dfat.head(2)

Unnamed: 0,sid,eid,tid,aid,actorType
0,5,18,1980051010,Bob Newhart,host
1,5,18,1980051010,JaCu,cast


In [8]:
dfr.head(2)

Unnamed: 0,sid,eid,1,10,2,3,4,5,6,7,...,Males Aged 45+_avg,Males under 18,Males under 18_avg,Males_avg,Non-US users,Non-US users_avg,Top 1000 voters,Top 1000 voters_avg,US users,US users_avg
0,2,21,0,10,0,0,4,2,4,9,...,7.2,0,,7.5,12,7.8,13,7.6,21,7.4
1,5,1,1,13,0,2,0,2,6,11,...,7.9,0,,7.9,11,7.2,16,6.9,28,7.8


#### Combine episodes and ratings
Since the ratings are for the episode we combine the two dataframes.

In [9]:
dfer = pd.merge(dfe, dfr, on=['sid', 'eid'])

#### Ratings over time (per episode)
Now we can create our first graph. Let us look at the ratings over time. First sort the dataframe by season and episode.

In [10]:
dfer = dfer.sort_values(['sid', 'eid'], ascending=[True, True]).reset_index(drop=True)

In [11]:
# plot a trend line, too
trend = np.polyfit(dfer.index, dfer["IMDb users_avg"].values, 10)
trend_func = np.poly1d(trend)

p = figure(plot_width=800, plot_height=200, y_range=(0,10))
r = p.multi_line([dfer.index, dfer.index],[dfer["IMDb users_avg"].values, trend_func(dfer.index)], color=['blue', 'red'])
t = show(p, notebook_handle=True)

#### Ratings over time (per season)
It is also interesting to see how the average ratings of the season developed over the years.

In [12]:
sSeasonRatingAverage = dfer.groupby("sid")["IMDb users_avg"].mean()

In [13]:
p = figure(plot_width=800, plot_height=200, y_range=(0,10))
r = p.line(dfer.sid.unique(),sSeasonRatingAverage.values)
t = show(p, notebook_handle=True)

#### Ratings over time (conclusion)

As you can see in the graphs there was a steep increase in quality between season 28 and 33. Since then the ratings are fairly constant. There were some quality highs in the mid 90s and 80s.

#### Moving on to the actors
Now let us take a look at the actors. First it would be interesting to know which actors played in the most sketches and which of them were very present during their stay at the show (most sketches per episode). To do that we have to merge most of the dataframes.

In [14]:
dfactors = pd.merge(pd.merge(dfat, dfer, on=['sid', 'eid']), dfa, on='aid')

Now let's take a look at the Top 10 actors of SNL when it comes to appearances.

In [15]:
sActorsAppearances = dfactors['name'].value_counts()
sActorsAppearances.head(10)

Kenan Thompson     933
Phil Hartman       913
Darrell Hammond    768
Fred Armisen       739
Bill Hader         696
Amy Poehler        687
Will Ferrell       654
Kevin Nealon       646
Bobby Moynihan     644
Kristen Wiig       633
Name: name, dtype: int64

The Top 3 are: Kenan Thompson, Phil Hartman and Darrell Hammond. Since Kenan is still on the show he can further increase his lead. But does he also have the most appearances per episode?

In [16]:
dfActorsEpisodes = pd.DataFrame(dfactors.groupby(['name','sid', 'eid'])['aid'].count().sort_values(ascending=False)).reset_index()
dfActorsEpisodes.head(10)

Unnamed: 0,name,sid,eid,aid
0,Ray Charles,3,5,12
1,Richard Pryor,1,7,12
2,Betty White,35,21,12
3,Ludacris,32,6,12
4,Justin Bieber,38,13,11
5,Josh Brolin,34,5,11
6,Steve Martin,31,12,11
7,Willie Nelson,12,12,11
8,Phil Hartman,17,7,11
9,Jennifer Lopez,35,15,11


In this category there are four actors that take the first place: Ludacris, Richard Pryor, Ray Charles and Betty White. They were all part of 12 titles in a single episode. But which actor had the biggest presence on set over several episodes? Of course it only makes sense to look at actors who appeared in more than one episode.

In [17]:
# Define the aggregation calculations
aggregations = {
    'aid': {     # Now work on the "date" column
        'titles': 'sum',   # Find the max, call the result "max_date"
        'episodes': 'count'
    }
}
 
# Perform groupby aggregation by "month", but only on the rows that are of type "call"
dfActorsTitlePerEpisode = dfActorsEpisodes.groupby('name').agg(aggregations)
dfActorsTitlePerEpisode.columns = dfActorsTitlePerEpisode.columns.droplevel()

In [18]:
dfActorsTitlePerEpisode["title_avg"] = dfActorsTitlePerEpisode["titles"] / dfActorsTitlePerEpisode["episodes"]

Let's take a look at the actors with appearances in at least 3 episodes.

In [19]:
dfActorsTitlePerEpisode[dfActorsTitlePerEpisode.episodes>=3].sort_values('title_avg', ascending=False).head(10)

Unnamed: 0_level_0,episodes,titles,title_avg
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Charles Barkley,3,25,8.333333
Zach Galifianakis,3,24,8.0
Louis C.K.,3,23,7.666667
Jack Black,3,23,7.666667
Drake,3,22,7.333333
Jennifer Lopez,3,22,7.333333
Garth Brooks,3,21,7.0
Lily Tomlin,4,28,7.0
Charles Rocket,12,83,6.916667
Jonah Hill,4,27,6.75


Charles Barkley wins with 8.3 titles per episode. What about 10 episodes?

In [20]:
dfActorsTitlePerEpisode[dfActorsTitlePerEpisode.episodes>=10].sort_values('title_avg', ascending=False).head(10)

Unnamed: 0_level_0,episodes,titles,title_avg
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Charles Rocket,12,83,6.916667
Phil Hartman,163,913,5.601227
Joe Piscopo,72,364,5.055556
Bill Murray,78,389,4.987179
Gail Matthius,13,62,4.769231
Amy Poehler,148,687,4.641892
Will Ferrell,143,654,4.573427
John Goodman,19,86,4.526316
Dan Aykroyd,92,415,4.51087
Kristen Wiig,141,633,4.489362


Now let's look at people with at least 50 episodes under their belt. These are mostly cast members.

In [21]:
dfActorsTitlePerEpisode[dfActorsTitlePerEpisode.episodes>=50].sort_values('title_avg', ascending=False).head(10)

Unnamed: 0_level_0,episodes,titles,title_avg
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Phil Hartman,163,913,5.601227
Joe Piscopo,72,364,5.055556
Bill Murray,78,389,4.987179
Amy Poehler,148,687,4.641892
Will Ferrell,143,654,4.573427
Dan Aykroyd,92,415,4.51087
Kristen Wiig,141,633,4.489362
Gilda Radner,106,464,4.377358
Tim Kazurinsky,60,260,4.333333
Bill Hader,162,696,4.296296


Here we see Phil Hartmans impressive record of having an average 5.6 titles per episode in over 160 episodes.

#### End of the initial analysis
I hope I could spark your interest in this dataset. Maybe you have some ideas of interesting things to analyse about this TV show that is currently in its 42nd season. I will also add more data to this dataset if you point me towards a source of interesting data that would fit into it.