<a href="https://colab.research.google.com/github/ErikSeguinte/movie_data/blob/master/processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import cufflinks as cf
import numpy as np
import plotly.graph_objects as go
import altair as alt
import plotly.express as px

In [0]:
merge_keys = {'left_index':True, 'right_index':True}

* I previously pulled CSV files from Kaggle, but the files were too big to host on github.
* I imported the files I wanted into pandas, and then exported them back out as compressed pickles.
* I was able to compress a 700MB csv to a 3 MB Pickle

In [3]:
#try: 
#    movies = pd.read_pickle('data/movies.pkl.xz')
#    ratings = pd.read_pickle('data/ratings2.pkl.xz')
#    cpi = pd.read_csv('data/cpi.csv')
    
#except FileNotFoundError                         :
    
# Download pickles from github
cpi = pd.read_csv('https://datahub.io/core/cpi/r/cpi.csv')

!wget 'https://github.com/ErikSeguinte/movie_data/raw/master/data/movies.pkl.xz'
!wget 'https://github.com/ErikSeguinte/movie_data/raw/master/data/ratings2.pkl.xz'
movies = pd.read_pickle('movies.pkl.xz')
ratings = pd.read_pickle('ratings2.pkl.xz')

--2020-03-06 03:41:35--  https://github.com/ErikSeguinte/movie_data/raw/master/data/movies.pkl.xz
Resolving github.com (github.com)... 52.74.223.119
Connecting to github.com (github.com)|52.74.223.119|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ErikSeguinte/movie_data/master/data/movies.pkl.xz [following]
--2020-03-06 03:41:36--  https://raw.githubusercontent.com/ErikSeguinte/movie_data/master/data/movies.pkl.xz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7917036 (7.5M) [application/octet-stream]
Saving to: ‘movies.pkl.xz’


2020-03-06 03:41:38 (131 MB/s) - ‘movies.pkl.xz’ saved [7917036/7917036]

--2020-03-06 03:41:40--  https://github.com/ErikSeguinte/movie_data/raw/master/data/ratings2.pk

## Clean Movie DF

In [4]:
movies.dtypes

adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object

* Movies Dataframe has malformed data. `id` Should be numeric.
* After inspection, it looks like there are rows that are missing a comma somewhere, making columns not line up, and adding the wrong data to columns. Let's clean those up.
* All malformed rows have strings for IDs instead of numeric, so we will coerce them into numeric columns, and strings will be returned as `NaN`, which we'll then drop.

* `budget` and `revanue` should also be numeric, but Nans won't be dropped






In [0]:
movies['id'] = pd.to_numeric(movies['id'], errors='coerce')
movies = movies[movies['id'].notnull()]
movies = movies.set_index('id')

In [0]:
def to_numeric(df, labels):
    
    for label in labels:
        df[label] = pd.to_numeric(movies[label], errors='coerce').copy()
    return df

In [0]:
movies = to_numeric(movies, ['budget', 'revenue', 'vote_average'])

In [0]:
movies['release_date'] =pd.to_datetime(movies['release_date'], infer_datetime_format= True)

In [0]:
movies['year'] = movies['release_date'].dt.year

In [0]:
clean_movies = movies[['title','genres', 'release_date','budget', 'revenue','year' ,'runtime', 'vote_average', 'vote_count']]

## Process User Reviews
* User reviews come in a collection of individual reviews where a review gives a movie a score of 1 to 5.
* We will take the mean ratings for each movie

In [0]:
# Aggregate mean ratings and number of votes per movie
movie_ratings =pd.DataFrame(ratings.groupby('movieId')[['rating']].agg(['mean', 'count']))['rating']
movie_ratings = movie_ratings.rename({'mean': 'rating', 'count': 'num_votes'}, axis = 1)


* Lets drop any movies with less than 100 votes. Those are more easily swayed by outliers and aren't reliable.

In [0]:
movie_ratings = movie_ratings[(movie_ratings['num_votes'] >= 100)]

* And now we merge the averaged ratings back with the movie database.
* Note that not all movies are present in the user votings.

In [0]:
clean_movies = clean_movies.merge(movie_ratings, left_index = True, right_index=True)

In [65]:
clean_movies[['title', 'rating', 'num_votes']].nlargest(10, 'rating').style.hide_index().format(
    {'rating':"{:.2f}",
     'num_votes': "{:,.0f}"}
)

title,rating,num_votes
Sleepless in Seattle,4.34,57070
Once Were Warriors,4.27,67662
Hard Target,4.26,13994
License to Wed,4.23,60024
The Talented Mr. Ripley,4.18,33987
Galaxy Quest,4.17,5453
Terminator 3: Rise of the Machines,4.17,87901
Local Color,4.17,25245
Hannibal Rising,4.16,5199
Ice Age: The Meltdown,4.15,3628


* The movie Database also provides a rating and suffer from a similar problem of some movies having a tiny sample size.

In [72]:
(clean_movies[clean_movies['vote_count'] >= 100]
    [['title', 'vote_average', 'vote_count']]
    .nlargest(10, 'vote_average')
    .style
    .hide_index()
    .format({'vote_average':'{:.1f}', 'vote_count':'{:,.0f}'})
)
 


title,vote_average,vote_count
The Godfather,8.5,6024
The Shawshank Redemption,8.5,8358
Spirited Away,8.3,3968
The Dark Knight,8.3,12269
The Godfather: Part II,8.3,3418
Schindler's List,8.3,4436
One Flew Over the Cuckoo's Nest,8.3,3001
Psycho,8.3,2405
Fight Club,8.3,9678
Life Is Beautiful,8.3,3643


In [74]:
(clean_movies[['title', 'revenue']]
 .nlargest(10, 'revenue')
 .style
 .hide_index()
 .format({
     "revenue":"${:,.0f}"
 }
 )
)

title,revenue
Titanic,"$1,845,034,188"
The Lord of the Rings: The Return of the King,"$1,118,888,979"
Pirates of the Caribbean: Dead Man's Chest,"$1,065,659,812"
Pirates of the Caribbean: On Stranger Tides,"$1,045,713,802"
The Dark Knight,"$1,004,558,444"
Harry Potter and the Philosopher's Stone,"$976,475,550"
Finding Nemo,"$940,335,536"
Harry Potter and the Half-Blood Prince,"$933,959,197"
The Lord of the Rings: The Two Towers,"$926,287,400"
Star Wars: Episode I - The Phantom Menace,"$924,317,558"


## Inflation
* Inflation means that a 1940 dollar is worth more than a 2020 dollar. Let's adjust Revenue for that.
* The Consumer price index can be used to convert to standarized dollars.
* Here, we'll be using 2014 dollars.
* Years later than 2014 will not be adjusted.
$$ \textrm{adjusted dollars} = \frac{\textrm{New CPI}}{\textrm{Base CPI}}$$
* where x is the current cpi and y is the cpi of that year 

In [0]:
cpi = cpi[cpi['Country Name'] == 'United States'][['Year', 'CPI']]

In [0]:
cpi = cpi.set_index(cpi['Year'])

In [0]:
def adjust_dollars(value, year):
    year = int(year)
    try:
        current = cpi.loc[2014,'CPI']
        base = cpi.loc[year,'CPI']
        adjusted_value = value * (current/base)
        return adjusted_value
    except: 
        return value

In [0]:
clean_movies['year']= clean_movies['release_date'].dt.year

In [0]:
df = clean_movies[clean_movies['revenue'].notnull() & clean_movies['year'].notnull()]

In [0]:
adjusted = pd.DataFrame([adjust_dollars(x,y) for x,y in zip(df['revenue'], df['year'])], index = df.index, columns = ['adjusted_revenue'])

In [0]:
clean_movies = clean_movies.merge(adjusted, left_index=True, right_index=True)

In [76]:
(clean_movies[['title', 'adjusted_revenue']]
 .nlargest(10, 'adjusted_revenue')
 .style
 .hide_index()
 .format({
     "adjusted_revenue":"${:,.0f}"
 }
 )
)

title,adjusted_revenue
Star Wars,"$3,028,727,803"
Titanic,"$2,721,127,532"
E.T. the Extra-Terrestrial,"$1,945,321,985"
The Empire Strikes Back,"$1,546,829,516"
Jurassic Park,"$1,507,846,186"
The Lord of the Rings: The Return of the King,"$1,439,899,368"
The Godfather,"$1,387,267,307"
Return of the Jedi,"$1,361,232,958"
Star Wars: Episode I - The Phantom Menace,"$1,313,638,874"
Harry Potter and the Philosopher's Stone,"$1,305,536,965"


In [0]:
clean_movies['decade'] = [x - (x%10) for x in clean_movies['year']]

In [78]:
px.line(clean_movies.groupby('year')['vote_average'].mean().reset_index(), 
       x ='year',
       y = 'vote_average' )

In [79]:
px.line(clean_movies.groupby('year')['rating'].mean().reset_index(), 
       x ='year',
       y = 'rating' )

* I'd like to compare the votes from TMB to the user ratings, but they are on different scales. We'll use standard scaler to normalize them so we can more easily compare.

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
to_scale = clean_movies[['vote_average', 'rating']].dropna()
scaled = scaler.fit_transform(to_scale)
clean_movies = clean_movies.merge(
    pd.DataFrame(
        scaled,
        index = to_scale.index,
        columns = ['scaled_tmdb_vote', 'scaled_user_rating']
    ),
    left_index = True,
    right_index = True,
)

In [34]:
traces = [
    go.Bar(name='TMDB rating',
        x = clean_movies.groupby('decade')['scaled_tmdb_vote'].mean().index,
        y = clean_movies.groupby('decade')['scaled_tmdb_vote'].mean()
    ),
        go.Bar(name='user rating',
        x = clean_movies.groupby('decade')['scaled_user_rating'].mean().index,
        y = clean_movies.groupby('decade')['scaled_user_rating'].mean()
    )
]

go.Figure(data = traces,
    layout_xaxis_tick0 = 1890
)

In [0]:
from sklearn.decomposition import PCA

pca = PCA(1)

to_scale = clean_movies[['vote_average', 'rating', 'adjusted_revenue']].dropna()
scaled = scaler.fit_transform(to_scale)

pca_df = pd.DataFrame(pca.fit_transform(scaled), index=to_scale.index, columns = ['PCA'])

clean_movies = clean_movies.merge(pca_df, left_index = True, right_index = True)

In [80]:
(clean_movies[['title', 'PCA']]
 .nlargest(10, 'PCA')
 .style
 .hide_index()
 .format({
     'PCA':'{:.2f}'
 })
)

title,PCA
Star Wars,8.87
Titanic,7.51
E.T. the Extra-Terrestrial,5.27
The Empire Strikes Back,4.82
The Godfather,4.76
The Lord of the Rings: The Return of the King,4.51
Jurassic Park,4.35
Return of the Jedi,4.22
The Lord of the Rings: The Two Towers,3.98
The Dark Knight,3.87


In [0]:
clean_movies['q_budget'] = pd.qcut(clean_movies['budget'], q = 5, 
labels = ['vlow', 'low', 'med', 'high', 'vhigh']
)

In [0]:
budget_ratings = clean_movies[['title','q_budget', 'budget', 'revenue', 'rating', 'vote_average']].dropna()

In [39]:
px.histogram(budget_ratings.dropna(),
       x = 'q_budget',
       y = 'vote_average')

In [40]:
alt.Chart(budget_ratings, width=720).mark_bar().encode(
    x='q_budget:N',
    y='mean(vote_average)'
)

In [41]:
trace = go.Scatter(
    y = budget_ratings['vote_average'],
    x = budget_ratings['revenue'],
    mode = 'markers'
)

go.Figure(
    trace,
    layout_xaxis_title = "Budget",
    layout_yaxis_title = "Movie Rating",
    layout_title = "Movie Ratings by budget",
    
)

In [0]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Genre

In [0]:
genres = clean_movies[['title','genres']]

In [0]:
import ast

In [0]:
bad_genres = [
           'Aniplex',
 'BROSTA TV',
 'Carousel Productions',   
  'GoHands',
   'Mardock Scramble Production Committee',
    'Odyssey Media',
     'Pulser Productions',
 'Rogue State',
  'Sentai Filmworks',
   'Telescene Film Group Productions',
 'The Cartel',
  'Vision View Entertainment',
]

#genre_set = genre_set.difference(bad_genres)

In [46]:
genres['genres_ls'] = [
                    [d['name'] for d in ast.literal_eval(x) if d['name'] not in bad_genres ]
                    for x in genres['genres']
                    ]
genres['genres_ls']

5.0                                   [Crime, Comedy]
6.0                         [Action, Thriller, Crime]
11.0             [Adventure, Action, Science Fiction]
12.0                              [Animation, Family]
13.0                         [Comedy, Drama, Romance]
                              ...                    
103299.0    [Drama, Action, Thriller, Crime, Mystery]
109161.0                      [Crime, Action, Comedy]
110669.0                                      [Music]
115210.0                                     [Horror]
116977.0          [Animation, Action, Comedy, Family]
Name: genres_ls, Length: 1552, dtype: object

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer

In [0]:
mlb = MultiLabelBinarizer()

In [0]:
mlb_ = mlb.fit_transform(genres['genres_ls'])

In [50]:
encoded_genres = pd.DataFrame(mlb_, columns = mlb.classes_, index = genres.index)
encoded_genres

Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,History,Horror,Music,Mystery,Romance,Science Fiction,Thriller,War,Western
5.0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6.0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
11.0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
12.0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
13.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103299.0,1,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0
109161.0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
110669.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
115210.0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [51]:
encoded_genres = clean_movies.merge(encoded_genres, **merge_keys)
encoded_genres

Unnamed: 0,title,genres,release_date,budget,revenue,year,runtime,vote_average,vote_count,rating,num_votes,adjusted_revenue,decade,scaled_tmdb_vote,scaled_user_rating,PCA,q_budget,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,History,Horror,Music,Mystery,Romance,Science Fiction,Thriller,War,Western
5.0,Four Rooms,"[{'id': 80, 'name': 'Crime'}, {'id': 35, 'name...",1995-12-09,4000000.0,4300000.0,1995.0,98.0,6.5,539.0,3.079565,15258,6.680294e+06,1990.0,-0.172824,-0.423752,-0.604977,vlow,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6.0,Judgment Night,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",1993-10-15,,12136938.0,1993.0,110.0,6.4,79.0,3.841764,27895,1.988983e+07,1990.0,-0.290514,1.081219,-0.533044,,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
11.0,Star Wars,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",1977-05-25,11000000.0,775398007.0,1977.0,121.0,8.1,6778.0,3.660591,19475,3.028728e+09,1970.0,1.710224,0.723491,8.865144,low,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
12.0,Finding Nemo,"[{'id': 16, 'name': 'Animation'}, {'id': 10751...",2003-05-30,94000000.0,940335536.0,2003.0,100.0,7.6,6292.0,2.672179,4475,1.210119e+09,2000.0,1.121771,-1.228141,3.453371,vhigh,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
13.0,Forrest Gump,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",1994-07-06,55000000.0,677945399.0,1994.0,142.0,8.2,8147.0,3.326442,1838,1.082774e+09,1990.0,1.827914,0.063710,3.721754,vhigh,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103299.0,Traces of Red,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",1992-11-13,,3206713.0,1992.0,105.0,7.3,4.0,3.382979,141,5.410226e+06,1990.0,0.768700,0.175343,0.110876,,1,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0
109161.0,The Squeeze,"[{'id': 80, 'name': 'Crime'}, {'id': 28, 'name...",1987-07-10,,2228951.0,1987.0,101.0,5.8,6.0,3.579167,120,4.640926e+06,1980.0,-0.996657,0.562718,-1.118563,,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
110669.0,Wabash Avenue,"[{'id': 10402, 'name': 'Music'}]",1950-03-31,2115000.0,2039000.0,1950.0,92.0,7.0,1.0,3.809249,173,2.039000e+06,1950.0,0.415628,1.017018,-0.082274,vlow,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
115210.0,Stitches,"[{'id': 27, 'name': 'Horror'}]",2012-05-19,100000.0,95000.0,2012.0,86.0,5.5,94.0,3.731075,3144,9.795528e+04,2010.0,-1.349729,0.862663,-1.358279,vlow,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [52]:
for genre in mlb.classes_:
    print(genre, encoded_genres[encoded_genres[genre] == 1]['vote_average'].mean())

Action 6.368047337278103
Adventure 6.464166666666668
Animation 6.5181818181818185
Comedy 6.599013956303933
Crime 6.600525787915338
Documentary 7.06
Drama 6.6025728133283295
Family 6.493814432989691
Fantasy 6.527692307692308
Foreign 6.0
History 6.9728813559322065
Horror 6.329523809523811
Music 6.574999999999999
Mystery 6.828985507246377
Romance 6.59998176041643
Science Fiction 6.434319526627218
Thriller 6.599510324119252
War 7.116279069767442
Western 7.095238095238094


In [0]:
d = {g: [encoded_genres[encoded_genres[g] == 1]['PCA'].mean(),
         encoded_genres[encoded_genres[g] == 1]['adjusted_revenue'].sum()]
     for g in mlb.classes_}

In [54]:
df = pd.DataFrame.from_dict(d, orient='index',  columns = ['pca rating', 'total revenue'])
df

Unnamed: 0,pca rating,total revenue
Action,0.020301,91191070000.0
Adventure,0.378806,90421680000.0
Animation,0.261648,13721900000.0
Comedy,-0.347387,2908821000000.0
Crime,-0.346864,2883037000000.0
Documentary,-0.085474,282016600.0
Drama,-0.344216,2932161000000.0
Family,0.328039,34114010000.0
Fantasy,0.308037,43238110000.0
Foreign,-1.046721,4106328.0


In [55]:
trace = go.Bar(
    x=df.index,
    y=df['pca rating']
)

fig = go.Figure(
    data = trace,
    #layout_x)
)
fig.update_xaxes(tickangle=45)
fig.update_layout(
    title_text = "PCA Rating by Genre"
)

In [56]:
melted = df.reset_index().melt(id_vars='index')
melted

Unnamed: 0,index,variable,value
0,Action,pca rating,0.02030121
1,Adventure,pca rating,0.3788064
2,Animation,pca rating,0.2616476
3,Comedy,pca rating,-0.3473871
4,Crime,pca rating,-0.3468643
5,Documentary,pca rating,-0.0854742
6,Drama,pca rating,-0.3442159
7,Family,pca rating,0.328039
8,Fantasy,pca rating,0.3080365
9,Foreign,pca rating,-1.046721
