<a href="https://colab.research.google.com/github/ErikSeguinte/movie_data/blob/master/processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import cufflinks as cf
import numpy as np
import plotly.graph_objects as go
import altair as alt
import plotly.express as px

In [0]:
merge_keys = {'left_index':True, 'right_index':True}
html_keys = {'full_html':False, 'include_plotlyjs':"cdn"}

* I previously pulled CSV files from Kaggle, but the files were too big to host on github.
* I imported the files I wanted into pandas, and then exported them back out as compressed pickles.
* I was able to compress a 700MB csv to a 3 MB Pickle

In [3]:
#try: 
#    movies = pd.read_pickle('data/movies.pkl.xz')
#    ratings = pd.read_pickle('data/ratings2.pkl.xz')
#    cpi = pd.read_csv('data/cpi.csv')
    
#except FileNotFoundError                         :
    
# Download pickles from github
cpi = pd.read_csv('https://datahub.io/core/cpi/r/cpi.csv')

!wget 'https://github.com/ErikSeguinte/movie_data/raw/master/data/movies.pkl.xz'
!wget 'https://github.com/ErikSeguinte/movie_data/raw/master/data/ratings2.pkl.xz'
movies = pd.read_pickle('movies.pkl.xz')
ratings = pd.read_pickle('ratings2.pkl.xz')

--2020-03-07 01:53:50--  https://github.com/ErikSeguinte/movie_data/raw/master/data/movies.pkl.xz
Resolving github.com (github.com)... 140.82.118.3
Connecting to github.com (github.com)|140.82.118.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ErikSeguinte/movie_data/master/data/movies.pkl.xz [following]
--2020-03-07 01:53:50--  https://raw.githubusercontent.com/ErikSeguinte/movie_data/master/data/movies.pkl.xz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7917036 (7.5M) [application/octet-stream]
Saving to: ‘movies.pkl.xz’


2020-03-07 01:53:51 (137 MB/s) - ‘movies.pkl.xz’ saved [7917036/7917036]

--2020-03-07 01:53:53--  https://github.com/ErikSeguinte/movie_data/raw/master/data/ratings2.pkl.

## Clean Movie DF

In [4]:
movies.dtypes

adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object

* Movies Dataframe has malformed data. `id` Should be numeric.
* After inspection, it looks like there are rows that are missing a comma somewhere, making columns not line up, and adding the wrong data to columns. Let's clean those up.
* All malformed rows have strings for IDs instead of numeric, so we will coerce them into numeric columns, and strings will be returned as `NaN`, which we'll then drop.

* `budget` and `revanue` should also be numeric, but Nans won't be dropped






In [0]:
movies['id'] = pd.to_numeric(movies['id'], errors='coerce')
movies = movies[movies['id'].notnull()]
movies = movies.set_index('id')

In [0]:
def to_numeric(df, labels):
    
    for label in labels:
        df[label] = pd.to_numeric(movies[label], errors='coerce').copy()
    return df

In [0]:
movies = to_numeric(movies, ['budget', 'revenue', 'vote_average'])

In [0]:
movies['release_date'] =pd.to_datetime(movies['release_date'], infer_datetime_format= True)

In [0]:
movies['year'] = movies['release_date'].dt.year

In [0]:
clean_movies = movies[['title','genres', 'release_date','budget', 'revenue','year' ,'runtime', 'vote_average', 'vote_count']]

## Process User Reviews
* User reviews come in a collection of individual reviews where a review gives a movie a score of 1 to 5.
* We will take the mean ratings for each movie

In [0]:
# Aggregate mean ratings and number of votes per movie
movie_ratings =pd.DataFrame(ratings.groupby('movieId')[['rating']].agg(['mean', 'count']))['rating']
movie_ratings = movie_ratings.rename({'mean': 'rating', 'count': 'num_votes'}, axis = 1)


* Lets drop any movies with less than 100 votes. Those are more easily swayed by outliers and aren't reliable.

In [0]:
movie_ratings = movie_ratings[(movie_ratings['num_votes'] >= 100)]

* And now we merge the averaged ratings back with the movie database.
* Note that not all movies are present in the user votings.

In [0]:
clean_movies = clean_movies.merge(movie_ratings, left_index = True, right_index=True)

In [14]:
clean_movies[['title', 'rating', 'num_votes']].nlargest(10, 'rating').style.hide_index().format(
    {'rating':"{:.2f}",
     'num_votes': "{:,.0f}"}
)

title,rating,num_votes
The Million Dollar Hotel,4.43,91082
Sleepless in Seattle,4.34,57070
Once Were Warriors,4.27,67662
Hard Target,4.26,13994
License to Wed,4.23,60024
Five Dances,4.22,273
The Thomas Crown Affair,4.21,30043
Murder She Said,4.21,28280
"Cousin, Cousine",4.2,20855
Dead Man,4.2,7930


* The movie Database also provides a rating and suffer from a similar problem of some movies having a tiny sample size.

In [15]:
(clean_movies[clean_movies['vote_count'] >= 100]
    [['title', 'vote_average', 'vote_count']]
    .nlargest(10, 'vote_average')
    .style
    .hide_index()
    .format({'vote_average':'{:.1f}', 'vote_count':'{:,.0f}'})
)
 


title,vote_average,vote_count
The Godfather,8.5,6024
The Shawshank Redemption,8.5,8358
Spirited Away,8.3,3968
The Dark Knight,8.3,12269
The Godfather: Part II,8.3,3418
Schindler's List,8.3,4436
One Flew Over the Cuckoo's Nest,8.3,3001
Psycho,8.3,2405
Fight Club,8.3,9678
Life Is Beautiful,8.3,3643


In [16]:
(clean_movies[['title', 'revenue']]
 .nlargest(10, 'revenue')
 .style
 .hide_index()
 .format({
     "revenue":"${:,.0f}"
 }
 )
)

title,revenue
Titanic,"$1,845,034,188"
The Lord of the Rings: The Return of the King,"$1,118,888,979"
Pirates of the Caribbean: Dead Man's Chest,"$1,065,659,812"
Pirates of the Caribbean: On Stranger Tides,"$1,045,713,802"
The Dark Knight,"$1,004,558,444"
Harry Potter and the Philosopher's Stone,"$976,475,550"
Finding Nemo,"$940,335,536"
Harry Potter and the Half-Blood Prince,"$933,959,197"
The Lord of the Rings: The Two Towers,"$926,287,400"
Star Wars: Episode I - The Phantom Menace,"$924,317,558"


## Inflation
* Inflation means that a 1940 dollar is worth more than a 2020 dollar. Let's adjust Revenue for that.
* The Consumer price index can be used to convert to standarized dollars.
* Here, we'll be using 2014 dollars.
* Years later than 2014 will not be adjusted.
$$ \textrm{adjusted dollars} = \frac{\textrm{New CPI}}{\textrm{Base CPI}}$$
* where x is the current cpi and y is the cpi of that year 

In [0]:
cpi = cpi[cpi['Country Name'] == 'United States'][['Year', 'CPI']]

In [0]:
cpi = cpi.set_index(cpi['Year'])

In [0]:
def adjust_dollars(value, year):
    year = int(year)
    try:
        current = cpi.loc[2014,'CPI']
        base = cpi.loc[year,'CPI']
        adjusted_value = value * (current/base)
        return adjusted_value
    except: 
        return value

In [0]:
clean_movies['year']= clean_movies['release_date'].dt.year

In [0]:
df = clean_movies[clean_movies['revenue'].notnull() & clean_movies['year'].notnull()]

In [0]:
adjusted = pd.DataFrame([adjust_dollars(x,y) for x,y in zip(df['revenue'], df['year'])], index = df.index, columns = ['adjusted_revenue'])

In [0]:
clean_movies = clean_movies.merge(adjusted, left_index=True, right_index=True)

In [24]:
(clean_movies[['title', 'adjusted_revenue']]
 .nlargest(10, 'adjusted_revenue')
 .style
 .hide_index()
 .format({
     "adjusted_revenue":"${:,.0f}"
 }
 )
)

title,adjusted_revenue
Star Wars,"$3,028,727,803"
Titanic,"$2,721,127,532"
E.T. the Extra-Terrestrial,"$1,945,321,985"
The Empire Strikes Back,"$1,546,829,516"
Jurassic Park,"$1,507,846,186"
The Lord of the Rings: The Return of the King,"$1,439,899,368"
The Godfather,"$1,387,267,307"
Return of the Jedi,"$1,361,232,958"
Star Wars: Episode I - The Phantom Menace,"$1,313,638,874"
Harry Potter and the Philosopher's Stone,"$1,305,536,965"


In [0]:
clean_movies['decade'] = [x - (x%10) for x in clean_movies['year']]

In [26]:
px.line(clean_movies.groupby('year')['vote_average'].mean().reset_index(), 
       x ='year',
       y = 'vote_average' )

In [27]:
px.line(clean_movies.groupby('year')['rating'].mean().reset_index(), 
       x ='year',
       y = 'rating' )

* I'd like to compare the votes from TMB to the user ratings, but they are on different scales. We'll use standard scaler to normalize them so we can more easily compare.

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
to_scale = clean_movies[['vote_average', 'rating']].dropna()
scaled = scaler.fit_transform(to_scale)
clean_movies = clean_movies.merge(
    pd.DataFrame(
        scaled,
        index = to_scale.index,
        columns = ['scaled_tmdb_vote', 'scaled_user_rating']
    ),
    left_index = True,
    right_index = True,
)

In [29]:
traces = [
    go.Bar(name='TMDB rating',
        x = clean_movies.groupby('decade')['scaled_tmdb_vote'].mean().index,
        y = clean_movies.groupby('decade')['scaled_tmdb_vote'].mean()
    ),
        go.Bar(name='user rating',
        x = clean_movies.groupby('decade')['scaled_user_rating'].mean().index,
        y = clean_movies.groupby('decade')['scaled_user_rating'].mean()
    )
]

go.Figure(data = traces,
    layout_xaxis_tick0 = 1890
)

In [0]:
from sklearn.decomposition import PCA

pca = PCA(1)

to_scale = clean_movies[['vote_average', 'rating', 'adjusted_revenue']].dropna()
scaled = scaler.fit_transform(to_scale)

pca_df = pd.DataFrame(pca.fit_transform(scaled), index=to_scale.index, columns = ['PCA'])

clean_movies = clean_movies.merge(pca_df, left_index = True, right_index = True)

In [31]:
(clean_movies[['title', 'PCA']]
 .nlargest(10, 'PCA')
 .style
 .hide_index()
 .format({
     'PCA':'{:.2f}'
 })
)

title,PCA
Star Wars,8.87
Titanic,7.51
E.T. the Extra-Terrestrial,5.27
The Empire Strikes Back,4.82
The Godfather,4.76
The Lord of the Rings: The Return of the King,4.51
Jurassic Park,4.35
Return of the Jedi,4.22
The Lord of the Rings: The Two Towers,3.98
The Dark Knight,3.87


## Budget


In [32]:
budget_dict = {'no-budget':[0, 500000],
 'low-budget':[500000, 40000000],
 'medium-budget': [40000000, 100000000],
 'High-budget:':[100000000, 1000000000000]
 }

(pd.DataFrame.from_dict(budget_dict, orient='index', columns = ['low-end', 'high-end'])
 .style
 .format('{:,.0f}')
)

Unnamed: 0,low-end,high-end
no-budget,0,500000
low-budget,500000,40000000
medium-budget,40000000,100000000
High-budget:,100000000,1000000000000


In [0]:
low = []
hi = []
for v in budget_dict.values():
    low.append(v[0])
    hi.append(v[1])

In [0]:
budget_intervals = pd.IntervalIndex.from_arrays(low, hi, closed = 'left')

In [35]:
list(budget_dict.keys())

['no-budget', 'low-budget', 'medium-budget', 'High-budget:']

In [0]:
clean_movies['budget-cat'] = pd.cut(clean_movies['budget'], budget_intervals, labels = list(budget_dict.keys()))

In [37]:
clean_movies[['title', 'budget-cat', 'PCA']].groupby('budget-cat').mean()

Unnamed: 0_level_0,PCA
budget-cat,Unnamed: 1_level_1
"[0, 500000)",-0.149724
"[500000, 40000000)",-0.059581
"[40000000, 100000000)",0.038098
"[100000000, 1000000000000)",0.700114


In [38]:
budget_quantiles = pd.qcut(clean_movies[clean_movies['budget']>500000]['budget'], q = 5, 
#labels = ['vlow', 'low', 'med', 'high', 'vhigh']
)
budget_quantiles.values

[(549999.999, 8000000.0], (8000000.0, 23400000.0], (51400000.0, 380000000.0], (51400000.0, 380000000.0], (8000000.0, 23400000.0], ..., (23400000.0, 30000000.0], (549999.999, 8000000.0], (30000000.0, 51400000.0], (549999.999, 8000000.0], (51400000.0, 380000000.0]]
Length: 1327
Categories (5, interval[float64]): [(549999.999, 8000000.0] < (8000000.0, 23400000.0] <
                                    (23400000.0, 30000000.0] < (30000000.0, 51400000.0] <
                                    (51400000.0, 380000000.0]]

In [0]:
budget_ratings = clean_movies[['title','budget-cat', 'budget', 'revenue','PCA']].dropna()

In [0]:
quantiles=budget_ratings.groupby('budget-cat').mean()



In [69]:
(quantiles
 .style
 .format({
     "budget":"${:,.0f}",
     "revenue":"${:,.0f}",
     "PCA":"{:.2f}"
 })
 #.render()
)

Unnamed: 0_level_0,budget,revenue,PCA
budget-cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"[0, 500000)","$197,962","$15,368,653",-0.15
"[500000, 40000000)","$18,500,755","$61,334,171",-0.06
"[40000000, 100000000)","$61,415,816","$186,069,871",0.04
"[100000000, 1000000000000)","$141,292,135","$436,756,768",0.7


In [68]:
(px.bar(quantiles.reset_index(),
       x = list(budget_dict.keys()),
       y = 'PCA').update_xaxes(
           categoryorder='array',
           #categoryarray=['vlow', 'low', 'med', 'high', 'vhigh'],
           title = "Budget vs PCA rating"
           )
       #.to_html(full_html=False, include_plotlyjs="cdn")

)

In [0]:
trace = go.Scatter(
    y = budget_ratings['vote_average'],
    x = budget_ratings['revenue'],
    mode = 'markers'
)

go.Figure(
    trace,
    layout_xaxis_title = "Budget",
    layout_yaxis_title = "Movie Rating",
    layout_title = "Movie Ratings by budget",
    
)

In [0]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Genre

In [0]:
genres = clean_movies[['title','genres']]

In [0]:
import ast

In [0]:
bad_genres = [
           'Aniplex',
 'BROSTA TV',
 'Carousel Productions',   
  'GoHands',
   'Mardock Scramble Production Committee',
    'Odyssey Media',
     'Pulser Productions',
 'Rogue State',
  'Sentai Filmworks',
   'Telescene Film Group Productions',
 'The Cartel',
  'Vision View Entertainment',
]

#genre_set = genre_set.difference(bad_genres)

In [0]:
genres['genres_ls'] = [
                    [d['name'] for d in ast.literal_eval(x) if d['name'] not in bad_genres ]
                    for x in genres['genres']
                    ]
genres['genres_ls']

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer

In [0]:
mlb = MultiLabelBinarizer()

In [0]:
mlb_ = mlb.fit_transform(genres['genres_ls'])

In [0]:
encoded_genres = pd.DataFrame(mlb_, columns = mlb.classes_, index = genres.index)
encoded_genres

In [0]:
encoded_genres = clean_movies.merge(encoded_genres, **merge_keys)
encoded_genres

In [0]:
for genre in mlb.classes_:
    print(genre, encoded_genres[encoded_genres[genre] == 1]['vote_average'].mean())

In [0]:
d = {g: [encoded_genres[encoded_genres[g] == 1]['PCA'].mean(),
         encoded_genres[encoded_genres[g] == 1]['adjusted_revenue'].sum()]
     for g in mlb.classes_}

In [0]:
df = pd.DataFrame.from_dict(d, orient='index',  columns = ['pca rating', 'total revenue'])
df

In [0]:
trace = go.Bar(
    x=df.index,
    y=df['pca rating']
)

fig = go.Figure(
    data = trace,
    #layout_x)
)
fig.update_xaxes(tickangle=45)
fig.update_layout(
    title_text = "PCA Rating by Genre"
)

In [0]:
melted = df.reset_index().melt(id_vars='index')
melted