### Task #2: Sequel Success Prediction
> 1. Given the franchises discovered from task #1, suppose someone wants to predict the success of a sequel that is to be launched, and she/he will partner with you for the task. The ask here is not to develop a prediction model, rather, you need to develop data analysis notebooks for the following subtasks:
> 2. Create ground truth for training dataset, similar to the following output: movieId, franchiseId, 0 or 1
> 3. 0 means “not successful”, “1” means successful.
> 4. You need to provide a clear and reasonable definition of “success”.
> 5. Come up with features that can be predictive of sequel success, for each feature, demonstrate why it is predictive or not.
> 6. Create training dataset that consists of feature vector and ground truth. No need to do train-test split.
> 7. Note: if you choose to skip task #1, you can use “belong_to_collection” column as input to task #2.

### Notes

1. add categorical variables for genre, and keywords
2. do OLS summary on revenue with rating metrics

In [8]:
from IPython.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
display(HTML("<style>.container { width:100% !important; }</style>"))

import json
import ast

import datetime
from dateutil.relativedelta import relativedelta

import pandas as pd
from pandas import Timestamp
pd.set_option('mode.chained_assignment', None)
import seaborn as sns
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split
np.random.seed(42)

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import scipy as sp

In [9]:
master = pd.read_csv('franchises.csv', engine='python', encoding='utf8', parse_dates=['release_date'])

In [10]:
additional_features = master.copy()

In [11]:
additional_features = additional_features.drop(columns=['Unnamed: 0'])

In [12]:
additional_features['release_date'] = pd.to_datetime(additional_features['release_date'], errors='coerce').dt.strftime("%Y%m%d")
additional_features['rating_timestamp'] = pd.to_datetime(additional_features['rating_timestamp'], errors='coerce').dt.strftime("%Y%m%d")

In [13]:
additional_features.columns  

Index(['budget', 'id', 'imdb_id', 'original_language', 'title', 'popularity',
       'release_date', 'revenue', 'runtime', 'vote_average', 'collection_id',
       'collection_name', 'production companies', 'production_countries',
       'genres', 'rating', 'rating_timestamp', 'key_agg'],
      dtype='object')

#### Filtering dataframe to include only prequel and sequel (w/ same collection id)

In [14]:
additional_features = additional_features.sort_values('revenue', ascending=False).head(500)

In [15]:
additional_features.columns

Index(['budget', 'id', 'imdb_id', 'original_language', 'title', 'popularity',
       'release_date', 'revenue', 'runtime', 'vote_average', 'collection_id',
       'collection_name', 'production companies', 'production_countries',
       'genres', 'rating', 'rating_timestamp', 'key_agg'],
      dtype='object')

In [16]:
get_sequels = additional_features.sort_values(by=['collection_id', 'release_date'], ascending=True)

In [17]:
get_sequels['prev_title'] = get_sequels.title.shift(1)   # get previous row using shift because we grouped by id and sored date asc
get_sequels['prev_collection_id'] = get_sequels.collection_id.shift(1)   # get previous title id  
get_sequels['prev_collection_revenue'] = get_sequels.revenue.shift(1)   # get previous title revenue
get_sequels['prev_collection_release_date'] = get_sequels.release_date.shift(1)   # get previous title revenue

In [18]:
get_sequels['prev_collection_rating'] = get_sequels.rating.shift(1)   # get previous title revenue
get_sequels['prev_collection_popularity'] = get_sequels.popularity.shift(1)   # get previous title revenue
get_sequels['prev_collection_vote_average'] = get_sequels.vote_average.shift(1)   # get previous title revenue

In [19]:
get_sequels['prev_collection_keywords'] = get_sequels.key_agg.shift(1)   # get previous title keywords
get_sequels['prev_collection_genres'] = get_sequels.genres.shift(1)   # get previous title genres

In [20]:
get_sequels['prequel_title'] = get_sequels.apply(lambda x: x['prev_title'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
get_sequels.head(1) # if prequel title is None it's doesn't have a sequel (here we are validating with #id of collection/ franchise group)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_title,prev_collection_id,prev_collection_revenue,prev_collection_release_date,prev_collection_rating,prev_collection_popularity,prev_collection_vote_average,prev_collection_keywords,prev_collection_genres,prequel_title
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [21]:
get_sequels['prequel_revenue'] = get_sequels.apply(lambda x: x['prev_collection_revenue'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
get_sequels.head(1)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_collection_id,prev_collection_revenue,prev_collection_release_date,prev_collection_rating,prev_collection_popularity,prev_collection_vote_average,prev_collection_keywords,prev_collection_genres,prequel_title,prequel_revenue
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [22]:
get_sequels['prequel_release_date'] = get_sequels.apply(lambda x: x['prev_collection_release_date'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
get_sequels.head(1)  

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_collection_revenue,prev_collection_release_date,prev_collection_rating,prev_collection_popularity,prev_collection_vote_average,prev_collection_keywords,prev_collection_genres,prequel_title,prequel_revenue,prequel_release_date
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [23]:
get_sequels['prequel_rating'] = get_sequels.apply(lambda x: x['prev_collection_rating'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
get_sequels.head(1)  

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_collection_release_date,prev_collection_rating,prev_collection_popularity,prev_collection_vote_average,prev_collection_keywords,prev_collection_genres,prequel_title,prequel_revenue,prequel_release_date,prequel_rating
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [24]:
get_sequels['prequel_popularity'] = get_sequels.apply(lambda x: x['prev_collection_popularity'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
get_sequels.head(1)  

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_collection_rating,prev_collection_popularity,prev_collection_vote_average,prev_collection_keywords,prev_collection_genres,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [25]:
get_sequels['prequel_vote_average'] = get_sequels.apply(lambda x: x['prev_collection_vote_average'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
get_sequels.head(1)  

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_collection_popularity,prev_collection_vote_average,prev_collection_keywords,prev_collection_genres,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity,prequel_vote_average
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [26]:
get_sequels['prequel_keywords'] = get_sequels.apply(lambda x: x['prev_collection_keywords'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
get_sequels.head(1)  

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_collection_vote_average,prev_collection_keywords,prev_collection_genres,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity,prequel_vote_average,prequel_keywords
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [27]:
get_sequels['prequel_genres'] = get_sequels.apply(lambda x: x['prev_collection_genres'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
get_sequels.head(1)  

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_collection_keywords,prev_collection_genres,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity,prequel_vote_average,prequel_keywords,prequel_genres
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [28]:
get_sequels = get_sequels.rename(columns={"prev_collection_id": "prequel_id"})

In [29]:
get_sequels.columns

Index(['budget', 'id', 'imdb_id', 'original_language', 'title', 'popularity',
       'release_date', 'revenue', 'runtime', 'vote_average', 'collection_id',
       'collection_name', 'production companies', 'production_countries',
       'genres', 'rating', 'rating_timestamp', 'key_agg', 'prev_title',
       'prequel_id', 'prev_collection_revenue', 'prev_collection_release_date',
       'prev_collection_rating', 'prev_collection_popularity',
       'prev_collection_vote_average', 'prev_collection_keywords',
       'prev_collection_genres', 'prequel_title', 'prequel_revenue',
       'prequel_release_date', 'prequel_rating', 'prequel_popularity',
       'prequel_vote_average', 'prequel_keywords', 'prequel_genres'],
      dtype='object')

In [30]:
get_sequels = get_sequels.drop(columns=['prev_title', 'prev_collection_revenue', 'prev_collection_release_date', 
                                                     'prev_collection_rating', 'prev_collection_popularity','prev_collection_vote_average', 
                                       'prev_collection_genres', 'prev_collection_keywords' ], axis=1)

In [31]:
get_sequels = get_sequels[get_sequels['prequel_title'] == get_sequels['prequel_title']] 

In [32]:
get_sequels.head(1)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,key_agg,prequel_id,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity,prequel_vote_average,prequel_keywords,prequel_genres
49,18000000,1891,tt0080684,en,The Empire Strikes Back,19.470959,19800517,538400000,124.0,8.2,...,"'rebel', 'android', 'asteroid', 'space battle'...",10.0,Star Wars,775398007.0,19770525,3.660591,42.149697,8.1,"'android', 'galaxy', 'hermit', 'death star', '...","'Adventure', 'Action', 'Science Fiction'"


#### Get dummies for keywords and genres our categorical variables separately for sequel and prequel

In [33]:
get_sequels.columns

Index(['budget', 'id', 'imdb_id', 'original_language', 'title', 'popularity',
       'release_date', 'revenue', 'runtime', 'vote_average', 'collection_id',
       'collection_name', 'production companies', 'production_countries',
       'genres', 'rating', 'rating_timestamp', 'key_agg', 'prequel_id',
       'prequel_title', 'prequel_revenue', 'prequel_release_date',
       'prequel_rating', 'prequel_popularity', 'prequel_vote_average',
       'prequel_keywords', 'prequel_genres'],
      dtype='object')

In [34]:
sequels = get_sequels[['budget', 'id', 'imdb_id', 'original_language', 'title', 'popularity',
       'release_date', 'revenue', 'runtime', 'vote_average', 'collection_id',
       'collection_name', 'production companies', 'production_countries',
       'genres', 'rating', 'rating_timestamp', 'key_agg']]

In [35]:
prequels = get_sequels[['prequel_id', 'prequel_title', 'prequel_revenue', 'prequel_release_date',
           'prequel_rating', 'prequel_popularity', 'prequel_vote_average',
           'prequel_keywords', 'prequel_genres']]

#### Get only first three genres 

In [37]:
sequels['genres'] = sequels['genres'].map(lambda x: ' '.join(x.split()[:1]))

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,collection_id,collection_name,production companies,production_countries,genres,rating,rating_timestamp,key_agg
49,18000000,1891,tt0080684,en,The Empire Strikes Back,19.470959,19800517,538400000,124.0,8.2,10,Star Wars Collection,"'Lucasfilm', 'Twentieth Century Fox Film Corpo...",'United States of America',"'Adventure',",2.464912,19980701,"'rebel', 'android', 'asteroid', 'space battle'..."


In [38]:
prequels['prequel_genres'] = prequels['prequel_genres'].map(lambda x: ' '.join(x.split()[:1]))

#### Get only first three keywords

In [39]:
sequels['key_agg'] = sequels['key_agg'].replace(np.nan, " ")

In [40]:
sequels['keywords'] = sequels['key_agg'].map(lambda x: ' '.join(x.split()[:1]))

In [41]:
sequels =sequels.drop(columns=['key_agg'])

In [42]:
prequels['prequel_keywords'] = prequels['prequel_keywords'].replace(np.nan, " ")

In [43]:
prequels['prequel_keywords'] = prequels['prequel_keywords'].map(lambda x: ' '.join(x.split()[:1]))

In [44]:
sequels['keywords'].value_counts()

'london          10
'saving          10
'paris',          7
'riddle',         4
'cyborg',         3
                 ..
'spacecraft',     1
'vampire',        1
'hotel',          1
'press',          1
'dc               1
Name: keywords, Length: 120, dtype: int64

In [45]:
prequels['prequel_keywords'].value_counts()

'saving          10
'paris',          8
'london           8
'riddle',         4
'witch',          4
                 ..
'dead             1
'flying           1
'salesclerk',     1
'halloween',      1
'cia',            1
Name: prequel_keywords, Length: 123, dtype: int64

In [46]:
sequels_keywords_dummies = pd.get_dummies(sequels['keywords'])

In [47]:
sequels = sequels.join(sequels_keywords_dummies)
sequels.head(1)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,"'terrorist',","'transporter',","'usa',",'uss,"'vampire',","'venice',","'waitress',",'washington,"'witch',",'world
49,18000000,1891,tt0080684,en,The Empire Strikes Back,19.470959,19800517,538400000,124.0,8.2,...,0,0,0,0,0,0,0,0,0,0


In [48]:
prequels_keywords_dummies = pd.get_dummies(prequels['prequel_genres'])

In [49]:
prequels = prequels.join(prequels_keywords_dummies)
prequels.head(1)

Unnamed: 0,prequel_id,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity,prequel_vote_average,prequel_keywords,prequel_genres,"'Action',",...,"'Family',","'Fantasy',",'Horror',"'Horror',","'Music',","'Mystery',",'Science,"'Thriller',","'War',",'Western'
49,10.0,Star Wars,775398007.0,19770525,3.660591,42.149697,8.1,"'android',","'Adventure',",0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
prequels['prequel_genres'].value_counts()

'Adventure',    55
'Action',       42
'Science        12
'Horror',       12
'Comedy',       11
'Drama',         9
'Drama'          8
'Comedy'         7
'Fantasy',       6
'Horror'         5
'Thriller',      4
'Crime',         3
'Western'        2
'Animation',     2
'Music',         2
'Family',        2
'Mystery',       1
'War',           1
Name: prequel_genres, dtype: int64

In [51]:
sequels['genres'].value_counts()

'Action',       48
'Adventure',    48
'Comedy',       13
'Horror',       12
'Science        12
'Drama',         9
'Fantasy',       8
'Drama'          8
'Crime',         6
'Comedy'         6
'Animation',     3
'Horror'         2
'Music',         2
'Thriller',      2
'Western'        2
'Romance',       1
'Mystery',       1
'Family',        1
Name: genres, dtype: int64

In [52]:
sequels_genres_dummies = pd.get_dummies(sequels['genres'])

In [53]:
sequels = sequels.join(sequels_genres_dummies)
sequels.head(1)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,"'Family',","'Fantasy',",'Horror',"'Horror',","'Music',","'Mystery',","'Romance',",'Science,"'Thriller',",'Western'
49,18000000,1891,tt0080684,en,The Empire Strikes Back,19.470959,19800517,538400000,124.0,8.2,...,0,0,0,0,0,0,0,0,0,0


In [54]:
prequels_genres_dummies = pd.get_dummies(prequels['prequel_genres']) 

In [56]:
# prequels = prequels.join(prequels_genres_dummies)
# prequels.head(1)

> #### TODO: Merge Dataframes and combine categorical variables to ground truth

#### OLS Summary on revenue with rated metrics

In [59]:
import statsmodels.api as sm;

In [60]:
get_sequels['intercept'] = 1

In [61]:
lm = sm.OLS(get_sequels['revenue'], get_sequels[['intercept', 'rating', 'popularity', 'vote_average']])
results = lm.fit()
results.summary()

0,1,2,3
Dep. Variable:,revenue,R-squared:,0.305
Model:,OLS,Adj. R-squared:,0.293
Method:,Least Squares,F-statistic:,26.29
Date:,"Thu, 19 Sep 2019",Prob (F-statistic):,3.76e-14
Time:,08:19:55,Log-Likelihood:,-3803.9
No. Observations:,184,AIC:,7616.0
Df Residuals:,180,BIC:,7629.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,-8.834e+07,1.45e+08,-0.609,0.543,-3.75e+08,1.98e+08
rating,-2.889e+07,3.12e+07,-0.925,0.356,-9.05e+07,3.28e+07
popularity,1.314e+07,1.8e+06,7.299,0.000,9.58e+06,1.67e+07
vote_average,4.212e+07,2.01e+07,2.094,0.038,2.44e+06,8.18e+07

0,1,2,3
Omnibus:,23.998,Durbin-Watson:,1.294
Prob(Omnibus):,0.0,Jarque-Bera (JB):,32.513
Skew:,0.79,Prob(JB):,8.71e-08
Kurtosis:,4.32,Cond. No.,151.0


> ####  TODO: OLS summary on categorical variables


#### Export Dataset to CSV

In [63]:
# categorical_vars.to_csv('categorical_vars.csv', encoding='utf-8')