### Task #2: Sequel Success Prediction
> 1. Given the franchises discovered from task #1, suppose someone wants to predict the success of a sequel that is to be launched, and she/he will partner with you for the task. The ask here is not to develop a prediction model, rather, you need to develop data analysis notebooks for the following subtasks:
> 2. Create ground truth for training dataset, similar to the following output: movieId, franchiseId, 0 or 1
> 3. 0 means “not successful”, “1” means successful.
> 4. You need to provide a clear and reasonable definition of “success”.
> 5. Come up with features that can be predictive of sequel success, for each feature, demonstrate why it is predictive or not.
> 6. Create training dataset that consists of feature vector and ground truth. No need to do train-test split.
> 7. Note: if you choose to skip task #1, you can use “belong_to_collection” column as input to task #2.

### Notes

1. filter dataset to retrive prequel and sequels within same collection
2. sequel success definition:
     * sequel revenue > prequel revenue 
     * sequel metrics mean > prequel metrics mean 
         * metrics: (rating, popularity, vote_average) 
     * success indicator based on sequels that meet criteria of 1 and 2 with a release date diff of < 2 years

In [45]:
from IPython.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
display(HTML("<style>.container { width:100% !important; }</style>"))

import json
import ast

import datetime
from dateutil.relativedelta import relativedelta

import pandas as pd
pd.set_option('mode.chained_assignment', None)
import seaborn as sns
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import scipy as sp

In [46]:
master = pd.read_csv('franchises.csv', engine='python', encoding='utf8', parse_dates=['release_date'])

In [47]:
franchises = master.copy()

In [48]:
franchises = franchises.drop(columns=['Unnamed: 0'])

In [49]:
franchises['release_date'] = pd.to_datetime(franchises['release_date'], errors='coerce').dt.strftime("%Y%m%d")
franchises['rating_timestamp'] = pd.to_datetime(franchises['rating_timestamp'], errors='coerce').dt.strftime("%Y%m%d")

In [50]:
franchises.columns  

Index(['budget', 'id', 'imdb_id', 'original_language', 'title', 'popularity',
       'release_date', 'revenue', 'runtime', 'vote_average', 'collection_id',
       'collection_name', 'production companies', 'production_countries',
       'genres', 'rating', 'rating_timestamp', 'key_agg'],
      dtype='object')

#### Filtering dataframe to include only prequel and sequel (w/ same collection id)

In [51]:
franchises = franchises.sort_values('revenue', ascending=False).head(500)

In [52]:
franchises.shape

(500, 18)

In [53]:
collection_id_sort = franchises.sort_values(by=['collection_id', 'release_date'], ascending=True)

In [54]:
collection_id_sort['prev_title'] = collection_id_sort.title.shift(1)   # get previous row using shift because we grouped by id and sored date asc
collection_id_sort['prev_collection_id'] = collection_id_sort.collection_id.shift(1)   # get previous title id  
collection_id_sort['prev_collection_revenue'] = collection_id_sort.revenue.shift(1)   # get previous title revenue
collection_id_sort['prev_collection_release_date'] = collection_id_sort.release_date.shift(1)   # get previous title revenue

In [55]:
collection_id_sort['prev_collection_rating'] = collection_id_sort.rating.shift(1)   # get previous title revenue
collection_id_sort['prev_collection_popularity'] = collection_id_sort.popularity.shift(1)   # get previous title revenue
collection_id_sort['prev_collection_vote_average'] = collection_id_sort.vote_average.shift(1)   # get previous title revenue

In [56]:
collection_id_sort['prequel_title'] = collection_id_sort.apply(lambda x: x['prev_title'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
collection_id_sort.head(1) # if prequel title is None it's doesn't have a sequel (here we are validating with #id of collection/ franchise group)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,rating_timestamp,key_agg,prev_title,prev_collection_id,prev_collection_revenue,prev_collection_release_date,prev_collection_rating,prev_collection_popularity,prev_collection_vote_average,prequel_title
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,19960129,"'android', 'galaxy', 'hermit', 'death star', '...",,,,,,,,


In [57]:
collection_id_sort['prequel_revenue'] = collection_id_sort.apply(lambda x: x['prev_collection_revenue'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
collection_id_sort.head(1)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,key_agg,prev_title,prev_collection_id,prev_collection_revenue,prev_collection_release_date,prev_collection_rating,prev_collection_popularity,prev_collection_vote_average,prequel_title,prequel_revenue
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,"'android', 'galaxy', 'hermit', 'death star', '...",,,,,,,,,


In [58]:
collection_id_sort['prequel_release_date'] = collection_id_sort.apply(lambda x: x['prev_collection_release_date'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
collection_id_sort.head(1)  

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_title,prev_collection_id,prev_collection_revenue,prev_collection_release_date,prev_collection_rating,prev_collection_popularity,prev_collection_vote_average,prequel_title,prequel_revenue,prequel_release_date
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [59]:
collection_id_sort['prequel_rating'] = collection_id_sort.apply(lambda x: x['prev_collection_rating'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
collection_id_sort.head(1)  

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_collection_id,prev_collection_revenue,prev_collection_release_date,prev_collection_rating,prev_collection_popularity,prev_collection_vote_average,prequel_title,prequel_revenue,prequel_release_date,prequel_rating
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [60]:
collection_id_sort['prequel_popularity'] = collection_id_sort.apply(lambda x: x['prev_collection_popularity'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
collection_id_sort.head(1)  

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_collection_revenue,prev_collection_release_date,prev_collection_rating,prev_collection_popularity,prev_collection_vote_average,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [61]:
collection_id_sort['prequel_vote_average'] = collection_id_sort.apply(lambda x: x['prev_collection_vote_average'] if x['collection_id'] == x['prev_collection_id'] else None, axis=1)  # validating previous/sequel
collection_id_sort.head(1)  

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prev_collection_release_date,prev_collection_rating,prev_collection_popularity,prev_collection_vote_average,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity,prequel_vote_average
27,11000000,11,tt0076759,en,Star Wars,42.149697,19770525,775398007,121.0,8.1,...,,,,,,,,,,


In [62]:
collection_id_sort = collection_id_sort.rename(columns={"prev_collection_id": "prequel_id"})

In [63]:
collection_id_sort.columns

Index(['budget', 'id', 'imdb_id', 'original_language', 'title', 'popularity',
       'release_date', 'revenue', 'runtime', 'vote_average', 'collection_id',
       'collection_name', 'production companies', 'production_countries',
       'genres', 'rating', 'rating_timestamp', 'key_agg', 'prev_title',
       'prequel_id', 'prev_collection_revenue', 'prev_collection_release_date',
       'prev_collection_rating', 'prev_collection_popularity',
       'prev_collection_vote_average', 'prequel_title', 'prequel_revenue',
       'prequel_release_date', 'prequel_rating', 'prequel_popularity',
       'prequel_vote_average'],
      dtype='object')

In [64]:
collection_id_sort = collection_id_sort.drop(columns=['prev_title', 'prev_collection_revenue', 'prev_collection_release_date', 
                                                     'prev_collection_rating', 'prev_collection_popularity','prev_collection_vote_average' ], axis=1)

In [65]:
collection_id_sort = collection_id_sort[collection_id_sort['prequel_title'] == collection_id_sort['prequel_title']] 

In [66]:
collection_id_sort.head(1)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,rating,rating_timestamp,key_agg,prequel_id,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity,prequel_vote_average
49,18000000,1891,tt0080684,en,The Empire Strikes Back,19.470959,19800517,538400000,124.0,8.2,...,2.464912,19980701,"'rebel', 'android', 'asteroid', 'space battle'...",10.0,Star Wars,775398007.0,19770525,3.660591,42.149697,8.1


#### Generate success indicator

###### 1. retrieve sequels with revenue greater than prequel

In [67]:
sequel_revenue_greater = collection_id_sort.query('prequel_revenue < revenue')

In [68]:
sequel_revenue_greater.columns

Index(['budget', 'id', 'imdb_id', 'original_language', 'title', 'popularity',
       'release_date', 'revenue', 'runtime', 'vote_average', 'collection_id',
       'collection_name', 'production companies', 'production_countries',
       'genres', 'rating', 'rating_timestamp', 'key_agg', 'prequel_id',
       'prequel_title', 'prequel_revenue', 'prequel_release_date',
       'prequel_rating', 'prequel_popularity', 'prequel_vote_average'],
      dtype='object')

In [69]:
sequel_revenue_greater.head(1)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,rating,rating_timestamp,key_agg,prequel_id,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity,prequel_vote_average
45,32350000,1892,tt0086190,en,Return of the Jedi,14.586087,19830523,572700000,135.0,7.9,...,3.381859,19980610,"'rebel', 'brother sister relationship', 'emper...",10.0,The Empire Strikes Back,538400000.0,19800517,2.464912,19.470959,8.2


###### 2. retrieve sequels with metrics (combination of rating, popularity, vote average) greater than prequel

In [70]:
metric_cols = [ 'popularity', 'vote_average', 'rating', 'prequel_rating', 'prequel_popularity', 'prequel_vote_average']

In [71]:
sequel_revenue_greater['prequel_score'] = sequel_revenue_greater[[ 'popularity', 'vote_average', 'rating']].apply(lambda x: x.mean(), axis=1)

In [72]:
sequel_revenue_greater['sequel_score'] = sequel_revenue_greater[[ 'prequel_popularity', 'prequel_vote_average', 'prequel_rating']].apply(lambda x: x.sum(), axis=1)

In [73]:
sequel_ratings_greater = sequel_revenue_greater.query('sequel_score > prequel_score')

In [74]:
sequel_ratings_greater.head(1)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,key_agg,prequel_id,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity,prequel_vote_average,prequel_score,sequel_score
45,32350000,1892,tt0086190,en,Return of the Jedi,14.586087,19830523,572700000,135.0,7.9,...,"'rebel', 'brother sister relationship', 'emper...",10.0,The Empire Strikes Back,538400000.0,19800517,2.464912,19.470959,8.2,8.622649,30.135871


##### 3. retrieve sequels with above features that were released within 2 years of prequel 

In [75]:
sequel_ratings_greater['date_diff'] = pd.to_datetime(sequel_ratings_greater['release_date']) - pd.to_datetime(sequel_ratings_greater['prequel_release_date'])
sequel_ratings_greater['date_diff'] = sequel_ratings_greater['date_diff']/np.timedelta64(1,'Y')

In [76]:
sequel_ratings_greater.sample(1)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,...,prequel_id,prequel_title,prequel_revenue,prequel_release_date,prequel_rating,prequel_popularity,prequel_vote_average,prequel_score,sequel_score,date_diff
25,185000000,217,tt0367882,en,Indiana Jones and the Kingdom of the Crystal S...,12.577266,20080521,786636033,122.0,5.7,...,84.0,Indiana Jones and the Last Crusade,474171806.0,19890524,3.136548,14.788987,7.6,7.002991,25.525535,18.992861


#### Success Predictor (0,1) 

In [77]:
sequel_ratings_greater['success'] = sequel_ratings_greater.apply(lambda x: 1 if x['date_diff'] < 2 else 0, axis=1) 

In [78]:
sequel_ratings_greater.columns

Index(['budget', 'id', 'imdb_id', 'original_language', 'title', 'popularity',
       'release_date', 'revenue', 'runtime', 'vote_average', 'collection_id',
       'collection_name', 'production companies', 'production_countries',
       'genres', 'rating', 'rating_timestamp', 'key_agg', 'prequel_id',
       'prequel_title', 'prequel_revenue', 'prequel_release_date',
       'prequel_rating', 'prequel_popularity', 'prequel_vote_average',
       'prequel_score', 'sequel_score', 'date_diff', 'success'],
      dtype='object')

In [79]:
sequel_success_ground_truth = sequel_ratings_greater

In [80]:
sequel_success_ground_truth = sequel_success_ground_truth.filter(['id', 'collection_id', 'success'])

In [81]:
sequel_success_ground_truth.rename(columns={'id': 'movieId', 'collection_id': 'franchiseId'})

Unnamed: 0,movieId,franchiseId,success
45,1892,10,0
11,1893,10,0
18,1895,10,0
59,89,84,0
25,217,84,0
...,...,...,...
377,2295,182813,0
312,2332,211721,1
307,2334,211721,0
299,1251,261382,1


#### Export Dataset to CSV

In [407]:
sequel_success_ground_truth.to_csv('sequel_success_ground_truth.csv', encoding='utf-8')