### Task #2: Sequel Success Prediction
> 1. Given the franchises discovered from task #1, suppose someone wants to predict the success of a sequel that is to be launched, and she/he will partner with you for the task. The ask here is not to develop a prediction model, rather, you need to develop data analysis notebooks for the following subtasks:
> 2. Create ground truth for training dataset, similar to the following output: movieId, franchiseId, 0 or 1
> 3. 0 means “not successful”, “1” means successful.
> 4. You need to provide a clear and reasonable definition of “success”.
> 5. Come up with features that can be predictive of sequel success, for each feature, demonstrate why it is predictive or not.
> 6. Create training dataset that consists of feature vector and ground truth. No need to do train-test split.
> 7. Note: if you choose to skip task #1, you can use “belong_to_collection” column as input to task #2.

### Notes
> **Ratings Dataset**
1. `userId` `movieId`: convert to int
2. `rating`: convert float to int
3. `timestamp`: convert epoch to strftime 

> **Keywords Dataset**
1. `keywords`: extract keyword

> **Merge Datasets**

In [105]:
from IPython.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
display(HTML("<style>.container { width:100% !important; }</style>"))

import json
import ast
import datetime 
import pandas as pd
import numpy as np

#### Ratings Dataset

In [107]:
rate = pd.read_csv('ratings.csv', engine='python', encoding='utf8')

In [108]:
ratings = rate.copy()

In [109]:
ratings.head(1)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529


In [110]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26024289 entries, 0 to 26024288
Data columns (total 4 columns):
userId       int64
movieId      int64
rating       float64
timestamp    int64
dtypes: float64(1), int64(3)
memory usage: 794.2 MB


#### Convert to epoch timestamp to datetime

In [111]:
ratings['rating_timestamp'] = pd.to_datetime(ratings['timestamp'], errors='coerce', unit = 's') # .apply(lambda x: x.strftime('%Y%m%d%') if x else "")

In [112]:
ratings['rating_timestamp'].sample(1)  

9088710   2007-12-17 19:36:54
Name: rating_timestamp, dtype: datetime64[ns]

#### Keywords Dataset

In [113]:
keyw = pd.read_csv('keywords.csv', engine='python', encoding='utf8')

In [114]:
keywords = keyw.copy()

In [115]:
keywords.head(1)

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."


In [116]:
keywords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
id          46419 non-null int64
keywords    46419 non-null object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


#### Extract "name" 

In [117]:
keywords['key_list_dict'] = keywords.apply(lambda x: ast.literal_eval(str(x['keywords'])), axis=1)

In [118]:
keywords['key_list_dict'].head(1)

0    [{'id': 931, 'name': 'jealousy'}, {'id': 4290,...
Name: key_list_dict, dtype: object

In [119]:
keywords['key_agg'] = [[(v['name']) for v in v] for k, v in keywords['key_list_dict'].items()] 

In [120]:
keywords['key_agg'].head(1)

0    [jealousy, toy, boy, friendship, friends, riva...
Name: key_agg, dtype: object

In [121]:
keywords.head(1)

Unnamed: 0,id,keywords,key_list_dict,key_agg
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[jealousy, toy, boy, friendship, friends, riva..."


In [122]:
keywords = keywords.drop(columns=['keywords', 'key_list_dict'])

#### Rename movie IDs

In [123]:
ratings = ratings.rename(columns={"movieId": "id"})

In [124]:
keywords.shape, ratings.shape

((46419, 2), (26024289, 5))

#### Group by "id" 

In [129]:
ratings = ratings.drop(columns=['userId', 'timestamp'])

In [131]:
ratings.head(1)

Unnamed: 0,id,rating,rating_timestamp
0,110,1.0,2015-03-09 22:52:09


In [152]:
ratings_grouped = ratings.groupby(['id'],as_index=False).agg({'rating': np.mean, 'rating_timestamp': np.min})

#### View dataframes before merging

In [157]:
ratings_grouped.tail(1)

Unnamed: 0,id,rating,rating_timestamp
45114,176275,3.0,2017-08-03 15:59:28


In [158]:
keywords.head(1)

Unnamed: 0,id,key_agg
0,862,"[jealousy, toy, boy, friendship, friends, riva..."


#### Merge DataFrames

In [159]:
keywords.reset_index(drop=True, inplace=True)
ratings_grouped.reset_index(drop=True, inplace=True)

In [160]:
ratings_grouped.head(1)

Unnamed: 0,id,rating,rating_timestamp
0,1,3.888157,1996-01-29


In [161]:
master_keys_ratings = pd.merge(ratings_grouped, keywords, on=['id'], how='left')

In [162]:
master_keys_ratings.shape

(45218, 4)

In [163]:
master_keys_ratings.sample(1)

Unnamed: 0,id,rating,rating_timestamp,key_agg
35747,150586,4.0,2016-01-11 12:35:25,


####  Export Dataset to CSV

In [164]:
master_keys_ratings.to_csv('master_key_ratings.csv', encoding='utf-8')