### Task #2: Sequel Success Prediction
> Data Exploration, Data Cleaning, Model Building (categorical/ variables), Visualizations
> 1. Given the franchises discovered from task #1, suppose someone wants to predict the success of a sequel that is to be launched, and she/he will partner with you for the task. The ask here is not to develop a prediction model, rather, you need to develop data analysis notebooks for the following subtasks:
> 2. Create ground truth for training dataset, similar to the following output: movieId, franchiseId, 0 or 1
> 3. 0 means “not successful”, “1” means successful.
> 4. You need to provide a clear and reasonable definition of “success”.
> 5. Come up with features that can be predictive of sequel success, for each feature, demonstrate why it is predictive or not.
> 6. Create training dataset that consists of feature vector and ground truth. No need to do train-test split.
> 7. Note: if you choose to skip task #1, you can use “belong_to_collection” column as input to task #2.

### Notes
> Credits Dataset
1. `cast`: extract character, name, credit_id, id
2. `crew`: extract name, job, credit_id, id
3. `id`: convert to int

In [6]:
from IPython.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
display(HTML("<style>.container { width:100% !important; }</style>"))

import json
import ast
import pandas as pd
import numpy as np

In [7]:
cred = pd.read_csv('credits.csv', encoding='utf8', engine='python')

In [8]:
credits = cred.copy()

In [9]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
cast    45476 non-null object
crew    45476 non-null object
id      45476 non-null int64
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


In [10]:
credits = credits.rename(columns={"id": "join_id"})

In [11]:
credits.head()

Unnamed: 0,cast,crew,join_id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [12]:
print(type(credits.crew), type(credits.cast))

<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>


#### Extract "Character",  "Gender", "Name", "Order" from cast_list_dict column

In [13]:
credits['cast_list_dict'] = credits.apply(lambda x: ast.literal_eval(str(x['cast'])), axis=1)

In [14]:
credits.cast_list_dict[0][:1]

[{'cast_id': 14,
  'character': 'Woody (voice)',
  'credit_id': '52fe4284c3a36847f8024f95',
  'gender': 2,
  'id': 31,
  'name': 'Tom Hanks',
  'order': 0,
  'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}]

In [15]:
credits.head(1)

Unnamed: 0,cast,crew,join_id,cast_list_dict
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[{'cast_id': 14, 'character': 'Woody (voice)',..."


#### Extract "Job", "Gender", and "Name" from crew_list_dict column

In [16]:
credits['crew_list_dict'] = credits.apply(lambda x: ast.literal_eval(str(x['crew'])), axis=1) # convert to list of dictionaries

In [17]:
credits.crew_list_dict[0][:1]

[{'credit_id': '52fe4284c3a36847f8024f49',
  'department': 'Directing',
  'gender': 2,
  'id': 7879,
  'job': 'Director',
  'name': 'John Lasseter',
  'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}]

#### Iterate over dictionaries and extract to join on movie "id" 

In [18]:
credits['crew_list_dict'][0][5]['name']

'Bonnie Arnold'

In [19]:
director_actor_df = pd.DataFrame(None,None,columns=['movie_id','talent_character', 'talent_name', 'talent_gender','director_gender','director_name'])

In [20]:
# categories = ['gender', 'order', 'name', 'job', 'character']
# creds_dicts = ['cast_list_dict', 'crew_list_dict' ]

# for item, row in credits.iterrows():
#     dir_talent_row = {'movie_id': np.nan, 'movie_id': np.nan,
#             'talent_character': np.nan,'talent_name': np.nan,
#             'talent_gender': np.nan, 'director_gender': np.nan,
#             'director_name':np.nan}
    
#     dir_talent_row['movie_id'] = int(row['join_id'])
    
#     talent=0
#     for item in row['cast_list_dict']:
#         if talent==1:
#             break
#         if item in categories: 
#             print('item talent:', item)
#             dir_talent_row['gender'] = item['gender']
#             dir_talent_row['order'] = item['order']
#             dir_talent_row['name'] = item['name']
#             dir_talent_row['character'] = item['character']
#             talent+=1
    
#     crew=0
#     for item in row['crew_list_dict']:
#         if crew==1:
#             break     # gender, name
#         if item in categories and categories[3] == 'Director':
#             print('item director: ', item)
#             dir_talent_row['gender'] = item['gender']
#             dir_talent_row['name'] = item['name']
#             crew+=1 
    
#     credits_new = director_actor_df.append(dir_talent_row, ignore_index=True)

In [21]:
# director_actor_df.head()

In [22]:
# credits_new.head()

In [327]:
credits_new.shape

(1, 6)

In [55]:
credits = credits.rename(columns={"id": "join_id"})
cast_ = cast_.rename(columns={"id": "cast_ref_id", "credit_id": "cr_credit_id", "gender": "ca_gender", "name": "cast_name" })
crew_ = crew_.rename(columns={"id": "crew_ref_id", "credit_id": "ca_credit_id", "gender": "cr_gender", "name": "crew_name" })

### Merge DataFrames

In [73]:
credits.reset_index(drop=True, inplace=True)
cast_.reset_index(drop=True, inplace=True)
crew_.reset_index(drop=True, inplace=True)

In [74]:
master_credits = pd.concat([cast_, crew_, credits] , axis=1)

In [75]:
master_credits.head(1)

Unnamed: 0,0,cast_id,character,cr_credit_id,ca_gender,cast_ref_id,cast_name,order,profile_path,0.1,...,cr_gender,crew_ref_id,job,crew_name,profile_path.1,cast,crew,join_id,cast_list_dict,crew_list_dict
0,,14,Woody (voice),52fe4284c3a36847f8024f95,2,31,Tom Hanks,0,/pQFoyx7rp09CJTAb932F2g8Nlho.jpg,,...,2,7879,Director,John Lasseter,/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."


### Drop Columns

In [76]:
master_credits.columns

Index([               0,        'cast_id',      'character',   'cr_credit_id',
            'ca_gender',    'cast_ref_id',      'cast_name',          'order',
         'profile_path',                0,   'ca_credit_id',     'department',
            'cr_gender',    'crew_ref_id',            'job',      'crew_name',
         'profile_path',           'cast',           'crew',        'join_id',
       'cast_list_dict', 'crew_list_dict'],
      dtype='object')

In [77]:
master_credits = master_credits.drop(columns=[0, 0, 'profile_path', 'profile_path' ,'cast_list_dict', 'cast', 'crew_list_dict', 'crew'])

In [80]:
master_credits.head()

Unnamed: 0,cast_id,character,cr_credit_id,ca_gender,cast_ref_id,cast_name,order,ca_credit_id,department,cr_gender,crew_ref_id,job,crew_name,join_id
0,14,Woody (voice),52fe4284c3a36847f8024f95,2,31,Tom Hanks,0,52fe4284c3a36847f8024f49,Directing,2,7879,Director,John Lasseter,862.0
1,15,Buzz Lightyear (voice),52fe4284c3a36847f8024f99,2,12898,Tim Allen,1,52fe4284c3a36847f8024f4f,Writing,2,12891,Screenplay,Joss Whedon,8844.0
2,16,Mr. Potato Head (voice),52fe4284c3a36847f8024f9d,2,7167,Don Rickles,2,52fe4284c3a36847f8024f55,Writing,2,7,Screenplay,Andrew Stanton,15602.0
3,17,Slinky Dog (voice),52fe4284c3a36847f8024fa1,2,12899,Jim Varney,3,52fe4284c3a36847f8024f5b,Writing,2,12892,Screenplay,Joel Cohen,31357.0
4,18,Rex (voice),52fe4284c3a36847f8024fa5,2,12900,Wallace Shawn,4,52fe4284c3a36847f8024f61,Writing,0,12893,Screenplay,Alec Sokolow,11862.0


In [85]:
master_credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564892 entries, 0 to 564891
Data columns (total 14 columns):
cast_id         564892 non-null object
character       564892 non-null object
cr_credit_id    564892 non-null object
ca_gender       564892 non-null object
cast_ref_id     564892 non-null object
cast_name       564892 non-null object
order           564892 non-null object
ca_credit_id    465085 non-null object
department      465085 non-null object
cr_gender       465085 non-null object
crew_ref_id     465085 non-null object
job             465085 non-null object
crew_name       465085 non-null object
join_id         45476 non-null float64
dtypes: float64(1), object(13)
memory usage: 60.3+ MB


### Correct Data Types

In [84]:
# master_credits['join_id'] = master_credits['join_id'].astype(str).extract('(\d+)', expand=False).fillna(0).replace(0, np.nan).astype(int)  

In [65]:
categorical_dtypes = ['character', 'ca_gender', 'cast_name', 'cr_gender', 'job', 'crew_name']

In [66]:
for var in categorical_dtypes:
    ordered_var = pd.api.types.CategoricalDtype(ordered = True, categories = categorical_dtypes)
    master_credits[var] = master_credits[var].astype(ordered_var)

### Export Tidy Dataset to CSV

In [None]:
master_credits.to_csv("master_credits.csv", encoding='utf-8')