### Task #2: Sequel Success Prediction
> 1. Given the franchises discovered from task #1, suppose someone wants to predict the success of a sequel that is to be launched, and she/he will partner with you for the task. The ask here is not to develop a prediction model, rather, you need to develop data analysis notebooks for the following subtasks:
> 2. Create ground truth for training dataset, similar to the following output: movieId, franchiseId, 0 or 1
> 3. 0 means “not successful”, “1” means successful.
> 4. You need to provide a clear and reasonable definition of “success”.
> 5. Come up with features that can be predictive of sequel success, for each feature, demonstrate why it is predictive or not.
> 6. Create training dataset that consists of feature vector and ground truth. No need to do train-test split.
> 7. Note: if you choose to skip task #1, you can use “belong_to_collection” column as input to task #2.

### Notes

1. `release_date`: convert to datetime 
2. `production_companies`: extract country name to list
3. `genres`: extract genres to list
4. variables: `release_date`, `keywords`, `revenue`, `genres`, `language`, `runtime`, `vote_average` 
5. drop columns after data cleansing: `adult`, `vote_count`, `homepage`, `tagline`, `title`, `spoken_language`, `production_countries`, `production_companies`

In [635]:
from IPython.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
display(HTML("<style>.container { width:100% !important; }</style>"))

import os
import datetime as dt
import numpy as np
import pandas as pd
import json
import ast

In [636]:
meta = pd.read_csv('movies_metadata.csv', engine='python', encoding='utf8')

#### Copy of dataframe

In [637]:
movies = meta.copy()

#### Brief analysis of data, nulls and data types

In [638]:
# movies.info(1)

In [640]:
movies.isnull().sum()

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

In [642]:
# movies.dtypes

#### Convert column data to correct data types

In [643]:
movies['budget'] = movies['budget'].apply(pd.to_numeric, errors='coerce').fillna(0, downcast='infer')
movies['popularity'] = movies['popularity'].apply(pd.to_numeric, errors='coerce').fillna(0, downcast='infer')

In [644]:
movies['release_date'] = pd.to_datetime(movies['release_date'], errors='coerce').apply(lambda x: x.date())

In [645]:
movies['id'] = movies['id'].str.extract('(\d+)', expand=False).fillna(0).replace(0, np.nan).astype(int)  

In [646]:
movies['revenue'] = movies['revenue'].apply(pd.to_numeric, errors='coerce').fillna(0, downcast='infer')

In [647]:
movies.describe()

Unnamed: 0,budget,id,popularity,revenue,runtime,vote_average,vote_count
count,45466.0,45466.0,45466.0,45466.0,45203.0,45460.0,45460.0
mean,4224300.0,108352.901333,2.921093,11207870.0,94.128199,5.618207,109.897338
std,17423590.0,112460.356937,6.005112,64328130.0,38.40781,1.924216,491.310374
min,0.0,2.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,26444.75,0.385804,0.0,85.0,5.0,3.0
50%,0.0,59999.0,1.12741,0.0,95.0,6.0,10.0
75%,0.0,157316.25,3.678343,0.0,107.0,6.8,34.0
max,380000000.0,469172.0,547.488298,2787965000.0,1256.0,10.0,14075.0


#### Confirm Series type

In [648]:
print(type(movies.belongs_to_collection), type(movies.spoken_languages)) 

<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>


####  Extract "Collection" from "Belongs to Collection" and add new column labeled "Franchise"

In [649]:
movies['belongs_to_collection'].isnull().sum()

40972

In [650]:
movies['belongs_to_collection'].sample(1)

21720    NaN
Name: belongs_to_collection, dtype: object

#### Custom function for error handling

In [605]:
def evaluator(x):
    if pd.isnull(x):
        return np.nan
    else:
        evaluated = ast.literal_eval(x)
        if not isinstance(evaluated, dict):
            return np.nan
        else:
            return evaluated

In [651]:
movies['collection_list_dict'] = movies['belongs_to_collection'].apply(evaluator)  # converting to dict with ast.literal_eval

In [652]:
movies = movies[movies['collection_list_dict'] == movies['collection_list_dict']]  # filtering out nan's

In [653]:
movies['collection_id'] = movies['collection_list_dict'].apply(lambda x: x['id'])
movies['collection_name'] = movies['collection_list_dict'].apply(lambda x: x['name'])

In [654]:
movies.sample(1)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,collection_list_dict,collection_id,collection_name
47,False,"{'id': 136214, 'name': 'Pocahontas Collection'...",55000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, '...",,10530,tt0114148,en,Pocahontas,History comes gloriously to life in Disney's e...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,An American legend comes to life.,Pocahontas,False,6.7,1509.0,"{'id': 136214, 'name': 'Pocahontas Collection'...",136214,Pocahontas Collection


In [655]:
movies['franchise'] = movies['belongs_to_collection'].map(lambda x: "Franchise" if "Collection" in x else "" if "" in x else "")

#### Extract "name" from "production companies" to new column

In [656]:
movies['production_companies'].isnull().sum()

1

In [657]:
movies['production_co_names'] = movies['production_companies'].fillna('[]').str.strip().apply(ast.literal_eval)

In [658]:
movies['production_co_names'] = movies['production_co_names'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [659]:
movies['production_co_names'].sample(1)

20714    [Filmkraft Productions Pvt. Ltd]
Name: production_co_names, dtype: object

#### Extract "name" from "production countries" to new column

In [660]:
movies['production_countries'].isnull().sum()

1

In [661]:
movies['production_countries_name'] = movies['production_countries'].fillna('[]').str.strip().apply(ast.literal_eval)

In [662]:
movies['production_countries_name'].sample(1)

6605    [{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...
Name: production_countries_name, dtype: object

In [663]:
movies['production_countries_name'] = movies['production_countries_name'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [664]:
movies['production_countries_name'].sample(1)

643    [United States of America]
Name: production_countries_name, dtype: object

#### Extract "name" from "genres" column to list

In [665]:
movies['genres'].sample()

11152    [{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...
Name: genres, dtype: object

In [666]:
movies['genres_list_dict'] = movies.apply(lambda x: ast.literal_eval(str(x['genres'])), axis=1)

In [667]:
movies['genres_list_dict'].head(1)

0    [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
Name: genres_list_dict, dtype: object

In [668]:
movies['genres_list'] = [[(v['name']) for v in v] for k, v in movies['genres_list_dict'].items()] 

In [669]:
movies['genres_list'].sample(1)

9484    [Action, Crime]
Name: genres_list, dtype: object

#### Drop columns

In [671]:
movies.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'collection_list_dict', 'collection_id',
       'collection_name', 'franchise', 'production_co_names',
       'production_countries_name', 'genres_list_dict', 'genres_list'],
      dtype='object')

In [672]:
movies = movies.drop(columns=['adult', 'belongs_to_collection', 'genres' ,'homepage', 
                             'overview', 'poster_path', 'production_companies', 'production_countries',
                              'tagline', 'title', 'video', 'status', 'vote_count', 'spoken_languages', 
                              'genres_list_dict', 'collection_list_dict'])

In [673]:
movies.head(1)

Unnamed: 0,budget,id,imdb_id,original_language,original_title,popularity,release_date,revenue,runtime,vote_average,collection_id,collection_name,franchise,production_co_names,production_countries_name,genres_list
0,30000000,862,tt0114709,en,Toy Story,21.946943,1995-10-30,373554033,81.0,7.7,10194,Toy Story Collection,Franchise,[Pixar Animation Studios],[United States of America],"[Animation, Comedy, Family]"


In [674]:
movies_franchises = movies[movies['franchise']=='Franchise']

In [675]:
movies_franchises.shape

(3733, 16)

In [676]:
movies_franchises.head(1)

Unnamed: 0,budget,id,imdb_id,original_language,original_title,popularity,release_date,revenue,runtime,vote_average,collection_id,collection_name,franchise,production_co_names,production_countries_name,genres_list
0,30000000,862,tt0114709,en,Toy Story,21.946943,1995-10-30,373554033,81.0,7.7,10194,Toy Story Collection,Franchise,[Pixar Animation Studios],[United States of America],"[Animation, Comedy, Family]"


In [677]:
movies_franchises = movies_franchises.drop(columns=['franchise'])

#### Merge DataFrames

In [678]:
movies_franchises.reset_index(drop=True, inplace=True)

In [679]:
movies_franchises = movies_franchises.rename(columns={"original_title": "title", "production_co_names": "production companies", \
                         "production_countries_name": "production_countries", "genres_list": "genres"})

In [680]:
movies_franchises.head(1)

Unnamed: 0,budget,id,imdb_id,original_language,title,popularity,release_date,revenue,runtime,vote_average,collection_id,collection_name,production companies,production_countries,genres
0,30000000,862,tt0114709,en,Toy Story,21.946943,1995-10-30,373554033,81.0,7.7,10194,Toy Story Collection,[Pixar Animation Studios],[United States of America],"[Animation, Comedy, Family]"


#### Export Dataset to CSV

In [681]:
movies_franchises.to_csv('movies_franchises.csv', encoding='utf-8')