# Leveraging Video-On-Demand streaming data for early forecast of a movie's success using Gradient Boosting Machines and advanced feature engineering techniques.
## Introduction
The goal of this project is to create a decision support system to aid movie investments at the early stage of a movie's production. The system predicts the success of a movie based on a streaming rank scoring measure by leveraging historical data from various sources. Using social network analysis and advanced natural language processing (NLP) techniques, the system automatically extracts several groups of features, including the “who” (cast and crew), the “what” (the plot)), as well as “hybrid” features that match “who” with “what”. In order to support investment decisions on a movie, the model has to be provided information that is available at the very early stage of the movie’s production. Consequently, our prediction of movie success can only leverage data that is available when a movie is still being planned. Predictions that are made right before or after the official release may have more data to use and get more accurate results, but they are too late for investors to make any meaningful decision.

# Setup

## Python libraries

In [11]:
import pandas as pd
import numpy as np
from random import choice
import nltk
import networkx as nx # Graph analyses
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
import lightgbm as lgb # Prediction model
from utils import *

data_path = './data/'
countries = ['Mexico', 'Brazil', 'United States'] # Can be expanded to other countries

# Data Loading

### Training Set

In [12]:
main_df = pd.read_pickle(data_path + 'netflixmoviemain_df.pkl')
main_df

Unnamed: 0,country,jw_entity_id,rank,is_nflx_original,score,date,age_certification,object_type,original_release_year,original_title,...,genre_3,genre_4,genre_5,genre_6,genre_7,genre_8,genre_9,genre_10,genre_11,genre_12
0,Argentina,tm1000599,,,1.0,2021-11-07,,movie,2021.0,A Última Floresta,...,,,,,,,,,,
1,Argentina,tm1000619,,,1.0,2022-05-07,,movie,2022.0,రాధే శ్యామ్,...,,,,,,,,,,
2,Argentina,tm1001097,,,1.0,2022-06-29,R,movie,2022.0,Beauty,...,romance,,,,,,,,,
3,Argentina,tm1001912,,,1.0,2022-03-02,,movie,2021.0,Trust,...,romance,,,,,,,,,
4,Argentina,tm1003034,,,1.0,2021-08-23,,movie,2021.0,The Witcher: Nightmare of the Wolf,...,scifi,animation,action,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25831,Venezuela,tm996762,,,1.0,2022-07-07,,movie,2022.0,మేజర్,...,,,,,,,,,,
25832,Venezuela,tm998033,,,1.0,2021-11-18,,movie,2021.0,டாக்டர்,...,comedy,crime,,,,,,,,
25833,Venezuela,tm998992,,,1.0,2022-09-07,PG,movie,2021.0,竜とそばかすの姫,...,fantasy,music,scifi,,,,,,,
25834,Venezuela,tm999817,,,1.0,2021-12-01,,movie,2021.0,白蛇 II：青蛇劫起,...,action,,,,,,,,,


In [13]:
titles = pd.read_pickle(data_path + 'titles.pkl')
talent = pd.read_pickle(data_path + 'talent.pkl')
talent = talent.merge(titles[['jw_entity_id', 'original_release_year', 'genre_1']], on='jw_entity_id', how='left')

# Create a feature called 'tenure' to measure the number of years between the earliest movie and the latest movie of each talent
talent['tenure'] = talent.groupby('person_id')['original_release_year'].transform(lambda x: x.max() - x.min())
talent

Unnamed: 0,role,character_name,person_id,name,title,jw_entity_id,original_release_year,genre_1,tenure
0,ACTOR,Janaki,68294,Meena,Avvai Shanmugi,tm110160,1996.0,drama,35.0
1,ACTOR,Joseph,145348,Nagesh,Avvai Shanmugi,tm110160,1996.0,drama,46.0
2,ACTOR,Bhai,45436,Nassar,Avvai Shanmugi,tm110160,1996.0,drama,35.0
3,ACTOR,Rathna,432833,Heera Rajgopal,Avvai Shanmugi,tm110160,1996.0,drama,6.0
4,ACTOR,Kousi,471253,Rani,Avvai Shanmugi,tm110160,1996.0,drama,6.0
...,...,...,...,...,...,...,...,...,...
959963,EDITOR,,15210,Larry Bock,Remember the Daze,tm73324,2008.0,comedy,30.0
959964,EXECUTIVE_PRODUCER,,618188,Kevin Loughery,Remember the Daze,tm73324,2008.0,comedy,1.0
959965,ORIGINAL_MUSIC_COMPOSER,,33249,Dustin O'Halloran,Remember the Daze,tm73324,2008.0,comedy,14.0
959966,PRODUCER,,17275,Matthew Rhodes,Remember the Daze,tm73324,2008.0,comedy,16.0


# Network Analysis

For demostration purposes let's start by focusing on Mexico

In [49]:
local_df, local_talent = get_country_dfs('Mexico', main_df, talent)
local_talent

Unnamed: 0,role,character_name,person_id,name,title,jw_entity_id,original_release_year,genre_1,tenure,score,...,talent_average_score,talent_max_score,talent_min_score,talent_median_score,talent_std_score,talent_count,talent_total_role_score,talent_average_role_score,talent_total_genre_role_score,talent_average_genre_role_score
0,ACTOR,Charlie St. Cloud,12678,Zac Efron,Charlie St. Cloud,tm96023,2010.0,drama,17.0,1.0,...,1.0,1.0,1.0,1.0,0.0,2,2.0,1.0,2.0,1.0
1,ACTOR,Tess Carroll,23864,Amanda Crew,Charlie St. Cloud,tm96023,2010.0,drama,14.0,1.0,...,1.0,1.0,1.0,1.0,0.0,2,2.0,1.0,1.0,1.0
2,ACTOR,Tink Weatherbee,10698,Donal Logue,Charlie St. Cloud,tm96023,2010.0,drama,29.0,1.0,...,1.0,1.0,1.0,1.0,,1,1.0,1.0,1.0,1.0
3,ACTOR,Florio Ferrente,5313,Ray Liotta,Charlie St. Cloud,tm96023,2010.0,drama,36.0,1.0,...,1.0,1.0,1.0,1.0,0.0,2,2.0,1.0,1.0,1.0
4,ACTOR,Connors,92289,Matt Ward,Charlie St. Cloud,tm96023,2010.0,drama,6.0,1.0,...,1.0,1.0,1.0,1.0,,1,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35526,EXECUTIVE_PRODUCER,,80065,Ashwin Rajan,Split,tm236706,2016.0,thriller,11.0,1.0,...,1.0,1.0,1.0,1.0,,1,1.0,1.0,1.0,1.0
35527,ORIGINAL_MUSIC_COMPOSER,,76651,West Dylan Thordson,Split,tm236706,2016.0,thriller,10.0,1.0,...,1.0,1.0,1.0,1.0,,1,1.0,1.0,1.0,1.0
35528,PRODUCER,,2243,Jason Blum,Split,tm236706,2016.0,thriller,22.0,1.0,...,1.0,1.0,1.0,1.0,0.0,11,11.0,1.0,6.0,1.0
35529,PRODUCTION_DESIGN,,120673,Mara LePere-Schloop,Split,tm236706,2016.0,thriller,6.0,1.0,...,1.0,1.0,1.0,1.0,,1,1.0,1.0,1.0,1.0


In [15]:
colab_matrix = get_colab_matrix(local_df, local_talent)
G, talent_centrality_measures = graph(colab_matrix)
talent_centrality_measures

Unnamed: 0,person_id,deg_cent,ein_cent,prank_cent
0,16,0.000698,1.568679e-12,0.000031
1,18,0.000698,1.568679e-12,0.000031
2,22939,0.000698,1.568679e-12,0.000031
3,138747,0.002277,8.495536e-11,0.000062
4,222873,0.000698,1.568679e-12,0.000031
...,...,...,...,...
27230,840825,0.000073,7.241075e-68,0.000037
27231,2360211,0.000073,7.241075e-68,0.000037
27232,2352928,0.000624,3.381036e-07,0.000033
27233,2363864,0.000991,1.374234e-07,0.000035


In [50]:
local_talent.drop(columns=['character_name', 'original_release_year', 'genre_1', 'score'], inplace=True)
local_talent.drop_duplicates(inplace=True)

# Group by 'person_id' and 'role' and take the first value of the rest of the columns
local_talent = local_talent.groupby(['person_id', 'role']).first().reset_index()

local_talent = local_talent.merge(talent_centrality_measures, on='person_id', how='left')

local_talent = local_talent.fillna(0)

# Group by 'jw_entity_id' and 'role' and take the mean of the rest of the columns
local_talent_title_agg = local_talent.drop(columns=['person_id', 'name', 'title']).groupby(['jw_entity_id', 'role']).mean().reset_index()

# Pivot the table to have 'jw_entity_id' as index, 'role' as columns and the rest of the columns as values
local_talent_title_agg = local_talent_title_agg.pivot(index='jw_entity_id', columns='role').reset_index()
local_talent_title_agg.columns = ['_'.join(col) for col in local_talent_title_agg.columns.values]
local_talent_title_agg.rename(columns={'jw_entity_id_': 'jw_entity_id'}, inplace=True)
local_talent_title_agg

Unnamed: 0,jw_entity_id,tenure_ACTOR,tenure_ASSISTANT_DIRECTOR,tenure_AUTHOR,tenure_CO_EXECUTIVE_PRODUCER,tenure_CO_PRODUCER,tenure_CO_WRITER,tenure_DIRECTOR,tenure_EDITOR,tenure_EXECUTIVE_PRODUCER,...,prank_cent_EDITOR,prank_cent_EXECUTIVE_PRODUCER,prank_cent_MUSIC,prank_cent_ORIGINAL_MUSIC_COMPOSER,prank_cent_PRODUCER,prank_cent_PRODUCTION_DESIGN,prank_cent_SCREENPLAY,prank_cent_SONGS,prank_cent_VISUAL_EFFECTS,prank_cent_WRITER
0,tm1000037,2.833333,,,,,,6.0,6.0,,...,0.000053,,0.000029,,0.000029,0.000053,,,,0.000029
1,tm1000599,0.000000,,,,,,20.0,4.0,,...,0.000036,,,0.000036,0.000037,,,,,0.000039
2,tm1000619,18.166667,,,,,,7.0,45.0,,...,0.000032,,,0.000032,0.000032,0.000032,,,,0.000033
3,tm1001097,11.000000,,,,,,9.0,0.0,,...,0.000026,,,,,,,,,0.000050
4,tm1002815,5.076923,,,,0.0,,7.0,8.0,0.00,...,0.000045,0.000024,,,0.000074,0.000024,0.000045,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
962,tm996762,14.285714,,,,,,0.0,0.0,0.00,...,0.000029,0.000029,,0.000029,0.000029,0.000059,0.000031,,,
963,tm998188,9.650000,,,,,,3.0,,,...,,,,,0.000033,,,,,0.000046
964,tm998992,6.500000,,,,,,15.0,28.0,2.25,...,0.000033,0.000041,,0.000033,0.000033,0.000033,0.000034,0.000033,,
965,tm999817,0.812500,,,,,,2.0,,,...,,,,,0.000037,,,,,0.000037


In [51]:
# Merge the 'local_df' with the 'local_talent_title_agg' table on 'jw_entity_id'
local_df = local_df.merge(local_talent_title_agg, on='jw_entity_id', how='left')
local_df

Unnamed: 0,country,jw_entity_id,rank,is_nflx_original,score,date,age_certification,object_type,original_release_year,original_title,...,prank_cent_EDITOR,prank_cent_EXECUTIVE_PRODUCER,prank_cent_MUSIC,prank_cent_ORIGINAL_MUSIC_COMPOSER,prank_cent_PRODUCER,prank_cent_PRODUCTION_DESIGN,prank_cent_SCREENPLAY,prank_cent_SONGS,prank_cent_VISUAL_EFFECTS,prank_cent_WRITER
0,Mexico,tm1000037,,,1.0,2021-09-23,R,movie,2021.0,Je suis Karl,...,0.000053,,0.000029,,0.000029,0.000053,,,,0.000029
1,Mexico,tm1000599,,,1.0,2021-11-07,,movie,2021.0,A Última Floresta,...,0.000036,,,0.000036,0.000037,,,,,0.000039
2,Mexico,tm1000619,,,1.0,2022-05-06,,movie,2022.0,రాధే శ్యామ్,...,0.000032,,,0.000032,0.000032,0.000032,,,,0.000033
3,Mexico,tm1001097,,,1.0,2022-06-29,R,movie,2022.0,Beauty,...,0.000026,,,,,,,,,0.000050
4,Mexico,tm1002815,,,1.0,2021-09-15,,movie,2021.0,Nightbooks,...,0.000045,0.000024,,,0.000074,0.000024,0.000045,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
976,Mexico,tm996762,,,1.0,2022-07-07,,movie,2022.0,మేజర్,...,0.000029,0.000029,,0.000029,0.000029,0.000059,0.000031,,,
977,Mexico,tm998188,,,1.0,2021-12-01,,movie,2021.0,Donde caben dos,...,,,,,0.000033,,,,,0.000046
978,Mexico,tm998992,,,1.0,2022-09-07,PG,movie,2021.0,竜とそばかすの姫,...,0.000033,0.000041,,0.000033,0.000033,0.000033,0.000034,0.000033,,
979,Mexico,tm999817,,,1.0,2021-12-01,,movie,2021.0,白蛇 II：青蛇劫起,...,,,,,0.000037,,,,,0.000037


# Plot scoring

In [52]:
# create a pandas series called 'plots' with the index as the movie's 'jw_entity_id' and the value as the movie's 'short_description'
plots = pd.Series(main_df['short_description'].values, index=main_df['jw_entity_id'])

# drop duplicate index vales from the plots series
plots = plots[~plots.index.duplicated(keep='first')]

# Drop null values from the plots series
plots.dropna(inplace=True)

scored_keywords = score_keywords(plots, main_df)
# Create a dataframe called 'scored_plots' with the 'node_weight_scored_by_keyword_and_country' column summed by 'jw_entity_id' and 'country'
scored_plots = scored_keywords.groupby(['jw_entity_id', 'country'])['node_weight_scored_by_keyword_and_country'].sum().reset_index()
scored_plots

Extracting keywords from the plot of each of 4124 movies...


4124it [00:46, 89.24it/s] 


Total number of keywords extracted: 41469


Unnamed: 0,jw_entity_id,country,node_weight_scored_by_keyword_and_country
0,tm10,Hungary,22.382919
1,tm10,India,153.821716
2,tm10,South Africa,115.377949
3,tm1000037,France,71.069444
4,tm1000037,Greece,91.090437
...,...,...,...
25788,tm999927,South Africa,18.501730
25789,tm999927,Thailand,109.780915
25790,tm999927,United Kingdom,26.219853
25791,tm999927,Venezuela,43.937666


In [53]:
# Merge the 'local_df' with the 'scored_plots' table on 'jw_entity_id' and 'country'
local_df = local_df.merge(scored_plots, on=['jw_entity_id', 'country'], how='left')
local_df.rename(columns={'node_weight_scored_by_keyword_and_country': 'plot_score'}, inplace=True)
local_df

Unnamed: 0,country,jw_entity_id,rank,is_nflx_original,score,date,age_certification,object_type,original_release_year,original_title,...,prank_cent_EXECUTIVE_PRODUCER,prank_cent_MUSIC,prank_cent_ORIGINAL_MUSIC_COMPOSER,prank_cent_PRODUCER,prank_cent_PRODUCTION_DESIGN,prank_cent_SCREENPLAY,prank_cent_SONGS,prank_cent_VISUAL_EFFECTS,prank_cent_WRITER,plot_score
0,Mexico,tm1000037,,,1.0,2021-09-23,R,movie,2021.0,Je suis Karl,...,,0.000029,,0.000029,0.000053,,,,0.000029,100.386991
1,Mexico,tm1000599,,,1.0,2021-11-07,,movie,2021.0,A Última Floresta,...,,,0.000036,0.000037,,,,,0.000039,3.430855
2,Mexico,tm1000619,,,1.0,2022-05-06,,movie,2022.0,రాధే శ్యామ్,...,,,0.000032,0.000032,0.000032,,,,0.000033,139.666610
3,Mexico,tm1001097,,,1.0,2022-06-29,R,movie,2022.0,Beauty,...,,,,,,,,,0.000050,139.502797
4,Mexico,tm1002815,,,1.0,2021-09-15,,movie,2021.0,Nightbooks,...,0.000024,,,0.000074,0.000024,0.000045,,,,70.212973
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
976,Mexico,tm996762,,,1.0,2022-07-07,,movie,2022.0,మేజర్,...,0.000029,,0.000029,0.000029,0.000059,0.000031,,,,61.530119
977,Mexico,tm998188,,,1.0,2021-12-01,,movie,2021.0,Donde caben dos,...,,,,0.000033,,,,,0.000046,105.690547
978,Mexico,tm998992,,,1.0,2022-09-07,PG,movie,2021.0,竜とそばかすの姫,...,0.000041,,0.000033,0.000033,0.000033,0.000034,0.000033,,,444.496586
979,Mexico,tm999817,,,1.0,2021-12-01,,movie,2021.0,白蛇 II：青蛇劫起,...,,,,0.000037,,,,,0.000037,81.350659


### Prediction Set

In [54]:
pred_set = pd.read_csv(data_path + 'project_form - movie.csv')

# make all column names lowercase
pred_set.columns = map(str.lower, pred_set.columns)

mask = ~pred_set['title'].isna()
pred_set['title'] = pred_set['title'].ffill()

# Create a 'pred_set_talent' dataframe that contains only the 'title', 'name', and 'role' columns
pred_set_talent = pred_set[['title', 'name', 'role']]

pred_set_talent.dropna(inplace=True)
pred_set_talent

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pred_set_talent.dropna(inplace=True)


Unnamed: 0,title,name,role
0,WAY DOWN,Liam Cunningham,ACTOR
1,WAY DOWN,Astrid Bergès-Frisbey,ACTOR
2,WAY DOWN,Freddie Highmore,ACTOR
3,WAY DOWN,Jaume Balagueró,DIRECTOR
4,WAY DOWN,Álvaro Augustín,PRODUCER
...,...,...,...
5561,MUSK,ALEX GIBNEY,DIRECTOR
5562,MUSK,Black Bear,PRODUCER
5563,MUSK,JIGSAW PRODUCTIONS,PRODUCER
5564,MUSK,CLOSER MEDIA,PRODUCER


In [55]:
local_talent_crew_agg = local_talent.drop(columns=['person_id', 'title', 'jw_entity_id']).drop_duplicates()
local_talent_crew_agg = local_talent_crew_agg.groupby(['name', 'role']).mean().reset_index()
local_talent_crew_agg

Unnamed: 0,name,role,tenure,talent_total_score,talent_average_score,talent_max_score,talent_min_score,talent_median_score,talent_std_score,talent_count,talent_total_role_score,talent_average_role_score,talent_total_genre_role_score,talent_average_genre_role_score,deg_cent,ein_cent,prank_cent
0,50 Cent,ACTOR,17.0,2.0,1.0,1.0,1.0,1.0,0.0,2.0,2.0,1.0,1.0,1.0,0.003341,1.645165e-04,0.000049
1,A Martinez,ACTOR,50.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.000881,2.795196e-06,0.000025
2,A. C. Murali Mohan,ACTOR,7.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.001102,5.724415e-08,0.000033
3,A. Demetrius Brown,EXECUTIVE_PRODUCER,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.001248,8.558134e-05,0.000026
4,A. Jay Radcliff,ACTOR,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.000991,9.345656e-05,0.000019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28888,Игорь Павлов,ACTOR,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.001763,1.772953e-07,0.000035
28889,杨轶,ACTOR,8.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.000991,6.601263e-06,0.000033
28890,王俊凯,ACTOR,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.001395,6.468249e-04,0.000024
28891,陆树铭,ACTOR,0.0,2.0,1.0,1.0,1.0,1.0,0.0,2.0,2.0,1.0,1.0,1.0,0.000441,1.905546e-09,0.000030


In [56]:
# Merge the 'pred_set_talent' dataframe with the 'local_talent' dataframe on the 'name' column
pred_set_talent = pred_set_talent.merge(local_talent_crew_agg, on=['name', 'role'], how='left')
pred_set_talent

Unnamed: 0,title,name,role,tenure,talent_total_score,talent_average_score,talent_max_score,talent_min_score,talent_median_score,talent_std_score,talent_count,talent_total_role_score,talent_average_role_score,talent_total_genre_role_score,talent_average_genre_role_score,deg_cent,ein_cent,prank_cent
0,WAY DOWN,Liam Cunningham,ACTOR,26.0,13.0,4.333333,11.0,1.0,1.0,5.773503,3.0,13.0,4.333333,1.0,1.0,0.005618,0.000134,0.000082
1,WAY DOWN,Astrid Bergès-Frisbey,ACTOR,10.0,11.0,11.000000,11.0,11.0,11.0,0.000000,1.0,11.0,11.000000,11.0,11.0,0.002020,0.000003,0.000036
2,WAY DOWN,Freddie Highmore,ACTOR,20.0,24.0,6.000000,11.0,1.0,6.0,5.773503,4.0,13.0,4.333333,2.0,1.0,0.004773,0.000187,0.000073
3,WAY DOWN,Jaume Balagueró,DIRECTOR,16.0,11.0,11.000000,11.0,11.0,11.0,0.000000,1.0,11.0,11.000000,11.0,11.0,0.002020,0.000003,0.000036
4,WAY DOWN,Álvaro Augustín,PRODUCER,16.0,14.0,3.500000,11.0,1.0,1.0,5.000000,4.0,14.0,3.500000,3.0,1.0,0.004443,0.000004,0.000098
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5529,MUSK,ALEX GIBNEY,DIRECTOR,,,,,,,,,,,,,,,
5530,MUSK,Black Bear,PRODUCER,,,,,,,,,,,,,,,
5531,MUSK,JIGSAW PRODUCTIONS,PRODUCER,,,,,,,,,,,,,,,
5532,MUSK,CLOSER MEDIA,PRODUCER,,,,,,,,,,,,,,,


In [57]:
pred_set_talent = pred_set_talent.drop(columns=['name']).groupby(['title', 'role']).mean().reset_index()

# Pivot the table to have 'title' as index, 'role' as columns and the rest of the columns as values
pred_set_talent = pred_set_talent.pivot(index='title', columns='role').reset_index()
pred_set_talent.columns = ['_'.join(col) for col in pred_set_talent.columns.values]
pred_set_talent.rename(columns={'title_': 'title'}, inplace=True)
pred_set_talent

Unnamed: 0,title,tenure_ACTOR,tenure_CO_PRODUCER,tenure_DIRECTOR,tenure_EDITOR,tenure_EXECUTIVE_PRODUCER,tenure_Elijah Wood,tenure_PRODUCER,tenure_PRODUCTION_DESIGN,tenure_SCREENPLAY,...,prank_cent_ACTOR,prank_cent_CO_PRODUCER,prank_cent_DIRECTOR,prank_cent_EDITOR,prank_cent_EXECUTIVE_PRODUCER,prank_cent_Elijah Wood,prank_cent_PRODUCER,prank_cent_PRODUCTION_DESIGN,prank_cent_SCREENPLAY,prank_cent_WRITER
0,10 Lives,,,,,,,,,,...,,,,,,,,,,
1,100 MINUTES,,,,,,,8.0,,,...,,,,,,,0.000062,,,
2,2 WIN,18.0,,,,,,44.0,,,...,0.000036,,,,,,0.000028,,,
3,2067,16.0,,,,,,,,,...,0.000037,,,,,,,,,
4,3 DAYS IN MALAY,31.0,,,,,,,,,...,0.000029,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
881,YOUTH,42.0,,,,,,32.0,,,...,0.000053,,,,,,0.000025,,,
882,You Can't Run Forever,40.5,,,,,,,,,...,0.000133,,,,,,,,,
883,ZERO CONTACT,24.6,,,,,,,,,...,0.000033,,,,,,,,,
884,ZEROS AND ONES,37.0,,,,,,,,,...,0.000101,,,,,,,,,


In [58]:
pred_set = pred_set[mask].drop(columns=['name', 'role'])

# Drop rows with missing values for 'plot'
pred_set = pred_set.dropna(subset=['plot'])

pred_set

Unnamed: 0,title,plot,age_certification,genre_1,genre_2,genre_3,comentarios_vivi,budget,ask,sales,market,status
0,WAY DOWN,The Bank of Spain is like no other. An absolut...,PG-13,action,thriller,,,,,TF1,,
7,THE GOOD BOSS,It’s a sharp and nuanced dark comedy about the...,PG-13,comedy,drama,,"Es una comedia negra, por momentos bastante di...",3000000,250000,MK2,,
9,LOIS WAIN,Louis Wain: Unconventional while iconic. Candi...,R,drama,,,,,,,,
10,RIO,"Set against the exotic backdrop of Brazil, thi...",R,thriller,action,,,,,STUDIOCANAL,EFM 2022,
17,EMILY,Emily (Emma Mackey) wears a mask. The world te...,R,drama,,,Tipica historia tipo Pride and Prejudice. Se p...,Budget £8m,Asking para Latam: US$600k.,EMBANKMENT,Cannes 2022,Status Post-Production. Delivery Q1 2022
...,...,...,...,...,...,...,...,...,...,...,...,...
5540,CINNAMON,This darkly comedic heist thriller follows asp...,PG-13,thriller,comedy,,,,Ask: 75K,VILLAGE ROADSHOW,CANNES 2023,
5544,THE SALTED PATH,An honest and life-affirming true story of the...,R,drama,,,,,Ask: 450K,ROCKET SCIENCE,CANNES 2023,Pre prod. Shooting Date: 5th June 2023
5549,CONTROL,"Wallace Conway, a troubled doctor who increasi...",PG-13,thriller,drama,,"Lei las primeras 40 paginas, es excelente, atr...",,,STUDIOCANAL,CANNES 2023,
5553,CLIFFHANGER,Sylvester Stallone will reprise his character ...,PG-13,action,,,,,,ROCKET SCIENCE,CANNES 2023,Shooting Date: September 2023


In [59]:
# Merge the 'pred_set' dataframe with the 'pred_set_talent' dataframe on the 'title' column
pred_set = pred_set.merge(pred_set_talent, on='title', how='left')
pred_set

Unnamed: 0,title,plot,age_certification,genre_1,genre_2,genre_3,comentarios_vivi,budget,ask,sales,...,prank_cent_ACTOR,prank_cent_CO_PRODUCER,prank_cent_DIRECTOR,prank_cent_EDITOR,prank_cent_EXECUTIVE_PRODUCER,prank_cent_Elijah Wood,prank_cent_PRODUCER,prank_cent_PRODUCTION_DESIGN,prank_cent_SCREENPLAY,prank_cent_WRITER
0,WAY DOWN,The Bank of Spain is like no other. An absolut...,PG-13,action,thriller,,,,,TF1,...,0.000064,,0.000036,,,,0.000086,,,
1,THE GOOD BOSS,It’s a sharp and nuanced dark comedy about the...,PG-13,comedy,drama,,"Es una comedia negra, por momentos bastante di...",3000000,250000,MK2,...,0.000095,,,,,,,,,
2,LOIS WAIN,Louis Wain: Unconventional while iconic. Candi...,R,drama,,,,,,,...,0.000090,,,,,,,,,
3,RIO,"Set against the exotic backdrop of Brazil, thi...",R,thriller,action,,,,,STUDIOCANAL,...,0.000083,,,,,,0.000077,,,
4,EMILY,Emily (Emma Mackey) wears a mask. The world te...,R,drama,,,Tipica historia tipo Pride and Prejudice. Se p...,Budget £8m,Asking para Latam: US$600k.,EMBANKMENT,...,0.000053,,,,,,0.000026,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900,CINNAMON,This darkly comedic heist thriller follows asp...,PG-13,thriller,comedy,,,,Ask: 75K,VILLAGE ROADSHOW,...,0.000031,,,,,,,,,
901,THE SALTED PATH,An honest and life-affirming true story of the...,R,drama,,,,,Ask: 450K,ROCKET SCIENCE,...,0.000032,,,,,,,,,
902,CONTROL,"Wallace Conway, a troubled doctor who increasi...",PG-13,thriller,drama,,"Lei las primeras 40 paginas, es excelente, atr...",,,STUDIOCANAL,...,,,,,,,,,,
903,CLIFFHANGER,Sylvester Stallone will reprise his character ...,PG-13,action,,,,,,ROCKET SCIENCE,...,0.000203,,,,,,0.000177,,,


In [60]:
# Create a pandas series called 'pred_set_plots' with the index as the movie's 'title' and the value as the movie's 'plot'
pred_set_plots = pd.Series(pred_set['plot'].values, index=pred_set['title'])
pred_set_keywords = extract_keywords(pred_set_plots)
pred_set_keywords.rename(columns={'jw_entity_id':'title'}, inplace=True)
pred_set_keywords

Extracting keywords from the plot of each of 905 movies...


905it [00:17, 52.84it/s]

Total number of keywords extracted: 15808





Unnamed: 0,keyword,node_weight,title,node_weight_normalized
0,bank,1.842171,WAY DOWN,0.060202
1,blueprints,0.150000,WAY DOWN,0.004902
2,maps,0.150000,WAY DOWN,0.004902
3,data,0.798623,WAY DOWN,0.026099
4,vault,1.865069,WAY DOWN,0.060950
...,...,...,...,...
15914,Inventor,1.030104,MUSK,0.147793
15915,Blood,0.787500,MUSK,0.112986
15916,sister,0.868368,MUSK,0.124588
15917,company,0.868368,MUSK,0.124588


In [61]:
# Create a dataframe called 'pred_set_keywords_scored' which is the inner merge of 'pred_set_keywords' and 'scored_keywords' on 'keyword'
pred_set_keywords_scored = pred_set_keywords.drop('node_weight', axis=1).merge(scored_keywords[['keyword', 'country', 'node_weight_scored_by_keyword_and_country']], on=['keyword'], how='inner')
pred_set_keywords_scored['node_weight_scored_by_keyword_and_country'] = pred_set_keywords_scored['node_weight_scored_by_keyword_and_country'] * pred_set_keywords_scored['node_weight_normalized']
pred_set_keywords_scored.drop(columns=['node_weight_normalized'], inplace=True)
pred_set_keywords_scored.drop_duplicates(inplace=True)
pred_set_keywords_scored

Unnamed: 0,keyword,title,country,node_weight_scored_by_keyword_and_country
0,bank,WAY DOWN,Argentina,0.085108
13,bank,WAY DOWN,Brazil,0.141587
25,bank,WAY DOWN,Chile,0.071919
36,bank,WAY DOWN,Colombia,0.068864
47,bank,WAY DOWN,Czech Republic,0.033124
...,...,...,...,...
6168742,Blood,MUSK,Czech Republic,0.018831
6168743,Blood,MUSK,Hungary,0.018831
6168744,Blood,MUSK,Indonesia,0.075324
6168745,Blood,MUSK,Romania,0.922718


In [62]:
# Create a dataframe called 'pred_set_plots_scored' which is the sum of 'node_weight_scored_by_keyword_and_country' grouped by 'title' and 'country'
pred_set_plots_scored = pred_set_keywords_scored.groupby(['title', 'country'])['node_weight_scored_by_keyword_and_country'].sum().reset_index()
pred_set_plots_scored

Unnamed: 0,title,country,node_weight_scored_by_keyword_and_country
0,10 Lives,Argentina,5.331058
1,10 Lives,Austria,5.703408
2,10 Lives,Belgium,3.989301
3,10 Lives,Brazil,4.191144
4,10 Lives,Canada,4.020865
...,...,...,...
32324,ZOYA,Thailand,17.698574
32325,ZOYA,Turkey,0.502802
32326,ZOYA,United Kingdom,3.809553
32327,ZOYA,United States,6.211883


In [63]:
pred_set_plots_scored = pred_set_plots_scored[pred_set_plots_scored['country'] == 'Mexico']
pred_set_plots_scored.drop(columns=['country'], inplace=True)
pred_set_plots_scored.rename(columns={'node_weight_scored_by_keyword_and_country': 'plot_score'}, inplace=True)
pred_set_plots_scored

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pred_set_plots_scored.drop(columns=['country'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pred_set_plots_scored.rename(columns={'node_weight_scored_by_keyword_and_country': 'plot_score'}, inplace=True)


Unnamed: 0,title,plot_score
20,10 Lives,5.415733
57,100 MINUTES,5.271929
94,2 WIN,1.067359
131,2067,5.689762
168,3 DAYS IN MALAY,2.735496
...,...,...
32165,YOUTH,15.430795
32202,You Can't Run Forever,13.012829
32238,ZERO CONTACT,7.621279
32275,ZEROS AND ONES,6.770530


In [64]:
# Merge the 'pred_set' dataframe with the 'pred_set_plots_scored' dataframe on the 'title' column
pred_set = pred_set.merge(pred_set_plots_scored, on='title', how='left')
pred_set

Unnamed: 0,title,plot,age_certification,genre_1,genre_2,genre_3,comentarios_vivi,budget,ask,sales,...,prank_cent_CO_PRODUCER,prank_cent_DIRECTOR,prank_cent_EDITOR,prank_cent_EXECUTIVE_PRODUCER,prank_cent_Elijah Wood,prank_cent_PRODUCER,prank_cent_PRODUCTION_DESIGN,prank_cent_SCREENPLAY,prank_cent_WRITER,plot_score
0,WAY DOWN,The Bank of Spain is like no other. An absolut...,PG-13,action,thriller,,,,,TF1,...,,0.000036,,,,0.000086,,,,3.516045
1,THE GOOD BOSS,It’s a sharp and nuanced dark comedy about the...,PG-13,comedy,drama,,"Es una comedia negra, por momentos bastante di...",3000000,250000,MK2,...,,,,,,,,,,0.243460
2,LOIS WAIN,Louis Wain: Unconventional while iconic. Candi...,R,drama,,,,,,,...,,,,,,,,,,21.915715
3,RIO,"Set against the exotic backdrop of Brazil, thi...",R,thriller,action,,,,,STUDIOCANAL,...,,,,,,0.000077,,,,4.069250
4,EMILY,Emily (Emma Mackey) wears a mask. The world te...,R,drama,,,Tipica historia tipo Pride and Prejudice. Se p...,Budget £8m,Asking para Latam: US$600k.,EMBANKMENT,...,,,,,,0.000026,,,,8.052318
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900,CINNAMON,This darkly comedic heist thriller follows asp...,PG-13,thriller,comedy,,,,Ask: 75K,VILLAGE ROADSHOW,...,,,,,,,,,,9.628737
901,THE SALTED PATH,An honest and life-affirming true story of the...,R,drama,,,,,Ask: 450K,ROCKET SCIENCE,...,,,,,,,,,,23.672776
902,CONTROL,"Wallace Conway, a troubled doctor who increasi...",PG-13,thriller,drama,,"Lei las primeras 40 paginas, es excelente, atr...",,,STUDIOCANAL,...,,,,,,,,,,12.196133
903,CLIFFHANGER,Sylvester Stallone will reprise his character ...,PG-13,action,,,,,,ROCKET SCIENCE,...,,,,,,0.000177,,,,0.047428
