# Box Office Revenue Predictictive Models

### Data Wrangling Notebook


The purpose of this project is to further practice building my own models in order to build and evaluate the performance of models that predict the revenue of movies, as a function of features associated with them.

These models will be evaluated against relevant KPI's (R-squared, Mean Absolute Error, MAE^2), and the best model will be used to predict revenue performance for movies in a world where theaters continued to stay open in 2020 and 2021.  

This notebook's purpose is to prepare the data set(s) for exploratory analysis in a subsequent notebook. 

In [82]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os
import ast

**Load Box office data**

Here I load in both .csv files with the box-office data.  The data was pre-split for a Kaggle competition years ago. In this instance I chose to merge both files into a single dataframe and later will use SciKit Learn to create a new train/test split for modeling. 

Data Source: 

Kaggle. (May 2019). TMDB Box Office Prediction, V1. Retrieved 04/16/2021 from https://www.kaggle.com/c/tmdb-box-office-prediction/data.

In [78]:
boxoffice = pd.read_csv(r'C:\Users\deann\Documents\Data\Box Office Prediction Data\train.csv')

pd.set_option("display.max_colwidth", 20)

boxoffice.head()

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,1,"[{'id': 313576, ...",14000000,"[{'id': 35, 'nam...",,tt2637294,en,Hot Tub Time Mac...,"When Lou, who ha...",6.58,...,2/20/15,93.0,[{'iso_639_1': '...,Released,The Laws of Spac...,Hot Tub Time Mac...,"[{'id': 4379, 'n...","[{'cast_id': 4, ...",[{'credit_id': '...,12314651
1,2,"[{'id': 107674, ...",40000000,"[{'id': 35, 'nam...",,tt0368933,en,The Princess Dia...,Mia Thermopolis ...,8.25,...,8/6/04,113.0,[{'iso_639_1': '...,Released,It can take a li...,The Princess Dia...,"[{'id': 2505, 'n...","[{'cast_id': 1, ...",[{'credit_id': '...,95149435
2,3,,3300000,"[{'id': 18, 'nam...",http://sonyclass...,tt2582802,en,Whiplash,Under the direct...,64.3,...,10/10/14,105.0,[{'iso_639_1': '...,Released,The road to grea...,Whiplash,"[{'id': 1416, 'n...","[{'cast_id': 5, ...",[{'credit_id': '...,13092000
3,4,,1200000,"[{'id': 53, 'nam...",http://kahaanith...,tt1821480,hi,Kahaani,Vidya Bagchi (Vi...,3.17,...,3/9/12,122.0,[{'iso_639_1': '...,Released,,Kahaani,"[{'id': 10092, '...","[{'cast_id': 1, ...",[{'credit_id': '...,16000000
4,5,,0,"[{'id': 28, 'nam...",,tt1380152,ko,마린보이,Marine Boy is th...,1.15,...,2/5/09,118.0,[{'iso_639_1': '...,Released,,Marine Boy,,"[{'cast_id': 3, ...",[{'credit_id': '...,3923970


Pulling up basic summary statistics for both .csv's

In [77]:
boxoffice.describe()

Unnamed: 0,id,budget,popularity,runtime,revenue
count,3000.0,3000.0,3000.0,2998.0,3000.00
mean,1500.5,22531334.11,8.46,107.86,66725851.89
std,866.17,37026086.41,12.1,22.09,137532326.34
min,1.0,0.0,0.0,0.0,1.00
25%,750.75,0.0,4.02,94.0,2379808.25
50%,1500.5,8000000.0,7.37,104.0,16807068.00
75%,2250.25,29000000.0,10.89,118.0,68919203.50
max,3000.0,380000000.0,294.34,338.0,"1,519,557,91..."


Here I can see that there are few numeric features for this dataset - determining what other data types I'll be working with is going to be important before going too far with the data here. 

I'm also going to remove the scientific notation from Pandas for this dataset.  The maximum numeric value in the entire set is 1.5 billion - which has more digits when written with scientific notation than without.  In a different scenario the notation might be helpful but since it's not something I use regularly I'll remove it for the time being to make my data more readable. 

In [5]:
pd.options.display.float_format = '{:,.2f}'.format

boxoffice.describe()

Unnamed: 0,id,budget,popularity,runtime,revenue
count,7398.0,7398.0,7398.0,7392.0,3000.0
mean,3699.5,22601457.78,8.51,107.72,66725851.89
std,2135.76,36948673.26,12.17,21.48,137532326.34
min,1.0,0.0,0.0,0.0,1.0
25%,1850.25,0.0,3.93,94.0,2379808.25
50%,3699.5,7500000.0,7.44,104.0,16807068.0
75%,5548.75,28000000.0,10.92,118.0,68919203.5
max,7398.0,380000000.0,547.49,338.0,1519557910.0


While removing scientific notation does make this more readable - the floats have and unessecary degree of precision and I'd like to include some comma separators for mroe readability so I'll adjust that pandas setting here as well. 

It's time to take a look at the first portion of the combined data frame as well. 

In [79]:
pd.options.display.float_format = '{:,.2f}'.format #reformats pandas for this notebook to round floats to 0.00

boxoffice.head(20)     

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,1,"[{'id': 313576, ...",14000000,"[{'id': 35, 'nam...",,tt2637294,en,Hot Tub Time Mac...,"When Lou, who ha...",6.58,...,2/20/15,93.0,[{'iso_639_1': '...,Released,The Laws of Spac...,Hot Tub Time Mac...,"[{'id': 4379, 'n...","[{'cast_id': 4, ...",[{'credit_id': '...,12314651
1,2,"[{'id': 107674, ...",40000000,"[{'id': 35, 'nam...",,tt0368933,en,The Princess Dia...,Mia Thermopolis ...,8.25,...,8/6/04,113.0,[{'iso_639_1': '...,Released,It can take a li...,The Princess Dia...,"[{'id': 2505, 'n...","[{'cast_id': 1, ...",[{'credit_id': '...,95149435
2,3,,3300000,"[{'id': 18, 'nam...",http://sonyclass...,tt2582802,en,Whiplash,Under the direct...,64.3,...,10/10/14,105.0,[{'iso_639_1': '...,Released,The road to grea...,Whiplash,"[{'id': 1416, 'n...","[{'cast_id': 5, ...",[{'credit_id': '...,13092000
3,4,,1200000,"[{'id': 53, 'nam...",http://kahaanith...,tt1821480,hi,Kahaani,Vidya Bagchi (Vi...,3.17,...,3/9/12,122.0,[{'iso_639_1': '...,Released,,Kahaani,"[{'id': 10092, '...","[{'cast_id': 1, ...",[{'credit_id': '...,16000000
4,5,,0,"[{'id': 28, 'nam...",,tt1380152,ko,마린보이,Marine Boy is th...,1.15,...,2/5/09,118.0,[{'iso_639_1': '...,Released,,Marine Boy,,"[{'cast_id': 3, ...",[{'credit_id': '...,3923970
5,6,,8000000,"[{'id': 16, 'nam...",,tt0093743,en,Pinocchio and th...,Pinocchio and hi...,0.74,...,8/6/87,83.0,[{'iso_639_1': '...,Released,,Pinocchio and th...,,"[{'cast_id': 6, ...",[{'credit_id': '...,3261638
6,7,,14000000,"[{'id': 27, 'nam...",http://www.thepo...,tt0431021,en,The Possession,A young girl buy...,7.29,...,8/30/12,92.0,[{'iso_639_1': '...,Released,Fear The Demon T...,The Possession,,"[{'cast_id': 23,...",[{'credit_id': '...,85446075
7,8,,0,"[{'id': 99, 'nam...",,tt0391024,en,Control Room,A chronicle whic...,1.95,...,1/15/04,84.0,[{'iso_639_1': '...,Released,Different channe...,Control Room,"[{'id': 917, 'na...","[{'cast_id': 2, ...",[{'credit_id': '...,2586511
8,9,"[{'id': 256377, ...",0,"[{'id': 28, 'nam...",,tt0117110,en,Muppet Treasure ...,After telling th...,6.9,...,2/16/96,100.0,[{'iso_639_1': '...,Released,Set sail for Mup...,Muppet Treasure ...,"[{'id': 2041, 'n...","[{'cast_id': 1, ...",[{'credit_id': '...,34327391
9,10,,6000000,"[{'id': 35, 'nam...",,tt0310281,en,A Mighty Wind,"In ""A Mighty Win...",4.67,...,4/16/03,91.0,[{'iso_639_1': '...,Released,Back together fo...,A Mighty Wind,"[{'id': 11800, '...","[{'cast_id': 24,...",[{'credit_id': '...,18750246


***Dictionary Problem***

It seems that there are several values for which the data frame returns a dictionary: ***Cast, Crew, Keywords, Genres, 'Belongs to collection', and 'Spoken Languages'***.  All of these seem to be features which could conceivably contribute or indicate revenue potential for a movie, either by providing exposure (casting Tom Hanks as your lead), quality production value (bringing in Steven Spielberg as your EP), or expanding/adopting an existing market (action movies, zombie films, existing franchises like Star Wars, or a new country like China).  

For now I'm going to leave these in as-is however later on in this notebook I would like to be able to identify if a given collection, genre, actor, producer, or language is associated with a higher revenue.  It's likely that this will be served best by transforming these dictionaries into lists.  However I would like to do some more data cleaning before taking that step.

***Missing Values:***

In this section I'll take a look at missing values for the dataset and determine which, if any, columns to drop from the dataset.  

In [7]:
missing = pd.concat([boxoffice.isnull().sum(),100*boxoffice.isnull().mean()], axis=1)
missing.columns = ['Count', '% Missing']
missing.sort_values(by='% Missing', ascending=False) #creates a data frame displaying missing values by column

Unnamed: 0,Count,% Missing
belongs_to_collection,5917,79.98
homepage,5032,68.02
revenue,4398,59.45
tagline,1460,19.74
Keywords,669,9.04
production_companies,414,5.6
production_countries,157,2.12
spoken_languages,62,0.84
crew,38,0.51
cast,26,0.35


Based on the above it seems that missing values will not be a large impediment for this project.  

The test data set with 4398 rows does not have any revenue information (this is what we'll be predicting), so those missing values were expected.  It is worth noting that this will alter the data I can work with for exploratory analysis and it may be prudent to import revenue data for those films at the end of this notebook. 

There are significant missing values for both ***'Belongs to collection' and 'homepage'***, with 80% an 68% missing respectively.  These columns definitely warrant further review.  

The 'tagline' and 'Keywords' columns are missing less than 20% of their data and for now this seems workable.  Previously I've built accurate models with more data missing than that so I'll leave these columns alone for now.  

***Dealing with 'belongs_to_collection'***

The column *boxoffice('belongs_to_collection')* has the largest amount of missing values.  I need to determine what to do with this column, potentially turning it into a boolean value representing if the film belongs to a collection at all.  

Additionally this is a good opportunity to take a look at values which are lists or dictionaries in this data set since that will present interesting opportunities for analysis. 

In [80]:
has_collection = boxoffice['belongs_to_collection'].notnull()

boxoffice['belongs_to_collection'][has_collection]   #I want to look at only actual values to better understand what they are

0       [{'id': 313576, ...
1       [{'id': 107674, ...
8       [{'id': 256377, ...
10      [{'id': 1575, 'n...
11      [{'id': 48190, '...
               ...         
2967    [{'id': 387219, ...
2968    [{'id': 97307, '...
2974    [{'id': 149704, ...
2984    [{'id': 221111, ...
2991    [{'id': 107469, ...
Name: belongs_to_collection, Length: 604, dtype: object

In [81]:
pd.set_option("display.max_colwidth", None)    #updating pandas setting so that colums aren't truncated

boxoffice['belongs_to_collection'][has_collection]

0       [{'id': 313576, 'name': 'Hot Tub Time Machine Collection', 'poster_path': '/iEhb00TGPucF0b4joM1ieyY026U.jpg', 'backdrop_path': '/noeTVcgpBiD48fDjFVic1Vz7ope.jpg'}]
1       [{'id': 107674, 'name': 'The Princess Diaries Collection', 'poster_path': '/wt5AMbxPTS4Kfjx7Fgm149qPfZl.jpg', 'backdrop_path': '/zSEtYD77pKRJlUPx34BJgUG9v1c.jpg'}]
8                 [{'id': 256377, 'name': 'The Muppet Collection', 'poster_path': '/8Ew8EIdFFurMMYjSbWPu1Hl4vLX.jpg', 'backdrop_path': '/1AWd3MM90G47mxtD112gRDxSXY9.jpg'}]
10                       [{'id': 1575, 'name': 'Rocky Collection', 'poster_path': '/mCY5dMkSSFQufGCViI6jNUU6pXq.jpg', 'backdrop_path': '/w4h6gjdWPvmu5R9H6zeGDPo1ZuV.jpg'}]
11                                     [{'id': 48190, 'name': 'Revenge of the Nerds Collection', 'poster_path': '/qOnoXEdrSnBuS3FMAFRIgyJSM2r.jpg', 'backdrop_path': None}]
                                                                                       ...                                                  

Now I can see that each value is a list containing a dictionary with the ID, name, path to poster, adn path to backdrop for each collection. 

While it's possible that the backdrop and/or poster could have an impact on revenue for marketing reasons; I would need to asess them using computer vision which is not in the scope of this project.  

For now I'm going to replace this dictionary with only the name of each collection since the other values aren't ones that I want to work with.

In [83]:
print(type(boxoffice.loc[0, 'belongs_to_collection']))


solution = boxoffice['belongs_to_collection'].fillna('NA')  #fills nan values with NA

<class 'str'>


***Extracting film collection names from the 'belongs_to_collections' column***

After a lot of tinkering I'm able to discover that individual values in this column are lists of dictionaries formatted as strings.  This was a huge problem and after consulting with some colleagues I was able to come up with the following solution to extract a numpy array of names for each of the 'collections' or 'NA' if there was no collection.  

For the other columns like this (Genre, Language Spoken, Cast) I'll need to write an additional for loop to iterate through the list of dictionaries and pull out each of the names as well. 

In [84]:
names = np.array([ast.literal_eval(item)[0].get('name', 'NA') if item != 'NA' else 'NA' for item in solution.values])

names[0]       #checks to see if I have succcessffully extracted the name for Toy Story 2. 

'Hot Tub Time Machine Collection'

I'm able to use the parsing library AST to read each string, call the first dictionary in the list, then extract the value representing the name for each film collection.  I'll need to replicated this at a more advanced level to extract the names for all spoken langauges, genres, cast, and crew members. 

In [87]:
boxoffice['belongs_to_collection'] = names    #reassigning the 'belongs_to_collection' column to the array of names

boxoffice['belongs_to_collection']

0       Hot Tub Time Machine Collection
1       The Princess Diaries Collection
2                                    NA
3                                    NA
4                                    NA
                     ...               
2995                                 NA
2996                                 NA
2997                                 NA
2998                                 NA
2999                                 NA
Name: belongs_to_collection, Length: 3000, dtype: object

In [88]:
boxoffice['collectionbool'] = has_collection #create a column of boolean values identifying if a movie is in a collection at all

list(boxoffice.columns)

['id',
 'belongs_to_collection',
 'budget',
 'genres',
 'homepage',
 'imdb_id',
 'original_language',
 'original_title',
 'overview',
 'popularity',
 'poster_path',
 'production_companies',
 'production_countries',
 'release_date',
 'runtime',
 'spoken_languages',
 'status',
 'tagline',
 'title',
 'Keywords',
 'cast',
 'crew',
 'revenue',
 'collectionbool']

In [90]:
pd.set_option("display.max_colwidth", 20)

boxoffice.head()

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue,collectionbool
0,1,Hot Tub Time Mac...,14000000,"[{'id': 35, 'nam...",,tt2637294,en,Hot Tub Time Mac...,"When Lou, who ha...",6.58,...,93.0,[{'iso_639_1': '...,Released,The Laws of Spac...,Hot Tub Time Mac...,"[{'id': 4379, 'n...","[{'cast_id': 4, ...",[{'credit_id': '...,12314651,True
1,2,The Princess Dia...,40000000,"[{'id': 35, 'nam...",,tt0368933,en,The Princess Dia...,Mia Thermopolis ...,8.25,...,113.0,[{'iso_639_1': '...,Released,It can take a li...,The Princess Dia...,"[{'id': 2505, 'n...","[{'cast_id': 1, ...",[{'credit_id': '...,95149435,True
2,3,,3300000,"[{'id': 18, 'nam...",http://sonyclass...,tt2582802,en,Whiplash,Under the direct...,64.3,...,105.0,[{'iso_639_1': '...,Released,The road to grea...,Whiplash,"[{'id': 1416, 'n...","[{'cast_id': 5, ...",[{'credit_id': '...,13092000,False
3,4,,1200000,"[{'id': 53, 'nam...",http://kahaanith...,tt1821480,hi,Kahaani,Vidya Bagchi (Vi...,3.17,...,122.0,[{'iso_639_1': '...,Released,,Kahaani,"[{'id': 10092, '...","[{'cast_id': 1, ...",[{'credit_id': '...,16000000,False
4,5,,0,"[{'id': 28, 'nam...",,tt1380152,ko,마린보이,Marine Boy is th...,1.15,...,118.0,[{'iso_639_1': '...,Released,,Marine Boy,,"[{'cast_id': 3, ...",[{'credit_id': '...,3923970,False


### Summary

-From the data provided "Belongs to collection" and "homepage" values had 80% and 68% missing values - this is significantly higher than any other column in the set. 

-I made the decision to keep all columns with less than 20% of their data missing

-Added a new column boxoffice.collectionbool that identifies if a film is part of a collection or not
