# Box Office Revenue Predictictive Models

### Data Wrangling Notebook


The purpose of this project is to further practice building my own models in order to build and evaluate the performance of models that predict the revenue of movies, as a function of features associated with them.

These models will be evaluated against relevant KPI's (R-squared, Mean Absolute Error, MAE^2), and the best model will be used to predict revenue performance for movies in a world where theaters continued to stay open in 2020 and 2021.  

This notebook's purpose is to prepare the data set(s) for exploratory analysis in a subsequent notebook. 

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os

**Load Box office data**

Here I load in both .csv files with the box-office data.  The data was pre-split for a Kaggle competition years ago. In this instance I chose to merge both files into a single dataframe and later will use SciKit Learn to create a new train/test split for modeling. 

Data Source: 

Kaggle. (May 2019). TMDB Box Office Prediction, V1. Retrieved 04/16/2021 from https://www.kaggle.com/c/tmdb-box-office-prediction/data.

In [5]:
box_office_train_data = pd.read_csv(r'C:\Users\deann\Documents\Data\Box Office Prediction Data\train.csv')
box_office_test_data = pd.read_csv(r'C:\Users\deann\Documents\Data\Box Office Prediction Data\test.csv')

box_office_train_data.head()

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,...,2/20/15,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651
1,2,"[{'id': 107674, 'name': 'The Princess Diaries ...",40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,...,8/6/04,113.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,"[{'id': 2505, 'name': 'coronation'}, {'id': 42...","[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de...",95149435
2,3,,3300000,"[{'id': 18, 'name': 'Drama'}]",http://sonyclassics.com/whiplash/,tt2582802,en,Whiplash,"Under the direction of a ruthless instructor, ...",64.29999,...,10/10/14,105.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The road to greatness can take you to the edge.,Whiplash,"[{'id': 1416, 'name': 'jazz'}, {'id': 1523, 'n...","[{'cast_id': 5, 'character': 'Andrew Neimann',...","[{'credit_id': '54d5356ec3a3683ba0000039', 'de...",13092000
3,4,,1200000,"[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...",http://kahaanithefilm.com/,tt1821480,hi,Kahaani,Vidya Bagchi (Vidya Balan) arrives in Kolkata ...,3.174936,...,3/9/12,122.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,,Kahaani,"[{'id': 10092, 'name': 'mystery'}, {'id': 1054...","[{'cast_id': 1, 'character': 'Vidya Bagchi', '...","[{'credit_id': '52fe48779251416c9108d6eb', 'de...",16000000
4,5,,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",,tt1380152,ko,마린보이,Marine Boy is the story of a former national s...,1.14807,...,2/5/09,118.0,"[{'iso_639_1': 'ko', 'name': '한국어/조선말'}]",Released,,Marine Boy,,"[{'cast_id': 3, 'character': 'Chun-soo', 'cred...","[{'credit_id': '52fe464b9251416c75073b43', 'de...",3923970


Pulling up basic summary statistics for both .csv's

In [6]:
box_office_train_data.describe()

Unnamed: 0,id,budget,popularity,runtime,revenue
count,3000.0,3000.0,3000.0,2998.0,3000.0
mean,1500.5,22531330.0,8.463274,107.856571,66725850.0
std,866.169729,37026090.0,12.104,22.086434,137532300.0
min,1.0,0.0,1e-06,0.0,1.0
25%,750.75,0.0,4.018053,94.0,2379808.0
50%,1500.5,8000000.0,7.374861,104.0,16807070.0
75%,2250.25,29000000.0,10.890983,118.0,68919200.0
max,3000.0,380000000.0,294.337037,338.0,1519558000.0


In [7]:
box_office_test_data.describe()

Unnamed: 0,id,budget,popularity,runtime
count,4398.0,4398.0,4398.0,4394.0
mean,5199.5,22649290.0,8.55023,107.622212
std,1269.737571,36899910.0,12.209014,21.05829
min,3001.0,0.0,1e-06,0.0
25%,4100.25,0.0,3.895186,94.0
50%,5199.5,7450000.0,7.482241,104.0
75%,6298.75,28000000.0,10.938524,118.0
max,7398.0,260000000.0,547.488298,320.0


Here I can see that there are few numeric features for this dataset - determining what other data types I'll be working with is going to be important before going too far with the data here. 

At this point it seems that both the train and test data provided really should be combined into a single data frame.  There isn't a compelling reason to keep the data partitioned while I go through data wrangling and EDA.  I can separate out the portions of data without revenue data to recreate my train/test split later on - the work that will require is significanly less than duplicating all of the work needed for data wrangling and exploratory analysis. 

I'm also going to remove the scientific notation from Pandas for this dataset.  The maximum numeric value in the entire set is 380 million - which has more digits when written with scientific notation than without.  In a different scenario the notation might be helpful but since it's not something I use regularly I'll remove it for the time being to make my data more readable. 

In [14]:
boxoffice = pd.concat([box_office_test_data, box_office_train_data])

pd.set_option('display.float_format', lambda x: '%.5f' % x)

boxoffice.describe()

Unnamed: 0,id,budget,popularity,runtime,revenue
count,7398.0,7398.0,7398.0,7392.0,3000.0
mean,3699.5,22601457.78102,8.51497,107.71726,66725851.88867
std,2135.76298,36948673.26211,12.16579,21.48004,137532326.33602
min,1.0,0.0,0.0,0.0,1.0
25%,1850.25,0.0,3.93312,94.0,2379808.25
50%,3699.5,7500000.0,7.43584,104.0,16807068.0
75%,5548.75,28000000.0,10.92,118.0,68919203.5
max,7398.0,380000000.0,547.4883,338.0,1519557910.0


While removing scientific notation does make this more readable - the floats have and unessecary degree of precision and I'd like to include some comma separators for mroe readability so I'll adjust that pandas setting here as well. 

It's time to take a look at the first portion of the combined data frame as well. 

In [15]:
pd.options.display.float_format = '{:,.2f}'.format

boxoffice.head(20)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,3001,"[{'id': 34055, 'name': 'Pokémon Collection', '...",0,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, '...",http://www.pokemon.com/us/movies/movie-pokemon...,tt1226251,ja,ディアルガVSパルキアVSダークライ,Ash and friends (this time accompanied by newc...,3.85,...,7/14/07,90.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Somewhere Between Time & Space... A Legend Is ...,Pokémon: The Rise of Darkrai,"[{'id': 11451, 'name': 'pok√©mon'}, {'id': 115...","[{'cast_id': 3, 'character': 'Tonio', 'credit_...","[{'credit_id': '52fe44e7c3a368484e03d683', 'de...",
1,3002,,88000,"[{'id': 27, 'name': 'Horror'}, {'id': 878, 'na...",,tt0051380,en,Attack of the 50 Foot Woman,When an abused wife grows to giant size becaus...,3.56,...,5/19/58,65.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A titanic beauty spreads a macabre wave of hor...,Attack of the 50 Foot Woman,"[{'id': 9748, 'name': 'revenge'}, {'id': 9951,...","[{'cast_id': 2, 'character': 'Nancy Fowler Arc...","[{'credit_id': '55807805c3a3685b1300060b', 'de...",
2,3003,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",,tt0118556,en,Addicted to Love,Good-natured astronomer Sam is devastated when...,8.09,...,5/23/97,100.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A Comedy About Lost Loves And Last Laughs,Addicted to Love,"[{'id': 931, 'name': 'jealousy'}, {'id': 9673,...","[{'cast_id': 11, 'character': 'Maggie', 'credi...","[{'credit_id': '52fe4330c3a36847f8041367', 'de...",
3,3004,,6800000,"[{'id': 18, 'name': 'Drama'}, {'id': 10752, 'n...",http://www.sonyclassics.com/incendies/,tt1255953,fr,Incendies,A mother's last wishes send twins Jeanne and S...,8.6,...,9/4/10,130.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,The search began at the opening of their mothe...,Incendies,"[{'id': 378, 'name': 'prison'}, {'id': 539, 'n...","[{'cast_id': 6, 'character': 'Nawal', 'credit_...","[{'credit_id': '56478092c3a36826140043af', 'de...",
4,3005,,2000000,"[{'id': 36, 'name': 'History'}, {'id': 99, 'na...",,tt0418753,en,Inside Deep Throat,"In 1972, a seemingly typical shoestring budget...",3.22,...,2/11/05,92.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It was filmed in 6 days for 25 thousand dollar...,Inside Deep Throat,"[{'id': 279, 'name': 'usa'}, {'id': 1228, 'nam...","[{'cast_id': 1, 'character': 'Narrator (voice)...","[{'credit_id': '52fe44ce9251416c75041967', 'de...",
5,3006,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0120238,en,SubUrbia,A group of suburban teenagers try to support e...,8.68,...,2/7/96,121.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,SubUrbia,"[{'id': 10183, 'name': 'independent film'}]","[{'cast_id': 4, 'character': 'Pony', 'credit_i...","[{'credit_id': '52fe4576c3a368484e05c901', 'de...",
6,3007,,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...",,tt1517177,de,Drei,Hanna and Simon are in a 20 year marriage with...,4.9,...,12/23/10,119.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Imagine the possibilities.,Three,"[{'id': 572, 'name': 'sex'}, {'id': 154937, 'n...","[{'cast_id': 2, 'character': 'Hanna', 'credit_...","[{'credit_id': '52fe485bc3a36847f816358d', 'de...",
7,3008,,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 10751...",http://www.tigger.com,tt0220099,en,The Tigger Movie,"As it happens, everybody - Pooh, Piglet, Eeyor...",7.02,...,2/11/00,77.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,The Tigger Movie,"[{'id': 3905, 'name': 'owl'}, {'id': 4144, 'na...","[{'cast_id': 2, 'character': 'Tigger / Winnie ...","[{'credit_id': '59121da0c3a3686519043247', 'de...",
8,3009,,16500000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://becomingjane-themovie.com/,tt0416508,en,Becoming Jane,A biographical portrait of a pre-fame Jane Aus...,7.83,...,3/2/07,120.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Her own life is her greatest inspiration.,Becoming Jane,"[{'id': 392, 'name': 'england'}, {'id': 934, '...","[{'cast_id': 10, 'character': 'Jane Austen', '...","[{'credit_id': '53569575c3a3687f54000051', 'de...",
9,3010,"[{'id': 10194, 'name': 'Toy Story Collection',...",90000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story-2,tt0120363,en,Toy Story 2,"Andy heads off to Cowboy Camp, leaving his toy...",17.55,...,10/30/99,92.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The toys are back!,Toy Story 2,"[{'id': 2598, 'name': 'museum'}, {'id': 3246, ...","[{'cast_id': 18, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8025073', 'de...",


 It seems that there are several values for which the data frame returns a dictionary: ***Cast, Crew, Keywords, Genres, 'Belongs to collection', and 'Spoken Languages'***.  All of these seem to be features which could conceivably contribute or indicate revenue potential for a movie, either by providing exposure (casting Tom Hanks as your lead), quality production value (bringing in Steven Spielberg as your EP), or expanding/adopting an existing market (action movies, zombie films, existing franchises like Star Wars, or a new country like China).  

For now I'm going to leave these in as-is however later on in this notebook I would like to be able to identify if a given collection, genre, actor, producer, or language is associated with a higher revenue.  It's likely that this will be served best by transforming these dictionaries into lists.  However I would like to do some more data cleaning before taking that step.

***Missing Values:***

In this section I'll take a look at missing values for the dataset and determine which, if any, columns to drop from the dataset.  

In [24]:
missing = pd.concat([boxoffice.isnull().sum(),100*boxoffice.isnull().mean()], axis=1)
missing.columns = ['Count', '% Missing']
missing.sort_values(by='% Missing', ascending=False)

Unnamed: 0,Count,% Missing
belongs_to_collection,5917,79.98
homepage,5032,68.02
revenue,4398,59.45
tagline,1460,19.74
Keywords,669,9.04
production_companies,414,5.6
production_countries,157,2.12
spoken_languages,62,0.84
crew,38,0.51
cast,26,0.35


Based on the above it seems that missing values will not be a large impediment for this project.  

The test data set with 4398 rows does not have any revenue information (this is what we'll be predicting), so those missing values were expected.  It is worth noting that this will alter the data I can work with for exploratory analysis and it may be prudent to import revenue data for those films at the end of this notebook. 

There are significant missing values for both ***'Belongs to collection' and 'homepage'***, with 80% an 68% missing respectively.  These columns definitely warrant further review.  

The 'tagline' and 'Keywords' columns are missing less than 20% of their data and for now this seems workable.  Previously I've built accurate models with more data missing than that so I'll leave these columns alone for now.  

***Dealing with 'belongs_to_collection'***

This is the column with the largest amount of missing values.  I need to determine what to do with this column, potentially turning it into a boolean value representing if the film belongs to a collection at all.  

Additionally this is a good opportunity to take a look at values which are lists or dictionaries in this data set since that will present interesting opportunities for analysis. 

In [36]:
has_collection = boxoffice['belongs_to_collection'].notnull()

boxoffice['belongs_to_collection'][has_collection]   #I want to look at only actual values to better understand what they are

0       [{'id': 34055, 'name': 'Pokémon Collection', '...
9       [{'id': 10194, 'name': 'Toy Story Collection',...
12      [{'id': 87805, 'name': 'The Gods Must Be Crazy...
16      [{'id': 313576, 'name': 'Hot Tub Time Machine ...
28      [{'id': 409138, 'name': 'Yossi Collection', 'p...
                              ...                        
2967    [{'id': 387219, 'name': 'The Hustler Collectio...
2968    [{'id': 97307, 'name': 'BloodRayne Collection'...
2974    [{'id': 149704, 'name': 'Alone in the Dark Col...
2984    [{'id': 221111, 'name': 'S.W.A.T. Collection',...
2991    [{'id': 107469, 'name': 'Save The Last Dance C...
Name: belongs_to_collection, Length: 1481, dtype: object

In [38]:
pd.set_option("display.max_colwidth", None)    #updating pandas setting so that colums aren't truncated

boxoffice['belongs_to_collection'][has_collection]

0                      [{'id': 34055, 'name': 'Pokémon Collection', 'poster_path': '/j5te0YNZAMXDBnsqTUDKIBEt8iu.jpg', 'backdrop_path': '/iGoYKA0TFfgSoZpG2u5viTJMGfK.jpg'}]
9                    [{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}]
12      [{'id': 87805, 'name': 'The Gods Must Be Crazy Collection', 'poster_path': '/3KAJpE2OOimXE5Z15LHARbeA0eC.jpg', 'backdrop_path': '/3u6As0EKP2KKPNhzQObCgdYvm7f.jpg'}]
16       [{'id': 313576, 'name': 'Hot Tub Time Machine Collection', 'poster_path': '/iEhb00TGPucF0b4joM1ieyY026U.jpg', 'backdrop_path': '/noeTVcgpBiD48fDjFVic1Vz7ope.jpg'}]
28                       [{'id': 409138, 'name': 'Yossi Collection', 'poster_path': '/y30egmXtXggkHA2yfcDQcmM9D90.jpg', 'backdrop_path': '/w5u8mlzQqpVWcZlCgHzgzHOwGm.jpg'}]
                                                                                        ...                                            

Now I can see that each value is a dictionary containing the ID, name, path to poster, adn path to backdrop for each collection. 

While it's possible that the backdrop and/or poster could have an impact on revenue for marketing reasons; I would need to asess them using computer vision which is not in the scope of this project.  

For now I'm going to replace this dictionary with only the name of each collection. 

### Summary

-From the data provided "Belongs to collection" and "homepage" values had 80% and 68% missing values - this is significantly higher than any other column in the set. 

-I made the decision to keep all columns with less than 20% of their data missing
