## Movie Revenue Prediction 

### Objective: Your client is a movie studio and they need to be able to predict movie revenue in order to greenlight the project and assign a budget to it. 
- Most of the data is comprised of categorical variables. 
- While the budget for the movie is known in the dataset, it is often an unknown variable during the greenlighting process. 

In [1]:
%load_ext watermark
%watermark -a "Emily Schoof" -d -t -v -p numpy,pandas,matplotlib

Emily Schoof 2019-08-14 21:58:27 

CPython 3.7.3
IPython 7.4.0

numpy 1.16.2
pandas 0.24.2
matplotlib 3.0.3


## Section 1: Data Preprocessing
- Load Movie_Revenue_Predictions.csv data
- Cleaning Data and Exploration

In [2]:
# Import necessary modules
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import datetime

In [3]:
# Load the dataset
movie_df = DataFrame(pd.read_csv('Movie_Revenue_Predictions.csv'))
movie_df.head(2)

Unnamed: 0,title,tagline,revenue,budget,genres,homepage,id,keywords,original_language,overview,production_companies,production_countries,release_date,runtime,spoken_languages,status
0,Avatar,Enter the World of Pandora.,2787965087,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,"In the 22nd century, a paraplegic Marine is di...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",12/10/09,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released
1,Pirates of the Caribbean: At World's End,"At the end of the world, the adventure begins.",961000000,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,"Captain Barbossa, long believed to be dead, ha...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",5/19/07,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released


In [4]:
# Assess shape of data
movie_df.shape

(4803, 16)

In [5]:
# Assess dataframe
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 16 columns):
title                   4803 non-null object
tagline                 3959 non-null object
revenue                 4803 non-null int64
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
overview                4800 non-null object
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
dtypes: float64(1), int64(3), object(12)
memory usage: 600.5+ KB


In [6]:
movie_df.isnull().sum()

title                      0
tagline                  844
revenue                    0
budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
overview                   3
production_companies       0
production_countries       0
release_date               1
runtime                    2
spoken_languages           0
status                     0
dtype: int64

## Resolution of NaN Values
#### 1. Numerical columns:
   - **runtime** (2 NaN)
       - since there are so few instances, dropping the NaN rows will not interfere with the analysis of movie revenue. 
        
#### 2. Categorical/Object columns:
Filling in NaN categorical values in the remaining columns is a bit tricky since there is no easily-applied statistical method.
   - **homepage** (3091 NaN)
        - 3/4 of the data is missing (3091 of the total 4803), so this column cannot be effectively utilized for this model and should be dropped.
   - **overview** (3 NaN)
        - since there are so few instances, dropping the NaN rows will not interfere with the analysis of movie revenue. 
   - **release_date** (1 NaN)
        - since there are so few instances, dropping the NaN rows will not interfere with the analysis of movie revenue. 
   - **tagline** (844 NaN)
        - while 844 is well under 1/4 of the total data, the tagline for a movie may in fact have a significant impact on movie revenue due to its marketing implications. Therefore, I will attempt to predict the tagline column missing values with random forest, as documented in this source article: https://www.mikulskibartosz.name/fill-missing-values-using-random-forest/. *Sense this will result in the conversion of the dataset in to numerical/encoded values, this resolution will be reserved for after all other values have been resolved.*

In [7]:
# Drop NaNs in Numerical Columns - only select rows where overview, runtime, and release_date columns are "not null"
movie_df = movie_df.dropna(subset=['runtime', 'overview', 'release_date'])
len(movie_df)

4799

In [8]:
# Convert Dates to Datetime Objects
movie_df['release_date_dt'] = pd.to_datetime(movie_df['release_date'], infer_datetime_format=True)
movie_df['release_date_dt'].head(2)

0   2009-12-10
1   2007-05-19
Name: release_date_dt, dtype: datetime64[ns]

In [9]:
# Drop Columns with Too Many NaNs *(> 50% of entries)* to Resolve
movie_df = movie_df.drop(columns=['homepage', 'release_date'])
movie_df.head(2)

Unnamed: 0,title,tagline,revenue,budget,genres,id,keywords,original_language,overview,production_companies,production_countries,runtime,spoken_languages,status,release_date_dt
0,Avatar,Enter the World of Pandora.,2787965087,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,"In the 22nd century, a paraplegic Marine is di...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,2009-12-10
1,Pirates of the Caribbean: At World's End,"At the end of the world, the adventure begins.",961000000,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,"Captain Barbossa, long believed to be dead, ha...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,2007-05-19


In [10]:
# Test output
movie_df.isnull().sum()

title                     0
tagline                 840
revenue                   0
budget                    0
genres                    0
id                        0
keywords                  0
original_language         0
overview                  0
production_companies      0
production_countries      0
runtime                   0
spoken_languages          0
status                    0
release_date_dt           0
dtype: int64

## Explore the Values within Each Column

In [11]:
# Import helpful modules
from collections import Counter

**Title**

In [12]:
print(len(movie_df.title.value_counts(ascending=False)))
print(movie_df.title.describe())
movie_df.title.value_counts(ascending=False).nlargest()

4796
count                4799
unique               4796
top       Out of the Blue
freq                    2
Name: title, dtype: object


Out of the Blue    2
Batman             2
The Host           2
Before Sunset      1
Marvin's Room      1
Name: title, dtype: int64

*Aside from more two movie entries per Batman, Out of the Blue, and The Host movies, all other movie titles within the dataset only appear once.*

**Revenue**

In [13]:
print(len(movie_df.revenue.value_counts(ascending=False)))
print(movie_df.revenue.describe())
movie_df.revenue.value_counts(ascending=False).nlargest()

3297
count    4.799000e+03
mean     8.232920e+07
std      1.629076e+08
min      0.000000e+00
25%      0.000000e+00
50%      1.918402e+07
75%      9.295652e+07
max      2.787965e+09
Name: revenue, dtype: float64


0           1423
7000000        6
8000000        6
6000000        5
12000000       5
Name: revenue, dtype: int64

*There are 1423 movies within the dataset that have a reported $0 for revenue. This is most likely due to missing data. Since movie revenue is the target variable, these instances need to be kept in mind.*

**Budget**

In [14]:
print(len(movie_df.budget.value_counts(ascending=False)))
print(movie_df.budget.describe())
movie_df.budget.value_counts(ascending=False).nlargest()

434
count    4.799000e+03
mean     2.906593e+07
std      4.073251e+07
min      0.000000e+00
25%      8.000000e+05
50%      1.500000e+07
75%      4.000000e+07
max      3.800000e+08
Name: budget, dtype: float64


0           1036
20000000     144
30000000     128
25000000     126
40000000     123
Name: budget, dtype: int64

*Similar to revenue, there are 1036 movies within the dataset that have a reported $0 for budget. This is most likely due to missing data. Since movie revenue is the target variable, these instances need to be kept in mind.*

**Genres**

In [21]:
print(movie_df.genres.describe())
print(len(movie_df.genres.value_counts(ascending=False)))
most_common, num_most_common = Counter(movie_df.genres).most_common(1)[0]
print(most_common, num_most_common)
movie_df.genres.value_counts(ascending=False).nlargest()

count                              4799
unique                             1175
top       [{"id": 18, "name": "Drama"}]
freq                                369
Name: genres, dtype: object
1175
[{"id": 18, "name": "Drama"}] 369


[{"id": 18, "name": "Drama"}]                                       369
[{"id": 35, "name": "Comedy"}]                                      282
[{"id": 18, "name": "Drama"}, {"id": 10749, "name": "Romance"}]     164
[{"id": 35, "name": "Comedy"}, {"id": 10749, "name": "Romance"}]    144
[{"id": 35, "name": "Comedy"}, {"id": 18, "name": "Drama"}]         142
Name: genres, dtype: int64

**ID**

In [16]:
print(len(movie_df.id.value_counts(ascending=False)))
print(movie_df.id.describe())
movie_df.id.value_counts(ascending=False).nlargest()

4799
count      4799.000000
mean      56899.920192
std       88236.500208
min           5.000000
25%        9012.500000
50%       14623.000000
75%       58461.500000
max      447027.000000
Name: id, dtype: float64


45054     1
109417    1
13187     1
8849      1
29339     1
Name: id, dtype: int64

**Keywords**

In [20]:
print(len(movie_df.keywords.value_counts(ascending=False)))
print(movie_df.keywords.describe())
most_common, num_most_common = Counter(movie_df.keywords).most_common(1)[0]
print(most_common, num_most_common)
movie_df.keywords.value_counts(ascending=False).nlargest()

4220
count     4799
unique    4220
top         []
freq       410
Name: keywords, dtype: object
[] 410


[]                                                  410
[{"id": 10183, "name": "independent film"}]          55
[{"id": 187056, "name": "woman director"}]           42
[{"id": 179431, "name": "duringcreditsstinger"}]     15
[{"id": 6075, "name": "sport"}]                      13
Name: keywords, dtype: int64

*This column appears to have nested information that needs to be extracted, cleaned, then added back to the movie_df dataset. In addition, the most common keyword is '[ ]', which is empty and holds no value.*

**Original Language**

In [18]:
print(len(movie_df.original_language.value_counts(ascending=False)))
print(movie_df.original_language.describe())
movie_df.original_language.value_counts(ascending=False).nlargest()

37
count     4799
unique      37
top         en
freq      4503
Name: original_language, dtype: object


en    4503
fr      70
es      32
zh      27
de      26
Name: original_language, dtype: int64

**Overview**

In [18]:
#movie_df.overview.value_counts(ascending=False)
print(len(movie_df.overview.value_counts(ascending=False)))
print(movie_df.overview.describe())
movie_df.overview[0]

4799
count                                                  4799
unique                                                 4799
top       On a trip to the beach, a teenage girl named T...
freq                                                      1
Name: overview, dtype: object


'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

**Production Companies**

In [19]:
print(len(movie_df.production_companies.value_counts(ascending=False)))
print(movie_df.production_companies.describe())
most_common, num_most_common = Counter(movie_df.production_companies).most_common(1)[0]
print(most_common, num_most_common)
movie_df.production_companies.value_counts(ascending=False).nlargest()

3695
count     4799
unique    3695
top         []
freq       349
Name: production_companies, dtype: object
[] 349


[]                                            349
[{"name": "Paramount Pictures", "id": 4}]      58
[{"name": "Universal Pictures", "id": 33}]     45
[{"name": "New Line Cinema", "id": 12}]        38
[{"name": "Columbia Pictures", "id": 5}]       37
Name: production_companies, dtype: int64

*This column appears to have nested information that needs to be extracted, cleaned, then added back to the movie_df dataset. In addition, the most common production_companies is '[ ]', which is empty and holds no value.*

In [30]:
no_company = movie_df[movie_df['production_companies'] == '[]']
print(len(no_company))
no_company.head()

349


Unnamed: 0,title,tagline,revenue,budget,genres,id,keywords,original_language,overview,production_companies,production_countries,runtime,spoken_languages,status,release_date_dt
1011,The Tooth Fairy,,0,0,"[{""id"": 27, ""name"": ""Horror""}]",53953,"[{""id"": 10292, ""name"": ""gore""}, {""id"": 12339, ...",de,A woman and her daughter (Nicole Muñoz) encoun...,[],[],0.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,2006-08-08
1360,There Be Dragons,,0,0,"[{""id"": 18, ""name"": ""Drama""}]",45054,"[{""id"": 5509, ""name"": ""spanish civil war""}, {""...",en,Arising out of the horror of the Spanish Civil...,[],[],112.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,2011-03-25
1669,The Promise,the promise,0,0,"[{""id"": 14, ""name"": ""Fantasy""}, {""id"": 18, ""na...",2008,"[{""id"": 964, ""name"": ""servant""}, {""id"": 2280, ...",zh,"An orphaned girl, driven by poverty at such a ...",[],"[{""iso_3166_1"": ""CN"", ""name"": ""China""}, {""iso_...",98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,2005-12-15
1754,The Sisterhood of the Traveling Pants 2,Some friends just fit together.,44352417,27000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 35, ""...",10188,"[{""id"": 5248, ""name"": ""female friendship""}, {""...",en,Four young women continue the journey toward a...,[],"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",117.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,2008-08-06
1898,Unaccompanied Minors,"No plane, no parents, no problem!",0,26000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10751, ""...",18147,"[{""id"": 65, ""name"": ""holiday""}]",en,Five disparate kids snowed in at the airport o...,[],[],90.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,2006-12-08


**Production Countries**

In [22]:
print(len(movie_df.production_countries.value_counts(ascending=False)))
print(movie_df.production_countries.describe())
most_common, num_most_common = Counter(movie_df.production_countries).most_common(1)[0]
print(most_common, num_most_common)
movie_df.production_countries.value_counts(ascending=False).nlargest()

469
count                                                  4799
unique                                                  469
top       [{"iso_3166_1": "US", "name": "United States o...
freq                                                   2977
Name: production_countries, dtype: object
[{"iso_3166_1": "US", "name": "United States of America"}] 2977


[{"iso_3166_1": "US", "name": "United States of America"}]                                                    2977
[{"iso_3166_1": "GB", "name": "United Kingdom"}, {"iso_3166_1": "US", "name": "United States of America"}]     181
[]                                                                                                             172
[{"iso_3166_1": "GB", "name": "United Kingdom"}]                                                               130
[{"iso_3166_1": "DE", "name": "Germany"}, {"iso_3166_1": "US", "name": "United States of America"}]            119
Name: production_countries, dtype: int64

*This column appears to have nested information that needs to be extracted, cleaned, then added back to the movie_df dataset. In addition, the incidence of 'US' (on it's own) as the production country appears to be the most common, making up around 50% of the dataset (2977 out of 4799). In addition, there are 172 entries of '[]' that appear to hold no information.*

**Runtime**

In [28]:
print(len(movie_df.runtime.value_counts(ascending=False)))
print(movie_df.runtime.describe())
movie_df.runtime.value_counts(ascending=False).nlargest()

156
count    4799.000000
mean      106.903105
std        22.561305
min         0.000000
25%        94.000000
50%       103.000000
75%       118.000000
max       338.000000
Name: runtime, dtype: float64


90.0     163
100.0    149
98.0     140
97.0     133
95.0     123
Name: runtime, dtype: int64

*There appear to be movies with a '0.0' runtime, which doesn't make sense, since this would imply that the movie was 0 minutes long.*

In [29]:
no_runtime = movie_df[movie_df['runtime'] == 0.0]
print(len(no_runtime))
no_runtime.head()

34


Unnamed: 0,title,tagline,revenue,budget,genres,id,keywords,original_language,overview,production_companies,production_countries,runtime,spoken_languages,status,release_date_dt
1011,The Tooth Fairy,,0,0,"[{""id"": 27, ""name"": ""Horror""}]",53953,"[{""id"": 10292, ""name"": ""gore""}, {""id"": 12339, ...",de,A woman and her daughter (Nicole Muñoz) encoun...,[],[],0.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,2006-08-08
3112,Blood Done Sign My Name,No one changes the world alone.,0,0,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 80, ""name...",41894,[],en,A drama based on the true story in which a bla...,[],[],0.0,[],Released,2010-02-01
3669,Should've Been Romeo,Even Shakespeare didn't see this one coming.,0,0,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",113406,[],en,"A self-centered, middle-aged pitchman for a po...","[{""name"": ""Phillybrook Films"", ""id"": 65147}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",0.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,2012-04-28
3809,How to Fall in Love,,0,4000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",158150,[],en,"An accountant, who never quite grew out of his...","[{""name"": ""Annuit Coeptis Entertainment Inc."",...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",0.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,2012-07-21
3953,Fort McCoy,,0,0,"[{""id"": 10752, ""name"": ""War""}, {""id"": 18, ""nam...",281230,"[{""id"": 187056, ""name"": ""woman director""}]",en,Unable to serve in World War II because of a h...,[],[],0.0,[],Released,2014-01-01


**Spoken Languages**

In [31]:
print(len(movie_df.spoken_languages.value_counts(ascending=False)))
print(movie_df.spoken_languages.describe())
most_common, num_most_common = Counter(movie_df.spoken_languages).most_common(1)[0]
print(most_common, num_most_common)
movie_df.spoken_languages.value_counts(ascending=False).nlargest()

544
count                                         4799
unique                                         544
top       [{"iso_639_1": "en", "name": "English"}]
freq                                          3170
Name: spoken_languages, dtype: object
[{"iso_639_1": "en", "name": "English"}] 3170


[{"iso_639_1": "en", "name": "English"}]                                                  3170
[{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "es", "name": "Espa\u00f1ol"}]      127
[{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "fr", "name": "Fran\u00e7ais"}]     114
[]                                                                                          84
[{"iso_639_1": "es", "name": "Espa\u00f1ol"}, {"iso_639_1": "en", "name": "English"}]       54
Name: spoken_languages, dtype: int64

*This column appears to have nested information that needs to be extracted, cleaned, then added back to the movie_df dataset. In addition, the incidence of 'English' as the spoken language appears to be the most common, making up around 75% of the dataset (3170 out of 4799). There are also 84 entries with a '[]'.*

In [33]:
no_language = movie_df[movie_df['spoken_languages'] == '[]']
print(len(no_language))
no_language.head()

84


Unnamed: 0,title,tagline,revenue,budget,genres,id,keywords,original_language,overview,production_companies,production_countries,runtime,spoken_languages,status,release_date_dt
492,Top Cat Begins,,0,8000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 16, ""nam...",293644,"[{""id"": 209714, ""name"": ""3d""}]",es,Top Cat has arrived to charm his way into your...,"[{""name"": ""Anima Estudios"", ""id"": 9965}, {""nam...","[{""iso_3166_1"": ""IN"", ""name"": ""India""}, {""iso_...",89.0,[],Released,2015-10-30
1169,42,The True Story Of An American Legend,95020213,40000000,"[{""id"": 18, ""name"": ""Drama""}]",109410,"[{""id"": 1480, ""name"": ""baseball""}, {""id"": 5565...",en,"The powerful story of Jackie Robinson, the leg...","[{""name"": ""Warner Bros."", ""id"": 6194}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",128.0,[],Released,2013-04-12
2590,VeggieTales: The Pirates Who Don't Do Anything,,0,0,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 16, ""...",15511,"[{""id"": 380, ""name"": ""brother brother relation...",en,Set Sail For Adventure! A boatload of beloved ...,"[{""name"": ""Starz Animation"", ""id"": 2885}, {""na...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",85.0,[],Released,2008-01-11
2614,The Love Letter,A letter from the past would change their futu...,0,0,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",57943,[],en,20th century computer games designer Scott exc...,[],[],98.0,[],Released,1998-02-01
2631,The Company,,0,0,"[{""id"": 18, ""name"": ""Drama""}]",112430,"[{""id"": 11162, ""name"": ""miniseries""}]",en,Real-life figures from the Cold War era mix wi...,[],[],276.0,[],Released,2007-08-05


**Status**

In [24]:
print(len(movie_df.status.value_counts(ascending=False)))
print(movie_df.status.describe())
print(movie_df.status.value_counts(ascending=False))
most_common, num_most_common = Counter(movie_df.status).most_common(1)[0]
most_common, num_most_common

3
count         4799
unique           3
top       Released
freq          4791
Name: status, dtype: object
Released           4791
Rumored               5
Post Production       3
Name: status, dtype: int64


('Released', 4791)

**Release Date**

In [32]:
print(len(movie_df.release_date_dt.value_counts(ascending=False)))
print(movie_df.release_date_dt.describe()) 
most_common, num_most_common = Counter(movie_df.release_date_dt).most_common(1)[0]
least_common, num_least_common = Counter(movie_df.release_date_dt).most_common()[-1]
print(most_common, num_most_common, least_common, num_least_common)
movie_df.release_date_dt.value_counts(ascending=False).nlargest()

3278
count                    4799
unique                   3278
top       2006-01-01 00:00:00
freq                       10
first     1969-01-01 00:00:00
last      2068-12-21 00:00:00
Name: release_date_dt, dtype: object
2006-01-01 00:00:00 10 2012-05-03 00:00:00 1


2006-01-01    10
2002-01-01     8
2014-12-25     7
2013-07-18     7
1999-10-22     7
Name: release_date_dt, dtype: int64

*The column dates range from the years 1969 to the future year of 2068. This is interesting, since movie revenue cannot be recorded for future dates, but it can be predicted (as is the goal of this analysis). Before moving forward, the movies with dates greater than 2020 should be dropped.*

In [26]:
movie_df = movie_df.loc[movie_df.release_date_dt < '2020-01-01 00:00:00']
print(len(movie_df.release_date_dt.value_counts(ascending=False)))
print(movie_df.release_date_dt.describe())

3147
count                    4667
unique                   3147
top       2006-01-01 00:00:00
freq                       10
first     1969-01-01 00:00:00
last      2017-02-03 00:00:00
Name: release_date_dt, dtype: object


This looks much better. It appears data collection for this dataset must have stopped in early 2017, since no movies released in 2018 and 2019 are present in the dataset despite the maximum year elligibility being 2019.

### Drop row with the following 6 conditions:
    1. movie.revenue == 0
    2. movie.budget == 0
    3. movie.keyword == []
    4. movie.production_companies == []
    5. movie_df.runtime == 0.0
    6. movie.tagline.isnull() == True
If all 6 of these conditions are met, then the resulting impact on a model built to predict a movie's revenue will be weakened; thus, these rows should be removed entirely.

In [27]:
# Drop rows with sub-optimal conditions
movie_df = movie_df.drop(
     movie_df[(movie_df.revenue == 0) & 
              (movie_df.budget == 0) & 
              (movie_df.keywords == "[]") & 
              (movie_df.production_companies == "[]") &
              (movie_df.runtime == 0.0) &
             (movie_df.tagline.isnull() == True)].index)

# Drop rows where 
len(movie_df)

4651

In [28]:
# Store dataframe globally
%store movie_df

Stored 'movie_df' (DataFrame)
