## Movie Revenue Prediction 

### Objective: Your client is a movie studio and they need to be able to predict movie revenue in order to greenlight the project and assign a budget to it. 
- Most of the data is comprised of categorical variables. 
- While the budget for the movie is known in the dataset, it is often an unknown variable during the greenlighting process. 

## Section 1: Data Preprocessing
- Load Movie_Revenue_Predictions.csv data
- Cleaning Data and Exploration

In [1]:
# Import necessary modules
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import datetime

In [2]:
# Load the dataset
movie_df = DataFrame(pd.read_csv('Movie_Revenue_Predictions.csv'))
movie_df.head(2)

Unnamed: 0,title,tagline,revenue,budget,genres,homepage,id,keywords,original_language,overview,production_companies,production_countries,release_date,runtime,spoken_languages,status
0,Avatar,Enter the World of Pandora.,2787965087,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,"In the 22nd century, a paraplegic Marine is di...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",12/10/09,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released
1,Pirates of the Caribbean: At World's End,"At the end of the world, the adventure begins.",961000000,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,"Captain Barbossa, long believed to be dead, ha...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",5/19/07,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released


In [3]:
# Assess shape of data
movie_df.shape

(4803, 16)

In [4]:
# Assess dataframe
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 16 columns):
title                   4803 non-null object
tagline                 3959 non-null object
revenue                 4803 non-null int64
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
overview                4800 non-null object
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
dtypes: float64(1), int64(3), object(12)
memory usage: 600.5+ KB


In [5]:
movie_df.isnull().sum()

title                      0
tagline                  844
revenue                    0
budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
overview                   3
production_companies       0
production_countries       0
release_date               1
runtime                    2
spoken_languages           0
status                     0
dtype: int64

## Resolution of NaN Values
#### 1. Numerical columns:
   - **runtime** (2 NaN)
       - since there are so few instances, dropping the NaN rows will not interfere with the analysis of movie revenue. 
        
#### 2. Categorical/Object columns:
Filling in NaN categorical values in the remaining columns is a bit tricky since there is no easily-applied statistical method.
   - **homepage** (3091 NaN)
        - 3/4 of the data is missing (3091 of the total 4803), so this column cannot be effectively utilized for this model and should be dropped.
   - **overview** (3 NaN)
        - since there are so few instances, dropping the NaN rows will not interfere with the analysis of movie revenue. 
   - **release_date** (1 NaN)
        - since there are so few instances, dropping the NaN rows will not interfere with the analysis of movie revenue. 
   - **tagline** (844 NaN)
        - while 844 is well under 1/4 of the total data, the tagline for a movie may in fact have a significant impact on movie revenue due to its marketing implications. Therefore, I will attempt to predict the tagline column missing values with random forest, as documented in this source article: https://www.mikulskibartosz.name/fill-missing-values-using-random-forest/. *Sense this will result in the conversion of the dataset in to numerical/encoded values, this resolution will be reserved for after all other values have been resolved.*

In [6]:
# Drop NaNs in Numerical Columns - only select rows where overview, runtime, and release_date columns are "not null"
movie_df = movie_df.dropna(subset=['runtime', 'overview', 'release_date'])
len(movie_df)

4799

In [7]:
# Convert Dates to Datetime Objects
movie_df['release_date_dt'] = pd.to_datetime(movie_df['release_date'], infer_datetime_format=True)
movie_df['release_date_dt'].head(2)

0   2009-12-10
1   2007-05-19
Name: release_date_dt, dtype: datetime64[ns]

In [8]:
# Drop Columns with Too Many NaNs *(> 50% of entries)* to Resolve
movie_df = movie_df.drop(columns=['homepage', 'release_date'])
movie_df.head(2)

Unnamed: 0,title,tagline,revenue,budget,genres,id,keywords,original_language,overview,production_companies,production_countries,runtime,spoken_languages,status,release_date_dt
0,Avatar,Enter the World of Pandora.,2787965087,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,"In the 22nd century, a paraplegic Marine is di...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,2009-12-10
1,Pirates of the Caribbean: At World's End,"At the end of the world, the adventure begins.",961000000,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,"Captain Barbossa, long believed to be dead, ha...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,2007-05-19


In [9]:
# Test output
movie_df.isnull().sum()

title                     0
tagline                 840
revenue                   0
budget                    0
genres                    0
id                        0
keywords                  0
original_language         0
overview                  0
production_companies      0
production_countries      0
runtime                   0
spoken_languages          0
status                    0
release_date_dt           0
dtype: int64

## Explore the Values within Each Column

In [26]:
# Import helpful modules
from collections import Counter

In [10]:
# Title
print(len(movie_df.title.value_counts(ascending=False)))
print(movie_df.title.describe())
movie_df.title.value_counts(ascending=False)

4796
count         4799
unique        4796
top       The Host
freq             2
Name: title, dtype: object


The Host                                 2
Out of the Blue                          2
Batman                                   2
Cargo                                    1
The Life of David Gale                   1
Pitch Perfect 2                          1
Rio 2                                    1
The Last Sin Eater                       1
Little Nicholas                          1
Eraserhead                               1
Rapa Nui                                 1
Spectre                                  1
East Is East                             1
You Don't Mess with the Zohan            1
Lincoln                                  1
Love Ranch                               1
Stake Land                               1
A Simple Wish                            1
Babel                                    1
Snakes on a Plane                        1
Albino Alligator                         1
Dragonslayer                             1
PCU                                      1
Anomalisa  

*Aside from more two movie entries per Batman, Out of the Blue, and The Host movies, all other movie titles within the dataset only appear once.*

In [11]:
# Revenue
print(len(movie_df.revenue.value_counts(ascending=False)))
print(movie_df.revenue.describe())
movie_df.revenue.value_counts(ascending=False)

3297
count    4.799000e+03
mean     8.232920e+07
std      1.629076e+08
min      0.000000e+00
25%      0.000000e+00
50%      1.918402e+07
75%      9.295652e+07
max      2.787965e+09
Name: revenue, dtype: float64


0            1423
7000000         6
8000000         6
6000000         5
12000000        5
10000000        5
100000000       5
14000000        4
25000000        4
11000000        4
5000000         4
32000000        3
13000000        3
60000000        3
7800000         3
14400000        3
4000000         3
17000000        3
30000000        3
77000000        2
20000000        2
29000000        2
42000000        2
2200000         2
8500000         2
24000000        2
102000000       2
9000000         2
15000000        2
33400000        2
             ... 
157887643       1
18195610        1
482860185       1
10501938        1
325233863       1
16633035        1
53191886        1
273552592       1
47042000        1
275293450       1
786636033       1
538400000       1
32204030        1
30411183        1
91636986        1
82087155        1
82150642        1
48190704        1
104303851       1
2401510         1
15200000        1
109906372       1
83719388        1
193355800       1
136621271 

*There are 1423 movies within the dataset that have a reported $0 for revenue. This is most likely due to missing data. Since movie revenue is the target variable, these instances need to be kept in mind.*

In [12]:
# Budget
print(len(movie_df.budget.value_counts(ascending=False)))
print(movie_df.budget.describe())
movie_df.budget.value_counts(ascending=False)

434
count    4.799000e+03
mean     2.906593e+07
std      4.073251e+07
min      0.000000e+00
25%      8.000000e+05
50%      1.500000e+07
75%      4.000000e+07
max      3.800000e+08
Name: budget, dtype: float64


0            1036
20000000      144
30000000      128
25000000      126
40000000      123
15000000      119
35000000      102
50000000      101
10000000      101
60000000       86
5000000        84
12000000       79
8000000        62
70000000       60
80000000       59
18000000       59
6000000        55
7000000        55
2000000        54
45000000       52
3000000        51
4000000        49
1000000        48
75000000       47
55000000       45
28000000       42
150000000      41
13000000       41
11000000       41
100000000      41
             ... 
176000003       1
180000          1
97250400        1
16800000        1
1350000         1
777000          1
127000000       1
2686000         1
3100000         1
1650000         1
1455000         1
12516654        1
5952000         1
14200000        1
12305523        1
78146652        1
3705538         1
3730500         1
46000           1
7347125         1
85000           1
4300000         1
19500000        1
23600000        1
41677699  

*Similar to revenue, there are 1036 movies within the dataset that have a reported $0 for budget. This is most likely due to missing data. Since movie revenue is the target variable, these instances need to be kept in mind.*

In [27]:
# Genres 
#movie_df.genres.value_counts(ascending=False)
print(movie_df.genres.describe())
print(len(movie_df.genres.value_counts(ascending=False)))

most_common, num_most_common = Counter(movie_df.genres).most_common(1)[0]
most_common, num_most_common

count                              4799
unique                             1175
top       [{"id": 18, "name": "Drama"}]
freq                                369
Name: genres, dtype: object
1175


('[{"id": 18, "name": "Drama"}]', 369)

In [15]:
# ID
print(len(movie_df.id.value_counts(ascending=False)))
print(movie_df.id.describe())
movie_df.id.value_counts(ascending=False)

4799
count      4799.000000
mean      56899.920192
std       88236.500208
min           5.000000
25%        9012.500000
50%       14623.000000
75%       58461.500000
max      447027.000000
Name: id, dtype: float64


45054     1
109417    1
13187     1
8849      1
29339     1
286939    1
673       1
10368     1
14438     1
68202     1
8869      1
72358     1
299687    1
681       1
260778    1
4518      1
55306     1
41144     1
19084     1
621       1
50942     1
218       1
39538     1
4723      1
51828     1
13816     1
10707     1
37495     1
25209     1
332411    1
         ..
50839     1
9582      1
11631     1
13680     1
840       1
50546     1
11635     1
13688     1
1402      1
13685     1
9598      1
11619     1
9570      1
11615     1
27983     1
107846    1
25678     1
1777      1
15394     1
7501      1
9550      1
46420     1
9566      1
36739     1
1366      1
887       1
367961    1
9562      1
19803     1
65203     1
Name: id, Length: 4799, dtype: int64

In [28]:
# Keywords
print(len(movie_df.keywords.value_counts(ascending=False)))
print(movie_df.keywords.describe())
#movie_df.keywords.value_counts(ascending=False)
most_common, num_most_common = Counter(movie_df.keywords).most_common(1)[0]
most_common, num_most_common

4220
count     4799
unique    4220
top         []
freq       410
Name: keywords, dtype: object


('[]', 410)

*This column appears to have nested information that needs to be extracted, cleaned, then added back to the movie_df dataset. In addition, the most common keyword is '[ ]', which is empty and holds no value.*

In [17]:
# Original Language
print(len(movie_df.original_language.value_counts(ascending=False)))
print(movie_df.original_language.describe())
movie_df.original_language.value_counts(ascending=False)

37
count     4799
unique      37
top         en
freq      4503
Name: original_language, dtype: object


en    4503
fr      70
es      32
zh      27
de      26
hi      19
ja      16
it      13
cn      12
ru      11
ko      11
pt       9
da       7
sv       5
fa       4
nl       4
he       3
th       3
ar       2
cs       2
ta       2
id       2
ro       2
sl       1
el       1
is       1
hu       1
pl       1
ps       1
af       1
ky       1
tr       1
no       1
xx       1
nb       1
vi       1
te       1
Name: original_language, dtype: int64

In [20]:
# Overview
#movie_df.overview.value_counts(ascending=False)
print(len(movie_df.overview.value_counts(ascending=False)))
print(movie_df.overview.describe())
movie_df.overview[0]

4799
count                                                  4799
unique                                                 4799
top       When a half-Chechen, half-Russian, tortured ha...
freq                                                      1
Name: overview, dtype: object


'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [29]:
# Production_companies
print(len(movie_df.production_companies.value_counts(ascending=False)))
print(movie_df.production_companies.describe())
#movie_df.production_companies.value_counts(ascending=False)
most_common, num_most_common = Counter(movie_df.production_companies).most_common(1)[0]
most_common, num_most_common

3695
count     4799
unique    3695
top         []
freq       349
Name: production_companies, dtype: object


('[]', 349)

*This column appears to have nested information that needs to be extracted, cleaned, then added back to the movie_df dataset. In addition, the most common production_companies is '[ ]', which is empty and holds no value.*

In [30]:
# Production_countries
print(len(movie_df.production_countries.value_counts(ascending=False)))
print(movie_df.production_countries.describe())
#movie_df.production_countries.value_counts(ascending=False)
most_common, num_most_common = Counter(movie_df.production_countries).most_common(1)[0]
most_common, num_most_common

469
count                                                  4799
unique                                                  469
top       [{"iso_3166_1": "US", "name": "United States o...
freq                                                   2977
Name: production_countries, dtype: object


('[{"iso_3166_1": "US", "name": "United States of America"}]', 2977)

*This column appears to have nested information that needs to be extracted, cleaned, then added back to the movie_df dataset. In addition, the incidence of 'US' (on it's own) as the production country appears to be the most common, making up around 50% of the dataset (2977 out of 4799).*

In [23]:
# Runtime  
print(len(movie_df.runtime.value_counts(ascending=False)))
print(movie_df.runtime.describe())
movie_df.runtime.value_counts(ascending=False)

156
count    4799.000000
mean      106.903105
std        22.561305
min         0.000000
25%        94.000000
50%       103.000000
75%       118.000000
max       338.000000
Name: runtime, dtype: float64


90.0     163
100.0    149
98.0     140
97.0     133
95.0     123
99.0     119
94.0     116
96.0     115
101.0    114
93.0     113
104.0    109
92.0     107
91.0     106
105.0    106
106.0    105
110.0    101
102.0    100
103.0     99
107.0     99
108.0     94
88.0      90
89.0      87
120.0     85
109.0     83
113.0     79
87.0      77
111.0     74
114.0     73
112.0     72
115.0     72
        ... 
225.0      1
202.0      1
47.0       1
186.0      1
254.0      1
242.0      1
187.0      1
59.0       1
66.0       1
174.0      1
63.0       1
238.0      1
276.0      1
166.0      1
25.0       1
42.0       1
201.0      1
184.0      1
41.0       1
216.0      1
248.0      1
185.0      1
219.0      1
67.0       1
173.0      1
338.0      1
53.0       1
214.0      1
194.0      1
179.0      1
Name: runtime, Length: 156, dtype: int64

In [32]:
# Spoken_languages
print(len(movie_df.spoken_languages.value_counts(ascending=False)))
print(movie_df.spoken_languages.describe())
#movie_df.spoken_languages.value_counts(ascending=False)
most_common, num_most_common = Counter(movie_df.spoken_languages).most_common(1)[0]
most_common, num_most_common

544
count                                         4799
unique                                         544
top       [{"iso_639_1": "en", "name": "English"}]
freq                                          3170
Name: spoken_languages, dtype: object


('[{"iso_639_1": "en", "name": "English"}]', 3170)

*This column appears to have nested information that needs to be extracted, cleaned, then added back to the movie_df dataset. In addition, the incidence of 'English' as the spoken language appears to be the most common, making up around 75% of the dataset (3170 out of 4799).*

In [43]:
# Status
print(len(movie_df.status.value_counts(ascending=False)))
print(movie_df.status.describe())
print(movie_df.status.value_counts(ascending=False))
most_common, num_most_common = Counter(movie_df.status).most_common(1)[0]
most_common, num_most_common

3
count         4725
unique           3
top       Released
freq          4719
Name: status, dtype: object
Released           4719
Post Production       3
Rumored               3
Name: status, dtype: int64


('Released', 4719)

In [39]:
# Release Date
print(len(movie_df.release_date_dt.value_counts(ascending=False)))
print(movie_df.release_date_dt.describe())
#movie_df.release_date_dt.value_counts(ascending=False) 
most_common, num_most_common = Counter(movie_df.release_date_dt).most_common(1)[0]
least_common, num_least_common = Counter(movie_df.release_date_dt).most_common()[-1]
most_common, num_most_common, least_common, num_least_common

3253
count                    4725
unique                   3253
top       2006-01-01 00:00:00
freq                        8
first     1969-01-01 00:00:00
last      2068-12-21 00:00:00
Name: release_date_dt, dtype: object


(Timestamp('2006-01-01 00:00:00'), 8, Timestamp('2012-05-03 00:00:00'), 1)

*The column dates range from the years 1969 to the future year of 2068. This is interesting, since movie revenue cannot be recorded for future dates, but it can be predicted (as is the goal of this analysis). Before moving forward, the movies with dates greater than 2020 should be dropped.*

In [50]:
movie_df = movie_df.loc[movie_df.release_date_dt < '2020-01-01 00:00:00']
print(len(movie_df.release_date_dt.value_counts(ascending=False)))
print(movie_df.release_date_dt.describe())

3122
count                    4593
unique                   3122
top       2006-01-01 00:00:00
freq                        8
first     1969-01-01 00:00:00
last      2017-02-03 00:00:00
Name: release_date_dt, dtype: object


This looks much better. It appears data collection for this dataset must have stopped in early 2017, since no movies released in 2018 and 2019 are present in the dataset despite the maximum year elligibility being 2019.

### Drop row with the following 5 conditions:
    1. movie.revenue == 0
    2. movie.budget == 0
    3. movie.keyword == []
    4. movie.production_companies == []
    5. movie.tagline.isnull() == True
If all 5 of these conditions are met, then the resulting impact on a model built to predict a movie's revenue will be weakened; thus, these rows should be removed entirely.

In [51]:
# Drop rows with sub-optimal conditions
movie_df = movie_df.drop(
     movie_df[(movie_df.revenue == 0) & 
              (movie_df.budget == 0) & 
              (movie_df.keywords == "[]") & 
              (movie_df.production_companies == "[]") &
             (movie_df.tagline.isnull() == True)].index)
len(movie_df)

4593

In [52]:
# Store dataframe globally
%store movie_df

Stored 'movie_df' (DataFrame)
