This project aims to predict the revenue of movies. This data was extracted from the website 'https://www.the-numbers.com'.

# Imports

In [38]:
# data wrangling
import pandas as pd

# data visualization
import plotly.express as px

# machine learning
from sklearn.linear_model import LinearRegression

# Get Data

In [6]:
# read the data
df = pd.read_csv('cost_revenue.csv')

# Clean the data

In [18]:
# visualizing the first and last rows into the dataframe
df

Unnamed: 0,Rank,Release Date,Movie Title,Production Budget ($),Worldwide Gross ($),Domestic Gross ($)
0,5293,8/2/1915,The Birth of a Nation,"$110,000","$11,000,000","$10,000,000"
1,5140,5/9/1916,Intolerance,"$385,907",$0,$0
2,5230,12/24/1916,"20,000 Leagues Under the Sea","$200,000","$8,000,000","$8,000,000"
3,5299,9/17/1920,Over the Hill to the Poorhouse,"$100,000","$3,000,000","$3,000,000"
4,5222,1/1/1925,The Big Parade,"$245,000","$22,000,000","$11,000,000"
...,...,...,...,...,...,...
5386,2950,10/8/2018,Meg,"$15,000,000",$0,$0
5387,126,12/18/2018,Aquaman,"$160,000,000",$0,$0
5388,96,12/31/2020,Singularity,"$175,000,000",$0,$0
5389,1119,12/31/2020,Hannibal the Conqueror,"$50,000,000",$0,$0


The movies with Worldwide Gross and Domestic Gross equals to zero mean that it wasn't released yet or canceled during its production. Considering our goal is to predict the revenue, these movies shall be dropped.

In [23]:
# filtering the gross columns and storing this in another dataframe 
df2 = df[(df['Worldwide Gross ($)'] != '$0') | (df['Domestic Gross ($)'] != '$0')]
df2

Unnamed: 0,Rank,Release Date,Movie Title,Production Budget ($),Worldwide Gross ($),Domestic Gross ($)
0,5293,8/2/1915,The Birth of a Nation,"$110,000","$11,000,000","$10,000,000"
2,5230,12/24/1916,"20,000 Leagues Under the Sea","$200,000","$8,000,000","$8,000,000"
3,5299,9/17/1920,Over the Hill to the Poorhouse,"$100,000","$3,000,000","$3,000,000"
4,5222,1/1/1925,The Big Parade,"$245,000","$22,000,000","$11,000,000"
5,4250,12/30/1925,Ben-Hur,"$3,900,000","$9,000,000","$9,000,000"
...,...,...,...,...,...,...
5378,914,10/2/2017,Fifty Shades Darker,"$55,000,000","$376,856,949","$114,434,010"
5379,1295,10/2/2017,John Wick: Chapter Two,"$40,000,000","$166,893,990","$92,029,184"
5380,70,10/3/2017,Kong: Skull Island,"$185,000,000","$561,137,727","$168,052,812"
5381,94,12/5/2017,King Arthur: Legend of the Sword,"$175,000,000","$140,012,608","$39,175,066"


The target columns is the worldwide gross to predict the ravenue and it's going to be used the budget for this prediction. The other columns won't appear in the machine learning model do we drop them.

In [25]:
# selecting feature and target columns and storing into a dataframe
df3 = df2[['Production Budget ($)', 'Worldwide Gross ($)']]
df3

Unnamed: 0,Production Budget ($),Worldwide Gross ($)
0,"$110,000","$11,000,000"
2,"$200,000","$8,000,000"
3,"$100,000","$3,000,000"
4,"$245,000","$22,000,000"
5,"$3,900,000","$9,000,000"
...,...,...
5378,"$55,000,000","$376,856,949"
5379,"$40,000,000","$166,893,990"
5380,"$185,000,000","$561,137,727"
5381,"$175,000,000","$140,012,608"


It is also important to remove the dollar sign and commas from our data because the machine learning model only understand numbers.

In [26]:
# removing special characters and storing into a dataframe
df4 = df3[['Production Budget ($)', 'Worldwide Gross ($)']].replace({'\$':'', ',':''}, regex = True)
df4

Unnamed: 0,Production Budget ($),Worldwide Gross ($)
0,110000,11000000
2,200000,8000000
3,100000,3000000
4,245000,22000000
5,3900000,9000000
...,...,...
5378,55000000,376856949
5379,40000000,166893990
5380,185000000,561137727
5381,175000000,140012608


In [27]:

# renaming columns to be easier to handle
df5 = df4.rename(columns = {'Production Budget ($)':'production_budget_usd',
                            'Worldwide Gross ($)':'worldwide_gross_usd'
})
df5

Unnamed: 0,production_budget_usd,worldwide_gross_usd
0,110000,11000000
2,200000,8000000
3,100000,3000000
4,245000,22000000
5,3900000,9000000
...,...,...
5378,55000000,376856949
5379,40000000,166893990
5380,185000000,561137727
5381,175000000,140012608


In [29]:
# describe the type of each column
df5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5034 entries, 0 to 5382
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   production_budget_usd  5034 non-null   object
 1   worldwide_gross_usd    5034 non-null   object
dtypes: object(2)
memory usage: 118.0+ KB


As it is possible to see, the columns' types are object. It's necessary to convert them to integers so the model will be able to understand the data.

In [31]:
# changing data types from object to int
df5['production_budget_usd'] = df5['production_budget_usd'].astype('int64')
df5['worldwide_gross_usd'] = df5['worldwide_gross_usd'].astype('int64')

# checking data types
df5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5034 entries, 0 to 5382
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   production_budget_usd  5034 non-null   int64
 1   worldwide_gross_usd    5034 non-null   int64
dtypes: int64(2)
memory usage: 118.0 KB


# Exploratory Data Analysis and Visualization

In this step, it's going to be evaluated the dependency of the two columns and visualizing the data.

In [34]:
# plotting heatmap to analyze the correlation between feature and target values
fig = px.imshow(df5.corr().round(3), text_auto=True, template = 'plotly_dark')

fig.update_layout(
    title = {
        'text': 'Heatmap of variables'},
    font_family="Arial",
    font_color="White",
    font=dict(size = 18),
    title_font_family="Arial",
    title_font_color= "White")

fig.show()

Here it is possible to see the Pearson's correlation between the variables. They're strongly correlated, which indicates a possible use of linear regression algorithms applied to predict the target column.

In [37]:
# plotting scatter chart between variables
fig = px.scatter(data_frame = df5, x = 'production_budget_usd',
                y = 'worldwide_gross_usd', template = 'plotly_dark')

fig.update_layout(
    title = {
        'text': 'Film cost vs global revenue'},
    xaxis_title = 'Production budget',
    yaxis_title = 'Worldwide gross',
    font_family = "Arial",
    font_color = "White",
    font = dict(size = 18),
    title_font_family = "Arial",
    title_font_color = "White")

fig.show()

# Building the model

In [40]:
# Splitting dataset in two
X = df5[['production_budget_usd']]
y = df5[['worldwide_gross_usd']]

In [41]:
# creating the regressor
regression = LinearRegression()

# fitting model
regression.fit(X, y)

LinearRegression()

In [45]:
# obtaining slope coeficient
a = regression.coef_

# obtaining intercept coeficient
b = regression.intercept_

print('the slope coeficient is:', a.round(3))
print('the intercept coeficient is:', b.round(3))

the slope coeficient is: [[3.112]]
the intercept coeficient is: [-7236192.729]


In [52]:
# plotting scatter chart between variables
fig = px.scatter(data_frame = df5, x = 'production_budget_usd',
                y = 'worldwide_gross_usd', template = 'plotly_dark',
                trendline = 'ols', trendline_color_override = 'red')

fig.update_layout(
    title = {
        'text': 'Film cost vs global revenue'},
    xaxis_title = 'Production budget',
    yaxis_title = 'Worldwide gross',
    font_family = "Arial",
    font_color = "White",
    font = dict(size = 18),
    title_font_family = "Arial",
    title_font_color = "White")

fig.show()

## Evaluating model

In [54]:
regression.score(X, y)

0.5496485356985729

In [None]:
print