The first Input corresponds to a list of the libraries that will be needed for the cleaning and plotting of our data, plus the "%matplotlib inlin" command line which is the predefineed magic function in Iphyton that will allow plots to appear and be stored within the notebook

This will be followed by the loading of the tn.movie.budgets.csv.gz file and a quick visualization via .head() and .tail() of it's first and last 5 rows to try start defining a cleaning workflow

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df_tn = pd.read_csv('./zippedData/tn.movie_budgets.csv.gz')

In [3]:
df_tn.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [4]:
df_tn.tail()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0
5781,82,"Aug 5, 2005",My Date With Drew,"$1,100","$181,041","$181,041"


By looking at the head() and tail() the first clear changes to make are to change the $ and the "," in the Production_budget, domestic_gross and world_gross for "", and eliminate the additional id index column

In [5]:
df_tn.drop('id', axis=1, inplace=True)

In [6]:
df_tn.production_budget = df_tn['production_budget'].str.replace("$", "") 
df_tn.production_budget = df_tn['production_budget'].str.replace(",", "")
df_tn.domestic_gross = df_tn['domestic_gross'].str.replace("$", "") 
df_tn.domestic_gross = df_tn['domestic_gross'].str.replace(",", "")
df_tn.worldwide_gross = df_tn['worldwide_gross'].str.replace("$", "") 
df_tn.worldwide_gross = df_tn['worldwide_gross'].str.replace(",", "")

This could have been done in one line but for some reason it wasn't working for me. I resetted the kernel and the error persisted so I decided to do it separately not to lose anymore time on this

In [7]:
df_tn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 5 columns):
release_date         5782 non-null object
movie                5782 non-null object
production_budget    5782 non-null object
domestic_gross       5782 non-null object
worldwide_gross      5782 non-null object
dtypes: object(5)
memory usage: 226.0+ KB


In [8]:
df_tn.tail()

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross
5777,"Dec 31, 2018",Red 11,7000,0,0
5778,"Apr 2, 1999",Following,6000,48482,240495
5779,"Jul 13, 2005",Return to the Land of Wonders,5000,1338,1338
5780,"Sep 29, 2015",A Plague So Pleasant,1400,0,0
5781,"Aug 5, 2005",My Date With Drew,1100,181041,181041


In [9]:
df_tn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 5 columns):
release_date         5782 non-null object
movie                5782 non-null object
production_budget    5782 non-null object
domestic_gross       5782 non-null object
worldwide_gross      5782 non-null object
dtypes: object(5)
memory usage: 226.0+ KB


The .info() showed that the columns 'domestic_gross' and 'worldwide_gross' have a type of 'objects' when they should be int64. This will be fixed before we continue:

In [10]:
df_tn['domestic_gross'] = pd.to_numeric(df_tn['domestic_gross'])
df_tn['worldwide_gross'] = pd.to_numeric(df_tn['worldwide_gross'])
df_tn['production_budget'] = pd.to_numeric(df_tn['production_budget'])

In [11]:
df_tn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 5 columns):
release_date         5782 non-null object
movie                5782 non-null object
production_budget    5782 non-null int64
domestic_gross       5782 non-null int64
worldwide_gross      5782 non-null int64
dtypes: int64(3), object(2)
memory usage: 226.0+ KB


The release_date is not in an optimal format, so it will be updated to a standard

In [12]:
df_tn['release_date'] = pd.to_datetime(df_tn.release_date)

With this last modifications the df_tn seems to be ready to look for duplicates and look at some statistical values. first lets look at the df and then its dimensions, followed by duplicates and finished with statistics which will help identify possible outliers"

In [13]:
df_tn.head()

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,2009-12-18,Avatar,425000000,760507625,2776345279
1,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,2019-06-07,Dark Phoenix,350000000,42762350,149762350
3,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963
4,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747


In [14]:
df_tn.shape

(5782, 5)

In [15]:
df_tn.isna().any() # As the results suggested above, no missing values. 

release_date         False
movie                False
production_budget    False
domestic_gross       False
worldwide_gross      False
dtype: bool

In [16]:
duplicates = df_tn[df_tn.duplicated()]
print(len(duplicates))

0


In [17]:
df_tn.describe()

Unnamed: 0,production_budget,domestic_gross,worldwide_gross
count,5782.0,5782.0,5782.0
mean,31587760.0,41873330.0,91487460.0
std,41812080.0,68240600.0,174720000.0
min,1100.0,0.0,0.0
25%,5000000.0,1429534.0,4125415.0
50%,17000000.0,17225940.0,27984450.0
75%,40000000.0,52348660.0,97645840.0
max,425000000.0,936662200.0,2776345000.0


In [18]:
df_tn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 5 columns):
release_date         5782 non-null datetime64[ns]
movie                5782 non-null object
production_budget    5782 non-null int64
domestic_gross       5782 non-null int64
worldwide_gross      5782 non-null int64
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 226.0+ KB


In [19]:
df_tn.loc[df_tn['domestic_gross'] == 0]

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross
194,2020-12-31,Moonfall,150000000,0,0
479,2017-12-13,Bright,90000000,0,0
480,2019-12-31,Army of the Dead,90000000,0,0
535,2020-02-21,Call of the Wild,82000000,0,0
617,2012-12-31,AstÃ©rix et ObÃ©lix: Au service de Sa MajestÃ©,77600000,0,60680125
...,...,...,...,...,...
5761,2014-12-31,Stories of Our Lives,15000,0,0
5764,2007-12-31,Tin Can Man,12000,0,0
5771,2015-05-19,Family Motocross,10000,0,0
5777,2018-12-31,Red 11,7000,0,0


In [20]:
temp = df_tn.loc[df_tn['domestic_gross'] != 0]

In [21]:
domestic_gross_mean = np.mean(temp.domestic_gross)
print(domestic_gross_mean)

46257465.79002675


In [22]:
domestic_gross_mean = domestic_gross_mean.astype(int)
print(type(domestic_gross_mean))

<class 'numpy.int64'>


In [23]:
df_tn.loc[df_tn['domestic_gross'] == 0,'domestic_gross'] = domestic_gross_mean

In [24]:
df_tn.loc[df_tn['domestic_gross'] == 0]

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross


In [25]:
df_tn.tail()

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross
5777,2018-12-31,Red 11,7000,46257465,0
5778,1999-04-02,Following,6000,48482,240495
5779,2005-07-13,Return to the Land of Wonders,5000,1338,1338
5780,2015-09-29,A Plague So Pleasant,1400,46257465,0
5781,2005-08-05,My Date With Drew,1100,181041,181041


In [26]:
df_tn.loc[df_tn['worldwide_gross'] == 0]

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross
194,2020-12-31,Moonfall,150000000,46257465,0
479,2017-12-13,Bright,90000000,46257465,0
480,2019-12-31,Army of the Dead,90000000,46257465,0
535,2020-02-21,Call of the Wild,82000000,46257465,0
670,2019-08-30,PLAYMOBIL,75000000,46257465,0
...,...,...,...,...,...
5761,2014-12-31,Stories of Our Lives,15000,46257465,0
5764,2007-12-31,Tin Can Man,12000,46257465,0
5771,2015-05-19,Family Motocross,10000,46257465,0
5777,2018-12-31,Red 11,7000,46257465,0


In [27]:
temp1 = df_tn.loc[df_tn['worldwide_gross'] != 0]

In [28]:
worldwide_gross_mean = np.mean(temp1.worldwide_gross)
print(worldwide_gross_mean)

97687996.11468144


In [29]:
worldwide_gross_mean = worldwide_gross_mean.astype(int)
print(type(worldwide_gross_mean))

<class 'numpy.int64'>


In [30]:
df_tn.loc[df_tn['worldwide_gross'] == 0,'worldwide_gross'] = worldwide_gross_mean

In [31]:
df_tn.loc[df_tn['worldwide_gross'] == 0]

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross


In [32]:
df_tn.tail()

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross
5777,2018-12-31,Red 11,7000,46257465,97687996
5778,1999-04-02,Following,6000,48482,240495
5779,2005-07-13,Return to the Land of Wonders,5000,1338,1338
5780,2015-09-29,A Plague So Pleasant,1400,46257465,97687996
5781,2005-08-05,My Date With Drew,1100,181041,181041


In [33]:
df_tn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 5 columns):
release_date         5782 non-null datetime64[ns]
movie                5782 non-null object
production_budget    5782 non-null int64
domestic_gross       5782 non-null int64
worldwide_gross      5782 non-null int64
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 226.0+ KB


At this point the df is clean, but we decided to add a couple more columns, a first for the foreign_gross that will be the difference between the worldwide_gross minus the domestic_gross, and a second corresponding to the profit that will be calculated as the difference between the worldwide_gross minus the production_budget

In [34]:
# foreign_gross
df_tn['foreign_gross'] = df_tn['worldwide_gross'] - df_tn['domestic_gross']

In [35]:
# profit
df_tn['profit'] = df_tn['worldwide_gross'] - df_tn['production_budget']

In [36]:
df_tn.tail()

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,foreign_gross,profit
5777,2018-12-31,Red 11,7000,46257465,97687996,51430531,97680996
5778,1999-04-02,Following,6000,48482,240495,192013,234495
5779,2005-07-13,Return to the Land of Wonders,5000,1338,1338,0,-3662
5780,2015-09-29,A Plague So Pleasant,1400,46257465,97687996,51430531,97686596
5781,2005-08-05,My Date With Drew,1100,181041,181041,0,179941


Before displaying the results we will also change the column 'movie' to 'movie_title' to be consistent with the other dataframes and make the visualization of the map easier for us 

In [37]:
df_tn = df_tn.rename(columns = {'movie': 'movie_title'})

In [38]:
df_tn.sort_values(by=['profit'], inplace=True, ascending=False)
df_tn.head(10)

Unnamed: 0,release_date,movie_title,production_budget,domestic_gross,worldwide_gross,foreign_gross,profit
0,2009-12-18,Avatar,425000000,760507625,2776345279,2015837654,2351345279
42,1997-12-19,Titanic,200000000,659363944,2208208395,1548844451,2008208395
6,2018-04-27,Avengers: Infinity War,300000000,678815482,2048134200,1369318718,1748134200
5,2015-12-18,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2053311220,1116648995,1747311220
33,2015-06-12,Jurassic World,215000000,652270625,1648854864,996584239,1433854864
66,2015-04-03,Furious 7,190000000,353007020,1518722794,1165715774,1328722794
26,2012-05-04,The Avengers,225000000,623279547,1517935897,894656350,1292935897
260,2011-07-15,Harry Potter and the Deathly Hallows: Part II,125000000,381193157,1341693157,960500000,1216693157
41,2018-02-16,Black Panther,200000000,700059566,1348258224,648198658,1148258224
112,2018-06-22,Jurassic World: Fallen Kingdom,170000000,417719760,1305772799,888053039,1135772799


In [39]:
df_tn.to_csv('./Cleaned_CSV_files/' + str('tn_movies_budgets') + '.csv', encoding = 'utf-8')

DataFrame df_bm HAS BEEN CLEANED!!!