The first Input corresponds to a list of the libraries that will be needed for the cleaning and plotting of our data, plus the "%matplotlib inlin" command line which is the predefineed magic function in Iphyton that will allow plots to appear and be stored within the notebook

This will be followed by the loading of the bommovie_gross.csv.gz file and a quick visualization via .head() and .tail() of it's first and last 5 rows to try start defining a cleaning workflow

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df_bm = pd.read_csv('./zippedData/bom.movie_gross.csv.gz') 

In [3]:
df_bm.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [4]:
df_bm.tail()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018
3386,An Actor Prepares,Grav.,1700.0,,2018


The .tail(0) showed that the foreign gross has a series of NaN values that will need to be addressed. But before any decisions are made, more information from this df will be needed.

A shape will give us the dimensions of the df and an .info() will give us a concise summary of the df

In [5]:
df_bm.shape

(3387, 5)

In [6]:
df_bm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
title             3387 non-null object
studio            3382 non-null object
domestic_gross    3359 non-null float64
foreign_gross     2037 non-null object
year              3387 non-null int64
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


From the summary of the df above we can extract the following information:

- number of rows is 3387
- three columns have less rows (studio, domestic_gross, foreign_gross)
- foreign_gross is define as an object when it should in fact be a float64 as the domestic_gross. 

Lets quickly check for the number of missig values per column:

In [7]:
df_bm.isna().sum()

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

The "foreign_gross" column is the one with the most NaNs and it happens to be an easy calculation from another table that contains considerably more rows, therefore we will entireely drop it from this df:

In [8]:
df_bm.drop(['foreign_gross'], axis = 1, inplace = True)

Columns "studio" and "domestic_gross" have a very low number of NaNs percentage wise, therefore both's NaN value rows will be dropped:

In [9]:
df_bm.dropna(subset = ["studio", "domestic_gross"], inplace=True)

Now lets do a quick check for NaN values and see if the previous code has eliminated the identified NaN values

In [10]:
df_bm.isna().sum()

title             0
studio            0
domestic_gross    0
year              0
dtype: int64

In [11]:
df_bm.head() 

Unnamed: 0,title,studio,domestic_gross,year
0,Toy Story 3,BV,415000000.0,2010
1,Alice in Wonderland (2010),BV,334200000.0,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,2010
3,Inception,WB,292600000.0,2010
4,Shrek Forever After,P/DW,238700000.0,2010


Now that weee have gotten rid of thee NaN values, lets look as potential duplicates within the data:

In [12]:
duplicates = df_bm[df_bm.duplicated()]
print(len(duplicates))

0


At this point the data has been cleaned. The last step that wee will perform is to change the name on the "title" column for "movie_title" to make it match better within the SQL schema: 

In [13]:
df_bm = df_bm.rename(columns = {'title': 'movie_title'})

This finalizes the cleaning of this df with one more vies of thee df:

In [14]:
df_bm.head()

Unnamed: 0,movie_title,studio,domestic_gross,year
0,Toy Story 3,BV,415000000.0,2010
1,Alice in Wonderland (2010),BV,334200000.0,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,2010
3,Inception,WB,292600000.0,2010
4,Shrek Forever After,P/DW,238700000.0,2010


In [16]:
df_bm.to_csv('./Cleaned_CSV_files/' + str('bom_movie_gross') + '.csv', encoding = 'utf-8')

DataFrame df_bm HAS BEEN CLEANED!!!