The first Input corresponds to a list of the libraries that will be needed for the cleaning and plotting of our data, plus the "%matplotlib inlin" command line which is the predefineed magic function in Iphyton that will allow plots to appear and be stored within the notebook

This will be followed by the loading of the tmdb.movies.csv.gz file and a quick visualization via .head() and .tail() of it's first and last 5 rows to try start defining a cleaning workflow

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df_tm = pd.read_csv('./zippedData/tmdb.movies.csv.gz')

In [3]:
df_tm.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [4]:
df_tm.tail()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.6,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.6,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.6,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.6,2018-06-22,Trailer Made,0.0,1
26516,26516,"[53, 27]",309885,en,The Church,0.6,2018-10-05,The Church,0.0,1


Additional index column is present, so before we go into the .info() we will delete this column:

In [5]:
df_tm.drop('Unnamed: 0', axis=1, inplace=True)

In [6]:
df_tm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 9 columns):
genre_ids            26517 non-null object
id                   26517 non-null int64
original_language    26517 non-null object
original_title       26517 non-null object
popularity           26517 non-null float64
release_date         26517 non-null object
title                26517 non-null object
vote_average         26517 non-null float64
vote_count           26517 non-null int64
dtypes: float64(2), int64(2), object(5)
memory usage: 1.8+ MB


Nothing seems to be too obvious, so I'll run a quick check for any missing data, look at possible duplicates, look at statistics and use these to identify possible outliers.

In [7]:
df_tm.isna().any()

genre_ids            False
id                   False
original_language    False
original_title       False
popularity           False
release_date         False
title                False
vote_average         False
vote_count           False
dtype: bool

No missing data so far has been identified

In [8]:
duplicates = df_tm[df_tm.duplicated()]
print(len(duplicates))

1020


In [9]:
duplicates.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
2473,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
2477,"[16, 35, 10751]",863,en,Toy Story 2,22.698,1999-11-24,Toy Story 2,7.5,7553
2536,"[12, 28, 878]",20526,en,TRON: Legacy,13.459,2010-12-10,TRON: Legacy,6.3,4387
2673,"[18, 10749]",46705,en,Blue Valentine,8.994,2010-12-29,Blue Valentine,6.9,1677
2717,"[35, 18, 14, 27, 9648]",45649,en,Rubber,8.319,2010-09-01,Rubber,5.9,417


lets remove the duplicates and I will dobule check with some of the most known titles such as Iron Man 2 (row 2), Toy Sotry (row 3) and Inception (row 4)

In [10]:
df_tm = df_tm.drop_duplicates()

In [11]:
df_tm.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [12]:
df_tm.describe()

Unnamed: 0,id,popularity,vote_average,vote_count
count,25497.0,25497.0,25497.0,25497.0
mean,294203.960505,3.043279,5.979331,178.79578
std,154690.24966,4.261045,1.866094,914.150311
min,27.0,0.6,0.0,1.0
25%,154770.0,0.6,5.0,1.0
50%,307125.0,1.321,6.0,5.0
75%,420707.0,3.49,7.0,25.0
max,608444.0,80.773,10.0,22186.0


In [13]:
df_tm.to_csv('./Cleaned_CSV_files/' + str('tmdb_movies') + '.csv', encoding = 'utf-8')

df_tm HAS BEEN CLEANED!!!