The first Input corresponds to a list of the libraries that will be needed for the cleaning and plotting of our data, plus the "%matplotlib inlin" command line which is the predefineed magic function in Iphyton that will allow plots to appear and be stored within the notebook

This will be followed by the loading of the rt.movie_info.tsv.gz file and a quick visualization via .head() and .tail() of it's first and last 5 rows to try start defining a cleaning workflow

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df_rtm = pd.read_csv('./zippedData/rt.movie_info.tsv.gz', delimiter='\t')

In [3]:
df_rtm.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [4]:
df_rtm.tail()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
1555,1996,Forget terrorists or hijackers -- there's a ha...,R,Action and Adventure|Horror|Mystery and Suspense,,,"Aug 18, 2006","Jan 2, 2007",$,33886034.0,106 minutes,New Line Cinema
1556,1997,The popular Saturday Night Live sketch was exp...,PG,Comedy|Science Fiction and Fantasy,Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993","Apr 17, 2001",,,88 minutes,Paramount Vantage
1557,1998,"Based on a novel by Richard Powell, when the l...",G,Classics|Comedy|Drama|Musical and Performing Arts,Gordon Douglas,,"Jan 1, 1962","May 11, 2004",,,111 minutes,
1558,1999,The Sandlot is a coming-of-age story about a g...,PG,Comedy|Drama|Kids and Family|Sports and Fitness,David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993","Jan 29, 2002",,,101 minutes,
1559,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures


Issues identified in this df that will be addressed during the cleaning process:

- Additional index column "id"
- Numerous columns with NaN values
- Two date format that can be modified (theater_date and dvd_date). This is if these are considered for further analysis
- Word minutes within the "runtime" column

before doing a .info(), we will proceed to delete the additional index column:

In [5]:
df_rtm.drop('id', axis=1, inplace=True)

In [6]:
df_rtm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 11 columns):
synopsis        1498 non-null object
rating          1557 non-null object
genre           1552 non-null object
director        1361 non-null object
writer          1111 non-null object
theater_date    1201 non-null object
dvd_date        1201 non-null object
currency        340 non-null object
box_office      340 non-null object
runtime         1530 non-null object
studio          494 non-null object
dtypes: object(11)
memory usage: 134.2+ KB


Columns currency, box_office, and studio will be dropped as they represent a small percentage of the entire row number:

In [7]:
df_rtm = df_rtm.drop(['currency', 'box_office', 'studio'], axis=1)

In [8]:
df_rtm.head()

Unnamed: 0,synopsis,rating,genre,director,writer,theater_date,dvd_date,runtime
0,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",104 minutes
1,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",108 minutes
2,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",116 minutes
3,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",128 minutes
4,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,200 minutes


The date format for both, "theater_date" and "dvd_date" will not be modified as they will be both dropped as well. In principle these two should not affect the question analysis to come.

In [9]:
df_rtm = df_rtm.drop(['theater_date', 'dvd_date'], axis=1)

The colum 'runtime' has the word 'minutes' after each value so these will need to be removed. Furthermore, the type will need to change from object to int64, and the column name changed to runtime_minutes from minutes"

In [10]:
df_rtm['runtime'] = df_rtm['runtime'].str.replace("minutes", "")
df_rtm = df_rtm.rename(columns = {'runtime': 'runtime_minutes'})
df_rtm['runtime_minutes'] = pd.to_numeric(df_rtm['runtime_minutes'])

With the previous cleaning we will proceed with a quick duplicates search:

In [11]:
duplicates = df_rtm[df_rtm.duplicated()]
print(len(duplicates))

4


In [12]:
df_rtm = df_rtm.drop_duplicates() # those 4 duplicated columns have been dropped

In [13]:
df_rtm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1556 entries, 0 to 1559
Data columns (total 6 columns):
synopsis           1497 non-null object
rating             1555 non-null object
genre              1550 non-null object
director           1361 non-null object
writer             1110 non-null object
runtime_minutes    1529 non-null float64
dtypes: float64(1), object(5)
memory usage: 85.1+ KB


In [14]:
df_rtm.isna().sum()

synopsis            59
rating               1
genre                6
director           195
writer             446
runtime_minutes     27
dtype: int64

Way forward with regards to the NaN values for some of the columns:
- synopsis 62 rows, will be dropped
- rating 3 rows, will be dropped
- genre 8 rows, will be dropped
- runtime_minutes 27 rows, will be dropped

Rest of the columns will go through a renaming of NaN to missing to be consistent 

The output illustrates the high number of missing values, therefore it won't affect our analysis, and if it does, it would be an insignifant impact

In [15]:
df_rtm.dropna(subset=['synopsis', 'rating', 'genre', 'runtime_minutes'], inplace=True)

In [16]:
df_rtm = df_rtm.replace(np.nan, 'missing')

In [17]:
df_rtm.isna().sum()

synopsis           0
rating             0
genre              0
director           0
writer             0
runtime_minutes    0
dtype: int64

In [18]:
df_rtm.head()

Unnamed: 0,synopsis,rating,genre,director,writer,runtime_minutes
0,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,104.0
1,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,108.0
2,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,116.0
3,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,128.0
5,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,95.0


In [19]:
df_rtm.to_csv('./Cleaned_CSV_files/' + str('rt_movies_info') + '.csv', encoding = 'utf-8')

DataFrame df_rtm HAS BEEN CLEANED!!!