The first Input corresponds to a list of the libraries that will be needed for the cleaning and plotting of our data, plus the "%matplotlib inlin" command line which is the predefineed magic function in Iphyton that will allow plots to appear and be stored within the notebook

This will be followed by the loading of the rt.movie_info.tsv.gz file and a quick visualization via .head() and .tail() of it's first and last 5 rows to try start defining a cleaning workflow

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df_rtr = pd.read_csv('./zippedData/rt.reviews.tsv.gz', delimiter='\t', encoding = 'unicode_escape')

In [3]:
df_rtr.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [4]:
df_rtr.tail()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
54427,2000,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"
54431,2000,,3/5,fresh,Nicolas Lacroix,0,Showbizz.net,"November 12, 2002"


Issues identified in this df that will be addressed during the cleaning process:

- Additional index column "id"
- Numerous columns with NaN values
- Date format that can be modified (theater_date and dvd_date). This is if this is considered for further analysis


In [5]:
df_rtr.drop('id', axis=1, inplace=True)

In [6]:
df_rtr['date'] = pd.to_datetime(df_rtr.date)

Before addressing the NaN issues, lets get a summary of the df first:

In [7]:
df_rtr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 7 columns):
review        48869 non-null object
rating        40915 non-null object
fresh         54432 non-null object
critic        51710 non-null object
top_critic    54432 non-null int64
publisher     54123 non-null object
date          54432 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 2.9+ MB


There seem to be a few columns with missing values, so lets look into that deeper

In [8]:
df_rtr.isna().any()

review         True
rating         True
fresh         False
critic         True
top_critic    False
publisher      True
date          False
dtype: bool

4 coluns out of the 7 have missing values: review, rating, critic, and publisher. Let's look at the duplicates first and then look at the NaN per column and based on this define the way forward. 

In [9]:
duplicates = df_rtr[df_rtr.duplicated()]
print(len(duplicates))

2123


In [10]:
df_rtr = df_rtr.drop_duplicates()

In [11]:
print(df_rtr.isna().sum().sum()) # Total number of NaN values
print("-------------------------------")
print(df_rtr.isna().sum()) # Total number of NaN per colum

19740
-------------------------------
review         3542
rating        13484
fresh             0
critic         2406
top_critic        0
publisher       308
date              0
dtype: int64


Way forward: We decided to be consistent and add the string'missing' for the NaN values of the remaining columns. We might drop some of then after we create our SQL database, but for the time being we will leave them as 'missing'


In [12]:
df_rtr = df_rtr.replace(np.nan, 'missing')

In [13]:
df_rtr.isna().sum()

review        0
rating        0
fresh         0
critic        0
top_critic    0
publisher     0
date          0
dtype: int64

For us to be able to use the rating for part of our analysis, we replace the "/5" with "" and changed the str "missing" to a "-1". The idea is that when we use the ratings, we will use a for loop that will only take into account values ranging between 1 and 5, and for a -1 it will not consider it.

In [14]:
df_rtr['rating'] = df_rtr['rating'].str.replace("/5", "")
df_rtr['rating'] = df_rtr['rating'].str.replace("missing", "-1")

In [15]:
df_rtr.head()

Unnamed: 0,review,rating,fresh,critic,top_critic,publisher,date
0,A distinctly gallows take on contemporary fina...,3,fresh,PJ Nabarro,0,Patrick Nabarro,2018-11-10
1,It's an allegory in search of a meaning that n...,-1,rotten,Annalee Newitz,0,io9.com,2018-05-23
2,... life lived in a bubble in financial dealin...,-1,fresh,Sean Axmaker,0,Stream on Demand,2018-01-04
3,Continuing along a line introduced in last yea...,-1,fresh,Daniel Kasman,0,MUBI,2017-11-16
4,... a perverse twist on neorealism...,-1,fresh,missing,0,Cinema Scope,2017-10-12


In [16]:
df_rtr.to_csv('./Cleaned_CSV_files/' + str('rt_reviews') + '.csv', encoding = 'utf-8')

DataFrame df_rtr HAS BEEN CLEANED!!!