# 2. Preprocessing

## 2.0 Import libraries

In [289]:
import pandas as pd
import numpy as np
import re

## 2.1 Preprocessing + Exploring data

In [290]:
df = pd.read_csv('../../data/raw/data.csv')
df.head()

Unnamed: 0,name,genre,tomatometer_score,tomatometer_count,audience_score,audience_count,classification,runtime,release_year,original_language,url
0,A Castle for Christmas,"Holiday, Romance, Comedy",74%,23,40%,100+,,1h 38m,2021,English,https://www.rottentomatoes.com/m/a_castle_for_...
1,Pinocchio,"Kids & family, Fantasy, Animation",100%,61,73%,"250,000+",G,1h 27m,1940,English,https://www.rottentomatoes.com/m/pinocchio_1940
2,The Informer,"Mystery & thriller, Crime, Drama",64%,58,60%,250+,R (Strong Violence|Pervasive Language),1h 53m,2019,English,https://www.rottentomatoes.com/m/the_informer_...
3,They Cloned Tyrone,"Sci-fi, Comedy",95%,129,100%,Fewer,R (Violence|Drug Use|Some Sexual Material|Perv...,2h 2m,2023,English,https://www.rottentomatoes.com/m/they_cloned_t...
4,1917,"War, History, Drama",89%,472,88%,"25,000+",R (Some Disturbing Images|Language|Violence),1h 59m,2019,English,https://www.rottentomatoes.com/m/1917_2019


How many rows and columns are there?

In [291]:
print("Number of columns:", df.shape[1])
print("Number of rows:", df.shape[0])

Number of columns: 11
Number of rows: 1215


What is the meaning of each row?
- Each row shows details of a movie such as genre, language,... as well as its ratings.

What is the meaning of each column?

| Column | Meaning    |
|--------|------------|
| name   | Title of the movie |
| genre  | Genres of the movie|
| tomatometer_score| Rating score of the movie, according to Rotten Tomatoes' experts|
| tomatometer_count| Number of experts' reviews made about the movie|
| audience_score| Rating score of the movie, according to Rotten Tomatoes' audience/viewers|
| audience_count| Number of audience's reviews made about the movie|
| classification| The movie's suitability rating|
| runtime| Length of the movie|
| release_year| Year of the movie's release|
| original_language| Original language the movie was filmed in|
| url| url to movie's info|

What is the current datatype of each column? Are there any inapproriate datatype?

In [292]:
df.dtypes

name                 object
genre                object
tomatometer_score    object
tomatometer_count    object
audience_score       object
audience_count       object
classification       object
runtime              object
release_year          int64
original_language    object
url                  object
dtype: object

```tomatometer_score, tomatometer_count, audience_score, audience_count``` and ```runtime``` should be numeric.  

- We'll convert ```tomatometer_score``` and ```audience_score``` to its float equivalence.

In [293]:
df['tomatometer_score'] = df['tomatometer_score'].str.strip('%')
df['audience_score'] = df['audience_score'].str.strip('%')

#Convert '--' value to 0
df.loc[df['tomatometer_score'] == '--', 'tomatometer_score'] = np.nan
df.loc[df['audience_score'] == '--', 'audience_score'] = np.nan

df[['tomatometer_score','audience_score']] = df[['tomatometer_score','audience_score']].astype(float) / 100

- ```audience_count``` and ```tomatometer_count``` has value 'fewer'.
- To determine value of 'fewer', we'll find the numerical min of each columns.

In [294]:
#Remove all symbols
df['audience_count'] = df['audience_count'].str.replace(',','').str.replace('+','')
df['tomatometer_count'] = df['tomatometer_count'].str.replace(',','').str.replace('+','')

In [295]:
#Get unique values except for 'Fewer' and nan
unique_vals = df[~(df['tomatometer_count'] == 'Fewer')]['tomatometer_count'].unique().astype(float)
unique_vals = unique_vals[~np.isnan(unique_vals)]
print("Tomatometer min:",unique_vals.min())

#Get unique values except for 'Fewer' and nan
unique_vals = df[~(df['audience_count'] == 'Fewer')]['audience_count'].unique().astype(float)
unique_vals = unique_vals[~np.isnan(unique_vals)]
print("Audience min:",unique_vals.min())

Tomatometer min: 1.0
Audience min: 50.0


So, we'll set **'Fewer'** for `tomatometer_count = 0` and `audience_count = 25`

In [296]:
#Convert 'fewer'
df.loc[df['tomatometer_count'] == 'Fewer', 'tomatometer_count'] = 0
df.loc[df['audience_count'] == 'Fewer', 'audience_count'] = 25

df['audience_count'] = df['audience_count'].astype(float)
df['tomatometer_count'] = df['tomatometer_count'].astype(float)

- ```runtime``` will be converted to minutes.

In [297]:
def convert_to_mins(x):
    regex = r'(\d{,1}h)?(\d{,2}m)?'
    r = re.search(regex, x)
    hours = int(r.group(1).strip('h')) if r.group(1) != None else 0
    mins = int(r.group(2).strip('m')) if r.group(2) != None else 0
    result = hours*60 + mins
    return result

df['runtime'] = df['runtime'].str.replace(' ','').apply(convert_to_mins)
#If runtime == 0, the data is wrong, let's convert those to nan
df.loc[df['runtime'] == 0, 'runtime'] = np.nan

In [298]:
df.head()

Unnamed: 0,name,genre,tomatometer_score,tomatometer_count,audience_score,audience_count,classification,runtime,release_year,original_language,url
0,A Castle for Christmas,"Holiday, Romance, Comedy",0.74,23.0,0.4,100.0,,98.0,2021,English,https://www.rottentomatoes.com/m/a_castle_for_...
1,Pinocchio,"Kids & family, Fantasy, Animation",1.0,61.0,0.73,250000.0,G,87.0,1940,English,https://www.rottentomatoes.com/m/pinocchio_1940
2,The Informer,"Mystery & thriller, Crime, Drama",0.64,58.0,0.6,250.0,R (Strong Violence|Pervasive Language),113.0,2019,English,https://www.rottentomatoes.com/m/the_informer_...
3,They Cloned Tyrone,"Sci-fi, Comedy",0.95,129.0,1.0,25.0,R (Violence|Drug Use|Some Sexual Material|Perv...,122.0,2023,English,https://www.rottentomatoes.com/m/they_cloned_t...
4,1917,"War, History, Drama",0.89,472.0,0.88,25000.0,R (Some Disturbing Images|Language|Violence),119.0,2019,English,https://www.rottentomatoes.com/m/1917_2019


With each numerical columns, how are values distributed?
- Percentage of missing values?

In [299]:
df.select_dtypes('number').isna().sum() / len(df)

tomatometer_score    0.043621
tomatometer_count    0.005761
audience_score       0.018930
audience_count       0.018930
runtime              0.004115
release_year         0.000000
dtype: float64

- Describe the values.

In [300]:
df.select_dtypes('number').describe()

Unnamed: 0,tomatometer_score,tomatometer_count,audience_score,audience_count,runtime,release_year
count,1162.0,1208.0,1192.0,1192.0,1210.0,1215.0
mean,0.730077,166.447848,0.742391,41565.247483,112.141322,2015.186008
std,0.238015,130.74781,0.187176,79940.738938,22.256881,12.200412
min,0.0,0.0,0.0,25.0,39.0,1936.0
25%,0.6,53.0,0.64,250.0,96.0,2015.0
50%,0.81,143.0,0.79,2500.0,109.0,2019.0
75%,0.92,256.0,0.89,25000.0,124.0,2022.0
max,1.0,602.0,1.0,250000.0,242.0,2023.0


All values seem to be normal.

In [301]:
df.to_csv('../../data/processed/data.csv')