## Setup environment

In [38]:
import numpy as np
import pandas as pd


## Data Preprocessing

In [39]:
df = pd.read_csv('../../data/raw/data.csv')


In [40]:
df.head()

Unnamed: 0,name,genre,tomatometer_score,tomatometer_count,audience_score,audience_count,classification,runtime,release_year,original_language,url
0,A Castle for Christmas,"Holiday, Romance, Comedy",74%,23,40%,100+,,1h 38m,2021,English,https://www.rottentomatoes.com/m/a_castle_for_...
1,Pinocchio,"Kids & family, Fantasy, Animation",100%,61,73%,"250,000+",G,1h 27m,1940,English,https://www.rottentomatoes.com/m/pinocchio_1940
2,The Informer,"Mystery & thriller, Crime, Drama",64%,58,60%,250+,R (Strong Violence|Pervasive Language),1h 53m,2019,English,https://www.rottentomatoes.com/m/the_informer_...
3,They Cloned Tyrone,"Sci-fi, Comedy",95%,129,100%,Fewer,R (Violence|Drug Use|Some Sexual Material|Perv...,2h 2m,2023,English,https://www.rottentomatoes.com/m/they_cloned_t...
4,1917,"War, History, Drama",89%,472,88%,"25,000+",R (Some Disturbing Images|Language|Violence),1h 59m,2019,English,https://www.rottentomatoes.com/m/1917_2019


We can see that there is a URL column which is not useful for our analysis. We can drop that column.

In [41]:
df = df.drop(axis = 1, columns = 'url')

In [42]:
df.head()

Unnamed: 0,name,genre,tomatometer_score,tomatometer_count,audience_score,audience_count,classification,runtime,release_year,original_language
0,A Castle for Christmas,"Holiday, Romance, Comedy",74%,23,40%,100+,,1h 38m,2021,English
1,Pinocchio,"Kids & family, Fantasy, Animation",100%,61,73%,"250,000+",G,1h 27m,1940,English
2,The Informer,"Mystery & thriller, Crime, Drama",64%,58,60%,250+,R (Strong Violence|Pervasive Language),1h 53m,2019,English
3,They Cloned Tyrone,"Sci-fi, Comedy",95%,129,100%,Fewer,R (Violence|Drug Use|Some Sexual Material|Perv...,2h 2m,2023,English
4,1917,"War, History, Drama",89%,472,88%,"25,000+",R (Some Disturbing Images|Language|Violence),1h 59m,2019,English


### How many rows and how many columns does the raw data have? (0.25 points)

In [43]:
shape = df.shape

In [44]:
print(f'Number of rows: {shape[0]}')
print(f'Number of columns: {shape[1]}')

Number of rows: 1215
Number of columns: 10


### What does each line mean?

Each line in the dataset represents a movie/ series information, such as title, release year, genre, etc.

### Does the dataset have duplicated rows? 

In [45]:
unique_rows = df.drop_duplicates().shape[0] - shape[0]
print(f'Number of duplicate rows: {unique_rows}')

Number of duplicate rows: 0


### What does each column mean?

| Column Name | Description |
| --- | --- |
| Name | Name of the movie/ series |
| Genre | Genre of the movie/ series |
| Tomatometer Score | Tomatometer Score of the movie/ series |
| Tomatometer Count | Tomatometer Count of the movie/ series |
| Audience Score | Audience Score of the movie/ series |
| Audience Count | Audience Count of the movie/ series |
| Classification | Classification of the movie/ series (Age restriction) |
| Runtime | Runtime of the movie/ series |
| Release Year | Release Year of the movie/ series |
| Original Language | Original Language of the movie/ series |

### Data type of each column

In [46]:
dtypes = df.dtypes

In [47]:
dtypes

name                 object
genre                object
tomatometer_score    object
tomatometer_count    object
audience_score       object
audience_count       object
classification       object
runtime              object
release_year          int64
original_language    object
dtype: object

Here, release year should be a string instead of an integer.

In [48]:
df['release_year'] = df['release_year'].astype(str)
df['runtime'] = df['runtime'].astype('datetime64[as]')

In the Genre column, there are multiple genres for each movie/ series. We should pick the first genre as the main genre of the movie/ series.
The same goes for the Classification column.

In [54]:
df['genre'] = df['genre'].str.split(',').str[0]
df['classification'] = df['classification'].str.split(' ').str[0]

name                 object
genre                object
tomatometer_score    object
tomatometer_count    object
audience_score       object
audience_count       object
classification       object
runtime              object
release_year          int64
original_language    object
dtype: object

### Missing ratios of categorical columns

In [59]:
# YOUR CODE HERE
df_copy = df.copy()
df_copy = df_copy.drop(axis = 1, columns=['tomatometer_score', 'tomatometer_count', 'audience_score', 'audience_count', 'runtime', 'release_year'])
def missing_ratio(s):
    # raise NotImplementedError()
    return (s.isna().mean() * 100).round(1)

def num_values(s):
    # raise NotImplementedError()
    s = s.str.split(';')
    s = s.explode()
    return len(s.value_counts())

def value_ratios(s):
    # raise NotImplementedError()
    s = s.str.split(';')
    s = s.explode()
    totalCount = (~s.isna()).sum()
    return ((s.value_counts()/totalCount*100).round(1)).to_dict()

cat_col_info_df = df_copy.agg([missing_ratio, num_values, value_ratios])
cat_col_info_df

Unnamed: 0,name,genre,classification,original_language
missing_ratio,0.0,0.0,19.0,0.0
num_values,1207,21,8,27
value_ratios,"{'Risen': 0.2, 'Pinocchio': 0.2, 'Halloween': ...","{'Kids & family': 12.8, 'Comedy': 8.8, 'Action...","{'R': 44.4, 'PG-13': 31.5, 'PG': 17.8, 'G': 4....","{'English': 89.0, 'Japanese': 4.0, 'English (U..."


We can see that the classification column's missing ratio is 19, which is not too high. We can fill the missing values with a new category called "Not Rated".

In [65]:
df['classification'] = df['classification'].fillna('Not Rated')

Unnamed: 0,name,genre,classification,original_language
missing_ratio,0.0,0.0,0.0,0.0
num_values,1207,21,9,27
value_ratios,"{'Risen': 0.2, 'Pinocchio': 0.2, 'Halloween': ...","{'Kids & family': 12.8, 'Comedy': 8.8, 'Action...","{'R': 36.0, 'PG-13': 25.5, 'Not Rated': 19.0, ...","{'English': 89.0, 'Japanese': 4.0, 'English (U..."


### Save processed data

In [66]:
df.to_csv('../../data/processed/data.csv', index=False)