## Setup

In [126]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Exploring data

In [127]:
df = pd.read_csv('../../data/processed/data.csv')
df.sample(10)

Unnamed: 0,name,genre,tomatometer_score,tomatometer_count,audience_score,audience_count,classification,runtime,release_year,original_language
700,Sully,Drama,0.85,348.0,0.84,25000.0,PG-13,96.0,2016,English
1174,Zombies 3,Adventure,0.75,8.0,0.6,100.0,Not Rated,91.0,2022,English
951,Black Hawk Down,War,0.77,175.0,0.88,250000.0,R,144.0,2001,English
838,Once Upon a Time... In Hollywood,Comedy,0.85,584.0,0.7,25000.0,R,159.0,2019,English
690,Vice,Biography,0.65,369.0,0.6,5000.0,R,132.0,2018,English
634,Widows,Mystery & thriller,0.91,427.0,0.61,5000.0,R,128.0,2018,English
45,Edward Scissorhands,Holiday,0.89,66.0,0.91,250000.0,PG-13,105.0,1990,English
211,Blaze,Biography,0.95,92.0,0.67,500.0,R,127.0,2018,English
438,The Devil on Trial,Crime,0.36,14.0,0.46,50.0,Not Rated,81.0,2023,English
1147,Flags of Our Fathers,War,0.76,240.0,0.69,250000.0,R,132.0,2006,English


In [128]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1215 entries, 0 to 1214
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               1215 non-null   object 
 1   genre              1215 non-null   object 
 2   tomatometer_score  1162 non-null   float64
 3   tomatometer_count  1208 non-null   float64
 4   audience_score     1192 non-null   float64
 5   audience_count     1192 non-null   float64
 6   classification     1215 non-null   object 
 7   runtime            1210 non-null   float64
 8   release_year       1215 non-null   int64  
 9   original_language  1215 non-null   object 
dtypes: float64(5), int64(1), object(4)
memory usage: 95.1+ KB


**Question**: For each genre, what would the correlation between tomatometer score and audience be ? 

**Purpose**: To see if there are any genres that are fit for both critics and audience.

### Prepare data for exploration

We need to find the missing ration for tomato meter and audience score. 

In [129]:
tomato_missing = df["tomatometer_score"].isna().sum() / len(df)
audience_missing = df["audience_score"].isna().sum() / len(df)
print(f"Tomatometer missing: {tomato_missing:.2%}")
print(f"Audience missing: {audience_missing:.2%}")

Tomatometer missing: 4.36%
Audience missing: 1.89%


These values are not too big, so we can fill them with the mean of the column.

In [130]:
df["tomatometer_score"].fillna(df["tomatometer_score"].mean(), inplace=True)
df["audience_score"].fillna(df["audience_score"].mean(), inplace=True)

In [131]:
genres = np.unique(df['genre'])
print(f'Number of genres: {len(genres)}')
print(f'Genres: {genres}')

Number of genres: 21
Genres: ['Action' 'Adventure' 'Anime' 'Biography' 'Comedy' 'Crime' 'Documentary'
 'Drama' 'Fantasy' 'Game show' 'History' 'Holiday' 'Horror'
 'Kids & family' 'Music' 'Musical' 'Mystery & thriller' 'Romance' 'Sci-fi'
 'War' 'Western']


We see that there are 21 genres in total and there are 2 genres that are related to each other. They are "Music" and "Musical". In this case, we really don't need to have both of them. So, we will combine them into one genre called "Music/Musical".

In [132]:
df['genre'] = df['genre'].replace(["Music", "Musical"], "Music/ Musical")
genres = np.unique(df['genre'])


Now, we need to count the number of movies in each genre. We will use this information to see if there are any genres that should be removed from the dataset.

In [135]:
for genre in genres:
    print(f'Number of {genre} movies: {len(df[df["genre"] == genre])}')

Number of Action movies: 107
Number of Adventure movies: 22
Number of Anime movies: 2
Number of Biography movies: 71
Number of Comedy movies: 107
Number of Crime movies: 45
Number of Documentary movies: 95
Number of Drama movies: 68
Number of Fantasy movies: 30
Number of Game show movies: 1
Number of History movies: 42
Number of Holiday movies: 75
Number of Horror movies: 79
Number of Kids & family movies: 155
Number of Music/ Musical movies: 42
Number of Mystery & thriller movies: 67
Number of Romance movies: 60
Number of Sci-fi movies: 73
Number of War movies: 36
Number of Western movies: 38


Next, we create a dataframe that contains the correlation between tomatometer score and audience score for each genre. We will use this dataframe to create a bar chart.

In [133]:
df_genre = pd.DataFrame(columns=['Correlation'], index=genres)

for genre in genres:
    df_genre.loc[genre] = df[df['genre'] == genre]['tomatometer_score'].corr(df[df['genre'] == genre]['audience_score'])
    if genre == 'Anime':
        print(df[df['genre'] == genre]['tomatometer_score'])
        print(df[df['genre'] == genre]['audience_score'])
df_genre

17     0.730077
287    0.730077
Name: tomatometer_score, dtype: float64
17     1.00
287    0.83
Name: audience_score, dtype: float64


  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0,Correlation
Action,0.611406
Adventure,-0.009375
Anime,
Biography,0.397751
Comedy,0.447956
Crime,0.62482
Documentary,0.364369
Drama,0.339137
Fantasy,0.652367
Game show,
