**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - EDA Checkpoint

# Names

- Celeste Walstrom-Vangor
- Rui Wang
- Jheel Gandhi
- Howard Ma
- Kenny Qiu

# **Research Question**

Can we predict the Letterboxd ratings for upcoming movies based solely on actors in those movies and the ratings of their previous films, focusing on the ratings of Gen Z?

## Background and Prior Work

By figuring out if there is a correlation between actors' previous movies and their forcasted Letterboxd movie rating, we will be able to determine whether a new movie can be determined as highly rated based only on the actors who are starring in it. Although there are many factors that go into the ratings of a movie, if we can create an accurate algorithm that focuses on actors alone it will be highly convinient so people can determine whether a movie will be highly rated before going to see it. As busy college students, we want to dedicate our time to helping people avoid going to a low rated movies before the ratings come out. If we can find a correlation between the ratings of previous movies with actors and the ratings of new movies with those same actors it could save people a lot of time.

Here are some prior work we found online:

1. Vox Analysis on the the actors and actresses who most consistently appear in terrible movies

  One previous analysis from Vox was done on the actors who starred in the best and worst rated films based on metacritic. From which they restricted the data set to only include actors that meet the following criteria:
    1. The actor/actress must have performed in at least 10 films (writing, directing, and producing credits were omitted).
    2. At least one of these films had to have grossed $30 million or more at the box office, adjusted for inflation.
    3. At least one of these films had to be within the past five years (we only wanted semi-active performers).

  From the blog post analysis of the top 10 worst rated actors, we can see familiar names like Adam Sandler and Jennifer Love Hewitt who are often known for being in worse rated movies. On the other hand, for the top 10 best rated actors, familiar names that are known for having good movies like Leonardo Di Caprio and Jennifer Lawerence make the list.

  Something interesting to note is that genres like action and comedy are rated more harshly than genres like documentaries and drama. This is hypothesized to be due to the subjective nature of comedy films.

  This analysis does only look at single actors rather than taking into account all the actors in a film, but it does conclude that there’s at least some correlation between an actor and their movie’s rating.

2.
The Hustle - The actors who are the best (and worst) at their job:

  The Hustle did a similar analysis on actors and the critical ratings they get.
  Here are the datasets used by the blog post:
    1. Average Metacritic scores (a measurement of critical ratings) across all films for 35k+ actors
    2. Average domestic box office data across all films an actor has played a prominent role in over their career

  They chose domestic box office because international films are biased towards franchise films, and the box office dataset is not adjusted for inflation so it favors newer films.

  Similar to the first analysis, critics are biased against comedy and love films, with those films getting lower than average ratings.

  They also included a beloved actors matrix where they plotted the average box office of an actor (how loved they are by the audience) with the average metacritic score (how loved they are by the critics).

  They then did a composite score taking into account both the percentiles of the box office and their metacritic. In this final list, actors like Leonardo DiCaprio and Tom Hanks make the list of the best combination and actors like Bella Thorne and Chad Michael Murray make the list of the worst combination.

  This analysis also seems to support the fact that some actors are more likely to have better rated movies, but this study only takes in single actors as well as the first analysis did, so there’s more work to be done on analyzing multiple actors for a single film.

  Something to note here is that just because a movie has bad critic ratings does not mean that the audience doesn’t like it, as supported by big box offices for some of the worst rated actors.



References:
* https://www.vox.com/2016/4/11/11381206/worst-actors-hollywood
* https://thehustle.co/the-actors-who-are-the-best-and-worst-at-their-job/

# **Hypothesis**


We predict that there exists a positive correlation between the Letterboxd ratings of past movies with the top 1000 most relevant actors and the ratings of their upcoming films. This assumption stems from the idea that actors with a history of highly rated performances are likely to continue appearing in well-received films, contributing to positive Letterboxd reviews.

# **Data**

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name:actors
  - Link to the dataset:https://www.kaggle.com/datasets/gsimonx37/letterboxd?select=actors.csv
  - Number of observations:5523327
  - Number of variables:2
- Dataset #2
  - Dataset Name:genres
  - Link to the dataset: https://www.kaggle.com/datasets/gsimonx37/letterboxd?select=genres.csv
  - Number of observations:990770
  - Number of variables:2
- Dataset #3
  - Dataset Name:language
  - Link to the dataset: https://www.kaggle.com/datasets/gsimonx37/letterboxd?select=languages.csv
  - Number of observations:988826
  - Number of variables:3
- Dataset #4
  - Dataset Name:movies
  - Link to the dataset: https://www.kaggle.com/datasets/gsimonx37/letterboxd?select=movies.csv
  - Number of observations:896400
  - Number of variables:7


We are planning to utilize four datasets: **actors.csv**, **genres.csv**, **language.csv**, and **movies.csv**.

**actors.csv**: This dataset comprises two variables: **id** and **name**. The **id** column contains numerous instances of identical numbers each associated with different **name** entries, representing actors. We intend to consolidate identical **id** numbers and compile a corresponding list of actor names. This process will involve groupby() to handle repeated **id** values.

**genres.csv**: Similar to the **actors.csv**, it contains two variables: **id** and **genre**. We will merge identical **id** numbers, creating a comprehensive list of genres associated with each unique identifier. This dataset will help us understand the distribution of movie genres and their relationships with other variables.

**language.csv**: This dataset includes three variables: **id**, **type**, and **language**. Our approach will not only groupby() the same **id** but also filter entries in the **type** column to retain only 'Language' and 'Primary language'.

**movies.csv**: Containing seven variables, we plan to discard **tagline**, **description**, and **minute** from our analysis.  We will focus on **id**, **name**, **date**, and **rating**, which are more aligned with our objectives.

For data cleaning, we will include movies with a **date** later than 1970, with 'English' among their languages, and we will eliminate any datas containing **'NaN'** to maintain data integrity.

The combination of these datasets will involve merging based on the **id** field, ensuring that each movie's data is enriched with corresponding actors, genres, and language information. The final dataset will feature variables such as **id**, **movie**, **date**, **rating**, **actors**, **genre**, **language**, and **director**. This cohesive dataset will form the foundation of our project's analysis.


In [1]:
# Imports
import pandas as pd

## Dataset # actors

In [2]:
actors_df = pd.read_csv('actors.csv')
actors_df

FileNotFoundError: [Errno 2] No such file or directory: 'actors.csv'

## Dataset # genres

In [None]:
genres_df = pd.read_csv('genres.csv')
genres_df

## Dataset # language

In [None]:
languages_df = pd.read_csv('languages.csv')
languages_df

## Dataset # movies

In [None]:
movies_df = pd.read_csv('movies.csv')
movies_df

# **EDA part 1: Data cleaning**

Now our raw data consists of multiple datasets and isn't very clean, so we will need to do some wrangling. First let's load in the datasets

In [None]:
# Imports dependencies
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import ast
import seaborn as sns
sns.set()
sns.set_context('talk')
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest
# Note: the statsmodels import may print out a 'FutureWarning'. Thats fine.

In [None]:
# Read CSV file into DataFrame
actors_df = pd.read_csv('actors.csv')

# Group by 'id' and create a list of 'name' for each group
grouped_actors = actors_df.groupby('id')['name'].apply(list).reset_index()

# Display the first few rows of the grouped DataFrame
grouped_actors.head

In [None]:
# Read CSV file 'genres.csv' into DataFrame
genres_df = pd.read_csv('genres.csv')

# Group by 'id' and create a list of 'genre' for each group
grouped_genres = genres_df.groupby('id')['genre'].apply(list).reset_index()

# Display the first five rows of the grouped DataFrame
grouped_genres.head(5)

In [None]:
# Read CSV file 'languages.csv' into DataFrame
languages_df = pd.read_csv('languages.csv')

# Filter rows where 'type' is either 'Language' or 'Primary language'
filtered_languages_df = languages_df[languages_df['type'].isin(['Language', 'Primary language'])]

# Display the first 20 rows of the filtered DataFrame
filtered_languages_df.head(20)

In [None]:
# Read CSV file 'movies.csv' into DataFrame
movies_df = pd.read_csv('movies.csv')

# Rename the 'name' column to 'movie' in DataFrame movies_df
movies_df.rename(columns={'name': 'movie'}, inplace=True)

# Merge movies_df with grouped_actors on the 'id' column using a left join
merged_movies = pd.merge(movies_df, grouped_actors, on='id', how='left')

In [None]:
# Merge merged_movies with grouped_genres on the 'id' column using a left join
merged_movies_actors_genres = pd.merge(merged_movies, grouped_genres, on='id', how='left')

In [None]:
# Merge merged_movies_actors_genres with filtered_languages_df on the 'id' column using a left join
merged_movies_actors_genres_languages = pd.merge(merged_movies_actors_genres, filtered_languages_df, on='id', how='left')

# Display the first 5 rows of the merged DataFrame
merged_movies_actors_genres_languages.head(5)

In [None]:
# Read CSV file 'crew.csv' into DataFrame
crew_df = pd.read_csv('crew.csv')

# Filter rows where 'role' is 'Director'
directors_df = crew_df[crew_df['role'] == 'Director']

# Rename the 'name' column to 'Director' in the filtered DataFrame
directors_df.rename(columns={'name': 'Director'}, inplace=True)

# Drop the 'role' column from the filtered DataFrame
directors_df.drop(columns='role', inplace=True)

# Display the first 5 rows of the directors DataFrame
directors_df.head(5)

In [None]:
# Group directors_df by 'id' and create a list of 'Director' for each group
grouped_directors = directors_df.groupby('id')['Director'].apply(list).reset_index()

# Display the first 15 rows of the grouped DataFrame
grouped_directors.head(15)

In [None]:
# Merge merged_movies_actors_genres_languages with grouped_directors on the 'id' column using a left join
total_merged_movies = pd.merge(merged_movies_actors_genres_languages, grouped_directors, on='id', how='left')

# Display the first 5 rows of the merged DataFrame
total_merged_movies.head(5)

In [None]:
total_merged_movies.drop(columns=['tagline', 'description', 'minute', 'type'], inplace=True)
total_merged_movies.head(15)

In [None]:
#We want to eliminate movies that are too old, and we decided that 1970 is a good year to draw the line
total_merged_movies['date'] = pd.to_numeric(total_merged_movies['date'], errors='coerce')
filtered_movies_df = total_merged_movies[total_merged_movies['date'] > 1970.0]
filtered_movies_df

In [None]:
filtered_movies_df = filtered_movies_df.dropna()
filtered_movies_df

In [None]:
filtered_movies_eng_df = filtered_movies_df[filtered_movies_df['language'].apply(lambda x: 'English' in x)]
filtered_movies_eng_df.to_csv('filtered_movies.csv', index=False)

Now we have merged our loose datasets and dropped the null values, here's what the final dataset look like

In [None]:
filtered_movies_df = pd.read_csv('filtered_movies.csv')
filtered_movies_df.head()

rename third column to make it easier to read

In [None]:
filtered_movies_df = filtered_movies_df.rename(columns={'name': 'actors'})

Let's also drop all the null values in columns 'movie', 'rating', and 'actors'

In [None]:
# Drop rows with NaN values in 'movie', 'rating', or 'actors' columns
filtered_movies_df = filtered_movies_df.dropna(subset=['movie', 'rating', 'actors'])

# Display the first few rows of the cleaned DataFrame
filtered_movies_df.head()

Remove columns so we only have movie name, rating, name of actors

In [None]:
columns_to_keep = ['movie', 'rating', 'actors']

# Drop columns that are not in the 'columns_to_keep' list
filtered_movies_df = filtered_movies_df[columns_to_keep]

# Display the first few rows of the modified DataFrame
filtered_movies_df.head()
print(len(filtered_movies_df))

seems like there's duplicate entries looking at row 2 and 3 of the previous table, let's dedupe

In [None]:
filtered_movies_df = filtered_movies_df.drop_duplicates(subset='movie', keep='first')

# Display the first few rows of the DataFrame after removing duplicates
filtered_movies_df.head()

That looks better and can be used for analysis!

# **EDA part 2: Data Exploring and visualization**

Now let's create a new table with two columns, actor name and the average rating of all the movies they were in

First let's have a dictionary actor_ratings which stores rating info for each actor

In [None]:
from ast import literal_eval
# Create an empty dictionary to store the sum and count of ratings for each actor
actor_ratings = {}

# Iterate through each row of the DataFrame
for index, row in df.iterrows():
  # Extract the list of actors from the 'actors' column
  actors_list_str = row['actors']
  actors_list_arr = literal_eval(actors_list_str)
  row['actors'] = actors_list_arr
  # Extract the rating for the movie
  rating = row['rating']

  # Check if the 'actors' column is a list
  if isinstance(actors_list_arr, list):
      # Iterate through each actor in the list
    for actor in actors_list_arr:
      # If the actor is not in the dictionary, add a new entry
      if actor not in actor_ratings:
          actor_ratings[actor] = {'sum': rating, 'count': 1}
      else:
          # If the actor is already in the dictionary, update the sum and count
          actor_ratings[actor]['sum'] += rating
          actor_ratings[actor]['count'] += 1

Then let's convert it into a table

In [None]:
# Create a list of dictionaries for actor names and their average ratings
data_list = [{'actor_name': actor, 'average_rating': round(data['sum'] / data['count'], 2), 'movie_count': data['count']} for actor, data in actor_ratings.items()]

# Create a new DataFrame using pandas.concat
average_ratings_df = pd.concat([pd.DataFrame(data_list)])

# Drop duplicate rows
average_ratings_df = average_ratings_df.drop_duplicates()

# Filter actors with more than 10 movies
average_ratings_df_filtered = average_ratings_df[average_ratings_df['movie_count'] > 20]

# Display the new table with actor names and their average ratings
average_ratings_df_filtered.head()

Top 10 actors/actresses are below

In [None]:
# Sort the DataFrame by 'average_rating' in descending order and get the top 10
top_10_high_ratings = average_ratings_df_filtered.sort_values(by='average_rating', ascending=False).head(10)

# Display the top 10 actors with the highest ratings
print("Top 10 actors with the highest ratings:")
top_10_high_ratings

Here's Ryan Gosling's number

In [None]:
average_ratings_df_filtered[average_ratings_df_filtered['actor_name'] == 'Ryan Gosling']

Bottom 10 actors

In [None]:
# Sort the DataFrame by 'average_rating' in ascending order and get the bottom 10
top_10_low_ratings = average_ratings_df_filtered.sort_values(by='average_rating', ascending=True).head(10)

# Display the top 10 actors with the highest ratings
print("Top 10 actors with the lowest ratings:")
top_10_low_ratings

Let's see the general trend of movie ratings

In [None]:
sns.histplot(x= df['rating'], bins = 10)
plt.title('Ratings vs number of movies')

We can see that the movies are pretty normally distributed

now let's look at each actor's average movie rating vs their movie counts

In [None]:
sns.scatterplot(x=average_ratings_df_filtered['average_rating'], y=average_ratings_df_filtered['movie_count'])

### How accurate are our predictions if we compare it with the actual rating a movie received


we want to come up with rating predictions then compare it with the actual rating to see how accurate we are

***We first want to look at the relationships between a movie's rating vs the average ratings of every actor in the movie***

Add column to original df where we get the average rating of all the actors in that row by using our dictionary to get the average rating of each actor

In [None]:
for index, row in average_ratings_df_filtered.iterrows():
    actor_name = row['actor_name']
    if actor_name in actor_ratings:
        actor_ratings[actor_name]['average_rating'] = row['average_rating']

Now let's create a new column for average rating of actors for a movie

In [None]:
import ast

# Function to calculate average rating of a list of actors
def calculate_avg_rating(actors, actor_ratings):
    if isinstance(actors, list) and len(actors) > 0:
        valid_ratings = [actor_ratings[actor]['average_rating'] for actor in actors if actor_ratings[actor]['average_rating'] is not None]
        if valid_ratings:
            return sum(valid_ratings) / len(valid_ratings)
    return None

# Apply the function to create a new 'avg_rating_of_actors' column
df['avg_rating_of_actors'] = df['actors'].apply(lambda actors: calculate_avg_rating(actors, actor_ratings))

# Iterate over the DataFrame to clean and update values
for index, row in df.iterrows():
    # Extract the list of actors from the 'actors' column
    actors_list_str = row['actors']
    actors_list_arr = ast.literal_eval(actors_list_str)  # Using ast.literal_eval to safely evaluate the string as a Python literal
    row['actors'] = actors_list_arr

    # Extract the rating for the movie
    rating = row['rating']
    actor_num = 0
    actor_rating = 0

    # Check if the 'actors' column is a list
    if isinstance(actors_list_arr, list):
        for actor in actors_list_arr:
            if 'average_rating' in actor_ratings[actor]:
                actor_rating += actor_ratings[actor]['average_rating']
                actor_num += 1

        avg_rating = None
        if actor_num > 0:
            avg_rating = round(actor_rating / actor_num, 2)

        # Update the 'avg_rating_of_actors' column
        df.at[index, 'avg_rating_of_actors'] = avg_rating

# Drop rows with missing values in 'avg_rating_of_actors' or 'rating'
df = df.dropna(subset=['avg_rating_of_actors', 'rating'])

# Display the first few rows of the cleaned DataFrame
df.head()

In [None]:
import patsy
import statsmodels.api as sm

# Use patsy to create design matrices
outcome_1, predictors_1 = patsy.dmatrices('rating ~ avg_rating_of_actors', df)

# Create an OLS model
mod_1 = sm.OLS(outcome_1, predictors_1)

# Fit the model
res_1 = mod_1.fit()

# Display the summary statistics
print(res_1.summary())

In [None]:
# Set up the figure size for the plot
plt.figure(figsize=(14, 8))

# Create a box plot using Seaborn
sns.boxplot(x='avg_rating_of_actors', y='rating', data=df, palette='viridis')

# Set the title of the plot
plt.title('Box Plot of avg_rating_of_actors vs. Rating', fontsize=16)

# Set the labels for the x and y axes
plt.xlabel('avg_rating_of_actors', fontsize=14)
plt.ylabel('Rating', fontsize=14)

# Rotate x-axis labels for better visibility
plt.xticks(rotation=45, ha='right')

# Adjust the layout for better spacing
plt.tight_layout()

# Display the plot
plt.show()

Let's look at the relationship between average rating of actors in a movie vs the movie's rating

In [None]:
# Create a scatter plot using Matplotlib
plt.scatter(df['avg_rating_of_actors'], df['rating'])

# Set the label for the x-axis
plt.xlabel('Average Rating of Actors')

# Set the label for the y-axis
plt.ylabel('Movie Rating')

# Set the title of the plot
plt.title('Average Rating of Actors vs Movie Rating')

# Display the plot
plt.show()

Seems like there are some signs of correlation happening here but is it strong enough?

## ***Let's train a linear regression Model***


In [None]:
# Import necessary modules from scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Extract features (X) and target variable (y) from the DataFrame 'df'
X = df['avg_rating_of_actors'].values.reshape(-1, 1)
y = df['rating']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Train the model using the training sets
model.fit(X_train, y_train)

# Predict the target variable for the testing set
y_pred = model.predict(X_test)

# Plot data points and regression line
plt.scatter(X_test, y_test, label='Testing Data')    # Plot testing data points
plt.plot(X_test, y_pred, color='red', label='Linear Regression')    # Plot regression line
plt.xlabel('Average rating of actor')
plt.ylabel('Rating')
plt.title('Year released vs rating')
plt.legend()
plt.show()

# Display the coefficient(s) and intercept of the linear regression model
print("Coefficient(s):", model.coef_)
print("Intercept:", model.intercept_)

# Calculate and display the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

**Let's try to make some predictions based on one actor!**

In [None]:
def One_actor_prediction():
    # Prompt user to enter an actor's name
    name = input("Enter an actor of your choice: ")

    # Retrieve the average rating of the specified actor from the DataFrame
    score = average_ratings_df_filtered.loc[average_ratings_df_filtered['actor_name'] == name, 'average_rating'].iloc[0]

    # Create a list containing the actor's score
    score_list = [score]

    # Create a list to hold the data points
    data_points = [[score]]

    # Use the trained model to predict the rating for the specified actor
    predictions = model.predict(data_points)

    # Return the predicted rating
    return predictions

In [None]:
for i in range(3):
  prediction = One_actor_prediction()
  print("Predicted score of this actor:", prediction)

This is the prototype of our model and it only takes one datapoint for now, but this will improve and take in more actors

# **Ethics & Privacy**

As a group of students enrolled at UCSD, we all acknowledge that we have aspects of our lives in common that could create biases in our research, and therefore data. We are all members of Gen Z, students, live/lived in San Diego, privileged enough to receive a high education, etc. These aspects of our lives could create an underlying bias that we need to address. Knowing this, we need to ensure that we are accessing data that reaches beyond our generation and using sources that address movies and actors of all ages and backgrounds, basing their popularity on one designated, impartial scale. Additionally, we chose to use Letterboxd for the rating of the movies in our data sets. We know that half of Letterboxd users are under the age of 35, and more than half are between 16 and 24 (https://variety.com/2023/film/news/letterboxd-martin-scorsese-younger-audience-classic-films-1235804153/). Since we know this to be true, we have decided to make our intentions about being able to predict the movie ratings of only Letterboxd users, meaning mostly Generation Z movie ratings. If this is made clear to the audience, we will avoid a bias, as it is not going to skew the data, but be a clear intention in the research process. We are also aware of the fact that choosing Letterboxd in the first place is likely a result of our generation. This could therefore mean that the scale we are using to rate movies is based more than half on the younger portion of the population. Although this is true, Letterboxd will have an impartial scale and vast amount of data to access. As a warning, we will provide this context to the audience so they are well informed that the ratings may primarily reflect the views of the younger generations, and may not accurately depict how the elder populations feel about the movies.

We have intentionally chosen to work with a dataset that is public
information accessable from Kaggle to avoid issues with privacy. Since this is a public data set, we will not have to worry about terms of use privacy issues.

# **Team Expectations**



*   We decide to meet at least once a week when necessary and previously decided upon
*   We want to ensure that there is a hybrid meeting system available so group members can join from anywhere
*   We want to make sure that deadlines are clearly communicated and that people are gently reminded to meet those deadlines
*   Judgment free zone!











# **Project Timeline Proposal**

In [None]:
from IPython.display import Image
Image('/content/Screenshot 2024-02-25 at 6.31.49 PM.png')