# COGS 118A - Project Checkpoint

# Names


- Eric Lin
- Cecilia Martinez
- Jared Singletary
- Finn St-John

# Abstract 
We will investigate the connection between features of a movie and its rating on the popular movie reviews website Rotten Tomatoes. This website classifies each movie as one of three categories: rotten, fresh, and certified fresh. We plan to use the non-cinematic features of movies to classify them into one of these three categories. Each data point will represent a movie and many of its characteristics. These features include things like director, genre, a synopsis, runtime, actors, etc. Almost all of these features are nominal or interval. We will be performing classification on our test data to predict the Rotten Tomatoes status of a movie. The status includes the aforementioned three categories. For model selection, we will perform grid-search cross validation and select the model with the best generalization. Since neither false positives nor false negatives pose more risk than the other in the context of our problem, we will evaluate the effectiveness of our model primarily using its f1 score, as this provides a metric that balances the impact of false negatives and false positives.


# Background

Rotten tomatoes is a popular website many people use to look at movie ratings. It provides an overall freshness score for movies based on the percentage of positive reviews. Scores can be assigned to either rotten(<60%), fresh(>60%), and certified fresh rating(>75% + 5 top critic reviews). They also provide an audience score. (https://www.studiobinder.com/blog/rotten-tomatoes-ratings-system/) We want to classify movies into rotten, fresh, and certified fresh categories. 

(https://towardsdatascience.com/can-we-predict-rotten-tomatoes-ratings-8b5f5b7d7eff)
This article talks about how audience rating scores do not reflect audience reviews. They determined that movie rating scores do not display the general public's opinion. 

https://towardsdatascience.com/what-makes-for-a-good-movie-8e10896e0f1b
In this article the author does an analysis of movie popularity broken down by both critic and audience reviews. We see that some genres like horror, comedy, mystery, and action perform much lower than other high-performing genres like documentary, classics, and animation. We expect that our Machine Learning algorithm should end up influenced by this reality. 




# Problem Statement

The problem for this project is to calculate whether or not a movie would be rated as rotten, fresh, or certified fresh on Rotten Tomatoes based on data such as its genre, content rating, and directors. 

# Data

Since we are trying to do an analysis of movies and how their features relate with their reviews, an ideal dataset for our project will allow us to compare several movie features with their review scores. 

Rotten Tomatoes Movies dataset
- https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset?select=rotten_tomatoes_movies.csv
- This dataset has 17,712 observations and 22 variables per observation. 
- An observation in this dataset describes a single movie with many pieces of information about it.
- Some important variables are the rotten tomatoes score, the genre of the movie, and the director. 
- Since there is a good amount of categorical variables like genre and director, we will likely need to use one-hot encoding to clean up that data. 


In [42]:
import pandas as pd
import seaborn as sns
import matplotlib as plt
import numpy as np
import sklearn

We read in our rotten tomatoes csv and select only the columns with the information we have deemed to be critical to our machine learning algorithm.

In [73]:
movies = pd.read_csv('data/rotten_tomatoes_movies.csv')
critical_columns = ['movie_title', 'content_rating', 'genres', 'directors', 'runtime', 'tomatometer_status', 
                    'tomatometer_rating', 'tomatometer_top_critics_count', 'tomatometer_fresh_critics_count',
                    'tomatometer_rotten_critics_count', 'original_release_date']
movies = movies[critical_columns]
movies.head()

Unnamed: 0,movie_title,content_rating,genres,directors,runtime,production_company,tomatometer_status,tomatometer_rating,tomatometer_top_critics_count,tomatometer_fresh_critics_count,tomatometer_rotten_critics_count,original_release_date
0,Percy Jackson & the Olympians: The Lightning T...,PG,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,119.0,20th Century Fox,Rotten,49.0,43,73,76,2010-02-12
1,Please Give,R,Comedy,Nicole Holofcener,90.0,Sony Pictures Classics,Certified-Fresh,87.0,44,123,19,2010-04-30
2,10,R,"Comedy, Romance",Blake Edwards,122.0,Waner Bros.,Fresh,67.0,2,16,8,1979-10-05
3,12 Angry Men (Twelve Angry Men),NR,"Classics, Drama",Sidney Lumet,95.0,Criterion Collection,Certified-Fresh,100.0,6,54,0,1957-04-13
4,"20,000 Leagues Under The Sea",G,"Action & Adventure, Drama, Kids & Family",Richard Fleischer,127.0,Disney,Fresh,89.0,5,24,3,1954-01-01


We know that the genre of the movie will be an important predictor, so we collect all the different genres into a set so we can see the unique values. 

In [74]:
genre_set = set()
for i in movies['genres']:
    if (type(i) == str):
        for j in i.split(','):
            j = j.strip()
            genre_set.add(j)
print(genre_set)

{'Cult Movies', 'Action & Adventure', 'Gay & Lesbian', 'Documentary', 'Comedy', 'Kids & Family', 'Science Fiction & Fantasy', 'Faith & Spirituality', 'Animation', 'Special Interest', 'Mystery & Suspense', 'Classics', 'Horror', 'Romance', 'Drama', 'Anime & Manga', 'Western', 'Television', 'Musical & Performing Arts', 'Art House & International', 'Sports & Fitness'}


We are interested in how many null values are in our csv because they can cause errors in our machine learning algorithms. We look through the columns and see how many null values they all have.

In [75]:
for i in movies.columns:
    print(i,movies[i].isnull().values.sum())

movie_title 0
content_rating 0
genres 19
directors 194
runtime 314
production_company 499
tomatometer_status 44
tomatometer_rating 44
tomatometer_top_critics_count 0
tomatometer_fresh_critics_count 0
tomatometer_rotten_critics_count 0
original_release_date 1166


Since there aren't very many null values across our columns, we will just drop all the null values to avoid any problems later on. 

In [76]:
movies = movies.dropna()
movies.shape

(15907, 12)

Genres list is a categorigal variable with no order. So we one hot encoded it.
To one hot encode the genres, first made genres into a new genres list. Then, applied Series to each 'genres list' element

In [90]:
movies['genres list'] = movies['genres'].str.split(', ', expand=False)
genres_series = movies['genres list'].apply(pd.Series).stack()
genres_encoded = pd.get_dummies(genres_series).groupby(level=0).sum()
# movies = pd.concat([movies, genres_encoded], axis=1)

genres_list = []
for i in genres_encoded.index:
    row_list = genres_encoded.loc[i, :].values.flatten().tolist()
    genres_list.append(row_list)

movies['genres encoded'] = genres_list

Ratings have an order to it, so we turned it into an ordinal numerical value.

In [91]:
ratings = {'G': 0, 'PG': 1, 'PG-13': 2, 'R': 3, 'NR': 4, 'NC17': 5}

movies['content_rating_id'] = movies['content_rating'].map(ratings)

We turned original release date into a pandas date time object so we can easily compare it to each other

In [89]:
movies['original_release_date'] = pd.to_datetime(movies['original_release_date'])

datetime.date(2010, 4, 30)

In [95]:
#reordering the columns
reordered_columns = ['movie_title', 'content_rating', 'content_rating_id', 'genres', 'genres encoded', 'directors',
                    'runtime', 'original_release_date', 'tomatometer_status', 'tomatometer_rating', 
                    'tomatometer_top_critics_count', 'tomatometer_fresh_critics_count', 'tomatometer_rotten_critics_count']
movies = movies[reordered_columns]
movies.head()

Unnamed: 0,movie_title,content_rating,content_rating_id,genres,genres encoded,directors,runtime,original_release_date,tomatometer_status,tomatometer_rating,tomatometer_top_critics_count,tomatometer_fresh_critics_count,tomatometer_rotten_critics_count
0,Percy Jackson & the Olympians: The Lightning T...,PG,1,"Action & Adventure, Comedy, Drama, Science Fic...","[1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...",Chris Columbus,119.0,2010-02-12,Rotten,49.0,43,73,76
1,Please Give,R,3,Comedy,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",Nicole Holofcener,90.0,2010-04-30,Certified-Fresh,87.0,44,123,19
2,10,R,3,"Comedy, Romance","[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",Blake Edwards,122.0,1979-10-05,Fresh,67.0,2,16,8
3,12 Angry Men (Twelve Angry Men),NR,4,"Classics, Drama","[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...",Sidney Lumet,95.0,1957-04-13,Certified-Fresh,100.0,6,54,0
4,"20,000 Leagues Under The Sea",G,0,"Action & Adventure, Drama, Kids & Family","[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...",Richard Fleischer,127.0,1954-01-01,Fresh,89.0,5,24,3


# Proposed Solution

We predict that these features, while not completely evident in the content of the movie, will be a large predictor of the movie’s rating. More specifically, we predict that the director will have a large influence on the overall rating given to the movie. Additionally, we predict that certain genres will be rated more highly than others across the board. We believe that a classification model will be able to effectively find a connection between director and rating and genre and rating. We will test each solution using our selected metric: the f1 score of a model. We believe that there is a connection to discover between these non-cinematic characteristics and the rating of the movie, whether it is a large connection or small connection, and therefore believe that a classification model will be able to effectively identify this connection.


# Evaluation Metrics

The evaluation model of the f1 score would likely be the best metric. In this situation, having a false positive isn’t necessarily better or worse than having a false negative. By minimizing both, the model will do its best to reflect an overall error score. 

Precision = True Positive/True Positive + False Positive

Recall = True Positive/True Positive + False Negative

F1 = (2 * Precision * recall)/(precision + recall)


# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



# Ethics & Privacy

Since the dataset that we are working with is only using publicly available information about Hollywood movies from the Rotten Tomatoes website, our Machine Learning analysis should not end up running into ethical concerns. One consideration we are taking into account is that this dataset has the names of all the movie reviewers in it, which could be an ethical concern if we include their names in our analysis. We’re going to avoid that though by completely ignoring the names of the critics and dropping it from the data since their names are irrelevant to our project. 

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* We will communicate on Discord and let each other know if we have a time conflict that interferes with working on the project. 
* Team members will complete the work they agree to ahead of any deadlines. 
* All team members will follow academic integrity rules. 

# Project Timeline Proposal

UPDATE THE PROPOSAL TIMELINE ACCORDING TO WHAT HAS ACTUALLY HAPPENED AND HOW IT HAS EFFECTED YOUR FUTURE PLANS

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/21  |  4 PM |  Brainstorm ideas for the project and look for datasets.  | Discuss ideas for what we want to do for our project and start filling out the proposal.  | 
| 2/22  |  6 PM |  Work on our respective sections of the proposal. | Finish up the project proposal by combining our sections and pushing to github. | 
| 3/1  | 6 PM  | Everyone should look over the Project Checkpoint and understand what needs to be done for it.  | Assign specific parts of the Project Checkpoint to the group members and discuss together what will go into it.   |
| 3/4  | 4 PM  | Began working on our respective sections of the Checkpoint. | Discuss our sections and get help if needed. Make sure everyone understands what they should get done by the Checkpoint deadline. |
| 3/7  | 4 PM  | Try to have our sections mostly finished up. | Check in the day before Checkpoint deadline to make sure we're all good to have it finished by tomorrow night. |
| 3/8  | 6 PM  | Have all our sections finished. | Combine our sections of the Checkpoint, do some last minute revising, and submit. |
| 3/15  | 6 PM  | Look over the final project specifications and make sure we all understand them. | Start talking about what needs to get done for the final project. Assign sections to each other.  |
| 3/19  | 4 PM  | Started working on our sections or at least put some time into brainstorming them. | Work together on the project for a bit and discuss any challenges we're running into. Make sure we have a clear plan for finishing the project by 3/22.  |
| 3/21  | 4 PM  | Finished or mostly finished our sections of the project. | Discuss the project and start putting our pieces together. Talk about any problems and make sure we're going to finish.  |
| 3/22  | Before 11:59 PM  | Project finished | Turn in Final Project  |

# Footnotes

1. Liao, E. (2018, December 8). Can We Predict Rotten Tomatoes Ratings? Towards Data Science. (https://towardsdatascience.com/can-we-predict-rotten-tomatoes-ratings-8b5f5b7d7eff)

2. Maio, A. (2020, March 4). How Does Rotten Tomatoes Actually Work? Studio Binder. (https://www.studiobinder.com/blog/rotten-tomatoes-ratings-system/)

3. Roper, H. (2021, January 20). What makes for a good movie? Towards Data Science. (https://towardsdatascience.com/what-makes-for-a-good-movie-8e10896e0f1b)