# COGS 118A - Project Checkpoint

# Names
- Banso Nguyen
- Kirsten Nino
- Rufeng Chen
- Shan He

# Abstract 

Our goal is to find out the factors that influence movie ratings and use those factors to try to predict the performance of upcoming movies. The dataset we used contains various variables such as production, release date, director and other related information about 119,000 movies and TV shows released around the world. We will be performing two separate linear regressions once we have cleaned our dataset. In order to measure the performance of the movies, we have decided to use the following variables: revenue and vote average (audience score). 

# Background

There are many factors that can influence how a movie is rated, like its budget, director, actors/actresses, its genre, and etc.. How these factors influence the movie ratings has been an area of interest for researchers.

The budget of a movie is generally thought to be the main determinant of its rating. Higher budgets often lead to higher production quality, top-notch actors, and more extensive advertisements. These could potentially draw larger audiences and better reviews. A 2013 study by Eliashberg, Elberse, and Leenders found that higher budgets often leads to more promising film projects.<a name="Eliashberg"></a>[<sup>[1]</sup>](#Eliashberg)

People often anticipate movies by their favorite directors, and directors who've produced great movies in the past often receive higher ratings for their new works. Wallace, Seigerman, and Holbrook found that directors with previous successes are more likely to have higher box-office sales as well as better reviews.<a name="Wallace"></a>[<sup>[2]</sup>](#Wallace)

Moreover, genre is another big contributing factor to movie ratings. Some genres just seem to resonate more with audiences and critics. De Vany and Walls' study showed that action flicks and dramas usually get higher ratings than comedies or horror movies.<a name="De Vany"></a>[<sup>[3]</sup>](#De) 
	
There are also many more factors that can alter the performance of a movie, such as the movie's casts, release date, and its marketing strategies. We will find out more about what factors influence movie ratings and predict the movie’s ratings based on those factors.



# Problem Statement

In the rapidly expanding world of film and television, discerning which movies are set to be hits or misses is a challenge for average audiences. Typically, we wait for critic reviews or box office results, but what if we could predict the outcome of a movie? Our goal is to try building a way of predicting a movie's success before it's even hit the screens. We will be creating a model that takes into account a range of information, such as budget, director, genre, release date and use these variables to predict a movie's success. All these factors can be represented by numbers, some variables can be represented by encoding them. The measurable aspect of this problem relates to the metrics used to determine a movie's success: its revenue and audience score (vote average). The model's performance can be measured by comparing the predicted success against the actual performance of the film’s post-release, which offers an objective method for evaluating our model's accuracy and reliability. The model we're building can be replicated for all movies. By using a large dataset of 119,000 movies and TV shows from around the world, our model's learning and predictions can be reproduced and improved upon over time. To solve this problem, we'll experiment with multiple different machine learning models - specifically linear regression models. We'll compare them and pick the most accurate. The selected model would then be used to predict the performance of upcoming movies, providing audiences with a guide to potential movie success.

# Data
Movie data (100K+ titles with budget, credits)
https://www.kaggle.com/datasets/kakarlaramcharan/tmdb-data-0920

The dataset used for this project is obtained from TMDB (The Movie Database), a community-built movie and TV database. The dataset contains information on more than 119,000 movies and TV shows released internationally. The dataset is in CSV format and is pipe-delimited.

Size of the Dataset:
The dataset consists of approximately 119,000 records and has 27 features (variables) that provide comprehensive information about each movie or TV show.

Observation:
Each observation in the dataset represents a movie or TV show. It contains various details about the production, release, cast, crew, and other relevant information associated with the title.

Critical Variables:
Some critical variables in the dataset include:

|Variable|Description|
|---|---|
|belongs_to_collection | Indicates whether the movie belongs to a collection, with the collection specified if it exists. |
|budget | Represents the budget of the movie. |
|original_language | Denotes the original language in which the movie is produced. |
|original_title | Specifies the original title of the movie. |
|overview | Provides a summary or synopsis of the movie. |
|popularity | Represents the popularity index of the movie. |
|production_companies | Lists the companies involved in producing the movie. |
|production_countries | Indicates the country where the movie is produced. |
|release_date | Represents the release date of the movie. |
|revenue | Represents the revenue generated by the movie. If missing, it is represented by 0. |
|runtime | Denotes the duration of the movie in minutes. |
|status | Indicates whether the movie is released or not. |
|tagline | Provides the tagline associated with the movie. |
|title | Specifies the English alias title of the movie. |
|vote_average | Represents the average vote rating given by viewers. |
|overview | Represents the synopsis of the movie. |
|cast | Represents the cast credits (Actors). |
|directors | Represents the director credits. |

Handling, Transformations, Cleaning:
* *Handling Missing Values: Check for missing values in the budget, popularity, and income variables. We will use imputation (replace missing values with mean, median or mode) or remove observations with missing values.*
* *Handling Categorical Variables: Since the dataset contains categorical variables such as original_language or production_companies, we need to encode them using one-hot encoding technique to convert them into a suitable numerical representation.*
* *Data type conversion: Ensure budget, popularity, and revenue variables are in numeric format. Convert them from string or object data types to numeric data types (such as integers or floating-point numbers) to perform mathematical operations and analysis on them.*
* *Scaling or normalization: since we may be using regression models or clustering algorithms which are sensitive to the extent of the data, we need to normalize the budget, popularity and income variables using z-score normalization or min-max scaling , making them comparable in size.*
* *Remove irrelevant features: Identify any irrelevant or redundant features that do not contribute significantly to the analysis or prediction. Removing these features simplifies the dataset and reduces noise.*
* *Data Splitting: Since we plan to build a predictive model of revenue or popularity, we need to split the dataset into training and testing sets to evaluate the performance of the model. This allows us to train the model on partial data and evaluate its accuracy on unseen data.*

# Setup

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import wget
import zipfile

In [2]:
movie_df = pd.read_csv('movies_metadata.csv')
movie_df

  movie_df = pd.read_csv('movies_metadata.csv')


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


In [3]:
selected_columns = ['budget', 'genres', 'original_language', 'popularity', 'production_companies', 'production_countries', 'revenue', 'runtime', 'spoken_languages', 'vote_average', 'vote_count']
movie_df = movie_df[selected_columns]

In [4]:
movie_df

Unnamed: 0,budget,genres,original_language,popularity,production_companies,production_countries,revenue,runtime,spoken_languages,vote_average,vote_count
0,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",en,21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",7.7,5415.0
1,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",en,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",6.9,2413.0
2,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",en,11.7129,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",6.5,92.0
3,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en,3.859495,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",6.1,34.0
4,0,"[{'id': 35, 'name': 'Comedy'}]",en,8.387519,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...
45461,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",fa,0.072051,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",4.0,1.0
45462,0,"[{'id': 18, 'name': 'Drama'}]",tl,0.178241,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",9.0,3.0
45463,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",en,0.903007,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",3.8,6.0
45464,0,[],en,0.003503,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",0.0,87.0,[],0.0,0.0


In [9]:
# movie_df = movie_df.drop(movie_df[movie_df['budget'] == 0].index)
# movie_df = movie_df.dropna()
df = movie_df[(movie_df != 0 ).all(1)]
df = df[(df != '0').all(1)]
df = df[(df != 0.0).all(1)]
df = df[(df != '0.0').all(1)]
df = df[(df != '[]').all(1)]

In [10]:
df

Unnamed: 0,budget,genres,original_language,popularity,production_companies,production_countries,revenue,runtime,spoken_languages,vote_average,vote_count
0,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",en,21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",7.7,5415.0
1,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",en,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",6.9,2413.0
3,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en,3.859495,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",6.1,34.0
5,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",en,17.924927,"[{'name': 'Regency Enterprises', 'id': 508}, {...","[{'iso_3166_1': 'US', 'name': 'United States o...",187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",7.7,1886.0
8,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",en,5.23158,"[{'name': 'Universal Pictures', 'id': 33}, {'n...","[{'iso_3166_1': 'US', 'name': 'United States o...",64350171.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",5.5,174.0
...,...,...,...,...,...,...,...,...,...,...,...
45014,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 37, 'nam...",en,50.903593,"[{'name': 'Imagine Entertainment', 'id': 23}, ...","[{'iso_3166_1': 'ZA', 'name': 'South Africa'},...",71000000.0,95.0,"[{'iso_639_1': 'en', 'name': 'English'}]",5.7,688.0
45139,50000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10751, '...",en,33.694599,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'US', 'name': 'United States o...",66913939.0,86.0,"[{'iso_639_1': 'en', 'name': 'English'}]",5.8,327.0
45167,11000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",en,40.796775,"[{'name': 'Thunder Road Pictures', 'id': 3528}...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",184770205.0,111.0,"[{'iso_639_1': 'en', 'name': 'English'}]",7.4,181.0
45250,12000000,"[{'id': 28, 'name': 'Action'}, {'id': 35, 'nam...",ta,1.323587,"[{'name': 'AVM Productions', 'id': 16424}]","[{'iso_3166_1': 'IN', 'name': 'India'}]",19000000.0,185.0,"[{'iso_639_1': 'ta', 'name': 'தமிழ்'}, {'iso_6...",6.9,25.0


# Proposed Solution

To predict a movie's success, we're focusing on two key indicators: revenue and audience score (also known as vote average). We propose employing the following machine learning models to accomplish this - Linear Regression and Logistic Regression.

We will be performing two separate linear regressions. Since we're dealing with continuous target variables, which are revenue and vote average, linear Regression would be perfect for helping us understand the relationship between factors like budget, director, genre, and release date and our target variables. This model works by fitting a line through our data in a way that best predicts the performance of upcoming movies. We will also use regularization techniques (Lasso) to squish the least important parameters to zero to leave the most prevalent ones. To implement this, we will use the Scikit-learn library's Linear Regression model to fit the data. The model's performance will be evaluated using a loss function, with gradient descent employed to minimize this loss and optimize the model.

Another solution is to perform a logistic regression which classifies a movie's success based on a certain threshold value for revenue and audience score. With Logistic Regression, we can measure how far a movie's predicted score is from a decision boundary and use that to determine the likelihood of its success. For instance, a decision boundary will be made if the vote average is above or below a rating of 7 and the farther a point is away from the decision boundary, the higher the probability the movie is successful/unsuccessful (depending on which side the point is on). We're planning on using the Scikit-learn library to implement our Logistic Regression model. We'll measure the model's performance using the soft log loss function and adjust the weights using gradient descent.

We may plan to exclude certain features, such as actors (since one-hot encoding would be too numerous) or overview (since movie plots are too unique to compare). Also, as mentioned earlier in the Data section, we will make sure missing values are handled with imputation, one-hot encode the according non-numerical features, and split the dataset with k-fold (value to be determined). With the remaining variables we can run the above mentioned methods with the Numpy, Scikit-learn libraries. A potential benchmark model is k-NN, and we can make two, one for predicting revenue and the other voter average. We can then compare these two with the corresponding linear and logistic regressions. Comparing our proposed models to this benchmark will help us assess the relative effectiveness of our chosen methods.



# Evaluation Metrics

One evaluation metric that we could use to quantify the performance of the linear regression model in predicting whether or not the movie will be successful or not is with mean squared error. This will measure how accurate our model is.

Another possible metric, while less effective but still applicable, would be positive predictive value (PPV) or precision. Our project could be interpreted by predicting a binary result of successful/ not successful, however this metric is much simpler and provides much less insight. We could also make a ROC-AUC to evaluate how good our model is.

The following metrics are not complex, and easily work with linear and logistic regression respectively. MSE is able to evaluate the prediction from a linear regression model, and the binary classification result from logistic regression can be evaluated with PPV.

# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



# Ethics & Privacy

Personal Information Privacy: The dataset contains the names of people involved in the film industry, such as actors, directors, or production staff. We will anonymize the personal information in the data, i.e. remove relevant variables.

Fairness in revenue forecasting: This dataset will be used for forecasting, so inadvertent unfairness may arise. For example, certain genre, production company, or country of origin data may be favored in revenue projections, leading to potential discrepancies or unequal opportunities.

Unintended analysis and discrimination: When analyzing datasets, there is a risk of unintended analysis and discrimination. Models or algorithms trained on data may inadvertently learn to bias certain groups based on factors such as language, genre, or country of production.

# Team Expectations 

* *Communicate through Discord and regularly schedule weekly remote meetings*
* *Major project decisions, review of work done during zoom or discord meetings2*
* *Work is to be divided equally, individually completed before next meeting*

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/17  |  4:30 |  Each part of the project proposal  | Finalize proposal and push to GitHub | 
| 5/23  |  6:15 |  Read the proposals | Complete peer review, discuss parts of the checkpoint and assign | 
| 5/30  | 6:15  | Assigned checkpoint parts  | Finalize the checkpoint and push to GitHub. Talk about what is left for final project and assign parts |
| 6/8  | 6:15  | Make good progress/finish final project parts | Discuss what is completed and make edits. |
| 6/13  | 6:15  | Finish assigned final parts | Finalize and push to GitHub |

# Footnotes
<a name="Eliashberg"></a>1.[^](#Eliashberg):Eliashberg, J., Elberse, A., & Leenders, M. A. (2013). The Motion Picture Industry: Critical Issues in Practice, Current Research, and New Research Directions:https://repository.upenn.edu/cgi/viewcontent.cgi?article=1179&context=oid_papers<br> 

<a name="Wallace"></a>2.[^](#Wallace):Wallace, W. T., Seigerman, A., & Holbrook, M. B. (1993). The Role of Actors and Actresses in the Success of Films: How Much Is a Movie Star Worth?https://link.springer.com/article/10.1007/BF00820765<br>

<a name="De Vany"></a>3.[^](#De):De Vany, A., & Walls, W. D. (1996). Bose–Einstein dynamics and adaptive contracting in the motion picture industry.https://econpapers.repec.org/article/ecjeconjl/v_3a106_3ay_3a1996_3ai_3a439_3ap_3a1493-1514.htm<br>
