# Predicting movie sales from Metacritic data

## 0. Business problem

The movie industry faces high financial-performance risks because of increasingly high movie-making and marketing costs and a high degree of uncertainty about audience reactions (Escoffier et al., 2015).
While Metacritic provides rich information about movies such as the critic scores, user scores, review texts and metadata, it is 
unclear how well these features can predict the monetary success. That's why this project uses historical data from Metacritic and movie sales information information to build several machine learning models that forecast whether a movie will result in low, medium or high sales. Furthermore, this project will focus on explaining which features drive these predictions. 
The final goal is to provide the movie publisher with valuable information on where to spend the marketing budget.

**Business Question** How can we predict box office perfomance of the movie to accuratly allocate marketing budget?

**Source**:
Escoffier, N., & McKelvey, B. (2015). The Wisdom of Crowds in the Movie Industry: Towards New Solutions to Reduce Uncertainties. International Journal of Arts Management, 17(2), 52â€“63. http://www.jstor.org/stable/24587073

### 0.1 Main research question & subquestions
**Main research question**:
How accurately can we predict a movie's first-year sales using Metacritic ratings, metadata, review texts, with particular focus on identifying the most influential predictive factors?

**Subquestions**
1. How are critic scores, user scores, genres, platforms, and release years related to the sales tiers of movies?
2. How well can different machine learning models predict the sales tier of a movie, based on structured features?
3. To what extent does adding transformers of review titles and/or movie summaries improve prediction performance compared to models using only structured features? 
4. Which features are most influential in predicting high versus low sales according to SHAP?
5. Can we identify review topics and/or movie clusters (e.g., using BERTopic and clustering methods) that are particularly associated with high or low sales tiers, and do these insights reveal distinct market segments?

The movie sales prediction dataset is contained in the dataset folder in the repository. We will read the data and clean it to make it ready for analysis.

The following information is provided on the dataset variables selected to address the research questions:

This research employs a continuous numerical variable, **Worldwide Box Office**, as the response variable. This represents the total revenue generated globally (in USD).

This study reviewed the literature and used the following 10 variables as explanatory variables:

- **X1**: Metascore
  - A weighted average of critic reviews (Scale: 0 - 100).
- **X2**: User Score
  - Average score provided by general users (Scale: 0 - 10).
- **X3**: Production Budget
  - The estimated financial cost to produce the film (USD).
- **X4**: Genre
  - Categorical variable indicating the primary classification of the movie (e.g., Action, Comedy, Drama).
  - Movies with multiple genres are processed using One-Hot Encoding.
- **X5**: Release Date
  - Used to extract the specific month and year of release to account for seasonal market trends and inflation adjustments.
- **X6**: Runtime
  - The duration of the movie in minutes.
- **X7**: Theatre Count
  - The number of theatres showing the movie during its opening weekend, serving as a proxy for distribution width.
- **X8**: MPAA Rating
  - Categorical certification defining the target audience scope:
    - G = General Audiences
    - PG = Parental Guidance Suggested
    - PG-13 = Parents Strongly Cautioned
    - R = Restricted
    - NC-17 = Adults Only
- **X9**: Movie Summary
  - The textual plot summary of the film.
  - Used to generate semantic embeddings via Transformers to capture narrative elements.
- **X10**: Review Text
  - The raw text body of expert and user reviews.
  - Used for BERTopic modeling to identify dominant discourse topics associated with sales performance.

In [None]:
import numpy as np
import pandas as pd
from plotnine import * # generally not a good thing to do to import everything from a package. However it's ok for visualization purposes in an analysis.
import os
import scipy
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # to make jupyter print all outputs, not just the last one
from IPython.core.display import HTML # to pretty print pandas df and be able to copy them over (e.g. to ppt slides)

In [None]:
# define file paths relative to the notebook
data_folder = "datasets"

sales_path = os.path.join(data_folder, "sales.xlsx")
userreviews_path = os.path.join(data_folder, "UserReviews.xlsx")
expertreviews_path = os.path.join(data_folder, "ExpertReviews.xlsx")
meta_path = os.path.join(data_folder, "metaClean43Brightspace.xlsx")

datasets\metaClean43Brightspace.xlsx


In [None]:
# Load the four Excel files
UserReviews_raw = pd.read_excel(userreviews_path)
ExpertReviews_raw = pd.read_excel(expertreviews_path)
sales_raw       = pd.read_excel(sales_path)
meta_raw        = pd.read_excel(meta_path)

print("There are {} rows and {} columns in the user review dataset".format(UserReviews_raw.shape[0], UserReviews_raw.shape[1]))
print("There are {} rows and {} columns in the expert review dataset".format(ExpertReviews_raw.shape[0], ExpertReviews_raw.shape[1]))
print("There are {} rows and {} columns in the sales dataset".format(sales_raw.shape[0], sales_raw.shape[1]))
print("There are {} rows and {} columns in the meta dataset".format(meta_raw.shape[0], meta_raw.shape[1]))


