# Predicting movie sales from Metacritic data

## 0. Business problem

The movie industry faces high financial-performance risks because of increasingly high movie-making and marketing costs and a high degree of uncertainty about audience reactions (Escoffier et al., 2015).
While Metacritic provides rich information about movies such as the critic scores, user scores, review texts and metadata, it is 
unclear how well these features can predict the monetary success. That's why this project uses historical data from Metacritic and movie sales information information to build several machine learning models that forecast whether a movie will result in low, medium or high sales. Furthermore, this project will focus on explaining which features drive these predictions. 
The final goal is to provide the movie publisher with valuable information on where to spend the marketing budget.

**Business Question** How can we predict box office perfomance of the movie to accuratly allocate marketing budget?

**Source**:
Escoffier, N., & McKelvey, B. (2015). The Wisdom of Crowds in the Movie Industry: Towards New Solutions to Reduce Uncertainties. International Journal of Arts Management, 17(2), 52–63. http://www.jstor.org/stable/24587073

### 0.1 Main research question & subquestions
**Main research question**:
How accurately can we predict a movie's box-office sales using Metacritic ratings, metadata, review texts, with particular focus on identifying the most influential predictive factors?

**Subquestions**
1. How are critic scores, user scores, genres, platforms, and release years related to the sales tiers of movies?
2. How well can different machine learning models predict the sales tier of a movie, based on structured features?
3. To what extent does adding transformers of review titles and/or movie summaries improve prediction performance compared to models using only structured features? 
4. Which features are most influential in predicting high versus low sales according to SHAP?
5. Can we identify review topics and/or movie clusters (e.g., using BERTopic and clustering methods) that are particularly associated with high or low sales tiers, and do these insights reveal distinct market segments?

The movie sales prediction dataset is contained in the dataset folder in the repository. We will read the data and clean it to make it ready for analysis.

The following information is provided on the dataset variables selected to address the research questions:

This research employs a continuous numerical variable, **Worldwide Box Office**, as the response variable. This represents the total revenue generated globally (in USD).

This study reviewed the literature and used the following 10 variables as explanatory variables:

- **X1**: Metascore
  - A weighted average of critic reviews (Scale: 0 - 100).
- **X2**: Userscore
  - Average score provided by general users (Scale: 0 - 10).
- **X3**: Production Budget
  - The estimated financial cost to produce the film (USD).
- **X4**: Genre
  - Categorical variable indicating the primary classification of the movie (e.g., Action, Comedy, Drama).
  - Movies with multiple genres are processed using One-Hot Encoding.
- **X5**: Release Date
  - Used to extract the specific month and year of release to account for seasonal market trends and inflation adjustments.
- **X6**: Runtime
  - The duration of the movie in minutes.
- **X7**: Theatre Count
  - The number of theatres showing the movie during its opening weekend, serving as a proxy for distribution width.
- **X8**: MPAA Rating
  - Categorical certification defining the target audience scope:
    - G = General Audiences
    - PG = Parental Guidance Suggested
    - PG-13 = Parents Strongly Cautioned
    - R = Restricted
    - NC-17 = Adults Only
- **X9**: Movie Summary
  - The textual plot summary of the film.
  - Used to generate semantic embeddings via Transformers to capture narrative elements.
- **X10**: Review Text
  - The raw text body of expert and user reviews.
  - Used for BERTopic modeling to identify dominant discourse topics associated with sales performance.

## 1. EDA (exploratory data analysis)

In [None]:
import numpy as np
import pandas as pd
from plotnine import * # generally not a good thing to do to import everything from a package. However it's ok for visualization purposes in an analysis.
import os
import scipy
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # to make jupyter print all outputs, not just the last one
from IPython.core.display import HTML # to pretty print pandas df and be able to copy them over (e.g. to ppt slides)

### 1.1 Basic structure of the datasets

In [None]:
# define file paths relative to the notebook
data_folder = "datasets"

sales_path = os.path.join(data_folder, "sales.xlsx")
userreviews_path = os.path.join(data_folder, "UserReviews.xlsx")
expertreviews_path = os.path.join(data_folder, "ExpertReviews.xlsx")
meta_path = os.path.join(data_folder, "metaClean43Brightspace.xlsx")

datasets\metaClean43Brightspace.xlsx


In [None]:
# Load the four Excel files
UserReviews_raw = pd.read_excel(userreviews_path)
ExpertReviews_raw = pd.read_excel(expertreviews_path)
sales_raw       = pd.read_excel(sales_path)
meta_raw        = pd.read_excel(meta_path)

There are 319662 rows and 7 columns in the user review dataset
There are 238973 rows and 5 columns in the expert review dataset
There are 30612 rows and 16 columns in the sales dataset
There are 11364 rows and 13 columns in the meta dataset


In [8]:
print("There are {} rows and {} columns in the user review dataset".format(UserReviews_raw.shape[0], UserReviews_raw.shape[1]))
print("There are {} rows and {} columns in the expert review dataset".format(ExpertReviews_raw.shape[0], ExpertReviews_raw.shape[1]))
print("There are {} rows and {} columns in the sales dataset".format(sales_raw.shape[0], sales_raw.shape[1]))
print("There are {} rows and {} columns in the meta dataset".format(meta_raw.shape[0], meta_raw.shape[1]))

There are 319662 rows and 7 columns in the user review dataset
There are 238973 rows and 5 columns in the expert review dataset
There are 30612 rows and 16 columns in the sales dataset
There are 11364 rows and 13 columns in the meta dataset


In [13]:
print("UserReview ")
UserReviews_raw.head()
UserReviews_raw.tail()
print("Expert Review")
ExpertReviews_raw.head()
ExpertReviews_raw.tail()
print("Meta data")
meta_raw.head()
meta_raw.tail()
print("Sales data")
sales_raw.head()
sales_raw.tail()


UserReview 


Unnamed: 0,url,idvscore,reviewer,dateP,Rev,thumbsUp,thumbsTot
0,https://www.metacritic.com/movie/bronson,8,'Longbottom94',"'Apr 25, 2013'",'Many have dismissed this film for not explor...,2,2
1,https://www.metacritic.com/movie/bronson,9,'MartinB',"'Oct 13, 2009'",'Anyone who doesn t like this movie simply ju...,0,1
2,https://www.metacritic.com/movie/bronson,10,'Jaakko',"'Jul 19, 2012'",'Not sure what to think at this film at first...,1,1
3,https://www.metacritic.com/movie/bronson,6,'CapoR',"'Oct 13, 2009'",'Nicely portrayed but it lacks the elements t...,0,1
4,https://www.metacritic.com/movie/bronson,8,'OrwellB.',"'Oct 10, 2009'",'Bronson is more than entertainment. It is ar...,0,0


Unnamed: 0,url,idvscore,reviewer,dateP,Rev,thumbsUp,thumbsTot
319657,https://www.metacritic.com/movie/spirited-away,10,'Zenflar',"'Sep 6, 2021'",'A+A+A+A+A+A+A+A+A+A+A+A+A+A+A+A+A+A+A+A+A+A+...,0,0
319658,https://www.metacritic.com/movie/spirited-away,10,'juiliopaublito',"'Nov 25, 2021'",'in my opinion spirited away is one of the be...,0,0
319659,https://www.metacritic.com/movie/spirited-away,10,'PorridgeBoy3000',"'Dec 31, 2021'",'Usually I don t like watching movies that mu...,0,0
319660,https://www.metacritic.com/movie/spirited-away,10,'jamesfhall',"'Jan 20, 2022'",'Studio Ghibli s gripping masterpiece Spirite...,0,0
319661,https://www.metacritic.com/movie/spirited-away,10,'Motoritz',"'Feb 7, 2022'",'Words cannot describe how amazing this movie...,0,0


Expert Review


Unnamed: 0,url,idvscore,reviewer,dateP,Rev
0,https://www.metacritic.com/movie/bronson,100.0,"""Andrew O'Hehir""",,'Bronson owes a little or a lot to Kubrick s ...
1,https://www.metacritic.com/movie/bronson,90.0,'A.O. Scott',,'Bronson invites you to admire its protagonis...
2,https://www.metacritic.com/movie/bronson,90.0,,,'Whether it s Peterson/Bronson s more theatri...
3,https://www.metacritic.com/movie/bronson,83.0,'Noel Murray',,'There are two Bronsons on display here: the ...
4,https://www.metacritic.com/movie/bronson,80.0,'Joshua Rothkopf',,'Refn has somehow found his way to an authent...


Unnamed: 0,url,idvscore,reviewer,dateP,Rev
238968,https://www.metacritic.com/movie/spirited-away,80.0,'Desson Thomson',,'This movie -- which is equally appealing to ...
238969,https://www.metacritic.com/movie/spirited-away,75.0,'David Sterritt',,"'Too intense for the youngest viewers, but te..."
238970,https://www.metacritic.com/movie/spirited-away,75.0,'Wesley Morris',,'Delivers chunks of Yellow Submarine and The ...
238971,https://www.metacritic.com/movie/spirited-away,75.0,'William Arnold',,'Has the power to transport us to a different...
238972,https://www.metacritic.com/movie/spirited-away,75.0,'C.W. Nevius',,"'A lovely, evocative tour de force. So why do..."


Meta data


Unnamed: 0,url,title,studio,rating,runtime,cast,director,genre,summary,awards,metascore,userscore,RelDate
0,https://www.metacritic.com/movie/!women-art-re...,!Women Art Revolution,Hotwire Productions,| Not Rated,83.0,,Lynn Hershman-Leeson,Documentary,,,70,,2011-06-01
1,https://www.metacritic.com/movie/10-cloverfiel...,10 Cloverfield Lane,Paramount Pictures,| PG-13,104.0,"John Gallagher Jr.,John Goodman,Mary Elizabeth...",Dan Trachtenberg,"Action,Sci-Fi,Drama,Mystery,Thriller,Horror","Waking up from a car accident, a young woman (...","#18MostDiscussedMovieof2016 , #1MostSharedMovi...",76,7.7,2016-03-11
2,https://www.metacritic.com/movie/10-items-or-less,10 Items or Less,Click Star,| R,82.0,"Jonah Hill,Morgan Freeman,Paz Vega",Brad Silberling,"Drama,Comedy,Romance",While researching a role as a supermarket mana...,,54,5.8,2006-12-01
3,https://www.metacritic.com/movie/10-years,10 Years,Anchor Bay Entertainment,| R,100.0,"Channing Tatum,Chris Pratt,Jenna Dewan",Jamie Linden,"Drama,Comedy,Romance",,,61,6.9,2012-09-14
4,https://www.metacritic.com/movie/100-bloody-acres,100 Bloody Acres,Music Box Films,| Not Rated,91.0,,Cameron Cairnes,"Horror,Comedy",Reg and Lindsay run an organic fertilizer busi...,,63,7.5,2013-06-28


Unnamed: 0,url,title,studio,rating,runtime,cast,director,genre,summary,awards,metascore,userscore,RelDate
11359,https://www.metacritic.com/movie/zoolander-2,Zoolander 2,Paramount Pictures,| PG-13,102.0,"Ben Stiller,Kristen Wiig,Owen Wilson,PenÃ©lope...",Ben Stiller,Comedy,Derek (Ben Stiller) and Hansel (Owen Wilson) a...,"#87MostDiscussedMovieof2016 , #80MostSharedMov...",34,4.1,2016-02-12
11360,https://www.metacritic.com/movie/zoom,Zoom,Columbia Pictures,| PG,83.0,"Chevy Chase,Courteney Cox,Tim Allen",Peter Hewitt,"Action,Adventure,Sci-Fi,Family",A former superhero (Allen) is called back into...,,26,4.4,2006-08-11
11361,https://www.metacritic.com/movie/zoom-2016,Zoom,Screen Media Films,| Not Rated,96.0,,Pedro Morelli,"Drama,Comedy,Animation",A multi-dimensional interface between a comic ...,,55,5.7,2016-09-02
11362,https://www.metacritic.com/movie/zootopia,Zootopia,Walt Disney Studios Motion Pictures,| PG,108.0,,Byron Howard,"Action,Adventure,Comedy,Crime,Animation,Family",,"#80BestMovieof2016 , #11MostDiscussedMovieof20...",78,8.6,2016-03-04
11363,https://www.metacritic.com/movie/zus-zo,Zus & zo,Lifesize Entertainment,,106.0,,Paula van der Oest,"Fantasy,Comedy,Romance",A quirky romantic comedy about 3 sisters who c...,,50,7.2,2003-02-07


Sales data


Unnamed: 0,year,release_date,title,genre,international_box_office,domestic_box_office,worldwide_box_office,production_budget,Unnamed: 8,opening_weekend,theatre_count,avg run per theatre,runtime,keywords,creative_type,url
0,2000,January 1st,Bakha Satang,Drama,76576.0,,76576.0,,,,,,129.0,,Contemporary Fiction,https://www.the-numbers.com/movie/Bakha-Satang...
1,2001,January 12th,Antitrust,Thriller/Suspense,6900000.0,10965209.0,17865209.0,30000000.0,,5486209.0,2433.0,3.1,,,Contemporary Fiction,https://www.the-numbers.com/movie/Antitrust
2,2000,January 28th,Santitos,,,378562.0,,,,,,,105.0,,,https://www.the-numbers.com/movie/Santitos
3,2002,2002 (Wide) by,Frank McKlusky C.I.,,,,,,,,,,,,,https://www.the-numbers.com/movie/Frank-McKlus...
4,2002,January 25th,A Walk to Remember,Drama,4833792.0,41227069.0,46060861.0,11000000.0,,12177488.0,2411.0,5.3,,Coming of Age,Contemporary Fiction,https://www.the-numbers.com/movie/Walk-to-Reme...


Unnamed: 0,year,release_date,title,genre,international_box_office,domestic_box_office,worldwide_box_office,production_budget,Unnamed: 8,opening_weekend,theatre_count,avg run per theatre,runtime,keywords,creative_type,url
30607,2021,January 1st,Jokbeoldu sinmun iyagi,Documentary,12356.0,,12356.0,,,,,,168.0,,Factual,https://www.the-numbers.com/movie/Jokbeoldu-si...
30608,2021,March 5th,My Salinger Year,Drama,914119.0,54730.0,968849.0,,,28851.0,123.0,2.0,101.0,Set in New York City,Contemporary Fiction,https://www.the-numbers.com/movie/My-Salinger-...
30609,2021,January 1st,Escort Vehicle 36,Action,240000.0,,240000.0,,,,,,85.0,,Historical Fiction,https://www.the-numbers.com/movie/Escort-Vehic...
30610,2021,May 21st,The Dry,Thriller/Suspense,16987526.0,364397.0,17351923.0,,,119364.0,186.0,2.5,118.0,Crime Thriller,Contemporary Fiction,https://www.the-numbers.com/movie/Dry-The-(Aus...
30611,2021,January 1st,Posledniy bogatyr. KorenÃ¢â‚¬â„¢ Zla,Adventure,33396899.0,,33396899.0,,,,,,,,Super Hero,https://www.the-numbers.com/movie/Posledniy-bo...


In [24]:
def print_basic_info(name, df):
    print(f"\n{name}")
    print("-" * len(name))
    print("Columns:", list(df.columns))
    print("\nMissing values per column:")
    display(df.isna().sum().sort_values(ascending=False).head(15))

print_basic_info("UserReviews_raw", UserReviews_raw)
print_basic_info("ExpertReviews_raw", ExpertReviews_raw)
print_basic_info("sales_raw", sales_raw)
print_basic_info("meta_raw", meta_raw)


UserReviews_raw
---------------
Columns: ['url', 'idvscore', 'reviewer', 'dateP', 'Rev', 'thumbsUp', 'thumbsTot']

Missing values per column:


thumbsUp     3580
thumbsTot    3576
Rev          3413
dateP        3413
reviewer     3407
idvscore     3404
url             0
dtype: int64


ExpertReviews_raw
-----------------
Columns: ['url', 'idvscore', 'reviewer', 'dateP', 'Rev']

Missing values per column:


idvscore    2
dateP       2
reviewer    2
Rev         2
url         0
dtype: int64


sales_raw
---------
Columns: ['year', 'release_date', 'title', 'genre', 'international_box_office', 'domestic_box_office', 'worldwide_box_office', 'production_budget', 'Unnamed: 8', 'opening_weekend', 'theatre_count', 'avg run per theatre', 'runtime', 'keywords', 'creative_type', 'url']

Missing values per column:


Unnamed: 8                  30612
production_budget           26132
opening_weekend             19683
avg run per theatre         19660
theatre_count               19649
domestic_box_office         18728
keywords                    18095
worldwide_box_office         9037
international_box_office     9037
runtime                      6053
creative_type                3945
genre                        1704
title                           8
year                            0
release_date                    0
dtype: int64


meta_raw
--------
Columns: ['url', 'title', 'studio', 'rating', 'runtime', 'cast', 'director', 'genre', 'summary', 'awards', 'metascore', 'userscore', 'RelDate']

Missing values per column:


awards       6977
summary      5897
cast         3702
userscore    2105
rating       1067
studio        350
runtime       255
genre          20
director       14
title           0
url             0
metascore       0
RelDate         0
dtype: int64

### Reflection on the raw datasets

Based on the first EDA, we can already see which problems exist in each table and what type of cleaning steps we need next.

#### UserReviews_raw

- Shape: 319,662 rows and 7 columns. This is a large table with one row per user review.
- Important columns:  
  - `url` links the review to a specific movie. There are no missing values here, so this is a good key.  
  - `idvscore` is a user score, but currently comes from Excel as text with quotes. We need it as a clean numeric column.  
  - `reviewer` contains usernames, but there are about 3,400 missing values. We do not want empty reviewer names in the final data.  
  - `dateP` is a text date (for example `'Apr 25, 2013'`) with quotes and also has about 3,400 missing values.  
  - `Rev` is the review text. We will mainly keep this as a text field, but we want to know where reviews are missing.  
  - `thumbsUp` and `thumbsTot` are sometimes missing (around 3,500 rows). Later we want to create an extra variable `thumbsDown` from these.

Cleaning implications:
- Convert `dateP` into a proper `datetime` column and keep invalid dates as `NaT`.  
- Clean the `reviewer` column and replace missing or empty names with `"Unknown Reviewer"`.  
- Convert `idvscore`, `thumbsUp` and `thumbsTot` into numeric using `get_numeric_value` and `clean_numeric_columns`.  
- Create a new variable `thumbsDown` from `thumbsTot - thumbsUp` using `calculate_thumbs_down`.  
- For this we use helpers like `clean_user_review_dates`, `clean_reviewer_column`, `clean_numeric_columns` and `calculate_thumbs_down`.

#### ExpertReviews_raw

- Shape: 238,973 rows and 5 columns, similar to the user reviews but without thumbs columns.  
- Almost no missing values: only 2 missing for `idvscore`, `reviewer`, `dateP` and `Rev`.  
- `dateP` is often `None` and must be converted to `datetime` just like in the user table.  
- `reviewer` has a few missing values that we again want to replace with `"Unknown Reviewer"`.

Cleaning implications:
- We want the same behaviour as in `UserReviews_raw` so both tables are consistent and easier to combine later.  
- We mainly use `clean_expert_review_dates`, `clean_reviewer_column` and `clean_numeric_columns`.  
- We also create unique IDs for reviewers and reviews with `create_expert_ids` and `create_review_ids`.

#### sales_raw

- Shape: 30,612 rows and 16 columns. This is our box office / financial table.  
- A lot of missing values in `production_budget`, `opening_weekend`, `theatre_count`, `avg run per theatre` and also in the box office columns.  
- Column `Unnamed: 8` is completely empty and can be dropped later.  
- `release_date` is a free-form text date (for example `"January 1st"`) and needs to be converted to `datetime`.  
- `international_box_office`, `domestic_box_office`, `worldwide_box_office` and `production_budget` are numeric amounts but come as text from Excel.  
- `url` is present and has no missing values, which is useful as a link to the movie.

Cleaning implications:
- Convert all monetary and count columns to numeric using `get_numeric_value` and `clean_numeric_columns`.  
- Create one total revenue measure `total_box_office` using `calculate_total_box_office`.  
- Convert `release_date` to a clean `releasedate` column via `clean_sales_dates`, with a fallback on `year` where needed.  
- Drop the completely empty `Unnamed: 8` column.

#### meta_raw

- Shape: 11,364 rows and 13 columns. This is our movie-level metadata table.  
- Important columns: `url`, `title`, `studio`, `rating`, `runtime`, `metascore`, `userscore`, `RelDate`.  
- `url`, `title`, `metascore` and `RelDate` have no missing values, which is good for linking and modelling.  
- Many missing values in `awards`, `summary` and `cast`. These are mainly descriptive and not strictly required for a first prediction model.  
- `RelDate` is already in a clean date format.

Cleaning implications:
- Standardise the release date into a single `releasedate` column using `standardize_meta_dates` so it aligns with the other tables.  
- Keep text fields like `summary`, `awards` and `cast` as they are for now, but we know they contain many missings.  
- Treat `runtime`, `metascore` and `userscore` as important numeric features for later modelling.

#### Across all tables

- All tables have a `url` column, and it is never missing. This makes `url` a natural starting point for linking movies across datasets.  
- However, URLs and titles are not always written in exactly the same way, and small spelling or formatting differences can cause duplicate or mismatched movies.  
  Because of this, we need a separate cleaning step to standardise movie names and create a single consistent `movie_id`.

Therefore we introduce the following helper functions:

- `clean_movie_name` and `extract_title_from_url` to standardise movie names and extract readable titles from URLs.  
- `collect_all_movie_records`, `create_movie_dimension_table` and `add_movie_ids_to_datasets` to build a movie dimension table (`movie_key`) with a unique `movie_id` per cleaned movie name and to merge that ID back into all datasets.  
- `create_all_ids` to create stable IDs for users, experts and individual reviews.

After these cleaning steps:

- dates use a consistent `datetime` format across tables,  
- numeric fields are properly numeric instead of messy text,  
- reviewers and reviews have unique IDs,  
- and all tables can be joined via a single `movie_id`.

This cleaned structure is then ready for feature engineering and machine learning models (for example regression or KNN) at the movie level.
