#  Modelling setup

In this notebook we work with the cleaned and feature engineered Metacritic dataset
that was created in the EDA and cleaning notebook.

The main goals of this section are:

1. Load the modelling dataset from disk.
2. Perform sanity checks on the data types and missing values.
3. Define the feature matrix `X` and the target vector `y`.
4. Create a train test split that we can use for different models later on.


In [2]:
# %% [markdown]
# # Modelling notebook
# Load the cleaned and feature-engineered dataset

# %%
import pandas as pd
from IPython.display import display

df = pd.read_parquet("datasets/computed/metacritic_sales_tier_modelling.parquet")

print("Shape:", df.shape)
print("\nColumns:")
print(df.columns.tolist())

print("\nHead (first 5 rows):")
display(df.head())

print("\nTail (last 5 rows):")
display(df.tail())


Shape: (21770, 75)

Columns:
['movie_id', 'metascore', 'userscore', 'runtime', 'production_budget_log', 'theatre_count_log', 'release_year', 'genre_list', 'genre_Action', 'genre_Adult', 'genre_Adventure', 'genre_Animation', 'genre_Biography', 'genre_Black Comedy', 'genre_Comedy', 'genre_Concert/Performance', 'genre_Crime', 'genre_Documentary', 'genre_Drama', 'genre_Educational', 'genre_Family', 'genre_Fantasy', 'genre_History', 'genre_Horror', 'genre_Multiple Genres', 'genre_Music', 'genre_Musical', 'genre_Mystery', 'genre_News', 'genre_Reality', 'genre_Romance', 'genre_Romantic Comedy', 'genre_Sci-Fi', 'genre_Short', 'genre_Sport', 'genre_Thriller', 'genre_Thriller/Suspense', 'genre_Unknown', 'genre_War', 'genre_Western', 'rating_missing', 'rating_clean', 'rating_G', 'rating_NC-17', 'rating_Not Rated', 'rating_PG', 'rating_PG-13', 'rating_R', 'season_Fall', 'season_Spring', 'season_Summer', 'season_Winter', 'summer_release', 'holiday_release', 'user_embed_1', 'user_embed_2', 'user_emb

Unnamed: 0,movie_id,metascore,userscore,runtime,production_budget_log,theatre_count_log,release_year,genre_list,genre_Action,genre_Adult,...,expert_embed_2,expert_embed_3,expert_embed_4,expert_embed_5,expert_embed_6,expert_embed_7,expert_embed_8,expert_embed_9,expert_embed_10,sales_tier_encoded
0,6305dc82622a,59.0,6.7,129.0,16.993564,4.26268,2000.0,[Drama],0,0,...,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543,0
1,662bc1e3cf57,31.0,8.7,109.0,17.216708,7.797291,2001.0,"[Drama, Thriller]",0,0,...,-5.644859,6.622465,5.285645,-4.129042,-17.439384,-4.43283,6.081197,-1.913313,3.629471,2
2,dfc233d7a2f9,59.0,6.7,104.0,16.213406,7.788212,2002.0,[Drama],0,0,...,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543,2
3,ed1dd3e75880,41.0,6.4,104.0,16.906553,7.812378,2008.0,"[Thriller, Comedy, Romance, Crime]",0,0,...,-6.086373,-7.251342,6.611788,1.355806,1.178939,-1.492724,4.36909,3.527283,1.824804,2
4,8e3d5b8714f4,30.0,5.1,95.0,16.118096,7.589842,2008.0,"[Fantasy, Comedy, Romance]",0,0,...,-4.169212,-3.627316,4.190808,-9.552444,-1.246047,4.286777,-1.451442,5.136322,2.053113,2



Tail (last 5 rows):


Unnamed: 0,movie_id,metascore,userscore,runtime,production_budget_log,theatre_count_log,release_year,genre_list,genre_Action,genre_Adult,...,expert_embed_2,expert_embed_3,expert_embed_4,expert_embed_5,expert_embed_6,expert_embed_7,expert_embed_8,expert_embed_9,expert_embed_10,sales_tier_encoded
21765,27e1b584f110,59.0,6.7,168.0,16.993564,4.26268,2021.0,[Documentary],0,0,...,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543,0
21766,fdf6e4d27c7d,50.0,6.2,101.0,16.993564,4.820282,2021.0,[Drama],0,0,...,9.797144,-8.735904,2.323131,-0.29595,2.524684,4.405437,0.634068,2.040505,-1.448089,1
21767,aa14c9a74a24,59.0,6.7,85.0,16.993564,4.26268,2021.0,[Action],1,0,...,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543,1
21768,9e4d2fe6bc75,59.0,6.7,118.0,16.993564,5.231109,2021.0,[Thriller/Suspense],0,0,...,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543,2
21769,c38d8dba21ac,59.0,6.7,90.0,16.993564,4.26268,2021.0,[Adventure],0,0,...,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543,2


## 1.1 Sanity check: data types and missing values

Before training any models, we first verify that:

- The target column is present and correctly encoded.
- All feature columns that we will use are numerical.
- Non numerical identifier columns (such as `movie_id`) are not used as features.
- There are no unexpected missing values that could break our models.

We start by inspecting the data types and by listing which columns are still
of type `object`. These object columns will either be ignored as features or
have already been encoded into numerical dummy variables.


In [6]:
# Inspect data types
print("Data types:")
print(df.dtypes)

# Object type columns (these should not be used directly as features)
print("\nObject type columns:")
print(df.dtypes[df.dtypes == "object"])

# Quick overview of missing values
print("\nMissing values per column (top 20):")
missing_counts = df.isna().sum().sort_values(ascending=False)
display(missing_counts.head(20))


Data types:
movie_id                  object
metascore                float64
userscore                float64
runtime                  float64
production_budget_log    float64
                          ...   
expert_embed_7           float32
expert_embed_8           float32
expert_embed_9           float32
expert_embed_10          float32
sales_tier_encoded         int64
Length: 75, dtype: object

Object type columns:
movie_id        object
genre_list      object
rating_clean    object
dtype: object

Missing values per column (top 20):


movie_id            0
user_embed_3        0
user_embed_1        0
holiday_release     0
summer_release      0
season_Winter       0
season_Summer       0
season_Spring       0
season_Fall         0
rating_R            0
rating_PG-13        0
rating_PG           0
rating_Not Rated    0
rating_NC-17        0
rating_G            0
rating_clean        0
rating_missing      0
user_embed_2        0
user_embed_4        0
genre_War           0
dtype: int64

### Interpretation of the sanity check

From the data type summary we observe that:

- The target column `sales_tier_encoded` is stored as an integer, which is suitable for a multi class classification task.
- All feature columns that we will actually use are numeric (`int` or `float`), for example:
  `metascore`, `userscore`, `runtime`, `production_budget_log`, `theatre_count_log`,
  the one hot encoded genre and rating columns, and the reduced Transformer embeddings.
- The remaining `object` type columns are:
  - `movie_id` (a pure identifier, not a feature)
  - `genre_list` (original genre string, already encoded into dummy columns)
  - `rating_clean` (human readable rating, already encoded into dummy columns)

These object columns will not be included in the feature matrix `X`.

Regarding missing values:

- We confirm whether there are any remaining NaN values in the numerical features.
  If there are only a few, we will handle them inside a preprocessing pipeline
  (for example with a `SimpleImputer`), or drop the corresponding rows if the amount is negligible.
- The target column `sales_tier_encoded` should not contain missing values. If it does,
  these rows will be removed from the modelling dataset.

Overall, the dataset structure looks suitable for model training, as long as we
explicitly exclude the non numerical identifier and helper columns from `X`.


## 1.2 Define feature matrix `X` and target vector `y`

Next, we build the actual feature matrix `X` and the target vector `y`.

To avoid accidentally using identifiers or raw text as features, we take the
following approach:

1. Select all numerical columns in the dataframe.
2. Remove the target column `sales_tier_encoded` from this list.
3. Use the remaining numerical columns as features.
4. Use `sales_tier_encoded` as the target.

This ensures that `movie_id`, `genre_list`, and `rating_clean` are not used
as input features, and that all features are numeric and ready for modelling.


In [7]:
target = "sales_tier_encoded"

# Select all numerical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns including target:", len(numeric_cols))

# Remove the target from the feature list
feature_cols = [c for c in numeric_cols if c != target]

print("Number of feature columns:", len(feature_cols))
print("First 10 feature columns:", feature_cols[:10])

# Build X and y
X = df[feature_cols].copy()
y = df[target].copy()

print("\nShapes:")
print("X shape:", X.shape)
print("y shape:", y.shape)

# Sanity check: ensure there are no object dtypes in X
print("\nObject dtypes in X (should be empty):")
print(X.dtypes[X.dtypes == "object"])

# Quick look at X and y
print("\nHead of X:")
display(X.head())

print("\nHead of y:")
display(y.head())


Numeric columns including target: 62
Number of feature columns: 61
First 10 feature columns: ['metascore', 'userscore', 'runtime', 'production_budget_log', 'theatre_count_log', 'release_year', 'genre_Action', 'genre_Adult', 'genre_Adventure', 'genre_Animation']

Shapes:
X shape: (21770, 61)
y shape: (21770,)

Object dtypes in X (should be empty):
Series([], dtype: object)

Head of X:


Unnamed: 0,metascore,userscore,runtime,production_budget_log,theatre_count_log,release_year,genre_Action,genre_Adult,genre_Adventure,genre_Animation,...,expert_embed_1,expert_embed_2,expert_embed_3,expert_embed_4,expert_embed_5,expert_embed_6,expert_embed_7,expert_embed_8,expert_embed_9,expert_embed_10
0,59.0,6.7,129.0,16.993564,4.26268,2000.0,0,0,0,0,...,5.248415,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543
1,31.0,8.7,109.0,17.216708,7.797291,2001.0,0,0,0,0,...,-18.119719,-5.644859,6.622465,5.285645,-4.129042,-17.439384,-4.43283,6.081197,-1.913313,3.629471
2,59.0,6.7,104.0,16.213406,7.788212,2002.0,0,0,0,0,...,5.248415,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543
3,41.0,6.4,104.0,16.906553,7.812378,2008.0,0,0,0,0,...,-19.68395,-6.086373,-7.251342,6.611788,1.355806,1.178939,-1.492724,4.36909,3.527283,1.824804
4,30.0,5.1,95.0,16.118096,7.589842,2008.0,0,0,0,0,...,-22.277643,-4.169212,-3.627316,4.190808,-9.552444,-1.246047,4.286777,-1.451442,5.136322,2.053113



Head of y:


0    0
1    2
2    2
3    2
4    2
Name: sales_tier_encoded, dtype: int64

### Interpretation of the feature matrix and target vector

The shapes of `X` and `y` confirm that:

- `X` contains one row per movie and a large set of numerical features
  (metadata, engineered calendar variables, genre and rating dummies,
  and Transformer embedding components).
- `y` is a one dimensional vector with the encoded sales tier
  (`0 = Low`, `1 = Medium`, `2 = High`).

The final check for object dtypes in `X` is empty, which means that all features
are numerical. This is exactly what we need for both classical machine learning
models and neural networks.

We are now ready to split the data into a training and test set.
This will allow us to evaluate our models on unseen data.


## 3.3 Train test split

To evaluate our models fairly, we split the dataset into a training set and a test set.

We use a stratified split on `sales_tier_encoded` so that the class distribution
(Low, Medium, High) is approximately the same in both the training and test sets.

The test set will be used only for final evaluation. All model training and tuning
will be performed on the training set (optionally with cross validation).


In [8]:
from sklearn.model_selection import train_test_split

# Perform a stratified train test split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

print("\nClass distribution in full data:")
print(y.value_counts(normalize=True).sort_index())

print("\nClass distribution in train set:")
print(y_train.value_counts(normalize=True).sort_index())

print("\nClass distribution in test set:")
print(y_test.value_counts(normalize=True).sort_index())


Train set shape: (17416, 61)
Test set shape: (4354, 61)

Class distribution in full data:
sales_tier_encoded
0    0.329995
1    0.340009
2    0.329995
Name: proportion, dtype: float64

Class distribution in train set:
sales_tier_encoded
0    0.329984
1    0.340032
2    0.329984
Name: proportion, dtype: float64

Class distribution in test set:
sales_tier_encoded
0    0.330041
1    0.339917
2    0.330041
Name: proportion, dtype: float64


### Interpretation of the train test split

The shapes show that:

- The training set contains 80 percent of the movies.
- The test set contains the remaining 20 percent.

The class distributions in the full dataset, the training set, and the test set
are very similar, which means that the stratified split worked as intended.
All three sales tiers (Low, Medium, High) are represented in both sets.

From this point onwards we can start training baseline models such as:

- A dummy classifier (predicting the most frequent class).
- A logistic regression model using only structured features.
- A model that also includes the Transformer based text embeddings.

These models will allow us to quantify how much predictive power we gain
from the different types of features in our dataset.
