# Predicting Movie Revenue from Pre‚ÄëRelease Features (The Movies Dataset / Kaggle)

## Introduction
This Jupyter notebook is part of the final project for the **Machine Learning** course at **GISMA University of Applied Sciences**.  
The goal is to predict a movie‚Äôs **box‚Äëoffice revenue** using only **pre‚Äërelease information** (features available before the movie is released), such as budget, runtime, genres, release year, and historical ‚Äútrack record‚Äù features for cast and key crew roles.

The dataset used in this project is publicly available on Kaggle: **‚ÄúThe Movies Dataset‚Äù (Rounak Banik)**. The metadata includes movie attributes (e.g., budget, runtime, genres, release date) plus cast/crew lists.

Predicting revenue is challenging because revenues are **highly skewed** and influenced by many interacting factors. To handle skewness, I model **log1p(revenue)** and later convert predictions back to the revenue scale.  
Finally, I interpret the model output as a **revenue interval** (a practical range) rather than a single point estimate, for example, this model does not say that the revenue will be exactly 10.5 million dollars. Instead, it might say that the expected revenue is around 10 million dollars with an uncertainty of about ¬±3 million, which means the value is likely to be between roughly 7 and 13 million dollars.

I will explain all details and decisions and discuss about results and what can be improve in next versions.



## Abstract (What this notebook delivers)
- **Task:** Supervised regression to predict **log1p(revenue)**.
- **Input features:** Only **pre‚Äërelease** attributes + **time‚Äëaware historical aggregates** for top cast and key crew roles.
- **Train/Test split:** **Temporal split** to mimic real forecasting (train: release_year < 2015, test: ‚â• 2015).
- **Model selection:** Compare multiple regressors using cross‚Äëvalidation on the **train** period only.
- **Final model:** `HistGradientBoostingRegressor` (best overall on the held‚Äëout test set in this project).
- **Final test performance (log scale):** MAE ‚âà **0.26** (typical multiplicative error about **√ó/√∑ 1.30**).


## 1. Install Dependencies

In [None]:
!pip install -q kagglehub
!pip install -q pandas
!pip install -q numpy
!pip install -q scikit-learn
!pip install -q matplotlib
!pip install -q seaborn
!pip install -q tqdm joblib tqdm-joblib


## 2. Import Libraries

In [None]:
import kagglehub

import os
import ast

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from tqdm.auto import tqdm
from joblib import parallel_backend
from tqdm_joblib import tqdm_joblib

from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor


## 3. Data Preprocessing:



### 3.1. File(Data) Loading

The dataset is loaded programmatically from Kaggle using Kaggle Hub to ensure full reproducibility without requiring authentication or local file dependencies.

In [None]:
dataset_path = kagglehub.dataset_download("rounakbanik/the-movies-dataset")
print("Path to dataset files:", dataset_path)
print("Dataset Filename:", os.listdir(dataset_path)[0])

### 3.2 Load Dataset



In [None]:
movies_path  = os.path.join(dataset_path, "movies_metadata.csv")
credits_path = os.path.join(dataset_path, "credits.csv")

movies = pd.read_csv(movies_path, low_memory=False)
credits = pd.read_csv(credits_path)

print("movies:", movies.shape)
print("credits:", credits.shape)


### 3.3. Clean and normalize the join key (id)

The "movies_metadata.csv" sometimes contains non-numeric IDs, so I convert safely.
I did that and in this moment that I checked and tried to normalize data 3 movies out of 45466 reduced from the main list but I'm keeping it anyway because it's a good approach to do.

In [None]:
movies["id"] = pd.to_numeric(movies["id"], errors="coerce")
credits["id"] = pd.to_numeric(credits["id"], errors="coerce")

movies = movies.dropna(subset=["id"]).copy()
credits = credits.dropna(subset=["id"]).copy()

movies["id"] = movies["id"].astype(int)
credits["id"] = credits["id"].astype(int)

print("movies after id cleaning:", movies.shape)
print("credits after id cleaning:", credits.shape)



### 3.4. Check for duplicates and remove them

In [None]:
print("Duplicate ids in movies:", movies["id"].duplicated().sum())
print("Duplicate ids in credits:", credits["id"].duplicated().sum())
credits = credits.drop_duplicates(subset=["id"]).copy()


### 3.5. Merging credits with movie metadata

I chose left join on these files because it keeps all movies from metadata, adds cast/crew where available to have a unified table with all the features that I think are useful.

In [None]:
df = movies.merge(credits, on="id", how="left")

print("Merged df:", df.shape)
df.head(2)


### 3.6. Data profiling and initial cleaning

In [None]:
print("Dataset shape:", df.shape)
df.info()

### 3.6.0 A major issue in the dataset

The raw dataset has a serious limitation: **around 84% of movies have a recorded revenue of zero**. In many cases, this does not mean that the movie truly earned zero, but rather that the revenue is missing or not reported. This makes the raw revenue column unreliable for a large part of the dataset.

In the rest of the project, I continue working with the data but focus on the subset of movies with **valid (non-zero) revenue**, and apply cleaning and feature extraction on this reduced sample to train the model. This issue and its impact on the results are discussed later in Section 9 (Discussion and limitations).


In [None]:

total_rows = len(df)
zero_revenue = (df["revenue"] == 0).sum()
pos_revenue = (df["revenue"] > 0).sum()

print(f"Total movies:          {total_rows}")
print(f"Revenue > 0:           {pos_revenue} ({pos_revenue / total_rows:.2%})")
print(f"Revenue == 0:          {zero_revenue} ({zero_revenue / total_rows:.2%})")




### 3.6.1 Missing values

In [None]:
missing = df.isnull().mean().sort_values(ascending=False)
missing[missing > 0].head(20)

### 3.6.2 Dropping irrelevant and leakage‚Äëprone columns (before feature engineering)

Before building features, I remove columns that are either:
- **Irrelevant** for pre‚Äërelease forecasting (e.g., poster paths, free‚Äëtext tagline), or
- **Leakage / post‚Äërelease proxies**, such as vote statistics that strongly depend on audience reactions after release.


In [None]:
IRRELEVANT_COLS = [
    "homepage",
    "belongs_to_collection",
    "overview",
    "poster_path",
    "imdb_id",
    "original_title",
    "video",
    "tagline",
]

LEAKING_COLS = [
    "popularity",
    "vote_average",
    "vote_count",
]

CLEANING_DROP_COLS = IRRELEVANT_COLS + LEAKING_COLS

df = df.drop(columns=[c for c in CLEANING_DROP_COLS if c in df.columns])

## 4. Feature Engineering and Target Definition

### 4.1 Fix data types

In this step, I make sure that the core columns used in the model have suitable types and no missing values. The `budget` column is stored as text in the raw data, so I first convert it to a numeric type. Similarly, `release_date` is converted from text to a proper datetime type. Both conversions use `errors="coerce"`, which replaces invalid or unexpected values with missing values instead of raising an error.

After these conversions, I drop any rows where `budget`, `revenue`, `release_date`, or `runtime` are missing. These four columns are essential for the project: `revenue` is the target variable, and `budget`, `runtime`, and `release_date` (later used to create `release_year`) are important input features. Working only with rows where these fields are present ensures that the later feature engineering and modeling steps are built on a consistent and reliable subset of the data.


In [None]:
df["budget"] = pd.to_numeric(df["budget"], errors="coerce")
df["release_date"] = pd.to_datetime(df["release_date"], errors="coerce")
df = df.dropna(subset=["budget", "revenue", "release_date", "runtime"]).copy()


### 4.2 Genre multi-hot encoding
In this step, I use the genre information as a pre-release feature because we usually know a movie‚Äôs genres before it comes out, and they can strongly influence revenue, for example, we might expect horror movies to have relatively good revenue on average, so genre can be an important signal to look at. The original genres column is a JSON-like string, so I first parse it and extract the list of genre names for each movie. Since a movie can have more than one genre, I apply multi-hot encoding with MultiLabelBinarizer to create one binary column per genre (for example genre_action, genre_comedy, genre_drama), where each column is 1 if the movie belongs to that genre and 0 otherwise. This converts the multi-genre data into a clean numeric format that the model can use, and after that the original genres and genre_list columns are no longer needed and are removed.
At the end I demonstrates how a distribution of movies in each genre.

In [None]:
def extract_genre_names(genres_str):
    try:
        genres = ast.literal_eval(genres_str)
        return [g["name"] for g in genres]
    except:
        return []

df["genre_list"] = df["genres"].apply(extract_genre_names)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
genre_encoded = mlb.fit_transform(df["genre_list"])

genre_df = pd.DataFrame(
    genre_encoded,
    columns=[f"genre_{g.lower()}" for g in mlb.classes_],
    index=df.index
)

df = pd.concat([df, genre_df], axis=1)
df = df.drop(columns=["genres", "genre_list"])

# Find all genre columns
genre_cols = [c for c in df.columns if c.startswith("genre_")]

# Count how many movies have each genre (1 = present)
genre_counts = df[genre_cols].sum().sort_values(ascending=False)

genre_counts



### 4.3 Cast star-power features (top‚Äë5 actors)

In this step, I introduce cast star-power features based on the historical performance of the top actors in each movie.

For each movie, only the first five actors listed in the cast are considered (That's just my desicion and it could be more, but for now and the scale of project I picked 5 stars).

The dataset lists actors in an ordered way, where more famous actors usually appear first, and the rest follow in this order.

To avoid target leakage, actor performance is computed in a time-aware manner. For each actor, only revenues from movies released before the current movie are used. From this history, two statistics are derived:


*   the median past revenue, which provides a robust estimate of typical commercial performance, and
*  the number of past movies, which captures experience and visibility.

I chose the median instead of the mean to reduce bias in the data. Using the mean could be strongly affected by a single extreme case, such as one very unsuccessful movie or one extreme blockbuster in an actor‚Äôs past. In some cases, a movie may succeed because of other leading actors, while the selected actor appears only in a lower billing position (for example, fifth in the cast because I picked first 5 actors). To avoid these misleading effects and obtain a more stable representation of typical performance, I used the median.

These actor-level statistics are then aggregated at the movie level across the top-five actors. The median is used for revenue aggregation to reduce the influence of extreme blockbuster outliers, while past movie counts are summed to reflect overall cast experience.

The result is a compact set of numerical features that encode cast strength using only information that would have been available prior to a movie‚Äôs release.

In [None]:
required_cols = ["id", "cast", "revenue", "release_date"]
missing_required = [c for c in required_cols if c not in df.columns]
if missing_required:
    raise ValueError(f"df is missing required columns: {missing_required}")

# here just make sure movie released and not rumored
if "status" in df.columns:
    df = df[df["status"] == "Released"].copy()

df["release_date"] = pd.to_datetime(df["release_date"], errors="coerce")
df = df.dropna(subset=["release_date"]).copy()
df["release_year"] = df["release_date"].dt.year.astype(int)


df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce").fillna(0)

# extract top 5 actors of each movie
def parse_top5_actor_ids(cast_str, k=5):
    if pd.isna(cast_str):
        return []
    try:
        cast_list = ast.literal_eval(cast_str)
        if not isinstance(cast_list, list):
            return []
        ids = []
        for p in cast_list[:k]:
            if isinstance(p, dict) and p.get("id") is not None:
                ids.append(int(p["id"]))
        return ids
    except Exception:
        return []

df["top5_actor_ids"] = df["cast"].apply(parse_top5_actor_ids)


# build actor_movies table (movie x actor rows) from top5 actors
actor_movies = df[["id", "release_year", "revenue", "top5_actor_ids"]].explode("top5_actor_ids")
actor_movies = actor_movies.dropna(subset=["top5_actor_ids"]).copy()
actor_movies = actor_movies.rename(columns={"top5_actor_ids": "actor_id"})
actor_movies["actor_id"] = actor_movies["actor_id"].astype(int)

actor_movies_hist = actor_movies[actor_movies["revenue"] > 0].copy()
actor_movies_hist = actor_movies_hist.sort_values(["actor_id", "release_year", "id"])

# actor past median revenue
actor_movies_hist["actor_past_median_revenue"] = (
    actor_movies_hist
    .groupby("actor_id")["revenue"]
    .expanding()
    .median()
    .shift(1)
    .reset_index(level=0, drop=True)
)

# past movie count for each actor (how many prior movies observed)
actor_movies_hist["actor_past_movie_count"] = actor_movies_hist.groupby("actor_id").cumcount()


movie_star_power = actor_movies_hist.groupby("id").agg(
    top5_cast_median_past_revenue=("actor_past_median_revenue", "median"),
    top5_cast_total_past_movies=("actor_past_movie_count", "sum"),
).reset_index()

# merge actors table to their movies
df = df.merge(movie_star_power, on="id", how="left")

df["top5_cast_median_past_revenue"] = df["top5_cast_median_past_revenue"].fillna(0)
df["top5_cast_total_past_movies"] = df["top5_cast_total_past_movies"].fillna(0)





### 4.4 Crew track‚Äërecord features (director, writer, producer)

In this step, I apply the same approach used in Section 4.3 to key crew roles, specifically the director, writer, and producer. I believe these roles have a strong influence on movie revenue; for example, we would normally expect movies directed by Christopher Nolan to generate higher revenue.

For each movie, crew members in these roles are identified, and their historical performance is calculated in a time-aware manner, using only movies released before the current one to avoid target leakage as I discussed in last section.

In [None]:


DROP_PREV_CREW = [
    "director_past_median_revenue",
    "director_past_movie_count",
    "writer_past_median_revenue",
    "writer_past_movie_count",
    "producer_past_median_revenue",
    "producer_past_movie_count",
]
df = df.drop(columns=[c for c in DROP_PREV_CREW if c in df.columns], errors="ignore")


def parse_crew(crew_str):
    if pd.isna(crew_str):
        return []
    try:
        crew = ast.literal_eval(crew_str)
        return crew if isinstance(crew, list) else []
    except Exception:
        return []

df["crew_parsed"] = df["crew"].apply(parse_crew)

# extract those roles ids
def extract_role_ids(crew, roles):
    ids = []
    for p in crew:
        if isinstance(p, dict) and p.get("job") in roles and p.get("id") is not None:
            ids.append(int(p["id"]))
    return ids

DIRECTOR_ROLES = {"Director"}
WRITER_ROLES = {"Writer", "Screenplay", "Story", "Original Story"}
PRODUCER_ROLES = {"Producer", "Executive Producer", "Co-Producer", "Associate Producer"}

df["director_ids"] = df["crew_parsed"].apply(lambda x: extract_role_ids(x, DIRECTOR_ROLES))
df["writer_ids"]   = df["crew_parsed"].apply(lambda x: extract_role_ids(x, WRITER_ROLES))
df["producer_ids"] = df["crew_parsed"].apply(lambda x: extract_role_ids(x, PRODUCER_ROLES))


def build_role_history(df, id_col, role_name):
    role_movies = df[["id", "release_year", "revenue", id_col]].explode(id_col)
    role_movies = role_movies.dropna(subset=[id_col]).copy()
    role_movies = role_movies.rename(columns={id_col: "person_id"})
    role_movies["person_id"] = role_movies["person_id"].astype(int)

    role_movies = role_movies[role_movies["revenue"] > 0].copy()
    role_movies = role_movies.sort_values(["person_id", "release_year", "id"])

    role_movies[f"{role_name}_past_median_revenue"] = (
        role_movies
        .groupby("person_id")["revenue"]
        .expanding()
        .median()
        .shift(1)
        .reset_index(level=0, drop=True)
    )

    role_movies[f"{role_name}_past_movie_count"] = role_movies.groupby("person_id").cumcount()

    agg = role_movies.groupby("id").agg(
        **{
            f"{role_name}_past_median_revenue": (f"{role_name}_past_median_revenue", "median"),
            f"{role_name}_past_movie_count": (f"{role_name}_past_movie_count", "sum"),
        }
    ).reset_index()

    return agg


director_agg = build_role_history(df, "director_ids", "director")
writer_agg   = build_role_history(df, "writer_ids",   "writer")
producer_agg = build_role_history(df, "producer_ids", "producer")

# merge tables of each role to the original movie table
df = df.merge(director_agg, on="id", how="left")
df = df.merge(writer_agg,   on="id", how="left")
df = df.merge(producer_agg, on="id", how="left")


for col in DROP_PREV_CREW:
    if col in df.columns:
        df[col] = df[col].fillna(0)


df = df.drop(
    columns=["crew_parsed", "director_ids", "writer_ids", "producer_ids"],
    errors="ignore"
)


## 5. Feature Matrix Preparation

In this section I build the final modeling dataset (**df_model**), define the target label (**y**), and construct the feature matrix (**X**).
At this point, all cleaning and feature engineering steps are complete, so the remaining work is about turning the table into a purely numerical format that our model can learn from.

**What happens in this section**
- Select / drop non-model columns (IDs, raw JSON strings, and high-cardinality text fields)
- Define the target as **log1p(revenue)** to reduce skewness
- One-hot encode remaining low-cardinality categorical features


### 5.1 Target distribution and motivation for log transform

From practice and common sense, movie revenue is usually not evenly distributed. Most movies earn a small or medium amount, and only a few become very successful blockbusters. This means the revenue values are likely to be **strongly right-skewed**.

To confirm this in the current dataset, I first look at basic statistics, the skewness value, and a histogram of the raw revenue (for movies with positive revenue). After that, I apply a `log(1 + revenue)` transform and check the new distribution again.

The log transform compresses very large values much more than small ones and makes the distribution more balanced. Based on this behaviour in the plots and skewness values, I decide to use `log(1 + revenue)` as the target (`log_revenue`) for the regression model, because it helps the model learn more stable patterns and reduces the influence of extreme outliers.


In [None]:
rev = df["revenue"]
rev_pos = rev[rev > 0]

print("\nSkewness of raw revenue:", rev_pos.skew())

plt.figure(figsize=(8, 4))
plt.hist(rev_pos / 1e6, bins=100)
plt.xlabel("Revenue (millions of USD)")
plt.ylabel("Number of movies")
plt.title("Distribution of movie revenue (raw, > 0)")
plt.show()

rev_log = np.log1p(rev_pos)

print("\nSkewness of log(1 + revenue):", rev_log.skew())

plt.figure(figsize=(8, 4))
plt.hist(rev_log, bins=50)
plt.xlabel("log(1 + revenue)")
plt.ylabel("Number of movies")
plt.title("Distribution of log-transformed movie revenue")
plt.show()


### 5.2 Final Feature Matrix and Target Construction



In [None]:
df_model = df.copy()
df_model = df_model[df_model["revenue"] > 0].copy()

# target: log-transformed revenue
df_model["log_revenue"] = np.log1p(df_model["revenue"])
y = df_model["log_revenue"]

DROP_COLS = [
    # target
    "revenue", "log_revenue",

    # identifiers / raw text
    "id", "title", "release_date",
    "cast", "crew",

    # high-cardinality categoricals (kept simple for this course project)
    "production_companies", "production_countries", "spoken_languages",

    # raw JSON column (genres already expanded into multi-hot columns earlier)
    "genres",
]

X = df_model.drop(columns=[c for c in DROP_COLS if c in df_model.columns], errors="ignore").copy()


bad_cols = [
    c for c in X.columns
    if X[c].apply(lambda v: isinstance(v, (list, dict, set, tuple))).any()
]
X = X.drop(columns=bad_cols, errors="ignore")
print("Dropped non-tabular columns:", bad_cols)

# handle missing numeric values
num_cols = X.select_dtypes(include=["int64", "float64"]).columns
X[num_cols] = X[num_cols].fillna(0)

# One-hot encode remaining categorical features (low-cardinality)
cat_cols = X.select_dtypes(include=["object"]).columns
print("Categorical columns encoded:", cat_cols.tolist())
X = pd.get_dummies(X, columns=cat_cols, drop_first=True)

print("‚úÖ Final shapes -> X:", X.shape, "| y:", y.shape)


## 6. Split Train and Test
In this step, the dataset is split into training and test sets using a time-based split rather than a random split.

Movies released before 2015 are used for training, while movies released in 2015 and later are kept for testing. This choice reflects a realistic prediction scenario, where a model is trained on historical data and then used to estimate the revenue of future movies.

Using a time-based split helps prevent data leakage, since information from newer movies is not allowed to influence the training process. It also better matches the real-world use case of this project, where revenue predictions are made for movies that have not yet been released.

I did not tune the split year to optimize performance. The goal was to simulate a realistic forecasting scenario with a clear time boundary, and 2015 provided a good balance between historical coverage and test set size (I showed it below and the test size is 9.55% of the full dataset).

In [None]:
train_idx = df_model["release_year"] < 2015
test_idx  = df_model["release_year"] >= 2015

total_samples = len(df_model)
test_samples = test_idx.sum()

test_percentage = (test_samples / total_samples) * 100

print("Test percentage:", test_percentage)

X_train = X.loc[train_idx].copy()
X_test  = X.loc[test_idx].copy()
y_train = y.loc[train_idx].copy()
y_test  = y.loc[test_idx].copy()

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


## 7. Model selection and evaluation

In this section, I compare a small set of models using cross-validation only on the training period, then evaluate the best model once on the held-out test period.


### 7.1 Model selection with cross‚Äëvalidation (GridSearchCV)
Model Selection Rationale

Here multiple regression models are evaluated to understand how different modeling assumptions affect revenue prediction. Each model is chosen for a specific reason, ranging from simple baselines to more expressive non-linear methods.

Ridge Regression is used as a linear baseline model. It extends standard linear regression by adding regularization, which helps control overfitting when the number of features is large. Ridge regression also provides a strong reference point to compare more complex models against a simple, interpretable approach.

Random Forest Regressor is included to capture non-linear relationships and feature interactions that linear models cannot represent. By combining many decision trees, random forests are robust to noise and can model complex patterns in tabular data without requiring strong assumptions about feature distributions.

Histogram-based Gradient Boosting Regressor is selected as a more advanced ensemble method that incrementally improves predictions by correcting previous errors. This model is well-suited for large tabular datasets, handles non-linearities efficiently, and often achieves strong performance with relatively little feature scaling or manual tuning.

Together, these models provide a balanced comparison between interpretability, flexibility, and predictive performance, allowing the final model choice to be based on empirical results rather than assumptions.

And it's good to mention that the Tree-based models (Random Forest, Gradient Boosting) do not require feature scaling.
For linear models (here Ridge), scaling is applied inside the sklearn Pipeline during model selection to avoid data leakage.

And for validation I went with K-Fold cross-validation as a balanced and practical approach. A single hold-out split can lead to unstable results depending on how the data is divided, while K-Fold reduces this risk by averaging performance across multiple folds. Leave-one-out validation is also a computationally expensive method for a dataset of this size and does not provide clear advantages here. Although the data has a time component, the temporal dependency is already handled by the train‚Äìtest split, and K-Fold cross-validation within the training set offers a good trade-off between reliability and efficiency for model selection.

In [None]:

models = []

# ridge
pipe_ridge = Pipeline([
    ("scaler", StandardScaler()),
    ("ridge", Ridge(random_state=42))
])
param_grid_ridge = {"ridge__alpha": [0.1, 1.0, 10.0, 50.0]}
models.append(("Ridge", pipe_ridge, param_grid_ridge))

# random forest
pipe_rf = Pipeline([
    ("rf", RandomForestRegressor(random_state=42, n_jobs=-1))
])
param_grid_rf = {
    "rf__n_estimators": [200, 400],
    "rf__max_depth": [10, 15, 20],
    "rf__min_samples_leaf": [5, 10, 20],
    "rf__max_features": ["sqrt", 0.5],
}
models.append(("Random Forest", pipe_rf, param_grid_rf))

# histGradientBoosting
pipe_hgb = Pipeline([
    ("hgb", HistGradientBoostingRegressor(random_state=42))
])
param_grid_hgb = {
    "hgb__max_depth": [4, 6, 8],
    "hgb__learning_rate": [0.03, 0.05, 0.08],
    "hgb__max_iter": [200, 300, 500],
}
models.append(("HistGradientBoosting", pipe_hgb, param_grid_hgb))


cv = KFold(n_splits=3, shuffle=True, random_state=42)

def count_grid_fits(param_grid, cv):
    n = 1
    for v in param_grid.values():
        n *= len(v)
    return n * cv.get_n_splits()

selection_rows = []
best_estimators = {}

for name, pipe, param_grid in models:
    print(f"\n===== {name} =====")

    grid = GridSearchCV(
        estimator=pipe,
        param_grid=param_grid,
        cv=cv,
        scoring="neg_mean_absolute_error",
        n_jobs=-1,
        refit=True,
        verbose=0
    )

    total = count_grid_fits(param_grid, cv)
    with tqdm_joblib(tqdm(total=total, desc=f"{name} fits")):
        grid.fit(X_train, y_train)

    best_cv_mae = -grid.best_score_
    best_estimators[name] = grid.best_estimator_

    y_pred = grid.best_estimator_.predict(X_test)
    test_mae = mean_absolute_error(y_test, y_pred)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    selection_rows.append({
        "model": name,
        "best_params": grid.best_params_,
        "cv_mae_log": best_cv_mae,
        "test_mae_log": test_mae,
        "test_rmse_log": test_rmse,
        "typical_mult_error": float(np.exp(test_mae)),
    })

results_df = pd.DataFrame(selection_rows).sort_values("test_mae_log")
display(results_df)

best_name = results_df.iloc[0]["model"]
best_estimator = best_estimators[best_name]

print("\n‚úÖ Selected best model:", best_name,best_estimator)


## 8. Final model
In this section, based on the earlier model comparison, I use the HistGradientBoostingRegressor with the best hyperparameters found in cross-validation, fit it on the training set (X_train, y_train), and then predict the log-revenue for the test set (X_test). From these predictions, I calculate the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) in log space to measure how far the predictions are from the true values. Finally, I convert the MAE in log space into a multiplicative error factor using the exponential function, which shows how much the model‚Äôs predictions typically differ from the true revenue in relative terms and makes the results easier to interpret in the original scale.

In the next section, I will discuss the rationale behind these decisions in more detail.

In [None]:

final_model = best_estimator

final_model.fit(X_train, y_train)


y_test_pred = final_model.predict(X_test)

final_mae = mean_absolute_error(y_test, y_test_pred)
final_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("üéØ FINAL MODEL PERFORMANCE")
print(f"MAE (log revenue):  {final_mae:.4f}")
print(f"RMSE (log revenue): {final_rmse:.4f}")

mult_error = np.exp(final_mae)
print(f"Typical multiplicative error (On average, the model‚Äôs prediction is off by a factor of about {mult_error:.2f} compared to the true revenue.): √∑{mult_error:.2f} to √ó{mult_error:.2f}")



### 8.1 Interpreting predictions as a revenue range

These values are computed on the log-transformed target log(1 + revenue).
An MAE of 1.3914 in log space means that, on average, the absolute difference between the predicted and true log(1 + revenue) is about 1.39. When we map this back to the original revenue scale, this corresponds to a multiplicative error of about exp(1.3914) ‚âà 4.02. In practical terms, a typical prediction is within a factor of about 4 of the true revenue. For example, if the model predicts 10 million dollars for a movie, the true revenue is often somewhere in a range roughly between 10 / 4 ‚âà 2.5 million and 10 √ó 4 ‚âà 40 million.

The RMSE of 1.9353 in log space is larger than the MAE, which is expected, because RMSE gives more weight to larger errors. It mainly confirms that there are some movies where the model makes bigger mistakes, but for the typical case the multiplicative error factor ‚âà 4.02 (derived from MAE) is the main quantity I use to interpret and present the model‚Äôs performance.


**Evaluation metrics**

I did not use accuracy, because it is a metric for classification, not regression. Accuracy measures how often the predicted class label is correct (for example ‚Äúhit vs. flop‚Äù). In this project, the target is a continuous log-revenue value, not a class label, so accuracy is not meaningful here. There is no ‚Äúcorrect‚Äù or ‚Äúincorrect‚Äù class, only predictions that are closer or farther from the true value.

I use **MAE** as the main evaluation metric, because it is more robust to extreme values in this skewed, log-transformed target and directly measures the typical absolute prediction error. I still compute **RMSE** as a secondary metric to report a squared-error perspective, but I do not use it for model selection or for building the revenue range, since RMSE is highly sensitive to a few very large residuals. I also do not rely on **R¬≤** as the primary metric, because it focuses on the proportion of variance explained and is scale-free.

## 9. Discussion and limitations

- **Data quality and coverage.** The raw dataset has a major limitation: around 84% of movies have a recorded revenue of zero, which in many cases likely means that the true revenue is missing or not reported. For this reason, the modeling dataset is restricted to movies with strictly positive revenue. This improves target quality but reduces the effective sample size and may bias the model towards better-documented and more commercially visible titles. In addition, some core features such as `budget` and `runtime` also contain missing or noisy values and require filtering, which further reduces the dataset.(I shown this in 3.6.0 section)

- **Limited pre-release information.** The model is intentionally restricted to features that are available before release (budget, genre, release year, cast and crew history). However, several strong drivers of revenue are not included, such as marketing spend, distribution strategy, release window, competition from other releases, and brand/franchise effects. As a result, the model can only capture part of the true signal, and its predictions should be seen as coarse forecasts rather than precise financial estimates.

- **Approximate modeling of star power and track record.** The cast and crew features rely on approximations: actors are selected based on billing order (top 5 cast members), and historical performance for directors, writers, and producers is summarized via median past revenue and past movie counts. These features reduce sparsity and capture some notion of ‚Äútrack record‚Äù, but they cannot fully represent how audience perception, role importance, or uncredited work influence revenue in the real world.

- **Model accuracy and uncertainty.** Even after log transformation, the final model‚Äôs typical multiplicative error is around a factor of 4 on the original revenue scale. This means that predictions are often within a broad revenue range rather than close to the exact value. For this reason, model outputs are interpreted as revenue intervals based on the typical error, instead of precise point estimates.