
# Movies Success Prediction â€“ Jupyter Notebook

This notebook reproduces the full pipeline of the project:

1. Load and prepare the **Movies Dataset** (`movies_metadata.csv`)
2. Perform basic **Exploratory Data Analysis (EDA)**
3. Train and evaluate three models:
   - Decision Tree  
   - Random Forest  
   - AdaBoost  

> **Authors:** Ron Estrin (318375755), Leedan Bayley (209876457)


In [1]:
# Standard imports
import os
import sys

import numpy as np
import pandas as pd

# Import project modules from src/ package
from src.config import DATA_PATH, RANDOM_SEED
from src.data_preparation import load_and_prepare_movies_dataset
from src import eda, models


## 1. Load and prepare the dataset

In [2]:

# Load and prepare the dataset
df_model = load_and_prepare_movies_dataset()

print("Prepared dataset shape:", df_model.shape)
print("Columns:", df_model.columns.tolist())


Prepared dataset shape: (5380, 13)
Columns: ['budget', 'popularity', 'runtime', 'n_genres', 'release_year', 'release_month', 'lang_en', 'lang_hi', 'lang_fr', 'lang_ru', 'lang_ja', 'lang_other', 'success']


## 2. Exploratory Data Analysis (EDA)

In [3]:

# Run basic EDA summaries
eda.print_basic_info(df_model)
eda.summarize_success_distribution(df_model)
eda.summarize_by_year(df_model)
eda.summarize_by_decade(df_model)


=== Basic Dataset Info ===
Number of rows: 5380
Number of columns: 13

Columns:
['budget', 'popularity', 'runtime', 'n_genres', 'release_year', 'release_month', 'lang_en', 'lang_hi', 'lang_fr', 'lang_ru', 'lang_ja', 'lang_other', 'success']

First 5 rows:
       budget  popularity  runtime  n_genres  release_year  release_month  \
0  30000000.0   21.946943     81.0         0          1995             10   
1  65000000.0   17.015539    104.0         0          1995             12   
3  16000000.0    3.859495    127.0         0          1995             12   
5  60000000.0   17.924927    170.0         0          1995             12   
8  35000000.0    5.231580    106.0         0          1995             12   

   lang_en  lang_hi  lang_fr  lang_ru  lang_ja  lang_other  success  
0        1        0        0        0        0           0        1  
1        1        0        0        0        0           0        1  
3        1        0        0        0        0           0        1  
5

## 3. Train / Test split

In [4]:

# Split the data into train and test sets
X_train, X_test, y_train, y_test = models.train_test_split_movies(df_model)
X_train.shape, X_test.shape


((4304, 12), (1076, 12))

## 4. Train and evaluate models

In [5]:

# Decision Tree
dt_clf = models.train_decision_tree(
    X_train,
    y_train,
    max_depth=8,
    min_samples_leaf=10,
    random_state=RANDOM_SEED,
)
dt_results = models.evaluate_model(dt_clf, X_test, y_test, model_name="Decision Tree")

# Random Forest
rf_clf = models.train_random_forest(
    X_train,
    y_train,
    n_estimators=200,
    max_depth=12,
    random_state=RANDOM_SEED,
)
rf_results = models.evaluate_model(rf_clf, X_test, y_test, model_name="Random Forest")

# AdaBoost
ab_clf = models.train_adaboost(
    X_train,
    y_train,
    n_estimators=200,
    learning_rate=0.3,
    random_state=RANDOM_SEED,
)
ab_results = models.evaluate_model(ab_clf, X_test, y_test, model_name="AdaBoost")



=== Evaluation for Decision Tree ===
Accuracy: 0.6264
Classification report:
              precision    recall  f1-score   support

           0      0.559     0.367     0.443       436
           1      0.651     0.803     0.719       640

    accuracy                          0.626      1076
   macro avg      0.605     0.585     0.581      1076
weighted avg      0.614     0.626     0.607      1076


=== Evaluation for Random Forest ===
Accuracy: 0.6636
Classification report:
              precision    recall  f1-score   support

           0      0.619     0.443     0.516       436
           1      0.682     0.814     0.742       640

    accuracy                          0.664      1076
   macro avg      0.650     0.628     0.629      1076
weighted avg      0.656     0.664     0.651      1076


=== Evaluation for AdaBoost ===
Accuracy: 0.6441
Classification report:
              precision    recall  f1-score   support

           0      0.608     0.342     0.438       436
        

## 5. Compare model performance

In [7]:

# Build a small summary table for accuracy comparison
results = [
    dt_results,
    rf_results,
    ab_results,
]

summary = pd.DataFrame(
    {
        "Model": [r["model_name"] for r in results],
        "Accuracy": [r["accuracy"] for r in results],
    }
)
summary


Unnamed: 0,Model,Accuracy
0,Decision Tree,0.626394
1,Random Forest,0.663569
2,AdaBoost,0.644052
