# Movie Success Predictor  
## Phase 5: Model Building

### Objective

The objective of this phase is to train and evaluate predictive models
using the final, leakage-free feature dataset prepared in Phase 4.

Two modeling tasks are addressed:

1. **Hit / Flop Classification**
   - Predicting whether a movie will be commercially successful

2. **IMDb Rating Prediction**
   - Predicting audience ratings on a 0–10 scale

All preprocessing steps such as scaling and encoding are performed
inside scikit-learn Pipelines to prevent data leakage and ensure
deployment readiness.

## Scope of This Notebook

This notebook focuses on preparing the modeling infrastructure only.
Model training and evaluation are intentionally separated into dedicated
notebooks for classification and regression.

No models are evaluated in this notebook.


## Step 1: Load Final Feature Dataset

In this step, we load the final, frozen feature dataset created at the end
of Phase 4. This dataset contains raw numerical and categorical features
along with target variables and will not be modified further in this phase.

All preprocessing, training, and evaluation steps in subsequent notebooks
will rely on this dataset as the single source of truth.


In [None]:
# Core libraries
import pandas as pd
import numpy as np

# Display settings
pd.set_option('display.max_columns', None)

# Load final feature dataset
df = pd.read_csv("../data/processed/features_final.csv")

print(f"Final dataset loaded successfully.")
print(f"Shape: rows={df.shape[0]}, columns={df.shape[1]}")

df.head(2)

Final dataset loaded successfully.
Shape: rows=3228, columns=13


Unnamed: 0,budget,original_language,popularity,runtime,vote_average,vote_count,release_year,release_month,release_day,release_season,director,hit,primary_genre
0,237000000,en,150.437577,162.0,7.2,11800,2009,12,10,Winter,James Cameron,1,Action
1,300000000,en,139.082615,169.0,6.9,4500,2007,5,19,Spring,Gore Verbinski,1,Adventure


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3228 entries, 0 to 3227
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   budget             3228 non-null   int64  
 1   original_language  3228 non-null   object 
 2   popularity         3228 non-null   float64
 3   runtime            3228 non-null   float64
 4   vote_average       3228 non-null   float64
 5   vote_count         3228 non-null   int64  
 6   release_year       3228 non-null   int64  
 7   release_month      3228 non-null   int64  
 8   release_day        3228 non-null   int64  
 9   release_season     3228 non-null   object 
 10  director           3228 non-null   object 
 11  hit                3228 non-null   int64  
 12  primary_genre      3228 non-null   object 
dtypes: float64(3), int64(6), object(4)
memory usage: 328.0+ KB


## Step 2: Feature and Target Separation

In this step, input features and target variables are explicitly separated.
This ensures clear task definitions for classification and regression and
prevents accidental target leakage during preprocessing and modeling.


In [3]:
# Classification target: Hit / Flop
y_classification = df['hit']

# Regression target: IMDb rating
y_regression = df['vote_average']

# Feature matrix (remove targets)
X = df.drop(columns=['hit', 'vote_average'])

print("Feature and target separation completed.")
print(f"X shape: {X.shape}")
print(f"y_classification shape: {y_classification.shape}")
print(f"y_regression shape: {y_regression.shape}")

X.head()

Feature and target separation completed.
X shape: (3228, 11)
y_classification shape: (3228,)
y_regression shape: (3228,)


Unnamed: 0,budget,original_language,popularity,runtime,vote_count,release_year,release_month,release_day,release_season,director,primary_genre
0,237000000,en,150.437577,162.0,11800,2009,12,10,Winter,James Cameron,Action
1,300000000,en,139.082615,169.0,4500,2007,5,19,Spring,Gore Verbinski,Adventure
2,245000000,en,107.376788,148.0,4466,2015,10,26,Fall,Sam Mendes,Action
3,250000000,en,112.31295,165.0,9106,2012,7,16,Summer,Christopher Nolan,Action
4,260000000,en,43.926995,132.0,2124,2012,3,7,Spring,Andrew Stanton,Action


## Step 3: Identify Numerical and Categorical Features

In this step, numerical and categorical input features are identified.
This separation is required to apply appropriate preprocessing steps
using a ColumnTransformer in later modeling notebooks.


In [None]:
numerical_features=X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features=X.select_dtypes(include=['object', 'category']).columns.tolist()
print("Numerical features:")
print(numerical_features)

print("\nCategorical features:")
print(categorical_features)

Numerical features:
['budget', 'popularity', 'runtime', 'vote_count', 'release_year', 'release_month', 'release_day']

Categorical features:
['original_language', 'release_season', 'director', 'primary_genre']


## Step 4: Train–Test Split

The dataset is split into training and testing sets before any preprocessing
or modeling steps. This ensures that all transformations are learned
exclusively from the training data and prevents data leakage.

Separate target variables are maintained for classification and regression
tasks.


In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_class_train, y_class_test, y_reg_train, y_reg_test = train_test_split(
    X, y_classification, y_regression, test_size=0.2, random_state=42, stratify=y_classification
)
print("Train-test split completed.")
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
print(f"y_class_train shape: {y_class_train.shape}, y_class_test shape: {y_class_test.shape}")
print(f"y_reg_train shape: {y_reg_train.shape}, y_reg_test shape: {y_reg_test.shape}")


Train-test split completed.
X_train shape: (2582, 11), X_test shape: (646, 11)
y_class_train shape: (2582,), y_class_test shape: (646,)
y_reg_train shape: (2582,), y_reg_test shape: (646,)


Note: A single train–test split is applied to features and both target
variables simultaneously to ensure consistent sample separation across
classification and regression tasks.


## Step 5: Preprocessing Pipeline

In this step, a preprocessing pipeline is defined using a ColumnTransformer.
Numerical features are scaled and categorical features are encoded inside
the pipeline to ensure that all transformations are learned exclusively
from the training data.

This preprocessing setup will be reused across both classification and
regression models to maintain consistency and prevent data leakage.


In [7]:
# Log-scaled numerical features
log_features = ['budget', 'popularity', 'vote_count']

# Normally scaled numerical features
normal_features = [
    col for col in numerical_features if col not in log_features
]
print("Log-scaled numerical features:")
print(log_features)
print("\nNormally scaled numerical features:")
print(normal_features)

Log-scaled numerical features:
['budget', 'popularity', 'vote_count']

Normally scaled numerical features:
['runtime', 'release_year', 'release_month', 'release_day']


In [8]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer

# Log + scale pipeline
log_numeric_transformer = Pipeline(
    steps=[
        ('log', FunctionTransformer(np.log1p, validate=False)),
        ('scaler', StandardScaler())
    ]
)

# Scale-only pipeline
normal_numeric_transformer = Pipeline(
    steps=[
        ('scaler', StandardScaler())
    ]
)

# Categorical pipeline
categorical_transformer = Pipeline(
    steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore', drop='first'))
    ]
)


In [9]:
preprocessor = ColumnTransformer(
    transformers=[
        ("log_num",log_numeric_transformer, log_features),
        ("norm_num", normal_numeric_transformer, normal_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)
print("Preprocessing pipeline created successfully.")

Preprocessing pipeline created successfully.


Numerical features were divided into skewed and non-skewed groups.
Skewed features were log-transformed to stabilize variance, while
remaining numerical features were scaled directly. This preprocessing
was implemented inside a pipeline to prevent data leakage.


## Phase 5 Setup Completion Summary

In this notebook, the modeling infrastructure was prepared without
training or evaluating any models.

The following steps were completed:
- Loading the finalized, leakage-free feature dataset
- Separating features and target variables
- Identifying numerical and categorical feature types
- Performing a train–test split prior to preprocessing
- Defining a preprocessing pipeline using ColumnTransformer, including:
  - Log transformation for skewed numerical features
  - Feature scaling for numerical variables
  - One-hot encoding for categorical variables

No models were trained or evaluated in this notebook.
All preprocessing and modeling steps will be executed inside dedicated
classification and regression notebooks using the pipelines de


In [10]:
import joblib
# Save train-test splits
joblib.dump(X_train, "../data/processed/X_train.joblib")
joblib.dump(X_test, "../data/processed/X_test.joblib")

joblib.dump(y_class_train, "../data/processed/y_class_train.joblib")
joblib.dump(y_class_test, "../data/processed/y_class_test.joblib")

joblib.dump(y_reg_train, "../data/processed/y_reg_train.joblib")
joblib.dump(y_reg_test, "../data/processed/y_reg_test.joblib")

print("Train-test splits saved successfully.")

# Save preprocessing pipeline blueprint
joblib.dump(preprocessor, "../models/preprocessor.joblib")

print("Preprocessing pipeline saved successfully.")


Train-test splits saved successfully.
Preprocessing pipeline saved successfully.


## Saved Modeling Artifacts

To ensure reproducibility and consistent modeling across notebooks,
the following artifacts were saved:

- Raw training and testing splits for both classification and regression
- An unfitted preprocessing pipeline to be reused during model training
  and deployment

No data transformations were applied at this stage. All preprocessing
will be fitted exclusively on training data inside the model-specific
notebooks.
