# Movie Success Predictor  
## Phase 4: Feature Engineering

### Objective

The objective of this phase is to transform the cleaned dataset into a
model-ready format for predictive modeling.

This phase focuses on:
- Creating target variables for classification and regression
- Engineering meaningful numerical features
- Encoding categorical variables safely
- Preventing data leakage
- Preparing final datasets for model training

No models are trained in this phase.

## Problem Context

This project addresses two predictive tasks:

1. **Hit / Flop Classification**
   - A movie is classified as a *Hit* if its Return on Investment (ROI)
     exceeds 100%, otherwise as a *Flop*.

2. **IMDb Rating Prediction**
   - Predicting audience ratings (`vote_average`) on a 0–10 scale.

Feature engineering in this phase is designed to support both tasks
while maintaining interpretability and deployment readiness.


### Step 1: Load Processed Dataset

The cleaned dataset produced in Phase 2 is loaded.
This dataset serves as the foundation for all feature engineering steps.


In [14]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
df=pd.read_csv("../data/processed/processed_movies.csv")
print(f"Dataframe loaded successfully with rows: {df.shape[0]}, columns: {df.shape[1]}")

Dataframe loaded successfully with rows: 3228, columns: 15


### Step 2: Return on Investment (ROI)

Return on Investment (ROI) is calculated to quantify a movie’s financial
performance relative to its production budget. This feature will be used
to define the hit/flop classification target.


In [15]:
df["roi"]=((df["revenue"]-df["budget"])/df["budget"]*100).round(2)
print("ROI feature created successfully.")
print(df[["budget","revenue","roi"]].head())

ROI feature created successfully.
      budget     revenue      roi
0  237000000  2787965087  1076.36
1  300000000   961000000   220.33
2  245000000   880674609   259.46
3  250000000  1084939099   333.98
4  260000000   284139100     9.28


### Step 3: Hit / Flop Target Creation

Movies with ROI greater than 100% are labeled as Hits (1),
while all others are labeled as Flops (0).
This binary target will be used for classification modeling.


In [16]:
df["hit"]=(df["roi"]>=100).astype(int)
print("Hit/Flop classification target created successfully.")
print(df[["roi","hit"]].head())

Hit/Flop classification target created successfully.
       roi  hit
0  1076.36    1
1   220.33    1
2   259.46    1
3   333.98    1
4     9.28    0


### Hit / Flop Target Definition

Return on Investment (ROI) is used to define commercial success.

A movie is classified as a **Hit** if its revenue is at least double
its production budget, corresponding to an ROI of **100% or higher**.
Movies with ROI below 100% are classified as **Flops**.

This definition reflects a clear and intuitive business rule:
a movie that recovers its budget and generates equivalent profit
is considered financially successful.


### Step 4: Primary Genre Feature

Movies often belong to multiple genres. To maintain interpretability and
deployment simplicity, a primary genre is extracted using the first listed
genre. This feature captures the dominant category of each movie and will
be used as a categorical input for modeling.


In [17]:
import ast

# Convert genres_list from string to list (CSV safety)
df['genres_list'] = df['genres_list'].apply(ast.literal_eval)
print("Genres list converted successfully.")
df["primary_genre"]=df["genres_list"].apply(lambda x: x[0] if len(x)>0 else "Unknown")
print("Primary genre extracted successfully.")
print(df[["genres_list","primary_genre"]].head())

Genres list converted successfully.
Primary genre extracted successfully.
                                     genres_list primary_genre
0  [Action, Adventure, Fantasy, Science Fiction]        Action
1                   [Adventure, Fantasy, Action]     Adventure
2                     [Action, Adventure, Crime]        Action
3               [Action, Crime, Drama, Thriller]        Action
4           [Action, Adventure, Science Fiction]        Action


In [18]:
df.shape

(3228, 18)

## Feature Engineering Rationale

At this stage, the dataset contains both original attributes and newly
engineered targets. The purpose of feature engineering is to prepare
a model-ready dataset that reflects real-world prediction conditions.

Key principles guiding the next steps:

- Target variables must not appear as input features
- Features should represent information available before a movie's release
- Any variables used to derive the target must be removed afterward
- The final dataset should support both classification and regression tasks

The following steps focus on preventing data leakage, transforming skewed
numerical features, and encoding categorical variables to ensure that
models learn genuine patterns rather than relying on outcome information.


### Step 5: Leakage Prevention and Feature Selection

To ensure realistic and deployable models, columns that directly or indirectly
reveal the target variables were removed. Revenue and ROI were excluded after
target creation to prevent data leakage. Identifier and high-cardinality
non-predictive columns were also dropped.

The resulting dataset now contains only features that would be available
prior to a movie’s release, along with clearly defined target variables.


In [19]:
# Step 5: Drop leakage and non-predictive columns
drop_cols = [
    'revenue',
    'roi',
    'id',
    'title',
    'genres_list'
]

df = df.drop(columns=drop_cols)

print("Leakage and irrelevant columns removed.")
df.head()
print(f"Final dataframe shape: rows={df.shape[0]}, columns={df.shape[1]}")

Leakage and irrelevant columns removed.
Final dataframe shape: rows=3228, columns=13


In [20]:
df.head(1)

Unnamed: 0,budget,original_language,popularity,runtime,vote_average,vote_count,release_year,release_month,release_day,release_season,director,hit,primary_genre
0,237000000,en,150.437577,162.0,7.2,11800,2009,12,10,Winter,James Cameron,1,Action


## Saved Output

The final output of Phase 4 is a cleaned, leakage-free dataset containing
raw numerical and categorical features along with target variables.
This dataset is saved as `features_final.csv` and will be used as input
for model training and evaluation in Phase 5.


In [21]:
# Save final feature dataset
final_path = "../data/processed/features_final.csv"
df.to_csv(final_path, index=False)

print(f"Final feature dataset saved at: {final_path}")

Final feature dataset saved at: ../data/processed/features_final.csv


## Phase 4 Completion Summary

In this phase, target variables were defined and a clean, leakage-free
feature set was prepared for modeling. All variables that could reveal
future outcomes were removed, and no transformations were applied prior
to the train–test split.

Feature scaling and categorical encoding will be performed in the
modeling phase using scikit-learn Pipelines to ensure reproducibility
and prevent data leakage.
