# Notebook 3: Feature Engineering & Deep EDA

## Purpose
This notebook creates derived features from the cleaned data and performs deep exploratory data analysis to understand relationships between features and the target variable (box office revenue).

## Objectives
1. Engineer Tier 1 features (cast/crew metrics, temporal, competition)
2. Engineer Tier 2 features (categories, franchise indicators)
3. Engineer Tier 3 features (3D/IMAX, recent hits, awards)
4. Calculate star power indices and director historical averages
5. Perform bivariate analysis (features vs revenue)
6. Create correlation matrix and identify multicollinearity
7. Answer key business questions from project plan
8. Validate engineered features make intuitive sense
9. Save feature-engineered dataset for modeling

## Key Questions to Answer
- What is the typical budget-to-revenue ratio?
- Which genre has the highest average revenue? Best ROI?
- What percentage of movies are profitable?
- How much does release timing matter?
- Do sequels reliably outperform originals?
- What's the relationship between runtime and revenue?
- Which features are most correlated with revenue?

## Outputs
- `data/processed/movies_features.csv`
- Correlation heatmap
- Key visualizations showing feature-target relationships

## Notes
- Avoid data leakage: use only information available before release
- For historical averages, exclude current movie from calculation
- Document any assumptions made during feature engineering

---
## Setup and Imports

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)

# Display settings
pd.set_option('display.max_columns', None)

---
## Load Cleaned Data

In [None]:
# Load cleaned dataset from previous notebook
# df = pd.read_csv('data/processed/movies_cleaned.csv')

---
## Tier 1 Feature Engineering

### Temporal Features

In [None]:
# Release_month, Release_quarter, Is_summer_release, Is_holiday_release, Day_of_week

### Cast and Crew Features

In [None]:
# Director_avg_gross, Lead_actor_avg_gross, Num_a_list_actors, Cast_total_experience

### Competition Features

In [None]:
# Num_releases_same_weekend, Num_opening_same_month

---
## Tier 2 Feature Engineering

In [None]:
# Budget_category, Genre_count, Is_franchise, Franchise_number,
# Years_since_last_installment, Runtime_category, Is_based_on_book,
# Is_based_on_comic, Director_genre_match, Release_month_avg_revenue

---
## Tier 3 Feature Engineering

In [None]:
# Is_3d, Is_imax, Lead_actor_recent_hit, Director_awards,
# Spoken_language_english, Production_countries_count

---
## Bivariate Analysis

### Budget vs Revenue

In [None]:
# Scatter plot with regression line

### Genre vs Average Revenue

In [None]:
# Bar chart showing average revenue by genre

### Release Timing Analysis

In [None]:
# Monthly revenue patterns, summer vs non-summer, holiday effects

### Sequel vs Original

In [None]:
# Compare revenue distributions

---
## Correlation Analysis

In [None]:
# Correlation matrix heatmap
# Identify top 10 features most correlated with revenue
# Check for multicollinearity

---
## Answer Key Business Questions

In [None]:
# Budget-to-revenue ratio
# Profitability percentage
# Genre performance
# Runtime impact
# Other insights

---
## Save Feature-Engineered Dataset

In [None]:
# Save final feature set
# df_features.to_csv('data/processed/movies_features.csv', index=False)

---
## Summary of Feature Engineering

In [None]:
# Document:
# - Number of features created
# - Most promising features based on correlation
# - Features with high multicollinearity
# - Key insights from deep EDA