# Notebook 02 - Preprocessing & Feature Engineering

**Objective:** Clean, filter, and create **4 processed datasets** with different data source combinations.

---

**4 Datasets Produced:**

| Dataset | Sources | Key |
|---|---|---|
| `movies_processed_metadata.csv` | movies_metadata only | `metadata` |
| `movies_processed_meta_credits.csv` | movies_metadata + credits | `meta_credits` |
| `movies_processed_meta_keywords.csv` | movies_metadata + keywords | `meta_keywords` |
| `movies_processed.csv` | all three combined | `all` |

**Preprocessing:**
1. Load & parse JSON columns in all 3 raw datasets
2. Type conversions, drop corrupted rows (non-numeric IDs)
3. Remove duplicates from all 3 datasets
4. Filter: Released status, budget > 0, revenue > 0, vote_count > 0

**Feature Engineering (applied to base metadata before merging):**
5. Binary: `is_collection`, `is_english`
6. Temporal: `release_year`, `release_month`
7. Financial: `roi` (replaces raw revenue)
8. Counts: `num_genres`, `num_production_companies`, etc.
9. `primary_genre` categorical

**Additional Features (for datasets with credits):**
10. `num_cast`, `num_crew`
11. Top modern directors & actors binary features

**Additional Features (for datasets with keywords):**
12. `num_keywords`

In [48]:
# 1. Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from ast import literal_eval
import os
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11

print('Libraries loaded')

Libraries loaded


In [49]:
# Load all three raw datasets
movies_df = pd.read_csv('../data/raw/movies_metadata.csv', low_memory=False)
credits_df = pd.read_csv('../data/raw/credits.csv')
keywords_df = pd.read_csv('../data/raw/keywords.csv')

print(f'movies_metadata : {movies_df.shape[0]:,} rows x {movies_df.shape[1]} cols')
print(f'credits         : {credits_df.shape[0]:,} rows x {credits_df.shape[1]} cols')
print(f'keywords        : {keywords_df.shape[0]:,} rows x {keywords_df.shape[1]} cols')

movies_metadata : 45,466 rows x 24 cols
credits         : 45,476 rows x 3 cols
keywords        : 46,419 rows x 2 cols


In [50]:
# JSON parser for stringified list-of-dicts columns
def parse_json_column(val):
    if pd.isna(val):
        return []
    if isinstance(val, str):
        try:
            return json.loads(val)
        except Exception:
            try:
                parsed = literal_eval(val)
                return parsed if isinstance(parsed, list) else []
            except Exception:
                return []
    return val if isinstance(val, list) else []

# Parse movies JSON columns
print('Parsing JSON columns...')
for col in ['genres', 'production_companies', 'production_countries', 'spoken_languages']:
    movies_df[col] = movies_df[col].apply(parse_json_column)

# Parse credits JSON columns
credits_df['cast_parsed'] = credits_df['cast'].apply(parse_json_column)
credits_df['crew_parsed'] = credits_df['crew'].apply(parse_json_column)

# Parse keywords
keywords_df['keywords_parsed'] = keywords_df['keywords'].apply(parse_json_column)

print('Done')

Parsing JSON columns...
Done


---
## 2. Cleaning & Filtering

In [51]:
# 2.1  Type conversions + drop corrupted rows
n_start = len(movies_df)
print(f'Starting rows: {n_start:,}')

# Drop rows with non-numeric IDs (3 corrupted rows found in EDA)
movies_df['id'] = pd.to_numeric(movies_df['id'], errors='coerce')
movies_df = movies_df.dropna(subset=['id'])
movies_df['id'] = movies_df['id'].astype(int)

# Convert numeric columns stored as strings
movies_df['budget'] = pd.to_numeric(movies_df['budget'], errors='coerce')
movies_df['popularity'] = pd.to_numeric(movies_df['popularity'], errors='coerce')
movies_df['revenue'] = pd.to_numeric(movies_df['revenue'], errors='coerce')
movies_df['runtime'] = pd.to_numeric(movies_df['runtime'], errors='coerce')
movies_df['vote_average'] = pd.to_numeric(movies_df['vote_average'], errors='coerce')
movies_df['vote_count'] = pd.to_numeric(movies_df['vote_count'], errors='coerce')

# Parse release date
movies_df['release_date'] = pd.to_datetime(movies_df['release_date'], errors='coerce')

n_after = len(movies_df)
print(f'After type cleanup: {n_after:,}  (dropped {n_start - n_after:,} corrupted rows)')

Starting rows: 45,466
After type cleanup: 45,463  (dropped 3 corrupted rows)


In [52]:
# 2.2  Remove duplicates from all 3 datasets
print('--- Removing duplicates ---')

n1 = len(movies_df)
movies_df = movies_df.drop_duplicates(subset=['id'], keep='first')
print(f'Movies:   {n1:,} -> {len(movies_df):,}  (dropped {n1 - len(movies_df):,})')

n2 = len(credits_df)
credits_df = credits_df.drop_duplicates(subset=['id'], keep='first')
print(f'Credits:  {n2:,} -> {len(credits_df):,}  (dropped {n2 - len(credits_df):,})')

n3 = len(keywords_df)
keywords_df = keywords_df.drop_duplicates(subset=['id'], keep='first')
print(f'Keywords: {n3:,} -> {len(keywords_df):,}  (dropped {n3 - len(keywords_df):,})')

--- Removing duplicates ---
Movies:   45,463 -> 45,433  (dropped 30)
Credits:  45,476 -> 45,432  (dropped 44)
Keywords: 46,419 -> 45,432  (dropped 987)


In [53]:
print('--- Filtering movies ---')
print(f'Start: {len(movies_df):,}')

# Keep only Released movies
movies_df = movies_df[movies_df['status'] == 'Released']
print(f'After status=Released: {len(movies_df):,}')

# ==============================
# DATASET 1: Revenue prediction
# ==============================

movies_df_revenue = movies_df[(movies_df['budget'] > 0) & (movies_df['revenue'] > 0)]

# Remove invalid runtime
movies_df_revenue = movies_df_revenue[movies_df_revenue['runtime'] > 0]

# Fill missing runtime with median (computed on this dataset only)
med_runtime_rev = movies_df_revenue['runtime'].median()
movies_df_revenue['runtime'] = movies_df_revenue['runtime'].fillna(med_runtime_rev)

# Drop missing release dates
movies_df_revenue = movies_df_revenue.dropna(subset=['release_date'])

print(f'Final revenue dataset: {len(movies_df_revenue):,}')


# ==============================
# DATASET 2: Vote prediction
# ==============================

movies_df_vote = movies_df[movies_df['vote_count'] > 0]

# Remove invalid runtime
movies_df_vote = movies_df_vote[movies_df_vote['runtime'] > 0]

# Fill missing runtime with median (computed on this dataset only)
med_runtime_vote = movies_df_vote['runtime'].median()
movies_df_vote['runtime'] = movies_df_vote['runtime'].fillna(med_runtime_vote)

# Drop missing release dates
movies_df_vote = movies_df_vote.dropna(subset=['release_date'])

print(f'Final vote dataset: {len(movies_df_vote):,}')


--- Filtering movies ---
Start: 45,433
After status=Released: 44,985
Final revenue dataset: 5,359
Final vote dataset: 40,804


---
## 3. Base Feature Engineering (Metadata)

In [54]:
print("\n" + "="*60)
print("DATASET 1 — PROFITABILITY PREDICTION")
print("="*60)

movies_df_revenue = movies_df_revenue.copy()

print(f"\nInitial shape: {movies_df_revenue.shape}")

# -----------------------------
# 1️⃣ Target Variable
# -----------------------------
print("\n--- Creating Target Variable (Profitable) ---")

movies_df_revenue['profit'] = movies_df_revenue['revenue'] - movies_df_revenue['budget']
movies_df_revenue['profitable'] = (movies_df_revenue['profit'] > 0).astype(int)

print("Target distribution:")
print(movies_df_revenue['profitable'].value_counts())
print("Target percentage:")
print(movies_df_revenue['profitable'].value_counts(normalize=True).round(3))


# -----------------------------
# 2️⃣ Binary Features
# -----------------------------
print("\n--- Creating Binary Features ---")

movies_df_revenue['is_collection'] = movies_df_revenue['belongs_to_collection'].apply(
    lambda x: 0 if pd.isna(x) or x == '' else 1)

movies_df_revenue['is_english'] = (movies_df_revenue['original_language'] == 'en').astype(int)

print("is_collection:", movies_df_revenue['is_collection'].sum())
print("is_english:", movies_df_revenue['is_english'].sum())


# -----------------------------
# 3️⃣ Temporal Features
# -----------------------------
print("\n--- Creating Temporal Features ---")

movies_df_revenue['release_year'] = movies_df_revenue['release_date'].dt.year.astype(int)
movies_df_revenue['release_month'] = movies_df_revenue['release_date'].dt.month.astype(int)

print("Year range:",
      movies_df_revenue['release_year'].min(),
      "-",
      movies_df_revenue['release_year'].max())


# -----------------------------
# 4️⃣ Count Features
# -----------------------------
print("\n--- Creating Count Features ---")

movies_df_revenue['num_genres'] = movies_df_revenue['genres'].apply(
    lambda x: len(x) if isinstance(x, list) else 0)

movies_df_revenue['num_production_companies'] = movies_df_revenue['production_companies'].apply(
    lambda x: len(x) if isinstance(x, list) else 0)

movies_df_revenue['num_production_countries'] = movies_df_revenue['production_countries'].apply(
    lambda x: len(x) if isinstance(x, list) else 0)

movies_df_revenue['num_spoken_languages'] = movies_df_revenue['spoken_languages'].apply(
    lambda x: len(x) if isinstance(x, list) else 0)

print(movies_df_revenue[
    ['num_genres',
     'num_production_companies',
     'num_production_countries',
     'num_spoken_languages']
].describe().round(2))


# -----------------------------
# 5️⃣ Primary Genre
# -----------------------------
print("\n--- Extracting Primary Genre ---")

def get_primary_genre(genres_list):
    if isinstance(genres_list, list) and len(genres_list) > 0:
        return genres_list[0].get('name', 'Unknown')
    return 'Unknown'

movies_df_revenue['primary_genre'] = movies_df_revenue['genres'].apply(get_primary_genre)

print("Number of unique genres:",
      movies_df_revenue['primary_genre'].nunique())

print("Top 5 genres:")
print(movies_df_revenue['primary_genre'].value_counts().head())


# -----------------------------
# 6️⃣ Prevent Data Leakage
# -----------------------------
print("\n--- Removing Leakage Columns ---")

movies_df_revenue = movies_df_revenue.drop(columns=['revenue', 'profit'])

print("Shape after dropping leakage columns:", movies_df_revenue.shape)


# -----------------------------
# 7️⃣ Define X and y
# -----------------------------
print("\n--- Final X and y ---")

y_profit = movies_df_revenue['profitable']

X_profit = movies_df_revenue.drop(columns=[
    'profitable',
    'release_date',
    'belongs_to_collection',
    'genres',
    'production_companies',
    'production_countries',
    'spoken_languages'
])

print("Final X shape:", X_profit.shape)
print("Final y shape:", y_profit.shape)



DATASET 1 — PROFITABILITY PREDICTION

Initial shape: (5359, 24)

--- Creating Target Variable (Profitable) ---
Target distribution:
profitable
1    3745
0    1614
Name: count, dtype: int64
Target percentage:
profitable
1    0.699
0    0.301
Name: proportion, dtype: float64

--- Creating Binary Features ---
is_collection: 1220
is_english: 4790

--- Creating Temporal Features ---
Year range: 1915 - 2017

--- Creating Count Features ---
       num_genres  num_production_companies  num_production_countries  \
count     5359.00                   5359.00                   5359.00   
mean         2.61                      2.94                      1.36   
std          1.12                      2.19                      0.79   
min          0.00                      0.00                      0.00   
25%          2.00                      1.00                      1.00   
50%          3.00                      2.00                      1.00   
75%          3.00                      4.00       

In [55]:
print("\n" + "="*60)
print("DATASET 2 — VOTE AVERAGE PREDICTION")
print("="*60)

movies_df_vote = movies_df_vote.copy()

print(f"\nInitial shape: {movies_df_vote.shape}")


# -----------------------------
# 1️⃣ Target Variable
# -----------------------------
print("\n--- Target: vote_average ---")

y_vote = movies_df_vote['vote_average']

print("Vote statistics:")
print(y_vote.describe().round(2))


# -----------------------------
# 2️⃣ Binary Features
# -----------------------------
print("\n--- Creating Binary Features ---")

movies_df_vote['is_collection'] = movies_df_vote['belongs_to_collection'].apply(
    lambda x: 0 if pd.isna(x) or x == '' else 1)

movies_df_vote['is_english'] = (movies_df_vote['original_language'] == 'en').astype(int)

print("is_collection:", movies_df_vote['is_collection'].sum())
print("is_english:", movies_df_vote['is_english'].sum())


# -----------------------------
# 3️⃣ Temporal Features
# -----------------------------
print("\n--- Creating Temporal Features ---")

movies_df_vote['release_year'] = movies_df_vote['release_date'].dt.year.astype(int)
movies_df_vote['release_month'] = movies_df_vote['release_date'].dt.month.astype(int)

print("Year range:",
      movies_df_vote['release_year'].min(),
      "-",
      movies_df_vote['release_year'].max())


# -----------------------------
# 4️⃣ Count Features
# -----------------------------
print("\n--- Creating Count Features ---")

movies_df_vote['num_genres'] = movies_df_vote['genres'].apply(
    lambda x: len(x) if isinstance(x, list) else 0)

movies_df_vote['num_production_companies'] = movies_df_vote['production_companies'].apply(
    lambda x: len(x) if isinstance(x, list) else 0)

movies_df_vote['num_production_countries'] = movies_df_vote['production_countries'].apply(
    lambda x: len(x) if isinstance(x, list) else 0)

movies_df_vote['num_spoken_languages'] = movies_df_vote['spoken_languages'].apply(
    lambda x: len(x) if isinstance(x, list) else 0)

print(movies_df_vote[
    ['num_genres',
     'num_production_companies',
     'num_production_countries',
     'num_spoken_languages']
].describe().round(2))


# -----------------------------
# 5️⃣ Primary Genre
# -----------------------------
print("\n--- Extracting Primary Genre ---")

movies_df_vote['primary_genre'] = movies_df_vote['genres'].apply(get_primary_genre)

print("Number of unique genres:",
      movies_df_vote['primary_genre'].nunique())

print("Top 5 genres:")
print(movies_df_vote['primary_genre'].value_counts().head())


# -----------------------------
# 6️⃣ Define X
# -----------------------------
print("\n--- Final X and y ---")

X_vote = movies_df_vote.drop(columns=[
    'vote_average',
    'release_date',
    'belongs_to_collection',
    'genres',
    'production_companies',
    'production_countries',
    'spoken_languages'
])

print("Final X shape:", X_vote.shape)
print("Final y shape:", y_vote.shape)



DATASET 2 — VOTE AVERAGE PREDICTION

Initial shape: (40804, 24)

--- Target: vote_average ---
Vote statistics:
count    40804.00
mean         6.01
std          1.27
min          0.00
25%          5.30
50%          6.10
75%          6.90
max         10.00
Name: vote_average, dtype: float64

--- Creating Binary Features ---
is_collection: 4327
is_english: 29296

--- Creating Temporal Features ---
Year range: 1874 - 2017

--- Creating Count Features ---
       num_genres  num_production_companies  num_production_countries  \
count    40804.00                  40804.00                  40804.00   
mean         2.08                      1.65                      1.13   
std          1.11                      1.77                      0.75   
min          0.00                      0.00                      0.00   
25%          1.00                      1.00                      1.00   
50%          2.00                      1.00                      1.00   
75%          3.00                

---
## 4. Create 4 Processed Datasets

In [56]:
print("\n" + "="*60)
print("PREPARING CREDITS & KEYWORDS FEATURES")
print("="*60)

# -----------------------------
# Credits Features
# -----------------------------
credits_merge = credits_df[['id', 'cast_parsed', 'crew_parsed']].copy()

credits_merge['id'] = pd.to_numeric(credits_merge['id'], errors='coerce')
credits_merge = credits_merge.dropna(subset=['id'])
credits_merge['id'] = credits_merge['id'].astype(int)

credits_merge['num_cast'] = credits_merge['cast_parsed'].apply(
    lambda x: len(x) if isinstance(x, list) else 0)

credits_merge['num_crew'] = credits_merge['crew_parsed'].apply(
    lambda x: len(x) if isinstance(x, list) else 0)

print(f'Credits features ready: {len(credits_merge):,} rows')


# -----------------------------
# Keywords Features
# -----------------------------
keywords_merge = keywords_df[['id', 'keywords_parsed']].copy()

keywords_merge['id'] = pd.to_numeric(keywords_merge['id'], errors='coerce')
keywords_merge = keywords_merge.dropna(subset=['id'])
keywords_merge['id'] = keywords_merge['id'].astype(int)

keywords_merge['num_keywords'] = keywords_merge['keywords_parsed'].apply(
    lambda x: len(x) if isinstance(x, list) else 0)

print(f'Keywords features ready: {len(keywords_merge):,} rows')



PREPARING CREDITS & KEYWORDS FEATURES
Credits features ready: 45,432 rows
Keywords features ready: 45,432 rows


In [57]:
print("\n" + "="*60)
print("PROFITABILITY DATASETS (4 VERSIONS)")
print("="*60)

# 1️⃣ Metadata Only
df_profit_metadata = movies_df_revenue.copy()
print(f'1. Metadata only:         {len(df_profit_metadata):,} rows')

# 2️⃣ Metadata + Credits
df_profit_meta_credits = movies_df_revenue.merge(
    credits_merge[['id', 'num_cast', 'num_crew']],
    on='id',
    how='inner'
)

print(f'2. Metadata + Credits:    {len(df_profit_meta_credits):,} rows '
      f'(lost {len(movies_df_revenue) - len(df_profit_meta_credits):,})')

# 3️⃣ Metadata + Keywords
df_profit_meta_keywords = movies_df_revenue.merge(
    keywords_merge[['id', 'num_keywords']],
    on='id',
    how='inner'
)

print(f'3. Metadata + Keywords:   {len(df_profit_meta_keywords):,} rows '
      f'(lost {len(movies_df_revenue) - len(df_profit_meta_keywords):,})')

# 4️⃣ All Combined
df_profit_all = movies_df_revenue.merge(
    credits_merge[['id', 'num_cast', 'num_crew']],
    on='id',
    how='inner'
)

df_profit_all = df_profit_all.merge(
    keywords_merge[['id', 'num_keywords']],
    on='id',
    how='inner'
)

print(f'4. All Combined:          {len(df_profit_all):,} rows '
      f'(lost {len(movies_df_revenue) - len(df_profit_all):,})')



PROFITABILITY DATASETS (4 VERSIONS)
1. Metadata only:         5,359 rows
2. Metadata + Credits:    5,359 rows (lost 0)
3. Metadata + Keywords:   5,359 rows (lost 0)
4. All Combined:          5,359 rows (lost 0)


In [58]:
print("\n" + "="*60)
print("VOTE DATASETS (4 VERSIONS)")
print("="*60)

# 1️⃣ Metadata Only
df_vote_metadata = movies_df_vote.copy()
print(f'1. Metadata only:         {len(df_vote_metadata):,} rows')

# 2️⃣ Metadata + Credits
df_vote_meta_credits = movies_df_vote.merge(
    credits_merge[['id', 'num_cast', 'num_crew']],
    on='id',
    how='inner'
)

print(f'2. Metadata + Credits:    {len(df_vote_meta_credits):,} rows '
      f'(lost {len(movies_df_vote) - len(df_vote_meta_credits):,})')

# 3️⃣ Metadata + Keywords
df_vote_meta_keywords = movies_df_vote.merge(
    keywords_merge[['id', 'num_keywords']],
    on='id',
    how='inner'
)

print(f'3. Metadata + Keywords:   {len(df_vote_meta_keywords):,} rows '
      f'(lost {len(movies_df_vote) - len(df_vote_meta_keywords):,})')

# 4️⃣ All Combined
df_vote_all = movies_df_vote.merge(
    credits_merge[['id', 'num_cast', 'num_crew']],
    on='id',
    how='inner'
)

df_vote_all = df_vote_all.merge(
    keywords_merge[['id', 'num_keywords']],
    on='id',
    how='inner'
)

print(f'4. All Combined:          {len(df_vote_all):,} rows '
      f'(lost {len(movies_df_vote) - len(df_vote_all):,})')



VOTE DATASETS (4 VERSIONS)
1. Metadata only:         40,804 rows
2. Metadata + Credits:    40,804 rows (lost 0)
3. Metadata + Keywords:   40,804 rows (lost 0)
4. All Combined:          40,804 rows (lost 0)


In [59]:
import ast
import pandas as pd

def parse_json_column(text):
    if pd.isna(text):
        return []
    try:
        return ast.literal_eval(text)
    except:
        return []

# --- Merge Credits (keep parsed columns) ---
df_profit_meta_credits = df_profit_metadata.merge(
    credits_df[['id', 'cast_parsed', 'crew_parsed']], on='id', how='inner'
)

df_profit_all = df_profit_meta_credits.merge(
    keywords_df[['id', 'keywords_parsed']], on='id', how='inner'
)

# --- Ensure parsed columns exist ---
for df_ref in [df_profit_meta_credits, df_profit_all]:
    for col in ['cast_parsed', 'crew_parsed']:
        if col not in df_ref.columns and col.replace('_parsed','') in df_ref.columns:
            df_ref[col] = df_ref[col.replace('_parsed','')].apply(parse_json_column)

# --- Extract Top Directors & Actors (post-2010) ---
def extract_top_industry_features(df_ref):
    # Directors
    dir_year_pairs = []
    for _, row in df_ref.iterrows():
        year = row.get('release_year')
        crew = row.get('crew_parsed', [])
        if pd.notna(year) and isinstance(crew, list):
            for m in crew:
                if isinstance(m, dict) and m.get('job')=='Director' and m.get('name'):
                    dir_year_pairs.append((m['name'], year))
    dir_year_df = pd.DataFrame(dir_year_pairs, columns=['director','year'])
    post2010_dirs = dir_year_df[dir_year_df['year']>=2010]['director'].value_counts()
    top_directors = set(post2010_dirs.head(50).index)

    # Actors
    act_data = []
    for _, row in df_ref.iterrows():
        year = row.get('release_year')
        cast = row.get('cast_parsed', [])
        if pd.notna(year) and isinstance(cast, list):
            for m in cast:
                if isinstance(m, dict) and m.get('name') and 'order' in m:
                    act_data.append((m['name'], year, m['order']))
    act_df = pd.DataFrame(act_data, columns=['actor','year','order'])
    post2010_acts = act_df[act_df['year']>=2010]
    top_actors = set(post2010_acts['actor'].value_counts().head(50).index)
    top_leads = set(post2010_acts[post2010_acts['order']==0]['actor'].value_counts().head(50).index)

    # Feature functions
    def has_top_dir(crew):
        if not isinstance(crew,list): return 0
        for m in crew:
            if isinstance(m, dict) and m.get('job')=='Director' and m.get('name','') in top_directors:
                return 1
        return 0

    def count_top_actors(cast):
        if not isinstance(cast,list): return 0
        return sum(1 for m in cast if isinstance(m,dict) and m.get('name','') in top_actors)

    def has_top_lead_actor(cast):
        if not isinstance(cast,list): return 0
        for m in cast:
            if isinstance(m, dict) and m.get('order')==0 and m.get('name','') in top_leads:
                return 1
        return 0

    # Apply features
    df_ref['has_top_director'] = df_ref['crew_parsed'].apply(has_top_dir)
    df_ref['num_top_actors'] = df_ref['cast_parsed'].apply(count_top_actors)
    df_ref['has_top_lead_actor'] = df_ref['cast_parsed'].apply(has_top_lead_actor)

    print("Top director count:", df_ref['has_top_director'].sum())
    print("Top actor mean:", round(df_ref['num_top_actors'].mean(),2))
    print("Top lead count:", df_ref['has_top_lead_actor'].sum())

# Apply
extract_top_industry_features(df_profit_meta_credits)
extract_top_industry_features(df_profit_all)


Top director count: 475
Top actor mean: 0.3
Top lead count: 794
Top director count: 475
Top actor mean: 0.3
Top lead count: 794


In [60]:
import os
import numpy as np

# -----------------------------
# Define columns to drop
# -----------------------------
meta_drop = [
    'adult', 'video', 'status',
    'belongs_to_collection', 'homepage', 'tagline', 'poster_path',
    'imdb_id', 'original_title', 'overview',
    'original_language', 'release_date',
    'genres', 'production_companies', 'production_countries', 'spoken_languages',
]

# -----------------------------
# Function to drop raw/unused columns
# -----------------------------
def drop_raw_columns(datasets, target_name):
    print(f"\nDropping raw/unused columns for {target_name} datasets...")
    df_metadata, df_meta_credits, df_meta_keywords, df_all = datasets
    
    df_metadata = df_metadata.drop(columns=[c for c in meta_drop if c in df_metadata.columns])
    df_meta_credits = df_meta_credits.drop(columns=[c for c in meta_drop + ['cast_parsed', 'crew_parsed'] 
                                                    if c in df_meta_credits.columns])
    df_meta_keywords = df_meta_keywords.drop(columns=[c for c in meta_drop + ['keywords_parsed'] 
                                                      if c in df_meta_keywords.columns])
    df_all = df_all.drop(columns=[c for c in meta_drop + ['cast_parsed','crew_parsed','keywords_parsed'] 
                                  if c in df_all.columns])
    
    # Print column info
    for name, df_ref in [('metadata', df_metadata), ('meta_credits', df_meta_credits),
                         ('meta_keywords', df_meta_keywords), ('all', df_all)]:
        print(f'\n  {name} ({df_ref.shape[0]:,} rows x {df_ref.shape[1]} cols):')
        for i, col in enumerate(df_ref.columns, 1):
            print(f'    {i:2d}. {col}')
            
    return df_metadata, df_meta_credits, df_meta_keywords, df_all

# -----------------------------
# Function to show dataset comparison
# -----------------------------
def dataset_overview(datasets, dataset_labels):
    print('\n' + '='*80)
    print('DATASET COMPARISON')
    print('='*80)
    
    comparison_rows = []
    for label, df_ref in zip(dataset_labels, datasets):
        numeric_df = df_ref.select_dtypes(include=[np.number])
        comparison_rows.append({
            'Dataset': label,
            'Rows': len(df_ref),
            'Columns': df_ref.shape[1],
            'Numeric Features': numeric_df.shape[1],
            'ROI Median': round(df_ref['roi'].median(),1) if 'roi' in df_ref.columns else None,
            'Profitable %': round((df_ref['roi']>0).mean()*100,1) if 'roi' in df_ref.columns else None,
        })
    comp_df = pd.DataFrame(comparison_rows)
    print(comp_df.to_string(index=False))
    
    # Missing values
    print('\n--- Missing Values ---')
    for label, df_ref in zip(dataset_labels, datasets):
        n_missing = df_ref.isnull().sum().sum()
        print(f'{label}: {n_missing:,} total missing values')

# -----------------------------
# Function to save datasets
# -----------------------------
def save_datasets(datasets, dataset_labels, save_dir):
    os.makedirs(save_dir, exist_ok=True)
    print(f'\nSaving processed datasets to {save_dir}:')
    for df, label in zip(datasets, dataset_labels):
        path = os.path.join(save_dir, f'movies_processed_{label.replace(" ","_").lower()}.csv')
        df.to_csv(path, index=False)
        print(f'  ✓ {label:15s} → {path}  ({len(df):,} rows x {df.shape[1]} cols)')

# -----------------------------
# APPLY TO PROFITABILITY DATASETS
# -----------------------------
profit_datasets = [df_profit_metadata, df_profit_meta_credits, df_profit_meta_keywords, df_profit_all]
profit_labels = ['Metadata Only', 'Meta + Credits', 'Meta + Keywords', 'All Combined']

# Drop raw columns
df_profit_metadata, df_profit_meta_credits, df_profit_meta_keywords, df_profit_all = drop_raw_columns(profit_datasets, "Profitability")

# Overview
dataset_overview([df_profit_metadata, df_profit_meta_credits, df_profit_meta_keywords, df_profit_all], profit_labels)

# Save
save_datasets([df_profit_metadata, df_profit_meta_credits, df_profit_meta_keywords, df_profit_all], profit_labels, '../data/processed/profit')

# -----------------------------
# APPLY TO VOTE AVERAGE DATASETS
# -----------------------------
vote_datasets = [df_vote_metadata, df_vote_meta_credits, df_vote_meta_keywords, df_vote_all]
vote_labels = ['Metadata Only', 'Meta + Credits', 'Meta + Keywords', 'All Combined']

# Drop raw columns
df_vote_metadata, df_vote_meta_credits, df_vote_meta_keywords, df_vote_all = drop_raw_columns(vote_datasets, "Vote Average")

# Overview
dataset_overview([df_vote_metadata, df_vote_meta_credits, df_vote_meta_keywords, df_vote_all], vote_labels)

# Save
save_datasets([df_vote_metadata, df_vote_meta_credits, df_vote_meta_keywords, df_vote_all], vote_labels, '../data/processed/vote')



Dropping raw/unused columns for Profitability datasets...

  metadata (5,359 rows x 17 cols):
     1. budget
     2. id
     3. popularity
     4. runtime
     5. title
     6. vote_average
     7. vote_count
     8. profitable
     9. is_collection
    10. is_english
    11. release_year
    12. release_month
    13. num_genres
    14. num_production_companies
    15. num_production_countries
    16. num_spoken_languages
    17. primary_genre

  meta_credits (5,359 rows x 20 cols):
     1. budget
     2. id
     3. popularity
     4. runtime
     5. title
     6. vote_average
     7. vote_count
     8. profitable
     9. is_collection
    10. is_english
    11. release_year
    12. release_month
    13. num_genres
    14. num_production_companies
    15. num_production_countries
    16. num_spoken_languages
    17. primary_genre
    18. has_top_director
    19. num_top_actors
    20. has_top_lead_actor

  meta_keywords (5,359 rows x 18 cols):
     1. budget
     2. id
     3. popular

---
## Summary

### 4 Processed Datasets Created

| Dataset | File | Sources |
|---|---|---|
| Metadata Only | `movies_processed_metadata.csv` | movies_metadata |
| Meta + Credits | `movies_processed_meta_credits.csv` | movies_metadata + credits |
| Meta + Keywords | `movies_processed_meta_keywords.csv` | movies_metadata + keywords |
| All Combined | `movies_processed.csv` | all three |

### Features Per Dataset

| Feature | Metadata | +Credits | +Keywords | All |
|---|:---:|:---:|:---:|:---:|
| `budget`, `popularity`, `runtime`, `vote_average`, `vote_count` | ✓ | ✓ | ✓ | ✓ |
| `is_collection`, `is_english` | ✓ | ✓ | ✓ | ✓ |
| `release_year`, `release_month` | ✓ | ✓ | ✓ | ✓ |
| `roi` | ✓ | ✓ | ✓ | ✓ |
| `num_genres`, `num_production_*`, `num_spoken_languages` | ✓ | ✓ | ✓ | ✓ |
| `primary_genre` | ✓ | ✓ | ✓ | ✓ |
| `num_cast`, `num_crew` | | ✓ | | ✓ |
| `has_top_director`, `has_top_actor`, `top_actor_count`, `has_top_lead_actor` | | ✓ | | ✓ |
| `num_keywords` | | | ✓ | ✓ |

### Next Steps
- **Notebook 03:** Prepare all 4 datasets for modeling — encode, split, scale, baselines