# 🎯 Feature Engineering – Movie Success Predictor

In this notebook, we perform feature engineering tasks to prepare our data for machine learning modeling. This includes:
- Extracting main genre from JSON
- Extracting director name from crew
- Creating binary target variable (`success`)

In [1]:
import pandas as pd
import numpy as np
import ast


In [7]:
# Load the cleaned dataset
df = pd.read_csv("../data/cleaned_movies.csv")
df.head(2)

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


## 🎬 Extract Main Genre

In [3]:
def extract_main_genre(genre_str):
    try:
        genres = ast.literal_eval(genre_str)
        if genres and isinstance(genres, list):
            return genres[0]['name']
    except:
        return 'Unknown'

df['main_genre'] = df['genres'].apply(extract_main_genre)
df['main_genre'].value_counts().head()

main_genre
Drama        1206
Comedy       1042
Action        754
Adventure     339
Horror        299
Name: count, dtype: int64

## 🎥 Extract Director from Crew

In [4]:
def extract_director(crew_str):
    try:
        crew = ast.literal_eval(crew_str)
        for person in crew:
            if person.get('job') == 'Director':
                return person.get('name')
    except:
        return 'Unknown'

df['director'] = df['crew'].apply(extract_director)
df['director'].value_counts().head()

director
Steven Spielberg    27
Woody Allen         21
Clint Eastwood      20
Martin Scorsese     20
Ridley Scott        16
Name: count, dtype: int64

## 🎯 Create Target Variable (`success`)

In [5]:
# Define movie as successful if revenue > budget
df['success'] = np.where(df['revenue'] > df['budget'], 1, 0)
df['success'].value_counts(normalize=True)

success
1    0.538446
0    0.461554
Name: proportion, dtype: float64

## 💾 Save Engineered Dataset

In [6]:
df.to_csv("../data/engineered_movies.csv", index=False)
print("Saved to data/engineered_movies.csv")

Saved to data/engineered_movies.csv
