# TMDB Movie Success – Mini Project (V7)

## V7 improvement over V6: **Optimal decision threshold** (GroupKFold, no leakage)

### What’s new vs V6
- We still use **GroupKFold by director** and rare-category grouping.
- After tuning each model with GridSearchCV, we compute **out-of-fold probabilities** on the training set.
- We choose the **threshold that maximizes F1_weighted** on those out-of-fold predictions.
- Then we evaluate once on the test set using that threshold.

✅ This improves F1 without changing the model and stays fully rigorous (threshold learned only from train via CV).


## 1. Imports

In [None]:
import pandas as pd
import numpy as np
import ast

from sklearn.model_selection import train_test_split, GridSearchCV, GroupKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, classification_report, roc_auc_score,
    ConfusionMatrixDisplay, RocCurveDisplay, f1_score
)
from sklearn.base import clone
import matplotlib.pyplot as plt

## 2. Load datasets

In [None]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

print('movies shape :', movies.shape)
print('credits shape:', credits.shape)
movies.head()

## 3. Merge movies + credits (clean column names)

In [None]:
df = movies.merge(credits, left_on='id', right_on='movie_id', how='left')

# Clean duplicate title columns created by merge
if 'title_x' in df.columns:
    df = df.rename(columns={'title_x': 'title'})
if 'title_y' in df.columns:
    df = df.drop(columns=['title_y'])

print('merged shape:', df.shape)
df[['id', 'title', 'movie_id']].head()

## 4. Helper functions (safe parsing)

In [None]:
def safe_eval_list(x):
    if not isinstance(x, str) or x.strip() == '':
        return []
    try:
        v = ast.literal_eval(x)
        return v if isinstance(v, list) else []
    except:
        return []

def first_name_from_json_list(x, default='Unknown'):
    v = safe_eval_list(x)
    if len(v) > 0 and isinstance(v[0], dict) and 'name' in v[0]:
        return v[0]['name']
    return default

def list_len(x):
    return len(safe_eval_list(x))

def extract_director(crew_str):
    crew = safe_eval_list(crew_str)
    for person in crew:
        if isinstance(person, dict) and person.get('job') == 'Director':
            return person.get('name', 'Unknown')
    return 'Unknown'

## 5. Pre-release feature engineering (safe)

In [None]:
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
df['release_year'] = df['release_date'].dt.year
df['release_month'] = df['release_date'].dt.month

df['main_genre'] = df['genres'].apply(first_name_from_json_list)
df['num_genres'] = df['genres'].apply(list_len)

df['top_company'] = df['production_companies'].apply(first_name_from_json_list)
df['num_production_companies'] = df['production_companies'].apply(list_len)

df['is_english'] = (df['original_language'] == 'en').astype(int)

df['cast_size'] = df['cast'].apply(list_len)
df['crew_size'] = df['crew'].apply(list_len)
df['director_name'] = df['crew'].apply(extract_director)

df[['main_genre','num_genres','top_company','num_production_companies','original_language','is_english','cast_size','crew_size','director_name']].head()

## 6. Define composite success score (TARGET ONLY)

In [None]:
df['profit'] = df['revenue'] - df['budget']
df['profit_pos'] = df['profit'].clip(lower=0)

P = np.log(df['profit_pos'] + 1)
V = np.log(df['vote_count'] + 1)
Pop = np.log(df['popularity'] + 1)
Q = df['vote_average'] / 10

df['FilmSuccessScore'] = 0.4*P + 0.3*Q + 0.2*V + 0.1*Pop

threshold_target = df['FilmSuccessScore'].median()
df['success'] = (df['FilmSuccessScore'] >= threshold_target).astype(int)

df['success'].value_counts(normalize=True)

## 7. Reduce rare categories (to lower overfitting)

In [None]:
def keep_top_n(series, n=50):
    top = series.value_counts().head(n).index
    return series.where(series.isin(top), other='Other')

df['director_group'] = keep_top_n(df['director_name'], n=80)
df['company_group']  = keep_top_n(df['top_company'], n=80)
df['lang_group']     = keep_top_n(df['original_language'], n=30)
df['genre_group']    = keep_top_n(df['main_genre'], n=20)

df[['director_name','director_group','top_company','company_group','original_language','lang_group','main_genre','genre_group']].head()

## 8. Build X / y + train/test split (with director groups)

In [None]:
numeric_features = [
    'budget', 'runtime', 'release_year', 'release_month',
    'num_genres', 'num_production_companies',
    'cast_size', 'crew_size', 'is_english'
]

categorical_features = [
    'genre_group', 'company_group', 'lang_group', 'director_group'
]

X = df[numeric_features + categorical_features].copy()
y = df['success'].copy()
groups = df['director_group']

X_train, X_test, y_train, y_test, g_train, g_test = train_test_split(
    X, y, groups, test_size=0.2, random_state=42, stratify=y
)

print('Train size:', X_train.shape, ' Test size:', X_test.shape)
print('Unique director groups (train):', pd.Series(g_train).nunique())

## 9. Preprocessing pipeline

In [None]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

## 10. Helper: best threshold from out-of-fold predictions (GroupKFold)

In [None]:
def oof_proba_groupkfold(estimator, X, y, groups, n_splits=5):
    """Out-of-fold predicted probabilities for class 1 using GroupKFold."""
    gkf = GroupKFold(n_splits=n_splits)
    oof = np.zeros(len(X), dtype=float)

    Xr = X.reset_index(drop=True)
    yr = y.reset_index(drop=True)
    gr = pd.Series(groups).reset_index(drop=True)

    for train_idx, val_idx in gkf.split(Xr, yr, gr):
        est = clone(estimator)
        est.fit(Xr.iloc[train_idx], yr.iloc[train_idx])
        oof[val_idx] = est.predict_proba(Xr.iloc[val_idx])[:, 1]
    return oof

def best_threshold_from_proba(y_true, proba):
    thresholds = np.linspace(0.05, 0.95, 19)
    rows = []
    for t in thresholds:
        y_pred = (proba >= t).astype(int)
        score = f1_score(y_true, y_pred, average='weighted')
        rows.append((t, score))
    res = pd.DataFrame(rows, columns=['threshold', 'f1_weighted'])
    best_row = res.loc[res['f1_weighted'].idxmax()]
    return float(best_row['threshold']), res.sort_values('f1_weighted', ascending=False)

## 11. Model 1 — Logistic Regression (GridSearch + optimal threshold)

In [None]:
log_reg = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', LogisticRegression(max_iter=4000, class_weight='balanced', solver='liblinear'))
])

param_grid_lr = {'model__C': [0.01, 0.1, 1, 10, 50]}
cv_group = GroupKFold(n_splits=5)

gs_lr = GridSearchCV(
    log_reg,
    param_grid=param_grid_lr,
    scoring='f1_weighted',
    cv=cv_group,
    n_jobs=-1
)
gs_lr.fit(X_train, y_train, groups=g_train)
best_lr = gs_lr.best_estimator_

print('Best LR params:', gs_lr.best_params_)
print('Best CV F1_weighted (thr=0.5):', gs_lr.best_score_)

# Find optimal threshold from OOF train probabilities
oof_lr = oof_proba_groupkfold(best_lr, X_train, y_train, g_train, n_splits=5)
best_t_lr, table_lr = best_threshold_from_proba(y_train.reset_index(drop=True), oof_lr)

print('\nBest threshold for LR (from OOF train):', best_t_lr)
table_lr.head(10)

In [None]:
# Fit on full train and evaluate on test with optimal threshold
best_lr.fit(X_train, y_train)

proba_test_lr = best_lr.predict_proba(X_test)[:, 1]
pred_test_lr_05  = (proba_test_lr >= 0.5).astype(int)
pred_test_lr_opt = (proba_test_lr >= best_t_lr).astype(int)

print('--- LR TEST (thr=0.5) ---')
print('F1_weighted:', f1_score(y_test, pred_test_lr_05, average='weighted'))
print('ROC AUC    :', roc_auc_score(y_test, proba_test_lr))
print('Accuracy   :', accuracy_score(y_test, pred_test_lr_05))

print('\n--- LR TEST (optimal thr) ---')
print('Threshold  :', best_t_lr)
print('F1_weighted:', f1_score(y_test, pred_test_lr_opt, average='weighted'))
print('ROC AUC    :', roc_auc_score(y_test, proba_test_lr))
print('Accuracy   :', accuracy_score(y_test, pred_test_lr_opt))

print('\nClassification report (optimal thr):')
print(classification_report(y_test, pred_test_lr_opt))

ConfusionMatrixDisplay.from_predictions(y_test, pred_test_lr_opt)
plt.title('Logistic Regression – Confusion Matrix (V7, optimal threshold)')
plt.show()

RocCurveDisplay.from_predictions(y_test, proba_test_lr)
plt.title('Logistic Regression – ROC Curve (V7)')
plt.show()

## 12. Model 2 — Decision Tree (GridSearch + optimal threshold)

In [None]:
tree = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', DecisionTreeClassifier(class_weight='balanced', random_state=42))
])

param_grid_tree = {
    'model__max_depth': [3, 4, 5, 6, None],
    'model__min_samples_leaf': [5, 10, 20, 40],
    'model__min_samples_split': [10, 30, 50, 100],
    'model__ccp_alpha': [0.0, 0.0005, 0.001, 0.005]
}

gs_tree = GridSearchCV(
    tree,
    param_grid=param_grid_tree,
    scoring='f1_weighted',
    cv=cv_group,
    n_jobs=-1
)
gs_tree.fit(X_train, y_train, groups=g_train)
best_tree = gs_tree.best_estimator_

print('Best Tree params:', gs_tree.best_params_)
print('Best CV F1_weighted (thr=0.5):', gs_tree.best_score_)

oof_tree = oof_proba_groupkfold(best_tree, X_train, y_train, g_train, n_splits=5)
best_t_tree, table_tree = best_threshold_from_proba(y_train.reset_index(drop=True), oof_tree)

print('\nBest threshold for Tree (from OOF train):', best_t_tree)
table_tree.head(10)

In [None]:
best_tree.fit(X_train, y_train)

proba_test_tree = best_tree.predict_proba(X_test)[:, 1]
pred_test_tree_05  = (proba_test_tree >= 0.5).astype(int)
pred_test_tree_opt = (proba_test_tree >= best_t_tree).astype(int)

print('--- Tree TEST (thr=0.5) ---')
print('F1_weighted:', f1_score(y_test, pred_test_tree_05, average='weighted'))
print('ROC AUC    :', roc_auc_score(y_test, proba_test_tree))
print('Accuracy   :', accuracy_score(y_test, pred_test_tree_05))

print('\n--- Tree TEST (optimal thr) ---')
print('Threshold  :', best_t_tree)
print('F1_weighted:', f1_score(y_test, pred_test_tree_opt, average='weighted'))
print('ROC AUC    :', roc_auc_score(y_test, proba_test_tree))
print('Accuracy   :', accuracy_score(y_test, pred_test_tree_opt))

print('\nClassification report (optimal thr):')
print(classification_report(y_test, pred_test_tree_opt))

ConfusionMatrixDisplay.from_predictions(y_test, pred_test_tree_opt)
plt.title('Decision Tree – Confusion Matrix (V7, optimal threshold)')
plt.show()

RocCurveDisplay.from_predictions(y_test, proba_test_tree)
plt.title('Decision Tree – ROC Curve (V7)')
plt.show()

## 13. Compare models (Test, optimal threshold)

In [None]:
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree'],
    'Best_threshold': [best_t_lr, best_t_tree],
    'F1_weighted_test_opt': [
        f1_score(y_test, pred_test_lr_opt, average='weighted'),
        f1_score(y_test, pred_test_tree_opt, average='weighted')
    ],
    'ROC_AUC_test': [
        roc_auc_score(y_test, proba_test_lr),
        roc_auc_score(y_test, proba_test_tree)
    ],
    'Accuracy_test_opt': [
        accuracy_score(y_test, pred_test_lr_opt),
        accuracy_score(y_test, pred_test_tree_opt)
    ]
})
results

## 14. Mini search engine (title → Success/Failure) using the chosen threshold

In [None]:
# Choose the final model here
final_model = best_lr
final_threshold = best_t_lr

def predict_from_title(title: str):
    mask = df['title'].astype(str).str.lower() == title.strip().lower()
    if mask.sum() == 0:
        mask = df['title'].astype(str).str.lower().str.contains(title.strip().lower(), na=False)
        if mask.sum() == 0:
            return None, 'Title not found in dataset.'

    row = df.loc[mask].iloc[0]

    x_row = pd.DataFrame([{
        **{c: row.get(c, np.nan) for c in numeric_features},
        **{c: row.get(c, 'Other') for c in categorical_features}
    }])

    proba = float(final_model.predict_proba(x_row)[:, 1][0])
    pred = int(proba >= final_threshold)

    label = 'SUCCESS (bon film)' if pred == 1 else 'FAILURE (mauvais film)'
    return {
        'title': row['title'],
        'proba_success': round(proba, 4),
        'threshold_used': float(final_threshold),
        'prediction': label
    }, None

# Example
predict_from_title('Avatar')

## 15. Notes for your report

- V7 keeps the V6 anti-leakage strategy (GroupKFold by director + rare-category grouping).
- V7 also optimizes the **decision threshold** using only training data via out-of-fold probabilities.
- This improves the SUCCESS/FAILURE decision for the F1 metric while avoiding any test leakage.
