# Error Analysis

This notebook dives into the errors made by our best model (Logistic Regression). We will identify False Positives (Flops predicted as Hits) and False Negatives (Hits predicted as Flops) and inspect specific movies.

## Goals
1. Load Data & Model
2. Generate Predictions
3. Identify & Visualize Errors
4. Inspect Specific Movie Titles

In [1]:
import pandas as pd
import numpy as np
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

%matplotlib inline
sns.set(style="whitegrid")

## 1. Load Data & Model

In [2]:
# Load processed data
df_processed = pd.read_csv('../data/processed/train_processed.csv')

# Load Model
pipeline = joblib.load('../models/movie_hit_flop_pipeline.joblib')

print(f"Loaded data shape: {df_processed.shape}")

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Loaded data shape: (2596, 49)


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


## 2. Generate Predictions

In [3]:
# Prepare features (drop non-features)
drop_cols = ['id', 'label', 'release_date'] 
X = df_processed.drop(columns=[c for c in drop_cols if c in df_processed.columns])
X = X.select_dtypes(include=[np.number])
X = X.fillna(X.mean())

# Get True Labels
y_true = df_processed['label'].map({'Hit': 1, 'Flop': 0}) # 1=Hit, 0=Flop

# Predict Probabilities
y_proba = pipeline.predict_proba(X)[:, 1]

# Apply Best Threshold (from training report ~0.31)
threshold = 0.31
y_pred = (y_proba >= threshold).astype(int)

# Add to dataframe
df_processed['prob_hit'] = y_proba
df_processed['pred_label'] = y_pred
df_processed['true_label'] = y_true

# Error Categorization
def categorize_error(row):
    if row['true_label'] == 1 and row['pred_label'] == 1:
        return 'TP' # True Hit
    elif row['true_label'] == 0 and row['pred_label'] == 0:
        return 'TN' # True Flop
    elif row['true_label'] == 0 and row['pred_label'] == 1:
        return 'FP' # Predicted Hit but was Flop (Costly Mistake)
    elif row['true_label'] == 1 and row['pred_label'] == 0:
        return 'FN' # Predicted Flop but was Hit (Missed Opportunity)

df_processed['error_type'] = df_processed.apply(categorize_error, axis=1)

AttributeError: 'LogisticRegression' object has no attribute 'multi_class'

## 3. Analyze Errors

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='error_type', data=df_processed, order=['TP', 'TN', 'FP', 'FN'], palette='coolwarm')
plt.title('Count of Prediction Types')
plt.show()

print(df_processed['error_type'].value_counts())

## 4. Inspect Specific Errors

### False Positives (Risky Bets)
Movies we predicted would be HITS, but were actually FLOPS. These would lose money.

In [None]:
cols_to_show = ['budget', 'revenue', 'prob_hit', 'error_type']

# Top False Positives (Highest probability of being a Hit, but was a Flop)
fp_df = df_processed[df_processed['error_type'] == 'FP'].sort_values('prob_hit', ascending=False)
print("Top 10 False Positives (Predicted Hit, Actual Flop):")
fp_df[cols_to_show].head(10)

### False Negatives (Missed Gems)
Movies we predicted would be FLOPS, but were actually HITS.

In [None]:
# Top False Negatives (Lowest probability of being a Hit, but was a Hit)
fn_df = df_processed[df_processed['error_type'] == 'FN'].sort_values('prob_hit', ascending=True)
print("Top 10 False Negatives (Predicted Flop, Actual Hit):")
fn_df[cols_to_show].head(10)

## 5. Budget Distribution by Error Type

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='error_type', y='budget', data=df_processed, order=['TP', 'TN', 'FP', 'FN'], palette='viridis')
plt.title('Budget Distribution by Prediction Outcome')
plt.yscale('log')
plt.show()