# üé¨ Movie Revenue Prediction: Full Pipeline & Comparison

Notebook n√†y s·∫Ω ch·∫°y quy tr√¨nh **End-to-End** t·ª´ A-Z v√† so s√°nh tr·ª±c ti·∫øp hi·ªáu qu·∫£ gi·ªØa **Phi√™n b·∫£n c≈© (V1)** v√† **Phi√™n b·∫£n n√¢ng cao (V2)**.

### üìå Quy tr√¨nh:
1.  **Data Loading**: T·∫£i d·ªØ li·ªáu phim (2010-2024).
2.  **Experiment 1 (Baseline)**: Ch·∫°y Preprocessing V1 (Old strategy).
3.  **Experiment 2 (Advanced)**: Ch·∫°y Preprocessing V2 (New strategy: KNN, RobustScaler, BGE Embeddings).
4.  **Comparison**: So s√°nh R2 Score & MAE.
5.  **Final Model & Data**: L∆∞u l·∫°i m√¥ h√¨nh v√† d·ªØ li·ªáu (Train/Test) t·ª´ V2.

In [1]:
import os
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore') # T·∫Øt warning cho g·ªçn

# Setup Path
current_dir = Path(os.getcwd())
if current_dir.name == 'notebooks':
    project_root = current_dir.parent
else:
    project_root = current_dir

if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import Modules
from src.data_loader import TMDbDataLoader
from src.preprocessing import DataPreprocessor        # V1
from src.preprocessing_v2 import DataPreprocessorV2   # V2 (New)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import xgboost as xgb
import joblib

print(f"Project Root: {project_root}")

Project Root: d:\Movie_Revenue_Prediction_v2


## 1. Load Data

In [2]:
config_path = project_root / "configs" / "config.yaml"
raw_data_path = project_root / "data" / "raw" / "movies_2020_2024.csv"

loader = TMDbDataLoader(config_path=str(config_path))

if not raw_data_path.exists():
    print("Fetching data from API...")
    loader.fetch_data()
    loader.save_data(str(raw_data_path))
else:
    print("Loading existing data...")

df_raw = loader.load_data(str(raw_data_path))
print(f"Data Shape: {df_raw.shape}")

2025-12-09 14:26:17,721 - src.data_loader - INFO - TMDbDataLoader ƒë√£ ƒë∆∞·ª£c kh·ªüi t·∫°o th√†nh c√¥ng


2025-12-09 14:26:17,761 - src.data_loader - INFO - Loaded 2708 rows t·ª´ d:\Movie_Revenue_Prediction_v2\data\raw\movies_2020_2024.csv


Loading existing data...
Data Shape: (2708, 14)


## 2. Setting Up Experiments (V1 vs V2)

In [3]:
def run_experiment(name, preprocessor, model, df):
    print(f"\n{'='*20} RUNNING: {name} {'='*20}")
    
    # 1. Preprocessing
    print(" Running fit_transform...")
    X, y = preprocessor.fit_transform(df)
    print(f"Features Shape: {X.shape}")
    
    # 2. Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # 3. Train
    print("Training Model...")
    model.fit(X_train, y_train)
    
    # 4. Evaluate
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    
    print(f"{name} Results:")
    print(f"   R2 Score: {r2:.4f}")
    print(f"   MAE:      ${mae:,.2f}")
    
    return {
        "Experiment": name,
        "R2": r2,
        "MAE": mae,
        "Features": X.shape[1],
        "Test_Samples": len(y_test),
        "Model": model,
        "Preprocessor": preprocessor,
        "Data": (X_train, X_test, y_train, y_test) # Save data for later
    }

## 3. Run V1 (Baseline)

In [4]:
# V1: Basic Preprocessing + RandomForest
prep_v1 = DataPreprocessor(config_path=str(config_path))
model_v1 = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

res_v1 = run_experiment("V1 (Basic)", prep_v1, model_v1, df_raw)

2025-12-09 14:26:17,795 - src.preprocessing - INFO - DataPreprocessor ƒë√£ ƒë∆∞·ª£c kh·ªüi t·∫°o th√†nh c√¥ng (target_col=revenue, scaler=StandardScaler)
2025-12-09 14:26:17,804 - src.preprocessing - INFO - B·∫Øt ƒë·∫ßu fit preprocessing pipeline...
2025-12-09 14:26:17,814 - src.preprocessing - INFO - ƒê√£ x·ª≠ l√Ω missing values
2025-12-09 14:26:17,824 - src.preprocessing - INFO - ƒê√£ lo·∫°i b·ªè 344 outliers (12.70%)
2025-12-09 14:26:17,853 - src.preprocessing - INFO - ƒê√£ t·∫°o 5 date features
2025-12-09 14:26:17,861 - src.preprocessing - INFO - ƒê√£ encode 19 genres



 Running fit_transform...


2025-12-09 14:26:18,097 - src.preprocessing - INFO - ƒê√£ t·∫°o 30 TF-IDF features t·ª´ overview
2025-12-09 14:26:18,103 - src.preprocessing - INFO - ƒê√£ t·∫°o derived features
2025-12-09 14:26:18,109 - src.preprocessing - INFO - ƒê√£ fit preprocessor v·ªõi 65 features
2025-12-09 14:26:18,110 - src.preprocessing - INFO - B·∫Øt ƒë·∫ßu transform data...
2025-12-09 14:26:18,114 - src.preprocessing - INFO - ƒê√£ x·ª≠ l√Ω missing values
2025-12-09 14:26:18,124 - src.preprocessing - INFO - ƒê√£ lo·∫°i b·ªè 344 outliers (12.70%)
2025-12-09 14:26:18,130 - src.preprocessing - INFO - ƒê√£ t·∫°o 5 date features
2025-12-09 14:26:18,137 - src.preprocessing - INFO - ƒê√£ encode 19 genres
2025-12-09 14:26:18,223 - src.preprocessing - INFO - ƒê√£ t·∫°o 30 TF-IDF features t·ª´ overview
2025-12-09 14:26:18,226 - src.preprocessing - INFO - ƒê√£ t·∫°o derived features
2025-12-09 14:26:18,230 - src.preprocessing - INFO - ƒê√£ transform data v·ªõi shape: (2364, 65)


Features Shape: (2364, 65)
Training Model...
V1 (Basic) Results:
   R2 Score: 0.5896
   MAE:      $28,956,918.54


## 4. Run V2 (Advanced with Embeddings)

In [5]:
# V2: Advanced Preprocessing (KNN, RobustScaler, BGE) + XGBoost (M·∫°nh h∆°n RF)
prep_v2 = DataPreprocessorV2(config_path=str(config_path))
# S·ª≠ d·ª•ng XGBoost cho V2 v√¨ n√≥ th∆∞·ªùng handle features t·ªët h∆°n RF
model_v2 = xgb.XGBRegressor(n_estimators=200, learning_rate=0.05, n_jobs=-1, random_state=42)

res_v2 = run_experiment("V2 (Advanced)", prep_v2, model_v2, df_raw)

2025-12-09 14:26:19,006 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: cpu
2025-12-09 14:26:19,006 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5
2025-12-09 14:26:23,924 - src.preprocessing_v2 - INFO - Using Embedding Model: BAAI/bge-small-en-v1.5
2025-12-09 14:26:23,924 - src.preprocessing_v2 - INFO - DataPreprocessorV2 (Advanced) ƒë√£ ƒë∆∞·ª£c kh·ªüi t·∫°o.
2025-12-09 14:26:23,926 - src.preprocessing_v2 - INFO - B·∫Øt ƒë·∫ßu fit DataPreprocessorV2 (Advanced)...
2025-12-09 14:26:23,949 - src.preprocessing_v2 - INFO - Encoding overview with BGE Embeddings...



 Running fit_transform...


2025-12-09 14:27:43,387 - src.preprocessing_v2 - INFO - ƒê√£ t·∫°o 384 features t·ª´ BGE Embeddings
2025-12-09 14:27:43,520 - src.preprocessing_v2 - INFO - ƒê√£ fit V2 v·ªõi 419 features.
2025-12-09 14:27:43,545 - src.preprocessing_v2 - INFO - Encoding overview with BGE Embeddings...
2025-12-09 14:29:03,808 - src.preprocessing_v2 - INFO - ƒê√£ t·∫°o 384 features t·ª´ BGE Embeddings


Features Shape: (2708, 419)
Training Model...
V2 (Advanced) Results:
   R2 Score: 0.7671
   MAE:      $53,511,084.00


## 5. Final Comparison & Save Data

In [6]:
# T·ªïng h·ª£p k·∫øt qu·∫£
df_results = pd.DataFrame([res_v1, res_v2])
cols = ['Experiment', 'R2', 'MAE', 'Features']
print("\nLEADERBOARD")
print(df_results[cols])

print("\nSaving Best Model & Data (V2)...")
models_dir = project_root / "models"
processed_dir = project_root / "data" / "processed"
models_dir.mkdir(parents=True, exist_ok=True)
processed_dir.mkdir(parents=True, exist_ok=True)

# 1. L∆∞u Model & Preprocessor
joblib.dump(res_v2['Model'], models_dir / "best_model.pkl")
res_v2['Preprocessor'].save_preprocessor(str(models_dir / "preprocessor.pkl"))

# 2. L∆∞u Processed Data (Train/Test) c·ªßa V2
X_train_v2, X_test_v2, y_train_v2, y_test_v2 = res_v2['Data']

feature_names = res_v2['Preprocessor'].get_feature_names()
# L∆∞u X_train, X_test
pd.DataFrame(X_train_v2, columns=feature_names).to_csv(processed_dir / "X_train.csv", index=False)
pd.DataFrame(X_test_v2, columns=feature_names).to_csv(processed_dir / "X_test.csv", index=False)
# L∆∞u y_train, y_test
pd.DataFrame(y_train_v2, columns=['revenue']).to_csv(processed_dir / "y_train.csv", index=False)
pd.DataFrame(y_test_v2, columns=['revenue']).to_csv(processed_dir / "y_test.csv", index=False)

print(f"Done! Data saved to '{processed_dir}'")


LEADERBOARD
      Experiment        R2           MAE  Features
0     V1 (Basic)  0.589595  2.895692e+07        65
1  V2 (Advanced)  0.767092  5.351108e+07       419

Saving Best Model & Data (V2)...


2025-12-09 14:29:15,062 - src.preprocessing_v2 - INFO - ƒê√£ l∆∞u preprocessor v√†o: d:\Movie_Revenue_Prediction_v2\models\preprocessor.pkl


Done! Data saved to 'd:\Movie_Revenue_Prediction_v2\data\processed'


In [7]:
import datetime
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
import numpy as np

def get_extra_metrics(res):
    model = res['Model']
    _, X_test, _, y_test = res['Data']
    y_pred = model.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mape = mean_absolute_percentage_error(y_test, y_pred) * 100 
    return rmse, mape

rmse_v1, mape_v1 = get_extra_metrics(res_v1)
rmse_v2, mape_v2 = get_extra_metrics(res_v2)

df_results['RMSE'] = [rmse_v1, rmse_v2]
df_results['MAPE'] = [mape_v1, mape_v2]

experiments_path = project_root / "results" / "experiments.csv"
experiments_path.parent.mkdir(parents=True, exist_ok=True)

df_results['Timestamp'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

# RMSE, MAE, R2, MAPE, Timestamp
log_cols = ['Experiment', 'RMSE', 'MAE', 'R2', 'MAPE', 'Timestamp']

if not experiments_path.exists():
    df_results[log_cols].to_csv(experiments_path, index=False)
else:
    df_results[log_cols].to_csv(experiments_path, mode='a', header=False, index=False)

print("\nFull Metrics:\n", df_results[log_cols])
print(f"Full experiment results logged to: {experiments_path}")


Full Metrics:
       Experiment          RMSE           MAE        R2          MAPE  \
0     V1 (Basic)  4.768314e+07  2.895692e+07  0.589595   1787.022856   
1  V2 (Advanced)  1.152460e+08  5.351108e+07  0.767092  10045.149994   

             Timestamp  
0  2025-12-09 14:29:16  
1  2025-12-09 14:29:16  
Full experiment results logged to: d:\Movie_Revenue_Prediction_v2\results\experiments.csv
