# **Predictive Model**

## Objectives

* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Hamas\\AI\\AI_Projects\\Code_Institute_Projects\\hackathon2_team1\\Team1_TMDb_Hackathon_2\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Hamas\\AI\\AI_Projects\\Code_Institute_Projects\\hackathon2_team1\\Team1_TMDb_Hackathon_2'

---

In [22]:
import pandas as pd
import numpy as np
import joblib
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV


In [23]:
df = pd.read_csv('Data/PROCESSED/movies_ready_for_EDA.csv')
df.head()

Unnamed: 0,Budget,Genres,Homepage,Id,Keywords,Original_language,Original_title,Overview,Popularity,Production_companies,...,Has_tagline,ROI,Log_budget,Log_revenue,Decade,Runtime_bucket,Language_full,Primary_genre,Primary_production_country,Primary_production_company
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,1,11.763566,19.283571,21.748578,2000.0,epic,English,Action,United States of America,Ingenious Film Partners
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,1,3.203333,19.519293,20.683485,2000.0,epic,English,Adventure,United States of America,Walt Disney Pictures
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,1,3.59459,19.316769,20.596199,2010.0,very_long,English,Action,United Kingdom,Columbia Pictures
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,1,4.339756,19.336971,20.80479,2010.0,epic,English,Action,United States of America,Legendary Pictures
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,1,1.092843,19.376192,19.464974,2010.0,very_long,English,Action,United States of America,Walt Disney Pictures


In [24]:
# Prepare features and target variable
features = ['Budget', 'Genres', 'Language_full', 'Primary_production_country',
            'Primary_production_company', 'Runtime']
target = 'Revenue'

X = df[features]
y = df[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}, Test samples: {len(X_test)}")

Training samples: 3842, Test samples: 961


In [25]:
#Preprocessing Pipeline
numeric_features = ['Budget', 'Runtime']
categorical_features = ['Genres', 'Language_full', 
                        'Primary_production_country', 'Primary_production_company']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

In [26]:
# Train Models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(
        random_state=42, n_estimators=200, max_depth=15
    ),
    "XGBoost": XGBRegressor(
        random_state=42, n_estimators=300, learning_rate=0.1, max_depth=8
    )
}

results = {}

for name, model in models.items():
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', model)])
    
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    results[name] = {'R2': r2, 'RMSE': rmse}
    
    print(f"\n{name} Results:")
    print(f"R²: {r2:.3f}")
    print(f"RMSE: {rmse:.2f}")


Linear Regression Results:
R²: 0.494
RMSE: 114842858.25

Random Forest Results:
R²: 0.618
RMSE: 99742693.90

XGBoost Results:
R²: 0.651
RMSE: 95314356.18


In [27]:
# Model Comparison
results_df = pd.DataFrame(results).T.sort_values(by='R2', ascending=False)
print("\nModel Comparison:")
print(results_df)

best_model_name = results_df.index[0]
print(f"\n Best Model: {best_model_name}")


Model Comparison:
                         R2          RMSE
XGBoost            0.651202  9.531436e+07
Random Forest      0.618038  9.974269e+07
Linear Regression  0.493632  1.148429e+08

 Best Model: XGBoost


In [28]:
# Hyperparameter Tuning with GridSearchCV

from sklearn.model_selection import GridSearchCV

xgb_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', XGBRegressor(random_state=42))
])

# Parameter grid
param_grid = {
    'model__n_estimators': [200, 300, 400],
    'model__max_depth': [6, 8, 10],
    'model__learning_rate': [0.05, 0.1, 0.2],
    'model__subsample': [0.8, 1.0],
    'model__colsample_bytree': [0.8, 1.0]
}


In [29]:
# Initialize GridSearchCV
grid_search = GridSearchCV(
    xgb_pipe,
    param_grid,
    cv=5,
    n_jobs=-1,
    scoring='r2',
    verbose=2
)

grid_search.fit(X_train, y_train)

# Best parameters and score
print("\n Best Parameters Found:")
print(grid_search.best_params_)

print(f"\nBest Cross-Validated R²: {grid_search.best_score_:.4f}")

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("\n Test Set Performance (Tuned XGBoost):")
print(f"R²: {r2:.3f}")
print(f"RMSE: {rmse:.2f}")

Fitting 5 folds for each of 108 candidates, totalling 540 fits

 Best Parameters Found:
{'model__colsample_bytree': 0.8, 'model__learning_rate': 0.05, 'model__max_depth': 6, 'model__n_estimators': 200, 'model__subsample': 0.8}

Best Cross-Validated R²: 0.5312

 Test Set Performance (Tuned XGBoost):
R²: 0.670
RMSE: 92764917.41


In [30]:
df = pd.read_csv("Data/PROCESSED/movies_ready_for_EDA.csv")

# Now using Log_budget instead of Budget and Log_revenue instead of Revenue
features = ['Log_budget', 'Genres', 'Language_full', 
            'Primary_production_country', 'Primary_production_company', 'Runtime']
target = 'Log_revenue'

X = df[features]
y = df[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}, Test samples: {len(X_test)}")

Training samples: 3842, Test samples: 961


In [31]:
# Preprocessing 
numeric_features = ['Log_budget', 'Runtime']
categorical_features = ['Genres', 'Language_full', 
                        'Primary_production_country', 'Primary_production_company']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

In [32]:
# Hyperparameter Tuning on log-transformed data

from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor

xgb_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', XGBRegressor(random_state=42))
])

param_grid = {
    'model__n_estimators': [200, 300, 400],
    'model__max_depth': [6, 8, 10],
    'model__learning_rate': [0.05, 0.1, 0.2],
    'model__subsample': [0.8, 1.0],
    'model__colsample_bytree': [0.8, 1.0]
}

grid_search = GridSearchCV(
    xgb_pipe,
    param_grid,
    cv=5,
    n_jobs=-1,
    scoring='r2',
    verbose=2
)

grid_search.fit(X_train, y_train)

print("\n🔍 Best Parameters Found:")
print(grid_search.best_params_)
print(f"Best CV R²: {grid_search.best_score_:.4f}")

Fitting 5 folds for each of 108 candidates, totalling 540 fits

🔍 Best Parameters Found:
{'model__colsample_bytree': 0.8, 'model__learning_rate': 0.05, 'model__max_depth': 6, 'model__n_estimators': 200, 'model__subsample': 1.0}
Best CV R²: 0.5392


In [33]:

best_model = grid_search.best_estimator_
y_pred_log = best_model.predict(X_test)

#Metrics in log scale
r2_log = r2_score(y_test, y_pred_log)
rmse_log = np.sqrt(mean_squared_error(y_test, y_pred_log))

print("\n📊 Log-Scale Evaluation:")
print(f"R² (log): {r2_log:.3f}")
print(f"RMSE (log): {rmse_log:.3f}")

# Convert back to original scale
y_test_exp = np.expm1(y_test)
y_pred_exp = np.expm1(y_pred_log)

r2_real = r2_score(y_test_exp, y_pred_exp)
rmse_real = np.sqrt(mean_squared_error(y_test_exp, y_pred_exp))

print("\n💰 Original Revenue Scale Evaluation:")
print(f"R²: {r2_real:.3f}")
print(f"RMSE: {rmse_real:,.2f}")



📊 Log-Scale Evaluation:
R² (log): 0.512
RMSE (log): 5.778

💰 Original Revenue Scale Evaluation:
R²: 0.396
RMSE: 125,478,252.69


---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.