In [7]:
import os
import joblib
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, RandomizedSearchCV
import seaborn as sns
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Load preprocessed data
preprocessing_objects = joblib.load("../outputs/preprocessing_all.pkl")

df_final_encoded = preprocessing_objects["data"]
binary_encoder = preprocessing_objects["binary_encoder"]
multi_encoder = preprocessing_objects["multi_encoder"]

print("Preprocessed data loaded successfully!")
df_final_encoded.head()

# Load model training data with feature selection
training_data = joblib.load("../outputs/model_training_data_with_features.pkl")

X_final = training_data["X_final"]  # Only feature-selected columns
y = training_data["y"]              # Target variable

print("Feature-selected data loaded successfully!")

Preprocessed data loaded successfully!
Feature-selected data loaded successfully!


In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X_final, y, test_size=0.2, random_state=42
)


In [9]:
import xgboost as xgb

# Initialize XGBoost Regressor
xgb_model = xgb.XGBRegressor(
    n_estimators=200,
    random_state=42,
    objective='reg:squarederror'  # Use for regression
)

# Train the model
xgb_model.fit(X_train, y_train)

# Predict on test set
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate performance
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mse_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print("XGBoost Results:")
print(f"MSE: {mse_xgb:.4f}")
print(f"RMSE: {rmse_xgb:.4f}")
print(f"R² Score: {r2_xgb:.4f}")


XGBoost Results:
MSE: 0.2320
RMSE: 0.4817
R² Score: 0.9073


**Code Expliantion**
- The model is initialized with xgb.XGBRegressor() with n_estimators=200 (number of trees) and objective='reg:squarederror' for regression.
- random_state=42 ensures **reproducibility** of results.
- The model **learns patterns** from the training data (`X_train`, `y_train`).
- After training, it predicts addiction scores for the **unseen test data** (`X_test`).
- `y_pred_xgb` is an array of predicted values for each test sample.
- **MSE (Mean Squared Error):** Average squared difference between actual and predicted values.
- **RMSE (Root Mean Squared Error):** Square root of MSE, giving error in the same scale as the target.
- **R² Score (Coefficient of Determination):** Measures how well the model explains variance in the target.
    - R² = 1 → perfect prediction
    - R² = 0 → model cannot explain variance

**Output Explination**
- **MSE = 0.2320** → On average, the **squared error** between predicted and actual addiction scores is 0.232.
- **RMSE = 0.4817** → On average, predictions are off by about **0.48 points** from the true score.
- **R² = 0.9073** → The model explains **90.7% of the variance** in addiction scores.

✅ **Interpretation:**

- XGBoost performed **well**, but not as accurate as Gradient Boosting or Random Forest on your dataset.
- Higher RMSE and lower R² compared to Gradient Boosting indicate that it **captures patterns slightly less effectively** here.
- Still a good model for prediction, but ensemble models like **Gradient Boosting** worked better for your data

In [10]:
# Initialize base XGBoost Regressor
xgb_model = xgb.XGBRegressor(
    objective='reg:squarederror',  # regression task
    random_state=42
)

# Define hyperparameter grid for tuning
param_grid = {
    'n_estimators': [100, 200, 300, 400],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 4, 5, 6, 8],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
    'gamma': [0, 0.1, 0.2, 0.3]  # minimum loss reduction
}

# 5-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Setup RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_grid,
    n_iter=50,  # number of random combinations to try
    scoring='neg_mean_squared_error',  # use MSE for regression
    cv=kf,
    verbose=2,
    n_jobs=-1,
    random_state=42
)

# Fit RandomizedSearchCV on training data
random_search.fit(X_train, y_train)

# Best hyperparameters
print("Best Hyperparameters:", random_search.best_params_)

# Best tuned model
best_xgb = random_search.best_estimator_

# Predict on test set
y_pred = best_xgb.predict(X_test)

# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("\nXGBoost (Tuned) Results:")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")


Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Hyperparameters: {'subsample': 1.0, 'n_estimators': 300, 'min_child_weight': 3, 'max_depth': 4, 'learning_rate': 0.1, 'gamma': 0.1, 'colsample_bytree': 1.0}

XGBoost (Tuned) Results:
MSE: 0.1367
RMSE: 0.3697
R² Score: 0.9454


**Code Explination**
- `objective='reg:squarederror'` → ensures model is optimized for regression, not classification.
- `random_state=42` → reproducible results.
- At this stage, the model uses **default hyperparameters**.
- Each key is a hyperparameter to tune:
- **`n_estimators`** → Number of trees in the model.
- **`learning_rate`** → Shrinks the contribution of each tree to prevent overfitting.
- **`max_depth`** → Maximum depth allowed for each individual tree.
- **`min_child_weight`** → Minimum sum of instance weight required in a leaf node.
- **`subsample`** → Fraction of training samples used to grow each tree.
- **`colsample_bytree`** → Fraction of features considered for each tree.
- **`gamma`** → Minimum loss reduction required to make a split (controls tree splitting).
- Splits data into 5 parts: 4 for training, 1 for validation → repeat 5 times.
- `shuffle=True` → ensures randomness.
- **RandomizedSearchCV** tries `n_iter=50` random hyperparameter combinations from the grid.
- `scoring='neg_mean_squared_error'` → minimizes MSE (scikit-learn uses negative internally).
- `cv=kf` → evaluates each combination using 5-fold CV.
- `n_jobs=-1` → uses all CPU cores for faster computation.
- Trains **50 randomly sampled hyperparameter combinations** on training data.
- Uses 5-fold cross-validation to select the best combination.
- `best_params_` → returns the hyperparameter combination with the **lowest CV MSE**.
- `best_estimator_` → the **fully trained XGBoost model** with those hyperparameters.
- Uses the tuned model to predict addicted scores for unseen test data.
- **MSE** → average squared error (lower is better).
- **RMSE** → error in the same units as target.
- **R² score** → proportion of variance explained by the model (closer to 1 = better fit).

### **Ouput comparison**
### 1. **Mean Squared Error (MSE)**

- **Before Tuning**: `0.2320`
- **After Tuning**: `0.1367`
    
    ✅ Big improvement — tuned model’s predictions are much closer to actual values.
    

### 2. **Root Mean Squared Error (RMSE)**

- **Before Tuning**: `0.4817`
- **After Tuning**: `0.3697`
    
    ✅ Clear reduction in error deviation, meaning better predictive accuracy.
    

### 3. **R² Score**

- **Before Tuning**: `0.9073`
- **After Tuning**: `0.9454`
    
    ✅ Significant jump — tuned model explains much more variance in the target variable.
    

### **Interpretation**

- **Before tuning**, XGBoost was **weaker** (R² = 0.90, decent but lower than Gradient Boosting).
- **After tuning**, XGBoost improved **a lot**, almost matching the tuned Gradient Boosting model.
- This shows that **XGBoost is more sensitive to hyperparameters** — without tuning, it can underperform, but with tuning, it can reach top performance.