# Task 2.3: Advanced Model - XGBoost Classifier Implementation

## Objective

The goal of this task is to implement an XGBoost (Extreme Gradient Boosting) Classifier with hyperparameter tuning. XGBoost is a powerful gradient boosting algorithm that often outperforms Random Forest by sequentially building trees that correct errors from previous trees. We will compare its performance with both the baseline Logistic Regression and Random Forest models.

## Understanding Key Metrics

Since this is a supervised classification task, we use specific metrics to evaluate how well our model performs:

1. **Accuracy:** - Measures the overall correctness of predictions. Range is 0 to 1, where 1 indicates perfect classification.
2. **Precision (Macro):** - Average precision across all classes, measuring the proportion of correct positive predictions.
3. **Recall (Macro):** - Average recall across all classes, measuring the proportion of actual positives correctly identified.
4. **F1-Score (Macro):** - Harmonic mean of precision and recall, providing a balanced measure of model performance. Range is 0 to 1.

## Why XGBoost?

XGBoost offers several advantages over Random Forest and other algorithms:

1. **Sequential Learning:** Unlike Random Forest (parallel trees), XGBoost builds trees sequentially, where each new tree corrects errors from previous ones.
2. **Gradient Boosting:** Uses gradient descent to minimize loss, leading to more accurate predictions.
3. **Regularization:** Built-in L1 and L2 regularization prevents overfitting better than Random Forest.
4. **Handling Missing Values:** Automatically learns the best direction for missing values.
5. **Speed and Performance:** Optimized implementation with parallel processing and tree pruning.
6. **Feature Importance:** Provides multiple ways to measure feature importance (gain, cover, frequency).

## XGBoost vs Random Forest:

| Aspect | Random Forest | XGBoost |
|--------|--------------|----------|
| Tree Building | Parallel (independent) | Sequential (corrective) |
| Learning Method | Bagging (averaging) | Boosting (error correction) |
| Overfitting Risk | Lower | Higher (needs tuning) |
| Training Speed | Faster | Slower |
| Prediction Accuracy | Good | Often Better |

## Step 1: Environment Setup and Data Discovery

In this step, we import the required libraries and load the preprocessed dataset from the project directory.

### Libraries Used:
- **pandas & numpy:** For data manipulation and numerical operations
- **XGBClassifier:** The main algorithm from xgboost library
- **sklearn.metrics:** For evaluating model performance
- **LabelEncoder:** To convert categorical target labels to numeric format
- **pickle:** For saving the trained model

### Data Loading:
We load the scaled training and testing sets that were prepared in previous tasks. XGBoost can work with unscaled data, but using scaled features ensures consistency with previous models and can improve convergence speed.

### Important Note:
XGBoost requires target labels to be in the range [0, num_classes-1]. Our LabelEncoder ensures this format.

In [None]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.preprocessing import LabelEncoder
import pickle
import warnings
warnings.filterwarnings('ignore')

# Load data
X_train = pd.read_csv('../../data/processed/X_train_scaled.csv')
X_test = pd.read_csv('../../data/processed/X_test_scaled.csv')
y_train = pd.read_csv('../../data/processed/y_train.csv')
y_test = pd.read_csv('../../data/processed/y_test.csv')

if 'id' in X_train.columns:
    X_train = X_train.drop('id', axis=1)
if 'id' in X_test.columns:
    X_test = X_test.drop('id', axis=1)

# Encode target labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train['value_category'])
y_test_encoded = label_encoder.transform(y_test['value_category'])


print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")
print(f"Number of features: {X_train.shape[1]}")
print(f"\nTarget distribution (Training):")
unique, counts = np.unique(y_train_encoded, return_counts=True)
for val, count in zip(unique, counts):
    category = label_encoder.classes_[val]
    print(f"  Class {val} ({category}): {count} samples ({count/len(y_train_encoded)*100:.2f}%)")

## Removing ID Columns from Features

**Why we remove ID columns:**

1. **IDs are unique identifiers, not predictive features** - They don't contain information about value categories
2. **Including IDs would cause overfitting** - The model would memorize specific listings instead of learning patterns
3. **IDs have no relationship with value categories** - A listing's ID number doesn't determine its value

This is a critical preprocessing step to ensure our model learns meaningful patterns rather than memorizing training data.

## Step 2: Model Training with Hyperparameter Tuning

We train an XGBoost Classifier with carefully tuned hyperparameters. These parameters were selected to balance model performance and prevent overfitting.

### Hyperparameter Explanation:

1. **learning_rate=0.1** (also called eta)
   - Controls how much each tree contributes to the final prediction
   - Lower values (0.01-0.1) make the model more robust but require more trees
   - 0.1 is a good balance between training speed and accuracy
   - Think of it as "step size" - smaller steps are more careful but slower

2. **max_depth=6**
   - Maximum depth of each tree
   - Deeper trees can capture more complex patterns but risk overfitting
   - 6 is a common default that works well for most problems
   - Shallower than Random Forest (20) because boosting is more powerful

3. **n_estimators=200**
   - Number of boosting rounds (trees to build)
   - More trees generally improve performance but increase training time
   - 200 is higher than Random Forest (100) because each tree is simpler
   - With learning_rate=0.1, 200 trees provides good convergence

4. **subsample=0.8**
   - Fraction of training samples used for each tree (80%)
   - Randomly samples 80% of data for each tree, preventing overfitting
   - Similar to Random Forest's bootstrap sampling
   - Values between 0.5-0.9 work well; 0.8 is a sweet spot

5. **colsample_bytree=0.8**
   - Fraction of features used for each tree (80%)
   - Randomly samples 80% of features, adding diversity to trees
   - Helps prevent overfitting and speeds up training
   - Similar to Random Forest's feature sampling

6. **objective='multi:softmax'**
   - Loss function for multi-class classification
   - Returns class labels directly (0, 1, 2)
   - Alternative: 'multi:softprob' returns probabilities

7. **eval_metric='mlogloss'**
   - Evaluation metric: multi-class log loss
   - Measures how well predicted probabilities match true labels
   - Lower values indicate better performance

8. **random_state=42**
   - Ensures reproducibility across runs

9. **n_jobs=-1**
   - Uses all available CPU cores for parallel processing

### How XGBoost Works:

1. **Start:** Build a simple tree to predict the target
2. **Calculate Errors:** Find where the first tree made mistakes
3. **Build Next Tree:** Focus on correcting those errors
4. **Repeat:** Each new tree corrects errors from previous trees
5. **Final Prediction:** Weighted sum of all tree predictions

This sequential error-correction approach often leads to better accuracy than Random Forest's parallel approach.

In [None]:
# Train model with hyperparameter tuning
xgb_model = XGBClassifier(
    learning_rate=0.1,
    max_depth=6,
    n_estimators=200,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='multi:softmax',
    eval_metric='mlogloss',
    random_state=42,
    n_jobs=-1
)
xgb_model.fit(X_train, y_train_encoded)

print("Model training complete!")

## Step 3: Model Evaluation

Evaluating model performance on both training and testing sets.

### Understanding the Metrics:

**Training vs Testing Accuracy:**
- **Training Accuracy:** How well the model performs on data it has seen during training
- **Testing Accuracy:** How well the model generalizes to new, unseen data
- **Gap between them:** A large gap indicates overfitting (model memorized training data)
- **XGBoost Note:** Due to boosting, training accuracy may be very high; focus on test accuracy

**Macro-Averaged Metrics:**
- Calculate metric for each class separately, then average them
- Treats all classes equally, regardless of their size
- Important for imbalanced datasets to ensure all classes are considered

**What to Look For:**
- Testing accuracy should be higher than Random Forest and Logistic Regression
- Training accuracy may be higher than Random Forest (boosting is more powerful)
- F1-Score provides the best overall measure of model quality
- Compare with previous models to see if XGBoost provides improvement

In [None]:
# Predictions
y_train_pred = xgb_model.predict(X_train)
y_test_pred = xgb_model.predict(X_test)

# Calculate metrics
train_acc = accuracy_score(y_train_encoded, y_train_pred)
test_acc = accuracy_score(y_test_encoded, y_test_pred)
test_precision = precision_score(y_test_encoded, y_test_pred, average='macro')
test_recall = recall_score(y_test_encoded, y_test_pred, average='macro')
test_f1 = f1_score(y_test_encoded, y_test_pred, average='macro')

print("\nModel Performance:")
print(f"  Training Accuracy: {train_acc:.4f}")
print(f"  Testing Accuracy: {test_acc:.4f}")
print(f"  Precision (Macro): {test_precision:.4f}")
print(f"  Recall (Macro): {test_recall:.4f}")
print(f"  F1-Score (Macro): {test_f1:.4f}")

## Step 4: Detailed Classification Report

The classification report provides per-class performance metrics, helping us understand which value categories the model predicts well and which ones it struggles with.

### Reading the Report:

**For Each Class:**
- **Precision:** Of all listings predicted as this class, what percentage were correct?
- **Recall:** Of all actual listings in this class, what percentage did we correctly identify?
- **F1-Score:** Harmonic mean of precision and recall for this class
- **Support:** Number of actual samples in this class

**Overall Metrics:**
- **Accuracy:** Overall correctness across all classes
- **Macro avg:** Simple average of metrics across all classes (treats each class equally)
- **Weighted avg:** Average weighted by the number of samples in each class

### What to Look For:
- Compare with Random Forest: Is XGBoost better at predicting all classes or just some?
- Are all three classes performing similarly, or is one class much harder to predict?
- Low recall for a class means we're missing many actual instances of that class
- Low precision for a class means we're incorrectly labeling other classes as this one

### Expected Improvement:
XGBoost should show more balanced performance across all three value categories compared to previous models.

In [None]:
print(classification_report(y_test_encoded, y_test_pred, 
                    target_names=label_encoder.classes_))

## Step 5: Comparison with Random Forest

Let's load the Random Forest results and compare them with XGBoost to see which model performs better.

### Why Compare?

1. **Validate Improvement:** Confirm that XGBoost's complexity is justified by better performance
2. **Understand Trade-offs:** XGBoost may be slower to train but more accurate
3. **Model Selection:** Choose the best model for deployment based on performance and requirements
4. **Learning Insights:** Understand which algorithm works better for this specific problem

### Key Comparison Points:

- **Test Accuracy:** Which model generalizes better to unseen data?
- **F1-Score:** Which model has better overall balance of precision and recall?
- **Training Time:** Is the performance gain worth the extra training time?
- **Overfitting:** Which model has a smaller gap between training and testing accuracy?

### Expected Outcome:

XGBoost typically outperforms Random Forest on structured/tabular data like our Airbnb dataset, but the improvement may be modest (1-5% accuracy gain).

In [None]:
# Load Random Forest results for comparison
try:
    rf_results = pd.read_csv('../../data/processed/random_forest_results.csv')
    
    print("\n" + "="*60)
    print("MODEL COMPARISON: XGBoost vs Random Forest")
    print("="*60)
    
    print("\nRandom Forest Performance:")
    print(f"  Training Accuracy: {rf_results['train_accuracy'].values[0]:.4f}")
    print(f"  Testing Accuracy:  {rf_results['test_accuracy'].values[0]:.4f}")
    print(f"  Precision (Macro): {rf_results['precision_macro'].values[0]:.4f}")
    print(f"  Recall (Macro):    {rf_results['recall_macro'].values[0]:.4f}")
    print(f"  F1-Score (Macro):  {rf_results['f1_macro'].values[0]:.4f}")
    
    print("\nXGBoost Performance:")
    print(f"  Training Accuracy: {train_acc:.4f}")
    print(f"  Testing Accuracy:  {test_acc:.4f}")
    print(f"  Precision (Macro): {test_precision:.4f}")
    print(f"  Recall (Macro):    {test_recall:.4f}")
    print(f"  F1-Score (Macro):  {test_f1:.4f}")
    
    print("\nImprovement (XGBoost - Random Forest):")
    print(f"  Testing Accuracy:  {(test_acc - rf_results['test_accuracy'].values[0]):.4f} ({((test_acc - rf_results['test_accuracy'].values[0])/rf_results['test_accuracy'].values[0]*100):.2f}%)")
    print(f"  F1-Score (Macro):  {(test_f1 - rf_results['f1_macro'].values[0]):.4f} ({((test_f1 - rf_results['f1_macro'].values[0])/rf_results['f1_macro'].values[0]*100):.2f}%)")
    
    if test_acc > rf_results['test_accuracy'].values[0]:
        print("\n XGBoost outperforms Random Forest!")
    elif test_acc == rf_results['test_accuracy'].values[0]:
        print("\n XGBoost and Random Forest perform equally.")
    else:
        print("\n Random Forest performs better than XGBoost.")
        print("   Consider: 1) Different hyperparameters, 2) More training data, 3) Feature engineering")
    
    print("="*60)
    
except FileNotFoundError:
    print("\n  Random Forest results not found. Please run Task 2.2 first.")
    print("   Comparison will be skipped.")

## Step 6: Feature Importance Analysis

XGBoost provides feature importance scores that indicate which features contribute most to the model's predictions.

### How XGBoost Feature Importance Works:

XGBoost offers three types of feature importance:

1. **Gain (default):** Average gain of splits using this feature
   - Measures the improvement in accuracy brought by a feature
   - Higher gain = more important for making correct predictions

2. **Cover:** Average coverage of splits using this feature
   - Number of samples affected by splits on this feature
   - Shows how broadly a feature is used

3. **Frequency (Weight):** Number of times a feature is used in splits
   - How often the feature appears in trees
   - High frequency doesn't always mean high importance

We use **Gain** as it best represents true predictive importance.

### Why This Matters:

- **Model Interpretability:** Understand what drives the model's decisions
- **Feature Selection:** Identify which features could potentially be removed
- **Business Insights:** Learn which listing characteristics most affect value perception
- **Validation:** Ensure the model is using sensible features (not just noise)
- **Comparison with Random Forest:** See if both models agree on important features

### Expected Important Features:

We expect features like review scores, price-related features, location, and amenities to be highly important, as these directly relate to a listing's value proposition.

In [None]:
# Get feature importances (using 'gain' as importance type)
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 Most Important Features (XGBoost):")
print(feature_importance.head(10).to_string(index=False))

# Save feature importance
feature_importance.to_csv('../../data/processed/xgboost_feature_importance.csv', index=False)
print("\nFeature importance saved to: ../../data/processed/xgboost_feature_importance.csv")

# Compare with Random Forest feature importance if available
try:
    rf_importance = pd.read_csv('../../data/processed/random_forest_feature_importance.csv')
    
    print("\n" + "="*60)
    print("FEATURE IMPORTANCE COMPARISON")
    print("="*60)
    print("\nTop 5 Features - Random Forest:")
    print(rf_importance.head(5)['feature'].to_string(index=False))
    
    print("\nTop 5 Features - XGBoost:")
    print(feature_importance.head(5)['feature'].to_string(index=False))
    
    # Find common important features
    rf_top10 = set(rf_importance.head(10)['feature'])
    xgb_top10 = set(feature_importance.head(10)['feature'])
    common_features = rf_top10.intersection(xgb_top10)
    
    print(f"\nCommon features in both models' top 10: {len(common_features)}")
    if common_features:
        print("Features:", ', '.join(common_features))
    print("="*60)
    
except FileNotFoundError:
    print("\n Random Forest feature importance not found. Comparison skipped.")

## Step 7: Saving Results

We save all important outputs for future reference and comparison with other models.

### Files Saved:

1. **xgboost_model.pkl** - The trained model object
   - Can be loaded later for making predictions without retraining
   - Preserves all learned parameters and tree structures
   - Includes all hyperparameter settings

2. **xgboost_results.csv** - Summary of model performance metrics
   - Allows easy comparison with other models (Logistic Regression, Random Forest)
   - Contains all key metrics in one place
   - Essential for final model selection

3. **xgboost_predictions.csv** - Actual vs predicted values
   - Useful for error analysis and understanding misclassifications
   - Can be used to create confusion matrices
   - Helps identify which listings are hardest to classify

4. **xgboost_feature_importance.csv** - Feature importance rankings
   - Documents which features the model relies on most
   - Useful for feature selection and model interpretation
   - Can be compared with Random Forest importance

### Why Save These Files:

- **Reproducibility:** Can recreate results without rerunning the entire notebook
- **Comparison:** Easy to compare XGBoost with Random Forest and Logistic Regression
- **Documentation:** Provides a record of model performance for reports
- **Deployment:** The saved model can be used in production applications
- **Model Selection:** Having all results saved makes final model selection easier

In [None]:
# Save results
results_df = pd.DataFrame({
    'model': ['XGBoost'],
    'train_accuracy': [train_acc],
    'test_accuracy': [test_acc],
    'precision_macro': [test_precision],
    'recall_macro': [test_recall],
    'f1_macro': [test_f1]
})
results_df.to_csv('../../data/processed/xgboost_results.csv', index=False)

# Save predictions
predictions_df = pd.DataFrame({
    'y_true': y_test_encoded,
    'y_pred': y_test_pred
})
predictions_df.to_csv('../../data/processed/xgboost_predictions.csv', index=False)

# Save model
with open('../../models/xgboost_model.pkl', 'wb') as f:
    pickle.dump(xgb_model, f)

print("\nFiles saved successfully:")
print("  - ../../models/xgboost_model.pkl")
print("  - ../../data/processed/xgboost_results.csv")
print("  - ../../data/processed/xgboost_predictions.csv")
print("  - ../../data/processed/xgboost_feature_importance.csv")

## Conclusion

### Expected Improvements Over Random Forest:

XGBoost should outperform Random Forest because:
1. **Sequential Error Correction:** Each tree learns from previous mistakes, leading to better accuracy
2. **Gradient Optimization:** Uses gradient descent to minimize loss more effectively
3. **Built-in Regularization:** Better prevents overfitting through L1/L2 regularization
4. **Optimized Splits:** More sophisticated tree-building algorithm