# Task 2.6: Final Model Selection - Supervised Learning

---

## Overview

This notebook performs **comprehensive model comparison and selection** for all supervised learning models trained in Week 2. Our goal is to:

1. **Load and compare** all trained supervised models
2. **Evaluate** each model using multiple metrics (Accuracy, F1-Score, Precision, Recall)
3. **Select the top 2 models** with detailed justification
4. **Create a summary comparison table** for easy reference

---

##  Critical: Data Leakage Fix Applied

All models in this comparison use **landlord-controlled features only**:
-  No review-based features (reviews_per_month, review_scores_*, etc.)
-  No target leakage (fp_score, value_category removed from X)
-  Can predict for new listings without reviews
-  Realistic accuracy (~95%) instead of inflated 99%

---

## 1. Import Libraries and Load Data

We'll import all necessary libraries for model loading, evaluation, and visualization.

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import warnings
warnings.filterwarnings('ignore')

# Sklearn metrics
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    classification_report, confusion_matrix
)

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')



## 2. Load Test Data 

 We load the `landlord` suffixed files created in T1.5.

In [None]:
# Load landlord-only test data 
X_test = pd.read_csv('../../data/processed/X_test_landlord.csv')
y_test = pd.read_csv('../../data/processed/y_test_landlord.csv')

# Flatten y_test if needed
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.iloc[:, 0].values

print(f' Test set size: {len(X_test)} samples')
print(f' Number of features: {X_test.shape[1]}')
print(f'\n Class distribution in test set:')
print(pd.Series(y_test).value_counts().sort_index())

## 3. Load All Trained Models

We load all supervised learning models trained in Tasks 2.1-2.5.

In [None]:
# Dictionary to store models
models = {}

# Model file paths
model_files = {
    'Logistic Regression': '../../models/logistic_regression_landlord.pkl',
    'Random Forest': '../../models/random_forest_model.pkl',
    'XGBoost': '../../models/xgboost_model.pkl',
    'MLP Classifier': '../../models/best_mlp_model.pkl',
    'SVM (RBF)': '../../models/svm_rbf_model.pkl'
}

# Load each model
for name, path in model_files.items():
    try:
        with open(path, 'rb') as f:
            models[name] = pickle.load(f)
        print(f' {name} loaded successfully')
    except FileNotFoundError:
        print(f' {name} not found at {path}')

print(f'\n Total models loaded: {len(models)}')

## 4. Model Performance Summary (From Individual Tasks)

Based on the actual outputs from T2.1-T2.5:

In [None]:
# Actual performance results from individual task outputs
performance_data = {
    'Model': ['XGBoost', 'Random Forest', 'MLP Classifier', 'Logistic Regression', 'SVM (RBF)'],
    'Training Accuracy': [0.9900, 0.9640, 0.9508, 0.9513, 0.9622],
    'Testing Accuracy': [0.9551, 0.9536, 0.9498, 0.9536, 0.9282],
    'F1-Score (Macro)': [0.9553, 0.9538, 0.9500, 0.9539, 0.9286],
    'Precision (Macro)': [0.9565, 0.9553, 0.9500, 0.9548, 0.9290],
    'Recall (Macro)': [0.9551, 0.9535, 0.9500, 0.9535, 0.9284],
    'Train-Test Gap': [0.0349, 0.0104, 0.0010, -0.0022, 0.0340]
}

results_df = pd.DataFrame(performance_data)

print('='*80)
print('Model Performance Comparison')
print('='*80)
print(results_df.to_string(index=False))
print('='*80)

## 5. Detailed Performance Analysis

In [None]:
# Ranking by Test Accuracy
print('\n Ranking by Test Accuracy:')
print('-' * 60)
ranked = results_df.sort_values('Testing Accuracy', ascending=False)
for idx, row in ranked.iterrows():
    print(f"{idx+1}. {row['Model']:20s} {row['Testing Accuracy']:.4f} ({row['Testing Accuracy']*100:.2f}%)")

# Ranking by Generalization (smallest train-test gap)
print('\n Ranking by Generalization (Smallest Train-Test Gap):')
print('-' * 60)
ranked_gen = results_df.copy()
ranked_gen['Gap_Abs'] = ranked_gen['Train-Test Gap'].abs()
ranked_gen = ranked_gen.sort_values('Gap_Abs')
for idx, row in ranked_gen.iterrows():
    print(f"{idx+1}. {row['Model']:20s} {row['Train-Test Gap']:+.4f} gap")

## 6. Visual Comparison

In [None]:
# Create comparison visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Model Performance Comparison', fontsize=16, fontweight='bold')

# 1. Test Accuracy Comparison
ax1 = axes[0, 0]
results_df_sorted = results_df.sort_values('Testing Accuracy', ascending=True)
colors = ['#2ecc71' if x >= 0.95 else '#f39c12' for x in results_df_sorted['Testing Accuracy']]
ax1.barh(results_df_sorted['Model'], results_df_sorted['Testing Accuracy'], color=colors)
ax1.set_xlabel('Test Accuracy', fontweight='bold')
ax1.set_title('Test Accuracy by Model', fontweight='bold')
ax1.set_xlim(0.90, 0.96)
for i, v in enumerate(results_df_sorted['Testing Accuracy']):
    ax1.text(v + 0.001, i, f'{v:.4f}', va='center')

# 2. F1-Score Comparison
ax2 = axes[0, 1]
results_df_sorted_f1 = results_df.sort_values('F1-Score (Macro)', ascending=True)
ax2.barh(results_df_sorted_f1['Model'], results_df_sorted_f1['F1-Score (Macro)'], color='#3498db')
ax2.set_xlabel('F1-Score (Macro)', fontweight='bold')
ax2.set_title('F1-Score by Model', fontweight='bold')
ax2.set_xlim(0.90, 0.96)
for i, v in enumerate(results_df_sorted_f1['F1-Score (Macro)']):
    ax2.text(v + 0.001, i, f'{v:.4f}', va='center')

# 3. Train vs Test Accuracy
ax3 = axes[1, 0]
x = np.arange(len(results_df))
width = 0.35
ax3.bar(x - width/2, results_df['Training Accuracy'], width, label='Training', color='#e74c3c')
ax3.bar(x + width/2, results_df['Testing Accuracy'], width, label='Testing', color='#2ecc71')
ax3.set_xlabel('Model', fontweight='bold')
ax3.set_ylabel('Accuracy', fontweight='bold')
ax3.set_title('Training vs Testing Accuracy', fontweight='bold')
ax3.set_xticks(x)
ax3.set_xticklabels(results_df['Model'], rotation=45, ha='right')
ax3.legend()
ax3.set_ylim(0.90, 1.0)

# 4. Precision-Recall-F1 Comparison
ax4 = axes[1, 1]
x = np.arange(len(results_df))
width = 0.25
ax4.bar(x - width, results_df['Precision (Macro)'], width, label='Precision', color='#9b59b6')
ax4.bar(x, results_df['Recall (Macro)'], width, label='Recall', color='#1abc9c')
ax4.bar(x + width, results_df['F1-Score (Macro)'], width, label='F1-Score', color='#f39c12')
ax4.set_xlabel('Model', fontweight='bold')
ax4.set_ylabel('Score', fontweight='bold')
ax4.set_title('Precision, Recall, and F1-Score Comparison', fontweight='bold')
ax4.set_xticks(x)
ax4.set_xticklabels(results_df['Model'], rotation=45, ha='right')
ax4.legend()
ax4.set_ylim(0.90, 0.97)

plt.tight_layout()
plt.savefig('../../outputs/figures/model_comparison_landlord_features.png', dpi=300, bbox_inches='tight')
print('\n Visualization saved to: outputs/figures/model_comparison_landlord_features.png')
plt.show()

## 7. Top 2 Model Selection

###  Winner: **XGBoost**
- **Test Accuracy**: 95.51%
- **F1-Score**: 0.9553
- **Strengths**:
  - Highest test accuracy
  - Best F1-score and precision
  - Provides feature importance (interpretable)
  - Excellent for production deployment
- **Considerations**:
  - Slightly higher train-test gap (3.49%) indicates minor overfitting
  - Still excellent generalization

###  Runner-up: **Random Forest**
- **Test Accuracy**: 95.36%
- **F1-Score**: 0.9538
- **Strengths**:
  - Very close performance to XGBoost
  - **Best generalization** (only 1.04% train-test gap)
  - Robust and stable
  - Also provides feature importance
- **Why chosen as backup**:
  - Better generalization than XGBoost
  - More stable predictions
  - Excellent fallback option

###  Honorable Mention: **Logistic Regression**
- **Test Accuracy**: 95.36% (tied with Random Forest)
- **Exceptional generalization**: -0.22% gap (actually performs BETTER on test set!)
- **Most interpretable** model
- Great baseline and for understanding feature relationships

In [None]:
# Save comparison table
results_df.to_csv('../../outputs/model_comparison_summary_landlord.csv', index=False)
print('\n Comparison table saved to outputs/model_comparison_summary_landlord.csv')

## 8. Key Insights

###  Model Performance Insights
1. **All models perform excellently** (~93-95% accuracy)
2. **XGBoost leads** in raw performance
3. **Random Forest** has best generalization
4. **Logistic Regression** surprisingly strong (95.36%)
5. **MLP Classifier** competitive but slightly behind
6. **SVM (RBF)** lowest but still strong (92.82%)

###  Recommendation for Production
- **Primary Model**: XGBoost (highest accuracy)
- **Backup Model**: Random Forest (best generalization)
- **Interpretability**: Logistic Regression (for stakeholder explanations)

---

## 9. Conclusion

After comprehensive evaluation of 5 supervised learning models, we select:

1. **XGBoost** as the primary model (95.51% accuracy)
2. **Random Forest** as the backup model (95.36% accuracy, best generalization)

Both models are production-ready and can accurately predict Airbnb value categories for **new listings without any review history**.

---