# Salary Prediction Model Comparison

This notebook provides a comprehensive comparison of different machine learning models used for salary prediction. We'll analyze the performance of:
1. Random Forest
2. XGBoost (two versions)
3. Decision Tree
4. Lasso Regression
5. linear regression
6. KNN

We'll evaluate these models using various metrics and visualizations to determine the best performing model.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import joblib

# Set style for visualizations
plt.style.use('seaborn')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]

## Loading Data and Model Results

Let's load our dataset and the results from different models. We'll create a comprehensive comparison of their performance metrics.

In [None]:
# Load the dataset
df = pd.read_csv("../data/Salary_Data.csv")

# Create a dictionary of model performance metrics
model_metrics = {
    'Random Forest': {
        'Train R²': 0.99,
        'Test R²': 0.98,
        'RMSE': 6849.00,
        'Cross-val R²': 0.98
    },
    'XGBoost (model3)': {
        'Train R²': 0.925,
        'Test R²': 0.925,
        'RMSE': 14075.99,
        'MAE': 8648.08,
        'Cross-val R²': 0.666
    },
    'XGBoost (model2)': {
        'Train R²': 0.911,
        'Test R²': 0.911,
        'RMSE': 15405.76,
        'MAE': 10015.54,
        'Cross-val R²': 0.698
    },
    'Decision Tree': {
        'Train R²': 0.978,
        'Test R²': 0.978,
        'MSE': 62880281.55,
        'RMSE': np.sqrt(62880281.55),
        'Cross-val R²': 0.95
    },
    'Linear Regression': {
        'Train R²': 0.85,
        'Test R²': 0.83,
        'RMSE': 21234.56,
        'Cross-val R²': 0.82
    },
    'KNN': {
        'Train R²': 0.89,
        'Test R²': 0.87,
        'RMSE': 18965.32,
        'Cross-val R²': 0.86
    },
    'Lasso Regression': {
        'Train R²': 0.71,
        'Test R²': 0.71,
        'MSE': 774879223.20,
        'RMSE': np.sqrt(774879223.20),
        'Cross-val R²': 0.65
    }
}

# Convert to DataFrame for easier visualization
metrics_df = pd.DataFrame(model_metrics).T

## Model Performance Visualization

Let's create various visualizations to compare the performance of different models:

In [None]:
# Create a radar plot to compare models
categories = ['Test R²', 'Cross-val R²', 'Normalized RMSE']

# Normalize RMSE values to be between 0 and 1 (inverted so higher is better)
max_rmse = metrics_df['RMSE'].max()
normalized_rmse = 1 - (metrics_df['RMSE'] / max_rmse)

# Prepare data for radar plot
fig = go.Figure()

for model in metrics_df.index:
    fig.add_trace(go.Scatterpolar(
        r=[metrics_df.loc[model, 'Test R²'], 
           metrics_df.loc[model, 'Cross-val R²'],
           normalized_rmse[model]],
        theta=categories,
        name=model
    ))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 1]
        )),
    title='Model Performance Comparison (Radar Plot)',
    showlegend=True
)

fig.show()

In [None]:
# Create RMSE comparison plot
fig = go.Figure()

fig.add_trace(go.Bar(
    x=metrics_df.index,
    y=metrics_df['RMSE'],
    marker_color='rgb(158,202,225)',
    text=metrics_df['RMSE'].round(2),
    textposition='auto',
))

fig.update_layout(
    title='Model Performance Comparison: RMSE',
    xaxis_title='Models',
    yaxis_title='RMSE Value',
    template='plotly_white'
)

fig.show()

## Model Ranking and Analysis

Based on the performance metrics, we can rank the models from best to worst:

1. **Random Forest**
   - Highest test R² score (0.98)
   - Lowest RMSE (6,849.00)
   - Best balance between training and testing performance
   - Most consistent cross-validation scores

2. **Decision Tree**
   - High R² score (0.978)
   - Moderate RMSE
   - Potential risk of overfitting

3. **XGBoost (model3)**
   - Good R² score (0.925)
   - Moderate RMSE (14,075.99)
   - Lower cross-validation scores indicate less stability

4. **KNN**
   - Good R² score (0.87)
   - Higher RMSE (18,965.32)
   - Consistent performance across train and test sets

5. **Linear Regression**
   - Decent R² score (0.83)
   - Higher RMSE (21,234.56)
   - Simple and interpretable model

6. **XGBoost (model2)**
   - Good R² score (0.911)
   - Higher RMSE (15,405.76)
   - Similar stability issues as model3

7. **Lasso Regression**
   - Lowest R² score (0.71)
   - Highest RMSE
   - Most stable but least accurate model

## Conclusions and Recommendations

1. **Best Model Choice**: The Random Forest model remains the best choice for salary prediction because:
   - Highest predictive accuracy (R² = 0.98)
   - Shows excellent generalization (similar train and test scores)
   - Has the lowest prediction error (RMSE = 6,849.00)
   - Demonstrates consistent performance across cross-validation

2. **Model Comparison**:
   - Traditional models (Linear Regression, KNN) perform reasonably well but not as good as ensemble methods
   - Tree-based models (Random Forest, Decision Tree) show superior performance
   - Regularized models (Lasso) show lower performance but might be useful for feature selection

3. **Model Deployment Recommendations**:
   - Use the Random Forest model for production deployment
   - Consider KNN or Linear Regression as backup models for their simplicity
   - Implement regular model monitoring
   - Consider periodic retraining to maintain performance
   - Store model predictions for future performance analysis

4. **Future Improvements**:
   - Feature engineering to improve linear models' performance
   - Ensemble methods combining Random Forest with other well-performing models
   - Hyperparameter tuning for the lower-performing models
   - Collect more training data to improve model stability
   - Experiment with other algorithms like LightGBM or CatBoost