# Salary Prediction Model Comparison

This notebook provides a comprehensive comparison of different machine learning models used for salary prediction. We'll analyze the performance of:
1. Random Forest
2. XGBoost (two versions)
3. Decision Tree
4. Lasso Regression
5. linear regression
6. KNN

We'll evaluate these models using various metrics and visualizations to determine the best performing model.

In [5]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import joblib

# Set style for visualizations
# Use seaborn's theme functions instead of matplotlib style 'seaborn' which may not be available.
sns.set_theme(style='darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]

## Loading Data and Model Results

Let's load our dataset and the results from different models. We'll create a comprehensive comparison of their performance metrics.

In [None]:
# Load the dataset
df = pd.read_csv("../data/Salary_Data.csv")

# Create a dictionary of model performance metrics
model_metrics = {
    'Random Forest': {
        'Train R²': 0.885,
        'Test R²': 0.881,
        'RMSE': 14356,
        'Cross-val R²': 0.879
    },
    'KNN': {
        'Train R²': 0.89,
        'Test R²': 0.87,
        'RMSE': 18965.32,
        'Cross-val R²': 0.86
    },
    'XGBoost': {
        'Train R²': 0.9816594635946724,
        'Test R²': 0.931231247301555,
        'RMSE': 66986.5809956585,
        'MAE': 40941.674233490565,
        'Cross-val R²': 0.8087096331731203
    },
    'Decision Tree': {
        'Train R²': 0,
        'Test R²': 0,
        'MSE': 0,
        'RMSE': 0,
        'Cross-val R²': 0
    },
    'Linear Regression': {
        'Train R²': 0,
        'Test R²': 0,
        'RMSE': 0,
        'Cross-val R²': 0
    },
    'Lasso Regression': {
        'Train R²': 0.71,
        'Test R²': 0.71,
        'MSE': 774879223.20,
        'RMSE': np.sqrt(774879223.20),
        'Cross-val R²': 0.65
    },
    'SVR': {
        'Train R²': 0.9798,
        'Test R²': 0.9660,
        'MAE': 6076.31,
        'RMSE': 9849.53,
        'Cross-val R²': 0.602
        'Train R²': 0,
        'Test R²': 0,
        'MSE': 0,
        'RMSE': 0,
        'Cross-val R²': 0
    }
}

# Convert to DataFrame for easier visualization
metrics_df = pd.DataFrame(model_metrics).T

## Model Performance Visualization

Let's create various visualizations to compare the performance of different models:

In [8]:
# Create a radar plot to compare models
categories = ['Test R²', 'Cross-val R²', 'Normalized RMSE']

# Normalize RMSE values to be between 0 and 1 (inverted so higher is better)
max_rmse = metrics_df['RMSE'].max()
normalized_rmse = 1 - (metrics_df['RMSE'] / max_rmse)

# Prepare data for radar plot
fig = go.Figure()

for model in metrics_df.index:
    fig.add_trace(go.Scatterpolar(
        r=[metrics_df.loc[model, 'Test R²'], 
           metrics_df.loc[model, 'Cross-val R²'],
           normalized_rmse[model]],
        theta=categories,
        name=model
    ))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 1]
        )),
    title='Model Performance Comparison (Radar Plot)',
    showlegend=True
)

fig.show()

In [9]:
# Create RMSE comparison plot
fig = go.Figure()

fig.add_trace(go.Bar(
    x=metrics_df.index,
    y=metrics_df['RMSE'],
    marker_color='rgb(158,202,225)',
    text=metrics_df['RMSE'].round(2),
    textposition='auto',
))

fig.update_layout(
    title='Model Performance Comparison: RMSE',
    xaxis_title='Models',
    yaxis_title='RMSE Value',
    template='plotly_white'
)

fig.show()

## Model Ranking and Analysis

Based on the performance metrics, we can rank the models from best to worst:

1. **Random Forest**
  

2. **Decision Tree**
   

3. **XGBoost (model3)**
  
4. **KNN**
   

5. **Linear Regression**
  

6. **XGBoost (model2)**
   

7. **Lasso Regression**
   

## Conclusions and Recommendations

1. **Best Model Choice**: The      model remains the best choice for salary prediction because:
   
2. **Model Comparison**:
   
3. **Model Deployment Recommendations**:
  
4. **Future Improvements**:
  