# Project 3 - Developer Salary Prediction in 2024 - Scikit-Learn - Top3 Models

- Modelling Developer Salary Prediction in 2024 with **Scikit-Learn**

- Course Name :         Applied Machine Learning
- Course instructor:    Sohail Tehranipour
- Student Name :        Afshin Masoudi Ashtiani
- Project 3 -           Developer Salary Prediction in 2024
- Date :                September 2024
- File(ipynb) :         5/5

## Step 1 : Install required libraries

In [6]:
%pip install pandas numpy joblib
%pip install lightgbm xgboost catboost

Note: you may need to restart the kernel to use updated packages.
Collecting catboost
  Using cached catboost-1.2.7-cp312-cp312-win_amd64.whl.metadata (1.2 kB)
Collecting graphviz (from catboost)
  Using cached graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Collecting numpy>=1.17.0 (from lightgbm)
  Using cached numpy-1.26.4-cp312-cp312-win_amd64.whl.metadata (61 kB)
Using cached catboost-1.2.7-cp312-cp312-win_amd64.whl (101.7 MB)
Using cached numpy-1.26.4-cp312-cp312-win_amd64.whl (15.5 MB)
Using cached graphviz-0.20.3-py3-none-any.whl (47 kB)
Note: you may need to restart the kernel to use updated packages.


ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    unknown package:
        Expected sha256 645082f23762c281a7e14fdc23b88e47a3e3bbf8655f5246d80194b104a8ada9
             Got        a6576c510a804292d5087144de0cd58c70d70b5f3e4b871b051d606fc5d09ba1



## Step 2 : Import required libraries

In [7]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)

## Step 3 : Load and Prepare data

In [8]:
# Load the dataset
# dataset_path = '/content/drive/My Drive/Applied Machine Learning/Project 3 : Developer Salary Prediction/datasets/cleaned_survey_results_public_v2.csv'
dataset_path = r'C:\Users\Afshin\Desktop\10_Projects\Project_3_Developer_Salary_Prediction\datasets\cleaned_survey_results_public_v2.csv'

df = pd.read_csv(dataset_path)
df.info()

# Define features and target variables
X = df.drop(columns=['Salary'])  # Features
y = df['Salary']

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13908 entries, 0 to 13907
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CompTotal          13908 non-null  float64
 1   YearsOfExperience  13908 non-null  float64
 2   MainBranch         13908 non-null  object 
 3   Country            13908 non-null  object 
 4   EducationLevel     13908 non-null  object 
 5   RemoteWork         13908 non-null  object 
 6   Salary             13908 non-null  float64
dtypes: float64(3), object(4)
memory usage: 760.7+ KB


In [9]:
df.describe()

Unnamed: 0,CompTotal,YearsOfExperience,Salary
count,13908.0,13908.0,13908.0
mean,96409.73109,9.865078,78880.216997
std,67277.52275,7.408676,43629.140861
min,5000.0,0.5,10000.0
25%,50000.0,4.0,45720.25
50%,80000.0,8.0,71962.0
75%,124000.0,14.0,105884.5
max,410000.0,31.0,199000.0


## Step 4 : Data Preprocessing

- Create pipelines

In [10]:
# Identify feature types
numeric_features = X.select_dtypes(include=['number']).columns.tolist()
boolean_features = X.select_dtypes(include=['bool']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

print(f'> Numeric Features     is {numeric_features}')
print(f'> Boolean Features     is {boolean_features}')
print(f'> Categorical Features is {categorical_features}')

> Numeric Features     is ['CompTotal', 'YearsOfExperience']
> Boolean Features     is []
> Categorical Features is ['MainBranch', 'Country', 'EducationLevel', 'RemoteWork']


In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

numeric_pipeline = Pipeline([('scaler', MinMaxScaler())])
categorical_pipeline = Pipeline([('onehot', OneHotEncoder(handle_unknown='ignore'))])

- Create a transformer

In [12]:
from sklearn.compose import ColumnTransformer

# Column transformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numeric_features),
        ('cat', categorical_pipeline, categorical_features)
    ])
preprocessor

- Split the dataset into **train** and **test** sets

In [13]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.1, random_state= 123)
print(f'> Shape of X_train: {X_train.shape}, X_test: {X_test.shape}, y_train: {y_train.shape}, y_test: {y_test.shape}')

> Shape of X_train: (12517, 6), X_test: (1391, 6), y_train: (12517,), y_test: (1391,)


## Step 5 : Train model and Tune **Hyperparameters**

- Define a function using **GridSearchCV**

In [14]:
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold, GridSearchCV

def tune_model(model, param_grid, X_train, y_train, X_test, y_test) -> pd.DataFrame:

    kf = KFold(n_splits= 10, shuffle= True, random_state= 123)

    # GridSearchCV for hyperparameter tuning
    grid_search = GridSearchCV(model, param_grid, scoring= 'r2', refit= True, cv= kf, n_jobs= -1, verbose=1)
    grid_search.fit(X_train, y_train)

    best_estimator = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_

    # Make predictions
    y_train_pred = grid_search.predict(X_train)
    y_test_pred = grid_search.predict(X_test)

    # Evaluate the best estimator
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    diff_r2 = np.abs(train_r2 - test_r2)

    print(f">> Results from Grid Search " + "--" * 10)
    print(f"> Best Score       : {(best_score * 100):0.2f}%")
    print(f"> Train R2         : {(train_r2 * 100):0.2f}%")
    print(f"> Test R2          : {(test_r2 * 100):0.2f}%")
    print(f"> Difference R2    : {(diff_r2 * 100):0.2f}%")
    print(f"> Best Params      : {best_params}")

    return pd.DataFrame([{
        'Best_Score' : best_score,
        'Train_R2' : train_r2,
        'Test_R2' : test_r2,
        'Difference_R2' : diff_r2,
        'Best_Estimator' : best_estimator
    }])

- Define models and their respective parameter grids

In [15]:
from sklearn.pipeline import Pipeline
# from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Define a function to create a model pipeline
def create_pipeline(model):
    return Pipeline(steps=[('preprocessor', preprocessor), ('regressor', model)])

# Define models and their respective parameter grids
models_param_grids = {
    #'CatBoost Regressor': (create_pipeline(CatBoostRegressor(verbose=0)), {
    #    'regressor__n_estimators': [100, 200, 300],
    #    'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
    #    'regressor__depth': [6, 8, 10],
    #    'regressor__l2_leaf_reg': [3, 5, 7]
    #}),

    'XGBoost Regressor': (create_pipeline(XGBRegressor()), {
        'regressor__n_estimators': [100, 200, 300],
        'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__subsample': [0.8, 1.0]
    }),

    'LGBM Regressor': (create_pipeline(LGBMRegressor()), {
        'regressor__n_estimators': [100, 200, 300],
        'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__num_leaves': [31, 63, 127],
        'regressor__boosting_type': ['gbdt', 'dart']
    }),

    'Random Forest Regressor': (create_pipeline(RandomForestRegressor()), {
        'regressor__n_estimators': [100, 200, 300],
        'regressor__max_features': ['auto', 'sqrt', 'log2'],
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__bootstrap': [True, False]
    }),
}

- Tune each model

In [16]:
# Store results
results_df = pd.DataFrame()

# Tune each model
for model_name, (model_pipeline, param_grid) in models_param_grids.items():
    print(f"\n>> Tuning {model_name} ...")
    new_row = tune_model(model_pipeline, param_grid, X_train, y_train, X_test, y_test)
    new_row.insert(0, 'Name', [model_name])
    new_row
    results_df = pd.concat([results_df, new_row], axis=0, ignore_index=True)

results_df


>> Tuning XGBoost Regressor ...
Fitting 10 folds for each of 96 candidates, totalling 960 fits
>> Results from Grid Search --------------------
> Best Score       : 93.51%
> Train R2         : 98.50%
> Test R2          : 92.20%
> Difference R2    : 6.30%
> Best Params      : {'regressor__learning_rate': 0.05, 'regressor__max_depth': 10, 'regressor__n_estimators': 200, 'regressor__subsample': 0.8}

>> Tuning LGBM Regressor ...
Fitting 10 folds for each of 288 candidates, totalling 2880 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001051 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 362
[LightGBM] [Info] Number of data points in the train set: 12517, number of used features: 45
[LightGBM] [Info] Start training from score 78868.642966
>> Results from Grid Search --------------------
> Best Score       : 93.71%
> Train R2        

Unnamed: 0,Name,Best_Score,Train_R2,Test_R2,Difference_R2,Best_Estimator
0,XGBoost Regressor,0.935136,0.985008,0.922006,0.063002,"(ColumnTransformer(transformers=[('num',\n ..."
1,LGBM Regressor,0.937138,0.960991,0.923787,0.037204,"(ColumnTransformer(transformers=[('num',\n ..."
2,Random Forest Regressor,0.909973,0.983672,0.893296,0.090376,"(ColumnTransformer(transformers=[('num',\n ..."


- Store the results in a DataFrame for analysis

In [17]:
# Sort the results in a DataFrame for analysis
sorted_results_df = results_df[results_df.Best_Score > 0.8].sort_values(by=['Difference_R2'], ascending=[True])
best_score_index = results_df.Best_Score.idxmax()
train_r2_index = results_df.Train_R2.idxmax()
test_r2_index = results_df.Test_R2.idxmax()
difference_r2_index = results_df.Difference_R2.idxmin()
best_score_model = results_df.loc[best_score_index, 'Best_Estimator']
train_r2_model = results_df.loc[train_r2_index, 'Best_Estimator']
test_r2_model = results_df.loc[test_r2_index, 'Best_Estimator']
difference_r2_model = results_df.loc[difference_r2_index, 'Best_Estimator']

- Summary of Model Tuning Results

In [18]:
print(">> Summary of Model Tuning Results :")
print(f"> Maximum Best Score    Model : {results_df.loc[best_score_index, 'Name']}")
print(f"> Maximum Train R2      Model : {results_df.loc[train_r2_index, 'Name']}")
print(f"> Maximum Test R2       Model : {results_df.loc[test_r2_index, 'Name']}")
print(f"> Minimum Difference R2 Model : {results_df.loc[difference_r2_index, 'Name']}")
sorted_results_df

>> Summary of Model Tuning Results :
> Maximum Best Score    Model : LGBM Regressor
> Maximum Train R2      Model : XGBoost Regressor
> Maximum Test R2       Model : LGBM Regressor
> Minimum Difference R2 Model : LGBM Regressor


Unnamed: 0,Name,Best_Score,Train_R2,Test_R2,Difference_R2,Best_Estimator
1,LGBM Regressor,0.937138,0.960991,0.923787,0.037204,"(ColumnTransformer(transformers=[('num',\n ..."
0,XGBoost Regressor,0.935136,0.985008,0.922006,0.063002,"(ColumnTransformer(transformers=[('num',\n ..."
2,Random Forest Regressor,0.909973,0.983672,0.893296,0.090376,"(ColumnTransformer(transformers=[('num',\n ..."


## Step 6 : Save the tuned best model

- Save the tuned best model

In [19]:
import joblib

for index, row in sorted_results_df.iterrows():
    tuned_best_pipeline_name = row.Name.replace(' ', '')
    tuned_best_pipeline= row.Best_Estimator

    # pipeline_path = f'/content/drive/My Drive/Applied Machine Learning/Project 3 : Developer Salary Prediction/models/{tuned_best_pipeline_name}.joblib'
    pipeline_path = f"C:\\Users\\Afshin\\Desktop\\10_Projects\\Project_3_Developer_Salary_Prediction\\models\\{tuned_best_pipeline_name}.joblib"

    joblib.dump(tuned_best_pipeline, pipeline_path)
    print(f"> Tuned {row.Name} pipeline Saved to {pipeline_path}")

> Tuned LGBM Regressor pipeline Saved to C:\Users\Afshin\Desktop\10_Projects\Project_3_Developer_Salary_Prediction\models\LGBMRegressor.joblib
> Tuned XGBoost Regressor pipeline Saved to C:\Users\Afshin\Desktop\10_Projects\Project_3_Developer_Salary_Prediction\models\XGBoostRegressor.joblib
> Tuned Random Forest Regressor pipeline Saved to C:\Users\Afshin\Desktop\10_Projects\Project_3_Developer_Salary_Prediction\models\RandomForestRegressor.joblib


## Step 7 : Make Predictions

- Load the tuned best pipeline

In [20]:
pipeline_path = 'C:\\Users\\Afshin\\Desktop\\10_Projects\\Project_3_Developer_Salary_Prediction\\models\\XGBoostRegressor.joblib'
# pipeline_path = '/content/drive/My Drive/Applied Machine Learning/Project 3 : Developer Salary Prediction/models/XGBoostRegressor.joblib'

loaded_tuned_best_pipeline = joblib.load(pipeline_path)
loaded_tuned_best_pipeline

- Predict the random sample

In [21]:
# Make predictions
df_sample = df.sample(1)
df_sample['Predicted Salary'] = loaded_tuned_best_pipeline.predict(df_sample.drop('Salary', axis= 1))
df_sample[['Salary','Predicted Salary']]

Unnamed: 0,Salary,Predicted Salary
5543,29405.0,28403.054688


- Predict the new sample

In [None]:
# Make predictions on new data
new_data = pd.DataFrame([{
    'CompTotal' : 195000.0,
    'YearsOfExperience' : 23.0,
    'MainBranch' : "Profession",
    'Country'  : "Germany",
    'EducationLevel' : "Master's degree",
    'RemoteWork' : "In-person",
}])
new_data['Predicted Salary'] = loaded_tuned_best_pipeline.predict(new_data)
new_data

Unnamed: 0,CompTotal,YearsOfExperience,MainBranch,Country,EducationLevel,RemoteWork,Predicted Salary
0,195000.0,23.0,Profession,Germany,Master's degree,In-person,174310.75
