# ⚖️ DeepCompNet: ML Salary Prediction & SHAP Explainability

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────    
📘 **Author:** Teslim Uthman Adeyanju  
📫 **Email:** [info@adeyanjuteslim.co.uk](mailto:info@adeyanjuteslim.co.uk)  
🔗 **LinkedIn:** [linkedin.com/in/adeyanjuteslimuthman](https://www.linkedin.com/in/adeyanjuteslimuthman)  
🌐 **Website & Blog:** [adeyanjuteslim.co.uk](https://adeyanjuteslim.co.uk)  
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────    


### Table of Contents

By following this roadmap, readers can gain a comprehensive understanding of the DeepCompNet project. The roadmap is divided into the following sections:

1. [Introduction](#1-introduction)  
   - 1.1 [Project Overview](#11-project-overview)  
   - 1.2 [Project Objectives](#12-project-objectives)  
   - 1.3 [Business Value](#13-business-value)  
   - 1.4 [Tools and Technologies](#14-tools-and-technologies)

2. [Data Collection & Understanding](#2-data-collection--understanding)  
   - 2.1 [Dataset Overview](#21-dataset-overview)  
   - 2.2 [Data Fields and Definitions](#22-data-fields-and-definitions)  
   - 2.3 [Initial Exploration](#23-initial-exploration)  
   - 2.4 [Assumptions and Limitations](#24-assumptions-and-limitations)

3. [Exploratory Data Analysis (EDA)](#3-exploratory-data-analysis-eda)  
   - 3.1 [Salary Distribution Analysis](#31-salary-distribution-analysis)  
   - 3.2 [Categorical Feature Breakdown](#32-categorical-feature-breakdown)  
   - 3.3 [Geographical and Remote Work Trends](#33-geographical-and-remote-work-trends)  
   - 3.4 [Outlier Detection and Decisions](#34-outlier-detection-and-decisions)  
   - 3.5 [Variance and Correlation Analysis](#35-variance-and-correlation-analysis)

4. [Feature Engineering](#4-feature-engineering)  
   - 4.1 [Categorical Embeddings for High-Cardinality Variables](#41-categorical-embeddings-for-high-cardinality-variables)  
   - 4.2 [Statistical Feature Creation](#42-statistical-feature-creation)  
   - 4.3 [Interaction Terms and Cross Features](#43-interaction-terms-and-cross-features)  
   - 4.4 [Feature Scaling and Transformation](#44-feature-scaling-and-transformation)  
   - 4.5 [Final Feature Set for Modeling](#45-final-feature-set-for-modeling)

5. [Model Development: Deep Learning](#5-model-development-deep-learning)  
   - 5.1 [Neural Network Architecture](#51-neural-network-architecture)  
   - 5.2 [Embedding Layers and Dense Connections](#52-embedding-layers-and-dense-connections)  
   - 5.3 [Regularization: Dropout & Batch Normalization](#53-regularization-dropout--batch-normalization)  
   - 5.4 [Training Strategy and Early Stopping](#54-training-strategy-and-early-stopping)  
   - 5.5 [Baseline Comparison with LightGBM or CatBoost](#55-baseline-comparison-with-lightgbm-or-catboost)

6. [Model Evaluation](#6-model-evaluation)  
   - 6.1 [Performance Metrics: MAE, RMSE, R²](#61-performance-metrics-mae-rmse-r²)  
   - 6.2 [Residual Analysis and Interpretation](#62-residual-analysis-and-interpretation)  
   - 6.3 [Error Analysis Across Job Segments](#63-error-analysis-across-job-segments)

7. [Model Explainability](#7-model-explainability)  
   - 7.1 [Global SHAP Summary Analysis](#71-global-shap-summary-analysis)  
   - 7.2 [Local SHAP Interpretations (Case Studies)](#72-local-shap-interpretations-case-studies)  
   - 7.3 [Feature Importance and Insights](#73-feature-importance-and-insights)

8. [Model Deployment](#8-model-deployment)  
   - 8.1 [Prepare Model for Inference](#81-prepare-model-for-inference)  
   - 8.2 [Designing Streamlit Interface](#82-designing-streamlit-interface)  
   - 8.3 [Integrating SHAP in Streamlit App](#83-integrating-shap-in-streamlit-app)  
   - 8.4 [Hosting: Streamlit Cloud / Hugging Face Spaces](#84-hosting-streamlit-cloud--hugging-face-spaces)

9. [MLflow Tracking](#9-mlflow-tracking)  
   - 9.1 [Tracking Experiments](#91-tracking-experiments)  
   - 9.2 [Logging Parameters and Artifacts](#92-logging-parameters-and-artifacts)  
   - 9.3 [Model Registry and Version Control](#93-model-registry-and-version-control)

10. [Conclusion and Recommendations](#10-conclusion-and-recommendations)  
    - 10.1 [Summary of Findings](#101-summary-of-findings)  
    - 10.2 [Business Implications](#102-business-implications)  
    - 10.3 [Lim]()


## 📚 1.0 Introduction


<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">
This section provides an overview of the dataset and the problem we are trying to solve. We will also discuss the data overview, project objective, methodology and the tools (libaries) we will use to solve the problem.

</div>



### 1.1 Project Overview
This project, **DeepCompNet**, aims to build an advanced salary prediction model for machine learning roles using deep learning techniques. The dataset consists of global compensation data for ML professionals and captures various features like job title, experience, location, remote work ratio, and company size.

We leverage neural networks for modeling, embedding layers for categorical variables, and SHAP for explainability. The final solution is deployed via a Streamlit app to enable real-time salary prediction and interpretation.

---

### 1.2 Project Objectives
- Predict salary compensation (`salary_in_usd`) for ML-related job roles.
- Identify and rank key features influencing salary outcomes.
- Build a deep learning model with categorical embeddings.
- Compare deep learning model performance with LightGBM/CatBoost baselines.
- Apply SHAP for global and local feature explainability.
- Deploy the model using Streamlit for public access and interaction.
- Track all experiments and hyperparameters using MLflow.

---

### 1.3 Business Value
This project provides a practical solution for:
- **HR and Recruitment** teams: To set competitive salary benchmarks.
- **Job Seekers**: To assess fair compensation expectations based on job attributes.
- **Compensation Analysts**: To gain insights into salary drivers across roles and regions.
- **Executives and Policy Makers**: To support equitable compensation frameworks in tech hiring.

---

### 1.4 Tools and Technologies
Below are the key tools and libraries used throughout this project:

- **Data Handling & Visualization**:  
  `pandas`, `numpy`, `matplotlib`, `seaborn`

- **Deep Learning**:  
  `PyTorch` *(or TensorFlow)*, `scikit-learn`

- **Explainability**:  
  `SHAP`, `PermutationImportance`

- **Experiment Tracking**:  
  `MLflow`

- **Deployment**:  
  `Streamlit`, optionally `Hugging Face Spaces`



### 🔍 Library Tools
___

In [1]:
# Core Libraries
import numpy as np
import pandas as pd
import warnings

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.cm import ScalarMappable, coolwarm
from matplotlib.colors import Normalize
from matplotlib.gridspec import GridSpec
from mpl_toolkits.mplot3d import Axes3D
plt.style.use('ggplot')

# Data Preprocessing
from sklearn.preprocessing import (
    StandardScaler, OneHotEncoder, LabelEncoder, OrdinalEncoder,
    PolynomialFeatures, PowerTransformer
)
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA

# Feature Selection
from sklearn.feature_selection import SelectKBest, f_regression

# Model Selection and Evaluation
from sklearn.model_selection import (
    train_test_split, GridSearchCV, cross_val_score, learning_curve
)
from sklearn.metrics import (
    mean_squared_error, r2_score, mean_absolute_error
)

# Regression Models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, VotingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
import lightgbm as lgb

# Pipeline Components
from sklearn.pipeline import Pipeline

# Statistical Tools
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import chi2_contingency


ModuleNotFoundError: No module named 'xgboost'

## 📚 2.0 Data Collection & Understanding


<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">

This section focuses on loading the dataset and performing data preprocessing tasks such as handling missing values, changing data types, and confirming the absence of duplicates. This will make our dataset ready for exploratory data analysis and model development.
</div>


### 2.1 Dataset Overview


The dataset used in this project is sourced from Kaggle and includes machine learning-related salary records for the year 2024. Each row represents an individual job role with attributes such as job title, location, experience level, employment type, company size, and more.

- **Source**: [Kaggle – ML Engineer Salary (2024)](https://www.kaggle.com/datasets/chopper53/machine-learning-engineer-salary-in-2024)  
- **Observations**: ~16000+ records  
- **Format**: CSV  
- **Target Variable**: `salary_in_usd`  

This dataset provides a strong foundation for understanding salary distributions and building predictive models.

### 2.2 Data Fields and Definitions



| Column Name         | Description                                                                 |
|---------------------|-----------------------------------------------------------------------------|
| `job_title`         | Title of the job (e.g., Data Scientist, ML Engineer)                        |
| `salary_in_usd`     | Annual salary in USD (standardized across all countries)                    |
| `employee_residence`| Country of the employee's primary residence                                 |
| `experience_level`  | Seniority level (EN = Entry, MI = Mid, SE = Senior, EX = Executive)         |
| `employment_type`   | Full-time, Part-time, Freelance, or Contract                                |
| `company_size`      | Size of the company (S = Small, M = Medium, L = Large)                      |
| `remote_ratio`      | % of remote work (0 = Onsite, 50 = Hybrid, 100 = Fully Remote)              |
| `company_location`  | Country where the company is headquartered                                  |
| `work_year`         | Year the salary was paid (should be 2024 for all entries)                   |



### 2.3 Assumptions and Limitations

**Assumptions**
- `salary_in_usd` is assumed to be standardized across all countries regardless of local currency.
- The dataset is considered representative of the global ML job market for the year 2024.
- High-cardinality categorical features such as `job_title` can be effectively captured using embedding layers in a deep learning model.

**Limitations**
- Potential **regional bias** due to overrepresentation from countries like the USA and India.
- The dataset lacks **industry/sector information** (e.g., tech, finance, healthcare), which may impact salary variance.
- **Total compensation** may be underestimated due to missing data on bonuses, equity, or benefits.
- **Temporal limitation**: The dataset only covers the year 2024 and does not account for salary trends over time.


### 2.4 Initial Exploration

In [15]:
df = pd.read_csv('data.csv', engine = 'pyarrow')

In [21]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2024,MI,FT,Data Scientist,120000,USD,120000,AU,0,AU,S
1,2024,MI,FT,Data Scientist,70000,USD,70000,AU,0,AU,S
2,2024,MI,CT,Data Scientist,130000,USD,130000,US,0,US,M
3,2024,MI,CT,Data Scientist,110000,USD,110000,US,0,US,M
4,2024,MI,FT,Data Science Manager,240000,USD,240000,US,0,US,M


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16494 entries, 0 to 16493
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           16494 non-null  int64 
 1   experience_level    16494 non-null  object
 2   employment_type     16494 non-null  object
 3   job_title           16494 non-null  object
 4   salary              16494 non-null  int64 
 5   salary_currency     16494 non-null  object
 6   salary_in_usd       16494 non-null  int64 
 7   employee_residence  16494 non-null  object
 8   remote_ratio        16494 non-null  int64 
 9   company_location    16494 non-null  object
 10  company_size        16494 non-null  object
dtypes: int64(4), object(7)
memory usage: 1.4+ MB


In [18]:
# check the dataframe shape
df.shape

(16494, 11)

In [22]:
# check for the missing values
df.isnull().sum() / df.shape[0] * 100

work_year             0.0
experience_level      0.0
employment_type       0.0
job_title             0.0
salary                0.0
salary_currency       0.0
salary_in_usd         0.0
employee_residence    0.0
remote_ratio          0.0
company_location      0.0
company_size          0.0
dtype: float64

In [25]:
numerical_data = df.select_dtypes(include=['float64', 'int64']).columns
categorical_data = df.select_dtypes(include=['object', 'category']).columns

In [26]:
# It is importance to check for the unique values in the categorical data
# Loop through columns
for col in categorical_data:
  
  # Print the number of unique values
  print(f"Number of unique values in {col} column: ", df[col].nunique())

Number of unique values in experience_level column:  4
Number of unique values in employment_type column:  4
Number of unique values in job_title column:  155
Number of unique values in salary_currency column:  23
Number of unique values in employee_residence column:  88
Number of unique values in company_location column:  77
Number of unique values in company_size column:  3


# 📚 3.0 Exploratory Data Analysis (EDA)