# **Salary Predictive Model.**

---

## **Table of Contents:**

---

### [1 Imports Data Loading and Preprocessing](#1) 
- #### [1.1 Imports](#11)
- #### [1.2 Loading Data](#12)
- #### [1.3 Data Cleansing and Imputation](#13)
- #### [1.4 Data Visualization](#14)
- #### [1.5 Data Visualization Results](#15)

### [2 Feature Engineering and Data Splitting](#2)
- #### [2.1 Splitting Data](#21)
- #### [2.2 Data Normalization and Scaling](#22)

### [3 Model Training and Evaluation](#3)
- #### [3.1 Baseline Model: Dummy Regressor](#31)
- #### [3.2 Random Forest Regressor](#32)
- #### [3.3 RFR Initial Testing](#33)
- #### [3.4 Feature Selection Process](#34)
- #### [3.5 Neural Network Model](#35)
- #### [3.6 Neural Network Model Initial Testing](#36)

### [4 Model Comparison](#4)
- #### [4.1 Final Model Selection](#41)

### [5 Inference](#5)

---

<a id="1"></a> 
# **1 Imports, Data Loading and preprocessing**

The data preprocessing pipeline consists of several key stages to prepare the dataset for modeling:

---

<a id="11"></a> 
## **1.1 Imports**

All the necesary imports for the notebook to function properly

---

In [None]:
#necesary imports

import matplotlib.font_manager as fm
import logging
import pickle
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler ,RobustScaler
from tensorflow import keras
import joblib
fm._log.setLevel(logging.WARNING)
from src import data_loading
from src import preprocessing
from src import visualize_data
from src import feature_engenieering
from src import modeling
from src import evaluation
from src import model_compare
from src import inference
from src.inference_jupyter_form import create_input_form
from src import feature_visualization

---

<a id="12"></a>
## **1.2 Loading Data**



 The `load_data` function in `data_loading.py` loads multiple CSV files and merges them into a single DataFrame using the 'id' column as the merging key.


 ---



In [3]:

#files path for the raw dataset:

data_files = ['./data/people.csv','./data/descriptions.csv','./data/salary.csv',]


#merge datasets in a cohesive Dataframe

full_dataset = data_loading.load_data(data_files)



<a id="13"></a>
## **1.3 Data cleansing and imputation**

In this step, I use the IQR metric to detect statistical outliers and also remove duplicate entries as well as using the descriptions for missing entries to fill them.

The `infer_missing_values_in_dataframe` function in `llm_dataset_filler.py` leverages a local LLM to infer missing values for specific fields (e.g., Age, Gender, Education Level) based on the provided descriptions. It utilizes asynchronous calls to the LLM API and updates the DataFrame with inferred values.

Additionally, I use IQR metrics to detect statistical outliers and remove duplicate entries to avoid overfitting on particular data points. It's important to note that while the functionality is implemented, the dataset does not contain duplicates or significant outliers detectable with this methodology.

---

In [None]:

BASE_URL = 'http://localhost:11434'  # Replace with your actual server URL and port acceping any OpenAI compatible API endpoint
MODEL_NAME = 'hermes3:8b-llama3.1-q6_K'       # Replace with your model name
API_KEY = ""  # Replace with your API key

cleansed_dataset = await preprocessing.preprocess(full_dataset, BASE_URL, MODEL_NAME, API_KEY)


<a id="14"></a> 
## **1.4 Data Visualization**

Im using simple visualizations for the dataset in order to see if how the datset is distributed.

---

In [5]:

visualize_data.visualize_dataset(cleansed_dataset)

<a id="15"></a> 

### **1.5 Data Visualization Results**


---
#### **Scatter Plots**

The following scatter plots visualize the relationships between different pairs of numerical variables in the dataset:

1. **Scatter Plot of Salary vs Age**


    ![Scatter Plot of Salary vs Age](./plots/scatter_Salary_vs_Age.png)

    - This plot shows the relationship between Salary and Age. shows a good linear relationship with more variance at the end of the distribution, making it a strong feature for the future model.


2. **Scatter Plot of Salary vs Years of Experience**

    ![Scatter Plot of Salary vs Years of Experience](./plots/scatter_Salary_vs_Years_of_Experience.png)

    - This plot illustrates the relationship between Salary and Years of Experience.shows a good linear relationship similar to Age.


3. **Scatter Plot of Years of Experience vs Age**


    ![Scatter Plot of Years of Experience vs Age](./plots/scatter_Years_of_Experience_vs_Age.png)

    - This plot displays the relationship between Years of Experience and Age. shows the expected distribution, sugesting the dataset is not particualarly skewed or missrepresented.

---

#### **Box Plots**

Box plots provide a summary of the distribution of numerical variables, highlighting the median, quartiles, and potential outliers:



1. **Box Plot of Age**

    ![Box Plot of Age](./plots/boxplot_Age.png)
    - This plot shows the distribution of the Age variable, including the median, quartiles, and any outliers.

2. **Box Plot of Years of Experience**

    ![Box Plot of Years of Experience](./plots/boxplot_Years_of_Experience.png)
    - This plot illustrates the distribution of the Years of Experience variable, highlighting the central tendency and spread.

3. **Box Plot of Salary**

    ![Box Plot of Salary](./plots/boxplot_Salary.png)
    - This plot presents the distribution of the Salary variable, showing the median, quartiles, and outliers.

---

#### **Histograms**

Histograms provide a visual representation of the distribution of numerical variables:

1. **Histogram of Age**

    ![Histogram of Age](./plots/histogram_Age.png)
    - This histogram shows the frequency distribution of the Age variable,  its distribution follows the expected normal curve for age just truncated due to the nature of work itself.

2. **Histogram of Years of Experience**

    ![Histogram of Years of Experience](./plots/histogram_Years_of_Experience.png)
    - This histogram illustrates the frequency distribution of the Years of Experience variable, it shows a clear linear pattern with diminishing representation as experience increases wich is expected.

3. **Histogram of Salary**

    ![Histogram of Salary](./plots/histogram_Salary.png)
    - This histogram presents the frequency distribution of the Salary variable, highlighting its distribution characteristics the distribution is not normal and sugest the dataset is biased towards the  140k range wich can make the model performace suffer, considering the size of the dataset pruning examples can hinder the hability of the model to converge, It is addressed with drop out layers for regularization on the NN model and the use of Huber loss to reduce the impact of outlier values.

---
#### **Count Plots**

Count plots visualize the frequency of categorical variables:

1. **Count Plot of Gender**

    ![Count Plot of Gender](./plots/countplot_Gender.png)
    - This plot shows the frequency distribution of the Gender variable, providing insights into the gender composition of the dataset.

2. **Count Plot of Education Level**

    ![Count Plot of Education Level](./plots/countplot_Education_Level.png)
    - This plot illustrates the frequency distribution of the Education Level variable, revealing the educational background of the dataset.


---
#### **Top 10 Job Titles**

The following plot visualizes the top 10 job titles in the dataset:

1. **Top 10 Job Titles**

    ![Top 10 Job Titles](./plots/top_10_job_titles.png)
    - This plot shows the frequency of the top 10 job titles, highlighting the most common job titles in the dataset.

---

<a id="2"></a> 
# **2 Feature engineering and data splitting**

This step consist in the processing of the dataset for use in the process of training, here im also creating the objects needed for inference over new data.

---

<a id="21"></a> 
## **2.1 Splitting data**

Dividing the dataset into training and testing subsets to evaluate model performance. This step ensures an unbiased assessment of the model's capabilities.

---

In [None]:

#split the dataset into an 80 / 20 ratio for training and testing.

X_train, X_test, y_train, y_test = feature_engenieering.split_data(cleansed_dataset)



<a id="22"></a> 
## **2.2 Data Normalization and Scaling**

 Normalizing and scaling data to standardize features, which helps improve model convergence and performance. unsing  Min-Max Scaling for the RNR model and Standardization for the NN model, the values for categorical features are treated case by case,for education its modeled as a linear relationship meaning 0 for bachellors m 1 for Masters and 2 for PHD, the Job title values since they have a strong correlation with salary but lack cardinality, the target encoder method seems to be the optimal. once the datasets are created are saved along with scalers and the job title target encoded table and their values in pkl format to be used later for training and inference respectibly.

 ---

In [None]:

#normalize and scale the datasets using MinMaxScaler and target encoder for random forest

normalized_X_train, te, scaler = feature_engenieering.normalize_train_data(X_train, y_train,MinMaxScaler())

normalized_X_test = feature_engenieering.normalize_test_data(X_test, te, scaler)


#normalize and scale the datasets using StandardScaler and target encoder for Neural Networks

normalized_X_train_nn, te_nn, scaler_nn = feature_engenieering.normalize_train_data(X_train, y_train,RobustScaler(),"nn_")

normalized_X_test_nn = feature_engenieering.normalize_test_data(X_test, te_nn, scaler_nn)

feature_engenieering.save_datasets(X_train, X_test, y_train, y_test, normalized_X_train, normalized_X_train_nn, normalized_X_test, normalized_X_test_nn)



<a id="23"></a> 
## **2.3 Feature Relationship Visualization**

This step involves the creation of visual plots to asses the distribution and relationships of the selected features.

---

In [None]:
feature_visualization.plot_feature_relationships(x_train=normalized_X_train, y_train=y_train)

1. **Job title encoded vs Salary**

    ![Job title encoded vs Salary](./plots/job_title_encoded_vs_salary.png)
    
    - This pairplot visualizes the relationships between job title encoded to salary, specially important to gauge the feature modeling choice.

2. **Pairplot of Training Data**

    ![Pairplot](./plots/pairplot.png)
    
    - This pairplot visualizes the relationships between different pairs of features in the training data. It helps to understand the feature interactions and distributions.

3. **Correlation Heatmap**

    ![Correlation Heatmap](./plots/correlation_heatmap.png)
    
    - This heatmap shows the correlation between different features in the training data. It helps to identify highly correlated features and potential multicollinearity issues.

<a id="3"></a> 
# **3 Model Training and Evaluation**

The section includes the training and further feature and performance analisys over a Random Forest Regressor and a Neural Network model.

---

<a id="31"></a> 
## **3.1 Baseline Model: Dummy Regressor**

Establishing a baseline performance using a dummy regressor. This simple model provides a reference point for comparing advanced models.

With the datasets created we use the train with MinMaxScaling dataset splits in a script to create a Random Forest Regressor using the scikit-learn framework.

I started by traing a Dummy Reggressor to use as a baseline for model performance comparison and then train a Random Forest Regressor algorithm with hyperparameter tuning. We also evaluate the trained model by calculating metrics such as mean absolute error (MAE), root mean squared error (RMSE) and R-squared e (R2) and plot a scatterplot of predicted vs actual salaries.

---

In [None]:
#Load generated datasets as pkl

X_train, X_test, y_train, y_test, normalized_X_train, normalized_X_train_nn, normalized_X_test, normalized_X_test_nn = feature_engenieering.load_datasets()


In [None]:

dummy = modeling.train_dummy_regressor(normalized_X_train, y_train)

<a id="32"></a> 
## **3.2 Random Forest Regressor**

Training a Random Forest Regressor, a robust ensemble model, to predict target variables. This model aggregates results from multiple decision trees. this step also includes Optimizing model performance by tuning hyperparameters and selecting the most promising model using a grid search technique with GridSearchCV 

---

In [None]:
#train a model using a random forest regressor algorithm and print out the predictions for the normalized test data.

rf_model = modeling.train_model(normalized_X_train, y_train)


### **Model Training and Evaluation Plots**

The following plots provide insights into the performance and behavior of the trained models during the training and evaluation phases:

---

1. **Grid Search Heatmap**

    ![Grid Search Heatmap](./plots/grid_search_heatmap.png)
    
    - This heatmap visualizes the performance of different hyperparameter combinations during the grid search process. It helps to identify the optimal hyperparameters that yield the best model performance.

---

2. **Grid Search Line Plot**

    ![Grid Search Line Plot](./plots/grid_search_lineplot.png)

    - This line plot shows the performance metrics (e.g., accuracy, loss) for different hyperparameter combinations over the grid search iterations. It helps to understand the trend and stability of the model performance across different hyperparameters.

---

These plots collectively provide a comprehensive view of the model's performance during the hyperparameter tuning process, highlighting the best-performing configurations and areas for potential improvement.

---

<a id="33"></a> 
## **3.3 RFR initial testing**

To quickly asses the training process validity a first fast evaluation is done followed by a boostraped test run to propperly asses confidence intervals on predictions.

---

In [None]:
#load the trained model from a pickle file
rf_model = joblib.load(open('./models/random_forest_model.pkl', 'rb'))

#use the test dataset to predict salaries based on the trained model for a first fast evaluation.
eval = evaluation.evaluate_model(normalized_X_test, y_test, normalized_X_train,y_train, rf_model)

#calculate metrics for the model
eval = evaluation.calculate_metrics(normalized_X_test, y_test, rf_model)

### **RFR Model Evaluation Results**

The following plots provide insights into the performance and behavior of the trained Random Forest Regressor model during the evaluation phase:

---

1. **Distribution of Residuals**

    ![Distribution of Residuals](./plots/distribution_of_residuals.png)
    
    - This plot shows the distribution of residuals (the difference between actual and predicted values). It helps to identify any patterns or biases in the model's predictions.

2. **Residuals vs. Predicted Salaries**

    ![Residuals vs. Predicted Salaries](./plots/residuals_vs_predicted_salaries.png)
    
    - This scatter plot illustrates the relationship between residuals and predicted salaries. It helps to assess the model's accuracy and identify any systematic errors.

3. **SHAP Feature Analisys**

    ![Shap Feature Analisys](./plots/shap_summary_plot_rf.png)

    - Bar chart showing roughly how much each feature contributes to the models output.

---

#### **Model Performance with Confidence Intervals:**

Mean Squared Error (MSE): 600793386.21 (95% CI: [290532348.25, 1021638682.27])

Mean Absolute Error (MAE): 15023.79 (95% CI: [11485.43, 19054.41])

R-squared Score (R²): 0.76 (95% CI: [0.62, 0.87])

---

<a id="34"></a> 
## **3.4 Feature Selection Process**

In the feature selection process, I initially used a Random Forest Regressor to capture non-linear relationships with relatively low data quantity for training. This approach provides fast training and explainability.

---

### **Initial Model Performance with All Features**

The initial model was trained with all features to gather information on relationships and model error (R-squared Score, R²). The heteroscedasticity observed in the middle of the distribution is likely due to the dataset's lack of examples to fill the distribution appropriately. However, the residuals distribution is acceptable, with no statistically significant outliers.

**Performance Metrics:**
- Mean Squared Error (MSE): 866,064,920.96
- Mean Absolute Error (MAE): 17,982.33
- R-squared Score (R²): 0.65

**Residuals Distribution:**

![img](./plots/residuals_distribution.png)

**SHAP Values:**

![Feature Importance](./plots/feature_importance.png)

**Feature Correlation Heatmap:**

![feature correlation heatmap](./plots/feature_correlation_heatmap_1.png)

---

### **Dropping Gender**

Gender was dropped as it showed a low correlation with salary and low feature importance. This improved the model's performance.

**Performance Metrics:**
- Mean Squared Error (MSE): 600,148,032.12
- Mean Absolute Error (MAE): 14,748.68
- R-squared Score (R²): 0.76

**Residuals Distribution:**

![img](./plots/residuals_distribution_2.png)

**SHAP Values:**

![Feature Importance](./plots/feature_importance_2.png)

**Feature Correlation Heatmap:**

![feature correlation heatmap](./plots/feature_correlation_heatmap_2.png)

---

### **Dropping Education Level**

Surprisingly, dropping the education level, which had a low correlation with salary, resulted in worse model performance.

**Performance Metrics:**
- Mean Squared Error (MSE): 808,374,516.33
- Mean Absolute Error (MAE): 17,240.95
- R-squared Score (R²): 0.67

**Residuals Distribution:**

![img](./plots/residuals_distribution_3.png)

**SHAP Values:**

![Feature Importance](./plots/feature_importance_3.png)

**Feature Correlation Heatmap:**

![feature correlation heatmap](./plots/feature_correlation_heatmap_3.png)

---

### **Conclusion**

The main feature to drop is gender due to its high noise-to-signal ratio and low correlation with the target variable. The education level, although loosely correlated, still provides an advantage to the model, indicating a relevant relationship with the target.

After the initial transformations and feature selection, the final model includes the following features in order of relevance based on SHAP scores and correlation maps:
- Job Title
- Age
- Years of Experience
- Education Level

---

<a id="3.5"></a> 

## **3.5 Neural Network Model**

In this step I create a second model to test performance of different approaches in this specific problem. I'm using the same dataset and features as the random forest regressor but normalized using the stardad scaling wich provides better performances on NN sequiential models, the hypotesis is that this approach can generalize better the dataset with sufficient Paramenters

---



### **3.5.1 Model Overview**
The neural network is structured as a feed-forward sequential model with a funnel architecture, designed for regression tasks. The model processes standardized features through multiple dense layers with decreasing neuron counts.

---

### **3.5.2 Architecture Diagram**

```
Input Layer (shape: features) →
Dense(64, ReLU) → Dropout(0.2) →
Dense(32, ReLU) → Dropout(0.2) →
Dense(16, ReLU) →
Dense(1, linear)
```

---

### **3.5.3 Design Choices**

1. **Input Processing**
   - Uses RobustScaler normalized features (in testing it gave a slight advantage on R squared metrics)
   - transforms the target to a log representation
   - Adapts to input shape automatically

2. **Hidden Layers**
   - Funnel architecture: 64 → 32 → 16 neurons
   - ReLU activation for non-linearity
   - Dropout layers (0.2) for regularization

3. **Training Configuration**
   - Adam optimizer (lr=0.001)
   - Mean Squared Error loss
   - Early stopping (patience=10)
   - Validation split: 10%
   - Batch size: 32
   - Max epochs: 1000

---

### **3.5.4 Comparison with Random Forest**

| Aspect | Neural Network | Random Forest |
|--------|---------------|---------------|
| Scaling | Requires normalization | Scale-invariant |
| Training | Iterative, gradient-based | Ensemble, parallel |
| Hyperparameters | Layer sizes, learning rate | Trees, depth, features |
| Interpretability | Less interpretable | More interpretable |

---

### **3.5.5 Hypothesis**
The neural network approach may capture complex non-linear relationships in the salary data that tree-based methods might miss, potentially leading to better generalization on unseen data.


---

In [None]:
# Trains the Neural Network model
modeling.train_NN_model(normalized_X_train_nn,y_train)

![img](./plots/nn_training_loss.png)

The plot represents the loss over time during model training, it can be seen that the model converged at around 175 epochs triggering Early stop to prevent overfit

---

<a id="36"></a> 
## **3.6 Neural Network Model initial testing**

To quickly asses the training process validity a first fast evaluation is done followed by a boostraped test run to propperly asses confidence intervals on predictions. folowing the same methology as on the RFR model

---

In [None]:
#load the trained model from a keras file
nn_model = keras.models.load_model('./models/neural_network_model.keras')

#use the test dataset to predict salaries based on the trained model for a first fast evaluation.
eval = evaluation.evaluate_NN_model(normalized_X_test_nn, y_test,normalized_X_train_nn, nn_model)

#calculate metrics for the model
metrics = evaluation.calculate_metrics(normalized_X_test_nn, y_test, nn_model)

### **NN Model Evaluation Results**

The following plots provide insights into the performance and behavior of the trained Neural Network model during the evaluation phase:

---

1. **Distribution of Residuals**

    ![Distribution of Residuals](./plots/distribution_of_residuals_nn.png)
    
    - This plot shows the distribution of residuals (the difference between actual and predicted values). the plot shows a normal distribution for the residuals aroud 0 wich means the model is behavioring as expected.

2. **Residuals vs. Predicted Salaries**

    ![Residuals vs. Predicted Salaries](./plots/residuals_vs_predicted_salaries_nn.png)
    
    - This scatter plot illustrates the relationship between residuals and predicted salaries. It shows a beter heterodecendacity with respect of the RFR model, consistent with the expected behavior.


3. **SHAP Feature Analisys**

    ![Shap Feature Analisys](./plots/shap_summary_plot_nn.png)

    - Bar chart showing roughly how much each feature contributes to the models output.

---

#### **Model Performance with Confidence Intervals:**

Mean Squared Error (MSE): 491423944.87 (95% CI: [260609071.86, 850637893.20])

Mean Absolute Error (MAE): 15287.96 (95% CI: [12594.30, 18564.54])

R-squared Score (R²): 0.80 (95% CI: [0.68, 0.89])

---

<a id="4"></a> 
# **4 Model Comparison**

Comparing the performance of all models using metrics such as accuracy, mean squared error (MSE), and R-squared calculated as confidence intervals to determine the best-performing model.

---

In [None]:
# Load the three models: Dummy Regressor, Random Forest Regressor, and Neural Network
dummy_model = joblib.load(open('./models/dummy_reggresor_model.pkl', 'rb'))
rf_model = joblib.load(open('./models/random_forest_model.pkl', 'rb'))
nn_model = keras.models.load_model('./models/neural_network_model.keras')

# Prepare the models data dictionary
models_data = {
    'Dummy Regressor': (dummy_model, normalized_X_test, y_test),
    'Random Forest Regressor': (rf_model, normalized_X_test, y_test),
    'Neural Network': (nn_model, normalized_X_test_nn, y_test)
}

# Call the model comparison module
metrics = model_compare.compare_models(models_data, y_test)

<a id="41"></a> 

### **4.1 Final Model Selection**


After generation both aproaches for modeling the problem I settled on using the neural network approach since it seems to capture better the relationships between the features and is in general more precise.

---

we can observe the comparisons on the following charts:

![img](./plots/predicted_vs_actual_values_model_comparison.png)


![img](./plots/error_model_comparison.png)


![img](./plots/residuals_distribution_model_comparison.png)

---


### **Model Performance Summary with Confidence Intervals**:


----------------------------------------------------------------------

#### **Random Forest**:

MSE: 591312507.301 (95% CI: [266726953.202, 1092637456.989])

MAE: 14592.480 (95% CI: [10587.479, 19339.949])

R2: 0.761 (95% CI: [0.635, 0.872])

---

#### **Neural Network**:

MSE: 410098860.336 (95% CI: [172791209.008, 806977256.100])

MAE: 12925.254 (95% CI: [9796.850, 17131.055])

R2: 0.831 (95% CI: [0.724, 0.912])

---

#### **Dummy Regresson**:

MSE: 2461198619.228 (95% CI: [1834308694.379, 3231299942.415])

MAE: 41588.630 (95% CI: [35333.317, 48193.295])

R2: -0.015 (95% CI: [-0.080, -0.000])

---

<a id="5"></a> 

# **5 Inference**

The inference process involves using the trained neural network model to make predictions on new or unseen data. This section outlines the steps to load the model, preprocess the input data, and generate predictions.

---

### **5.1 Loading the Model**

First, we load the trained neural network model from the saved file. This model has been trained and validated on the dataset, and it is now ready to be used for inference.

---

### **5.2 Preparing the Input Data**

The input data must be preprocessed in the same way as the training data. This includes encoding categorical variables and scaling numerical features. The preprocessing steps ensure that the input data is in the correct format for the model to make accurate predictions.

---

### **5.3 Making Predictions**

Once the model is loaded and the input data is preprocessed, we can use the model to make predictions. The predictions will be the estimated salaries based on the input features.

---

### **5.4 Interactive Input Form**

To facilitate the inference process, I have created an interactive input form using ipywidgets. This form allows users to input the necessary features and obtain salary predictions in real-time.


---

In [None]:
#This code cell is self-contained and can be run independently from the rest of the notebook. It will load the trained models and the test dataset, and then it will display a form to input the job title, company name, and job description. After filling the form, the user can click the "Predict Salary" button to get the salary prediction for the input data.

from src import inference
from src.inference_jupyter_form import create_input_form

job_titles = inference.get_unique_job_titles(prefix="nn_")
form = create_input_form(job_titles)
display(form)