# Salary Predictive model.

# Table of Contents

## [1 Imports Data Loading and Preprocessing](#1) 
### [1.1 Loading Data](#11)
### [1.2 Data Cleansing and Imputation](#12)
### [1.3 Data Visualization](#13)

## [2 Feature Engineering and Data Splitting](#2)
### [2.1 Splitting Data](#21)
### [2.2 Data Normalization and Scaling](#22)

## [3 Model Training and Evaluation](#3)
### [3.1 Baseline Model: Dummy Regressor](#31)
### [3.2 Random Forest Regressor](#32)
### [3.3 RFR Initial Testing](#33)
### [3.4 Feature Selection Process](#34)
### [3.5 Neural Network Model](#35)
### [3.6 Neural Network Model Initial Testing](#36)

## [4 Model Comparison](#4)
### [4.1 Final Model Selection](#41)

## [5 Inference](#5)

<a id="1"></a> 
# 1 Imports Data Loading and preprocessing

The data preprocessing pipeline consists of several key stages to prepare the dataset for modeling:

In [1]:
#necesary imports

import matplotlib.font_manager as fm
import logging
import pickle
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from tensorflow import keras
import joblib
fm._log.setLevel(logging.WARNING)
from src import data_loading
from src import preprocessing
from src import visualize_data
from src import feature_engenieering
from src import modeling
from src import evaluation
from src import model_compare
from src import inference
from src.inference_jupyter_form import create_input_form

2025-01-17 15:34:00.771510: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<a id="11"></a> 
## 1.1 Loading Data



 The `load_data` function in `data_loading.py` loads multiple CSV files and merges them into a single DataFrame using the 'id' column as the merging key.



In [2]:

#files path for the raw dataset:

data_files = ['./data/people.csv','./data/descriptions.csv','./data/salary.csv',]


#merge datasets in a cohesive Dataframe

full_dataset = data_loading.load_data(data_files)


<a id="12"></a> 
## 1.2 Data cleansing and imputation

The `infer_missing_values_in_dataframe` function in `llm_dataset_filler.py` uses a local LLM to infer missing values for specific fields (e.g., Age, Gender, Education Level) based on the provided descriptions. It utilizes asynchronous calls to the LLM API and updates the DataFrame with inferred values.

In [3]:


#preprocessing of the dataframe adds missing values with LLM inference over descriptions of each row, drops the incomplete rows and cleans up the data.

BASE_URL = 'http://localhost:11434'  # Replace with your actual server URL and port
MODEL_NAME = 'hermes3:8b-llama3.1-q6_K'       # Replace with your model name
API_KEY = ""  # Replace with your API key

cleansed_dataset = await preprocessing.preprocess(full_dataset, BASE_URL, MODEL_NAME, API_KEY)


DEBUG:src.llm_dataset_filler:Total rows with missing values: 14



Full Merged Dataset:
   id   Age  Gender Education Level          Job Title  Years of Experience  \
0   0  32.0    Male      Bachelor's  Software Engineer                  5.0   
1   1  28.0  Female        Master's       Data Analyst                  3.0   
2   2  45.0    Male             PhD     Senior Manager                 15.0   
3   3  36.0  Female      Bachelor's    Sales Associate                  7.0   
4   4  52.0    Male        Master's           Director                 20.0   

                                         Description    Salary  
0  I am a 32-year-old male working as a Software ...   90000.0  
1  I am a 28-year-old data analyst with a Master'...   65000.0  
2  I am a 45-year-old Senior Manager with a PhD a...  150000.0  
3  I am a 36-year-old female Sales Associate with...   60000.0  
4  I am a 52-year-old male with over two decades ...  200000.0  
      id   Age  Gender Education Level                      Job Title  \
370  370  35.0  Female      Bachelor's  

DEBUG:src.llm_dataset_filler:Index 172: Received NDJSON response for Salary.
DEBUG:src.llm_dataset_filler:Index 221: Received NDJSON response for Age.
DEBUG:src.llm_dataset_filler:Index 261: Received NDJSON response for Education Level.
DEBUG:src.llm_dataset_filler:Index 260: Received NDJSON response for Job Title.
DEBUG:src.llm_dataset_filler:Index 172: Inferred Salary: Not found
DEBUG:src.llm_dataset_filler:Index 221: Inferred Age: 31
DEBUG:src.llm_dataset_filler:Index 261: Inferred Education Level: Bachelor's
DEBUG:src.llm_dataset_filler:Index 260: Inferred Job Title: Not found
DEBUG:src.llm_dataset_filler:Index 139: Received NDJSON response for Education Level.
DEBUG:src.llm_dataset_filler:Index 51: Received NDJSON response for Job Title.
DEBUG:src.llm_dataset_filler:Index 366: Received NDJSON response for Education Level.
DEBUG:src.llm_dataset_filler:Index 172: Received NDJSON response for Years of Experience.
DEBUG:src.llm_dataset_filler:Index 139: Inferred Education Level: Maste


Missing Values after LLM inference:
id                     0
Age                    3
Gender                 2
Education Level        2
Job Title              2
Years of Experience    2
Description            3
Salary                 2
dtype: int64

Rows with Missing Values after LLM inference:
      id   Age Gender Education Level                 Job Title  \
111  111  37.0   Male      Bachelor's  Software Project Manager   
125  125  26.0   Male      Bachelor's         Junior Accountant   
172  172   NaN    NaN             NaN                       NaN   
177  177  31.0   Male      Bachelor's         Junior Accountant   
260  260   NaN    NaN             NaN                       NaN   
315  315   NaN   Male      Bachelor's  Senior Software Engineer   

     Years of Experience                                        Description  \
111                  9.0                                                NaN   
125                  2.0                                                NaN

<a id="13"></a> 
## 1.3 Data Visualization

Im using simple visualizations for the dataset in order to see if how the datset is distributed.

In [4]:

visualize_data.visualize_dataset(cleansed_dataset)

## Data Visualization Results

### Scatter Plots

The following scatter plots visualize the relationships between different pairs of numerical variables in the dataset:

1. **Scatter Plot of Salary vs Age**

    ![Scatter Plot of Salary vs Age](./plots/scatter_Salary_vs_Age.png)
    - This plot shows the relationship between Salary and Age. It helps to identify any trends or patterns between these two variables.

2. **Scatter Plot of Salary vs Years of Experience**

    ![Scatter Plot of Salary vs Years of Experience](./plots/scatter_Salary_vs_Years_of_Experience.png)
    - This plot illustrates the relationship between Salary and Years of Experience. It can reveal how experience impacts salary levels.

3. **Scatter Plot of Years of Experience vs Age**

    ![Scatter Plot of Years of Experience vs Age](./plots/scatter_Years_of_Experience_vs_Age.png)
    - This plot displays the relationship between Years of Experience and Age. It helps to understand how age correlates with the amount of experience.

### Box Plots

Box plots provide a summary of the distribution of numerical variables, highlighting the median, quartiles, and potential outliers:

1. **Box Plot of Age**

    ![Box Plot of Age](./plots/boxplot_Age.png)
    - This plot shows the distribution of the Age variable, including the median, quartiles, and any outliers.

2. **Box Plot of Years of Experience**

    ![Box Plot of Years of Experience](./plots/boxplot_Years_of_Experience.png)
    - This plot illustrates the distribution of the Years of Experience variable, highlighting the central tendency and spread.

3. **Box Plot of Salary**

    ![Box Plot of Salary](./plots/boxplot_Salary.png)
    - This plot presents the distribution of the Salary variable, showing the median, quartiles, and outliers.

### Histograms

Histograms provide a visual representation of the distribution of numerical variables:

1. **Histogram of Age**

    ![Histogram of Age](./plots/histogram_Age.png)
    - This histogram shows the frequency distribution of the Age variable, providing insights into its distribution shape.

2. **Histogram of Years of Experience**

    ![Histogram of Years of Experience](./plots/histogram_Years_of_Experience.png)
    - This histogram illustrates the frequency distribution of the Years of Experience variable, revealing its distribution pattern.

3. **Histogram of Salary**

    ![Histogram of Salary](./plots/histogram_Salary.png)
    - This histogram presents the frequency distribution of the Salary variable, highlighting its distribution characteristics.

### Count Plots

Count plots visualize the frequency of categorical variables:

1. **Count Plot of Gender**

    ![Count Plot of Gender](./plots/countplot_Gender.png)
    - This plot shows the frequency distribution of the Gender variable, providing insights into the gender composition of the dataset.

2. **Count Plot of Education Level**

    ![Count Plot of Education Level](./plots/countplot_Education_Level.png)
    - This plot illustrates the frequency distribution of the Education Level variable, revealing the educational background of the dataset.

### Top 10 Job Titles

The following plot visualizes the top 10 job titles in the dataset:

1. **Top 10 Job Titles**

    ![Top 10 Job Titles](./plots/top_10_job_titles.png)
    - This plot shows the frequency of the top 10 job titles, highlighting the most common job titles in the dataset.


<a id="2"></a> 
# 2 Feature engineering and data splitting

<a id="21"></a> 
## 2.1 Splitting data

Dividing the dataset into training and testing subsets to evaluate model performance. This step ensures an unbiased assessment of the model's capabilities.

In [5]:

#split the dataset into an 80 / 20 ratio for training and testing.

X_train, X_test, y_train, y_test = feature_engenieering.split_data(cleansed_dataset)



<a id="22"></a> 
## 2.2 Data Normalization and Scaling

 Normalizing and scaling data to standardize features, which helps improve model convergence and performance. unsing  Min-Max Scaling for the RNR model and Standardization for the NN model, the values for categorical features are treated case by case,for education its modeled as a linear relationship meaning 0 for bachellors m 1 for Masters and 2 for PHD, the Job title values since they have a strong correlation with salary but lack cardinality, the target encoder method seems to be the optimal. once the datasets are created are saved along with scalers and the job title target encoded table and their values in pkl format to be used later for training and inference respectibly.

In [None]:

#normalize and scale the datasets using MinMaxScaler and target encoder for random forest

normalized_X_train, te, scaler = feature_engenieering.normalize_train_data(X_train, y_train,MinMaxScaler())

normalized_X_test = feature_engenieering.normalize_test_data(X_test, te, scaler)


#normalize and scale the datasets using StandardScaler and target encoder for Neural Networks

normalized_X_train_nn, te_nn, scaler_nn = feature_engenieering.normalize_train_data(X_train, y_train,StandardScaler(),"nn_")

normalized_X_test_nn = feature_engenieering.normalize_test_data(X_test, te_nn, scaler_nn)

feature_engenieering.save_datasets(X_train, y_train, X_test, y_test, normalized_X_train, normalized_X_train_nn, normalized_X_test, normalized_X_test_nn)



<a id="3"></a> 
# 3 Model Training and Evaluation

The first step is to load all the relevant files in memory.

In [None]:
#Load generated datasets as pkl

X_train, X_test, y_train, y_test, normalized_X_train, normalized_X_train_nn, normalized_X_test, normalized_X_test_nn = feature_engenieering.load_datasets()


<a id="31"></a> 
## 3.1 Baseline Model: Dummy Regressor

Establishing a baseline performance using a dummy regressor. This simple model provides a reference point for comparing advanced models.

With the datasets created we use the train with MinMaxScaling dataset splits in a script to create a Random Forest Regressor using the scikit-learn framework.

traing a Dummy Reggressor to use as a baseline for model performance comparison and then train a Random Forest Regressor algorithm with hyperparameter tuning. We also evaluate the trained model by calculating metrics such as mean absolute error (MAE), root mean squared error (RMSE) and R-squared e (R2) and plot a scatterplot of predicted vs actual salaries.

In [None]:

dummy = modeling.train_dummy_regressor(normalized_X_train, y_train)

<a id="32"></a> 
## 3.2 Random Forest Regressor

Training a Random Forest Regressor, a robust ensemble model, to predict target variables. This model aggregates results from multiple decision trees. this step also includes Optimizing model performance by tuning hyperparameters and selecting the most promising model using a grid search technique with GridSearchCV 

In [None]:
#train a model using a random forest regressor algorithm and print out the predictions for the normalized test data.

rf_model = modeling.train_model(normalized_X_train, y_train)


<a id="33"></a> 
## 3.3 RFR initial testing

In [None]:
#use the test dataset to predict salaries based on the trained model for a first fast evaluation.


rf_model = joblib.load(open('./models/random_forest_model.pkl', 'rb'))

evaluation.evaluate_model(normalized_X_test, y_test, normalized_X_train,y_train, rf_model)

Next we use Confidence intervals to further test the  models performance using bootstraping 

In [None]:

rf_model = joblib.load(open('./models/random_forest_model.pkl', 'rb'))

evaluation.calculate_metrics(normalized_X_test, y_test, rf_model)

<a id="34"></a> 
## 3.4 Feature Selection Process

In the feature selection process, I initially used a Random Forest Regressor to capture non-linear relationships with relatively low data quantity for training. This approach provides fast training and explainability.

### Initial Model Performance with All Features

The initial model was trained with all features to gather information on relationships and model error (R-squared Score, R²). The heteroscedasticity observed in the middle of the distribution is likely due to the dataset's lack of examples to fill the distribution appropriately. However, the residuals distribution is acceptable, with no statistically significant outliers.

**Performance Metrics:**
- Mean Squared Error (MSE): 866,064,920.96
- Mean Absolute Error (MAE): 17,982.33
- R-squared Score (R²): 0.65

**Residuals Distribution:**

![img](./plots/residuals_distribution.png)

**SHAP Values:**

![Feature Importance](./plots/feature_importance.png)

**Feature Correlation Heatmap:**

![feature correlation heatmap](./plots/feature_correlation_heatmap.png)

### Dropping Gender

Gender was dropped as it showed a low correlation with salary and low feature importance. This improved the model's performance.

**Performance Metrics:**
- Mean Squared Error (MSE): 600,148,032.12
- Mean Absolute Error (MAE): 14,748.68
- R-squared Score (R²): 0.76

**Residuals Distribution:**

![img](./plots/residuals_distribution_2.png)

**SHAP Values:**

![Feature Importance](./plots/feature_importance_2.png)

**Feature Correlation Heatmap:**

![feature correlation heatmap](./plots/feature_correlation_heatmap_2.png)

### Dropping Education Level

Surprisingly, dropping the education level, which had a low correlation with salary, resulted in worse model performance.

**Performance Metrics:**
- Mean Squared Error (MSE): 808,374,516.33
- Mean Absolute Error (MAE): 17,240.95
- R-squared Score (R²): 0.67

**Residuals Distribution:**

![img](./plots/residuals_distribution_3.png)

**SHAP Values:**

![Feature Importance](./plots/feature_importance_3.png)

**Feature Correlation Heatmap:**

![feature correlation heatmap](./plots/feature_correlation_heatmap_3.png)

### Conclusion

The main feature to drop is gender due to its high noise-to-signal ratio and low correlation with the target variable. The education level, although loosely correlated, still provides an advantage to the model, indicating a relevant relationship with the target.

After the initial transformations and feature selection, the final model includes the following features in order of relevance based on SHAP scores and correlation maps:
- Job Title
- Age
- Years of Experience
- Education Level

<a id="3.5"></a> 
## 3.5 Neural Network Model

In this step I create a second model to test performance of different approaches in this specific problem. I'm using the same dataset and features as the random forest regressor but normalized using the stardad scaling wich provides better performances on NN sequiential models, the hypotesis is that this approach can generalize better the dataset with sufficient Paramenters

### 3.5.1 Model Overview
The neural network is structured as a feed-forward sequential model with a funnel architecture, designed for regression tasks. The model processes standardized features through multiple dense layers with decreasing neuron counts.

### 3.5.2 Architecture Diagram
```
Input Layer (shape: features) →
Dense(64, ReLU) → Dropout(0.2) →
Dense(32, ReLU) → Dropout(0.2) →
Dense(16, ReLU) →
Dense(1, linear)
```

### 3.5.3 Design Choices

1. **Input Processing**
   - Uses StandardScaler normalized features
   - Adapts to input shape automatically

2. **Hidden Layers**
   - Funnel architecture: 64 → 32 → 16 neurons
   - ReLU activation for non-linearity
   - Dropout layers (0.2) for regularization

3. **Training Configuration**
   - Adam optimizer (lr=0.001)
   - Mean Squared Error loss
   - Early stopping (patience=10)
   - Validation split: 10%
   - Batch size: 32
   - Max epochs: 1000

### 3.5.4 Comparison with Random Forest

| Aspect | Neural Network | Random Forest |
|--------|---------------|---------------|
| Scaling | Requires normalization | Scale-invariant |
| Training | Iterative, gradient-based | Ensemble, parallel |
| Hyperparameters | Layer sizes, learning rate | Trees, depth, features |
| Interpretability | Less interpretable | More interpretable |

### 3.5.5 Hypothesis
The neural network approach may capture complex non-linear relationships in the salary data that tree-based methods might miss, potentially leading to better generalization on unseen data.
```

In [None]:
modeling.train_NN_model(normalized_X_train_nn,y_train)

![img](./plots/nn_training_loss.png)

<a id="36"></a> 
## 3.6 Neural Network Model initial testing

In [None]:
#use the test dataset to predict salaries based on the trained model for a first fast evaluation.

nn_model = keras.models.load_model('./models/neural_network_model.keras')

evaluation.evaluate_NN_model(normalized_X_test_nn, y_test,normalized_X_train_nn, nn_model)

In [None]:
nn_model = keras.models.load_model('./models/neural_network_model.keras')


evaluation.calculate_metrics(normalized_X_test_nn, y_test, nn_model)

<a id="4"></a> 
# 4 Model Comparison
Comparing the performance of all models using metrics such as accuracy, mean squared error (MSE), and R-squared to determine the best-performing model.

In [None]:
nn_model = keras.models.load_model('./models/neural_network_model.keras')
dummy = joblib.load(open('./models/dummy_reggresor_model.pkl', 'rb'))
rf_model = joblib.load(open('./models/random_forest_model.pkl', 'rb'))



models_data = {
    'Random Forest': (rf_model, normalized_X_test, y_test),
    'Neural Network': (nn_model, normalized_X_test_nn, y_test),
    'Dummy Regresson': (dummy, normalized_X_test, y_test)
}

comparison_results = model_compare.compare_models(models_data, y_test)


<a id="41"></a> 
### 4.1 Final Model Selection

After generation both aproaches for modeling the problem I settled on using the neural network approach since it seems to capture better the relationships between the features and is in general more precise.

we can observe the comparisons on the following charts:

![img](./plots/predicted_vs_actual_values_model_comparison.png)


![img](./plots/error_model_comparison.png)


![img](./plots/residuals_distribution_model_comparison.png)




### **Model Performance Summary with Confidence Intervals**:
----------------------------------------------------------------------

**Random Forest**:

MSE: 591312507.301 (95% CI: [266726953.202, 1092637456.989])

MAE: 14592.480 (95% CI: [10587.479, 19339.949])

R2: 0.761 (95% CI: [0.635, 0.872])

**Neural Network**:

MSE: 451365450.081 (95% CI: [193911315.376, 891479517.642])

MAE: 13879.159 (95% CI: [10770.149, 17989.091])

R2: 0.820 (95% CI: [0.713, 0.904])

**Dummy Regresson**:

MSE: 2461198619.228 (95% CI: [1834308694.379, 3231299942.415])

MAE: 41588.630 (95% CI: [35333.317, 48193.295])

R2: -0.015 (95% CI: [-0.080, -0.000])



<a id="5"></a> 
# 5 Inference

Using the best-performing model to make predictions on new or unseen data. This step involves applying the trained model to real-world scenarios or test cases.

In [None]:
job_titles = inference.get_unique_job_titles(prefix="nn_")
form = create_input_form(job_titles)
display(form)