---
## 0. Setup Environment

In [9]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [26]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import os 

---
## A. Project Description


The objective of this project is to develop a predictive model that estimates employee salaries based on a variety of personal, professional, and organizational attributes. By leveraging features such as years of experience, education level, job role, department, performance ratings, and other relevant factors, the model aims to provide accurate salary predictions for employees. This will assist HR professionals, managers, and organizations in making informed decisions regarding compensation planning, talent management, and recruitment strategies. The project involves comprehensive data exploration, feature engineering, and model evaluation to ensure robust and reliable predictions that reflect real-world compensation dynamics.

---
## C. Data Understanding

### C.1   Load Datasets



In [11]:
# # Load training data
training_df = pd.read_csv("../data/raw/Extended_Employee_Performance_and_Productivity_Data")

### C.2 Explore Training Set


In [12]:
training_df.head()

Unnamed: 0,Employee_ID,Department,Gender,Age,Job_Title,Hire_Date,Years_At_Company,Education_Level,Performance_Score,Monthly_Salary,Work_Hours_Per_Week,Projects_Handled,Overtime_Hours,Sick_Days,Remote_Work_Frequency,Team_Size,Training_Hours,Promotions,Employee_Satisfaction_Score,Resigned
0,1,IT,Male,55,Specialist,2022-01-19 08:03:05.556036,2,High School,5,6750.0,33,32,22,2,0,14,66,0,2.63,False
1,2,Finance,Male,29,Developer,2024-04-18 08:03:05.556036,0,High School,5,7500.0,34,34,13,14,100,12,61,2,1.72,False
2,3,Finance,Male,55,Specialist,2015-10-26 08:03:05.556036,8,High School,3,5850.0,37,27,6,3,50,10,1,0,3.17,False
3,4,Customer Support,Female,48,Analyst,2016-10-22 08:03:05.556036,7,Bachelor,2,4800.0,52,10,28,12,100,10,0,1,1.86,False
4,5,Engineering,Female,36,Analyst,2021-07-23 08:03:05.556036,3,Bachelor,2,4800.0,38,11,29,13,100,15,9,1,1.25,False


In [13]:
training_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 20 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Employee_ID                  100000 non-null  int64  
 1   Department                   100000 non-null  object 
 2   Gender                       100000 non-null  object 
 3   Age                          100000 non-null  int64  
 4   Job_Title                    100000 non-null  object 
 5   Hire_Date                    100000 non-null  object 
 6   Years_At_Company             100000 non-null  int64  
 7   Education_Level              100000 non-null  object 
 8   Performance_Score            100000 non-null  int64  
 9   Monthly_Salary               100000 non-null  float64
 10  Work_Hours_Per_Week          100000 non-null  int64  
 11  Projects_Handled             100000 non-null  int64  
 12  Overtime_Hours               100000 non-null  int64  
 13  

In [27]:
training_df.isnull().sum().sort_values(ascending=False)

Employee_ID                    0
Department                     0
Gender                         0
Age                            0
Job_Title                      0
Hire_Date                      0
Years_At_Company               0
Education_Level                0
Performance_Score              0
Monthly_Salary                 0
Work_Hours_Per_Week            0
Projects_Handled               0
Overtime_Hours                 0
Sick_Days                      0
Remote_Work_Frequency          0
Team_Size                      0
Training_Hours                 0
Promotions                     0
Employee_Satisfaction_Score    0
Resigned                       0
dtype: int64

In [28]:
training_df.duplicated().sum()

np.int64(0)

Observations 
- There are No Null values present in this dataset 
- There are int, float, and object datatypes present in the dataset.
- There are no duplicate values

### C.3 Explore Numerical Features


In [None]:
training_df.describe(include= 'number')

Unnamed: 0,Employee_ID,Age,Years_At_Company,Performance_Score,Monthly_Salary,Work_Hours_Per_Week,Projects_Handled,Overtime_Hours,Sick_Days,Remote_Work_Frequency,Team_Size,Training_Hours,Promotions,Employee_Satisfaction_Score
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,50000.5,41.02941,4.47607,2.99543,6403.211,44.95695,24.43117,14.51493,7.00855,50.0905,10.01356,49.50606,0.99972,2.999088
std,28867.657797,11.244121,2.869336,1.414726,1372.508717,8.942003,14.469584,8.664026,4.331591,35.351157,5.495405,28.890383,0.815872,1.150719
min,1.0,22.0,0.0,1.0,3850.0,30.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
25%,25000.75,31.0,2.0,2.0,5250.0,37.0,12.0,7.0,3.0,25.0,5.0,25.0,0.0,2.01
50%,50000.5,41.0,4.0,3.0,6500.0,45.0,24.0,15.0,7.0,50.0,10.0,49.0,1.0,3.0
75%,75000.25,51.0,7.0,4.0,7500.0,53.0,37.0,22.0,11.0,75.0,15.0,75.0,2.0,3.99
max,100000.0,60.0,10.0,5.0,9000.0,60.0,49.0,29.0,14.0,100.0,19.0,99.0,2.0,5.0


There are a total of 14 Numerical columns
- All numerical columns have the same count, indicating no missing values.,
    - The mean and median are close for most features, suggesting relatively symmetric distributions.,
    - Some features have a large difference between min and max, indicating possible outliers.,
    - Standard deviation varies across features, showing different levels of spread.,
    - The range (min to max) helps identify the scale of each feature.,
    - Quartiles (25%, 50%, 75%) provide insight into the distribution and possible skewness.,
    - Some features may have minimum values at or near zero, which could be meaningful or require further investigation.
  

### C.4 Explore Categorical Features


In [15]:
training_df.describe(include= 'object')


Unnamed: 0,Department,Gender,Job_Title,Hire_Date,Education_Level
count,100000,100000,100000,100000,100000
unique,9,3,7,3650,4
top,Marketing,Male,Specialist,2020-09-29 08:03:05.556036,Bachelor
freq,11216,48031,14507,46,50041


In [16]:
categorical_cols = training_df.select_dtypes(include='object').columns

### C.5 Explore Target Variable




In [None]:
target_name = 'Monthly_Salary'
training_df[target_name].head()

0    6750.0
1    7500.0
2    5850.0
3    4800.0
4    4800.0
Name: Monthly_Salary, dtype: float64

In [18]:
numerical_cols = training_df.select_dtypes(include='number').columns

---
## D. Feature Selection


### D.1 Approach 1

### D.z Final Selection of Features


In [19]:
features_list = []

---
## E. Data Cleaning

### E.1 Copy Datasets



In [20]:
training_df_clean=training_df.copy()

### E.2 Fixing "Missing data "




### E.3 Fixing "Outliers in the final list of features"



---
## F. Feature Engineering

### F.1 Copy Datasets



In [21]:
# Create copy of datasets

training_df_eng = training_df_clean.copy()


### F.2 New Feature ""






---
## G. Data Transformation

### G.1 Copy Datasets



In [22]:
# Create copy of datasets

training_df_trans = training_df_eng.copy()


### G.2 Data Transformation Encoding the categorical features 



---
## H. Data Preparation for Modeling

### H.1 Copy Datasets



In [23]:


# Split into train (70%), temp (30%)
train_df, temp_df = train_test_split(training_df_eng, test_size=0.3, random_state=42)

# Split temp into validation (15%) and test (15%)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

print(f"Train shape: {train_df.shape}")
print(f"Validation shape: {val_df.shape}")
print(f"Test shape: {test_df.shape}")

Train shape: (70000, 20)
Validation shape: (15000, 20)
Test shape: (15000, 20)


### H.2 Split Features and Target Variables

In [24]:

X_train = train_df.drop(columns=[target_name])
y_train = train_df[target_name]

X_val = val_df.drop(columns=[target_name])
y_val = val_df[target_name]

X_test = test_df.drop(columns=[target_name])
y_test = test_df[target_name]

---
## I. Save Datasets

> Do not change this code

In [None]:
# Ensure the processed data directory exists
os.makedirs('../data/processed', exist_ok=True)


In [25]:

X_train.to_csv('../data/processed/X_train.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)

OSError: Cannot save file into a non-existent directory: '..\data\processed'

---
## J. Assess Baseline Model

### J.1 Generate Predictions with Baseline Model

In [None]:


dummy_regressor = DummyRegressor(strategy="mean")  # Predicts the mean of the target values
dummy_regressor.fit(X_train, y_train)
y_pred = dummy_regressor.predict(X_val)
y_pred



array([574.44604064, 574.44604064, 574.44604064, ..., 574.44604064,
       574.44604064, 574.44604064], shape=(1320,))

### J.2 Selection of Performance Metrics




In [None]:

rmse = np.sqrt(mean_squared_error(y_val, y_pred))
print(f"RMSE: {rmse}")


RMSE: 71.41122641110427


In [None]:
mae = mean_absolute_error(y_val, y_pred)
print(f"MAE: {mae}")

MAE: 28.384357945254933


### J.3 Baseline Model Performance




In [None]:

y_pred = dummy_regressor.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {rmse}")

RMSE: 86.92117365503061


In [None]:
# Calculating accuracy within RMSE range
allowed_range_lower = y_test - rmse
allowed_range_upper = y_test + rmse

within_range = np.logical_and(y_pred >= allowed_range_lower, y_pred <= allowed_range_upper)
accuracy = np.sum(within_range) / len(y_test) * 100

print(f"Accuracy within RMSE range: {accuracy:.2f}%")


Accuracy within RMSE range: 87.83%


In [None]:
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae}")

MAE: 39.73232608655444


In [None]:
# Calculating accuracy within mae range
allowed_range_lower = y_test - mae
allowed_range_upper = y_test + mae

within_range = np.logical_and(y_pred >= allowed_range_lower, y_pred <= allowed_range_upper)
accuracy = np.sum(within_range) / len(y_test) * 100

print(f"Accuracy within mae range: {accuracy:.2f}%")


Accuracy within mae range: 76.61%
