<h1><center><font size="5">Salary Prediction App</font></center></h1>
<center><img src="../images/image.png" width="600"></img></center>


# Table of Contents

<a id="toc"></a>
- [1. Set-up](#1)
    - [1.1 Import Libraries](#1.1)
    - [1.2 Import Data](#1.2)
    
    
    
- [2. Feature Engineering](#2)
    
    
    
- [3. Encoding Categorical Variables](#3)
    
    
    
- [4. Scaling Numeric Features](#4)


- [5. Train-Test Split](#5)



- [6. Save Processed Data](#6)




- [7. Load Processed Train and Test Sets](#7)




- [8. Define and Train Models](#8)




- [9. Select and Save the Best Model](#9)

<a id="1.1"></a>
## <b>1.1 <span style='color:#2b4f92'>Import Libraries</span></b> 

In [32]:
import pandas as pd
import numpy as np
import json
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import joblib
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

<a id="1.2"></a>
## <b>1.2 <span style='color:#2b4f92'>Import Data</span></b> 

In [7]:
# Load the cleaned dataset
data_path = r"D:\SpaceCode_GraduationProject\Salary_After_Cleaning.csv"
df = pd.read_csv(data_path)

<a id="2"></a>
## <b>2 <span style='color:#2b4f92'>Feature Engineering</span></b> 

In [8]:
# Create new feature: Experience-to-Age Ratio
df['Experience_to_Age_Ratio'] = df['Years of Experience'] / df['Age']

In [9]:
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Country,Race,Senior,Experience_to_Age_Ratio
0,32.0,Male,1,Software Engineer,5.0,90000.0,UK,White,0,0.15625
1,28.0,Female,2,Data Analyst,3.0,65000.0,USA,Hispanic,0,0.107143
2,45.0,Male,3,Manager,15.0,150000.0,Canada,White,1,0.333333
3,36.0,Female,1,Sales Associate,7.0,60000.0,USA,Hispanic,0,0.194444
4,52.0,Male,2,Director,20.0,200000.0,USA,Asian,0,0.384615


<a id="3"></a>
## <b>3 <span style='color:#2b4f92'>Encoding Categorical Variables</span></b> 

In [10]:
# Label Encoding for Gender (Male=0, Female=1)
le = LabelEncoder()
df['Gender_Encoded'] = le.fit_transform(df['Gender'])

In [12]:
# One-Hot Encoding for Country and Race
df = pd.get_dummies(df, columns=['Country', 'Race'], drop_first=True)

In [13]:
# Label Encoding for Job Title (assign unique number to each of the 129 job titles)
le_job_title = LabelEncoder()
df['Job_Title_Encoded'] = le_job_title.fit_transform(df['Job Title'])

In [19]:
# Save the Job Title mapping for use in the app
job_title_mapping = dict(zip(le_job_title.classes_, range(len(le_job_title.classes_))))
with open(r"D:\SpaceCode_GraduationProject\job_title_mapping.json", 'w') as f:
    json.dump(job_title_mapping, f)
print("Job Title mapping saved to: D:\\SpaceCode_GraduationProject\\job_title_mapping.json")

Job Title mapping saved to: D:\SpaceCode_GraduationProject\job_title_mapping.json


In [42]:
# Save Label Encoders
joblib.dump(le, r"D:\SpaceCode_GraduationProject\gender_encoder.pkl")
joblib.dump(le_job_title, r"D:\SpaceCode_GraduationProject\job_title_encoder.pkl")
print("Gender and Job Title encoders saved.")

Gender and Job Title encoders saved.


In [22]:
# Drop original categorical columns after encoding
df = df.drop(['Gender', 'Job Title'], axis=1)

In [23]:
df.head()

Unnamed: 0,Age,Education Level,Years of Experience,Salary,Senior,Experience_to_Age_Ratio,Gender_Encoded,Country_Canada,Country_China,Country_UK,...,Race_Asian,Race_Australian,Race_Black,Race_Chinese,Race_Hispanic,Race_Korean,Race_Mixed,Race_Welsh,Race_White,Job_Title_Encoded
0,32.0,1,5.0,90000.0,0,0.15625,1,0,0,1,...,0,0,0,0,0,0,0,0,1,112
1,28.0,2,3.0,65000.0,0,0.107143,0,0,0,0,...,0,0,0,0,1,0,0,0,0,24
2,45.0,3,15.0,150000.0,1,0.333333,1,1,0,0,...,0,0,0,0,0,0,0,0,1,72
3,36.0,1,7.0,60000.0,0,0.194444,0,0,0,0,...,0,0,0,0,1,0,0,0,0,100
4,52.0,2,20.0,200000.0,0,0.384615,1,0,0,0,...,1,0,0,0,0,0,0,0,0,34


<a id="4"></a>
## <b>4 <span style='color:#2b4f92'>Scaling Numeric Features</span></b> 

In [24]:
scaler = StandardScaler()
numeric_cols = ['Age', 'Years of Experience', 'Education Level', 'Experience_to_Age_Ratio']
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

In [43]:
# Save the Scaler
joblib.dump(scaler, r"D:\SpaceCode_GraduationProject\scaler.pkl")
print("Scaler saved to: D:\\SpaceCode_GraduationProject\\scaler.pkl")

Scaler saved to: D:\SpaceCode_GraduationProject\scaler.pkl


In [25]:
df.head()

Unnamed: 0,Age,Education Level,Years of Experience,Salary,Senior,Experience_to_Age_Ratio,Gender_Encoded,Country_Canada,Country_China,Country_UK,...,Race_Asian,Race_Australian,Race_Black,Race_Chinese,Race_Hispanic,Race_Korean,Race_Mixed,Race_Welsh,Race_White,Job_Title_Encoded
0,-0.257812,-0.726009,-0.535323,90000.0,0,-0.51657,1,0,0,1,...,0,0,0,0,0,0,0,0,1,112
1,-0.773134,0.406909,-0.856198,65000.0,0,-0.910723,0,0,0,0,...,0,0,0,0,1,0,0,0,0,24
2,1.416988,1.539827,1.069056,150000.0,1,0.904768,1,1,0,0,...,0,0,0,0,0,0,0,0,1,72
3,0.257511,-0.726009,-0.214447,60000.0,0,-0.210007,0,0,0,0,...,0,0,0,0,1,0,0,0,0,100
4,2.318803,0.406909,1.871245,200000.0,0,1.316378,1,0,0,0,...,1,0,0,0,0,0,0,0,0,34


<a href="#toc" role="button" aria-pressed="true" >🔝Back to Table of Contents🔝</a>

<a id="5"></a>
## <b>5 <span style='color:#2b4f92'>Train-Test Split</span></b> 

In [26]:
# Define features (X) and target (y)
X = df.drop('Salary', axis=1)
y = df['Salary']

In [27]:
# Split data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<a id="6"></a>
## <b>6 <span style='color:#2b4f92'>Save Processed Data</span></b>

In [28]:
processed_data_path = r"D:\SpaceCode_GraduationProject\Salary_Processed_LabelEncoded.csv"
df.to_csv(processed_data_path, index=False)
print(f"\nProcessed dataset saved to: {processed_data_path}")


Processed dataset saved to: D:\SpaceCode_GraduationProject\Salary_Processed_LabelEncoded.csv


In [29]:
# Save train and test sets for modeling
X_train.to_csv(r"D:\SpaceCode_GraduationProject\X_train_LabelEncoded.csv", index=False)
X_test.to_csv(r"D:\SpaceCode_GraduationProject\X_test_LabelEncoded.csv", index=False)
y_train.to_csv(r"D:\SpaceCode_GraduationProject\y_train_LabelEncoded.csv", index=False)
y_test.to_csv(r"D:\SpaceCode_GraduationProject\y_test_LabelEncoded.csv", index=False)
print(f"Train and test sets saved to: D:\\SpaceCode_GraduationProject\\")

Train and test sets saved to: D:\SpaceCode_GraduationProject\


<a id="7"></a>
## <b>7 <span style='color:#2b4f92'>Load Processed Train and Test Sets</span></b>

In [33]:
X_train = pd.read_csv(r"D:\SpaceCode_GraduationProject\X_train_LabelEncoded.csv")
X_test = pd.read_csv(r"D:\SpaceCode_GraduationProject\X_test_LabelEncoded.csv")
y_train = pd.read_csv(r"D:\SpaceCode_GraduationProject\y_train_LabelEncoded.csv")
y_test = pd.read_csv(r"D:\SpaceCode_GraduationProject\y_test_LabelEncoded.csv")

In [35]:
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

<a href="#toc" role="button" aria-pressed="true" >🔝Back to Table of Contents🔝</a>

<a id="8"></a>
## <b>8 <span style='color:#2b4f92'>Define and Train Models</span></b> 

In [44]:
try:
    scaler = joblib.load(r"D:\SpaceCode_GraduationProject\scaler.pkl")
    le_gender = joblib.load(r"D:\SpaceCode_GraduationProject\gender_encoder.pkl")
    le_job_title = joblib.load(r"D:\SpaceCode_GraduationProject\job_title_encoder.pkl")
    print("Scaler and Encoders loaded successfully.")
except FileNotFoundError:
    print("Error: Scaler or Encoder files not found. Please run the preprocessing script first.")
    raise

Scaler and Encoders loaded successfully.


In [36]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'XGBoost': XGBRegressor(n_estimators=100, random_state=42)
}

In [37]:
# Dictionary to store results
results = {}

# Train and evaluate each model
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on test set
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    # Perform 5-fold cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
    
    # Store results
    results[name] = {
        'RMSE': rmse,
        'R2': r2,
        'Cross-Validation R2 (Mean)': np.mean(cv_scores),
        'Cross-Validation R2 (Std)': np.std(cv_scores)
    }
    
    # Print results
    print(f"\nResults for {name}:")
    print(f"RMSE: {rmse:.2f}")
    print(f"R2 Score: {r2:.2f}")
    print(f"Cross-Validation R2: {np.mean(cv_scores):.2f} (+/- {np.std(cv_scores):.2f})")


Results for Linear Regression:
RMSE: 25349.14
R2 Score: 0.76
Cross-Validation R2: 0.77 (+/- 0.01)

Results for Random Forest:
RMSE: 11074.84
R2 Score: 0.95
Cross-Validation R2: 0.95 (+/- 0.01)

Results for XGBoost:
RMSE: 9560.87
R2 Score: 0.97
Cross-Validation R2: 0.96 (+/- 0.01)


<a id="9"></a>
## <b>9 <span style='color:#2b4f92'>Select and Save the Best Model</span></b> 

In [38]:
# Find the model with the lowest RMSE
best_model_name = min(results, key=lambda x: results[x]['RMSE'])
best_model = models[best_model_name]

In [39]:
# Save the best model
joblib.dump(best_model, r"D:\SpaceCode_GraduationProject\best_model.pkl")
print(f"\nBest model ({best_model_name}) saved to: D:\\SpaceCode_GraduationProject\\best_model.pkl")


Best model (XGBoost) saved to: D:\SpaceCode_GraduationProject\best_model.pkl


In [45]:
# Save the Scaler and Label Encoders (just in case they need to be overwritten)
joblib.dump(scaler, r"D:\SpaceCode_GraduationProject\scaler.pkl")
joblib.dump(le_gender, r"D:\SpaceCode_GraduationProject\gender_encoder.pkl")
joblib.dump(le_job_title, r"D:\SpaceCode_GraduationProject\job_title_encoder.pkl")
print("Scaler and encoders saved for app usage.")

Scaler and encoders saved for app usage.


<a href="#toc" role="button" aria-pressed="true" >🔝Back to Table of Contents🔝</a>