# Salary Prediction Model

## Data Collection and Pre processing

Importing required libraries

- pandas : For holding data in a data frame,
- matplolib's pyplot function: For plotting data. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

Sampling

- Loading a external dataset surveyed by stackOverflow with its users about various info such as salary, education, location, etc. in the year 2023.
- we use this dataset to make predictions about the salary distribution.

In [None]:
df = pd.read_csv('Datasets/survey_results_public.csv')
df.head()

Feature Selection

- We use the columns 'Country', 'EdLevel', 'YearsCodePro', 'Employment', 'ConvertedCompYearly' for training our model
- Hence we delete the unwanted columns and use only the columns that are required for training the model.
- we also changed 'ConvertedCompYearly' to 'Salary' for easier use later on. 

In [None]:
df = df[['Country', 'EdLevel', 'YearsCodePro', 'Employment', 'ConvertedCompYearly']]
df = df.rename({
    'EdLevel' : 'Education',
    'ConvertedCompYearly':'Salary',
    'YearsCodePro' : 'Experience'
    },
        axis=1)
df.head(3)

Data cleaning

- The null values in the dataset are not useful for training the model
- We are only using the columns were the type of employment is Employed, full-time to better suit our needs 

In [None]:
df = df[df['Salary'].notnull()]
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df = df.dropna()
df.isnull().sum()

In [None]:
df = df[df['Employment'] == 'Employed, full-time']
df = df.drop('Employment', axis = 1)
df.info()

In [None]:
df['Country'].value_counts()

- The countries where there are not enough samples could reflect a wrong prediction if used for prediction
- So we remove the countries where there are samples less than a threshold number that seems to be sufficient for prediction

In [None]:
threshold = 500
country_counts = df['Country'].value_counts()
countries_to_keep = country_counts[country_counts >= threshold].index
df = df[df['Country'].isin(countries_to_keep)]


In [None]:
df.head(3)

In [None]:
df.Country.value_counts()

In [None]:
df['Country'].unique()

Remove the outliers from the DataFrame

In [None]:
fig,ax =  plt.subplots(1,1, figsize = (12, 8))
df.boxplot('Salary', 'Country', ax = ax)
plt.title('Salary ($) vs Country')
plt.suptitle('Employee Salaries in each country')
plt.xticks(rotation = 87)
plt.ylabel('Salary')
plt.show()

In [None]:
df = df[df['Salary'] <= 600000] 
df = df[df['Salary'] >= 10000]

In [None]:
fig,ax =  plt.subplots(1,1, figsize = (12, 8))
df.boxplot('Salary', 'Country', ax = ax)
plt.title('Salary ($) vs Country')
plt.suptitle('Employee Salaries in each country')
plt.xticks(rotation = 87)
plt.ylabel('Salary')
plt.show()

In [None]:
df['Experience'].unique()

In [None]:
def experience_cleaner(value):
    if value == "More than 50 years":
        return 50
    elif value == "Less than 1 year":
        return 0.5
    return float(value)

In [None]:
df['Experience'] = df['Experience'].apply(experience_cleaner)

In [None]:
df['Education'].unique()

In [None]:
def education_cleaner(education_level):
    if "Bachelor’s degree" in education_level:
        return 'Bachelors degree'
    if "Master’s degree" in education_level:
        return 'Masters degree'
    if "Professional degree" in education_level:
        return 'Post graduate'
    return 'Less than Bachelors'

In [None]:
df['Education'] = df['Education'].apply(education_cleaner)
df['Education'].unique()

In [None]:
le_education = LabelEncoder()
df['Education'] =  le_education.fit_transform(df['Education'])
df['Education'].unique()

In [None]:
le_country = LabelEncoder()
df['Country'] =  le_country.fit_transform(df['Country'])
df['Country'].unique()

## Model building

As We used a label Encoder it is better to use a algorithm that does not get confused with such encodings
- Random Forest Regressor, DecisionTreeRegressor are good examples

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
from sklearn.model_selection import GridSearchCV

In [None]:
X = df.drop('Salary', axis=1) # Features
y = df['Salary'] # Target


In [None]:
dec_tree_reg = DecisionTreeRegressor(random_state = 9)
dec_tree_reg.fit(X, y.values)

In [None]:
y_pred = dec_tree_reg.predict(X)

In [None]:
error = np.sqrt(mean_squared_error(y, y_pred))
print("${:,.2f}".format(error))

In [None]:
random_forest_Reg = RandomForestRegressor(random_state= 9)
random_forest_Reg.fit(X, y.values)

In [None]:
y_pred = random_forest_Reg.predict(X)

In [None]:
error = np.sqrt(mean_squared_error(y,y_pred))
print("${:,.2f}".format(error))

In [None]:
max_depth = [None, 1, 2, 4, 6, 8, 10, 12, 14]
parameters = {'max_depth' : max_depth}

regressor = DecisionTreeRegressor(random_state = 9)
gs = GridSearchCV(regressor, parameters, scoring='neg_mean_squared_error')
gs.fit(X, y.values)

In [None]:
regressor = gs.best_estimator_

regressor.fit(X, y.values)
y_pred = regressor.predict(X)
error = np.sqrt(mean_squared_error(y, y_pred))
print("${:,.2f}".format(error))

In [None]:
X = np.array([["Germany", "Masters degree", 1]])
X

In [None]:

#- Transform the 'Country' column using the fitted LabelEncoder
X[:, 0] = le_country.transform(X[:, 0])

#- Assuming le_education has been fitted to the 'Education' column
#- Transform the 'Education' column
X[:, 1] = le_education.transform(X[:, 1])

In [None]:
X = X.astype(float)
y_pred = regressor.predict(X)
y_pred

## Saving the model

In [None]:
import pickle

In [None]:
data = {'model' : regressor,
        'le_country' : le_country,
        'le_education' : le_education}
with open('saved_steps.pkl', 'wb') as file:
    pickle.dump(data, file)

In [None]:
with open('saved_steps.pkl', 'rb') as file:
    data = pickle.load(file)

In [None]:
regressor_loaded = data['model']
le_country = data['le_country']
le_education = data['le_education']

In [None]:
y_pred = regressor_loaded.predict(X)
y_pred