# Thapar Summer School - Employee Salary Prediction

In this lab, we are predict the salaries of employee from different fake companies.

Here, the features are explain below:

- `id`: Identity of the employee
- `salary`: (Target Column) Salary of the employee
- `company`: Current Company of the employee
- `department`: Department of the employee
- `age`: Current age of the employee
- `age_when_joined`: Employee's age when joined the company
- `year_in_the_company`: Employee's experience in the company
- `prior_years_experiences`: Employee's experience prior to joining the company
- `annual_bonus`: Annual bonus of employee

## Tools

Now, before we get, let's initialize some important's libraries.

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

ModuleNotFoundError: No module named 'catboost'

## Viewing the data

Taking a look in our training dataset and getting more familiar with it. Remember that the training dataset and test dataset are already separated.

In [2]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

Looking at the first five examples of the dataset.

In [3]:
df_train.head()

Unnamed: 0,id,company,department,age,age_when_joined,years_in_the_company,salary,annual_bonus,prior_years_experience,full_time,part_time,contractor
0,1,Cheerper,Support,40,36,4,69420.46872,22586.99591,2,0.0,0.893809,0.328591
1,2,Cheerper,BigData,40,34,6,88407.04974,18676.07837,3,0.205947,0.756632,0.03687
2,3,Pear,Sales,41,39,2,97831.84885,19287.87365,2,0.942309,0.0,0.514457
3,4,Glasses,Search Engine,39,33,6,93905.86813,17936.39297,3,0.484373,0.236922,0.278535
4,5,Glasses,AI,39,35,3,105983.9752,16854.92943,3,0.835346,0.308958,0.0


Seeing the shape of our dataset.

In [4]:
n_rows = df_train.shape[0]
n_columns = df_train.shape[1]

print(f"The number of rows is: {n_rows}")
print(f"The number of columns is: {n_columns}")

The number of rows is: 100000
The number of columns is: 12


In [5]:
df_train.dtypes

id                          int64
company                    object
department                 object
age                         int64
age_when_joined             int64
years_in_the_company        int64
salary                    float64
annual_bonus              float64
prior_years_experience      int64
full_time                 float64
part_time                 float64
contractor                float64
dtype: object

Checking if the dataframe has some null value.

In [6]:
df_train.isnull().sum()

id                        0
company                   0
department                0
age                       0
age_when_joined           0
years_in_the_company      0
salary                    0
annual_bonus              0
prior_years_experience    0
full_time                 0
part_time                 0
contractor                0
dtype: int64

That's good. Now, we notice that just two features are categorical (`company` and `department`), while the others are numerical (without considerer `id`). Let's see the different values that each of the two feature contains.

#### Values in _company_ and _department_

In [7]:
df_train['company'].value_counts()

Glasses     47734
Cheerper    28583
Pear        23683
Name: company, dtype: int64

In [8]:
df_train['department'].value_counts()

Search Engine    21915
AI               21642
BigData          15777
Design           15713
Sales            12535
Support          12418
Name: department, dtype: int64

In [9]:
df_train.describe(include=object)

Unnamed: 0,company,department
count,100000,100000
unique,3,6
top,Glasses,Search Engine
freq,47734,21915


#### Summary measures of each numerical feature

In [10]:
df_train.describe(include=[int, float])

Unnamed: 0,id,age,age_when_joined,years_in_the_company,salary,annual_bonus,prior_years_experience,full_time,part_time,contractor
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,50000.5,38.19587,33.53706,4.66778,87389.018245,18582.064815,2.50554,0.383531,0.383279,0.382016
std,28867.657797,6.013073,7.719078,2.531773,28478.542805,4373.431365,1.207222,0.340638,0.339783,0.339621
min,1.0,30.0,22.0,1.0,40000.0,9000.0,1.0,0.0,0.0,0.0
25%,25000.75,33.0,27.0,3.0,66027.96136,15448.885482,1.0,0.016171,0.015702,0.013575
50%,50000.5,38.0,33.0,5.0,86554.20499,18821.651785,2.0,0.333278,0.334739,0.331437
75%,75000.25,43.0,39.0,7.0,107269.920325,22168.52263,3.0,0.653879,0.650733,0.649338
max,100000.0,49.0,48.0,9.0,153000.0,24792.91,5.0,1.0,1.0,1.0


For we will start the data pre-processing, we know that categorical features like **company** and **department** have a lower cardinality, which can be managed by some **encoding** techniques. 

Also, we know that the others features (which are numerical) are having **high deviation** values which can lead to bias, overfitting, and can affect the accuracy of the model. So, we can use some different transformation techniques to reduce the deviation between the data-points.

## Data pre-processing

Notice that the column `id` only represents the _index_ of a employee. So, as it appears to be irrelevant to analysis, we can simply **drop the feature**.

In [11]:
df_train = df_train.drop(columns='id')

#### Separating numerical features and categorical features

In [12]:
# taking only numerical features
num_feat = df_train.drop(columns=['company', 'department'])

# taking only categorical features
cat_feat = df_train.loc[:, ['company', 'department']]

#### Processing the categorical features using OneHotEnconder

In [13]:
# creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')

# label encoded values of company and department
enc_df = pd.DataFrame(enc.fit_transform(cat_feat[['company', 'department']]).toarray(), columns=enc.get_feature_names(['company', 'department']))
enc_df.head()



Unnamed: 0,company_Cheerper,company_Glasses,company_Pear,department_AI,department_BigData,department_Design,department_Sales,department_Search Engine,department_Support
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


#### Feature scaling using StandardScale

Notice that, for example, the numerical feature `annual_bonus` have a large scale of values, so, as not to affect the performance of the model and avoid convergence problems for models based on gradients.

In [14]:
# creating a instance of StandardScaler
scaler = StandardScaler()

# applying StandardScaler only to features
num_feat_norm = scaler.fit_transform(num_feat)

# converting normalized output to a new DataFrame
feat_norm_df = pd.DataFrame(num_feat_norm, columns=num_feat.columns)
feat_norm_df.head()

Unnamed: 0,age,age_when_joined,years_in_the_company,salary,annual_bonus,prior_years_experience,full_time,part_time,contractor
0,0.300036,0.319073,-0.263761,-0.630954,0.915746,-0.418765,-1.125925,1.502528,-0.157308
1,0.300036,0.059974,0.526203,0.035747,0.021497,0.409587,-0.521329,1.098805,-1.016274
2,0.466341,0.707723,-1.053725,0.366693,0.161386,-0.418765,1.640396,-1.128018,0.389967
3,0.133731,-0.069576,0.526203,0.228835,-0.147636,0.409587,0.29604,-0.430738,-0.304698
4,0.133731,0.189524,-0.658743,0.652949,-0.394917,0.409587,1.326385,-0.218732,-1.124837


Taking only the target `salary` of dataframe `df_train` to concat with the others df's.

In [15]:
# taking the target
target_df = df_train.loc[:,'salary']

# concatenating the df's of preprocessed numerical and categorical features
all_features = pd.concat([enc_df, feat_norm_df], axis=1)

# concatening the df's of all features with the target
df_train = pd.concat([all_features, target_df], axis=1)
df_train.head()

Unnamed: 0,company_Cheerper,company_Glasses,company_Pear,department_AI,department_BigData,department_Design,department_Sales,department_Search Engine,department_Support,age,age_when_joined,years_in_the_company,salary,annual_bonus,prior_years_experience,full_time,part_time,contractor,salary.1
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.300036,0.319073,-0.263761,-0.630954,0.915746,-0.418765,-1.125925,1.502528,-0.157308,69420.46872
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.300036,0.059974,0.526203,0.035747,0.021497,0.409587,-0.521329,1.098805,-1.016274,88407.04974
2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.466341,0.707723,-1.053725,0.366693,0.161386,-0.418765,1.640396,-1.128018,0.389967,97831.84885
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.133731,-0.069576,0.526203,0.228835,-0.147636,0.409587,0.29604,-0.430738,-0.304698,93905.86813
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.133731,0.189524,-0.658743,0.652949,-0.394917,0.409587,1.326385,-0.218732,-1.124837,105983.9752


In [16]:
df_train.shape

(100000, 19)

In [17]:
df_train.dtypes

company_Cheerper            float64
company_Glasses             float64
company_Pear                float64
department_AI               float64
department_BigData          float64
department_Design           float64
department_Sales            float64
department_Search Engine    float64
department_Support          float64
age                         float64
age_when_joined             float64
years_in_the_company        float64
salary                      float64
annual_bonus                float64
prior_years_experience      float64
full_time                   float64
part_time                   float64
contractor                  float64
salary                      float64
dtype: object

#### Separating _X_train_ and _y_train_

In [18]:
# train
X_train = df_train.drop(columns='salary')
y_train = df_train.loc[:, 'salary']

#### Doing all preprocessing to the df of the test and obtaining X_test

In [19]:
# drop id
df_test = df_test.drop(columns='id')

# separate the numerical features and categorical features
num_feat_test = df_test.drop(columns=['company', 'department'])
cat_feat_test = df_test.loc[:, ['company', 'department']]

# OneHotEncoder
enc_test = OneHotEncoder(handle_unknown='ignore')
enc_test_df = pd.DataFrame(enc_test.fit_transform(cat_feat_test[['company', 'department']]).toarray(), columns=enc_test.get_feature_names(['company', 'department']))

# Feature scaling
num_feat_norm_test = scaler.fit_transform(num_feat_test)
feat_norm_df_test = pd.DataFrame(num_feat_norm_test, columns=num_feat_test.columns)

# concatenating the df's of preprocessed numerical and categorical features of test
X_test = pd.concat([enc_test_df, feat_norm_df_test], axis=1)
X_test.head()



Unnamed: 0,company_Cheerper,company_Glasses,company_Pear,department_AI,department_BigData,department_Design,department_Sales,department_Search Engine,department_Support,age,age_when_joined,years_in_the_company,annual_bonus,prior_years_experience,full_time,part_time,contractor
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.117167,1.859593,-1.442821,-2.176168,1.239771,-1.127529,-0.653535,1.811299
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.452452,1.083241,-1.442821,-0.367324,1.239771,0.254885,-1.126316,0.999501
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.876979,-1.375206,1.718193,-0.898164,1.239771,-0.129338,1.118517,-1.129061
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.7108,-0.598855,0.137686,0.977069,-0.415398,1.799037,-1.126316,0.075396
4,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.618631,0.953849,-1.442821,-0.665359,-0.415398,-1.127529,0.99743,0.616834


## Model Creation & Evaluation

We will create a function to train model using different regression algorithms.

In [20]:
r2_value = []
adjusted_r2_value = []
mae_value = []
mse_value = []
rmse_value = []

In [21]:
def model_evaluation(model):
    model.fit(X_train, y_train)
    y_train_pred= model.predict(X_train)
    y_test_pred = model.predict(X_test)

    #Metrics Calculation.
    mae = mean_absolute_error(y_test,y_test_pred)
    mse = mean_squared_error(y_test,y_test_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test,y_test_pred)
    adjusted_r2 = 1 - ((1-r2)*(x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1))
   
    mae_value.append(mae)
    mse_value.append(mse)
    rmse_value.append(rmse)
    r2_value.append(r2)
    adjusted_r2_value.append(adjusted_r2) 
    
    print(f"R2 Score of the {model} model is=>",r2)
    print(f"Adjusted R2 Score of the {model} model is=>",adjusted_r2)
    print()
    print(f"MAE of {model} model is=>",mae)
    print(f"MSE of {model} model is=>",mse)
    print(f"RMSE of {model} model is=>",rmse)
    

    # Scatter plot.
    plt.figure(figsize=(15,5))
    plt.subplot(1,2,1)    
    plt.scatter(y_train, y_train_pred, color='blue', label='Train')
    plt.scatter(y_test, y_test_pred, color='red', label='Test')
    plt.xlabel('True values')
    plt.ylabel('Predicted values')
    plt.legend()
    plt.title('Scatter Plot',fontweight="black",size=20,pad=10)
    
    # Residual plot.
    plt.subplot(1,2,2)
    plt.scatter(y_train_pred, y_train_pred - y_train, color='blue', label='Train')
    plt.scatter(y_test_pred, y_test_pred - y_test, color='red', label='Test')
    plt.axhline(y=0, color='black', linestyle='--')
    plt.xlabel('Predicted values')
    plt.ylabel('Residuals')
    plt.legend()
    plt.title('Residual Plot',fontweight="black",size=20,pad=10)
    plt.show()