The focus of this project is salary prediction. The data set includes information on job descriptions and salaries. The aim is to use the data set to see if we can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description. This is important, because this model can make a salary recommendation as soon as a job description is entered into a system.

## Description of Variables

The description of variables are provided in "Jobs - Data Dictionary.docx"

## Data Prep 

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_jobs = pd.read_csv('jobs_alldata.csv')
df_jobs

Unnamed: 0,Salary,Job Description,Location,Min_years_exp,Technical,Comm,Travel
0,67206,Civil Service Title: Regional Director Mental ...,Remote,5,2,3,0
1,88313,The New York City Comptrollerâ€™s Office Burea...,Remote,5,2,4,10-15
2,81315,With minimal supervision from the Deputy Commi...,East campus,5,3,3,5-10
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...,East campus,1,1,3,0
4,55675,Only candidates who are permanent in the Princ...,Southeast campus,1,1,3,5-10
...,...,...,...,...,...,...,...
2408,79812,"Section 8, also known as the Housing Choice Vo...",Southeast campus,5,2,1,0
2409,108122,The NYC Department of Environmental Protection...,West campus,5,1,1,0
2410,55711,The NYC Department of Environmental Protection...,HQ,5,5,4,0
2411,64420,"Under general supervision, with some latitude ...",East campus,4,1,5,0


In [4]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df_jobs, test_size=0.3)

## Separating the Target Variable

In [5]:
train_y = train_set['Salary']
test_y = test_set['Salary']

train_inputs = train_set.drop(['Salary'], axis=1)
test_inputs = test_set.drop(['Salary'], axis=1)

### Separating the Job Description column to perform Text Mining for Train Dataset

In [6]:
Train_Text_inputs=train_inputs['Job Description']
Train_Text_inputs

2012    The Division of Sidewalk & Inspection Manageme...
1648    The Real Estate Services (RES) division of DCA...
1026    The Mental Hygiene Division is responsible for...
2281    The New York City Housing Authority (NYCHA) is...
2321    The Commission on Human Rights (the Commission...
                              ...                        
1380    The New York City Department of Environmental ...
267     The NYC Department of Design and Construction,...
1454    The New York City Department of Environmental ...
1912    The NYC Department of Environmental Protection...
379     The mission of the Bureau of Environmental Dis...
Name: Job Description, Length: 1689, dtype: object

In [7]:
Train_NonText_inputs=train_inputs.drop('Job Description',axis=1)
Train_NonText_inputs

Unnamed: 0,Location,Min_years_exp,Technical,Comm,Travel
2012,HQ,1,1,3,1-5
1648,HQ,4,2,2,0
1026,HQ,1,3,2,0
2281,HQ,4,4,4,0
2321,HQ,4,4,3,5-10
...,...,...,...,...,...
1380,HQ,1,2,3,0
267,Remote,5,1,4,0
1454,Southeast campus,1,3,3,0
1912,HQ,4,4,1,1-5


### Separating the Job Description column to perform Text Mining for Test Dataset

In [8]:
Test_Text_inputs=test_inputs['Job Description']
Test_Text_inputs

1334    TASK FORCE: \t\tFIRE, PARKS, AND SANITATION  U...
1960    Please read this posting carefully to make cer...
1841    The New York City Law Department seeks applica...
95      The NYC Department of Buildings is responsible...
344     The New York City Department of Investigation ...
                              ...                        
2154    REPOST The candidate will be responsible for t...
1453    The mission of the Bureau of Environmental Dis...
1631    THIS IS A TEMPORARY ASSIGNMENT  New York City ...
115     NYC Department of Finance (DOF) is responsible...
466     *** IN ORDER TO BE CONSIDERED FOR THIS POSITIO...
Name: Job Description, Length: 724, dtype: object

In [9]:
Test_NonText_inputs=test_inputs.drop('Job Description',axis=1)
Test_NonText_inputs

Unnamed: 0,Location,Min_years_exp,Technical,Comm,Travel
1334,West campus,1,5,3,0
1960,East campus,2,4,3,0
1841,HQ,5,5,3,0
95,HQ,1,2,4,0
344,Remote,1,1,3,1-5
...,...,...,...,...,...
2154,West campus,5,3,3,0
1453,West campus,1,2,3,0
1631,East campus,2,3,2,5-10
115,East campus,1,1,3,0


## Feature Engineering

### Creating a New binary column called travel hours : If travel hours value is greater than 0, corresponding value will be 1 else it will be 0 (this column will help to determine the effect of travel hours on salary)

In [11]:
def new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    df1['travel_hours'] = np.where(df1['Travel'] == '0', 0, 1)
    
    return df1[['travel_hours']]

In [12]:
new_col(Train_NonText_inputs)

Unnamed: 0,travel_hours
2012,1
1648,0
1026,0
2281,0
2321,1
...,...
1380,0
267,0
1454,0
1912,1


In [16]:
Train_NonText_inputs

Unnamed: 0,Location,Min_years_exp,Technical,Comm,Travel
2012,HQ,1,1,3,1-5
1648,HQ,4,2,2,0
1026,HQ,1,3,2,0
2281,HQ,4,4,4,0
2321,HQ,4,4,3,5-10
...,...,...,...,...,...
1380,HQ,1,2,3,0
267,Remote,5,1,4,0
1454,Southeast campus,1,3,3,0
1912,HQ,4,4,1,1-5


In [17]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

##  Identify the numerical and categorical columns

In [18]:
Train_NonText_inputs.dtypes

Location         object
Min_years_exp     int64
Technical         int64
Comm              int64
Travel           object
dtype: object

In [19]:
numeric_columns=['Min_years_exp','Technical','Comm']

In [20]:
categorical_columns=['Location']

In [21]:
feat_eng_columns=['Travel']

## Pipeline

In [22]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [23]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [24]:
my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col))])

In [25]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('trans', my_new_column, feat_eng_columns)],
        remainder='passthrough')

### Transform: fit_transform() for TRAIN_NonText

In [26]:
train_x_nontext = preprocessor.fit_transform(Train_NonText_inputs)

train_x_nontext

array([[-1.11602173, -1.0304714 , -0.14682133, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.54620111, -0.19330632, -1.27400876, ...,  0.        ,
         0.        ,  0.        ],
       [-1.11602173,  0.64385876, -1.27400876, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.11602173,  0.64385876, -0.14682133, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.54620111,  1.48102384, -2.40119619, ...,  0.        ,
         0.        ,  1.        ],
       [-1.11602173, -0.19330632, -0.14682133, ...,  0.        ,
         0.        ,  0.        ]])

In [27]:
train_x_nontext.shape

(1689, 9)

### Tranform: transform() for TEST_NonText

In [28]:
test_x_nontext = preprocessor.transform(Test_NonText_inputs)

test_x_nontext

array([[-1.11602173,  2.31818892, -0.14682133, ...,  0.        ,
         1.        ,  0.        ],
       [-0.56194745,  1.48102384, -0.14682133, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.10027539,  2.31818892, -0.14682133, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.56194745,  0.64385876, -1.27400876, ...,  0.        ,
         0.        ,  1.        ],
       [-1.11602173, -1.0304714 , -0.14682133, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.10027539, -1.0304714 ,  0.98036609, ...,  0.        ,
         0.        ,  0.        ]])

In [29]:
test_x_nontext.shape

(724, 9)

## Sklearn: Text preparation

In [30]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english')

train_x_tr = tfidf_vect.fit_transform(Train_Text_inputs)

In [31]:
test_x_tr = tfidf_vect.transform(Test_Text_inputs)

In [32]:
train_x_tr.shape, test_x_tr.shape

((1689, 9837), (724, 9837))

In [33]:
train_x_tr.toarray()

array([[0.        , 0.01949644, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.01869492, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.06638992, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.16116157, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

## Latent Semantic Analysis (Singular Value Decomposition)

In [34]:
from sklearn.decomposition import TruncatedSVD

In [43]:
svd = TruncatedSVD(n_components=1000, n_iter=10)

## Fit Transform for Train Data

In [47]:
train_x_lsa = svd.fit_transform(train_x_tr)

In [48]:
train_x_lsa.shape

(1689, 1000)

In [49]:
svd.explained_variance_.sum() ## Cummultive Variance is ~93%

0.9304034712508663

## Transforming the Test data

In [50]:
test_x_lsa = svd.transform(test_x_tr)

In [51]:
train_x=np.hstack((train_x_nontext,train_x_lsa))
train_x.shape

(1689, 1009)

In [52]:
test_x=np.hstack((test_x_nontext,test_x_lsa))
test_x.shape

(724, 1009)

In [55]:
from sklearn.metrics import mean_squared_error

## Decision Tree

In [56]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(min_samples_leaf=10) 

tree_reg.fit(train_x, train_y)

DecisionTreeRegressor(min_samples_leaf=10)

In [57]:
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 13582.28848342736


In [58]:
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 25303.495305072018


## Reducing Overfitting

In [59]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(min_samples_leaf=40) 

tree_reg.fit(train_x, train_y)

DecisionTreeRegressor(min_samples_leaf=40)

In [60]:
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 21574.828043602367


In [61]:
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 26232.08881195226


## Voting regressor 


In [80]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(max_depth=20)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('sgd', sgd_reg)])

voting_reg.fit(train_x, train_y)



VotingRegressor(estimators=[('dt', DecisionTreeRegressor(max_depth=20)),
                            ('svr', SVR(C=10, epsilon=0.01)),
                            ('sgd', SGDRegressor(max_iter=10000))])

In [81]:
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 11959.080539809298


In [82]:
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 19472.956607521635


## Reducing Overfitting

In [83]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(max_depth=3)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('sgd', sgd_reg)])

voting_reg.fit(train_x, train_y)



VotingRegressor(estimators=[('dt', DecisionTreeRegressor(max_depth=3)),
                            ('svr', SVR(C=10, epsilon=0.01)),
                            ('sgd', SGDRegressor(max_iter=10000))])

In [84]:
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 19687.67909655768


In [85]:
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 23113.521337891932


## A Boosting model


In [352]:
from sklearn.ensemble import AdaBoostRegressor 

#Create Adapative Boosting with Decision Stumps (depth=1)
ada_reg = AdaBoostRegressor( 
            DecisionTreeRegressor(max_depth=1), n_estimators=500, 
            learning_rate=0.1) 

ada_reg.fit(train_x, train_y)

AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=1),
                  learning_rate=0.1, n_estimators=500)

In [353]:
train_pred = ada_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 29926.098016920172


In [354]:
test_pred = ada_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 29818.44963353811


## Neural network:

In [68]:
from sklearn.neural_network import MLPRegressor

#Default settings create 1 hidden layer with 100 neurons
mlp_reg = MLPRegressor(hidden_layer_sizes=(700,),max_iter=1000)

mlp_reg.fit(train_x, train_y)



MLPRegressor(hidden_layer_sizes=(700,), max_iter=1000)

In [71]:
#Train RMSE
train_pred = mlp_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 20086.158492835628


In [72]:
test_pred = mlp_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 23422.445560771164


## Grid search


In [73]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'min_samples_leaf': np.arange(10, 30), 
     'max_depth': np.arange(10,30)}
  ]

tree_reg = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_reg, param_grid, cv=5, n_iter=50,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x, train_y)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


RandomizedSearchCV(cv=5, estimator=DecisionTreeRegressor(), n_iter=50,
                   param_distributions=[{'max_depth': array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29]),
                                         'min_samples_leaf': array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29])}],
                   return_train_score=True, scoring='neg_mean_squared_error',
                   verbose=1)

In [74]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

26406.948113326107 {'min_samples_leaf': 20, 'max_depth': 13}
26296.013902392748 {'min_samples_leaf': 25, 'max_depth': 26}
25585.506562415663 {'min_samples_leaf': 12, 'max_depth': 21}
26406.948113326107 {'min_samples_leaf': 20, 'max_depth': 14}
25929.519939245303 {'min_samples_leaf': 27, 'max_depth': 25}
26152.796362986923 {'min_samples_leaf': 22, 'max_depth': 20}
26186.436031258447 {'min_samples_leaf': 29, 'max_depth': 10}
26152.796362986923 {'min_samples_leaf': 22, 'max_depth': 12}
26043.87733412103 {'min_samples_leaf': 10, 'max_depth': 17}
26406.948113326107 {'min_samples_leaf': 20, 'max_depth': 22}
26308.709423075867 {'min_samples_leaf': 19, 'max_depth': 17}
25815.022714929208 {'min_samples_leaf': 16, 'max_depth': 20}
26249.367267005433 {'min_samples_leaf': 13, 'max_depth': 27}
26281.00559358257 {'min_samples_leaf': 26, 'max_depth': 20}
26167.901413454667 {'min_samples_leaf': 24, 'max_depth': 22}
26512.316381642504 {'min_samples_leaf': 15, 'max_depth': 20}
26187.45474811433 {'min_sa

In [75]:
grid_search.best_params_

{'min_samples_leaf': 12, 'max_depth': 21}

In [76]:
grid_search.best_estimator_

DecisionTreeRegressor(max_depth=21, min_samples_leaf=12)

In [77]:
train_pred = grid_search.best_estimator_.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 15206.656527537765


In [78]:
test_pred = grid_search.best_estimator_.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 25885.050485708485


### Selecting the best model

# Interpretation

**Voting Regressor** model performs the best because it has least Test RMSE of 23098.06
It can be used for predicting the job salary based on Job description and other factors such as location, minimum experience etc. 

It should be noted that the RMSE value is quite high. RMSE value of 23098 suggests that the predicted salary can vary by a range of 23098. Although Voting Regressor is the best model here, this model can not be used in real world situation for prediction. The RMSE value suggests that the model and the features that we have are not able to capture all the information required for predicting salary. We might need more data and other features for making apt salary predictions in real world!
