# Group work - Assessment 2

In this assignment, we will focus on salary prediction. The data set for this assignment includes information on job descriptions and salaries. Use this data set to see if you can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description. This is important, because this model can make a salary recommendation as soon as a job description is entered into a system.

## Description of Variables

The description of variables are provided in "Jobs - Data Dictionary.docx"

## Goal

Use the **jobs_alldata.csv** data set and build models to predict **salary**.

**Be careful: this is a REGRESSION task**

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Recommended roles for group members:

**Section 1:** to be completed by both group members

**Section 2:** first three models to be completed by the first group member and checked by the second; last two models to be completed by the second group members and checked by the first group member.

**Discussion:** to be completed by both group members

**Important notes:**
- Both group members will get the same grade. Therefore, you should check the work of your group member. If they make a mistake, you will be responsible for that mistake too.
- Both group members must put in their fair share of effort. Otherwise, those who don't contribute to the assignment will not receive any grade.


# Section 1: (8 points in total)

## Data Prep (6 points)

In [363]:
#Importing required libraries

import pandas as pd
import numpy as np
np.random.seed(42)
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer

In [364]:
# Read the CSV file
job_salary = pd.read_csv('jobs_alldata.csv')

In [365]:
#Checking data set
job_salary.head()

Unnamed: 0,Salary,Job Description,Location,Min_years_exp,Technical,Comm,Travel
0,67206,Civil Service Title: Regional Director Mental ...,Remote,5,2,3,0
1,88313,The New York City Comptrollerâ€™s Office Burea...,Remote,5,2,4,10-15
2,81315,With minimal supervision from the Deputy Commi...,East campus,5,3,3,5-10
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...,East campus,1,1,3,0
4,55675,Only candidates who are permanent in the Princ...,Southeast campus,1,1,3,5-10


In [366]:
#Checking Data Type
job_salary.dtypes

Salary              int64
Job Description    object
Location           object
Min_years_exp       int64
Technical           int64
Comm                int64
Travel             object
dtype: object

In [367]:
#Splitting our data set into Train and Test data set with 70% to 30% ratio
from sklearn.model_selection import train_test_split
train_set , test_set = train_test_split( job_salary, test_size=0.3)

In [368]:
#Creating Test and Train set for Text Mining processz
text_min_train = train_set['Job Description']
text_min_test = test_set['Job Description']

text_min_test

765     The New York City Housing Authority (NYCHA) is...
2387    Hiring Rate:  $62,272.00  (Flat Rate-Annual)  ...
2162    The Executive Director for Regulatory Reform w...
1833    The NYC Department of Environmental Protection...
1814    The Department of Transportationâ€™s (DOT) mis...
                              ...                        
2333    The Family Independence Administration/ Office...
998     In order to be considered for this position ca...
891     In accordance to Local Law 196 established in ...
1866    About New York City Cyber Command NYC Cyber Co...
1731    Only candidates who are permanent in the Civil...
Name: Job Description, Length: 724, dtype: object

In [369]:
# Check for missing values
job_salary[['Job Description']].isna().sum()

Job Description    0
dtype: int64

### Text Mining

In [370]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english')

#Transform and Fit_Transform data
input_data_TM_train = tfidf_vect.fit_transform(text_min_train)
input_data_TM_test = tfidf_vect.transform(text_min_test)

In [371]:
input_data_TM_train.shape, input_data_TM_test.shape

((1689, 9914), (724, 9914))

In [372]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
input_data_TM_test.toarray()

array([[0.        , 0.05710745, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.05301415, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [373]:
tfidf_vect.vocabulary_

{'candidates': 1507,
 'permanent': 6504,
 'computer': 2036,
 'systems': 8779,
 'manager': 5450,
 'title': 8994,
 'provide': 7045,
 'proof': 6999,
 'successful': 8652,
 'registration': 7408,
 'october': 6080,
 '2018': 112,
 'open': 6155,
 'competitive': 1977,
 'promotional': 6991,
 'exam': 3569,
 'apply': 816,
 'failure': 3757,
 'result': 7678,
 'disqualification': 2944,
 'department': 2691,
 'design': 2734,
 'construction': 2165,
 'division': 2984,
 'public': 7080,
 'buildings': 1420,
 'seeks': 8020,
 'director': 2858,
 'data': 2518,
 'analytics': 749,
 'team': 8859,
 'responsible': 7661,
 'providing': 7051,
 'descriptive': 2733,
 'diagnostic': 2814,
 'predictive': 6799,
 'insights': 4766,
 'based': 1143,
 'agency': 633,
 'external': 3711,
 'sources': 8321,
 'dataset': 2524,
 'includes': 4631,
 'basic': 1148,
 'project': 6977,
 'management': 5448,
 'schedule': 7923,
 'budget': 1409,
 'internal': 4862,
 'information': 4713,
 'sensors': 8051,
 'including': 4632,
 'uncensored': 9246,
 'le

## Latent Semantic Analysis (Singular Value Decomposition)


In [374]:
from sklearn.decomposition import TruncatedSVD

In [375]:
#If you are performing Latent Semantic Analysis, recommended number of components is 100

svd = TruncatedSVD(n_components=100, n_iter=10)

In [376]:
#Fit_transform on train set
train_data_TM_lsa = svd.fit_transform(input_data_TM_train)

In [377]:
train_data_TM_lsa.shape

(1689, 100)

In [378]:
input_data_TM_lsa

array([[ 0.22541013, -0.18772174, -0.1234057 , ...,  0.01584198,
        -0.02776014,  0.01248863],
       [ 0.14646783, -0.09181437, -0.03610346, ..., -0.10436983,
         0.02390888, -0.04682402],
       [ 0.1838085 , -0.10639539, -0.06984518, ...,  0.00791176,
         0.06297532, -0.02738126],
       ...,
       [ 0.25626826,  0.00734499, -0.03820685, ..., -0.0314851 ,
         0.01051491,  0.04303647],
       [ 0.15566608, -0.0875817 , -0.04037813, ..., -0.05153492,
        -0.00112513,  0.0654764 ],
       [ 0.11782855, -0.06243145, -0.04656153, ...,  0.01712884,
        -0.00624332,  0.00788095]])

In [379]:
#transform on test set
test_data_TM_lsa = svd.transform(input_data_TM_test)

In [380]:
test_data_TM_lsa.shape

(724, 100)

In [381]:
test_data_TM_lsa

array([[ 0.22442943, -0.15304556, -0.07083585, ..., -0.01055468,
         0.06329797, -0.00842715],
       [ 0.18129172, -0.14482415, -0.05639426, ..., -0.0240798 ,
        -0.00436142, -0.09301763],
       [ 0.25478046, -0.19879586, -0.07065608, ...,  0.02435736,
         0.01206157,  0.02165181],
       ...,
       [ 0.27755586, -0.18287792,  0.03599143, ..., -0.04275712,
        -0.01386028,  0.02087426],
       [ 0.14498797, -0.11293755, -0.07782072, ...,  0.03625459,
         0.00491215,  0.0390706 ],
       [ 0.26692051, -0.08935042,  0.44355519, ...,  0.03461096,
        -0.00869083,  0.01366206]])

In [382]:
#Droping 'Job Description' column since it is no longer required
train_set.drop(['Job Description'], axis=1, inplace=True)
test_set.drop(['Job Description'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [383]:
## lets separate the feature and target for train and test data set
train_target = train_set['Salary']
test_target = test_set['Salary']

In [384]:
train_data = train_set.drop(['Salary'], axis=1)
test_data = test_set.drop(['Salary'], axis=1)

## Feature Engineering (1 points)

Create one NEW feature from existing data. You either transform a single variable, or create a new variable from existing ones. 

Grading: 
- 0.5 points for creating the new feature correctly
- 0.5 points for the justification of the new feature (i.e., why did you create this new feature)

In [385]:
# We decided to categorize the travel column as below: 0 for "0" days travel, 1 for "1-5" days travel, 2 for "5-10" days travel, and 3 for "10-15" days travel
def new_col(df):    
    df1 = df.copy()
    TravelCol = df1['Travel'].to_list()
    New_Travel = []
    j = len(TravelCol)

    for i in range(j):
        if TravelCol[i] == "0":
            New_Travel.append(0)
        elif TravelCol[i] == "1-5":
            New_Travel.append(1)
        elif TravelCol[i] == "5-10":
            New_Travel.append(2)
        elif TravelCol[i] == "10-15":
            New_Travel.append(3)
    df1['Travel']=New_Travel
    return df1[['Travel']]
    

In [386]:
new_col(train_data)

Unnamed: 0,Travel
429,0
1185,3
2116,0
2127,0
458,0
...,...
1638,0
1095,0
1130,2
1294,0


## Data Prepration for Pipelining

In [387]:
# Identify the numerical columns
numeric_columns = ['Min_years_exp','Technical','Comm']

# Identify the categorical columns
categorical_columns =  ['Location','Travel']

In [388]:
numeric_columns

['Min_years_exp', 'Technical', 'Comm']

In [389]:
categorical_columns

['Location', 'Travel']

In [390]:
feat_eng_columns = ['Travel']

### PIPELINE

In [391]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [392]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [393]:
my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col)),
                               ('scaler', StandardScaler())])

In [394]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('trans', my_new_column, feat_eng_columns)],
        remainder='passthrough')


In [395]:
#Fit_transforming our train set
train_processed = preprocessor.fit_transform(train_data)

train_processed

array([[-1.12955755,  0.59306224,  0.9780514 , ...,  0.        ,
         0.        , -0.45335186],
       [ 1.0979062 ,  0.59306224, -1.25579419, ...,  1.        ,
         0.        ,  2.99062696],
       [-1.12955755,  0.59306224,  2.09497419, ...,  0.        ,
         0.        , -0.45335186],
       ...,
       [-1.12955755,  0.59306224, -1.25579419, ...,  0.        ,
         1.        ,  1.84263402],
       [ 0.54104026, -0.23136749, -0.1388714 , ...,  0.        ,
         0.        , -0.45335186],
       [-1.12955755, -0.23136749, -0.1388714 , ...,  0.        ,
         0.        , -0.45335186]])

In [396]:
#Transforming our Test set
test_processed = preprocessor.transform(test_data)

In [397]:
test_processed

array([[-1.12955755, -0.23136749, -0.1388714 , ...,  0.        ,
         0.        , -0.45335186],
       [ 0.54104026,  0.59306224, -2.37271699, ...,  0.        ,
         1.        ,  1.84263402],
       [ 0.54104026,  0.59306224, -1.25579419, ...,  0.        ,
         0.        , -0.45335186],
       ...,
       [-1.12955755, -0.23136749,  0.9780514 , ...,  0.        ,
         0.        , -0.45335186],
       [ 0.54104026,  2.24192171, -1.25579419, ...,  0.        ,
         0.        ,  0.69464108],
       [ 1.0979062 , -1.05579722,  0.9780514 , ...,  0.        ,
         0.        , -0.45335186]])

In [398]:
test_processed.shape, train_processed.shape

((724, 13), (1689, 13))

### Concatenating our 2 data sets

In [399]:
final_train_x = np.concatenate((train_processed, train_data_TM_lsa), axis=1)
final_test_x = np.concatenate((test_processed, test_data_TM_lsa), axis=1)

In [400]:
final_train_x.shape, final_test_x.shape

((1689, 113), (724, 113))

## Find the Baseline (1 point)

In [401]:
from sklearn.metrics import mean_squared_error

In [402]:
#First find the average value of the target

mean_value = np.mean(train_target)

mean_value

78566.0307874482

In [403]:
# Predict all values as the mean

baseline_pred = np.repeat(mean_value, len(test_target))

baseline_pred

array([78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
      

In [404]:
baseline_mse = mean_squared_error(test_target, baseline_pred)

baseline_rmse = np.sqrt(baseline_mse)

print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 28294.892856870818


#### AS shown above our base line shows approximately 28000 dollar salary. It means without any modeling we can predict the salary with 28000 dollars deviation. our models should predict the salary with a less error

# Section 2: (7 points in total)

Build the following models:


## Decision Tree: (1 point)

In [405]:
#Importing and training DT model
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(min_samples_leaf =10 ) 
tree_reg.fit(final_train_x, train_target)

DecisionTreeRegressor(min_samples_leaf=10)

In [406]:
#Train RMSE
train_pred = tree_reg.predict(final_train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 15021.25338382824


In [407]:
#Test RMSE
test_pred = tree_reg.predict(final_test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 23675.50530517647


#### Decision Tree model shows better result then baseline, we can see that the test set result shows approximately 23000 dollar error which is better than baseline. also we can see a bit of an overfitting since the train model has the error of 15000 dollars.

## Voting regressor (2 points):

The voting regressor should have at least 3 individual models

In [408]:
#Importing and training Voting Regressor model
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(max_depth=20)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('sgd', sgd_reg)])

voting_reg.fit(final_train_x, train_target)

VotingRegressor(estimators=[('dt', DecisionTreeRegressor(max_depth=20)),
                            ('svr', SVR(C=10, epsilon=0.01)),
                            ('sgd', SGDRegressor(max_iter=10000))])

In [409]:
#Train RMSE
train_pred = voting_reg.predict(final_train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 16137.952509743462


In [410]:
#Test RMSE
test_pred = voting_reg.predict(final_test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 20540.15942330933


#### As you can see our Voting Regressor model performs even better than DT since we have RMSE of approximately 20000 in our test set

## A Boosting model: (1 point)

Build either an Adaboost or a GradientBoost model

In [411]:
#Importing and training AdaBoost mode
from sklearn.ensemble import AdaBoostRegressor 

#Create Adapative Boosting with Decision Stumps (depth=1)
ada_reg = AdaBoostRegressor( 
            DecisionTreeRegressor(max_depth=1), n_estimators=500, 
            learning_rate=0.1) 

ada_reg.fit(final_train_x, train_target)

AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=1),
                  learning_rate=0.1, n_estimators=500)

In [412]:
#Train RMSE
train_pred = ada_reg.predict(final_train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 28614.35461017554


In [413]:
#Test RMSE
test_pred = ada_reg.predict(final_test_x)
28614
test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 28960.309820648974


#### As RMSE result shows the AdaBoost model does not out perferform the last 2 models. the error of the test set is almost the same as our baseline.

## Neural network: (1 point)

In [414]:
#Importing and training neural network
from sklearn.neural_network import MLPRegressor

#Default settings create 1 hidden layer with 100 neurons
mlp_reg = MLPRegressor(hidden_layer_sizes=(1,))

mlp_reg.fit(final_train_x, train_target)



MLPRegressor(hidden_layer_sizes=(1,))

In [415]:
#Train RMSE
train_pred = mlp_reg.predict(final_train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 83921.48340097268


In [416]:
#Test RMSE
test_pred = mlp_reg.predict(final_test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 81650.80594651504


#### The NN model is the worst model for this data since the RMSE is 80000 dollars which is as twice as worst than baseline. even changing the hidden layer size does not change the outcome

## Grid search (2 points)

Perform either a full or randomized grid search on any model you want. There has to be at least two parameters for the search. 

In [417]:
#Importing and training Grid Search
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'min_samples_leaf': np.arange(10, 30), 
     'max_depth': np.arange(10,30)}
  ]

tree_reg = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_reg, param_grid, cv=5, n_iter=10,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(final_train_x, train_target)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


RandomizedSearchCV(cv=5, estimator=DecisionTreeRegressor(),
                   param_distributions=[{'max_depth': array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29]),
                                         'min_samples_leaf': array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29])}],
                   return_train_score=True, scoring='neg_mean_squared_error',
                   verbose=1)

In [418]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

27035.562596480555 {'min_samples_leaf': 18, 'max_depth': 14}
26921.236325977774 {'min_samples_leaf': 20, 'max_depth': 28}
26348.07186023829 {'min_samples_leaf': 23, 'max_depth': 24}
26452.417115478584 {'min_samples_leaf': 25, 'max_depth': 28}
26804.81785538773 {'min_samples_leaf': 11, 'max_depth': 21}
26815.48598807801 {'min_samples_leaf': 12, 'max_depth': 13}
26832.10722491942 {'min_samples_leaf': 12, 'max_depth': 19}
26794.596337478477 {'min_samples_leaf': 11, 'max_depth': 13}
26359.996304308013 {'min_samples_leaf': 24, 'max_depth': 17}
27052.84788953683 {'min_samples_leaf': 16, 'max_depth': 13}


In [419]:
grid_search.best_params_

{'min_samples_leaf': 23, 'max_depth': 24}

In [420]:
grid_search.best_estimator_

DecisionTreeRegressor(max_depth=24, min_samples_leaf=23)

In [421]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(final_train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 20490.32891641276


In [422]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(final_test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 24767.860706296036


#### Eventhough the Grid sreach using decision tree regressor is performing better than most models but still cannot perform better than voting regressor.

# Discussion (5 points in total)


## List the train and test values of each model you built (2 points)

## Which model performs the best and why? (0.5 points) 
## How does it compare to baseline? (0.5 points)

Hint: The best model is the one that has the highest TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? (1 point)

## Is there any overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? (1 point)