# Salary Prediction


In this analysis, we will focus on salary prediction. The data set includes information on job descriptions and salaries. Using this data set, the goal is to see if we can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description. This is important, because this model can make a salary recommendation as soon as a job description is entered into a system.

## Description of Variables

The description of variables are provided in "Jobs - Data Dictionary.docx"

## Goal

Use the **jobs_alldata.csv** data set and build models to predict **salary**.

**Be careful: this is a REGRESSION task**

# Section 1: 

## Data Prep 

In [108]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(42)

In [109]:
jobs = pd.read_csv("jobs_alldata.csv")
jobs.head()

Unnamed: 0,Salary,Job Description,Location,Min_years_exp,Technical,Comm,Travel
0,67206,Civil Service Title: Regional Director Mental ...,Remote,5,2,3,0
1,88313,The New York City Comptrollerâ€™s Office Burea...,Remote,5,2,4,10-15
2,81315,With minimal supervision from the Deputy Commi...,East campus,5,3,3,5-10
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...,East campus,1,1,3,0
4,55675,Only candidates who are permanent in the Princ...,Southeast campus,1,1,3,5-10


In [110]:
jobs.shape

(2413, 7)

In [111]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(jobs, test_size=0.3)

In [112]:
train_set.isna().sum()

Salary             0
Job Description    0
Location           0
Min_years_exp      0
Technical          0
Comm               0
Travel             0
dtype: int64

In [113]:
test_set.isna().sum()

Salary             0
Job Description    0
Location           0
Min_years_exp      0
Technical          0
Comm               0
Travel             0
dtype: int64

In [114]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

In [115]:
train_y = train_set[['Salary']]
test_y = test_set[['Salary']]

train_inputs = train_set.drop(['Salary'], axis=1)
test_inputs = test_set.drop(['Salary'], axis=1)

In [116]:
train_inputs.dtypes

Job Description    object
Location           object
Min_years_exp       int64
Technical           int64
Comm                int64
Travel             object
dtype: object

In [117]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [118]:
numeric_columns

['Min_years_exp', 'Technical', 'Comm']

In [119]:
categorical_columns

['Job Description', 'Location', 'Travel']

In [120]:
# Not passing the text column through the pipeline
categorical_columns.remove('Job Description')

In [121]:
categorical_columns

['Location', 'Travel']

## Feature Engineering 

Create one NEW feature from existing data. You either transform a single variable, or create a new variable from existing ones. 


We created, a new feature called travel_required which is a binary column. The value is 0 if there is no travel required and 1 if there is any travel reuired for the job position. The feature is created to understand the impact of travel on the Salary of the job. As there might be travel allowances if there is any travel required for the job position.

In [122]:
def new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    df1['travel_required'] = np.where(df1['Travel'] == '0', 0, 1)
    
    return df1[['travel_required']]
    # You can use this to check whether the calculation is made correctly:
    #return df1

In [123]:
#Let's test the new function:

# Send train set to the function we created
new_col(train_set)

Unnamed: 0,travel_required
429,0
1185,1
2116,0
2127,0
458,0
...,...
1638,0
1095,0
1130,1
1294,0


In [124]:
feat_eng_columns = ['Travel']

## Pipeline
#### Data Preparation Continued.

In [125]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [126]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [127]:
my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col))])

In [128]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('trans', my_new_column, feat_eng_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

In [129]:
new_train_inputs=train_inputs.drop('Job Description', axis=1)

In [130]:
new_test_inputs = test_inputs.drop('Job Description', axis=1)

In [131]:
#Fit and transform the train data for all columns excluding the Job Description

train_x = preprocessor.fit_transform(new_train_inputs)
train_x

array([[-1.12955755,  0.59306224,  0.9780514 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.0979062 ,  0.59306224, -1.25579419, ...,  1.        ,
         0.        ,  1.        ],
       [-1.12955755,  0.59306224,  2.09497419, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.12955755,  0.59306224, -1.25579419, ...,  0.        ,
         1.        ,  1.        ],
       [ 0.54104026, -0.23136749, -0.1388714 , ...,  0.        ,
         0.        ,  0.        ],
       [-1.12955755, -0.23136749, -0.1388714 , ...,  0.        ,
         0.        ,  0.        ]])

In [132]:
train_x.shape

(1689, 13)

In [133]:
# Transform the test data for all columns excluding the Job Description
test_x = preprocessor.transform(new_test_inputs)

test_x

array([[-1.12955755, -0.23136749, -0.1388714 , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.54104026,  0.59306224, -2.37271699, ...,  0.        ,
         1.        ,  1.        ],
       [ 0.54104026,  0.59306224, -1.25579419, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.12955755, -0.23136749,  0.9780514 , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.54104026,  2.24192171, -1.25579419, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.0979062 , -1.05579722,  0.9780514 , ...,  0.        ,
         0.        ,  0.        ]])

In [134]:
test_x.shape

(724, 13)

## Performing text mining and creating SVDs on the column "Job Description"

In [135]:
import nltk
from nltk.corpus import stopwords
import re

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to C:\Users\Ravi
[nltk_data]     Teja\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Ravi
[nltk_data]     Teja\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Ravi
[nltk_data]     Teja\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [136]:
train_set_text = train_inputs['Job Description']

In [137]:
test_set_text = test_inputs['Job Description']

In [138]:
test_set_text

765     The New York City Housing Authority (NYCHA) is...
2387    Hiring Rate:  $62,272.00  (Flat Rate-Annual)  ...
2162    The Executive Director for Regulatory Reform w...
1833    The NYC Department of Environmental Protection...
1814    The Department of Transportationâ€™s (DOT) mis...
                              ...                        
2333    The Family Independence Administration/ Office...
998     In order to be considered for this position ca...
891     In accordance to Local Law 196 established in ...
1866    About New York City Cyber Command NYC Cyber Co...
1731    Only candidates who are permanent in the Civil...
Name: Job Description, Length: 724, dtype: object

In [139]:
#Create a blank list

new_train = []


# For each row in train_set, we will read the text, tokenize it, remove stopwords, lemmatize it, 
# and save it to the new list

for text in train_set_text:
    text = re.sub(r'[!"#$%&()*+,-./:;<=>?[\]^_`{|}~]', ' ', text).lower()
        
    words= nltk.tokenize.word_tokenize(text)
    words = [w for w in words if w.isalpha()]
    words = [w for w in words if len(w)>2 and w not in stopwords.words('english')]
        
    lemmatizer = nltk.stem.WordNetLemmatizer()
    words = [lemmatizer.lemmatize(w) for w in words]
    new_train.append(' '.join(words))

In [140]:
new_train

['candidate permanent computer system manager title provide proof successful registration october open competitive promotional exam may apply failure result disqualification department design construction division public building seek director data analytics data analytics team responsible providing descriptive diagnostic predictive data insight based data across agency external source dataset includes basic project management data schedule budget well external internal data information system sensor including uncensored data director data analytics leverage business acumen knowledge data science modern technical tool develop data informed strategy action related strengthening agency business performance related delivery quality capital project time within budget safe manner director aggregate diagnose data identified trend risk performance metric also oversee internal external agency reporting ensure effective data flow external dashboard capital project dashboard open data portal dir

In [141]:
# Let's convert the original train_set to a dataframe

train_text_df = pd.DataFrame(train_set)

train_text_df['new_text'] = new_train

train_text_df

Unnamed: 0,Salary,Job Description,Location,Min_years_exp,Technical,Comm,Travel,new_text
429,63407,Only candidates who are permanent in the Compu...,HQ,1,3,4,0,candidate permanent computer system manager ti...
1185,111372,NYCERS is seeking a Business Analyst with a te...,HQ,5,3,2,10-15,nycers seeking business analyst technical back...
2116,79054,The NYC Department of Environmental Protection...,HQ,1,3,5,0,nyc department environmental protection dep pr...
2127,67864,Only Candidates permanent in the Assistant Civ...,Southeast campus,5,4,3,0,candidate permanent assistant civil engineer t...
458,88695,Please read this posting carefully to make cer...,Remote,1,1,3,0,please read posting carefully make certain mee...
...,...,...,...,...,...,...,...,...
1638,44366,NYC Civilian Complaint Review Board The Civil...,HQ,2,3,3,0,nyc civilian complaint review board civilian c...
1095,52753,The NYC Department of Environmental Protection...,HQ,1,3,4,0,nyc department environmental protection dep en...
1130,78345,The NYC Office of Payroll Administration is re...,HQ,1,3,2,5-10,nyc office payroll administration recruiting i...
1294,87196,HPDTech is the IT division within HPD. Its mis...,Remote,4,2,3,0,hpdtech division within hpd mission identify a...


In [142]:
# Let's do the same for test data 

new_test = []

for text in test_set_text:
    text = re.sub(r'[!"#$%&()*+,-./:;<=>?[\]^_`{|}~]', ' ', text).lower()
        
    words= nltk.tokenize.word_tokenize(text)
    words = [w for w in words if w.isalpha()]
    words = [w for w in words if len(w)>2 and w not in stopwords.words('english')]
        
    lemmatizer = nltk.stem.WordNetLemmatizer()
    words = [lemmatizer.lemmatize(w) for w in words]
    new_test.append(' '.join(words))

In [143]:
test_text_df = pd.DataFrame(test_set)

test_text_df['new_text'] = new_test

test_text_df

Unnamed: 0,Salary,Job Description,Location,Min_years_exp,Technical,Comm,Travel,new_text
765,149752,The New York City Housing Authority (NYCHA) is...,HQ,1,2,3,0,new york city housing authority nycha largest ...
2387,64653,"Hiring Rate: $62,272.00 (Flat Rate-Annual) ...",HQ,4,3,1,5-10,hiring rate flat rate annual mission bureau hi...
2162,69530,The Executive Director for Regulatory Reform w...,East campus,4,3,2,0,executive director regulatory reform assist im...
1833,51932,The NYC Department of Environmental Protection...,East campus,2,3,2,0,nyc department environmental protection dep pr...
1814,60218,The Department of Transportationâ€™s (DOT) mis...,East campus,2,2,2,0,department dot mission provide safe efficient ...
...,...,...,...,...,...,...,...,...
2333,58452,The Family Independence Administration/ Office...,HQ,5,4,1,0,family independence administration office rese...
998,90220,In order to be considered for this position ca...,West campus,5,2,3,0,order considered position candidate must servi...
891,68328,In accordance to Local Law 196 established in ...,Remote,1,2,4,0,accordance local law established late sb devel...
1866,92478,About New York City Cyber Command NYC Cyber Co...,Southeast campus,4,5,2,1-5,new york city cyber command nyc cyber command ...


In [144]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english',max_features=101)

train_x_tfidf = tfidf_vect.fit_transform(train_text_df['new_text'])

In [145]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

test_x_tfidf = tfidf_vect.transform(test_text_df['new_text'])

In [146]:
train_x_tfidf.shape, test_x_tfidf.shape

((1689, 101), (724, 101))

In [147]:
from sklearn.decomposition import TruncatedSVD

In [148]:
#If you are performing Latent Semantic Analysis, recommended number of components is 100

svd = TruncatedSVD(n_components=100, n_iter=10)

In [149]:
train_x_lsa = svd.fit_transform(train_x_tfidf)

In [150]:
test_x_lsa = svd.transform(test_x_tfidf)

In [151]:
svd.explained_variance_.sum()

0.7708492237317928

In [152]:
train_x_lsa

array([[ 3.67682851e-01, -1.90896506e-01,  1.14650068e-01, ...,
         4.73456606e-03, -2.34230761e-02,  1.58256928e-04],
       [ 2.84165120e-01, -1.49857137e-01,  6.60315472e-02, ...,
         5.38898773e-03,  3.01154278e-02,  2.44126764e-02],
       [ 6.67244749e-01,  6.12526058e-01,  1.48533886e-01, ...,
        -5.00592291e-02,  1.07749590e-02,  1.85704805e-02],
       ...,
       [ 4.28626098e-01, -1.57392069e-01, -2.73227936e-02, ...,
         1.40852257e-02, -3.29239204e-02,  2.57191103e-03],
       [ 3.73985348e-01, -1.88020187e-01,  7.39386458e-03, ...,
         9.68662321e-03,  1.55949692e-02,  3.26640003e-04],
       [ 3.82069800e-01,  8.50045275e-02,  3.75757633e-01, ...,
        -2.44920926e-03, -1.50552807e-02,  3.61550241e-03]])

In [153]:
test_x_lsa

array([[ 0.58682171, -0.26851759, -0.08228445, ...,  0.02429357,
         0.01271428,  0.01433828],
       [ 0.5135123 , -0.25452156, -0.05854773, ..., -0.00583632,
         0.01124775,  0.0020083 ],
       [ 0.53934404, -0.28640543, -0.11497648, ...,  0.02433296,
         0.07163504, -0.01630139],
       ...,
       [ 0.45716255, -0.14683424,  0.09726571, ..., -0.01859526,
        -0.03006875, -0.00162078],
       [ 0.40308908, -0.12395804, -0.04931821, ..., -0.01210091,
        -0.02216812, -0.0103851 ],
       [ 0.47391276,  0.04512944,  0.51118103, ...,  0.02002575,
        -0.0185134 ,  0.01614024]])

In [154]:
train_x_lsa.shape

(1689, 100)

In [155]:
test_x_lsa.shape

(724, 100)

In [156]:
train_x.shape

(1689, 13)

In [157]:
test_x.shape

(724, 13)

## Combining the pipeline columns, and SVDs

In [158]:
train_x = np.column_stack((train_x, train_x_lsa))

In [159]:
train_x.shape

(1689, 113)

In [160]:
test_x = np.column_stack((test_x, test_x_lsa))

In [161]:
test_x.shape

(724, 113)

## Find the Baseline

In [162]:
from sklearn.metrics import mean_squared_error

In [163]:
#First find the average value of the target

mean_value = np.mean(train_y['Salary'])

mean_value

78566.0307874482

In [164]:
# Predict all values as the mean

baseline_pred = np.repeat(mean_value, len(test_y))

baseline_pred

array([78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
       78566.03078745, 78566.03078745, 78566.03078745, 78566.03078745,
      

In [165]:
baseline_mse = mean_squared_error(test_y, baseline_pred)

baseline_rmse = np.sqrt(baseline_mse)

print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 28294.892856870818


# Section 2:

Build the following models:


## Decision Tree:

### Descision tree with regularization

In [166]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=16, min_samples_leaf = 5) 

tree_reg.fit(train_x, train_y)

DecisionTreeRegressor(max_depth=16, min_samples_leaf=5)

In [167]:
#Train RMSE
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 12669.192634025645


In [168]:
#Test RMSE
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 23794.28300872655


## Voting regressor :

The voting regressor should have at least 3 individual models

In [169]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(max_depth=20)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('sgd', sgd_reg)])

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), ('svr', svm_reg), ('sgd', sgd_reg)],
                        weights=[0.6, 0.2, 0.2])

voting_reg.fit(train_x, train_y)

  return f(*args, **kwargs)


VotingRegressor(estimators=[('dt', DecisionTreeRegressor(max_depth=20)),
                            ('svr', SVR(C=10, epsilon=0.01)),
                            ('sgd', SGDRegressor(max_iter=10000))],
                weights=[0.6, 0.2, 0.2])

In [170]:
#Train RMSE
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 10897.467906391394


In [171]:
#Test RMSE
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 21795.218534754113


In [172]:
for reg in (dtree_reg, svm_reg, sgd_reg, voting_reg):
    reg.fit(train_x, train_y['Salary'])
    test_y_pred = reg.predict(test_x)
    print(reg.__class__.__name__, 'Test rmse=', np.sqrt(mean_squared_error(test_y, test_y_pred)))

DecisionTreeRegressor Test rmse= 25568.61284826046
SVR Test rmse= 28420.432339772527
SGDRegressor Test rmse= 24466.66228803418
VotingRegressor Test rmse= 20394.761439497594


## A Boosting model: 

Build either an Adaboost or a GradientBoost model

In [173]:
#Adapative Boosting with Decision Stumps (depth=1) was giving test RMSE: 29569.37

In [174]:
from sklearn.ensemble import AdaBoostRegressor 


ada_reg = AdaBoostRegressor( 
            DecisionTreeRegressor(max_depth=10), n_estimators=500, 
            learning_rate=0.1) 

ada_reg.fit(train_x, train_y)

  return f(*args, **kwargs)


AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=10),
                  learning_rate=0.1, n_estimators=500)

In [175]:
#Train RMSE
train_pred = ada_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 4907.548404719151


In [176]:
#Test RMSE
test_pred = ada_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 16089.29562624076


## Neural network: 

In [177]:
from sklearn.neural_network import MLPRegressor

#Default settings create 1 hidden layer with 100 neurons
mlp_reg = MLPRegressor(hidden_layer_sizes=(100,100,100), max_iter=1000, alpha = 0.1)

mlp_reg.fit(train_x, train_y)

  return f(*args, **kwargs)


MLPRegressor(alpha=0.1, hidden_layer_sizes=(100, 100, 100), max_iter=1000)

In [178]:
#Train RMSE
train_pred = mlp_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 15477.501798390376


In [179]:
#Test RMSE
test_pred = mlp_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 23086.73497853427


## Grid search

Perform either a full or randomized grid search on any model you want. There has to be at least two parameters for the search. 

In [180]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'min_samples_leaf': np.arange(10, 30), 
     'max_depth': np.arange(10,20)}
  ]

tree_reg = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_reg, param_grid, cv=5, n_iter=10,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x, train_y)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


RandomizedSearchCV(cv=5, estimator=DecisionTreeRegressor(),
                   param_distributions=[{'max_depth': array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),
                                         'min_samples_leaf': array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29])}],
                   return_train_score=True, scoring='neg_mean_squared_error',
                   verbose=1)

In [181]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

27322.171260670224 {'min_samples_leaf': 15, 'max_depth': 14}
27516.576395239637 {'min_samples_leaf': 29, 'max_depth': 13}
27954.84098892312 {'min_samples_leaf': 11, 'max_depth': 14}
27667.03574039121 {'min_samples_leaf': 27, 'max_depth': 14}
28095.046600057663 {'min_samples_leaf': 12, 'max_depth': 17}
27372.379374414322 {'min_samples_leaf': 13, 'max_depth': 19}
27516.576395239637 {'min_samples_leaf': 29, 'max_depth': 17}
27693.9466479276 {'min_samples_leaf': 26, 'max_depth': 18}
27439.059189549884 {'min_samples_leaf': 23, 'max_depth': 19}
27326.470850231264 {'min_samples_leaf': 17, 'max_depth': 14}


In [182]:
grid_search.best_params_

{'min_samples_leaf': 15, 'max_depth': 14}

In [183]:
grid_search.best_estimator_

DecisionTreeRegressor(max_depth=14, min_samples_leaf=15)

In [184]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 18733.032193454277


In [185]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 25718.83485144365


# Discussion 


## train and test values of each model 

## Which model performs the best and why?

## How does it compare to baseline? 

Hint: The best model is the one that has the highest TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? 

## Is there any overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? 