# Salary Recommendation - Text Mining

The data set for this exercise includes information on job descriptions and salaries. Use this data set to see if you can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description. This is important, because this model can make a salary recommendation as soon as a job description is entered into a system.

## Description of Variables

Use the **jobs.csv** file as the data set. 

There are only two columns:<br>
**Salary:** The salary of that specific job<br>
**Job Description:** The description of the job ad<br>

## Goal

Use the **jobs.csv** data set and build a model to predict **Salary**. <br>


# Read and Prepare the Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
jobs = pd.read_csv('jobs.csv')

In [3]:
jobs.head(5)

Unnamed: 0,Salary,Job Description
0,67206,Civil Service Title: Regional Director Mental ...
1,88313,The New York City Comptrollerâ€™s Office Burea...
2,81315,With minimal supervision from the Deputy Commi...
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...
4,55675,Only candidates who are permanent in the Princ...


In [4]:
target=jobs['Salary']

### Select the "text" (input) variable

In [5]:
jobs[['Job Description']].isna().sum()

Job Description    0
dtype: int64

In [6]:
##jobs['Job Description'].fillna('missing', inplace=True)

In [7]:
input_data = jobs['Job Description']

### Baseline

In [82]:
#First find the average value of the target

mean_value = np.mean(jobs['Salary'])

mean_value

77990.33029423954

In [84]:
# Predict all values as the mean

baseline_pred = np.repeat(mean_value, len(jobs))

baseline_pred

array([77990.33029424, 77990.33029424, 77990.33029424, ...,
       77990.33029424, 77990.33029424, 77990.33029424])

In [88]:
baseline_mse = mean_squared_error(jobs['Salary'], baseline_pred)

baseline_rmse = np.sqrt(baseline_mse)

print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 29196.68788150113


### Split the data

In [8]:
from sklearn.model_selection import train_test_split

train_set, test_set, train_y, test_y = train_test_split(input_data, target, test_size=0.3, random_state=42)

In [9]:
train_set.shape, train_y.shape

((1689,), (1689,))

In [10]:
test_set.shape, test_y.shape

((724,), (724,))

In [94]:
#Countvectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(stop_words='english')

train_x_tr = count_vect.fit_transform(train_set)

In [95]:
# Perform the CountVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

test_x_tr = count_vect.transform(test_set)


In [96]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer1 = TfidfTransformer()

train_x_tfidf = tf_transformer1.fit_transform(train_x_tr)

train_x_tfidf.shape

(1689, 9914)

In [97]:
# Now we need to perform the tf-idf transformation on the test data set

test_x_tfidf = tf_transformer1.transform(test_x_tr)

test_x_tfidf.shape

(724, 9914)

In [15]:
# X_train_tf is a sparse matrix. We can't see it unless we convert using toarray()
train_x_tfidf[:,:].toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.02913336, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

### Latent Semantic Analysis (Singular Value Decomposition)

In [16]:
from sklearn.decomposition import TruncatedSVD

In [17]:
#If you are performing Latent Semantic Analysis, recommended number of components is 100  (500)

svd = TruncatedSVD(n_components=1000, n_iter=10)

In [18]:
train_x_lsa = svd.fit_transform(train_x_tfidf)

In [19]:
train_x_lsa.shape

(1689, 1000)

In [20]:
train_x_lsa

array([[ 2.47205981e-01, -2.03271254e-01,  1.14899112e-01, ...,
         1.06165287e-02, -1.77662225e-03, -2.03356930e-03],
       [ 1.72948207e-01, -1.30748862e-01,  4.77585353e-03, ...,
         3.61996907e-03,  4.47720592e-03, -7.32329869e-03],
       [ 5.87776099e-01,  3.65734004e-01,  1.34432287e-01, ...,
         4.43027643e-03, -1.34471247e-03,  9.29132021e-05],
       ...,
       [ 1.33857761e-01, -1.06052106e-01, -4.01844658e-02, ...,
         7.29242320e-04,  2.31761932e-03,  5.95222660e-04],
       [ 1.53406212e-01, -1.23536443e-01, -4.28305610e-02, ...,
        -1.40194615e-03,  1.58870785e-03,  6.63586443e-04],
       [ 2.17227521e-01, -5.64243368e-02,  3.15270243e-01, ...,
         5.86871013e-04,  1.06747523e-02,  2.32311719e-02]])

### Explore the SVDs

In [21]:
svd.explained_variance_.sum()

0.9313251924116244

In [22]:
#These are the all the components:
svd.components_

array([[ 6.88496587e-04,  8.57475540e-02,  9.35138953e-05, ...,
         2.58331510e-04,  2.64894838e-04,  4.74548918e-04],
       [-4.24457233e-04,  9.21278086e-02, -8.70152800e-05, ...,
        -5.37307789e-04,  4.90695772e-04, -9.88739389e-04],
       [-5.92600177e-04, -6.23455368e-02, -1.68140617e-04, ...,
        -3.77315274e-04, -3.92896112e-04, -5.15104924e-04],
       ...,
       [ 9.82621871e-03, -2.25187047e-03, -7.70233288e-05, ...,
        -1.27598887e-03,  3.40471294e-03,  1.02321257e-03],
       [-2.39700044e-02, -3.29468896e-03, -2.28909368e-03, ...,
        -5.54934288e-04,  2.39321172e-03, -3.32759719e-04],
       [ 3.92375304e-03, -3.25284824e-03, -4.13820717e-03, ...,
        -9.70313124e-04, -4.90217368e-03, -3.82991108e-03]])

In [23]:
svd.components_.shape

(1000, 9914)

In [24]:
#Let's select the first component:

first_component = svd.components_[0,:]

In [25]:
# Sort the weights in the first component, and get the indeces

indeces = np.argsort(first_component).tolist()

In [26]:
#Be careful, indeces are in descending order (least important first)

print(indeces)

[6450, 7432, 9466, 1166, 6843, 1749, 4530, 5531, 6584, 2920, 7336, 7992, 2383, 2467, 1194, 7386, 3263, 7532, 9158, 719, 7831, 2970, 5785, 9193, 7677, 5409, 1535, 2267, 8404, 9103, 5804, 5334, 7140, 5122, 5198, 9591, 9579, 1934, 2539, 9887, 1719, 9889, 3450, 9888, 3742, 88, 7775, 2939, 1439, 7494, 2588, 3826, 6239, 1646, 6276, 9495, 5498, 9706, 6310, 8386, 5655, 4935, 2157, 691, 692, 8148, 2811, 8072, 4528, 3010, 3478, 702, 8070, 8069, 8723, 4625, 5633, 5631, 9728, 2565, 6235, 5912, 3739, 7945, 5586, 2566, 3803, 9641, 4725, 4687, 2925, 1310, 1201, 9091, 9708, 149, 9696, 361, 8207, 8182, 1195, 5413, 7180, 8161, 8447, 6377, 9729, 8843, 3502, 3501, 7489, 3847, 9723, 9730, 6371, 8803, 5588, 9661, 4709, 4703, 434, 4166, 3302, 5399, 9612, 9790, 2929, 8036, 4120, 6128, 1447, 9604, 7531, 84, 1795, 2927, 3943, 8993, 8375, 1237, 7546, 5977, 5123, 4311, 5737, 651, 3162, 3900, 9284, 4070, 8941, 7379, 8201, 8001, 3604, 2608, 5927, 2135, 7871, 4310, 104, 7436, 9229, 7306, 2593, 4643, 6592, 1858, 5861

In [27]:
#Let's get the feature names from the count vectorizer:
feat_names = count_vect.get_feature_names()

In [28]:
#Print the last 10 terms (i.e., the 10 terms that have the highest weigths)

for index in indeces[-10:]:
    print(feat_names[index], "\t\tweight =", first_component[index])

bureau 		weight = 0.1071793187980578
management 		weight = 0.10960765437645041
new 		weight = 0.12230386159112752
design 		weight = 0.12986195254896937
city 		weight = 0.1360852112302978
project 		weight = 0.13846106935767966
dep 		weight = 0.14965285958788319
construction 		weight = 0.15479839429867406
wastewater 		weight = 0.1584152988456564
water 		weight = 0.2642504496809606


In [29]:
test_x_lsa = svd.transform(test_x_tfidf)

In [30]:
test_x_lsa.shape

(724, 1000)

# Model 1-Using Scikit Learn 
### Use any model that we have covered so far

### Random Forest Regressor

In [31]:
from sklearn.ensemble import RandomForestRegressor

In [32]:
rnd_regsc=RandomForestRegressor(n_estimators=100, max_leaf_nodes=16, n_jobs=-1)

rnd_regsc.fit(train_x_lsa, train_y)

RandomForestRegressor(max_leaf_nodes=16, n_jobs=-1)

In [33]:
from sklearn.metrics import mean_squared_error

In [34]:
#Train RMSE
reg_train_pred = rnd_regsc.predict(train_x_lsa)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 19900.54785554696


In [35]:
#Test RMSE
reg_test_pred = rnd_regsc.predict(test_x_lsa)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 22513.852145500736


### Stochastic Gradient Regressor 

In [36]:
from sklearn.linear_model import SGDRegressor

sgd_regsc = SGDRegressor(max_iter=10000, tol=1e-3,penalty='l2')

In [37]:
sgd_regsc.fit(train_x_lsa, train_y)



SGDRegressor(max_iter=10000)

In [38]:
#Train RMSE
reg_train_pred = sgd_regsc.predict(train_x_lsa)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 9603.516929049772


In [39]:
reg_test_pred = sgd_regsc.predict(test_x_lsa)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 16461.2289693097


### Hard voting Regressor

In [40]:
from sklearn.tree import DecisionTreeRegressor 
from sklearn.linear_model import SGDRegressor 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingRegressor
from sklearn.svm import SVR 

In [41]:
dtree_reg = DecisionTreeRegressor(max_depth=100)
log_reg = LogisticRegression(#multi_class='multinomial',
    solver = 'lbfgs', C=8, max_iter=1000)
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)
svr_reg=SVR(kernel="poly", degree=2, coef0=1, C=1, gamma='scale')
voting_regsc = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        #('lr', log_reg), 
                        ('sgd', sgd_reg),
                       ('svr',svr_reg)],
            )

voting_regsc.fit(train_x_lsa, train_y)



VotingRegressor(estimators=[('dt', DecisionTreeRegressor(max_depth=100)),
                            ('sgd', SGDRegressor(max_iter=10000)),
                            ('svr',
                             SVR(C=1, coef0=1, degree=2, kernel='poly'))])

In [42]:
train_y_pred = voting_regsc.predict(train_x_lsa)

train_mse = mean_squared_error(train_y, train_y_pred)
train_mse=np.sqrt(mean_squared_error (train_y, train_y_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 9603.516929049772


In [43]:
reg_test_pred = voting_regsc.predict(test_x_lsa)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 18254.172311798484


# Model 2 Using Keras
### Use any model that we have covered so far

In [44]:
# import tokenizer (after installing Tensorflow)
from tensorflow.keras.preprocessing.text import Tokenizer

# When initializing a tokenizer, "num_words" selects the most frequently occuring N terms only
# If you make it "num_words=None" then all terms are included
keras_tokenizer = Tokenizer(num_words=500, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n1234567890', lower=True)

keras_tokenizer.fit_on_texts(train_set)

In [45]:
# After identifying the terms to be used in the term-by-document matrix, 
# create the matrix using one of the below


#train_binary_matrix = keras_tokenizer.texts_to_matrix(train_set, mode='binary')
train_tfidf_matrix = keras_tokenizer.texts_to_matrix(train_set, mode='tfidf')
#train_binary_matrix = keras_tokenizer.texts_to_matrix(train_set, mode='freq')  # ratio of terms in a document
#train_binary_matrix = keras_tokenizer.texts_to_matrix(train_set, mode='count')


train_tfidf_matrix.shape

(1689, 500)

In [46]:
# Now we need to perform the test data set

#test_binary_matrix = keras_tokenizer.texts_to_matrix(test_set, mode='binary')
test_tfidf_matrix = keras_tokenizer.texts_to_matrix(test_set, mode='tfidf')
#test_binary_matrix = keras_tokenizer.texts_to_matrix(test_set, mode='freq')  # ratio of terms in a document
#test_binary_matrix = keras_tokenizer.texts_to_matrix(test_set, mode='count')


test_tfidf_matrix.shape

(724, 500)

In [47]:
print(keras_tokenizer.word_counts)



### Random Forest Regressor 

In [48]:
rnd_regk=RandomForestRegressor(n_estimators=1000, max_leaf_nodes=16, n_jobs=-1)

rnd_regk.fit(train_tfidf_matrix, train_y)

RandomForestRegressor(max_leaf_nodes=16, n_estimators=1000, n_jobs=-1)

In [49]:
#Train RMSE
reg_train_pred = rnd_regk.predict(train_tfidf_matrix)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 20536.53362266673


In [50]:
#Test RMSE
reg_test_pred = rnd_regk.predict(test_tfidf_matrix)

test_mse = mean_squared_error(test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 22356.085831017943


### Stochastic Gradient Regressor

In [51]:
sgd_regk = SGDRegressor(max_iter=10000, tol=1e-3)

In [52]:
sgd_regk.fit(train_tfidf_matrix, train_y)

SGDRegressor(max_iter=10000)

In [53]:
#Train RMSE
reg_train_pred = sgd_regk.predict(train_tfidf_matrix)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 16158.715342210153


In [54]:
#Test RMSE
reg_test_pred = sgd_regk.predict(test_tfidf_matrix)

test_mse = mean_squared_error(test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 24956.498109519787


### Hard Voting Regressor

In [56]:
voting_regk = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        #('lr', log_reg), 
                        ('sgd', sgd_reg),
                       ('svr',svr_reg)],
            )


In [57]:
voting_regk.fit(train_tfidf_matrix, train_y)

VotingRegressor(estimators=[('dt', DecisionTreeRegressor(max_depth=100)),
                            ('sgd', SGDRegressor(max_iter=10000)),
                            ('svr',
                             SVR(C=1, coef0=1, degree=2, kernel='poly'))])

In [58]:
#Train RMSE
reg_train_pred = voting_regk.predict(train_tfidf_matrix)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 13922.305807305494


In [59]:
#Test RMSE
reg_test_pred = voting_regk.predict(test_tfidf_matrix)

test_mse = mean_squared_error(test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))##good test rmse 19834

Test RMSE: 18372.21391100678


# NLTK  

In [60]:
import nltk
from nltk.corpus import stopwords
import re

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anush\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\anush\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\anush\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [61]:
train_set

429     Only candidates who are permanent in the Compu...
1185    NYCERS is seeking a Business Analyst with a te...
2116    The NYC Department of Environmental Protection...
2127    Only Candidates permanent in the Assistant Civ...
458     Please read this posting carefully to make cer...
                              ...                        
1638    NYC Civilian Complaint Review Board  The Civil...
1095    The NYC Department of Environmental Protection...
1130    The NYC Office of Payroll Administration is re...
1294    HPDTech is the IT division within HPD. Its mis...
860     Only Candidates permanent in the Assistant Civ...
Name: Job Description, Length: 1689, dtype: object

In [62]:
#Create a blank list

new_train = []


# For each row in train_set, we will read the text, tokenize it, remove stopwords, lemmatize it, 
# and save it to the new list

for text in train_set:
    text = re.sub(r'[!"#$%&()*+,-./:;<=>?[\]^_`{|}~]', ' ', text).lower()
        
    words= nltk.tokenize.word_tokenize(text)
    words = [w for w in words if w.isalpha()]
    words = [w for w in words if len(w)>2 and w not in stopwords.words('english')]
        
    lemmatizer = nltk.stem.WordNetLemmatizer()
    words = [lemmatizer.lemmatize(w) for w in words]
    new_train.append(' '.join(words))

In [63]:

new_train

['candidate permanent computer system manager title provide proof successful registration october open competitive promotional exam may apply failure result disqualification department design construction division public building seek director data analytics data analytics team responsible providing descriptive diagnostic predictive data insight based data across agency external source dataset includes basic project management data schedule budget well external internal data information system sensor including uncensored data director data analytics leverage business acumen knowledge data science modern technical tool develop data informed strategy action related strengthening agency business performance related delivery quality capital project time within budget safe manner director aggregate diagnose data identified trend risk performance metric also oversee internal external agency reporting ensure effective data flow external dashboard capital project dashboard open data portal dir

In [64]:
# Let's convert the original train_set to a dataframe

train_set_df = pd.DataFrame(train_set)

train_set_df['new_text'] = new_train

train_set_df

Unnamed: 0,Job Description,new_text
429,Only candidates who are permanent in the Compu...,candidate permanent computer system manager ti...
1185,NYCERS is seeking a Business Analyst with a te...,nycers seeking business analyst technical back...
2116,The NYC Department of Environmental Protection...,nyc department environmental protection dep pr...
2127,Only Candidates permanent in the Assistant Civ...,candidate permanent assistant civil engineer t...
458,Please read this posting carefully to make cer...,please read posting carefully make certain mee...
...,...,...
1638,NYC Civilian Complaint Review Board The Civil...,nyc civilian complaint review board civilian c...
1095,The NYC Department of Environmental Protection...,nyc department environmental protection dep en...
1130,The NYC Office of Payroll Administration is re...,nyc office payroll administration recruiting i...
1294,HPDTech is the IT division within HPD. Its mis...,hpdtech division within hpd mission identify a...


In [65]:
# Let's do the same for test data 

new_test = []

for text in test_set:
    text = re.sub(r'[!"#$%&()*+,-./:;<=>?[\]^_`{|}~]', ' ', text).lower()
        
    words= nltk.tokenize.word_tokenize(text)
    words = [w for w in words if w.isalpha()]
    words = [w for w in words if len(w)>2 and w not in stopwords.words('english')]
        
    lemmatizer = nltk.stem.WordNetLemmatizer()
    words = [lemmatizer.lemmatize(w) for w in words]
    new_test.append(' '.join(words))



test_set_df = pd.DataFrame(test_set)

test_set_df['new_text'] = new_test

test_set_df

Unnamed: 0,Job Description,new_text
765,The New York City Housing Authority (NYCHA) is...,new york city housing authority nycha largest ...
2387,"Hiring Rate: $62,272.00 (Flat Rate-Annual) ...",hiring rate flat rate annual mission bureau hi...
2162,The Executive Director for Regulatory Reform w...,executive director regulatory reform assist im...
1833,The NYC Department of Environmental Protection...,nyc department environmental protection dep pr...
1814,The Department of Transportationâ€™s (DOT) mis...,department dot mission provide safe efficient ...
...,...,...
2333,The Family Independence Administration/ Office...,family independence administration office rese...
998,In order to be considered for this position ca...,order considered position candidate must servi...
891,In accordance to Local Law 196 established in ...,accordance local law established late sb devel...
1866,About New York City Cyber Command NYC Cyber Co...,new york city cyber command nyc cyber command ...


In [66]:
#Countvectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import CountVectorizer

# Let's see if we can limite the features and achieve a good accuracy
count_vectnl = CountVectorizer(max_features=500)

train_x_tr = count_vectnl.fit_transform(train_set_df['new_text'])

In [67]:
test_x_tr = count_vectnl.transform(test_set_df['new_text'])

In [68]:
train_x_tr, test_x_tr

(<1689x500 sparse matrix of type '<class 'numpy.int64'>'
 	with 146093 stored elements in Compressed Sparse Row format>,
 <724x500 sparse matrix of type '<class 'numpy.int64'>'
 	with 64237 stored elements in Compressed Sparse Row format>)

In [69]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer()

train_x_tfidf = tf_transformer.fit_transform(train_x_tr)

train_x_tfidf.shape

(1689, 500)

In [70]:
# Now we need to perform the tf-idf transformation on the test data set

test_x_tfidf = tf_transformer.transform(test_x_tr)

test_x_tfidf.shape

(724, 500)

### Random Forest Regressor

In [71]:
rnd_regnlk=RandomForestRegressor(n_estimators=1000, max_leaf_nodes=16, n_jobs=-1)

rnd_regnlk.fit(train_x_tfidf, train_y)

RandomForestRegressor(max_leaf_nodes=16, n_estimators=1000, n_jobs=-1)

In [72]:
#Train RMSE
reg_train_pred = rnd_regnlk.predict(train_x_tfidf)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 19776.868090755408


In [73]:
#Test RMSE
reg_test_pred = rnd_regnlk.predict(test_x_tfidf)

test_mse = mean_squared_error(test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 22104.035800587193


### Stochastic Gradient Regressor

In [74]:
sgd_regnlk = SGDRegressor(max_iter=10000, tol=1e-3)

In [75]:
sgd_regnlk.fit(train_x_tfidf, train_y)



SGDRegressor(max_iter=10000)

In [76]:
#Train RMSE
reg_train_pred = sgd_regnlk.predict(train_x_tfidf)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 17163.199898763058


In [77]:
#Test RMSE
reg_test_pred = sgd_regnlk.predict(test_x_tfidf)

test_mse = mean_squared_error(test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 21527.883304634895


### Hard voting Regressor 

In [78]:
voting_regnlk = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        #('lr', log_reg), 
                        ('sgd', sgd_reg),
                       ('svr',svr_reg)],
            )

In [79]:
voting_regnlk.fit(train_tfidf_matrix, train_y)

VotingRegressor(estimators=[('dt', DecisionTreeRegressor(max_depth=100)),
                            ('sgd', SGDRegressor(max_iter=10000)),
                            ('svr',
                             SVR(C=1, coef0=1, degree=2, kernel='poly'))])

In [80]:
#Train RMSE
reg_train_pred = voting_regnlk.predict(train_tfidf_matrix)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 13597.003041387914


In [81]:
#Test RMSE
reg_test_pred = voting_regnlk.predict(test_tfidf_matrix)

test_mse = mean_squared_error(test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))##good test rmse 19834

Test RMSE: 18483.920238470077


# Discussion

1) Which model performs the best (and why)?<br>
2) What is the baseline?<br>
3) Does the best model perform better than the baseline (and why)?<br>
4) Does the best model exhibit any overfitting; what did you do about it?

1) The Stochastic Gradient Regressor model with Scikit text mining is best model with test RMSE of 16461
2) The Baseline RMSE is 29197
3) The model(RMSE 16461) performs better than baseline (RMSE 29197)
4) The model exhibits overfitting, tried to add l2 regularization and alpha value to prevent overfitting( but adding alpha value increased test RMSE and hence removed it from the equation).


### Extra  

In [122]:
import pandas as pd
import numpy as np
jobs_competition = pd.read_csv("jobs_competition.csv")

In [123]:
competition_data=jobs_competition['Job Description']

In [124]:
competition_data

0     TASK FORCE: \t\tFMS Financial Planning and Mai...
1     The NYC Department of Probation (DOP) is a wor...
2     The New York City Taxi and Limousine Commissio...
3     The NYC Department of Environmental Protection...
4     Job Description  The New York City Taxi and Li...
                            ...                        
70    ** THIS IS A TEMPORARY POSITION**  In 2019, Ma...
71    Please note: Only candidates currently serving...
72    The Financial Information Services Agency and ...
73    The Union Services Compliance Unit is responsi...
74    SBS seeks a Director to work in our Program Ma...
Name: Job Description, Length: 75, dtype: object

In [129]:
test_x_tr=count_vect.transform(competition_data)

In [130]:
test_x_tr

<75x9914 sparse matrix of type '<class 'numpy.int64'>'
	with 11417 stored elements in Compressed Sparse Row format>

In [131]:
test_x_tfidf = tf_transformer1.transform(test_x_tr)


In [132]:
test_x_lsa = svd.transform(test_x_tfidf)

In [133]:
bestprediction=sgd_regsc.predict(test_x_lsa)

In [134]:
bestpredictiondf=pd.DataFrame(bestprediction,columns=['Salary'])

In [135]:
bestpredictiondf['Salary']=bestpredictiondf['Salary'].apply(np.ceil)

In [136]:
bestpredictiondf.head()

Unnamed: 0,Salary
0,54188.0
1,82759.0
2,93214.0
3,76864.0
4,73119.0


In [137]:
bestpredictiondf.insert(1, "ID", np.arange(1, 76, 1).tolist(), False)

In [138]:
bestpredictiondf.to_csv('salian_dt_competition.csv',index=False)