# Predicting Salaries using Random Forrest


Loading the data for the cleaned dataframe

In [4]:
import pandas as pd

In [12]:
predict_df = pd.read_csv('./data/dsjob-salary.csv', index_col=0)
predict_df = predict_df.reset_index(drop=True)
predict_df.head(3)

Unnamed: 0,city,job_title,company_name,summary,salary_lower,salary_upper,mean_salary
0,Sydney,Data Scientist,Domain Group,As a Data Scientist,82500.0,112500.0,97500.0
1,Sydney,Data Scientist,ANZ Banking Group,As the Data Scientist,82500.0,112500.0,97500.0
2,Sydney,Junior Data Scientist,Intellify,We also believe great Data Scientist,80000.0,100000.0,90000.0


We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)
We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a binary classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries.  However we don't have an extremely high Salary on the dataset but the min of 55K still in a reasonable range.

In [13]:
median = predict_df['mean_salary'].median()
print ('The median salary for our data set is $' + str(median))

The median salary for our data set is $80000.0


In [14]:
def above_median(x):
    if x > median:
        return 1
    return 0

predict_df['Above Median'] = predict_df['mean_salary'].apply(above_median)
predict_df.head()

Unnamed: 0,city,job_title,company_name,summary,salary_lower,salary_upper,mean_salary,Above Median
0,Sydney,Data Scientist,Domain Group,As a Data Scientist,82500.0,112500.0,97500.0,1
1,Sydney,Data Scientist,ANZ Banking Group,As the Data Scientist,82500.0,112500.0,97500.0,1
2,Sydney,Junior Data Scientist,Intellify,We also believe great Data Scientist,80000.0,100000.0,90000.0,1
3,Sydney,Junior Data Scientist,The Eclair Group,Industry experience as a Data Analyst or Junio...,70000.0,90000.0,80000.0,0
4,Sydney,Data Scientist,Investigations & Counter Terrorism,Data Scientist,82500.0,112500.0,97500.0,1


What is the baseline accuracy for this model?

In [15]:
predict_df['Above Median'].value_counts()

0    378
1    210
Name: Above Median, dtype: int64

Based on the data we know that there are more Data Analyst, Analyst roles than Data Science roles and it's showing approximately 65% below the Median range and Data Science be in the above Median range for their Salary.

Create a Random Forest model to predict High/Low salary using Sklearn. Start by ONLY using the Job Title as a feature.

In [10]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
from ipywidgets import *
from IPython.display import display
from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn.cross_validation import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.cross_validation import train_test_split



In [16]:
job_title_dummy = pd.get_dummies(predict_df['job_title'])

In [17]:
job_title_dummy.head(3)

Unnamed: 0,Analyst,Data Analyst,Data Scientist,Junior Data Scientist
0,0,0,1,0
1,0,0,1,0
2,0,0,0,1


In [19]:
model = RandomForestClassifier(n_estimators=100, oob_score=True)
X = job_title_dummy
y = predict_df['Above Median']
cv_model = cross_val_score(model, X, y, cv=6)
print ('Cross-validated scores:', cv_model)
print ('Average score:', cv_model.mean())
print ('Standard deviation of score:', cv_model.std())
model.fit(X, y)
print (model.oob_score_)

Cross-validated scores: [0.55102041 0.95918367 0.96938776 0.96938776 0.7755102  0.64285714]
Average score: 0.8112244897959183
Standard deviation of score: 0.16795455478367016
0.8809523809523809


Create a few new variables in dataframe to represent interesting features of a job title.
- Create a feature that represents whether 'Scientist' is in the title or whether 'Analyst' is in the title.
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the job titles.
- Build a new random forest model with job title and these new features included. Note the job title have been pre categorised during the data cleaning therefore there are not too much categories in the job titles.

In [20]:
def scientist(x):
    if 'Scientist' in x:
        return 1
    return 0

predict_df['Scientist'] = predict_df['job_title'].apply(scientist)

In [21]:
predict_df[predict_df.Scientist !=0].head()

Unnamed: 0,city,job_title,company_name,summary,salary_lower,salary_upper,mean_salary,Above Median,Scientist
0,Sydney,Data Scientist,Domain Group,As a Data Scientist,82500.0,112500.0,97500.0,1,1
1,Sydney,Data Scientist,ANZ Banking Group,As the Data Scientist,82500.0,112500.0,97500.0,1,1
2,Sydney,Junior Data Scientist,Intellify,We also believe great Data Scientist,80000.0,100000.0,90000.0,1,1
3,Sydney,Junior Data Scientist,The Eclair Group,Industry experience as a Data Analyst or Junio...,70000.0,90000.0,80000.0,0,1
4,Sydney,Data Scientist,Investigations & Counter Terrorism,Data Scientist,82500.0,112500.0,97500.0,1,1


In [22]:
def analyst(x):
    if 'Analyst' in x:
        return 1
    return 0

predict_df['Analyst'] = predict_df['job_title'].apply(analyst)

In [23]:
predict_df[predict_df.Analyst !=0].head()

Unnamed: 0,city,job_title,company_name,summary,salary_lower,salary_upper,mean_salary,Above Median,Scientist,Analyst
196,Sydney,Analyst,WEST 1 Australia,"Data organization, processing and interpretati...",55000.0,65000.0,60000.0,0,0,1
197,Sydney,Data Analyst,ANZ Banking Group,Our Data Analyst,70000.0,90000.0,80000.0,0,0,1
198,Sydney,Data Analyst,AMP Ltd,We are currently looking for two Data Analyst,70000.0,90000.0,80000.0,0,0,1
199,Sydney,Data Analyst,Amazon.com,As a Data Analyst,70000.0,90000.0,80000.0,0,0,1
200,Sydney,Data Analyst,"Department of Finance, Services & Innovation",Cleanse and prepare multiple sources of data f...,70000.0,90000.0,80000.0,0,0,1


In [24]:
def junior(x):
    if 'Junior' in x:
        return 1
    return 0

predict_df['Junior'] = predict_df['job_title'].apply(junior)

In [26]:
def data_scientist(x):
    if 'Data Scientist' in x:
        return 1
    return 0

predict_df['Data Scientist'] = predict_df['job_title'].apply(data_scientist)

In [27]:
def data_analyst(x):
    if 'Data Analyst' in x:
        return 1
    return 0

predict_df['Data Analyst'] = predict_df['job_title'].apply(data_analyst)

In [29]:
feature_matrix = predict_df.copy(deep=True)
feature_matrix.drop(['city', 'job_title', 'company_name', 'summary', 'salary_lower', 'salary_upper', 'mean_salary', 'Above Median'], axis=1, inplace=True)
print (feature_matrix.shape)
feature_matrix.head()

(588, 5)


Unnamed: 0,Scientist,Analyst,Junior,Data Scientist,Data Analyst
0,1,0,0,1,0
1,1,0,0,1,0
2,1,0,1,1,0
3,1,0,1,1,0
4,1,0,0,1,0


In [30]:
model = RandomForestClassifier(n_estimators=100, oob_score=True)
X_features = feature_matrix
y = predict_df['Above Median']
cv_model = cross_val_score(model, X_features, y, cv=6)
print ('Cross-validated scores:', cv_model)
print ('Average score:', cv_model.mean())
print ('Standard deviation of score:', cv_model.std())
model.fit(X_features, y)
print (model.oob_score_)

Cross-validated scores: [0.55102041 0.95918367 0.96938776 0.96938776 0.7755102  0.64285714]
Average score: 0.8112244897959183
Standard deviation of score: 0.16795455478367016
0.8809523809523809


After Analysing the title separately there is no change in the score from the score and the accuracy is still pretty high.  We can make another attempt using a CountVetoriser on the title Column

In [52]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(stop_words='english', max_features=10, ngram_range=(2,2))
vectorizers = cvec.fit_transform(predict_df['job_title'].values)

df_vec  = pd.DataFrame(vectorizers.todense(), columns=cvec.get_feature_names())
print (df_vec.shape)
df_vec.head()

(588, 3)


Unnamed: 0,data analyst,data scientist,junior data
0,0,1,0
1,0,1,0
2,0,1,1
3,0,1,1
4,0,1,0


In [35]:
X_cvec = feature_matrix
X_cvec.head()

Unnamed: 0,Scientist,Analyst,Junior,Data Scientist,Data Analyst
0,1,0,0,1,0
1,1,0,0,1,0
2,1,0,1,1,0
3,1,0,1,1,0
4,1,0,0,1,0


In [36]:
model = RandomForestClassifier(n_estimators=100, oob_score=True)
y = predict_df['Above Median']
cv_model = cross_val_score(model, X_cvec, y, cv=6)
print ('Cross-validated scores:', cv_model)
print ('Average score:', cv_model.mean())
print ('Standard deviation of score:', cv_model.std())
model.fit(X_cvec, y)
print (model.oob_score_)
importance_dataframe = pd.DataFrame(model.feature_importances_, index = X_cvec.columns, columns=['importance']).sort_values('importance', ascending=False)
importance_dataframe.head(20)

Cross-validated scores: [0.55102041 0.95918367 0.96938776 0.96938776 0.7755102  0.64285714]
Average score: 0.8112244897959183
Standard deviation of score: 0.16795455478367016
0.8809523809523809


Unnamed: 0,importance
Junior,0.29494
Data Scientist,0.28416
Analyst,0.229813
Scientist,0.153296
Data Analyst,0.037791


The top description feature are Junior and Data Scientist, the Variables have already been reduced during the data cleaning, not sure if we should have left the Variable of Job Title in a broader range and not to reduced it too much.  On the other hand if we properly clean the data and reduced the number of variables you probably have better accuracy.  As from what we have seen the Ave Scores are still in the 81% Range and did not improved.

Repeat the model-building process with a non-tree-based method.

In [40]:
from sklearn import linear_model
log_reg = linear_model.LogisticRegression()
scores_log = cross_val_score(log_reg, X_cvec, y, cv=6)
print ('Cross-validated scores:', scores_log)
print ('Average score:', scores_log.mean())
print ('Standard deviation of score:', scores_log.std())
log_model = log_reg.fit(X_cvec, y)

Cross-validated scores: [0.55102041 0.95918367 0.96938776 0.96938776 0.7755102  0.64285714]
Average score: 0.8112244897959183
Standard deviation of score: 0.16795455478367016


A logistic regression gets an average accuracy score of 81%, which exactly with Random Forest Classifier.  So there is no further improvement.

In [41]:
from sklearn.svm import SVC
model_svmrbf = SVC(kernel='rbf')
scores_svm = cross_val_score(model_svmrbf, X_cvec, y, cv=6)
print ('Cross-validated scores:', scores_svm)
print ('Average score:', scores_svm.mean())
print ('Standard deviation of score:', scores_svm.std())
svm_model = model_svmrbf.fit(X_cvec, y)

Cross-validated scores: [0.55102041 0.95918367 0.96938776 0.96938776 0.7755102  0.64285714]
Average score: 0.8112244897959183
Standard deviation of score: 0.16795455478367016


In [42]:
model_svmlm = SVC(kernel='linear')
scores_svmlm = cross_val_score(model_svmlm, X_cvec, y, cv=6)
print ('Cross-validated scores:', scores_svmlm)
print ('Average score:', scores_svmlm.mean())
print ('Standard deviation of score:', scores_svmlm.std())
svm_model = model_svmlm.fit(X_cvec, y)

Cross-validated scores: [0.55102041 0.95918367 0.96938776 0.96938776 0.7755102  0.64285714]
Average score: 0.8112244897959183
Standard deviation of score: 0.16795455478367016


Not Much better

In [46]:
import warnings

### Neural Network Classifier

In [48]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='sgd', alpha=1e-5, hidden_layer_sizes=(100,), random_state=1, activation='relu')
cv_model = cross_val_score(clf, X_cvec, y, cv=5)
print ('Cross-validated scores:', cv_model)
print ('Average score:', cv_model.mean())
print ('Standard deviation of score:', cv_model.std())   
clf.fit(feature_matrix, y)
#clf.score(feature_matrix, y)


Cross-validated scores: [0.61864407 0.94915254 0.94915254 0.88034188 0.76923077]
Average score: 0.8333043604230046
Standard deviation of score: 0.12588772547077107


MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='sgd', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)

The code above is my attempt at a back-prop neural network. My lack of experience (knowledge) with these models made a small improvement. The accuracy score I was able to achieve with the neural network was a couple % higher than with the Random Forest model or the SVM model.  However. the processing time was a bit longer than the previous model.

In [49]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_cvec, y)
cv_model = cross_val_score(clf, X_cvec, y, cv=6)
print ('Cross-validated scores:', cv_model)
print ('Average score:', cv_model.mean())
print ('Standard deviation of score:', cv_model.std())

Cross-validated scores: [0.55102041 0.95918367 0.96938776 0.96938776 0.7755102  0.64285714]
Average score: 0.8112244897959183
Standard deviation of score: 0.16795455478367016


In [50]:
from xgboost import XGBClassifier
model = XGBClassifier(booster='gbtree')
cv_model = cross_val_score(model, X_cvec, y, cv=6)
print ('Cross-validated scores:', cv_model)
print ('Average score:', cv_model.mean())
print ('Standard deviation of score:', cv_model.std())

Cross-validated scores: [0.55102041 0.95918367 0.96938776 0.96938776 0.7755102  0.64285714]
Average score: 0.8112244897959183
Standard deviation of score: 0.16795455478367016


## Use Count Vectorizer from scikit-learn to create features from the job descriptions (summary).
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance?
- What text features are the most valuable?

In [53]:
cvec_synos = CountVectorizer(stop_words='english', max_features=50)
vectorizers_synos = cvec_synos.fit_transform(predict_df['summary']).toarray()
df_vec_synos  = pd.DataFrame(vectorizers_synos, columns=cvec_synos.get_feature_names())
print (df_vec_synos.shape)
df_vec_synos.head()

(588, 50)


Unnamed: 0,adwords,analysis,analyst,believe,business,cross,customer,data,experience,great,...,team,teams,test,tools,training,upskilling,using,validate,work,working
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,2,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [54]:
X_cvec_synos = pd.concat([feature_matrix, df_vec_synos], axis=1)
X_cvec_synos.head()

Unnamed: 0,Scientist,Analyst,Junior,Data Scientist,Data Analyst,adwords,analysis,analyst,believe,business,...,team,teams,test,tools,training,upskilling,using,validate,work,working
0,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,1,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,1,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [55]:

X_cvec_synos.shape

(588, 55)

In [56]:
model = RandomForestClassifier(n_estimators=100, oob_score=True)
y = predict_df['Above Median']
model.fit(X_cvec_synos, y)
print (model.oob_score_)
cv_model = cross_val_score(model, X_cvec_synos, y, cv=6)
print ('Cross-validated scores:', cv_model)
print ('Average score:', cv_model.mean())
print ('Standard deviation of score:', cv_model.std())
importance_dataframe_big = pd.DataFrame(model.feature_importances_, index = X_cvec_synos.columns, columns=['importance']).sort_values('importance', ascending=False)
importance_dataframe_big.head(20)

1.0
Cross-validated scores: [0.71428571 1.         1.         1.         1.         1.        ]
Average score: 0.9523809523809524
Standard deviation of score: 0.10647942749999


Unnamed: 0,importance
Junior,0.136972
scientist,0.111309
analyst,0.107953
Data Scientist,0.100571
Scientist,0.090508
Analyst,0.09036
migration,0.0509
test,0.034771
team,0.03432
Data Analyst,0.032006


After using the countvectorizer on the description (job summary), the accuracy of the model has increased from 81% to 95%.  Which is a significant boost to the accuracy.

In [57]:
from sklearn import linear_model
log_reg = linear_model.LogisticRegression()
log_model = log_reg.fit(X_cvec_synos, y)
scores_log = cross_val_score(log_reg, X_cvec_synos, y, cv=6)
print ('Cross-validated scores:', scores_log)
print ('Average score:', scores_log.mean())
print ('Standard deviation of score:', scores_log.std())

Cross-validated scores: [0.57142857 1.         1.         1.         1.         1.        ]
Average score: 0.9285714285714285
Standard deviation of score: 0.15971914124998499


In [58]:
from sklearn.svm import SVC
model_svmrbf = SVC(kernel='rbf')
scores_svm = cross_val_score(model_svmrbf, X_cvec_synos, y, cv=6)
print ('Cross-validated scores:', scores_svm)
print ('Average score:', scores_svm.mean())
print ('Standard deviation of score:', scores_svm.std())
svm_model = model_svmrbf.fit(X_cvec_synos, y)

Cross-validated scores: [0.57142857 1.         1.         1.         1.         0.82653061]
Average score: 0.8996598639455783
Standard deviation of score: 0.1598729914939595


Random Forrest and Linear model has the highest accuracy 95% and 92% which are the highest from all the models after adding the job summary as a predictor.

### Finding the factors that influence pay scale
I will be using a few logisitic regressions to determine how the important features effect salary (above of below the median). I will use the count vectorized titles to create a feature matrix.

In [64]:
X_cvec_synos.head()

Unnamed: 0,Scientist,Analyst,Junior,Data Scientist,Data Analyst,adwords,analysis,analyst,believe,business,...,team,teams,test,tools,training,upskilling,using,validate,work,working
0,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,1,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,1,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [94]:
log_reg = linear_model.LogisticRegression()
log_model = log_reg.fit(X_cvec_synos, y)
scores_log = cross_val_score(log_reg, X_cvec_synos, y, cv=6)
print ('Cross-validated scores:', scores_log)
print ('Average score:', scores_log.mean())
print ('Standard deviation of score:', scores_log.std())

Cross-validated scores: [0.57142857 1.         1.         1.         1.         1.        ]
Average score: 0.9285714285714285
Standard deviation of score: 0.15971914124998499


In [95]:
X_cvec_synos.columns

Index(['Scientist', 'Analyst', 'Junior', 'Data Scientist', 'Data Analyst',
       'adwords', 'analysis', 'analyst', 'believe', 'business', 'cross',
       'customer', 'data', 'experience', 'great', 'industry', 'invoice',
       'learn', 'maintaining', 'manager', 'manipulation', 'matching',
       'migration', 'multiple', 'notices', 'organization', 'outstanding',
       'prepare', 'processes', 'processing', 'pull', 'questions',
       'recommendations', 'regulatory', 'responses', 'results', 'review',
       'role', 'science', 'scientist', 'scientists', 'sets', 'skills', 'sound',
       'sources', 'team', 'teams', 'test', 'tools', 'training', 'upskilling',
       'using', 'validate', 'work', 'working'],
      dtype='object')

In [69]:
log_model.coef_

array([[ 1.16762313, -1.86855952, -4.29408255,  1.16762313, -1.30324356,
        -0.05699333,  0.15739409,  0.15185664,  1.62732303, -0.11398666,
         0.29276922,  0.19095372, -0.15347494,  0.72210354,  1.92009225,
        -1.81456477, -0.28617084,  0.29276922, -0.14308542, -0.16095753,
        -0.31973047, -0.21218506,  2.69644364, -0.21218506, -0.16095753,
        -0.24558549, -0.31973047, -0.21218506, -0.05699333, -0.24558549,
        -0.05699333, -0.05699333, -0.31973047, -0.16095753, -0.16095753,
        -0.31973047, -0.05699333, -0.16095753,  0.29276922,  1.51571953,
         0.29276922, -0.49117098, -0.46281589, -0.14308542, -0.21218506,
        -1.16087019,  0.29276922,  2.69644364, -0.24558549,  0.29276922,
         0.29276922, -0.05699333, -0.16095753, -0.16095753,  0.02175484]])

In [97]:
for a, b in zip(cvec_synos.get_feature_names(), log_model.coef_[0]):
    print( a, b)

adwords 1.1676231275752855
analysis -1.8685595215526283
analyst -4.294082548685262
believe 1.1676231275752855
business -1.3032435622544511
cross -0.05699332912642244
customer 0.1573940868175099
data 0.15185664283982742
experience 1.6273230293780976
great -0.11398665825284487
industry 0.2927692247381467
invoice 0.19095371871853348
learn -0.1534749431879848
maintaining 0.7221035429569953
manager 1.920092254116247
manipulation -1.8145647692665672
matching -0.28617083670074733
migration 0.2927692247381467
multiple -0.14308541835037367
notices -0.160957526921707
organization -0.31973046860177096
outstanding -0.21218506171049786
prepare 2.69644363825184
processes -0.21218506171049786
processing -0.160957526921707
pull -0.2455854906964112
questions -0.31973046860177096
recommendations -0.21218506171049786
regulatory -0.05699332912642244
responses -0.2455854906964112
results -0.05699332912642244
review -0.05699332912642244
role -0.31973046860177096
science -0.160957526921707
scientist -0.16095

As we can see that the Word Analyst or Analysis are below the median Salary With manager above the median.

### We'll try with Just the Job Titles to predict

In [87]:
log_reg = linear_model.LogisticRegression()
log_model = log_reg.fit(X_cvec, y)
scores_log = cross_val_score(log_reg, X_cvec, y, cv=6)
print ('Cross-validated scores:', scores_log)
print ('Average score:', scores_log.mean())
print ('Standard deviation of score:', scores_log.std())


Cross-validated scores: [0.55102041 0.95918367 0.96938776 0.96938776 0.7755102  0.64285714]
Average score: 0.8112244897959183
Standard deviation of score: 0.16795455478367016


In [88]:
X_cvec.columns

Index(['Scientist', 'Analyst', 'Junior', 'Data Scientist', 'Data Analyst'], dtype='object')

In [89]:
log_model.coef_

array([[ 1.85690027, -2.36279108, -3.58428017,  1.85690027,  0.64557271]])

In [104]:
for a, b in zip(cvec.get_feature_names(), log_model.coef_[0]):
    print(a, b)

data analyst 1.1676231275752855
data scientist -1.8685595215526283
junior data -4.294082548685262


With Just the title alone this method actually predicted Data Scientist to be below the Median, therefore predicted incorrectly, as from the Data set there were twice the number of Data Analyst to Data Scientist and sort of pushing the Data Analyst in the Above median salary range.  As we only have 3 Attributes from the model Data Analyst roles dominates the market.