## QUESTION 2: Factors that distinguish job category
Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:

-What components of a job posting distinguish data scientists from other data jobs?

-What features are important for distinguishing junior vs. senior positions?

-Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('whitegrid')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
df = pd.read_csv('./salary_df_car_fut.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Company,Title,Address,Emp_type,Seniority,Industry,Salary,Responsibility,Requirements
0,0,SKILLSFUTURE SINGAPORE AGENCY,"Executive, (Quality Management Division) (6-mo...","ONE MARINA BOULEVARD, 1 MARINA BOULEVARD 018989",Contract,NONE,Public / Civil Service,NONE,Roles & ResponsibilitiesResponsibilities\r\n\r...,NONE
1,1,HP PPS ASIA PACIFIC PTE. LTD.,Business Analyst,1 DEPOT CLOSE 109841,"Permanent, Full Time",Professional,Others,"$8,400to$12,900",Roles & ResponsibilitiesHP is the world’s lead...,RequirementsEducation and Experience Required:...
2,2,SHOPEE SINGAPORE PRIVATE LIMITED,Software Engineer,"GALAXIS, 1 FUSIONOPOLIS PLACE 138522","Permanent, Full Time",Executive,Information Technology,"$4,400to$8,000",Roles & ResponsibilitiesResponsibilities: De...,RequirementsRequirements: Minimum B.S. degre...
3,3,PRICEWATERHOUSECOOPERS CONSULTING (SINGAPORE) ...,Data Analytics –Manager,"MARINA ONE EAST TOWER, 7 STRAITS VIEW 018936","Permanent, Contract, Full Time",Manager,Consulting,"$6,200to$9,500",Roles & Responsibilities Advisory - Consulting...,Requirements A good Degree in a quantitative ...
4,4,Company Undisclosed,Research Fellow,NONE,"Contract, Full Time","Professional, Executive",Others,"$4,000to$5,000",Roles & ResponsibilitiesData Scientist / Progr...,Requirements Doctorate degree in a relevant fi...


In [4]:
df = df.replace('NONE',np.nan)
df = df.drop(['Unnamed: 0'], axis=1 )
df.head()

Unnamed: 0,Company,Title,Address,Emp_type,Seniority,Industry,Salary,Responsibility,Requirements
0,SKILLSFUTURE SINGAPORE AGENCY,"Executive, (Quality Management Division) (6-mo...","ONE MARINA BOULEVARD, 1 MARINA BOULEVARD 018989",Contract,,Public / Civil Service,,Roles & ResponsibilitiesResponsibilities\r\n\r...,
1,HP PPS ASIA PACIFIC PTE. LTD.,Business Analyst,1 DEPOT CLOSE 109841,"Permanent, Full Time",Professional,Others,"$8,400to$12,900",Roles & ResponsibilitiesHP is the world’s lead...,RequirementsEducation and Experience Required:...
2,SHOPEE SINGAPORE PRIVATE LIMITED,Software Engineer,"GALAXIS, 1 FUSIONOPOLIS PLACE 138522","Permanent, Full Time",Executive,Information Technology,"$4,400to$8,000",Roles & ResponsibilitiesResponsibilities: De...,RequirementsRequirements: Minimum B.S. degre...
3,PRICEWATERHOUSECOOPERS CONSULTING (SINGAPORE) ...,Data Analytics –Manager,"MARINA ONE EAST TOWER, 7 STRAITS VIEW 018936","Permanent, Contract, Full Time",Manager,Consulting,"$6,200to$9,500",Roles & Responsibilities Advisory - Consulting...,Requirements A good Degree in a quantitative ...
4,Company Undisclosed,Research Fellow,,"Contract, Full Time","Professional, Executive",Others,"$4,000to$5,000",Roles & ResponsibilitiesData Scientist / Progr...,Requirements Doctorate degree in a relevant fi...


## Putting text together except "Address" & "Title"

In [5]:
df["text"] =  df["Company"] + df["Emp_type"] + df["Seniority"] + df["Industry"] + df["Salary"] + df["Responsibility"] + df["Requirements"] 

In [6]:
df["text"].isnull().sum()
df_clean = df[~(df["text"].isnull())][['text','Title']]

df_clean.isnull().sum()

text     0
Title    0
dtype: int64

In [7]:
# removing puncuations and lower text
import string 
import re
df_clean["text"] = df_clean["text"].map(lambda x: re.sub('[^ a-zA-Z0-9]', '', x).lower())
df_clean["text"].values

array(['hp pps asia pacific pte ltdpermanent full timeprofessionalothers8400to12900roles  responsibilitieshp is the worlds leading personal systems and printing company we create technology that makes life better for everyone everywhere our innovation springs from a team of individuals each collaborating and contributing their own perspectives knowledge and experience to advance the way the world works and lives we are looking for visionaries like you who are ready to make a purposeful impact on the way the world works  at hp the future is yours to create if you are our business analyst in singapore you will have a chance to   provide analysis to support businessteams needs by ensuring data integrity  accuracy be the interface to sales operations be the interface to external analyst canalys  idc establish  enable quarterly published reports drive regular  scheduled review of analysis with team  cross team members update market share quarterly  provide insights small and medium business

In [8]:
df_clean.head()

Unnamed: 0,text,Title
1,hp pps asia pacific pte ltdpermanent full time...,Business Analyst
2,shopee singapore private limitedpermanent full...,Software Engineer
3,pricewaterhousecoopers consulting singapore pt...,Data Analytics –Manager
4,company undisclosedcontract full timeprofessio...,Research Fellow
6,jpmorgan chase bank nafull timeexecutiveinform...,"Application Support Analyst, Reference Data Team"


In [9]:
import pandas as pd, seaborn as sns, numpy as np, matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer 
from sklearn.model_selection import learning_curve
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline

%matplotlib inline

sns.set_style("darkgrid")

In [11]:
cvt      =  CountVectorizer( strip_accents='unicode', stop_words="english", min_df=51)
X_all    =  cvt.fit_transform(df_clean['text'])
columns  =  np.array(cvt.get_feature_names()) 

X_all


<1283x908 sparse matrix of type '<type 'numpy.int64'>'
	with 136877 stored elements in Compressed Sparse Row format>

In [13]:
def get_freq_words(sparse_counts, columns):
    # X_all is a sparse matrix, so sum() returns a 'matrix' datatype ...
    #   which we then convert into a 1-D ndarray for sorting
    word_counts = np.asarray(X_all.sum(axis=0)).reshape(-1)

    # argsort() returns smallest first, so we reverse the result
    largest_count_indices = word_counts.argsort()[::-1]

    # pretty-print the results! Remember to always ask whether they make sense ...
    freq_words = pd.Series(word_counts[largest_count_indices], 
                           index=columns[largest_count_indices])

    return freq_words


freq_words = get_freq_words(X_all, columns)
freq_words[:20]

data            4960
experience      4397
business        4067
team            2458
management      2441
work            2203
skills          2175
analytics       1723
requirements    1691
development     1634
strong          1543
knowledge       1479
support         1460
working         1381
project         1291
solutions       1258
ability         1243
analysis        1198
design          1187
years           1141
dtype: int64

In [67]:
cvt = CountVectorizer(stop_words="english",ngram_range=(1,3))
X_all = cvt.fit_transform(df_clean['text'])
columns  =  np.array(cvt.get_feature_names())

freq_words = get_freq_words(X_all, columns)
print(freq_words.shape)
freq_words.head()



(363334L,)


data          4960
experience    4397
business      4067
team          2458
management    2441
dtype: int64

In [33]:
df_clean['Title'] = df_clean['Title'].astype('str').apply(lambda x: x.lower())
df_clean['Data_Analyst'] = df_clean['Title'].str.contains('analyst')*1
df_clean['Data_Scientist'] = df_clean['Title'].str.contains('scientist')*1
df_clean.head()

Unnamed: 0,text,Title,Data_Analyst,Data_Scientist
1,hp pps asia pacific pte ltdpermanent full time...,business analyst,1,0
2,shopee singapore private limitedpermanent full...,software engineer,0,0
3,pricewaterhousecoopers consulting singapore pt...,data analytics –manager,0,0
4,company undisclosedcontract full timeprofessio...,research fellow,0,0
6,jpmorgan chase bank nafull timeexecutiveinform...,"application support analyst, reference data team",1,0


# Using Logistic Regression


In [68]:
df_clean['Data_Analyst'].value_counts()

0    916
1    367
Name: Data_Analyst, dtype: int64

In [69]:
df_clean['Data_Scientist'].value_counts()

0    1174
1     109
Name: Data_Scientist, dtype: int64

### LogReg for "Data Analyst" 

In [70]:
#Running logreg for Data Analyst
X = X_all
y = df_clean['Data_Analyst']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [71]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [72]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.86


### Running 10 fold Cross Validation for "Data Analyst"

In [73]:
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))

10-fold cross validation average accuracy: 0.864


In [74]:
#checking baseline
y_train.value_counts()/len(y_train)*100

0    72.160356
1    27.839644
Name: Data_Analyst, dtype: float64

### Confusion Matrix and Score for "Data Analyst"

In [41]:
from sklearn.metrics import confusion_matrix
conmat = np.array(confusion_matrix(y_test, y_pred, labels=[0,1]))

confusion = pd.DataFrame(conmat, index=[0,1],
                         columns=['predicted_0','predicted_1'])
print confusion
#precision, recall and f1
from sklearn.metrics import classification_report
print classification_report(y_test, y_pred)

   predicted_0  predicted_1
0          245           23
1           37           80
             precision    recall  f1-score   support

          0       0.87      0.91      0.89       268
          1       0.78      0.68      0.73       117

avg / total       0.84      0.84      0.84       385



### LogReg for "Data Scientist" 

In [75]:
#Running logreg for Data Scientist
X = X_all
y = df_clean['Data_Scientist']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [76]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [77]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.97


### Running 10 fold Cross Validation for "Data Scientist"

In [78]:
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))

10-fold cross validation average accuracy: 0.947


In [79]:
#checking baseline
y_train.value_counts()/len(y_train)*100

0    90.200445
1     9.799555
Name: Data_Scientist, dtype: float64

### Confusion Matrix and Score for "Data Scientist"

In [80]:
from sklearn.metrics import confusion_matrix
conmat = np.array(confusion_matrix(y_test, y_pred, labels=[0,1]))

confusion = pd.DataFrame(conmat, index=[0,1],
                         columns=['predicted_0','predicted_1'])
print confusion
#precision, recall and f1
from sklearn.metrics import classification_report
print classification_report(y_test, y_pred)

   predicted_0  predicted_1
0          362            2
1           10           11
             precision    recall  f1-score   support

          0       0.97      0.99      0.98       364
          1       0.85      0.52      0.65        21

avg / total       0.97      0.97      0.97       385



In [65]:
gs_params = {'penalty':['l1','l2'],
             'solver':['liblinear'],
             'C': np.linspace(0.0001, 10, 100)}

logreg_gridsearch = GridSearchCV(logreg, gs_params, cv=10, verbose=1, n_jobs=-1, scoring='roc_auc')

In [66]:
logreg_gridsearch.fit(X_train, y_train)

Fitting 10 folds for each of 200 candidates, totalling 2000 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    8.9s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  6.1min
[Parallel(n_jobs=-1)]: Done 1226 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 1776 tasks      | elapsed: 14.8min
[Parallel(n_jobs=-1)]: Done 2000 out of 2000 | elapsed: 16.7min finished


GridSearchCV(cv=10, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([1.00000e-04, 1.01109e-01, ..., 9.89899e+00, 1.00000e+01]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=1)

In [51]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

In [52]:
forest = RandomForestClassifier(class_weight='balanced')

gs_params = {'n_estimators': [500, 1000],
             'max_features':['log2','sqrt'],
             'max_depth': [5, 10, 20]}

RF_gridsearch = GridSearchCV(forest, gs_params, cv=10, verbose=1, n_jobs=-1, scoring='roc_auc')

In [54]:
RF_gridsearch.fit(X_train, y_train)

Fitting 10 folds for each of 12 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 11.0min finished


GridSearchCV(cv=10, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [500, 1000], 'max_features': ['log2', 'sqrt'], 'max_depth': [5, 10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=1)

In [55]:
RF_model = RF_gridsearch.best_estimator_

RF_gridsearch.best_params_

{'max_depth': 20, 'max_features': 'sqrt', 'n_estimators': 1000}

In [56]:
RF_model.score(X_test, y_test)

0.9558441558441558

In [63]:
score = RF_model.predict(X_test)
print classification_report(y_test,score)

             precision    recall  f1-score   support

          0       0.96      0.99      0.98       364
          1       0.70      0.33      0.45        21

avg / total       0.95      0.96      0.95       385

