## Analyzing Job Description Data: MBA VS Data Scientist

* Introduction
  * Data Sources
  * EDA
* Text Cleaning
* TFIDF/Count Vectorization
* Clustering
* Classification with Logistic Regression
* Nearest Neighbors

In [134]:
import sqlalchemy
import pymysql
import pandas as pd
import nltk
import re

### Data Sources

The data we will be analyzing was collected from Indeed.com for _____ major cities in the United States.  The Indeed.com job search API was utilized over a period of _____ in mid 2017.  A notebook outlining the methogology and code for collecting and storing the data on a daily basis can be found here _____ for those who are interested.

The total number of unique townships spanning the data is:

In [135]:
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://collier:barkley07@dkingpc/jobs',echo=False)

In [136]:
df_data_sci = pd.read_sql_query('SELECT * FROM indeed_datasci',engine)
df_data_sci['category'] = "data_science"

In [137]:
df_mba = pd.read_sql_query('SELECT * FROM indeed_mba',engine)
df_mba['category'] = "mba"

Lets take a quick glance at the data.

Looking at the proportions of job categories in our data, we see that 'mba' jobs are more than twice as common as 'data scientist' jobs.

In [138]:
# df_combined.groupby('category').size()

Lets randomly sample an even number from each group. 

In [139]:
df_data_sci = df_data_sci.sample(n=6000)
df_mba = df_mba.sample(n=6000)

For the sake of genuine classification, we will remove some key identifying words from each category's decription.

In [140]:
# df_mba[df_mba['Snippet'].str.contains("mba")].head()

In [141]:
import string
stop_mba = ["mba","Masters of Business Administration"]
df_mba['Snippet'] = df_mba['Snippet'].apply(lambda x: \
                [item for item in str.split(x) if item not in stop_mba])

In [142]:
# df_mba.head()

In [143]:
stop_datasci = ["data scientist","data science"]
df_data_sci['Snippet'] = df_data_sci['Snippet'].apply(lambda x: \
                [item for item in str.split(x) if item not in stop_datasci])

In [144]:
df_combined = df_mba.append(df_data_sci)
# df_combined.head()

In [145]:
df_combined['Snippet'] = df_combined['Snippet'].astype(str)

In [146]:
descriptions = df_combined['Snippet'].tolist()

In [147]:
categories = df_combined['category'].tolist()

### Text Cleaning

Before we can dive into analyzing our text corpora, we need to perform some text cleaning.  First we will remove punctuation and symbols from the text.

In [27]:
stopwords = nltk.corpus.stopwords.words('english')

In [28]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [29]:
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [40]:
row['Snippet']

'In Computer Science, Computer or Electrical Engineering, Mathematics, or a related field plus at least 1 year of experience in software development (*Bachelor\x92s degree plus 3 years of progressively responsible software development experience may also be accepted). Phd level work in quantitative field. 1 year of your experience should involve designing and implementing large-scale distributed...'

In [41]:
totalvocab_stemmed = []
totalvocab_tokenized = []
# for i in descriptions:
for idx,row in df_combined.iterrows():
    allwords_stemmed = tokenize_and_stem(row['Snippet']) #for each item in 'synopses', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list
    allwords_tokenized = tokenize_only(row['Snippet'])
    totalvocab_tokenized.extend(allwords_tokenized)

In [42]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, 
                           index = totalvocab_stemmed)

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(descriptions) #fit the vectorizer to synopses

print(tfidf_matrix.shape)

CPU times: user 40.4 s, sys: 1.33 s, total: 41.7 s
Wall time: 50.6 s
(12000, 18)


In [44]:
terms = tfidf_vectorizer.get_feature_names()

In [45]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

### K-means clustering

We need to determine the correct number of clusters.

In [None]:
# from sklearn.cluster import KMeans
# distortions = []
# for i in range(1, 6):
#     km = KMeans(n_clusters=i, 
#                 init='k-means++', 
#                 n_init=10, 
#                 max_iter=300, 
#                 random_state=0)
#     km.fit(tfidf_matrix)
#     distortions.append(km.inertia_)
# plt.plot(range(1, 6), distortions, marker='o')
# plt.xlabel('Number of clusters')
# plt.ylabel('Distortion')
# plt.tight_layout()
# plt.show()

In [None]:
from sklearn.cluster import KMeans

num_clusters = 3

km = KMeans(n_clusters=num_clusters)

%time km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

In [None]:
from sklearn.externals import joblib

joblib.dump(km,  'doc_cluster.pkl')

# km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()

In [None]:
clusters2 = pd.DataFrame(clusters,columns=["clusters"])

In [None]:
clusters2.groupby("clusters").size()

In [None]:
from __future__ import print_function

print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

for i in range(num_clusters):
    print("Cluster %d words:" % i, end='')
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')
    print() #add whitespace
    print() #add whitespace
        
print()
print()

### Hierarchical Clustering

### K-Nearest Neighbors

In [46]:
# Randomly sample 70% of your dataframe for training data
df_070 = df_combined.sample(frac=0.7)

df_rest_70 = df_combined.loc[~df_combined.index.isin(df_070.index)]

In [47]:
X_train = df_070['Snippet'].values
y_train = df_070['category'].values
X_test = df_rest_70['Snippet'].values
y_test = df_rest_70['category'].values

In [50]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

In [51]:
# creating odd list of K for KNN
myList = list(range(1,50))

# subsetting just the odd ones
neighbors = filter(lambda x: x % 2 != 0, myList)

# empty list that will hold cv scores
cv_scores = []

# perform 10-fold cross validation
for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

ValueError: could not convert string to float: 'MBA or Master\x92s Degree in Business, Human Resource Management, or relevant field or job experience is preferred. Work with Marketing to develop library of sales enablement tools to support key stages of the sales process. Pragmatic Marketing and/or Scrum Product Owner training and certification a big plus. Define, own and communicate the multi-year, product strategy and roadmap for the HR...'

In [None]:
# changing to misclassification error
MSE = [1 - x for x in cv_scores]

# determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print "The optimal number of neighbors is %d" % optimal_k

# plot misclassification error vs k
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

#### Choose K

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)

# fitting the model
knn.fit(X_train, y_train)

# predict the response
pred = knn.predict(X_test)

# evaluate accuracy
print accuracy_score(y_test, pred)

### Multidimensional scaling

In [17]:
# import matplotlib.pyplot as plt
# import matplotlib as mpl

# from sklearn.manifold import MDS

# MDS()

# # convert two components as we're plotting points in a two-dimensional plane
# # "precomputed" because we provide a distance matrix
# # we will also specify `random_state` so the plot is reproducible.
# mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)

# pos = mds.fit_transform(dist)  # shape (n_components, n_samples)

# xs, ys = pos[:, 0], pos[:, 1]
# print()
# print()

## Classification Model

In [148]:
# Randomly sample 70% of your dataframe
df_070 = df_combined.sample(frac=0.7)

df_rest_70 = df_combined.loc[~df_combined.index.isin(df_070.index)]

In [149]:
X_train = df_070['Snippet'].values
y_train = df_070['category'].values
X_test = df_rest_70['Snippet'].values
y_test = df_rest_70['category'].values

In [150]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

In [151]:
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

In [152]:
from nltk.corpus import stopwords

stop = stopwords.words('english')

In [153]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [154]:
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

In [155]:
lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [156]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  8.1min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 26.5min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 33.2min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'clf__C': [1.0, 10.0, 100.0], 'vect__tokenizer': [<function tokenizer at 0x118c88488>, <function tokenizer_porter at 0x118c88bf8>], 'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'your...sn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'], None]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
    

In [157]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not

In [158]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.910
