## Analyzing Job Description Data: MBA VS Data Scientist

* Introduction
  * Data Sources
  * EDA
* Text Cleaning
* TFIDF/Count Vectorization
* Clustering
* Classification with Logistic Regression
* Nearest Neighbors

In [3]:
import sqlalchemy
import pymysql
import pandas as pd

### Data Sources

The data we will be analyzing was collected from Indeed.com for _____ major cities in the United States.  The Indeed.com job search API was utilized over a period of _____ in mid 2017.  A notebook outlining the methogology and code for collecting and storing the data on a daily basis can be found here _____ for those who are interested.

The total number of unique townships spanning the data is:

In [283]:
df_combined['Location'].nunique()

522

In [4]:
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://collier:barkley07@dkingpc/jobs',echo=False)

In [37]:
df_data_sci = pd.read_sql_query('SELECT * FROM indeed_datasci',engine)
df_data_sci['category'] = "data_science"

In [38]:
df_mba = pd.read_sql_query('SELECT * FROM indeed_mba',engine)
df_mba['category'] = "mba"

Lets take a quick glance at the data.

In [250]:
df_combined = df_mba.append(df_data_sci)
df_combined.head()

Unnamed: 0,Location,Company,Date,Job_Key,Job_Title,Snippet,category
0,"Austin, TX",Kasasa,2017-05-24,a44def7958691957,Executive Strategist,Must be service oriented and willing to selfle...,mba
1,"Austin, TX",Texas Windstorm Insurance Association,2017-05-25,382cc710feb5f483,Chief Actuary,"A bachelors degree, preferably in the areas o...",mba
2,"Austin, TX",Otto Bock Healthcare LP,2017-05-24,568c85b4c93984f3,"Senior Director, Commercial Services","BA in Business, Marketing or related field, MB...",mba
3,"Austin, TX",Otto Bock Healthcare LP,2017-05-24,849868dc1a367512,Market Manager - Orthotics,Responsible for working with Marketing Communi...,mba
4,"Austin, TX",Otto Bock Healthcare LP,2017-05-24,fde27361782c9f0b,Market Manager Prosthetics,Responsible for working with Marketing Communi...,mba


Looking at the proportions of job categories in our data, we see that 'mba' jobs are more than twice as common as 'data scientist' jobs.

In [280]:
df_combined.groupby('category').size()

category
data_science     5804
mba             12039
dtype: int64

### Text Cleaning

Before we can dive into analyzing our text corpora, we need to perform some text cleaning.  First we will remove punctuation and symbols from the text.

In [251]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

We will also remove stopwords.

In [252]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Collier/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [253]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
df_combined['Snippet'] = df_combined['Snippet'].apply(preprocessor)
df_combined['Snippet'] = df_combined['Snippet'].apply(lambda x: [item for item in str.split(x) if item not in stop])

In [254]:
df_combined['Snippet'] = df_combined['Snippet'].apply(lambda x: ' '.join(x))

We will also "stem" words in our corpus.

In [255]:
# from nltk.stem.porter import PorterStemmer

# porter = PorterStemmer()

# def tokenizer_porter(text):
#     return [porter.stem(word) for word in text.split()]

In [256]:
# df_combined['Snippet'] = df_combined['Snippet'].apply(tokenizer_porter)

In [318]:
stopwords = nltk.corpus.stopwords.words('english')

In [319]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [320]:
# here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [326]:
descriptions = df_combined["Snippet"].tolist()
categories = df_combined["category"].tolist()

## Count Vectorizer

We will also create a count vectorizer, which will.....

In [294]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [295]:
count = CountVectorizer()
docs = np.array([df_combined['Snippet']])

In [296]:
bag = count.fit_transform(docs.ravel())

In [304]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs.ravel())).toarray()

In [308]:
from sklearn.preprocessing import normalize
raw_tfidf = normalize(raw_tfidf)

In [309]:
def bipartition(cluster, maxiter=400, num_runs=4, seed=None):
    '''cluster: should be a dictionary containing the following keys
                * dataframe: original dataframe
                * matrix:    same data, in matrix format
                * centroid:  centroid for this particular cluster'''
    data_matrix = cluster['matrix']
    dataframe   = cluster['dataframe']
    # Run k-means on the data matrix with k=2. We use scikit-learn here to simplify workflow.
    kmeans_model = KMeans(n_clusters=2, max_iter=maxiter, n_init=num_runs, random_state=seed, n_jobs=1)    
    kmeans_model.fit(data_matrix)
    centroids, cluster_assignment = kmeans_model.cluster_centers_, kmeans_model.labels_
    # Divide the data matrix into two parts using the cluster assignments.
    data_matrix_left_child, data_matrix_right_child = data_matrix[cluster_assignment==0], \
                                                      data_matrix[cluster_assignment==1]
    # Divide the dataframe into two parts, again using the cluster assignments.
    dataframe['cluster_assignment'] = cluster_assignment # minor format conversion
    dataframe_left_child, dataframe_right_child     = dataframe[dataframe['cluster_assignment']==0], \
                                                      dataframe[dataframe['cluster_assignment']==1]
    # Package relevant variables for the child clusters
    cluster_left_child  = {'matrix': data_matrix_left_child,
                           'dataframe': dataframe_left_child,
                           'centroid': centroids[0]}
    cluster_right_child = {'matrix': data_matrix_right_child,
                           'dataframe': dataframe_right_child,
                           'centroid': centroids[1]}
    return (cluster_left_child, cluster_right_child)

In [317]:
wiki_data = {'matrix': raw_tfidf, 'dataframe': df_combined} # no 'centroid' for the root cluster
left_child, right_child = bipartition(wiki_data, maxiter=1, num_runs=1, seed=1)

## Clustering

In [285]:
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances

## Classification Model

In [267]:
# Randomly sample 70% of your dataframe
df_050 = df_combined.sample(frac=0.5)

df_rest_50 = df_combined.loc[~df_combined.index.isin(df_050.index)]

In [273]:
X_train = df_050['Snippet'].values
y_train = df_050['category'].values
X_test = df_rest_50['Snippet'].values
y_test = df_rest_50['category'].values

In [269]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

In [270]:
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

In [271]:
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

In [272]:
lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [277]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 24.1min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 30.7min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', '...kenizer at 0x110632b70>, <function tokenizer_porter at 0x10df31d08>], 'clf__penalty': ['l1', 'l2']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
    

In [278]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__C': 10.0, 'vect__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'ver

In [279]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.924
