# Job Fiction - Building a training model

### Goal

Overall goal of this notebook is to build a training model based on JOBFICTION database. We would expect to have a corpus, job classifiers, keywords dictionary and model results. All the results will be persisted and updated with the new jobs being collected.

Based on the input from job seekers i.e. job descriptions submitted we will able to determine 
- job titles closest to the job description or keywords submitted (based on the weights associated)
- closest job posts
- keywords to search for the right job posts

The first part of this notebook will explore how jobs in the JOBFICTION database can be classified. 

**Why do we have to classify the job posts?**

A `truck driver` job post is way different from a `database administrator` job post. With the help of clustering algorithms I would expect we can categorize similar jobs into same class or cluster based purely on the job description. Similar to movie genres this classifier is expected to create job categories based on similarity of job descriptions. We can then study the job titles under the same cluster to see how true clusters. The challenge is defining the number of clusters. 

## Approach
- export job descriptions, job title, company and job id from JOBFICTION database
- tokenizing and stemming each job description
- transforming the corpus into vector space using tf-idf
- calculating cosine distance between each job as a measure of similarity
- clustering the documents using the k-means algorithm
- using multidimensional scaling to reduce dimensionality within the corpus
- plot the clusters
- Hierarchical clustering on the corpus using [Ward clustering](http://en.wikipedia.org/wiki/Ward%27s_method)
- plot the clusters with hierarchial claustering
- topic modeling using Latent Dirichlet Allocation (LDA)

## Imports

In [26]:
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from gensim import corpora, models
import gensim
import nltk
import re
import pandas as pd

## 1. Export data from JOBFICTION database

Let's extract jobs from JOBFICTION database

In the jobs table, job description is an array of sentences. In order to export job description, this mongo javasript will be run to combine array elements as a string. For traceback we will add __id field to every record.

In [2]:
%%writefile export_jobs_with_title.js
db.jobs.find({}, { _id: 1, jobtitle: 1, company: 1, summary: 1}).forEach( function (x) 
    {     
        var jobdesc = '';
        x.summary.forEach( function (y) { 
            jobdesc += y.replace(/\n/g, ' '); 
        });     
        print(x._id + "!@#" + x.jobtitle + "!@#" + x.company + "!@#" + jobdesc);
    });

Overwriting export_jobs_with_title.js


In [13]:
!mkdir ./data

mkdir: cannot create directory ‘./data’: File exists


** Run export script to dump data to text file **

In [14]:
!time mongo JOBFICTION --quiet export_jobs_with_title.js > ./data/export_jobs_w_title.txt


real	0m7.431s
user	0m4.135s
sys	0m0.640s


In [3]:
!wc -l /home/rt/wrk/jobs/export_jobs_w_title.txt
!head -1 /home/rt/wrk/jobs/export_jobs_w_title.txt

575279 /home/rt/wrk/jobs/export_jobs_w_title.txt
indeed_2248d91cabe9f4c1!@#Records Clerk I!@#Swedish Covenant Hospital!@#To name, scan and file medical records in our EMR, eClinical Works (eCW) as directed in order to maintain the organization of the departmentâs daily operation. Back up inbound and outbound faxing needs, mail drop off and collection and to Foster Medical Pavilion post box as well as main hospital mail room.  RESPONSIBILITIES  Essential Functions   Demonstrates a commitment to the mission of Swedish Covenant Hospital and  demonstrates a service orientation and adheres to all responsibilities and standards  of the Hospital.  1. Names, scans and files medical records into patient charts in eCW following accepted naming conventions and protocols on where various medical records are filed in the chart. This is high volume work.  2. Works with incoming faxes as well as outgoing faxes ensuring they reach the appropriate provider or area following standardized protocols in 

## 2. Create training data set

We will just export top 100000 job descriptions (remember it's only driver related job posts) and let's build LDA model. Once LDA model is built, we will perform online training for new job posts and keep updating corpus.

In [7]:
!head -100000 /home/rt/wrk/jobs/export_jobs_w_title.txt | awk -F'!@#' '{print $4}' > /home/rt/wrk/jobs/train.txt

In [9]:
!head -100000 /home/rt/wrk/jobs/export_jobs_w_title.txt | awk -F'!@#' 'BEGIN{OFS="|"}{print $1, $2, $3}' > /home/rt/wrk/jobs/train_labels.txt

In [11]:
!head /home/rt/wrk/jobs/train_labels.txt

indeed_2248d91cabe9f4c1|Records Clerk I|Swedish Covenant Hospital
indeed_6ed966da9f33ffc1|Associate|Potbelly Sandwich Shop
indeed_8c97fa9b508897d5|MAIL HANDLER ASSISTANT|United States Postal Service
indeed_8a31ad4b101a0017|In-Airport Sales Representative-Midway|Stratmar Retail Services
indeed_7896face148b0248|Public Aid Investigator Trainee|State of Illinois
indeed_7faa4d3e581b1fbf|Installation Technician|Comcast-Xfinity
indeed_971fe05d25827953|Natural Resources Advanced Specialist - Opt 2|State of Illinois
indeed_7925061549fda745|Office Dispatcher|Allan E Power Plumbing Heating and Cooling
indeed_676c8a0ba858e943|Customer Service Associate|Lowes Home Improvment
indeed_fcae39b731c906e9|Macy's Water Tower Place, Chicago, IL: Merchandise Team Manager|Macy's


## 3. Cleansing Data - Stop words, Tokenizing and Stemming

Let's define few functions before we take off

In [4]:
# replace forward and back slash, hyphen, underscores and other characters
def preprocess(text):
    return text.replace('/', ' ').replace('\\', ' ').replace('_', ' ').replace('-', ' ').decode("utf-8")

In [8]:
# define a tokenizer and stemmer to returns the set of stems in the text passed

def tokenize_and_stem(text):
    # tokenize by sentence, then by word to catch any punctuations
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out tokens not containing alphanumeric
    for token in tokens:
        if re.search('[a-zA-Z0-9]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # tokenize by sentence, then by word to catch any punctuations
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out tokens not containing alphanumeric
    for token in tokens:
        if re.search('[a-zA-Z0-9]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [24]:
# create p_stemmer of class PorterStemmer
#stemmer = PorterStemmer()

# create p_stemmer of class SnowballStemmer
stemmer = SnowballStemmer("english")

### Read training data

In [13]:
# compile training docs into a list
train = [ preprocess(line) for line in open('/home/rt/wrk/jobs/train.txt', 'r') ]

In [18]:
# compile training labels for tracking and debugging purposes only
train_labels = [ line.strip('\n').split('|') for line in open('/home/rt/wrk/jobs/train_labels.txt', 'r') ]

In [19]:
train_labels[0]

['indeed_2248d91cabe9f4c1', 'Records Clerk I', 'Swedish Covenant Hospital']

### Tokenizing and Stemming

In [25]:
vocab_stemmed = []
vocab_tokenized = []

for jobdesc in train:
    stemmed = tokenize_and_stem(jobdesc) 
    vocab_stemmed.extend(stemmed)
    
    tokenized = tokenize_only(jobdesc)
    vocab_tokenized.extend(tokenized)

In [28]:
df_vocab = pd.DataFrame({'words': vocab_tokenized}, index = vocab_stemmed)
print 'there are ' + str(df_vocab.shape[0]) + ' items in vocab_frame'

there are 29321418 items in vocab_frame


In [30]:
print df_vocab.head(20)

              words
to               to
name           name
scan           scan
and             and
file           file
medic       medical
record      records
in               in
our             our
emr             emr
eclin     eclinical
work          works
ecw             ecw
as               as
direct     directed
in               in
order         order
to               to
maintain   maintain
the             the


## 4. TF-IDF and Document Similarity

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.8, 
    max_features=200000,
    min_df=0.2, 
    stop_words='english',
    use_idf=True, 
    tokenizer=tokenize_and_stem, 
    ngram_range=(1, 4)
)

%time tfidf_matrix = tfidf_vectorizer.fit_transform(train) #fit the vectorizer to synopses
print(tfidf_matrix.shape)

MemoryError: 

In [20]:
tokenizer = RegexpTokenizer(r'\w+')

# create english stop words list
en_stop = get_stop_words('en')

    

# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
    
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=20)


[(0, u'0.014*requir + 0.014*custom + 0.012*work + 0.011*must + 0.010*driver + 0.009*deliveri + 0.009*job + 0.008*abil + 0.007*s + 0.007*time'), (1, u'0.040*driver + 0.038*pay + 0.032*truck + 0.019*offer + 0.019*transport + 0.017*paid + 0.017*compani + 0.016*weekli + 0.016*drive + 0.015*cpm'), (2, u'0.037*driver + 0.017*team + 0.016*year + 0.014*truck + 0.014*per + 0.014*mile + 0.011*paid + 0.010*compani + 0.010*000 + 0.009*pay'), (3, u'0.040*truck + 0.035*driver + 0.019*pay + 0.018*roehl + 0.015*ll + 0.014*get + 0.013*drive + 0.013*home + 0.013*cdl + 0.013*train'), (4, u'0.021*haul + 0.020*oper + 0.018*owner + 0.015*move + 0.015*graebel + 0.014*high + 0.014*ca + 0.011*revenu + 0.011*long + 0.011*opportun'), (5, u'0.022*oper + 0.020*owner + 0.018*program + 0.016*knight + 0.015*driver + 0.015*year + 0.014*fuel + 0.013*truck + 0.013*leas + 0.011*busi'), (6, u'0.045*driver + 0.041*pay + 0.026*mile + 0.024*barr + 0.024*nunn + 0.021*per + 0.019*truck + 0.018*cdl + 0.018*benefit + 0.017*hire'

In [23]:
for topic, keywords in ldamodel.print_topics(num_topics=10, num_words=5):
    print topic, keywords

0 0.014*requir + 0.014*custom + 0.012*work + 0.011*must + 0.010*driver
1 0.040*driver + 0.038*pay + 0.032*truck + 0.019*offer + 0.019*transport
2 0.037*driver + 0.017*team + 0.016*year + 0.014*truck + 0.014*per
3 0.040*truck + 0.035*driver + 0.019*pay + 0.018*roehl + 0.015*ll
4 0.021*haul + 0.020*oper + 0.018*owner + 0.015*move + 0.015*graebel
5 0.022*oper + 0.020*owner + 0.018*program + 0.016*knight + 0.015*driver
6 0.045*driver + 0.041*pay + 0.026*mile + 0.024*barr + 0.024*nunn
7 0.032*driver + 0.020*drive + 0.017*compani + 0.014*hazmat + 0.014*year
8 0.023*year + 0.019*driver + 0.016*cdl + 0.015*month + 0.015*class
9 0.036*driver + 0.022*pay + 0.018*compani + 0.016*flatb + 0.014*offer


### 4. Let's TEST the LDA model with new job description

In [26]:
!tail -1 ./data/export.txt | awk -F'!@#' '{print $2}' > ./data/test.txt
!cat ./data/test.txt

Now Hiring Company Truck Drivers. At Transport America We Raised Pay!  Company Truck Driver Benefits: Top 10% Industry Pay Year-Round Steady Freight Performance Pay - Experienced Drivers Earn Top Scale in 2 Years Flexible Home Time, Including Get Home Certificates 24/7 Support, 365 Days A Year Pick Your Schedule Option Lease Purchase Options Day 1 Medical/Dental/Vision/Disability Benefits Package Transfer Opportunities Available E-Logs and an InCab Communication Hub Roll Stability and OnGuard System CSA Safe Carrier New Fleet of Equipment- New Kenworths In Delivery  At Transport America, our goal is to deliver excellence in all that we do. At a time when others are moving to asset lite models, we are committed to running assets in networks, which gives you reliable capacity with an excellence of service unsurpassed in the transportation industry. We are big enough to create meaningful solutions, but small enough to provide you the level of customer service you deserve. We believe in hi

In [51]:
# compile sample documents into a list
test_set = [ preprocess(line) for line in open('./data/test.txt', 'r') ]

# list for tokenized documents in loop
texts = []

# loop through document list
for i in test_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
    
# convert tokenized documents into a document-term matrix
bow = [dictionary.doc2bow(text) for text in texts]

** Let's print topics from the model **

In [57]:
for topic, keywords in ldamodel.print_topics(num_topics=10, num_words=5):
    print topic, keywords

0 0.014*requir + 0.014*custom + 0.012*work + 0.011*must + 0.010*driver
1 0.040*driver + 0.038*pay + 0.032*truck + 0.019*offer + 0.019*transport
2 0.037*driver + 0.017*team + 0.016*year + 0.014*truck + 0.014*per
3 0.040*truck + 0.035*driver + 0.019*pay + 0.018*roehl + 0.015*ll
4 0.021*haul + 0.020*oper + 0.018*owner + 0.015*move + 0.015*graebel
5 0.022*oper + 0.020*owner + 0.018*program + 0.016*knight + 0.015*driver
6 0.045*driver + 0.041*pay + 0.026*mile + 0.024*barr + 0.024*nunn
7 0.032*driver + 0.020*drive + 0.017*compani + 0.014*hazmat + 0.014*year
8 0.023*year + 0.019*driver + 0.016*cdl + 0.015*month + 0.015*class
9 0.036*driver + 0.022*pay + 0.018*compani + 0.016*flatb + 0.014*offer


** Let's see what topis test document belongs to **

In [56]:
for topics in ldamodel[bow]:
    print topics

[(0, 0.85813606763953532), (2, 0.038124342565039944), (3, 0.052171292370160209), (7, 0.011156941430638388), (9, 0.03004651314823252)]


** So the test document belongs to topics 0, 2, 3, 7 and 9 **

In [58]:
print test_set

[u"Now Hiring Company Truck Drivers. At Transport America We Raised Pay!  Company Truck Driver Benefits: Top 10% Industry Pay Year Round Steady Freight Performance Pay   Experienced Drivers Earn Top Scale in 2 Years Flexible Home Time, Including Get Home Certificates 24 7 Support, 365 Days A Year Pick Your Schedule Option Lease Purchase Options Day 1 Medical Dental Vision Disability Benefits Package Transfer Opportunities Available E Logs and an InCab Communication Hub Roll Stability and OnGuard System CSA Safe Carrier New Fleet of Equipment  New Kenworths In Delivery  At Transport America, our goal is to deliver excellence in all that we do. At a time when others are moving to asset lite models, we are committed to running assets in networks, which gives you reliable capacity with an excellence of service unsurpassed in the transportation industry. We are big enough to create meaningful solutions, but small enough to provide you the level of customer service you deserve. We believe in