## Topic Modeling with Latent Dirichlet Allocation (LDA) - gensim package

### 1. Export data from JOBFICTION database

Let's extract jobs where job title has driver in it. Ideally, all job posts must be classified into multiple categories using K-Means or other unsupervised classification/clustering algorithms.

In the jobs table, job desription is an array of sentences. In order to export job description, this mongo javasript will be run to combine array elements as a string. For traceback we will add __id field to every record.

In [12]:
%%writefile export_job_desc.js
db.jobs.find( { "jobtitle" : /driver/i }, { _id: 1, summary: 1}).forEach( function (x) 
    {     
        var jobdesc = '';
        x.summary.forEach( function (y) { 
            jobdesc += y.replace(/\n/g, ' '); 
        });     
        print(x._id + "!@#" +jobdesc);
    });

Overwriting export_job_desc.js


In [13]:
!mkdir ./data

mkdir: cannot create directory ‘./data’: File exists


** Run export script to dump data to text file **

In [14]:
!time mongo JOBFICTION --quiet export_job_desc.js > ./data/export.txt


real	0m7.431s
user	0m4.135s
sys	0m0.640s


In [16]:
!wc -l ./data/export.txt
!head -1 ./data/export.txt

62204 ./data/export.txt
indeed_65ec17d9c2a5ce01!@#Because moving is stressful, we seek candidates with a commitment to customer service and an appreciation for variety in your job!  What is the job?  Skillfully and carefully drive a moving vehicle (truck or van) to move customersâ possessions. Position requires a unique combination of general labor and problem solving skills with excellent customer service talents. You will be part of a moving crew that provides an exceptional service and experience. Movers are exposed to a variety of physical work in a dynamic environment.  What will you be doing?   When driving, you will make $13+ per hour (depending on experience) with growth potential into higher pay grades, warehouse, office or leadership positions Will earn tips in addition to the hourly wage! Perform general physical activities related to the handling/moving of objects: drive a moving truck between headquarters & various client locations lift and move heavy objects load, unloa

### 2. Create training data set

We will just export top 10000 job descriptions (remember it's only driver related job posts) and let's build LDA model. Once LDA model is built, we will perform online training for new job posts and keep updating corpus.

In [18]:
!head -10000 ./data/export.txt | awk -F'!@#' '{print $2}' > ./data/train.txt

### 3. Train LDA Model

** *Layman's Explanation of LDA* **

Ref: https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation

Suppose you've just moved to a new city. You're a hipster and an anime fan, so you want to know where the other hipsters and anime geeks tend to hang out. Of course, as a hipster, you know you can't just ask, so what do you do?

Here's the scenario: you scope out a bunch of different establishments (documents) across town, making note of the people (words) hanging out in each of them (e.g., Alice hangs out at the mall and at the park, Bob hangs out at the movie theater and the park, and so on). Crucially, you don't know the typical interest groups (topics) of each establishment, nor do you know the different interests of each person.

So you pick some number K of categories to learn (i.e., you want to learn the K most important kinds of categories people fall into), and start by making a guess as to why you see people where you do. For example, you initially guess that Alice is at the mall because people with interests in X like to hang out there; when you see her at the park, you guess it's because her friends with interests in Y like to hang out there; when you see Bob at the movie theater, you randomly guess it's because the Z people in this city really like to watch movies; and so on.

Of course, your random guesses are very likely to be incorrect (they're random guesses, after all!), so you want to improve on them. One way of doing so is to:

- Pick a place and a person (e.g., Alice at the mall).
- Why is Alice likely to be at the mall? Probably because other people at the mall with the same interests sent her a message telling her to come.
- In other words, the more people with interests in X there are at the mall and the stronger Alice is associated with interest X (at all the other places she goes to), the more likely it is that Alice is at the mall because of interest X.
- So make a new guess as to why Alice is at the mall, choosing an interest with some probability according to how likely you think it is.


Go through each place and person over and over again. Your guesses keep getting better and better (after all, if you notice that lots of geeks hang out at the bookstore, and you suspect that Alice is pretty geeky herself, then it's a good bet that Alice is at the bookstore because her geek friends told her to go there; and now that you have a better idea of why Alice is probably at the bookstore, you can use this knowledge in turn to improve your guesses as to why everyone else is where they are), and so eventually you can stop updating. Then take a snapshot (or multiple snapshots) of your guesses, and use it to get all the information you want:

- For each category, you can count the people assigned to that category to figure out what people have this particular interest. By looking at the people themselves, you can interpret the category as well (e.g., if category X contains lots of tall people wearing jerseys and carrying around basketballs, you might interpret X as the "basketball players" group).
- For each place P and interest category C, you can compute the proportions of people at P because of C (under the current set of assignments), and these give you a representation of P. For example, you might learn that the people who hang out at Barnes & Noble consist of 10% hipsters, 50% anime fans, 10% jocks, and 30% college students.

In [20]:
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

def preprocess(text):
    return text.replace('/', ' ').replace('\\', ' ').replace('_', ' ').replace('-', ' ').decode("utf-8")

tokenizer = RegexpTokenizer(r'\w+')

# create english stop words list
en_stop = get_stop_words('en')

# create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
    
# compile sample documents into a list
doc_set = [ preprocess(line) for line in open('./data/train.txt', 'r') ]

# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
    
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=20)


[(0, u'0.014*requir + 0.014*custom + 0.012*work + 0.011*must + 0.010*driver + 0.009*deliveri + 0.009*job + 0.008*abil + 0.007*s + 0.007*time'), (1, u'0.040*driver + 0.038*pay + 0.032*truck + 0.019*offer + 0.019*transport + 0.017*paid + 0.017*compani + 0.016*weekli + 0.016*drive + 0.015*cpm'), (2, u'0.037*driver + 0.017*team + 0.016*year + 0.014*truck + 0.014*per + 0.014*mile + 0.011*paid + 0.010*compani + 0.010*000 + 0.009*pay'), (3, u'0.040*truck + 0.035*driver + 0.019*pay + 0.018*roehl + 0.015*ll + 0.014*get + 0.013*drive + 0.013*home + 0.013*cdl + 0.013*train'), (4, u'0.021*haul + 0.020*oper + 0.018*owner + 0.015*move + 0.015*graebel + 0.014*high + 0.014*ca + 0.011*revenu + 0.011*long + 0.011*opportun'), (5, u'0.022*oper + 0.020*owner + 0.018*program + 0.016*knight + 0.015*driver + 0.015*year + 0.014*fuel + 0.013*truck + 0.013*leas + 0.011*busi'), (6, u'0.045*driver + 0.041*pay + 0.026*mile + 0.024*barr + 0.024*nunn + 0.021*per + 0.019*truck + 0.018*cdl + 0.018*benefit + 0.017*hire'

In [23]:
for topic, keywords in ldamodel.print_topics(num_topics=10, num_words=5):
    print topic, keywords

0 0.014*requir + 0.014*custom + 0.012*work + 0.011*must + 0.010*driver
1 0.040*driver + 0.038*pay + 0.032*truck + 0.019*offer + 0.019*transport
2 0.037*driver + 0.017*team + 0.016*year + 0.014*truck + 0.014*per
3 0.040*truck + 0.035*driver + 0.019*pay + 0.018*roehl + 0.015*ll
4 0.021*haul + 0.020*oper + 0.018*owner + 0.015*move + 0.015*graebel
5 0.022*oper + 0.020*owner + 0.018*program + 0.016*knight + 0.015*driver
6 0.045*driver + 0.041*pay + 0.026*mile + 0.024*barr + 0.024*nunn
7 0.032*driver + 0.020*drive + 0.017*compani + 0.014*hazmat + 0.014*year
8 0.023*year + 0.019*driver + 0.016*cdl + 0.015*month + 0.015*class
9 0.036*driver + 0.022*pay + 0.018*compani + 0.016*flatb + 0.014*offer


### 4. Let's TEST the LDA model with new job description

In [26]:
!tail -1 ./data/export.txt | awk -F'!@#' '{print $2}' > ./data/test.txt
!cat ./data/test.txt

Now Hiring Company Truck Drivers. At Transport America We Raised Pay!  Company Truck Driver Benefits: Top 10% Industry Pay Year-Round Steady Freight Performance Pay - Experienced Drivers Earn Top Scale in 2 Years Flexible Home Time, Including Get Home Certificates 24/7 Support, 365 Days A Year Pick Your Schedule Option Lease Purchase Options Day 1 Medical/Dental/Vision/Disability Benefits Package Transfer Opportunities Available E-Logs and an InCab Communication Hub Roll Stability and OnGuard System CSA Safe Carrier New Fleet of Equipment- New Kenworths In Delivery  At Transport America, our goal is to deliver excellence in all that we do. At a time when others are moving to asset lite models, we are committed to running assets in networks, which gives you reliable capacity with an excellence of service unsurpassed in the transportation industry. We are big enough to create meaningful solutions, but small enough to provide you the level of customer service you deserve. We believe in hi

In [51]:
# compile sample documents into a list
test_set = [ preprocess(line) for line in open('./data/test.txt', 'r') ]

# list for tokenized documents in loop
texts = []

# loop through document list
for i in test_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
    
# convert tokenized documents into a document-term matrix
bow = [dictionary.doc2bow(text) for text in texts]

** Let's print topics from the model **

In [57]:
for topic, keywords in ldamodel.print_topics(num_topics=10, num_words=5):
    print topic, keywords

0 0.014*requir + 0.014*custom + 0.012*work + 0.011*must + 0.010*driver
1 0.040*driver + 0.038*pay + 0.032*truck + 0.019*offer + 0.019*transport
2 0.037*driver + 0.017*team + 0.016*year + 0.014*truck + 0.014*per
3 0.040*truck + 0.035*driver + 0.019*pay + 0.018*roehl + 0.015*ll
4 0.021*haul + 0.020*oper + 0.018*owner + 0.015*move + 0.015*graebel
5 0.022*oper + 0.020*owner + 0.018*program + 0.016*knight + 0.015*driver
6 0.045*driver + 0.041*pay + 0.026*mile + 0.024*barr + 0.024*nunn
7 0.032*driver + 0.020*drive + 0.017*compani + 0.014*hazmat + 0.014*year
8 0.023*year + 0.019*driver + 0.016*cdl + 0.015*month + 0.015*class
9 0.036*driver + 0.022*pay + 0.018*compani + 0.016*flatb + 0.014*offer


** Let's see what topis test document belongs to **

In [56]:
for topics in ldamodel[bow]:
    print topics

[(0, 0.85813606763953532), (2, 0.038124342565039944), (3, 0.052171292370160209), (7, 0.011156941430638388), (9, 0.03004651314823252)]


** So the test document belongs to topics 0, 2, 3, 7 and 9 **

In [58]:
print test_set

[u"Now Hiring Company Truck Drivers. At Transport America We Raised Pay!  Company Truck Driver Benefits: Top 10% Industry Pay Year Round Steady Freight Performance Pay   Experienced Drivers Earn Top Scale in 2 Years Flexible Home Time, Including Get Home Certificates 24 7 Support, 365 Days A Year Pick Your Schedule Option Lease Purchase Options Day 1 Medical Dental Vision Disability Benefits Package Transfer Opportunities Available E Logs and an InCab Communication Hub Roll Stability and OnGuard System CSA Safe Carrier New Fleet of Equipment  New Kenworths In Delivery  At Transport America, our goal is to deliver excellence in all that we do. At a time when others are moving to asset lite models, we are committed to running assets in networks, which gives you reliable capacity with an excellence of service unsurpassed in the transportation industry. We are big enough to create meaningful solutions, but small enough to provide you the level of customer service you deserve. We believe in