<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lesson 16 - Latent Variables and Natural Language Processing

---

In [1]:
# Imports
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

# Config
np.random.seed(1)

In [23]:
# spacy is used for pre-processing and traditional NLP
import spacy
from spacy.en import English

In [3]:
# Import data
df = pd.read_csv('stumbleupon.tsv', sep='\t')
df['title'] = df.boilerplate.map(lambda x: json.loads(x).get('title', ''))
df['body'] = df.boilerplate.map(lambda x: json.loads(x).get('body', ''))

df.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


## Demo: "LDA in gensim"

Gensim is a library of language processing tools focused on latent variable models for text. It was originally developed by grad students dissatisfied with current implementations of latent models. Documentation and tutorials are available on the [package’s website](https://radimrehurek.com/gensim/index.html).


Let’s first translate a set of documents (articles) into a matrix representation with a row per document and a column per feature (word or n-gram).

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

body_text = df.body.dropna()
vectorizer = CountVectorizer(binary=False,
                             stop_words='english',
                             min_df=3,
                             ngram_range=(1,2))
vectorizer.fit(body_text)
docs = vectorizer.transform(body_text) 

In [14]:
# Build a mapping of numerical ID to word
id2word = dict(enumerate(vectorizer.get_feature_names()))
print id2word[5000], id2word[10000], id2word[20000]

add mayonnaise bejeweled consistency want


- We want to learn which columns are correlated (i.e. likely to come from the same topic). This is the word distribution. 
- We can also determine what topics are in each document, the topic distribution.

In [15]:
from gensim.models.ldamodel import LdaModel
from gensim.matutils import Sparse2Corpus

# First we convert our word-matrix into gensim's format
corpus = Sparse2Corpus(docs, documents_columns=False)

# Then we fit an LDA model
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=5)

In this model, we need to explicitly specify the number of topics we want the model to uncover. This is a critical parameter, but there isn’t much guidance on how to choose it.  Try to use domain expertise where possible.


Now we need to assess the goodness of fit for our model. Like other unsupervised learning techniques, our validation techniques are mostly about interpretation.

#### Use the following questions to guide you:

- Did we learn reasonable topics?
- Do the words that make up a topic make sense?
- Is this topic helpful towards our goal?

#### We can evaluate fit by viewing the top words in each topic.

- Gensim has a `show_topics()` function for this.

In [17]:
for ti, topic in enumerate(lda_model.show_topics(num_topics=5, num_words=10)):
    print "Topic: {}".format(ti)
    print topic
    print

Topic: 0
(0, u'0.004*"like" + 0.003*"people" + 0.003*"just" + 0.003*"health" + 0.003*"time" + 0.003*"food" + 0.003*"make" + 0.003*"body" + 0.002*"new" + 0.002*"said"')

Topic: 1
(1, u'0.005*"sports" + 0.005*"flashvars" + 0.004*"2011" + 0.004*"com" + 0.003*"image" + 0.003*"world" + 0.003*"images" + 0.002*"link" + 0.002*"jpg" + 0.002*"track"')

Topic: 2
(2, u'0.006*"com" + 0.004*"http" + 0.004*"online" + 0.004*"www" + 0.004*"http www" + 0.003*"new" + 0.003*"2009" + 0.003*"2010" + 0.003*"news" + 0.002*"like"')

Topic: 3
(3, u'0.008*"cup" + 0.008*"chocolate" + 0.007*"butter" + 0.006*"sugar" + 0.005*"baking" + 0.005*"recipe" + 0.005*"make" + 0.005*"minutes" + 0.004*"cake" + 0.004*"add"')

Topic: 4
(4, u'0.005*"recipe" + 0.004*"add" + 0.004*"chicken" + 0.004*"recipes" + 0.004*"cheese" + 0.004*"minutes" + 0.003*"cup" + 0.003*"make" + 0.003*"just" + 0.003*"like"')



#### Let's now use our fitted model to predict topics for some new data

(examples taken from http://www.buzzfeed.com/babymantis/25-stupid-newspaper-headlines-1opu)

In [18]:
new_text = [
    "Japanese scientists grow frog eyes and ears",
    "Statistics show that teen pregnancy drops of significantly after age 25",
    "Bugs flying around with wings are flying bugs",
    "Federal agents raid gun shop, find weapons",
    "Marijuana issue sent to a joint committee"
]

# Transform the text into the bag-of-words (bow) space using our vectorizer
new_bow = vectorizer.transform(new_text)

# Transform into format expected by gensim
new_corpus = Sparse2Corpus(new_bow, documents_columns=False)

# Print out first entry + matching words
print list(new_corpus)[0]
print [(id2word[id], count) for id, count in list(new_corpus)[0]]

[(28589, 1), (31548, 1), (35775, 1), (39060, 1), (45294, 1), (74370, 1)]
[(u'ears', 1), (u'eyes', 1), (u'frog', 1), (u'grow', 1), (u'japanese', 1), (u'scientists', 1)]


#### Transform into LDA space by applying fitted LDA model to the corpus

In [19]:
lda_vector = lda_model[new_corpus]

#### For each entry we can extract a tuple indicating how much it makes part of each topic

In [20]:
[list(lda_vec) for lda_vec in lda_vector]

[[(0, 0.88390465062556722),
  (1, 0.029006546314757485),
  (2, 0.029291835346880516),
  (3, 0.028803645500547946),
  (4, 0.028993322212246781)],
 [(0, 0.4098334184376039),
  (1, 0.52257446875346902),
  (2, 0.022580739804032787),
  (3, 0.022565289783851226),
  (4, 0.022446083221043054)],
 [(0, 0.58809330216769051),
  (1, 0.034308279205680088),
  (2, 0.034544040128471384),
  (3, 0.033635485468756471),
  (4, 0.30941889302940156)],
 [(0, 0.029203262020429051),
  (1, 0.028776980917551143),
  (2, 0.88467240469170949),
  (3, 0.028662571631090851),
  (4, 0.028684780739219474)],
 [(0, 0.86450575101866878),
  (1, 0.034351822976431197),
  (2, 0.034149044751454895),
  (3, 0.033454408491716547),
  (4, 0.033538972761728555)]]

#### Extract most prominent LDA topics for each entry

In [21]:
top_topics = [max(x, key=lambda item: item[1]) for x in list(lda_vector)]
top_topics

[(0, 0.88389727440813137),
 (1, 0.52266317561652986),
 (2, 0.86331326205235503),
 (2, 0.88465899869202425),
 (0, 0.86450791791318971)]

#### Print out text + topic

In [22]:
for i, topic_tuple in enumerate(top_topics):
    print new_text[i]
    print "{0:.1f}% as topic #{1}:".format(100 * topic_tuple[1], topic_tuple[0])
    print lda_model.print_topic(topic_tuple[0],topn=10), "\n"

Japanese scientists grow frog eyes and ears
88.4% as topic #0:
0.004*"like" + 0.003*"people" + 0.003*"just" + 0.003*"health" + 0.003*"time" + 0.003*"food" + 0.003*"make" + 0.003*"body" + 0.002*"new" + 0.002*"said" 

Statistics show that teen pregnancy drops of significantly after age 25
52.3% as topic #1:
0.005*"sports" + 0.005*"flashvars" + 0.004*"2011" + 0.004*"com" + 0.003*"image" + 0.003*"world" + 0.003*"images" + 0.002*"link" + 0.002*"jpg" + 0.002*"track" 

Bugs flying around with wings are flying bugs
86.3% as topic #2:
0.006*"com" + 0.004*"http" + 0.004*"online" + 0.004*"www" + 0.004*"http www" + 0.003*"new" + 0.003*"2009" + 0.003*"2010" + 0.003*"news" + 0.002*"like" 

Federal agents raid gun shop, find weapons
88.5% as topic #2:
0.006*"com" + 0.004*"http" + 0.004*"online" + 0.004*"www" + 0.004*"http www" + 0.003*"new" + 0.003*"2009" + 0.003*"2010" + 0.003*"news" + 0.002*"like" 

Marijuana issue sent to a joint committee
86.5% as topic #0:
0.004*"like" + 0.003*"people" + 0.003*"

For more examples on using LDA with gensim, see: http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html