## Latent Dirichlet Allocation

+ Most commonly used in natural language processing
+ Sometimes as an end in and of itself
+ Sometimes as a variable reduction technique


### Simple Example of LDA in NLP

Stolen from: http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py

+ Authors: 
    + Olivier Grisel <olivier.grisel@ensta.org>
    + Lars Buitinck
    + Chyi-Kwei Yau <chyikwei.yau@gmail.com>
+ License: BSD 3 clause

In [1]:
from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups


### This code defines a custom function that we'll use later

In [2]:
n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()



### This code loads the dataset

In [3]:

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))


Loading dataset...
done in 1.913s.


In [4]:
dataset['data'][2]

u"Although I realize that principle is not one of your strongest\npoints, I would still like to know why do do not ask any question\nof this sort about the Arab countries.\n\n   If you want to continue this think tank charade of yours, your\nfixation on Israel must stop.  You might have to start asking the\nsame sort of questions of Arab countries as well.  You realize it\nwould not work, as the Arab countries' treatment of Jews over the\nlast several decades is so bad that your fixation on Israel would\nbegin to look like the biased attack that it is.\n\n   Everyone in this group recognizes that your stupid 'Center for\nPolicy Research' is nothing more than a fancy name for some bigot\nwho hates Israel."

In [5]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))



Extracting tf features for LDA...
done in 0.402s.


In [49]:

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)




Fitting LDA models with tf features, n_samples=2000 and n_features=1000...


In [50]:
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

done in 12.535s.


In [51]:
X_example = lda.transform(tf)

In [58]:
import pandas as pd

pd.DataFrame(X_example, columns = ["ethnic", "next topic"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.530256,0.003449,0.003449,0.003449,0.003449,0.240672,0.003449,0.003448,0.204932,0.003449
1,0.003125,0.003125,0.003125,0.003126,0.003127,0.551005,0.003126,0.003125,0.423990,0.003125
2,0.296628,0.284906,0.003573,0.003572,0.003572,0.003572,0.003572,0.003571,0.393462,0.003572
3,0.002778,0.244747,0.002779,0.429914,0.002778,0.002779,0.080891,0.002778,0.227779,0.002778
4,0.005264,0.005265,0.005266,0.005264,0.126397,0.307343,0.005264,0.005263,0.529411,0.005264
5,0.009092,0.009092,0.009092,0.009093,0.009092,0.506527,0.009092,0.009091,0.420736,0.009093
6,0.002632,0.086142,0.002632,0.002632,0.042762,0.002632,0.299677,0.002632,0.371743,0.186517
7,0.003450,0.003449,0.003450,0.003449,0.003449,0.520215,0.003449,0.003448,0.404932,0.050709
8,0.004762,0.325559,0.004764,0.004763,0.004764,0.004763,0.004763,0.004762,0.636338,0.004762
9,0.002439,0.002440,0.002439,0.449908,0.002439,0.002440,0.152455,0.002439,0.380560,0.002440


In [53]:

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)


Topics in LDA model:
Topic #0:
people gun armenian armenians war
Topic #1:
government people law mr use
Topic #2:
space program output entry data
Topic #3:
key car chip used keys
Topic #4:
edu file com available mail
Topic #5:
god people does jesus say
Topic #6:
windows use drive thanks does
Topic #7:
ax max b8f g9v a86
Topic #8:
just don like think know
Topic #9:
10 00 25 15 12



### In class assignment

+ load in the training set (done for you below)
+ re-run LDA and use topics as input for model
+ Predict categories using some multinomial classifier 

In [9]:
print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'), 
                            subset="train")

data = dataset.data

y = dataset.target

print("done in %0.3fs." % (time() - t0))


Loading dataset...
done in 1.876s.


In [21]:
import numpy as np

np.unique(y, return_counts=False)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

u"Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n"

In [10]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf_vectorizer.fit(data)

tf = tf_vectorizer.transform(data)


print("done in %0.3fs." % (time() - t0))


Extracting tf features for LDA...
done in 4.024s.


In [12]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=20, max_iter=50,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)


t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
done in 106.853s.


In [22]:
X = lda.transform(tf)

In [33]:
test = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'), 
                            subset="test")

testdata = test.data

y_test = test.target


In [36]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [38]:
clf2 = RandomForestClassifier()

clf2.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [42]:
from sklearn.model_selection import GridSearchCV, KFold

gs = GridSearchCV(estimator = RandomForestClassifier(), 
                 param_grid = {'n_estimators':np.arange(10, 21, 1)}, 
                 cv = KFold(n_splits=5))

gs.fit(X, y)

algo = gs.best_estimator_


In [30]:
clf = MultinomialNB(alpha=.01)
clf.fit(X, y)


MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [62]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, pred)

array([[ 46,   4,   4,   3,   3,   3,   3,  20,  16,  13,   5,   6,   4,
         27,  14,  90,  15,  14,   6,  23],
       [  4, 123,  58,  25,  19,  66,   7,  13,   4,   4,   0,  11,  15,
          7,  21,   4,   3,   3,   1,   1],
       [  1,  65, 101,  47,  31,  48,   6,  27,   9,   3,   3,   9,   8,
         14,   7,   3,   2,   3,   7,   0],
       [  1,  47,  60, 112,  76,  13,  24,  14,   8,   5,   2,   5,  15,
          0,   5,   0,   3,   0,   1,   1],
       [  3,  16,  34, 107,  94,  10,  28,  27,  14,   4,   8,   7,  16,
          4,   7,   0,   2,   2,   2,   0],
       [  4,  90,  72,  19,  11, 121,   6,  19,   6,   5,   2,   8,   6,
         11,   5,   2,   0,   3,   3,   2],
       [  2,  13,   9,  27,  32,   9, 222,  34,   5,   5,   5,   1,  14,
          4,   3,   1,   0,   2,   2,   0],
       [ 10,   9,   5,   9,  16,   9,  61, 131,  33,   5,   7,   6,  39,
         17,   4,   5,  16,   4,   9,   1],
       [ 15,  18,   8,  15,  19,  11,  15,  56, 102,  12,  13,  

In [39]:
X_test = tf_vectorizer.transform(testdata)
X_test = lda.transform(X_test)

pred = clf2.predict(X_test)
metrics.f1_score(y_test, pred, average='macro')

0.28736194294309109

In [44]:
algo.score(X_test, y_test)

0.31213489113117365

### In class assignment:

+ I'll divide you into 3 segments
+ Each segment generates 100 sentences on the *same topic*
+ Save as a JSON and send to me
+ We'll run them through LDA

In [68]:
CLASSLIST = ["My favorite type of food is tacos, but it used to be fried chicken.", "My favorite type of taco is al pastor.", "My favorite mexican resturant is El Rancho.", "I also like all the resturants in my immediate neighborhood.", "Corn dogs are quite nice as well however my friends make fun of me", "Yo dog I love dog, but not like the food", "Sup hot stuff, you like hot food or cold food", "I raise chickens in my farm that I dont eat", "Sometimes I go fishing with my father", "Working at the food bank is very fulfilling to me", "If I eat too much Im not going to feel like drinking", "The breweries in San Diego are plentiful", "The food in San Deigo are not as good as the stuff in SF", "LA has really solid mexican food which I love", "Im pretty hungry right now, where should we grab lunch?", "Is dinner going to be taken care of at the Reynold's house?", "If I pay for breakfast will you cover lunch or dinner babe?", "Fast Food is not good for health", "Indian food is spicy", "I like Thai food", "There are two new restaurants opened around the block", "Can I get this sweet dish?", "McBurger has 3000 calories", "Nuts are good for health", "Vegetables are bad", "Cheese cake is good", "He ate all the fried food", "In istanbul, a burger cost $30", "The new hotel chain offers free buffet for 2 days", "Can I get a diet coke?", "How to cook fiesta salsa?", "These fries are tasty but bad for health", "This chinese restaurant serves the best soup", "Please order a pizza for me?", "Dinner is ready", "Doing breakfast is good for health", "Please dont throw extra food, donate it to someone hungry", "chocolate chip cookies and best fresh from the oven.", "pumpkin pie is a good dessert for the fall season", "vegtables are an important part of any diet", "fruit is a healthy way to suffice your sweet tooth", "eggs are a filling way eat breakfast", "soda is a necessary evil.", "philz coffee is a great way to start your morning", "after making a big dinner with several courses, at least there are leftovers.", "turnkey is a great type of meat", "hot sauce makes everything better.", "hot dogs and garlic fries are best when watching a giants baseball game.", "I like ketchup more than mustard", "I wish a had a few more cook books.", "The worst part of cooking is cleaning the pots and pans afterwards.", "I had cereal with a banana every morning before school as a kid.", "Avocado is my favorite type of vegtable.", "I try to avoid fast food restaurants as much as possible.", "shrimp scampi is one of my all time favorite dishes.", "cooking is something I hope to do more of later in life.", "salmon is a great type of food", "You should eat well, but not like Charles Barkley well.", "There are like 17 cooking shows. All of them seem to be related to Top Chef.", "Guy Fearri is not a chef so much as the lead from Smashmouth pretending to be a chef.", "Salt is not a food. But it goes well on food.", "Vegetarians who still eat fish are not vegetarians. They are just against eating things that have eyes.", "Vegans are basically food Taliban. Do not make me feel bad because I have good things in my life.", "They say cows shitting causes global warming. That means we should eat less cows. Maybe more veal though. What is the shit to meat produced ratio where we can still enjoy meat, but not destroy the only planet we have.", "My mother said pre-heat the oven. Instead I turned on the microwave.", "Turkey is the worst of the bird dishes.", "Dog is a food someplaces.", "To make rice, you just get rice, and then add water.", "Food Trucks are not made of food.", "Instagram is mostly a forum for posting food photos. ALso for Smirnoff ICe ads.", "Pasta is a delicacy.", "I refused to believe that gushers are a food.", "If you travel exclusively for local dishes, you have too much money.", "Happiness: a good bank account, a good cook, and a good digestion.", "Food Porn and Porn Food are not the same thing, and you should google only one.", "France thinks it has the best food in Europe, but really Italy does. In Asia, Thailand is to France, as Vietnam is to Italy. I will not negotiate on this."]

sports = {"Food": CLASSLIST}

import json

with open('Food.json', 'w') as fp:
    json.dump(sports, fp, sort_keys=True, indent=4)


### Save to JSON

In [28]:
travel = {"Travel":travellist}

### JSON save:

import json
with open('Travel.json', 'w') as fp:
    json.dump(travel, fp, sort_keys=True, indent=4)


LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_jobs=1, n_topics=10, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

### Import from JSON

### Here is where we import our data

In [144]:
import numpy as np

#This code imports a json of ALL viable glooko codes - before random selection

import json
with open('Food.json', 'r') as fp:
    food = json.load(fp)

with open('sports.json', 'r') as fp:
    sports = json.load(fp)
    
with open('travel.json', 'r') as fp:
    travel = json.load(fp)


sentencelist = []

sentencelist.extend(food['Food'])
sentencelist.extend(sports['Sports'])
sentencelist.extend(travel['Travel'])

### I'm using this code to create an outcome variable, 

+ so we can test our topic model

In [173]:
y = []
y.extend([0]*len(food['Food']))
y.extend([1]*len(sports['Sports']))
y.extend([2]*len(travel['Travel']))

label = []
label.extend(["Food"]*len(food['Food']))
label.extend(["Sports"]*len(sports['Sports']))
label.extend(["Travel"]*len(travel['Travel']))

### This organizes everything into a data frame

In [175]:
df = pd.DataFrame({"y":y, "sentence":sentencelist, "label":label})

### This splits our sentences and outcome var

#### Preprocessing! 

+ Here is the count vectorizer


In [215]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=10000,
                                stop_words='english')
t0 = time()
tf_vectorizer.fit(sentencelist)

X = tf_vectorizer.transform(sentencelist)


print("done in %0.3fs." % (time() - t0))

Extracting tf features for LDA...
done in 0.011s.


### Define the LDA!

In [276]:
lda = LatentDirichletAllocation(n_topics=3, max_iter=50,
                                learning_method='online',
                                learning_offset=0.,
                                random_state=0)


In [277]:
lda.fit(X)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=0.0,
             max_doc_update_iter=100, max_iter=50, mean_change_tol=0.001,
             n_jobs=1, n_topics=3, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [278]:
docprobs = lda.transform(X)

docprobs[0]

array([ 0.06681848,  0.08079847,  0.85238305])

In [279]:
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)

Topic #0:
traveling way best ball tour
Topic #1:
island time travelers worst ringle
Topic #2:
food travel rsquo like great



### Need to split into training and testing!



In [261]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(docprobs, y, test_size=0.10, random_state=5)

### Train the random forest classifier on X_train

In [262]:
rf = RandomForestClassifier()

rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [263]:
predy = rf.predict(X_test)

In [273]:
predy

array([0, 2, 1, 0, 1, 2, 0, 1, 1, 2, 0, 1, 0, 0, 1, 1, 0, 2, 1, 2, 0, 0, 0,
       0, 2])

In [272]:
metrics.f1_score(y_test, predy, average='macro')


0.28406432748538007