## Latent Dirichlet Allocation

+ Most commonly used in natural language processing
+ Sometimes as an end in and of itself
+ Sometimes as a variable reduction technique


### Simple Example of LDA in NLP

Stolen from: http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py

+ Authors: 
    + Olivier Grisel <olivier.grisel@ensta.org>
    + Lars Buitinck
    + Chyi-Kwei Yau <chyikwei.yau@gmail.com>
+ License: BSD 3 clause

In [1]:
import pandas as pd
from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups


### This code defines a custom function that we'll use later

In [2]:
n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()



### This code loads the dataset

In [5]:

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")
t0 = time()
df = pd.read_csv('issue_comments_jupyter_copy.csv')
df['org'] = df['org'].astype('str')
df['repo'] = df['repo'].astype('str')
df['comments'] = df['comments'].astype('str')
df['user'] = df['user'].astype('str')

data_samples = df.comments[:n_samples]
print("done in %0.3fs." % (time() - t0))


Loading dataset...
done in 0.096s.


In [6]:
df['comments'][2]

'same issue\n'

In [7]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))



Extracting tf features for LDA...
done in 0.128s.


In [8]:

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)




Fitting LDA models with tf features, n_samples=2000 and n_features=1000...


In [9]:
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

done in 1.681s.


In [10]:
X_example = lda.transform(tf)

In [11]:
import pandas as pd

pd.DataFrame(X_example, columns = ["comments"])

ValueError: Shape of passed values is (10, 2000), indices imply (1, 2000)

In [12]:

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)


Topics in LDA model:
Topic #0:
kernel file ipython jupyter py
Topic #1:
github com https jupyter ellisonbg
Topic #2:
think like kernel just things
Topic #3:
bash process setsid subprocess setpgrp
Topic #4:
https com assets png cloud
Topic #5:
pr use issue logo thanks
Topic #6:
thanks page looks great file
Topic #7:
jupyter docker image images demo
Topic #8:
message ipython js comm ll
Topic #9:
optional install str self function



### In class assignment

+ load in the training set (done for you below)
+ re-run LDA and use topics as input for model
+ Predict categories using some multinomial classifier 

In [None]:
print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'), 
                            subset="train")

data = dataset.data

y = dataset.target

print("done in %0.3fs." % (time() - t0))


In [None]:
import numpy as np

np.unique(y, return_counts=False)

In [None]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf_vectorizer.fit(data)

tf = tf_vectorizer.transform(data)


print("done in %0.3fs." % (time() - t0))


In [None]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=20, max_iter=50,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)


t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

In [None]:
X = lda.transform(tf)

In [None]:
test = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'), 
                            subset="test")

testdata = test.data

y_test = test.target


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [None]:
clf2 = RandomForestClassifier()

clf2.fit(X, y)

In [None]:
from sklearn.model_selection import GridSearchCV, KFold

gs = GridSearchCV(estimator = RandomForestClassifier(), 
                 param_grid = {'n_estimators':np.arange(10, 21, 1)}, 
                 cv = KFold(n_splits=5))

gs.fit(X, y)

algo = gs.best_estimator_


In [None]:
clf = MultinomialNB(alpha=.01)
clf.fit(X, y)


In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, pred)

In [None]:
X_test = tf_vectorizer.transform(testdata)
X_test = lda.transform(X_test)

pred = clf2.predict(X_test)
metrics.f1_score(y_test, pred, average='macro')

In [None]:
algo.score(X_test, y_test)

### In class assignment:

+ I'll divide you into 3 segments
+ Each segment generates 100 sentences on the *same topic*
+ Save as a JSON and send to me
+ We'll run them through LDA

In [None]:
CLASSLIST = ["My favorite type of food is tacos, but it used to be fried chicken.", "My favorite type of taco is al pastor.", "My favorite mexican resturant is El Rancho.", "I also like all the resturants in my immediate neighborhood.", "Corn dogs are quite nice as well however my friends make fun of me", "Yo dog I love dog, but not like the food", "Sup hot stuff, you like hot food or cold food", "I raise chickens in my farm that I dont eat", "Sometimes I go fishing with my father", "Working at the food bank is very fulfilling to me", "If I eat too much Im not going to feel like drinking", "The breweries in San Diego are plentiful", "The food in San Deigo are not as good as the stuff in SF", "LA has really solid mexican food which I love", "Im pretty hungry right now, where should we grab lunch?", "Is dinner going to be taken care of at the Reynold's house?", "If I pay for breakfast will you cover lunch or dinner babe?", "Fast Food is not good for health", "Indian food is spicy", "I like Thai food", "There are two new restaurants opened around the block", "Can I get this sweet dish?", "McBurger has 3000 calories", "Nuts are good for health", "Vegetables are bad", "Cheese cake is good", "He ate all the fried food", "In istanbul, a burger cost $30", "The new hotel chain offers free buffet for 2 days", "Can I get a diet coke?", "How to cook fiesta salsa?", "These fries are tasty but bad for health", "This chinese restaurant serves the best soup", "Please order a pizza for me?", "Dinner is ready", "Doing breakfast is good for health", "Please dont throw extra food, donate it to someone hungry", "chocolate chip cookies and best fresh from the oven.", "pumpkin pie is a good dessert for the fall season", "vegtables are an important part of any diet", "fruit is a healthy way to suffice your sweet tooth", "eggs are a filling way eat breakfast", "soda is a necessary evil.", "philz coffee is a great way to start your morning", "after making a big dinner with several courses, at least there are leftovers.", "turnkey is a great type of meat", "hot sauce makes everything better.", "hot dogs and garlic fries are best when watching a giants baseball game.", "I like ketchup more than mustard", "I wish a had a few more cook books.", "The worst part of cooking is cleaning the pots and pans afterwards.", "I had cereal with a banana every morning before school as a kid.", "Avocado is my favorite type of vegtable.", "I try to avoid fast food restaurants as much as possible.", "shrimp scampi is one of my all time favorite dishes.", "cooking is something I hope to do more of later in life.", "salmon is a great type of food", "You should eat well, but not like Charles Barkley well.", "There are like 17 cooking shows. All of them seem to be related to Top Chef.", "Guy Fearri is not a chef so much as the lead from Smashmouth pretending to be a chef.", "Salt is not a food. But it goes well on food.", "Vegetarians who still eat fish are not vegetarians. They are just against eating things that have eyes.", "Vegans are basically food Taliban. Do not make me feel bad because I have good things in my life.", "They say cows shitting causes global warming. That means we should eat less cows. Maybe more veal though. What is the shit to meat produced ratio where we can still enjoy meat, but not destroy the only planet we have.", "My mother said pre-heat the oven. Instead I turned on the microwave.", "Turkey is the worst of the bird dishes.", "Dog is a food someplaces.", "To make rice, you just get rice, and then add water.", "Food Trucks are not made of food.", "Instagram is mostly a forum for posting food photos. ALso for Smirnoff ICe ads.", "Pasta is a delicacy.", "I refused to believe that gushers are a food.", "If you travel exclusively for local dishes, you have too much money.", "Happiness: a good bank account, a good cook, and a good digestion.", "Food Porn and Porn Food are not the same thing, and you should google only one.", "France thinks it has the best food in Europe, but really Italy does. In Asia, Thailand is to France, as Vietnam is to Italy. I will not negotiate on this."]

sports = {"Food": CLASSLIST}

import json

with open('Food.json', 'w') as fp:
    json.dump(sports, fp, sort_keys=True, indent=4)


### Save to JSON

In [None]:
travel = {"Travel":travellist}

### JSON save:

import json
with open('Travel.json', 'w') as fp:
    json.dump(travel, fp, sort_keys=True, indent=4)


### Import from JSON

### Here is where we import our data

In [10]:
import numpy as np

#This code imports a json of ALL viable glooko codes - before random selection

import json
with open('Food.json', 'r') as fp:
    food = json.load(fp)

with open('sports.json', 'r') as fp:
    sports = json.load(fp)
    
with open('travel.json', 'r') as fp:
    travel = json.load(fp)


sentencelist = []

sentencelist.extend(food['Food'])
sentencelist.extend(sports['Sports'])
sentencelist.extend(travel['Travel'])

### I'm using this code to create an outcome variable, 

+ so we can test our topic model

In [11]:
y = []
y.extend([0]*len(food['Food']))
y.extend([1]*len(sports['Sports']))
y.extend([2]*len(travel['Travel']))

label = []
label.extend(["Food"]*len(food['Food']))
label.extend(["Sports"]*len(sports['Sports']))
label.extend(["Travel"]*len(travel['Travel']))

### This organizes everything into a data frame

In [12]:
df = pd.DataFrame({"y":y, "sentence":sentencelist, "label":label})

### This splits our sentences and outcome var

#### Preprocessing! 

+ Here is the count vectorizer


In [13]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=10000,
                                stop_words='english')
t0 = time()
tf_vectorizer.fit(sentencelist)

X = tf_vectorizer.transform(sentencelist)


print("done in %0.3fs." % (time() - t0))

Extracting tf features for LDA...
done in 0.016s.


### Define the LDA!

In [14]:
lda = LatentDirichletAllocation(n_topics=3, max_iter=50,
                                learning_method='online',
                                learning_offset=0.,
                                random_state=0)


In [15]:
lda.fit(X)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=0.0,
             max_doc_update_iter=100, max_iter=50, mean_change_tol=0.001,
             n_jobs=1, n_topics=3, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [18]:
docprobs = lda.transform(X)

docprobs

array([[ 0.06681848,  0.08079847,  0.85238305],
       [ 0.11134396,  0.75357425,  0.13508179],
       [ 0.11151966,  0.76358129,  0.12489904],
       [ 0.16675077,  0.16678902,  0.66646021],
       [ 0.8749538 ,  0.06928255,  0.05576365],
       [ 0.0563297 ,  0.05568371,  0.88798659],
       [ 0.04885048,  0.04771827,  0.90343125],
       [ 0.829315  ,  0.08508571,  0.08559929],
       [ 0.33333333,  0.33333333,  0.33333333],
       [ 0.1113184 ,  0.11142213,  0.77725948],
       [ 0.05938718,  0.0585844 ,  0.88202842],
       [ 0.11133115,  0.11758495,  0.7710839 ],
       [ 0.06781712,  0.06995786,  0.86222501],
       [ 0.05641614,  0.05778286,  0.885801  ],
       [ 0.06277123,  0.05801711,  0.87921165],
       [ 0.08539285,  0.08616942,  0.82843773],
       [ 0.85080342,  0.07051841,  0.07867817],
       [ 0.06681021,  0.55686836,  0.37632143],
       [ 0.16670708,  0.16685669,  0.66643623],
       [ 0.11115031,  0.11120946,  0.77764022],
       [ 0.0669114 ,  0.7465839 ,  0.186

In [17]:
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)

Topic #0:
traveling way best ball tour
Topic #1:
island time travelers worst ringle
Topic #2:
food travel rsquo like great



### Need to split into training and testing!



In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(docprobs, y, test_size=0.10, random_state=5)

### Train the random forest classifier on X_train

In [None]:
rf = RandomForestClassifier()

rf.fit(X_train, y_train)

In [None]:
predy = rf.predict(X_test)

In [None]:
predy

In [None]:
metrics.f1_score(y_test, predy, average='macro')
