# LDA (Latent Dirichlet Allocation) Demo
*Based on Jordan Barber's excellent LDA demo (https://goo.gl/XINIif)*

You'll need to install the `gensim` package in order to perform the LDA. Open a command prompt or terminal and run:

`conda install gensim`

In [30]:
import nltk
import gensim

Set up the text processing pipeline

In [31]:
# Get a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = nltk.corpus.stopwords.words('english')

# Create p_stemmer of class PorterStemmer
p_stemmer = nltk.stem.porter.PorterStemmer()

Create some simple documents.

In [32]:
# create sample documents
doc_a = "Broccoli is good to eat. My brother likes to eat good broccoli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that broccoli is good for your health." 

# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

Tokenize, clean, remove stopwords, and stem documents.

In [33]:
# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)
    
print(texts)

[['broccoli', 'good', 'eat', 'brother', 'like', 'eat', 'good', 'broccoli', 'mother'], ['mother', 'spend', 'lot', 'time', 'drive', 'brother', 'around', 'basebal', 'practic'], ['health', 'expert', 'suggest', 'drive', 'may', 'caus', 'increas', 'tension', 'blood', 'pressur'], ['often', 'feel', 'pressur', 'perform', 'well', 'school', 'mother', 'never', 'seem', 'drive', 'brother', 'better'], ['health', 'profession', 'say', 'broccoli', 'good', 'health']]


Turn our tokenized documents into a id <-> term dictionary

In [34]:
dictionary = gensim.corpora.Dictionary(texts)
print(dictionary)

Dictionary(32 unique tokens: ['broccoli', 'brother', 'eat', 'good', 'like']...)


Convert tokenized documents into a document-term matrix.

In [35]:
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

[[(0, 2), (1, 1), (2, 2), (3, 2), (4, 1), (5, 1)], [(1, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(8, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1)], [(1, 1), (5, 1), (8, 1), (19, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)], [(0, 1), (3, 1), (16, 2), (30, 1), (31, 1)]]


Generate LDA model.

In [36]:
num_topics = 3
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=20)

See what topics we found in the data

In [37]:
print(ldamodel.print_topics(num_topics=3, num_words=3))

[(0, '0.125*"health" + 0.050*"suggest" + 0.050*"tension"'), (1, '0.059*"drive" + 0.059*"pressur" + 0.059*"better"'), (2, '0.082*"broccoli" + 0.082*"good" + 0.081*"brother"')]


See how topics map to some of our original texts

In [38]:
ldamodel[dictionary.doc2bow(texts[0])] #doc_a = "Broccoli is good to eat. My brother likes to eat good brocolli, but not my mother."

[(0, 0.034531236), (1, 0.034101687), (2, 0.9313671)]

In [39]:
ldamodel[dictionary.doc2bow(texts[3])] #doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."

[(0, 0.02628985), (1, 0.9465401), (2, 0.02717006)]

We can also use this model to characterize new text.

In [40]:
def get_topics(new_string):
    new_text_tokens = tokenizer.tokenize(new_string.lower())
    new_stopped_tokens = [i for i in new_text_tokens if not i in en_stop]
    new_stemmed_tokens = [p_stemmer.stem(i) for i in new_stopped_tokens]
    return ldamodel[dictionary.doc2bow(new_stemmed_tokens)]

In [41]:
get_topics("My mother and brother drive a broccoli truck.")

[(0, 0.07121852), (1, 0.075055346), (2, 0.8537261)]

### Visualizing LDA results
We can use the pyLDAvis package to inspect the results

First, you'll need to install it from the command prompt/terminal. Anaconda doesn't index the package, so you'll need to use pip or another package manager (be sure to install for the right python version).

`pip install pyldavis`

You can play with some examples from larger sets of models here: http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb#topic=0&lambda=1&term=

In [42]:
# As a reminder, here are the topics we derived
print(ldamodel.print_topics(num_topics=3, num_words=3))

[(0, '0.125*"health" + 0.050*"suggest" + 0.050*"tension"'), (1, '0.059*"drive" + 0.059*"pressur" + 0.059*"better"'), (2, '0.082*"broccoli" + 0.082*"good" + 0.081*"brother"')]


In [43]:
import pyLDAvis.gensim

vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(vis)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


# Twitter Data

Now let's try it with the data from our twitter corpus.

In [44]:
import pandas as pd

In [45]:
df = pd.read_csv("V&A Qualitative Datathon - Twitter @neiltyson - Raw_Data.csv")
df.head(10)

Unnamed: 0,tweet-id,tweet-text,tweet-author,tweet-timestamp,tweet-timestamp-date
0,1.0,"Astrophysics is my first love, but I do occasi...",Neil deGrasse Tyson,1477279000000.0,10/24/16 3:10
1,2.0,JUST POSTED: @StarTalkRadio “Physics & Fantasy...,Neil deGrasse Tyson,1480000000000.0,10/22/16 3:21
2,3.0,A reminder that in a baseball game you cannot ...,Neil deGrasse Tyson,1477092000000.0,10/21/16 23:20
3,4.0,The physics of Light Sabers: A brief argument ...,Neil deGrasse Tyson,1480000000000.0,10/17/16 20:38
4,5.0,"Full Moon this eve, across all Earth's lands. ...",Neil deGrasse Tyson,1480000000000.0,10/15/16 23:10
5,6.0,If a Space Alien landed in the USA & requested...,Neil deGrasse Tyson,1480000000000.0,10/14/16 15:49
6,7.0,"Future headlines from the Multiverse: Nov 9, 2...",Neil deGrasse Tyson,1480000000000.0,10/9/16 14:53
7,8.0,Awww. That’s the nicest thing anybody has said...,Neil deGrasse Tyson,1480000000000.0,10/7/16 17:20
8,9.0,"If ComicCon people ruled the world, internatio...",Neil deGrasse Tyson,1480000000000.0,10/6/16 17:54
9,10.0,"On Pluto, with its 248-year orbit around the S...",Neil deGrasse Tyson,1475708000000.0,10/5/16 23:00


In [46]:
tweets_set = df['tweet-text']

In [47]:
# twitter stop words
twitter_stop = ["twitter","com","pic","http","https","www","status","bit","ly"]

In [48]:
# list for tokenized documents in loop
tweet_texts = []

# preprocessing – clean, tokenize, remove stopwords, and stem
for i in tweets_set:
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [t for t in tokens if (t not in en_stop and t not in twitter_stop)]
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    tweet_texts.append(stemmed_tokens)
 
# build our dictionary and matrix
tweet_dict = gensim.corpora.Dictionary(tweet_texts)
tweet_corpus = [tweet_dict.doc2bow(text) for text in tweet_texts]


In [49]:
tweet_texts

[['astrophys', 'first', 'love', 'occasion', 'give', 'schist', 'geolog'],
 ['post',
  'startalkradio',
  'physic',
  'fantasi',
  'time',
  'travel',
  'w',
  'michiokaku',
  'itunespodcast',
  '2eq226v'],
 ['remind',
  'basebal',
  'game',
  'cannot',
  'blame',
  'umpir',
  'enlarg',
  'strike',
  'zone',
  'expand',
  'univers'],
 ['physic',
  'light',
  'saber',
  'brief',
  'argument',
  'profbriancox',
  'lose',
  'natgeo',
  '2de7ooz'],
 ['full',
  'moon',
  'eve',
  'across',
  'earth',
  'land',
  'rise',
  'gentli',
  'east',
  'curtain',
  'twilight',
  'descend'],
 ['space',
  'alien',
  'land',
  'usa',
  'request',
  'take',
  'leader',
  'wonder',
  'pre',
  'trump',
  'would',
  'react',
  'vs',
  'pre',
  'clinton'],
 ['futur',
  'headlin',
  'multivers',
  'nov',
  '9',
  '2016',
  'trump',
  'got',
  'hillari',
  'elect',
  'dismantl',
  'republican',
  'parti'],
 ['awww',
  'nicest',
  'thing',
  'anybodi',
  'said',
  'long',
  'ayeshatron',
  '784441432652320769'],

In [50]:
# learn a topic model for the tweets
tweet_topics = 3
tweet_model = gensim.models.ldamodel.LdaModel(tweet_corpus, num_topics=tweet_topics, id2word = tweet_dict, passes=20)
print(tweet_model.print_topics(tweet_topics, num_words=3))

[(0, '0.007*"post" + 0.007*"follow" + 0.007*"cosmic"'), (1, '0.013*"startalkradio" + 0.012*"itunespodcast" + 0.011*"earth"'), (2, '0.008*"year" + 0.007*"earth" + 0.006*"wonder"')]


In [60]:
df.head(3)

Unnamed: 0,tweet-id,tweet-text,tweet-author,tweet-timestamp,tweet-timestamp-date,topics
0,1.0,"Astrophysics is my first love, but I do occasi...",Neil deGrasse Tyson,1477279000000.0,10/24/16 3:10,
1,2.0,JUST POSTED: @StarTalkRadio “Physics & Fantasy...,Neil deGrasse Tyson,1480000000000.0,10/22/16 3:21,
2,3.0,A reminder that in a baseball game you cannot ...,Neil deGrasse Tyson,1477092000000.0,10/21/16 23:20,


In [67]:
# assign topic labels to each tweet
topics = [tweet_model[tweet_dict.doc2bow(x)] for x in tweet_texts]
df['topics'] = topics
df.head()

Unnamed: 0,tweet-id,tweet-text,tweet-author,tweet-timestamp,tweet-timestamp-date,topics
0,1.0,"Astrophysics is my first love, but I do occasi...",Neil deGrasse Tyson,1477279000000.0,10/24/16 3:10,"[(0, 0.9141094), (1, 0.042994667), (2, 0.04289..."
1,2.0,JUST POSTED: @StarTalkRadio “Physics & Fantasy...,Neil deGrasse Tyson,1480000000000.0,10/22/16 3:21,"[(0, 0.9335337), (1, 0.035652414), (2, 0.03081..."
2,3.0,A reminder that in a baseball game you cannot ...,Neil deGrasse Tyson,1477092000000.0,10/21/16 23:20,"[(0, 0.9405498), (1, 0.031075275), (2, 0.02837..."
3,4.0,The physics of Light Sabers: A brief argument ...,Neil deGrasse Tyson,1480000000000.0,10/17/16 20:38,"[(0, 0.036839463), (1, 0.034266), (2, 0.9288945)]"
4,5.0,"Full Moon this eve, across all Earth's lands. ...",Neil deGrasse Tyson,1480000000000.0,10/15/16 23:10,"[(0, 0.026391737), (1, 0.028222496), (2, 0.945..."


In [68]:
df.iloc[0]["tweet-text"]

'Astrophysics is my first love, but I do occasionally give a schist about geology.'

In [69]:
tweet_vis = pyLDAvis.gensim.prepare(tweet_model, tweet_corpus, tweet_dict)
pyLDAvis.display(tweet_vis)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


# Exercise
Try building a topic model using tweets from all six files together. 

Can you fit a set of topics that captures the variation in topics across the different users?

What parameters did you use?