# Latent Dirichlet Allocation

We will use Latent Dirichlet Allocation (LDA) to learn yet more about the hidden structure within 4551 news articles.
LDA is a probabilistic topic model that assumes documents are a mixture of topics and that each word in the document is attributable to the document's topics. There is quite a good high-level overview of probabilistic topic models.

#### Loading the Documents

As our sample corpus of text, we will use a corpus of news articles collected in 2016. These articles have been stored in a single file and formatted so that one article appears on each line. We will load these articles into a list, and also create a short snippet of text for each document.

In [1]:
import os.path
raw_documents = []
snippets = []
with open( os.path.join("data", "articles.txt") ,"r",encoding="utf8") as fin:
    for line in fin.readlines():
        text = line.strip()
        raw_documents.append( text )
        # keep a short snippet of up to 100 characters as a title for each article
        snippets.append( text[0:min(len(text),100)] )
print("Read %d raw text documents" % len(raw_documents))

Read 4551 raw text documents


In our LDA, we will use the Gensim pacakage. We are going to preprocess the synopses a bit differently here, and first we define a function to remove any proper noun.

In [2]:
#strip any proper names from a text...unfortunately right now this is yanking the first word from a sentence too.
import string
import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction
#import mpld3

def strip_proppers(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent) if word.islower()]
    return "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()

Here we will run the actual text processing (removing of proper nouns, tokenization, removal of stop words)

# Stopwords, stemming, and tokenizing

First, we load NLTK's list of English stop words. Stop words are words like "a", "the", or "in" which don't convey significant meaning.

In [3]:
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')

print(stopwords[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [4]:
# load nltk's SnowballStemmer as variabled 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

Below we define two functions:

tokenize_and_stem: tokenizes (splits the synopsis into a list of its respective words (or tokens) and also stems each token 

tokenize_only: tokenizes the text only

In [5]:
# here we define a tokenizer and stemmer which returns the set of stems in the text that it is passed

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens


In [6]:
import nltk
from nltk import ngrams
import re

import os.path
raw_documents1 = []
snippets1 = []

my_documents1 = []
stopwords = nltk.corpus.stopwords.words('english')

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
s= ""

with open( os.path.join("data", "articles1.txt") ,"r",encoding="utf8") as fin1:
    for line1 in fin1.readlines():
        text1 = line1.strip()
        #raw_documents.append( text )
        #print(text)
        #print("--------------------------")
        for wo in text1.split():
            if(wo not in stopwords):
                if(re.search('[a-zA-Z]',wo)):
                    f=stemmer.stem(wo)
                    #print(f)
                    s =  s+ " " + f.strip("\n")
        #print(s)
        raw_documents1.append(s)
        # keep a short snippet of up to 100 characters as a title for each article
            #snippets.append( text[0:min(len(text),100)] )


In [7]:
for document1 in raw_documents1:
    trigrams = ngrams(document1.split(), 3)
    for grams in trigrams:
        #print(grams)
        trigram=' '.join(grams)
        #print(s)
        my_documents1.append(trigram)

In [8]:
texts = [[w] for w in my_documents1]

for i in range(20):
    print(texts[i])

['barclay defianc us']
['defianc us fine']
['us fine merit']
['fine merit barclay']
['merit barclay disgrac']
['barclay disgrac mani']
['disgrac mani way']
['mani way pre-financi']
['way pre-financi crisi']
['pre-financi crisi boom']
['crisi boom years.']
['boom years. so']
['years. so tempt']
['so tempt think']
['tempt think bank,']
['think bank, ask']
['bank, ask us']
['ask us depart']
['us depart justic']
['depart justic pay']


In [9]:
from gensim import corpora, models, similarities 

#remove proper names
%time preprocess = [strip_proppers(doc) for doc in raw_documents]


#tokenize
%time tokenized_text = [tokenize_and_stem(text) for text in preprocess]

#remove stop words
%time texts = [[word for word in text if word not in stopwords] for text in tokenized_text]



Wall time: 1min 21s
Wall time: 2min 1s
Wall time: 11.5 s


Below are some Gensim specific conversions; we also filter out extreme words (see inline comment)

In [10]:
#create a Gensim dictionary from the texts
dictionary = corpora.Dictionary(texts)

#remove extremes (similar to the min/max df step used when creating the tf-idf matrix)
dictionary.filter_extremes(no_below=1, no_above=0.8)

#convert the dictionary to a bag of words corpus for reference
corpus = [dictionary.doc2bow(text) for text in texts]

Our LDA model runs below. we took 50 passes to ensure convergence, but you can see that it took my machine 30 minutes to run. My chunksize is larger so basically all synopses are used per pass. We should optimize this, and Gensim has the capacity to run in parallel. I'll likely explore this further as I use the implementation on larger corpora.

In [11]:
""""
%time lda = models.LdaModel(corpus, num_topics=5, 
                            id2word=dictionary, 
                            update_every=5, 
                            chunksize=10000, 
                            passes=100)
"""
%time lda = models.LdaModel(corpus, num_topics=10, \
                            id2word=dictionary, \
                            update_every=5, \
                            chunksize=1000, \
                            passes=50,minimum_probability=0)

Wall time: 7min 39s


Will pickle(save) my created LDAmodel here to use it for processing to avoid run time and then loading it. 

In [12]:
from sklearn.externals import joblib
joblib.dump(lda, "lda-model.pkl") 



['lda-model.pkl']

In [13]:
from sklearn.externals import joblib
ldaload = joblib.load("lda-model.pkl")

Each topic has a set of words that defines it, along with a certain probability.

In [14]:
ldaload.show_topics()

[(0,
  '0.012*"music" + 0.010*"song" + 0.010*"album" + 0.009*"like" + 0.007*"one" + 0.006*"band" + 0.006*"year" + 0.005*"record" + 0.005*"sound" + 0.005*"time"'),
 (1,
  '0.017*"site" + 0.014*"year" + 0.009*"loan" + 0.009*"compani" + 0.008*"sale" + 0.008*"includ" + 0.007*"share" + 0.007*"new" + 0.006*"price" + 0.006*"buy"'),
 (2,
  '0.015*"said" + 0.010*"campaign" + 0.008*"say" + 0.006*"peopl" + 0.006*"vote" + 0.006*"elect" + 0.006*"support" + 0.005*"parti" + 0.005*"state" + 0.005*"would"'),
 (3,
  '0.015*"court" + 0.014*"women" + 0.014*"abort" + 0.013*"law" + 0.012*"right" + 0.011*"case" + 0.008*"would" + 0.007*"rule" + 0.007*"state" + 0.006*"said"'),
 (4,
  '0.021*"said" + 0.020*"bank" + 0.007*"year" + 0.006*"govern" + 0.006*"would" + 0.006*"report" + 0.005*"account" + 0.005*"financi" + 0.005*"custom" + 0.004*"execut"'),
 (5,
  '0.014*"min" + 0.011*"game" + 0.009*"goal" + 0.008*"season" + 0.008*"team" + 0.008*"player" + 0.007*"play" + 0.007*"ball" + 0.006*"time" + 0.006*"win"'),
 (6,

Here, we convert the topics into just a list of the top 20 words in each topic. 

In [15]:
import numpy
topics_matrix = lda.show_topics(formatted=True, num_words=20)
topics_matrix = numpy.array(topics_matrix)

topic_words = topics_matrix[:,:]
for i in topic_words:
    #print(i)
    for w in i:
        print(w)
    #print([word for word in i])
    print()

0
0.012*"music" + 0.010*"song" + 0.010*"album" + 0.009*"like" + 0.007*"one" + 0.006*"band" + 0.006*"year" + 0.005*"record" + 0.005*"sound" + 0.005*"time" + 0.004*"pop" + 0.004*"play" + 0.004*"new" + 0.004*"first" + 0.003*"make" + 0.003*"show" + 0.003*"artist" + 0.003*"rock" + 0.003*"track" + 0.003*"releas"

1
0.017*"site" + 0.014*"year" + 0.009*"loan" + 0.009*"compani" + 0.008*"sale" + 0.008*"includ" + 0.007*"share" + 0.007*"new" + 0.006*"price" + 0.006*"buy" + 0.005*"releas" + 0.005*"film" + 0.005*"market" + 0.005*"open" + 0.005*"top" + 0.005*"last" + 0.004*"offer" + 0.004*"total" + 0.004*"profit" + 0.004*"earn"

2
0.015*"said" + 0.010*"campaign" + 0.008*"say" + 0.006*"peopl" + 0.006*"vote" + 0.006*"elect" + 0.006*"support" + 0.005*"parti" + 0.005*"state" + 0.005*"would" + 0.005*"presid" + 0.005*"one" + 0.005*"go" + 0.005*"candid" + 0.005*"polit" + 0.004*"like" + 0.004*"voter" + 0.004*"report" + 0.004*"call" + 0.004*"presidenti"

3
0.015*"court" + 0.014*"women" + 0.014*"abort" + 0.013

Printing  topics

In [16]:
# Prints the topics.
for top in lda.print_topics():
  print(top)
print()

(0, '0.012*"music" + 0.010*"song" + 0.010*"album" + 0.009*"like" + 0.007*"one" + 0.006*"band" + 0.006*"year" + 0.005*"record" + 0.005*"sound" + 0.005*"time"')
(1, '0.017*"site" + 0.014*"year" + 0.009*"loan" + 0.009*"compani" + 0.008*"sale" + 0.008*"includ" + 0.007*"share" + 0.007*"new" + 0.006*"price" + 0.006*"buy"')
(2, '0.015*"said" + 0.010*"campaign" + 0.008*"say" + 0.006*"peopl" + 0.006*"vote" + 0.006*"elect" + 0.006*"support" + 0.005*"parti" + 0.005*"state" + 0.005*"would"')
(3, '0.015*"court" + 0.014*"women" + 0.014*"abort" + 0.013*"law" + 0.012*"right" + 0.011*"case" + 0.008*"would" + 0.007*"rule" + 0.007*"state" + 0.006*"said"')
(4, '0.021*"said" + 0.020*"bank" + 0.007*"year" + 0.006*"govern" + 0.006*"would" + 0.006*"report" + 0.005*"account" + 0.005*"financi" + 0.005*"custom" + 0.004*"execut"')
(5, '0.014*"min" + 0.011*"game" + 0.009*"goal" + 0.008*"season" + 0.008*"team" + 0.008*"player" + 0.007*"play" + 0.007*"ball" + 0.006*"time" + 0.006*"win"')
(6, '0.020*"film" + 0.006*"o

In [17]:
# Assigns the topics to the documents in corpus
lda_corpus = lda[corpus]

Using the probabilities of the topics, you can try to set some threshold and use it as a clustering baseline

In [18]:
from itertools import chain
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)

print("Now Printing Threshold value which decides the Probability of given target value falls in which cluster")
print(threshold)
print()


#cluster2 = [j for i,j in zip(lda_corpus,raw_documents[:1000]) if i[1][1] > threshold]

#print(cluster1)



Now Printing Threshold value which decides the Probability of given target value falls in which cluster
0.100000000014



# Printing Data of Cluster 1

In [19]:
cluster1 = [j for i,j in zip(lda_corpus,raw_documents[:1000]) if i[0][1] > threshold]
for i in cluster1:
    print(i)

Ariana Grande's donut-licking cost her a gig at White House, WikiLeaks reveals Licking donuts and saying “I hate America” cost Ariana Grande a prime gig performing for Barack Obama at the White House gala last September, according to several email exchanges exposed by WikiLeaks. Amid the thousands of DNC emails posted by WikiLeaks on Friday was a 10 September 2015 response to a request from the DNC finance chair, Zachary Allen, to vet the former Nickelodeon star to perform at a gala for the US president. “Ariana Butera-video caught her licking other peoples’ donuts while saying she hates America,” the DNC’s deputy compliance director wrote in response, referring to Grande’s real name. “Republican Congressman used this video and said it was a double standard that liberals were not upset with her like they are with Trump who criticized Mexicans; cursed out a person on Twitter after that person used an offensive word towards her brother.” A few months before the email exchange, on 4 July,

# Printing Data of Cluster 5

In [20]:
cluster5 = [j for i,j in zip(lda_corpus,raw_documents[:1000]) if i[5][1] > threshold]
for j in cluster5:
    print(j)

Manchester City pin hopes on key trio after West Ham expose flaws While José Mourinho picked the wrong moment to talk about fake results when Manchester City thumped Chelsea 3-0 in August, the recently unemployed one certainly made a valid point about how the final score can unfairly alter the narrative. Mourinho’s line came to mind after more ruthless finishing from Sergio Agüero rescued a barely deserved point for Manuel Pellegrini’s stuttering title challengers at Upton Park, leaving West Ham United with the rare sensation of feeling disappointed after failing to secure their first league double over City in 53 years. West Ham led twice, only for Agüero to cancel out Enner Valencia’s snappy double with two goals of his own, and there is a temptation to praise City for mounting spirited fightbacks from a goal down in both halves. Scoring an 81st minute equaliser at the home of a team who relish bloodying the noses of opponents with superior resources is usually interpreted as a sign 

###### pyLDAvis is a great way to visualize an LDA model. 
To summarize in short, the area of the circles represent the prevelance of the topic. The length of the bars on the right represent the membership of a term in a particular topic.

In [21]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

pyLDAvis.gensim.prepare(ldaload, corpus, dictionary)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


In [None]:
p=pyLDAvis.gensim.prepare(ldaload, corpus, dictionary)
pyLDAvis.save_html(p, 'lda-visualisation-Big.html')
print("done")


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


done


In [None]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

vis_data = gensimvis.prepare(ldaload, corpus, dictionary)
pyLDAvis.show(vis_data)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]



Note: if you're in the IPython notebook, pyLDAvis.show() is not the best command
      to use. Consider using pyLDAvis.display(), or pyLDAvis.enable_notebook().
      See more information at http://pyLDAvis.github.io/quickstart.html .

You must interrupt the kernel to end this command

Serving to http://127.0.0.1:8889/    [Ctrl-C to exit]


127.0.0.1 - - [22/Apr/2018 09:24:49] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [22/Apr/2018 09:24:49] "GET /LDAvis.css HTTP/1.1" 200 -
127.0.0.1 - - [22/Apr/2018 09:24:49] "GET /d3.js HTTP/1.1" 200 -
127.0.0.1 - - [22/Apr/2018 09:24:49] "GET /LDAvis.js HTTP/1.1" 200 -
127.0.0.1 - - [22/Apr/2018 09:24:49] code 404, message Not Found
127.0.0.1 - - [22/Apr/2018 09:24:49] "GET /favicon.ico HTTP/1.1" 404 -
