## Latent Dirichlet Allocation (LDA)

##### LDA is an example of topic modeling; a statistical modeling mechanism for discovering hidden topics in a collection of documents.

##### LDA builds a words per topic model and a topic per document model, which are modeled as Dirichlet distributions.

##### Let's load relevant libraries. 

In [2]:
import nltk
#nltk.download('stopwords')
import re
import pandas as pd

In [4]:
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

##### We we also load some stopwords to remove from our documents

In [5]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

##### We will use the NewsGroup data, which contains 18,000 different USENET newsgroup documents spread evenly across 20 different topics. We will use LDA to learn these latent topics. 

In [6]:
from sklearn.datasets import fetch_20newsgroups
df = fetch_20newsgroups(subset='train',shuffle=True)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


##### We will clean up the data little, by removing links.

In [7]:
data = df.data
data = [re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', sent) for sent in data]
data = [re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", sent) for sent in data]

In [8]:
data[0]

'From  lerxst  umd edu  where s my thing  Subject  WHAT car is this   Nntp Posting Host  rac3 wam umd edu Organization  University of Maryland  College Park Lines  15   I was wondering if anyone out there could enlighten me on this car I saw the other day  It was a 2 door sports car  looked to be from the late 60s  early 70s  It was called a Bricklin  The doors were really small  In addition  the front bumper was separate from the rest of the body  This is  all I know  If anyone can tellme a model name  engine specs  years of production  where this car is made  history  or whatever info you have on this funky looking car  please e mail   Thanks    IL         brought to you by your neighborhood Lerxst          '

##### We will use a built in gensim function to convert each document into a list of words and remove all punctuations.

In [9]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

[['from', 'lerxst', 'umd', 'edu', 'where', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']]


##### We will also remove all stopwords from the data

In [10]:
data_words_nostops = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in data_words]
print(data_words_nostops[:1])

[['lerxst', 'umd', 'edu', 'thing', 'subject', 'car', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'maryland', 'college', 'park', 'lines', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'door', 'sports', 'car', 'looked', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']]


##### Next, we will create a dictionary of each word in all documents. The idea is to have a unique identifier (ID) for each word to swap out computations with actual string values.

In [11]:
id2word = corpora.Dictionary(data_words_nostops)

##### Next, we will assign the frequency of each word occuring in each document. Think of it as building our corpus so you end up with how many times a word is repeated in each document.

In [12]:
texts = data_words_nostops
corpus = [id2word.doc2bow(text) for text in texts]

In [13]:
print(corpus[0])
print(id2word[13])

[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 5), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 2), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1)]
early


##### Now we can use the corpus object to build our LDA model. We will provide the corpus, the dictionary (vocabulary), and the number of topics we want to identify as parameters.

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

##### We can then look at each topic. The print_topics function gives the top 10 words associated with each topic, and these keywords basically describe what the topic is about.

In [None]:
pprint(lda_model.print_topics())

##### Now it can be hard to determine if the topics are really well fitting for the data provided. (You will have to go through each and every document and manually assign it a topic and then compare the manual topics with the ones generated by LDA). Instead, you can compute the coherence of the model. Coherence score measures the degree of semantic similarity between high scoring words in a topic.

##### A high coherence score means that the topics were meaningful and sort of fit the data well.

In [None]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

##### The lda_model object also gives you the topic assigned to each document in the data so you can attach it back to the original data to summarize findings or conduct additional analyses.