# Latent Dirichlet Allocation Topic Modeling

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.
- Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
- LDA ssumes that every chunk of text we feed into it will contain words that are somehow related.
- It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

## Step 1: Load the dataset

The dataset we'll use here is a list of over one million news headlines published over a period of 15 years. 

In [12]:
import numpy as np
import pandas as pd
data = pd.read_csv('~/abcnews-data.csv')

data_text = data[:300000][['headline_text']]
data_text['index'] = data_text.index

documents = data_text
documents.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


## Step 2: Data Preprocessing

- **Tokenization**: gensim.utils.simple_preprocess
- **Stopwords**: inlcuding stopwords and those less than 3 characters
- **Lemmatized**: Verbs in past or funture tenses are changed into present. Words in third person are changed to first person.
- **Stemmed**

**Gensim** is a python library for topic modelling, document indexing and similarity retrieval with large corpora.

**genism.utils**: generate various utility functions

In [15]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

The **WordNet** is a part of python's Natural Language Toolkit. It is a large word database of English Nouns, Adjective, Adverbs and Verbs. 

**Lemmatization** is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just remove the last few characters. 

There are several pythons packages available to implement lemmatization, such as:
1. **WordNet Lemmatizer**: is one of the earliest and most commonly used lemmatizers. We have to down it first in order to use it(as the previous step). And usually **POS** (part-of-speech) should be provided as the argument to imporve its accuracy rate.


2. **TextBlob**: is a powerful, fast and convenient NLP packages. Using the Word and TextBlob objects, its quite straighforward to parse and lemmatize words and sentences respectively. POS also needed.
...

**SnowballStemmer** is a stemming algorithm.

In [11]:
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\xiaoj\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [30]:
# This is to first lemmatize, then stem the text
stemmer = SnowballStemmer('english')
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos = 'v'))

# Tokenize and lemmatize
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result 

In [38]:
# Preview a document after preprocessing
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('Original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)

print('\n\nTokenized, Lemmatized and Stemmed documents: ')
print(preprocess(doc_sample))

Original document: 
['rain', 'helps', 'dampen', 'bushfires']


Tokenized, Lemmatized and Stemmed documents: 
['rain', 'help', 'dampen', 'bushfir']


In [34]:
# Preprocess(token, lemmatize, stem) the whole document text 
processed_docs = documents['headline_text'].map(preprocess)
processed_docs.head()

0     [decid, communiti, broadcast, licenc]
1                        [wit, awar, defam]
2    [call, infrastructur, protect, summit]
3               [staff, aust, strike, rise]
4      [strike, affect, australian, travel]
Name: headline_text, dtype: object

## Step 3 Convert Text into numeric vectors.

Machine can't process text data in raw form. They need us to break down the text into a numerical format that's easily readable by the machine.

Both Bag-of-words(**BOW**) and **TF-IDF** are techniques that help us convert text sentences into numeric vectors.

**BOW**:First build a vocabulary from all the unique words; then take each of these words and mark their occurrence in each documents.

    - If the new sentences contain new words, vocabulary size would increase, the lenght of the vectors would increase too;
    - The vectors would contain many 0s, which result in a sparse matrix;
    - We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text.

**TF-IDF**: give larger values for less requent words.

## Step 3.1: Bag of words on the dataset

#### **gensim.corpora.Dictionary()**

Now lets' create a dictionary from 'processed_docs' containing the number of times a word appears in the training set. To do that, let's use gensim.corpora.Dictionary().

In [39]:
# Create a dictionary from 'processed_docs' containing the number of times a word appears
dictionary = gensim.corpora.Dictionary(processed_docs)

In [80]:
# Check dictionary created
count = 0
for k, v in dictionary.items():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


#### Gensim **filter_extremes(no_below = ?, no_above = ?, keep_n =? )**

This means to filter out tokens that appear in:
- less than no_below documents
- more than no_above documents(fraction of total corpus size)
- after the above two steps, keep only the first keep_n most frequent tokens

In [46]:
# apply the above dictionary filter_extremes() 
dictionary.filter_extremes(no_below = 15, no_above = 0.1, keep_n = 100000)

#### **gensim.corpora.Dictionary.doc2bow**

Convert document into the bag-of-words(**BOW**) format.(list of token_id, token_count) Each word is assumed to be a tokenized and normalized string. **Apply tokenization, stemming etc. before calling this method**.

In [48]:
# create the BOW model for each document.
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [50]:
# preview BOW for some sample
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print('Word {}(\"{}")appears {} time.'.format(bow_doc_4310[i][0],
                                                 dictionary[bow_doc_4310[i][0]],
                                                 bow_doc_4310[i][1]))

Word 71("bushfir")appears 1 time.
Word 107("help")appears 1 time.
Word 462("rain")appears 1 time.
Word 3530("dampen")appears 1 time.


## Setp 3.2 TF-IDF on our document set

While performing TF-IDF on the corpus is not necessary for LDA implement using the gensim model, it is recemmened. TF-IDF expects a bag-of-words traning corpus during initialization.

#### TF-IDF stands for "Term Frequency, Inverse Document Frequency".

- It is a way to score the importance of words in a document based on how frequently they appear across multiple documents.
- If a word appears frequently in a document, it is import and a high score will be given to this word. But if a word appears in many documents, it's not a unique identifier and a low score will be given to this word.


- **TF(w)** = (Number of times term w appears in a document) / (Total number of terms in the document)
- **IDF(w)** = log_e(Total number of documents / Number of documents with term w in it)
- **TF-IDF = TF * IDF**

In [51]:
# Create tf-idf model object using models.TfidModel on 'bow_corpus' 
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)

In [59]:
# Apply transformation to the entire corpus
corpus_tfidf = tfidf[bow_corpus]

In [60]:
"""
Preview TF-IDF scores for first document
"""
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5959813347777092),
 (1, 0.39204529549491984),
 (2, 0.48531419274988147),
 (3, 0.5055461098578569)]


## Step 4.1: Running LDA using Bag of Words

We are going for 10 topics in the document corpus.

Some of parameters we will be tweaking are:
- **number_topic**: is the number of requested latent topics to be extracted from the training corpus.

- **id2word**: is a mapping from word ids(integers) to words(strings).

- **workers**：is the number of extra processes to use for parallelization.Uses all available cores by default.

- **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic(theta) and topic-word(lambda) distribution. We will let these be the default values for now (default value is 1/num_topics)


    - Alpha is the per document topic distribution.
        - High alpha: Every document has a mixture of all topics
        - Low alpha: Every document has a mixture of very few topics
    
    - Eta is the per topic word distribution
        - High eta: Each topic has a mixture of most words
        - Low eta: Each topic has a mixture of few words

- **passes**: is the number of training passes through the corpus.

In [82]:
# Train LDA model using gensim.models.LdaMulticore and save it to 'ida_model'

lda_model = gensim.models.LdaMulticore(bow_corpus,
                                      num_topics = 10,
                                      id2word = dictionary,
                                      passes = 2,
                                      workers = 2)

# For each topic, we will explore the words occuring in that topic and its relative weight

for idx, topic in lda_model.print_topics():
    print('Topic: {} \nWords: {}'.format(idx,topic))
    print('\n')

Topic: 0 
Words: 0.023*"closer" + 0.018*"talk" + 0.016*"market" + 0.016*"deal" + 0.014*"nuclear" + 0.012*"firefight" + 0.011*"trade" + 0.011*"year" + 0.011*"bush" + 0.010*"iran"


Topic: 1 
Words: 0.018*"open" + 0.015*"boost" + 0.014*"rain" + 0.013*"stand" + 0.012*"action" + 0.012*"centr" + 0.012*"worker" + 0.012*"campaign" + 0.011*"howard" + 0.011*"fall"


Topic: 2 
Words: 0.037*"charg" + 0.032*"court" + 0.030*"face" + 0.020*"jail" + 0.020*"accus" + 0.019*"murder" + 0.018*"drug" + 0.015*"polic" + 0.014*"case" + 0.014*"public"


Topic: 3 
Words: 0.038*"crash" + 0.026*"investig" + 0.023*"polic" + 0.017*"die" + 0.015*"victim" + 0.013*"dead" + 0.013*"leav" + 0.013*"train" + 0.012*"shoot" + 0.011*"famili"


Topic: 4 
Words: 0.020*"forc" + 0.019*"lead" + 0.015*"win" + 0.014*"take" + 0.014*"world" + 0.013*"troop" + 0.013*"final" + 0.013*"play" + 0.011*"storm" + 0.010*"aussi"


Topic: 5 
Words: 0.032*"council" + 0.019*"health" + 0.019*"minist" + 0.019*"servic" + 0.015*"opposit" + 0.015*"govt"

## Step 4.2 Running LDA using TF-IDF

In [67]:
# Define lda model using corpus_tfdif

lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf,
                                            num_topics = 10,
                                            id2word = dictionary,
                                            passes = 2,
                                            workers = 4)


for idx, topic in lda_model_tfidf.print_topics():
    print("Topic: {} \nWord: {}".format(idx, topic))
    print("\n")

Topic: 0 
Word: 0.008*"bird" + 0.007*"test" + 0.006*"england" + 0.006*"drug" + 0.006*"boat" + 0.005*"illeg" + 0.005*"kangaroo" + 0.005*"pakistan" + 0.005*"lanka" + 0.005*"polic"


Topic: 1 
Word: 0.023*"crash" + 0.022*"polic" + 0.013*"miss" + 0.013*"search" + 0.012*"investig" + 0.010*"die" + 0.010*"road" + 0.009*"fatal" + 0.008*"driver" + 0.008*"accid"


Topic: 2 
Word: 0.027*"closer" + 0.017*"charg" + 0.014*"court" + 0.011*"face" + 0.011*"jail" + 0.010*"assault" + 0.010*"polic" + 0.009*"murder" + 0.009*"kill" + 0.009*"blaze"


Topic: 3 
Word: 0.016*"water" + 0.009*"govt" + 0.008*"fund" + 0.007*"plan" + 0.007*"council" + 0.007*"boost" + 0.006*"rain" + 0.006*"urg" + 0.006*"coast" + 0.006*"drought"


Topic: 4 
Word: 0.007*"teacher" + 0.006*"strike" + 0.006*"union" + 0.006*"isra" + 0.006*"govt" + 0.006*"palestinian" + 0.006*"kill" + 0.006*"israel" + 0.006*"beazley" + 0.005*"concern"


Topic: 5 
Word: 0.008*"shortag" + 0.007*"retir" + 0.005*"murray" + 0.005*"staff" + 0.005*"season" + 0.004

As we can see, when using tf-idf, heavier weights are given to words that are not as frequent which results in nouns being factored in. That makes it harder to figure out the categories as nouns can be hard to categorize. This goes to show that the models we apply depend on the type of corpus of text we are dealing with.

## Step 5.1: Performance evaluation by classifying sample document using LDA BOW model

In [66]:
# Check which topic our test document belongs to using the LDA BOW

# tup: -1*tup[1] means to sort descend and tuple (index, score)
for index, score in sorted(lda_model[bow_corpus[4310]],key = lambda tup: -1* tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))   


Score: 0.4867934584617615	 
Topic: 0.025*"council" + 0.017*"rise" + 0.017*"price" + 0.016*"centr" + 0.013*"land" + 0.013*"mayor" + 0.013*"plan" + 0.012*"prompt" + 0.011*"studi" + 0.010*"bushfir"

Score: 0.3531844913959503	 
Topic: 0.035*"water" + 0.019*"servic" + 0.018*"farmer" + 0.018*"drought" + 0.014*"break" + 0.012*"nation" + 0.012*"rain" + 0.012*"park" + 0.011*"help" + 0.011*"health"

Score: 0.020003773272037506	 
Topic: 0.059*"govt" + 0.027*"urg" + 0.023*"plan" + 0.022*"fund" + 0.020*"council" + 0.016*"group" + 0.013*"closer" + 0.012*"consid" + 0.012*"boost" + 0.012*"defend"

Score: 0.020003309473395348	 
Topic: 0.074*"polic" + 0.031*"charg" + 0.027*"court" + 0.026*"face" + 0.022*"crash" + 0.020*"investig" + 0.019*"death" + 0.019*"miss" + 0.018*"jail" + 0.016*"murder"

Score: 0.02000252529978752	 
Topic: 0.030*"hospit" + 0.029*"hous" + 0.022*"sydney" + 0.020*"blaze" + 0.017*"leader" + 0.015*"firefight" + 0.014*"home" + 0.013*"blame" + 0.013*"hick" + 0.012*"drink"

Score: 0.02000

## Step 5.2 Performance evaluation by classifying sample document using LDA TF-IDF model

In [74]:
# Check which topic test document belongs to using the LDA TF-IDF model
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index,10)))


Score: 0.6447213292121887	 
Topic: 0.011*"price" + 0.010*"market" + 0.008*"bushfir" + 0.007*"rise" + 0.007*"nuclear" + 0.006*"plan" + 0.006*"govt" + 0.006*"rat" + 0.005*"bail" + 0.005*"council"

Score: 0.19524641335010529	 
Topic: 0.016*"water" + 0.009*"govt" + 0.008*"fund" + 0.007*"plan" + 0.007*"council" + 0.007*"boost" + 0.006*"rain" + 0.006*"urg" + 0.006*"coast" + 0.006*"drought"

Score: 0.02000492811203003	 
Topic: 0.009*"rudd" + 0.008*"govt" + 0.007*"develop" + 0.007*"guilti" + 0.006*"care" + 0.006*"climat" + 0.006*"chang" + 0.006*"council" + 0.005*"truck" + 0.005*"cancer"

Score: 0.020004484802484512	 
Topic: 0.008*"bird" + 0.007*"test" + 0.006*"england" + 0.006*"drug" + 0.006*"boat" + 0.005*"illeg" + 0.005*"kangaroo" + 0.005*"pakistan" + 0.005*"lanka" + 0.005*"polic"

Score: 0.02000434510409832	 
Topic: 0.023*"crash" + 0.022*"polic" + 0.013*"miss" + 0.013*"search" + 0.012*"investig" + 0.010*"die" + 0.010*"road" + 0.009*"fatal" + 0.008*"driver" + 0.008*"accid"

Score: 0.0200043

## Step 6: Testing model on unseen document

In [94]:
unseen_document = 'My favoriate sports are running and swimming.'

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

print('LDA BOW model performance:')
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))
print('\n')
print('LDA TF-IDF model performance:')
for index, score in sorted(lda_model_tfidf[bow_vector], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t Topic: {}".format(score, lda_model_tfidf.print_topic(index,5)))



LDA BOW model performance:

Score: 0.2749941647052765	 Topic: 0.020*"forc" + 0.019*"lead" + 0.015*"win" + 0.014*"take" + 0.014*"world"

Score: 0.2749866247177124	 Topic: 0.018*"open" + 0.015*"boost" + 0.014*"rain" + 0.013*"stand" + 0.012*"action"

Score: 0.2749808430671692	 Topic: 0.054*"govt" + 0.040*"urg" + 0.024*"help" + 0.023*"call" + 0.016*"fund"

Score: 0.02500712126493454	 Topic: 0.023*"closer" + 0.018*"talk" + 0.016*"market" + 0.016*"deal" + 0.014*"nuclear"

Score: 0.025006674230098724	 Topic: 0.032*"council" + 0.019*"health" + 0.019*"minist" + 0.019*"servic" + 0.015*"opposit"

Score: 0.025006195530295372	 Topic: 0.038*"crash" + 0.026*"investig" + 0.023*"polic" + 0.017*"die" + 0.015*"victim"

Score: 0.025004586204886436	 Topic: 0.037*"charg" + 0.032*"court" + 0.030*"face" + 0.020*"jail" + 0.020*"accus"

Score: 0.025004586204886436	 Topic: 0.044*"water" + 0.032*"plan" + 0.018*"council" + 0.016*"concern" + 0.014*"fear"

Score: 0.025004586204886436	 Topic: 0.047*"polic" + 0.029*"k