### Latent Dirichlet Allocation
LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial.
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

### Load the Dataset

In [1]:
import pandas as pd

data = pd.read_csv('Data/abcnews-date-text.csv', error_bad_lines=False)

data_text = data[:300000][['headline_text']]
data_text['index'] = data_text.index

documents = data_text

In [2]:
documents.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


In [3]:
print(len(documents))

300000


### Data Preprocessing

In [5]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [6]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Anubhav\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

#### Lemmatizer Example

In [7]:
print(WordNetLemmatizer().lemmatize('went', pos='v')) # Past Tense to Present Tense

go


#### Stemmer Example

In [9]:
stemmer = SnowballStemmer("english")
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [10]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and Lemmatize
def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        if token not  in STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [15]:
'''
Preview a document after preprocessing
'''
document_num = 4310
doc_sample = documents[documents['index'] == document_num].values[0][0]

print('Original Document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)    
print("\n\n Tokenized and Lemmatized Document: ")    
print(preprocess(doc_sample))

Original Document: 
['rain', 'helps', 'dampen', 'bushfires']


 Tokenized and Lemmatized Document: 
['rain', 'help', 'dampen', 'bushfir']


In [16]:
### Preprocess all the headlines, saving the list of results as 'processed_docs'
processed_docs = documents['headline_text'].map(preprocess)

In [17]:
processed_docs[:10]

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

### Bag of Words on the Dataset

In [18]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [20]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


In [21]:
'''Remove very rare and very common words
--words appearing less than 15 times
--words appearing in more than 10% of all documents
'''

dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

#### Gensim doc2bow
* Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [22]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [23]:
bow_corpus[document_num]

[(71, 1), (107, 1), (462, 1), (3530, 1)]

In [24]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_4310 = bow_corpus[document_num]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 71 ("bushfir") appears 1 time.
Word 107 ("help") appears 1 time.
Word 462 ("rain") appears 1 time.
Word 3530 ("dampen") appears 1 time.


### TF-IDF on our document set
While performing TF-IDF on the corpus is not necessary for LDA implemention using the gensim model, it is recemmended. TF-IDF expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality.

Please note: The author of Gensim dictates the standard procedure for LDA to be using the Bag of Words model.

#### TF-IDF stands for "Term Frequency, Inverse Document Frequency".

* It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
* If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
* Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

In other words:

* TF(w) = (Number of times term w appears in a document) / (Total number of terms in the document).
* IDF(w) = log_e(Total number of documents / Number of documents with term w in it).

#### For example

* Consider a document containing 100 words wherein the word 'tiger' appears 3 times.
* The term frequency (i.e., tf) for 'tiger' is then:
    * TF = (3 / 100) = 0.03.
* Now, assume we have 10 million documents and the word 'tiger' appears in 1000 of these. Then, the inverse document frequency (i.e., idf) is calculated as:
    * IDF = log(10,000,000 / 1,000) = 4.
* Thus, the Tf-idf weight is the product of these quantities:
    * TF-IDF = 0.03 * 4 = 0.12.

In [27]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

In [29]:
'''
Apply transformation to the entire corpus and call it 'corpus_tfidf'
'''

corpus_tfidf = tfidf[bow_corpus]

In [30]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5959813347777092),
 (1, 0.39204529549491984),
 (2, 0.48531419274988147),
 (3, 0.5055461098578569)]


### Running LDA using Bag of Words
We are going for 10 topics in the document corpus.

#### We will be running LDA using all CPU cores to parallelize and speed up model training.

Some of the parameters we will be tweaking are:

* `num_topics` is the number of requested latent topics to be extracted from the training corpus.
* `id2word` is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* `workers` is the number of extra processes to use for parallelization. Uses all available cores by default.
* `alpha` and `eta` are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is 1/num_topics)

    * Alpha is the per document topic distribution.

        * High alpha: Every document has a mixture of all topics(documents appear similar to each other).
        * Low alpha: Every document has a mixture of very few topics
    * Eta is the per topic word distribution.

        * High eta: Each topic has a mixture of most words(topics appear similar to each other).
        * Low eta: Each topic has a mixture of few words.
* `passes` is the number of training passes through the corpus. For example, if the training corpus has 50,000 documents, chunksize is 10,000, passes is 2, then online training is done in 10 updates:

    * #1 documents 0-9,999
    * #2 documents 10,000-19,999
    * #3 documents 20,000-29,999
    * #4 documents 30,000-39,999
    * #5 documents 40,000-49,999
    * #6 documents 0-9,999
    * #7 documents 10,000-19,999
    * #8 documents 20,000-29,999
    * #9 documents 30,000-39,999
    * #10 documents 40,000-49,999

In [32]:
# Train LDA Model
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [33]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.024*"open" + 0.021*"test" + 0.018*"world" + 0.017*"win" + 0.014*"lead" + 0.014*"south" + 0.012*"take" + 0.012*"timor" + 0.011*"strike" + 0.010*"east"


Topic: 1 
Words: 0.034*"report" + 0.030*"help" + 0.017*"deal" + 0.015*"urg" + 0.015*"blaze" + 0.015*"inquiri" + 0.012*"close" + 0.012*"firefight" + 0.012*"bushfir" + 0.011*"resid"


Topic: 2 
Words: 0.038*"crash" + 0.022*"closer" + 0.017*"die" + 0.016*"road" + 0.016*"coast" + 0.014*"train" + 0.012*"kill" + 0.012*"gold" + 0.012*"north" + 0.011*"hick"


Topic: 3 
Words: 0.041*"plan" + 0.034*"council" + 0.031*"govt" + 0.030*"water" + 0.018*"urg" + 0.016*"group" + 0.013*"reject" + 0.013*"fund" + 0.012*"concern" + 0.012*"consid"


Topic: 4 
Words: 0.024*"hospit" + 0.022*"labor" + 0.019*"defend" + 0.019*"elect" + 0.016*"protest" + 0.015*"minist" + 0.015*"power" + 0.014*"work" + 0.014*"govt" + 0.013*"begin"


Topic: 5 
Words: 0.045*"warn" + 0.020*"fight" + 0.017*"nuclear" + 0.017*"england" + 0.016*"year" + 0.014*"action" + 0

In [34]:
'''
Define lda model using corpus_tfidf
'''
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                             num_topics=10, 
                                             id2word = dictionary, 
                                             passes = 2, 
                                             workers=4)

In [35]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.006*"australia" + 0.006*"liber" + 0.006*"export" + 0.006*"england" + 0.006*"takeov" + 0.005*"grower" + 0.005*"contract" + 0.005*"resign" + 0.005*"aussi" + 0.005*"lake"


Topic: 1 Word: 0.014*"court" + 0.014*"murder" + 0.012*"charg" + 0.010*"polic" + 0.009*"jail" + 0.009*"face" + 0.009*"child" + 0.007*"sentenc" + 0.007*"accus" + 0.006*"appeal"


Topic: 2 Word: 0.019*"kill" + 0.011*"bomb" + 0.010*"crash" + 0.009*"accid" + 0.008*"polic" + 0.008*"attack" + 0.008*"fatal" + 0.008*"iraq" + 0.007*"blast" + 0.007*"soldier"


Topic: 3 Word: 0.030*"closer" + 0.006*"climat" + 0.006*"final" + 0.004*"open" + 0.004*"comment" + 0.004*"hawk" + 0.004*"violenc" + 0.004*"croc" + 0.004*"socceroo" + 0.004*"plan"


Topic: 4 Word: 0.012*"water" + 0.010*"price" + 0.009*"rise" + 0.006*"council" + 0.006*"rate" + 0.006*"restrict" + 0.006*"plan" + 0.005*"doubt" + 0.005*"govt" + 0.005*"suppli"


Topic: 5 Word: 0.010*"govt" + 0.008*"nuclear" + 0.008*"plan" + 0.007*"labor" + 0.007*"urg" + 0.007*"hick

### Performance evaluation by classifying sample document using LDA Bag of Words model

In [36]:
'''
Text of sample document 4310
'''
processed_docs[4310]

['rain', 'help', 'dampen', 'bushfir']

In [37]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''

# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.6199511885643005	 
Topic: 0.034*"report" + 0.030*"help" + 0.017*"deal" + 0.015*"urg" + 0.015*"blaze" + 0.015*"inquiri" + 0.012*"close" + 0.012*"firefight" + 0.012*"bushfir" + 0.011*"resid"

Score: 0.2200343757867813	 
Topic: 0.019*"school" + 0.018*"drought" + 0.018*"farmer" + 0.014*"price" + 0.014*"fund" + 0.014*"market" + 0.013*"rise" + 0.012*"rain" + 0.012*"feder" + 0.012*"boost"

Score: 0.020001808181405067	 
Topic: 0.024*"open" + 0.021*"test" + 0.018*"world" + 0.017*"win" + 0.014*"lead" + 0.014*"south" + 0.012*"take" + 0.012*"timor" + 0.011*"strike" + 0.010*"east"

Score: 0.020001808181405067	 
Topic: 0.038*"crash" + 0.022*"closer" + 0.017*"die" + 0.016*"road" + 0.016*"coast" + 0.014*"train" + 0.012*"kill" + 0.012*"gold" + 0.012*"north" + 0.011*"hick"

Score: 0.020001808181405067	 
Topic: 0.041*"plan" + 0.034*"council" + 0.031*"govt" + 0.030*"water" + 0.018*"urg" + 0.016*"group" + 0.013*"reject" + 0.013*"fund" + 0.012*"concern" + 0.012*"consid"

Score: 0.02000180818140506

In [38]:
'''
Check which topic our test document belongs to using the LDA TF-IDF model.
'''
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.6218549013137817	 
Topic: 0.014*"court" + 0.014*"murder" + 0.012*"charg" + 0.010*"polic" + 0.009*"jail" + 0.009*"face" + 0.009*"child" + 0.007*"sentenc" + 0.007*"accus" + 0.006*"appeal"

Score: 0.21810561418533325	 
Topic: 0.013*"search" + 0.012*"coast" + 0.010*"miss" + 0.010*"gold" + 0.008*"north" + 0.008*"break" + 0.007*"west" + 0.007*"rudd" + 0.007*"guilti" + 0.007*"plead"

Score: 0.020008515566587448	 
Topic: 0.012*"water" + 0.010*"price" + 0.009*"rise" + 0.006*"council" + 0.006*"rate" + 0.006*"restrict" + 0.006*"plan" + 0.005*"doubt" + 0.005*"govt" + 0.005*"suppli"

Score: 0.02000657096505165	 
Topic: 0.011*"polic" + 0.010*"fund" + 0.010*"govt" + 0.008*"blaze" + 0.008*"hospit" + 0.007*"road" + 0.007*"firefight" + 0.007*"plan" + 0.006*"stab" + 0.006*"council"

Score: 0.02000456675887108	 
Topic: 0.006*"australia" + 0.006*"liber" + 0.006*"export" + 0.006*"england" + 0.006*"takeov" + 0.005*"grower" + 0.005*"contract" + 0.005*"resign" + 0.005*"aussi" + 0.005*"lake"

Score: 0

In [39]:
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.42000025510787964	 Topic: 0.024*"open" + 0.021*"test" + 0.018*"world" + 0.017*"win" + 0.014*"lead"
Score: 0.23616380989551544	 Topic: 0.024*"hospit" + 0.022*"labor" + 0.019*"defend" + 0.019*"elect" + 0.016*"protest"
Score: 0.20379015803337097	 Topic: 0.028*"iraq" + 0.018*"talk" + 0.016*"australia" + 0.015*"troop" + 0.012*"storm"
Score: 0.020009027794003487	 Topic: 0.045*"warn" + 0.020*"fight" + 0.017*"nuclear" + 0.017*"england" + 0.016*"year"
Score: 0.020008498802781105	 Topic: 0.041*"plan" + 0.034*"council" + 0.031*"govt" + 0.030*"water" + 0.018*"urg"
Score: 0.020006872713565826	 Topic: 0.019*"school" + 0.018*"drought" + 0.018*"farmer" + 0.014*"price" + 0.014*"fund"
Score: 0.020005330443382263	 Topic: 0.034*"report" + 0.030*"help" + 0.017*"deal" + 0.015*"urg" + 0.015*"blaze"
Score: 0.020005330443382263	 Topic: 0.038*"crash" + 0.022*"closer" + 0.017*"die" + 0.016*"road" + 0.016*"coast"
Score: 0.020005330443382263	 Topic: 0.073*"polic" + 0.031*"charg" + 0.027*"court" + 0.023*"f