<a href="https://colab.research.google.com/github/Paul-mwaura/Natural-Language-Processing/blob/main/Topic_Modeling_and_Latent_Dirichlet_Allocation_(LDA)_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling and Latent Dirichlet Allocation (LDA) in Python

> **Topic modeling** is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. **Latent Dirichlet Allocation (LDA)** is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as **Dirichlet distributions**.

In [7]:
# Import necessary Libraries.
#
import pandas as pd

data = pd.read_excel('abcnews-date-text.xlsx', 0);
data.to_csv("abcnews-date-text.csv")

In [11]:
data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False, sep=",");
data.head()

Unnamed: 0.1,Unnamed: 0,"publish_date,headline_text"
0,0,"20030219,aba decides against community broadca..."
1,1,"20030219,act fire witnesses must be aware of d..."
2,2,"20030219,a g calls for infrastructure protecti..."
3,3,"20030219,air nz staff in aust strike for pay rise"
4,4,"20030219,air nz strike to affect australian tr..."


In [16]:
data[['publish_date','headline_text']] = data['publish_date,headline_text'].str.split(',',expand=True)
data = data.iloc[:, 2:]
data.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [17]:
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

In [18]:
# Preview the top rows.
#
print(len(documents))
print(documents[:5])

9269
                                       headline_text  index
0  aba decides against community broadcasting lic...      0
1     act fire witnesses must be aware of defamation      1
2     a g calls for infrastructure protection summit      2
3           air nz staff in aust strike for pay rise      3
4      air nz strike to affect australian travellers      4


## Data Pre-processing

>>
**Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
>>
Words that have fewer than 3 characters are removed.
>>
All stopwords are removed.
>>
**Words are lemmatized** — words in third person are changed to first person and verbs in past and future tenses are changed into present.
>>
**Words are stemmed** — words are reduced to their root form.

**Loading gensim and nltk libraries**

In [19]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

from nltk import PorterStemmer
stemmer=PorterStemmer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


**Lematization**

In [21]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [23]:
documents.shape

(9269, 2)

In [25]:
doc_sample = documents[documents['index'] == 9000].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['tasmanian', 'student', 'ends', 'anti', 'war', 'hunger', 'strike']


 tokenized and lemmatized document: 
['tasmanian', 'student', 'end', 'anti', 'hunger', 'strike']


> Preprocess the headline text, saving the results as ‘processed_docs’

In [26]:
processed_docs = documents['headline_text'].map(preprocess)
processed_docs[:10]

0               [decid, commun, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

## Bag of Words on the Data set

>>
Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.

In [27]:
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 commun
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


**Gensim filter_extremes**
>>
Filter out tokens that appear in:
>>
* less than 15 documents (absolute number) or
>>
* more than 0.5 documents (fraction of total corpus size, not absolute number).
>>
* after the above two steps, keep only the first 100000 most frequent tokens.

In [28]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

## Gensim doc2bow

>>
For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.

In [29]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(43, 1), (60, 1), (258, 1)]

**Preview Bag Of Words for our sample preprocessed document.**

In [32]:
bow_doc = bow_corpus[9000]
for i in range(len(bow_doc)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc[i][0], 
                                               dictionary[bow_doc[i][0]], 
bow_doc[i][1]))

Word 8 ("strike") appears 1 time.
Word 215 ("student") appears 1 time.
Word 261 ("anti") appears 1 time.
Word 292 ("tasmanian") appears 1 time.
Word 488 ("end") appears 1 time.


## TF-IDF

>>
Create tf-idf model object using models.TfidfModel on ‘bow_corpus’ and save it to ‘tfidf’, then apply transformation to the entire corpus and call it ‘corpus_tfidf’. Finally we preview TF-IDF scores for our first document.

In [33]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.6388391606013308), (1, 0.7693403192880164)]


## Running LDA using Bag of Words

>>
Train our lda model using gensim.models.LdaMulticore and save it to ‘lda_model’

In [34]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

>>
For each topic, we will explore the words occuring in that topic and its relative weight.

In [35]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.050*"plan" + 0.022*"welcom" + 0.021*"say" + 0.019*"support" + 0.015*"coast" + 0.015*"iraq" + 0.013*"hospit" + 0.011*"offer" + 0.011*"terror" + 0.010*"rail"
Topic: 1 
Words: 0.029*"warn" + 0.026*"iraqi" + 0.017*"forc" + 0.016*"kill" + 0.015*"govt" + 0.014*"dead" + 0.014*"iraq" + 0.013*"tour" + 0.012*"hit" + 0.012*"report"
Topic: 2 
Words: 0.036*"govt" + 0.017*"england" + 0.016*"bushfir" + 0.016*"rain" + 0.016*"fund" + 0.015*"iraq" + 0.014*"concern" + 0.014*"centr" + 0.013*"boost" + 0.013*"charg"
Topic: 3 
Words: 0.035*"iraq" + 0.031*"troop" + 0.017*"iraqi" + 0.017*"aust" + 0.016*"attack" + 0.016*"council" + 0.015*"kill" + 0.014*"confid" + 0.014*"suicid" + 0.012*"rise"
Topic: 4 
Words: 0.023*"council" + 0.020*"australian" + 0.016*"elect" + 0.016*"critic" + 0.015*"kill" + 0.014*"suspect" + 0.014*"charg" + 0.014*"minist" + 0.013*"crash" + 0.012*"polic"
Topic: 5 
Words: 0.051*"protest" + 0.040*"anti" + 0.022*"meet" + 0.017*"secur" + 0.016*"polic" + 0.015*"charg" + 0.014*"

> We can now distinguish different topics using the words in each topic and their corresponding weights.


## Running LDA using TF-IDF

In [36]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.015*"rise" + 0.015*"rain" + 0.015*"death" + 0.013*"jail" + 0.012*"servic" + 0.012*"fund" + 0.010*"kill" + 0.010*"green" + 0.009*"farmer" + 0.008*"blue"
Topic: 1 Word: 0.032*"iraq" + 0.023*"say" + 0.013*"seek" + 0.012*"saddam" + 0.011*"howard" + 0.011*"strike" + 0.010*"team" + 0.009*"coast" + 0.009*"australia" + 0.009*"gold"
Topic: 2 Word: 0.020*"report" + 0.020*"govt" + 0.014*"aust" + 0.013*"deni" + 0.013*"warn" + 0.012*"claim" + 0.010*"iraq" + 0.010*"releas" + 0.010*"take" + 0.009*"titl"
Topic: 3 Word: 0.032*"plan" + 0.015*"urg" + 0.014*"minist" + 0.012*"govt" + 0.012*"offer" + 0.012*"remain" + 0.010*"confid" + 0.009*"crash" + 0.009*"die" + 0.009*"water"
Topic: 4 Word: 0.021*"miss" + 0.013*"injur" + 0.011*"port" + 0.011*"call" + 0.010*"hold" + 0.010*"council" + 0.010*"open" + 0.010*"hospit" + 0.009*"shark" + 0.009*"vote"
Topic: 5 Word: 0.014*"get" + 0.013*"polic" + 0.012*"worker" + 0.011*"return" + 0.010*"bushfir" + 0.010*"search" + 0.010*"break" + 0.009*"fund" + 0.00

> We can distinguish different topics using the words in each topic and their corresponding weights.

## Performance evaluation by classifying sample document using LDA Bag of Words model

In [39]:
processed_docs[9000]

['tasmanian', 'student', 'end', 'anti', 'hunger', 'strike']

In [40]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.5248236060142517	 
Topic: 0.036*"govt" + 0.017*"england" + 0.016*"bushfir" + 0.016*"rain" + 0.016*"fund" + 0.015*"iraq" + 0.014*"concern" + 0.014*"centr" + 0.013*"boost" + 0.013*"charg"

Score: 0.2751636505126953	 
Topic: 0.044*"polic" + 0.026*"death" + 0.020*"world" + 0.018*"open" + 0.016*"get" + 0.016*"water" + 0.014*"look" + 0.014*"take" + 0.013*"help" + 0.012*"court"

Score: 0.025004133582115173	 
Topic: 0.051*"protest" + 0.040*"anti" + 0.022*"meet" + 0.017*"secur" + 0.016*"polic" + 0.015*"charg" + 0.014*"shoot" + 0.013*"raid" + 0.013*"return" + 0.012*"talk"

Score: 0.025003252550959587	 
Topic: 0.050*"plan" + 0.022*"welcom" + 0.021*"say" + 0.019*"support" + 0.015*"coast" + 0.015*"iraq" + 0.013*"hospit" + 0.011*"offer" + 0.011*"terror" + 0.010*"rail"

Score: 0.02500160224735737	 
Topic: 0.023*"council" + 0.020*"australian" + 0.016*"elect" + 0.016*"critic" + 0.015*"kill" + 0.014*"suspect" + 0.014*"charg" + 0.014*"minist" + 0.013*"crash" + 0.012*"polic"

Score: 0.0250012725

> Our test document has the highest probability to be part of the topic that our model assigned, which is the accurate classification.

## Performance evaluation by classifying sample document using LDA TF-IDF model.

In [41]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.5064266920089722	 
Topic: 0.014*"get" + 0.013*"polic" + 0.012*"worker" + 0.011*"return" + 0.010*"bushfir" + 0.010*"search" + 0.010*"break" + 0.009*"fund" + 0.009*"anti" + 0.009*"match"

Score: 0.2935522496700287	 
Topic: 0.015*"rise" + 0.015*"rain" + 0.015*"death" + 0.013*"jail" + 0.012*"servic" + 0.012*"fund" + 0.010*"kill" + 0.010*"green" + 0.009*"farmer" + 0.008*"blue"

Score: 0.025004150345921516	 
Topic: 0.032*"plan" + 0.015*"urg" + 0.014*"minist" + 0.012*"govt" + 0.012*"offer" + 0.012*"remain" + 0.010*"confid" + 0.009*"crash" + 0.009*"die" + 0.009*"water"

Score: 0.025003215298056602	 
Topic: 0.021*"miss" + 0.013*"injur" + 0.011*"port" + 0.011*"call" + 0.010*"hold" + 0.010*"council" + 0.010*"open" + 0.010*"hospit" + 0.009*"shark" + 0.009*"vote"

Score: 0.02500317059457302	 
Topic: 0.020*"report" + 0.020*"govt" + 0.014*"aust" + 0.013*"deni" + 0.013*"warn" + 0.012*"claim" + 0.010*"iraq" + 0.010*"releas" + 0.010*"take" + 0.009*"titl"

Score: 0.025003159418702126	 
Topic: 0

> Our test document has the highest probability to be part of the topic that our model assigned, which is the accurate classification.

## Testing model on unseen document

In [42]:
# unseen_document = 'How a Pentagon deal became an identity crisis for Google'

unseen_document = input("Enter a sentence to identify the topic: ")

bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Enter a sentence to identify the topic: The rebels in the Donbas region of eastern Ukraine also rely on Russian help. Western governments say Russia has deployed regular troops there, along with heavy weapons. That is denied by the Kremlin, which says Russian "volunteers" are helping the rebels, who seized a swathe of the Donetsk and Luhansk regions in April 2014
Score: 0.4207592010498047	 Topic: 0.044*"polic" + 0.026*"death" + 0.020*"world" + 0.018*"open" + 0.016*"get"
Score: 0.26123395562171936	 Topic: 0.083*"iraq" + 0.039*"baghdad" + 0.020*"report" + 0.016*"say" + 0.012*"australia"
Score: 0.24798600375652313	 Topic: 0.035*"iraq" + 0.031*"troop" + 0.017*"iraqi" + 0.017*"aust" + 0.016*"attack"
Score: 0.010005224496126175	 Topic: 0.036*"govt" + 0.017*"england" + 0.016*"bushfir" + 0.016*"rain" + 0.016*"fund"
Score: 0.010003713890910149	 Topic: 0.050*"plan" + 0.022*"welcom" + 0.021*"say" + 0.019*"support" + 0.015*"coast"
Score: 0.010002924129366875	 Topic: 0.046*"face" + 0.020*"govt" + 0