# **Topic Modelling **

Latent Dirichlet Allocation or LDA is an algorithm that is used in topic modelling.

Dirichlet process is a probability distribution whose range is a collection of probability distributions.

'Dirichlet' indicates that LDA assumes that the distribution of topics in a document and the distribution of words in topics are both Dirichlet distributions.

LDA is used to classify text in a document at a particular topic. 

Biulds a topic/doc model and words/topic model.

Each doc is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.

LDA assumes every chunck of text and feed it into words that is related.

Assumes that documents are produced from a mixture of topics. Those topics then generate words based on their probability of distribution.

## **Context on the dataset.**

The dataset is a collection newsgroup documents. The 20 newsgroups collection has become a popular data set for experiments in text format. 

To this dataset machine learning techniques, such as text classification and text clustering are applied.

### **Importing the in-built dataset (fetch_20newsgroups) from the sklearn.**

In [1]:
from sklearn.datasets import fetch_20newsgroups

### **Importing warnings to ignore the warning messages.**

In [2]:
import warnings
warnings.filterwarnings('ignore')

### **Shuffling and splitting into train and test**

This dataset already categorized into key topics



In [3]:
newsgroups_train = fetch_20newsgroups(subset='train',shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test',shuffle = True)

### **Listing the names of the 20Newsgrops**

In [4]:
print(list(newsgroups_train.target_names))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [5]:
newsgroups_train.data[:2]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

In [6]:
len(newsgroups_train)

5

In [7]:
len(newsgroups_test)

5

### **Importing the requried libraries.**





**Gensim** processes the raw unstructured texts using unsupervised machine laerning algorithim.

**gensim.utils.simple_preprocess** - coverts document into a list of tokens.

**gensim.parsing.preprocessing** - process of converting the data into required format for further processing. 
Here the data is texts.


In [8]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

### **Wordnet**

Wordnet is a large word database of English Noun, Adjectives,Adverb and verbs.

Lexical Database used for word-sense disambiguation, information retrieval, automatic text classification and machine translation.

The most important uses of WordNet is to find out the similarity among words.


In [9]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### **Importing the stem.porter**



In [10]:
from nltk.stem import WordNetLemmatizer, SnowballStemmer,LancasterStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [11]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### **WordNetLemmatizer()**

NLTK WordNet Lemmatizer for a Part-of-Speech tagging project by first modifying each word in the training corpus to its stem (in place modification), and then training only on the new corpus.


WordNetLemmatizer().lemmatize('word', pos = 'v')

pos: parts of speech. Here its Verb (v)

In [13]:
print(WordNetLemmatizer().lemmatize('went', pos = 'v'))

go


### **SnowballStemmer()**

Snowball Stemmer is the improvised version of porter stemmer.

Snowball is Stemmed to more accurate root word than the porter 

Eg: 

    cared ----> care
    university ----> univers
    fairly ----> fair
    easily ----> easili
    singing ----> sing

In [14]:
import pandas as pd
stemmer = SnowballStemmer("english")

### **Throwing some words and fetching the snowball stemmer performance**

In [15]:
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned',
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational',
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]

In [16]:
pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


### **Lancaster Stemmer**

Add our own custom rules in this algorithm when we implement this using the NLTK package.

In [17]:
from nltk.stem.lancaster import *

p_stemmer=LancasterStemmer("english")


In [18]:
original_words = ['carried', 'flew', 'died', 'mutes', 'denial', 'greedy', 'ownered',
           'humbling','greeting', 'stated', 'idealism','sensational',
           'traditional', 'reckless', 'condemnation','blended','beneficiary','notabily','materialistic']

singles = [stemmer.stem(verb) for verb in original_words]

In [20]:
pd.DataFrame(data={'original word':original_words, 'stemmed':singles})

Unnamed: 0,original word,stemmed
0,carried,carri
1,flew,flew
2,died,die
3,mutes,mute
4,denial,denial
5,greedy,greedi
6,ownered,owner
7,humbling,humbl
8,greeting,greet
9,stated,state


### **Stemming and Lemmatizing the text get verb.**

In [21]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

## **Tokenize and lemmatize**

Creating the function for preprocessing the text 

checking whether the token in gensim_utils,but not in gensim parsing and tokens length greater than 3

Append it with lemmatize_stemming(token)

In [22]:

def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))

    return result

### **Choosing the 20th document and taking a sentence from it**

In [23]:
document_num = 20
doc_sample = 'This disk has failed many times. I would like to get it replaced.'

In [24]:
print("Original document: ")

Original document: 


### **Spliting the sentence into words (Tokeniztion)**


In [25]:
words = []
for word in doc_sample.split(' '):
  words.append(word)

print(words)  

['This', 'disk', 'has', 'failed', 'many', 'times.', 'I', 'would', 'like', 'to', 'get', 'it', 'replaced.']


In [26]:
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))



Tokenized and lemmatized document: 
['disk', 'fail', 'time', 'like', 'replac']


### **Preprocessing the documents**

In [27]:
processed_docs = []
 
for doc in newsgroups_train.data:
    processed_docs.append(preprocess(doc))
 
print(processed_docs[:2])


[['lerxst', 'thing', 'subject', 'nntp', 'post', 'host', 'organ', 'univers', 'maryland', 'colleg', 'park', 'line', 'wonder', 'enlighten', 'door', 'sport', 'look', 'late', 'earli', 'call', 'bricklin', 'door', 'small', 'addit', 'bumper', 'separ', 'rest', 'bodi', 'know', 'tellm', 'model', 'engin', 'spec', 'year', 'product', 'histori', 'info', 'funki', 'look', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst'], ['guykuo', 'carson', 'washington', 'subject', 'clock', 'poll', 'final', 'summari', 'final', 'clock', 'report', 'keyword', 'acceler', 'clock', 'upgrad', 'articl', 'shelley', 'qvfo', 'innc', 'organ', 'univers', 'washington', 'line', 'nntp', 'post', 'host', 'carson', 'washington', 'fair', 'number', 'brave', 'soul', 'upgrad', 'clock', 'oscil', 'share', 'experi', 'poll', 'send', 'brief', 'messag', 'detail', 'experi', 'procedur', 'speed', 'attain', 'rat', 'speed', 'card', 'adapt', 'heat', 'sink', 'hour', 'usag', 'floppi', 'disk', 'function', 'floppi', 'especi', 'request', 'summar', 'day',

### **Defining the Dictionary and corpus**

### **Corpora.Dictionary()**

implements the concept of a Dictionary – a mapping between words and their integer ids.

Dictionary encapsulates the mapping between normalized words and their integer ids.

Here the mappin is performed on the processed_docs

In [31]:
dictionary = gensim.corpora.Dictionary(processed_docs)

### **Counting 100 words in the dictionary of processed document**

In [32]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 100:
        break


0 addit
1 bodi
2 bricklin
3 bring
4 bumper
5 call
6 colleg
7 door
8 earli
9 engin
10 enlighten
11 funki
12 histori
13 host
14 info
15 know
16 late
17 lerxst
18 line
19 look
20 mail
21 maryland
22 model
23 neighborhood
24 nntp
25 organ
26 park
27 post
28 product
29 rest
30 separ
31 small
32 spec
33 sport
34 subject
35 tellm
36 thank
37 thing
38 univers
39 wonder
40 year
41 acceler
42 adapt
43 answer
44 articl
45 attain
46 base
47 brave
48 brief
49 card
50 carson
51 clock
52 day
53 detail
54 disk
55 especi
56 experi
57 fair
58 final
59 floppi
60 function
61 guykuo
62 haven
63 heat
64 hour
65 innc
66 keyword
67 knowledg
68 messag
69 network
70 number
71 oscil
72 poll
73 procedur
74 qvfo
75 rat
76 report
77 request
78 send
79 share
80 shelley
81 sink
82 soul
83 speed
84 summar
85 summari
86 upgrad
87 usag
88 washington
89 access
90 activ
91 actual
92 advanc
93 anybodi
94 anymor
95 appear
96 better
97 breifli
98 bunch
99 convict
100 corner


### **dictionary.filter_extremes()**

Filter out tokens in the dictionary by their frequency.

**Parameters**

**no_below** (int, optional) – Keep tokens which are contained in at least no_below documents.

**no_above** (float, optional) – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).

**keep_n** (int, optional) – Keep only the first keep_n most frequent tokens.

**keep_tokens** (iterable of str) – Iterable of tokens that must stay in dictionary after filtering.



In [33]:
dictionary.filter_extremes(no_below = 15, no_above = 0.1, keep_n = 10000)

### **Frequency of the Words in the documents.**

**doc2bow(doc)**- Convert document into the bag-of-words (BoW)

 format = list of (token_id, token_count) tuples.

In [34]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
 
document_num = 20
bow_doc_x = bow_corpus[document_num]
 
for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1]))

Word 18 ("rest") appears 1 time.
Word 166 ("clear") appears 1 time.
Word 336 ("refer") appears 1 time.
Word 350 ("true") appears 1 time.
Word 391 ("technolog") appears 1 time.
Word 437 ("christian") appears 1 time.
Word 453 ("exampl") appears 1 time.
Word 476 ("jew") appears 1 time.
Word 480 ("lead") appears 1 time.
Word 482 ("littl") appears 3 time.
Word 520 ("wors") appears 2 time.
Word 721 ("keith") appears 3 time.
Word 732 ("punish") appears 1 time.
Word 803 ("california") appears 1 time.
Word 859 ("institut") appears 1 time.
Word 917 ("similar") appears 1 time.
Word 990 ("allan") appears 1 time.
Word 991 ("anti") appears 1 time.
Word 992 ("arriv") appears 1 time.
Word 993 ("austria") appears 1 time.
Word 994 ("caltech") appears 2 time.
Word 995 ("distinguish") appears 1 time.
Word 996 ("german") appears 1 time.
Word 997 ("germani") appears 3 time.
Word 998 ("hitler") appears 1 time.
Word 999 ("livesey") appears 2 time.
Word 1000 ("motto") appears 2 time.
Word 1001 ("order") appear

### **Model Building**

LDA (Latent Dirichlet Allocation) is a kind of unsupervised method to classify documents by topic number

In [35]:
lda_model =  gensim.models.LdaMulticore(bow_corpus,
                                   num_topics = 8,
                                   id2word = dictionary,                                   
                                   passes = 10,
                                   workers = 2)

### **Displaying the topic distribution of the first document.**

“idx” simply means “index”, the position of the element you are currently accessing

**Parameters**

**chunksize** - number of documents to be used in each training chunk.

 **update_every**  - determines how often the model parameters should be updated and passes is the total number of training passes.

In [None]:
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.012*"armenian" + 0.006*"turkish" + 0.005*"bike" + 0.005*"leav" + 0.004*"road" + 0.004*"home" + 0.004*"turk" + 0.003*"greek" + 0.003*"armenia" + 0.003*"game"


Topic: 1 
Words: 0.021*"drive" + 0.007*"control" + 0.007*"disk" + 0.006*"hard" + 0.004*"caus" + 0.004*"pitt" + 0.004*"effect" + 0.003*"gordon" + 0.003*"firearm" + 0.003*"price"


Topic: 2 
Words: 0.008*"wire" + 0.008*"scsi" + 0.008*"power" + 0.005*"data" + 0.004*"design" + 0.004*"grind" + 0.004*"circuit" + 0.004*"chip" + 0.004*"engin" + 0.004*"connect"


Topic: 3 
Words: 0.012*"space" + 0.010*"nasa" + 0.006*"presid" + 0.005*"nation" + 0.005*"research" + 0.005*"program" + 0.004*"center" + 0.004*"orbit" + 0.004*"health" + 0.004*"group"


Topic: 4 
Words: 0.010*"govern" + 0.009*"encrypt" + 0.007*"secur" + 0.007*"israel" + 0.006*"chip" + 0.006*"public" + 0.006*"clipper" + 0.006*"isra" + 0.004*"protect" + 0.004*"key"


Topic: 5 
Words: 0.019*"game" + 0.016*"team" + 0.012*"play" + 0.010*"player" + 0.009*"hockey" + 0.

## **Conclusion**

Thus the Topic modelling has clustered words to relative groups or topics.
