> ## KRITHIKA DEVI CHANDRAN (2211570)
> ## Domain: AI & ML

> # Natural Language Processing

# Topic Modeling

This Project takes a few news group data and identifies the topic of discussion in each of these groups.

Here we are taking threads from 20 news groups as to reduce processing time.

## Dataset Description

**Data: The 20 newsgroups text dataset**

The dataset comprises around **18000 newsgroups posts** on **20 topics** split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

The `sklearn.datasets.fetch_20newsgroups` function is a data fetching / caching functions that downloads the data archive from the original 20 newsgroups website, extracts the archive contents in the `~/scikit_learn_data/20news_home` folder and calls the `sklearn.datasets.load_files` on either the training or testing set folder, or both of them:

In [1]:
# importing 20 newsgroups dataset using sklearn
from sklearn.datasets import fetch_20newsgroups

In [2]:
# ignoring warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
# getting train and test set data
newsgroups_train = fetch_20newsgroups(subset='train',shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test',shuffle = True)

In [4]:
# getting names of the document (target_names) 
list(newsgroups_train.target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

The real data lies in the `filenames` and `target` attributes. The target attribute is the integer index of the category:

In [14]:
print("Shape of filename: ", newsgroups_train.filenames.shape)
print("Shape of target: ", newsgroups_train.target.shape)
newsgroups_train.target[:10]

Shape of filename:  (11314,)
Shape of target:  (11314,)


array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [15]:
# fetching first two filenames in data
newsgroups_train.data[:2]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

## Importing Libraries


* **Gensim library** 

   Gensim is designed to handle large text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that target only in-memory processing.

* **Preprocessing library**
   
   Importing `simple_preprocess` from `gensim.utils` package, which converts a document into a list of tokens. This lowercases, tokenizes, de-accents (optional). – the output are final tokens = unicode strings, that won’t be processed any further.

* **Preprocessing - Parsing library**

  Parsing is the process of analyzing a string of words to the rules of grammar which breaks up into component parts. This determines the syntactic structure of an expression. A module `gensim.parsing.preprocessing` contains methods for parsing and preprocessing strings. From this imported `STOPWORDS`.

In [16]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

Importing `nltk` (natural language toolkit) library and `wordnet` is downloaded which is a lexical database for the English language which is a part of the NLTK corpus.

In [17]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

* **Lemmatization and Stemming**

    * **Lemmatization** is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. It links words with similar meanings to one word.

    * **Stemming** is the process of producing morphological variants of a root/base word. It removes the last few words or suffix of a word where it misspelt or incorrect words.

    Two libraries `WordNetLemmatizer` and `SnowballStemmer` is imported:

    1. `WordNetLemmatizer` is a built-in morphy function and returns the input word unchanged if it cannot be found in WordNet.
    2. `SnowballStemmer` can map non-English words too.


* **Porter Stemmer** is used which suffixes in the English language are made up of a combination of smaller and simpler suffixes.

    Every module `*` is imported from `nltk.stem.porter`. 


In [18]:
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np # importing numpy library
np.random.seed(400) # setting random seed

Downloading `omw-1.4` from `nltk` library, which is a **Open Multilingual Wordnet (OMW)**, shares a  common function, but produce output of differing types comparing to Wornet.

In [19]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In order to lemmatize, you need to create an instance of the `WordNetLemmatizer()` and call the `lemmatize()` function on a single word. 

Here called a word, `went` to lemmatize into `pos = 'v'`, a parts of speech as verb.

In [20]:
print(WordNetLemmatizer().lemmatize('went', pos = 'v'))

go


In [21]:
import pandas as pd # pandas library
stemmer = SnowballStemmer("english") # getting a word `english` for stemming

In [22]:
# reading many words for stemming
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned',
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational',
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]

In [23]:
pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


Definition function for Wordnet lemmatization:

In [24]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

Definition function for preprocessing the parseing, of stopwords:



In [25]:
# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))

    return result

In [26]:
# setting document number 50
document_num = 50
# writing a sample word for documenting the sample
doc_sample = 'This disk has failed many times. I would like to get it replaced.'

In [27]:
print("Original document: ")

Original document: 


Splitting a string from the sample word.

In [28]:
words = []
for word in doc_sample.split(' '):
  words.append(word)

print(words)  

['This', 'disk', 'has', 'failed', 'many', 'times.', 'I', 'would', 'like', 'to', 'get', 'it', 'replaced.']


Printing the tokenized and lemmatized of sample sentence:


In [29]:
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))



Tokenized and lemmatized document: 
['disk', 'fail', 'time', 'like', 'replac']


Preprocessing the 20 news groups data that performs lowercases, tokenization, de-accent solution with in-built.

In [30]:
processed_docs = []
 
for doc in newsgroups_train.data:
    processed_docs.append(preprocess(doc))
 
print(processed_docs[:2])


[['lerxst', 'thing', 'subject', 'nntp', 'post', 'host', 'organ', 'univers', 'maryland', 'colleg', 'park', 'line', 'wonder', 'enlighten', 'door', 'sport', 'look', 'late', 'earli', 'call', 'bricklin', 'door', 'small', 'addit', 'bumper', 'separ', 'rest', 'bodi', 'know', 'tellm', 'model', 'engin', 'spec', 'year', 'product', 'histori', 'info', 'funki', 'look', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst'], ['guykuo', 'carson', 'washington', 'subject', 'clock', 'poll', 'final', 'summari', 'final', 'clock', 'report', 'keyword', 'acceler', 'clock', 'upgrad', 'articl', 'shelley', 'qvfo', 'innc', 'organ', 'univers', 'washington', 'line', 'nntp', 'post', 'host', 'carson', 'washington', 'fair', 'number', 'brave', 'soul', 'upgrad', 'clock', 'oscil', 'share', 'experi', 'poll', 'send', 'brief', 'messag', 'detail', 'experi', 'procedur', 'speed', 'attain', 'rat', 'speed', 'card', 'adapt', 'heat', 'sink', 'hour', 'usag', 'floppi', 'disk', 'function', 'floppi', 'especi', 'request', 'summar', 'day',

`corpora.Dictionary` is imported from `gensim` which construct word<->id mappings. This module implements the concept of a Dictionary – a mapping between words and their integer ids.

In [31]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [32]:
# counting every word in a dictionary and eliminating if any count of words is greater than 100
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 100:
        break

0 addit
1 bodi
2 bricklin
3 bring
4 bumper
5 call
6 colleg
7 door
8 earli
9 engin
10 enlighten
11 funki
12 histori
13 host
14 info
15 know
16 late
17 lerxst
18 line
19 look
20 mail
21 maryland
22 model
23 neighborhood
24 nntp
25 organ
26 park
27 post
28 product
29 rest
30 separ
31 small
32 spec
33 sport
34 subject
35 tellm
36 thank
37 thing
38 univers
39 wonder
40 year
41 acceler
42 adapt
43 answer
44 articl
45 attain
46 base
47 brave
48 brief
49 card
50 carson
51 clock
52 day
53 detail
54 disk
55 especi
56 experi
57 fair
58 final
59 floppi
60 function
61 guykuo
62 haven
63 heat
64 hour
65 innc
66 keyword
67 knowledg
68 messag
69 network
70 number
71 oscil
72 poll
73 procedur
74 qvfo
75 rat
76 report
77 request
78 send
79 share
80 shelley
81 sink
82 soul
83 speed
84 summar
85 summari
86 upgrad
87 usag
88 washington
89 access
90 activ
91 actual
92 advanc
93 anybodi
94 anymor
95 appear
96 better
97 breifli
98 bunch
99 convict
100 corner


`dictionary.filter_extremes`, filter out tokens that appear in 
1. less than no_below documents (absolute number) or
2. more than no_above documents (fraction of total corpus size, not absolute number).
3. after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

After the pruning, shrink resulting gaps in word ids.



In [33]:
dictionary.filter_extremes(no_below = 15, no_above = 0.1, keep_n = 10000)

`dictionary.doc2bow` converts document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a **tokenized and normalized** string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In Gensim, the corpus contains the word id and its frequency in every document. We can create a `BoW corpus` from a simple list of documents and from text files.

In [37]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
 
document_num = 20 # assigning a document number 20, since there is a 20 News Groups data
bow_doc_x = bow_corpus[document_num] # to contain word id and its frequency in every document
 

In [38]:
bow_doc_x

[(18, 1),
 (166, 1),
 (336, 1),
 (350, 1),
 (391, 1),
 (437, 1),
 (453, 1),
 (476, 1),
 (480, 1),
 (482, 3),
 (520, 2),
 (721, 3),
 (732, 1),
 (803, 1),
 (859, 1),
 (917, 1),
 (990, 1),
 (991, 1),
 (992, 1),
 (993, 1),
 (994, 2),
 (995, 1),
 (996, 1),
 (997, 3),
 (998, 1),
 (999, 2),
 (1000, 2),
 (1001, 1),
 (1002, 1),
 (1003, 1),
 (1004, 1),
 (1005, 1),
 (1006, 1),
 (1007, 1),
 (1008, 1),
 (1009, 1)]

In [39]:
# printing particular word with their counts along with how much time they appear.
for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1]))

Word 18 ("rest") appears 1 time.
Word 166 ("clear") appears 1 time.
Word 336 ("refer") appears 1 time.
Word 350 ("true") appears 1 time.
Word 391 ("technolog") appears 1 time.
Word 437 ("christian") appears 1 time.
Word 453 ("exampl") appears 1 time.
Word 476 ("jew") appears 1 time.
Word 480 ("lead") appears 1 time.
Word 482 ("littl") appears 3 time.
Word 520 ("wors") appears 2 time.
Word 721 ("keith") appears 3 time.
Word 732 ("punish") appears 1 time.
Word 803 ("california") appears 1 time.
Word 859 ("institut") appears 1 time.
Word 917 ("similar") appears 1 time.
Word 990 ("allan") appears 1 time.
Word 991 ("anti") appears 1 time.
Word 992 ("arriv") appears 1 time.
Word 993 ("austria") appears 1 time.
Word 994 ("caltech") appears 2 time.
Word 995 ("distinguish") appears 1 time.
Word 996 ("german") appears 1 time.
Word 997 ("germani") appears 3 time.
Word 998 ("hitler") appears 1 time.
Word 999 ("livesey") appears 2 time.
Word 1000 ("motto") appears 2 time.
Word 1001 ("order") appear

`gensims.models.ldamulticore` - is an online **Latent Dirichlet Allocation (LDA)** in Python, using all CPU cores to parallelize and speed up model training. This parallelization uses multiprocessing.

In [35]:
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 8,
                                   id2word = dictionary,                                   
                                   passes = 10,
                                   workers = 2)

## Topic Modeling

Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package. The challenge, however, is **how to extract good quality of topics that are clear, segregated and meaningful**. This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics. 

To see the keywords for each topic and the weightage(importance) of each keyword using `lda_model.print_topics()`.

In [36]:
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.012*"armenian" + 0.006*"turkish" + 0.005*"bike" + 0.005*"leav" + 0.004*"road" + 0.004*"home" + 0.004*"turk" + 0.003*"greek" + 0.003*"game" + 0.003*"armenia"


Topic: 1 
Words: 0.021*"drive" + 0.007*"control" + 0.007*"disk" + 0.006*"hard" + 0.004*"caus" + 0.004*"pitt" + 0.004*"effect" + 0.003*"gordon" + 0.003*"firearm" + 0.003*"price"


Topic: 2 
Words: 0.008*"wire" + 0.008*"scsi" + 0.008*"power" + 0.005*"data" + 0.004*"design" + 0.004*"grind" + 0.004*"circuit" + 0.004*"chip" + 0.004*"engin" + 0.004*"connect"


Topic: 3 
Words: 0.012*"space" + 0.010*"nasa" + 0.006*"presid" + 0.005*"nation" + 0.005*"research" + 0.005*"program" + 0.004*"center" + 0.004*"orbit" + 0.004*"health" + 0.004*"group"


Topic: 4 
Words: 0.010*"govern" + 0.009*"encrypt" + 0.007*"secur" + 0.007*"israel" + 0.006*"chip" + 0.006*"public" + 0.006*"clipper" + 0.006*"isra" + 0.004*"protect" + 0.004*"key"


Topic: 5 
Words: 0.018*"game" + 0.016*"team" + 0.012*"play" + 0.010*"player" + 0.009*"hockey" + 0.

*Summary:*

**Topic: 0** is represented as,

Words: 0.012*"armenian" + 0.006*"turkish" + 0.005*"bike" + 0.005*"leav" + 0.004*"road" + 0.004*"home" + 0.004*"turk" + 0.003*"greek" + 0.003*"game" + 0.003*"armenia".

It means the top 10 keywords that contribute to this topic are: 'armenian', 'turkish', 'bike', ... and so on and the weight of 'armenian' on topic 0 is 0.012.

**The weights reflect how important a keyword is to that topic.**
