## Topic Modeling using LDA

#### Using the already available 20newsgroup dataset which already has data grouped into pre-defined 20 news categories

In [1]:
import warnings

warnings.filterwarnings('ignore')

In [2]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

#### Check unique categories in the dataset

In [3]:
set(newsgroups_train.target_names)

{'alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc'}

#### check first 5 rows

In [4]:
newsgroups_train.data[:5]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

### Data Pre-processing

In [5]:
#!pip install gensim

In [6]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

### Functions to perform pre-processing

In [7]:
def lemmatize_stemming(text):
    stemmer = SnowballStemmer("english")
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

#### Pre-process data

In [8]:
processed_docs = []

for doc in newsgroups_train.data:
    processed_docs.append(preprocess(doc))

#### Creating Bag of Words from the processed data

In [9]:
dictionary = gensim.corpora.Dictionary(processed_docs)

#### Filter extreme cases. Words with frequency less than 10 and words appearing in more than 20% of the documents

In [10]:
dictionary.filter_extremes(no_below=10,no_above=0.2,keep_n= 100000)

In [11]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

### Running LDA on Bag of Words

In [12]:
## Creating 8 topics from the dictionary created and bow corpus
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 8, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 2)

#### Check words occuring for each topic

In [13]:
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.015*"file" + 0.015*"window" + 0.010*"program" + 0.008*"imag" + 0.008*"mail" + 0.006*"avail" + 0.006*"version" + 0.006*"graphic" + 0.006*"server" + 0.005*"inform"


Topic: 1 
Words: 0.008*"say" + 0.008*"right" + 0.006*"armenian" + 0.006*"govern" + 0.005*"kill" + 0.005*"go" + 0.005*"israel" + 0.005*"state" + 0.005*"come" + 0.004*"isra"


Topic: 2 
Words: 0.005*"state" + 0.004*"wire" + 0.004*"year" + 0.004*"drug" + 0.003*"problem" + 0.003*"control" + 0.003*"effect" + 0.003*"firearm" + 0.003*"weapon" + 0.003*"caus"


Topic: 3 
Words: 0.012*"game" + 0.010*"team" + 0.010*"year" + 0.008*"play" + 0.007*"player" + 0.005*"nasa" + 0.005*"hockey" + 0.004*"season" + 0.004*"go" + 0.004*"point"


Topic: 4 
Words: 0.011*"christian" + 0.008*"believ" + 0.007*"jesus" + 0.007*"say" + 0.006*"exist" + 0.005*"mean" + 0.005*"thing" + 0.005*"come" + 0.005*"question" + 0.005*"bibl"


Topic: 5 
Words: 0.015*"drive" + 0.009*"card" + 0.007*"scsi" + 0.007*"disk" + 0.007*"chip" + 0.007*"control" +

### Classification of the topics
#### Using the words in each topic and their corresponding weights, what categories were you able to infer?

* 0: Graphics Cards
* 1: Politics
* 2: Gun Violence 
* 3: Sports
* 4: Religion 
* 5: Technology
* 6: Driving
* 7: Encryption

### Testing Model

In [14]:
test_doc = newsgroups_test.data[10]
test_doc

'From: Greg.Reinacker@FtCollins.NCR.COM\nSubject: Windows On-Line Review uploaded\nReply-To: Greg.Reinacker@FtCollinsCO.NCR.COM\nOrganization: NCR Microelectronics, Ft. Collins, CO\nLines: 12\n\nI have uploaded the Windows On-Line Review shareware edition to\nftp.cica.indiana.edu as /pub/pc/win3/uploads/wolrs7.zip.\n\nIt is an on-line magazine which contains reviews of some shareware\nproducts...I grabbed it from the Windows On-Line BBS.\n\n--\n--------------------------------------------------------------------------\nGreg Reinacker                          (303) 223-5100 x9289\nNCR Microelectronic Products Division   VoicePlus 464-9289\n2001 Danfield Court                     Greg.Reinacker@FtCollinsCO.NCR.COM\nFort Collins, CO  80525\n'

In [15]:
## Pre-processing test document
bow_vector = dictionary.doc2bow(preprocess(test_doc))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.8724687695503235	 Topic: 0.015*"file" + 0.015*"window" + 0.010*"program" + 0.008*"imag" + 0.008*"mail"
Score: 0.1032753586769104	 Topic: 0.015*"drive" + 0.009*"card" + 0.007*"scsi" + 0.007*"disk" + 0.007*"chip"


In [16]:
newsgroups_test.target[10]

2

## Visualizing LDA model

In [17]:
#!pip install pyLDAvis

In [18]:
import pyLDAvis.gensim

In [19]:
lda_display = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary, sort_topics=False)

Saliency: tells how much the term tells about the topic.

Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.

The size of the bubble measures the importance of the topics, relative to the data.

In [20]:
pyLDAvis.display(lda_display)