## **Importing datasets**

Load the filenames and data from the 20 newsgroups dataset (classification).

1) Classes - 20 <br>
2) Samples total - 18846 <br>
3) Dimensionality - 1 <br>
4) Features - text <br>

In [38]:
# importing libraries
import numpy as np
import pandas as pd

In [39]:
# importing dataset
from sklearn.datasets import fetch_20newsgroups
train_data = fetch_20newsgroups(subset="train",shuffle=True)
test_data = fetch_20newsgroups(subset="test", shuffle=True)

In [40]:
# look at some sample news
train_data.data[:5]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

In [41]:
print(list(train_data))
print(list(train_data.target_names))

['data', 'filenames', 'target_names', 'target', 'DESCR']
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


## **Now Preprocessing the raw text**

In [42]:
# import libraries
import nltk
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [43]:
stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
  return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
  result=[]
  for token in gensim.utils.simple_preprocess(text):
    if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
      result.append(lemmatize_stemming(token))
  return result

In [44]:
doc_sample = "This disk has failed many times. I would like to get it replaced"
print("Original")
words = []
for word in doc_sample.split(' '):
  words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original
['This', 'disk', 'has', 'failed', 'many', 'times.', 'I', 'would', 'like', 'to', 'get', 'it', 'replaced']


Tokenized and lemmatized document: 
['disk', 'fail', 'time', 'like', 'replac']


In [45]:
processed_docs = [] 

for doc in train_data.data:
  processed_docs.append(preprocess(doc))

# print processed_docs
print(processed_docs[:5])

[['lerxst', 'thing', 'subject', 'nntp', 'post', 'host', 'organ', 'univers', 'maryland', 'colleg', 'park', 'line', 'wonder', 'enlighten', 'door', 'sport', 'look', 'late', 'earli', 'call', 'bricklin', 'door', 'small', 'addit', 'bumper', 'separ', 'rest', 'bodi', 'know', 'tellm', 'model', 'engin', 'spec', 'year', 'product', 'histori', 'info', 'funki', 'look', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst'], ['guykuo', 'carson', 'washington', 'subject', 'clock', 'poll', 'final', 'summari', 'final', 'clock', 'report', 'keyword', 'acceler', 'clock', 'upgrad', 'articl', 'shelley', 'qvfo', 'innc', 'organ', 'univers', 'washington', 'line', 'nntp', 'post', 'host', 'carson', 'washington', 'fair', 'number', 'brave', 'soul', 'upgrad', 'clock', 'oscil', 'share', 'experi', 'poll', 'send', 'brief', 'messag', 'detail', 'experi', 'procedur', 'speed', 'attain', 'rat', 'speed', 'card', 'adapt', 'heat', 'sink', 'hour', 'usag', 'floppi', 'disk', 'function', 'floppi', 'especi', 'request', 'summar', 'day',

##**Bag Of Words on dataset**

In [46]:
dictionary = gensim.corpora.Dictionary(processed_docs)

count = 0
for _key, _value in dictionary.iteritems():
  print(_key, _value)
  count += 1
  if count > 10:
    break

0 addit
1 bodi
2 bricklin
3 bring
4 bumper
5 call
6 colleg
7 door
8 earli
9 engin
10 enlighten


In [47]:
# remove very rare and very common word
# words appearing less than 10 times

dictionary.filter_extremes(no_below=10, no_above=0.1, keep_n=100000)

In [48]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [49]:
document_num = 20
bow_doc_x = bow_corpus[document_num]

for i in range(len(bow_doc_x)):
  print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0],
                                                   dictionary[bow_doc_x[i][0]],
                                                   bow_doc_x[i][1]))

Word 18 ("rest") appears 1 time.
Word 169 ("clear") appears 1 time.
Word 346 ("refer") appears 1 time.
Word 360 ("true") appears 1 time.
Word 403 ("technolog") appears 1 time.
Word 452 ("christian") appears 1 time.
Word 468 ("exampl") appears 1 time.
Word 491 ("jew") appears 1 time.
Word 495 ("lead") appears 1 time.
Word 497 ("littl") appears 3 time.
Word 540 ("wors") appears 2 time.
Word 748 ("keith") appears 3 time.
Word 760 ("punish") appears 1 time.
Word 835 ("california") appears 1 time.
Word 894 ("institut") appears 1 time.
Word 954 ("similar") appears 1 time.
Word 1033 ("allan") appears 1 time.
Word 1034 ("anti") appears 1 time.
Word 1035 ("arriv") appears 1 time.
Word 1036 ("austria") appears 1 time.
Word 1037 ("caltech") appears 2 time.
Word 1038 ("distinguish") appears 1 time.
Word 1039 ("german") appears 1 time.
Word 1040 ("germani") appears 3 time.
Word 1041 ("hitler") appears 1 time.
Word 1042 ("livesey") appears 2 time.
Word 1043 ("motto") appears 2 time.
Word 1044 ("orde

## **Running LDA using Bag of Words**

In [35]:
lda_model = gensim.models.LdaMulticore(bow_corpus,
                                       num_topics=8,
                                       id2word=dictionary,
                                       passes=10,
                                       workers=2)

In [36]:
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.016*"game" + 0.013*"team" + 0.010*"play" + 0.009*"player" + 0.006*"hockey" + 0.006*"season" + 0.005*"leagu" + 0.004*"score" + 0.004*"pittsburgh" + 0.004*"pitt"


Topic: 1 
Words: 0.013*"space" + 0.010*"nasa" + 0.009*"imag" + 0.005*"graphic" + 0.005*"orbit" + 0.005*"color" + 0.004*"access" + 0.004*"card" + 0.004*"launch" + 0.004*"research"


Topic: 2 
Words: 0.015*"window" + 0.014*"file" + 0.008*"program" + 0.006*"version" + 0.005*"softwar" + 0.004*"avail" + 0.004*"server" + 0.004*"card" + 0.004*"email" + 0.004*"machin"


Topic: 3 
Words: 0.008*"israel" + 0.006*"isra" + 0.004*"arab" + 0.004*"research" + 0.004*"jew" + 0.004*"kill" + 0.004*"human" + 0.003*"food" + 0.003*"islam" + 0.003*"caus"


Topic: 4 
Words: 0.009*"encrypt" + 0.007*"chip" + 0.006*"secur" + 0.006*"clipper" + 0.005*"govern" + 0.005*"public" + 0.005*"wire" + 0.004*"protect" + 0.004*"key" + 0.004*"phone"


Topic: 5 
Words: 0.011*"armenian" + 0.006*"turkish" + 0.006*"bike" + 0.005*"kill" + 0.004*"live" + 

## **Testing model on test data**

In [50]:
test_data.data[:5]

['From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)\nSubject: Need info on 88-89 Bonneville\nOrganization: University at Buffalo\nLines: 10\nNews-Software: VAX/VMS VNEWS 1.41\nNntp-Posting-Host: ubvmsd.cc.buffalo.edu\n\n\n I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.\n\n\t\t\tNeil Gandler\n',
 'From: Rick Miller <rick@ee.uwm.edu>\nSubject: X-Face?\nOrganization: Just me.\nLines: 17\nDistribution: world\nNNTP-Posting-Host: 129.89.2.33\nSummary: Go ahead... swamp me.  <EEP!>\n\nI\'m not familiar at all with the format of these "X-Face:" thingies, but\nafter seeing them 

In [51]:
print(test_data.data[3])

From: bakken@cs.arizona.edu (Dave Bakken)
Subject: Re: Saudi clergy condemns debut of human rights group!
Keywords: international, non-usa government, government, civil rights, 	social issues, politics
Organization: U of Arizona CS Dept, Tucson
Lines: 101

In article <benali.737307554@alcor> benali@alcor.concordia.ca ( ILYESS B. BDIRA ) writes:
>It looks like Ben Baz's mind and heart are also blind, not only his eyes.
>I used to respect him, today I lost the minimal amount of respect that
>I struggled to keep for him.
>To All Muslim netters: This is the same guy who gave a "Fatwah" that
>Saudi Arabia can be used by the United Ststes to attack Iraq . 

They were attacking the Iraqis to drive them out of Kuwait,
a country whose citizens have close blood and business ties
to Saudi citizens.  And me thinks if the US had not helped out
the Iraqis would have swallowed Saudi Arabia, too (or at 
least the eastern oilfields).  And no Muslim country was doing
much of anything to help liberate Ku

In [37]:
# Data preprocessing step for the unseen document
bow = dictionary.doc2bow(preprocess(test_data.data[3]))

for index, score in sorted(lda_model[bow], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

print(test_data.target[3])

Score: 0.4684653580188751	 Topic: 0.008*"israel" + 0.006*"isra" + 0.004*"arab" + 0.004*"research" + 0.004*"jew"
Score: 0.2354603111743927	 Topic: 0.011*"armenian" + 0.006*"turkish" + 0.006*"bike" + 0.005*"kill" + 0.004*"live"
Score: 0.16906537115573883	 Topic: 0.020*"drive" + 0.010*"scsi" + 0.008*"presid" + 0.007*"govern" + 0.007*"control"
Score: 0.08499891310930252	 Topic: 0.014*"christian" + 0.009*"jesus" + 0.007*"exist" + 0.006*"bibl" + 0.005*"religion"
Score: 0.040687885135412216	 Topic: 0.009*"encrypt" + 0.007*"chip" + 0.006*"secur" + 0.006*"clipper" + 0.005*"govern"
17
