# Tutorial
I’m going to explore the practical usages of topic model using the Gensim Library, such as finding structures for unknown datasets, classifying unlabeled data as well as improving accuracy of supervised learning.

## Introduction
This tutorial will introduce you several practical ways of using topic models with **Gensim Library**. In this tutorial I mainly focus on 2 kinds of topic models: *Latent Semantic Indexing (LSI)* and *Latent Dirichlet Allocation (LDA)*. In different cases, topic models can have different functionalities. Throughout this tutorial, you will learn how to use topic models to find structure for unknown datasets, classifying unlabeled data as well as improving the accuracy of supervised learning. Besides, in order to have a better views to the results, this tutorial also introduces some data visualization methods for reviewing topic models. 

We will cover the following topics in this tutorial:
- [Installing the libraries](#Installing-the-libraries)
- [Processing text](#Processing-text)
- [Training LSI model]
- [Training LDA model]
- [Finding structures of unknown text]
- [Classifying unlabeled data]

## Installing the libraries

Before getting started, you'll need to install the various libraries that we will use.  You can install Gensim, and NLTK using `pip`:

    $ pip3 install --upgrade gensim

    $ pip3 install -U nltk

When doing the examples, you might also need to use scikit-learn:

    $ pip3 install -U scikit-learn

In [1]:
import gensim
import nltk
import sklearn
import time
import random

## Processing text
Here we use the [20 newsgroups text dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html) from sklearn datasets. We also use the same processing method as homework-3. Additionally, we also remove all the stopwords as well as rarewords in the document. 

In [2]:
from sklearn.datasets import fetch_20newsgroups
from text_process import process, tokenize, remove_stopwords

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training and the other one for testing. Since for basic topic models there is no better way to truly evaluate the topics than manually examine the results and see whether they made sense. So here I use both training and testing sets for generating the topic model. 

In [3]:
newsgroups = fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'))

In [4]:
start = time.time()
docs_raw = [tokenize(doc) for doc in newsgroups.data]
docs = remove_stopwords(docs_raw)
k = 1000
sample_idxs = random.sample(range(len(docs)), k)  # 1000 samples from docs in order to do quick modeling
docs_sample = []
docs_raw_sample = []
for idx in sample_idxs:
    docs_sample.append(docs[idx]) 
    docs_raw_sample.append(docs_raw[idx])
end = time.time()

print("Processing time: " + str(end - start))

Processing time: 67.17970275878906


In [5]:
# Print out the num of docs and some of the processing results
print("The number of docs is: " + str(len(docs)))
print()
print("1st from samples of processed document: ")
print(' '.join(docs_sample[0]))
print()
print("1st from processed document:")
print(' '.join(docs[0]))

The number of docs is: 18846

1st from samples of processed document: 
nutrasweet synthetic sweetener couple thousand time sweeter sugar people concerned chemical body produce degrades nutrasweet thought form formaldehyde known methanol degredation pathway body us eliminate substance real issue whether level methanol formaldehyde produced high enough cause significant damage toxic living cell say consume phenylalanine nothing worry amino acid everyone us small quantity protein synthesis body people disease known missing enzyme necessary degrade compound eliminate body accumulate body high level toxic growing nerve cell therefore major problem young child around age woman pregnant disorder used leading cause brain damage infant easily detected birth one must simply avoid phenylalanine child pregnant

1st from processed document:
sure bashers pen fan pretty confused lack kind post recent pen massacre devil actually bit puzzled bit relieved however going put end non pittsburghers relief b

In [6]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary_all = gensim.corpora.Dictionary(docs)
dictionary_sample = gensim.corpora.Dictionary(docs_sample)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix_all = [dictionary_all.doc2bow(doc) for doc in docs]
doc_term_matrix_sample = [dictionary_sample.doc2bow(doc) for doc in docs_sample]


## Training LSI model

In [7]:
Lsi = gensim.models.lsimodel.LsiModel
lsimodel = Lsi(doc_term_matrix_sample, num_topics=30, id2word = dictionary_sample)

In [8]:
for item in lsimodel.show_topics():
    print(item)

(0, '0.382*"planet" + 0.270*"earth" + 0.225*"spacecraft" + 0.223*"moon" + 0.206*"solar" + 0.184*"system" + 0.169*"surface" + 0.141*"sun" + 0.140*"venus" + 0.137*"atmosphere"')
(1, '-0.597*"adl" + -0.305*"bullock" + -0.196*"gerard" + -0.190*"group" + -0.143*"information" + -0.133*"fbi" + -0.131*"right" + -0.126*"say" + -0.123*"francisco" + -0.122*"san"')
(2, '0.152*"m5" + 0.152*"mv" + 0.110*"mf" + 0.110*"m8" + 0.102*"mp" + 0.102*"md" + 0.102*"mj" + 0.101*"mh" + 0.093*"m4" + 0.093*"mw"')
(3, '-0.501*"kinsey" + -0.265*"sex" + -0.208*"reisman" + 0.174*"adl" + -0.139*"child" + -0.137*"people" + -0.119*"one" + -0.117*"sexual" + -0.111*"would" + -0.096*"boy"')
(4, '0.525*"space" + -0.316*"kinsey" + 0.243*"list" + 0.190*"post" + -0.161*"sex" + 0.159*"nasa" + 0.150*"shuttle" + 0.147*"sci" + -0.131*"reisman" + 0.110*"one"')
(5, '-0.345*"space" + -0.318*"kinsey" + 0.198*"jesus" + 0.178*"one" + -0.157*"sex" + 0.155*"people" + 0.144*"would" + -0.135*"list" + -0.132*"reisman" + 0.126*"day"')
(6, '0.

Processing with the data from 20newsgroups

In [50]:
id2word = gensim.corpora.Dictionary.load_from_text('data/wiki_en_wordids.txt')



In [55]:
print(len(id2word))

100000


In [16]:
from sklearn.datasets import fetch_20newsgroups

In [30]:
newsgroups = fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'))

In [31]:
from pprint import pprint
pprint(list(newsgroups.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [32]:
print(type(newsgroups.data[0]))

<class 'str'>


In [46]:
print(len(newsgroups.data))
print(newsgroups.data[200])

18846

Jesus did and so do I.

Peace be with you,


In [47]:
doc_clean = [process(doc).split() for doc in newsgroups.data]

NameError: name 'process' is not defined

In [5]:
# test for simple samples

doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

doc_complete = [doc1, doc2, doc3, doc4, doc5]

In [6]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]   

In [37]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [41]:
# Creating the object for LDA model using gensim library

start = time.time()
Lda = gensim.models.ldamodel.LdaModel
Lsi = gensim.models.lsimodel.LsiModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)
lsimodel = Lsi(doc_term_matrix, num_topics=3, id2word = dictionary)

end = time.time()

print(str(end - start))

704.1523542404175


In [42]:
print(ldamodel.print_topics(num_topics=3, num_words=10))

[(0, '0.005*"one" + 0.005*"would" + 0.004*"year" + 0.004*"get" + 0.003*"time" + 0.003*"like" + 0.003*"game" + 0.003*"well" + 0.003*"also" + 0.003*"know"'), (1, '0.009*"1" + 0.007*"x" + 0.006*"2" + 0.006*"0" + 0.006*"file" + 0.004*"window" + 0.004*"use" + 0.004*"system" + 0.004*"image" + 0.004*"3"'), (2, '0.007*"would" + 0.007*"one" + 0.007*"people" + 0.006*"god" + 0.004*"think" + 0.004*"say" + 0.004*"it" + 0.004*"know" + 0.004*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxax" + 0.003*"like"')]


In [43]:
print(lsimodel.print_topics(num_topics=3, num_words=10))

[(0, '0.976*"x" + 0.097*"file" + 0.079*"entry" + 0.052*"program" + 0.051*"0" + 0.035*"oname" + 0.033*"output" + 0.030*"char" + 0.030*"line" + 0.028*"section"'), (1, '0.700*"0" + 0.470*"1" + 0.390*"2" + 0.205*"3" + 0.155*"4" + 0.080*"5" + 0.078*"6" + -0.069*"x" + 0.068*"7" + 0.053*"8"'), (2, '1.000*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxax" + 0.008*"mg9vg9vg9vg9vg9vg9vg9vg9vg9vg9vg9vg9vg9vg9vg9v" + 0.004*"14" + 0.004*"part" + 0.003*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxaxq" + 0.002*"end" + 0.002*"m8axaxaxaxaxaxaxaxaxaxaxaxaxax" + 0.001*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxasq" + 0.001*"maxaxaxaxaxaxaxaxaxaxaxaxaxax1f" + -0.001*"0"')]


In [9]:
from gensim.test.utils import common_dictionary, common_corpus
from gensim.models import LsiModel


In [11]:
print(common_dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
