This section of the tutorial will teach you how to extract topics from documents. We will use the "gensim" python library to do this.

First, we need to import a bunch of stuff.

In [17]:
import nltk
from nltk.tokenize import *
from gensim import corpora, models
import gensim

We're going to create a few simple test documents for this section of the tutorial.

In [161]:
doc_a = "Mount Everest attracts many highly experienced mountaineers, as well as capable climbers willing to hire guides."
doc_b = "Everest is not the furthest summit from the centre of the Earth."
doc_c = "The first recorded efforts to reach Everest's summit were made by British mountaineers."
doc_d = "Biology recognizes the cell as the basic unit of life."
doc_e = "Biology began to quickly develop and grow with the improvement of the microscope."

test_doc = "Climbing microscopes is hard to cells to do."

We need to transform our documents into bags of words. 
In order to get only the most relevant words, we are going to remove stop words from our documents.
We will use a stemmer from NLTK to reduce our words to their stems.

In [151]:
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
# create English stop words list
stop_words = get_stop_words('en')

# Create p_stemmer of class PorterStemmer
stemmer = PorterStemmer()

Let's put all our training documents in a list.

In [152]:
training_docs = [doc_a,doc_b,doc_c,doc_d,doc_e]

Now, let's clean up our documents. We're going to put all our bags of words into a list.

In [153]:
texts = []
for i in training_docs:
    tokens = nltk.word_tokenize(i)
    tokens = [w.lower() for w in tokens if w.isalpha()] #lowercase and remove punctuation
    stopped_tokens = [i for i in tokens if not i in en_stop] #remove stop words
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens] #stem words
    texts.append(stemmed_tokens)# add words to corpus

Now we are going to create a dictionary that has all the words that appear in our documents in it.

In [154]:
dictionary = corpora.Dictionary(texts)
print dictionary[1]
print dictionary[2]
print dictionary[3]

capabl
climber
everest


This next line takes our words and transforms them into integers which represent those words' index in our dictionary.
It also counts the number of occurrances of each word in each document. 

In [155]:
corpus = [dictionary.doc2bow(text) for text in texts]

Now we create and train the topic model we will use to extract our topics. We use Latent Dirichlet Allocation for our model.

num_topics will determine how many topics our model will generate. Because our documents fall into two categories, I chose 2 topics.  If your document set has a large variety of topics, you should pick a bigger number.

passes will determine how many times the model loops through our corpus. A higher number will take more time, and a lower number will be less accurate.

In [156]:
lda = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

Now we can print out the topics that our model has found.
num_topics in this case will print out a number of topics, up to the number that we generated when we trained our model. Note that these topics are in no particular order.
num_words will show a number of the most relevant words to each topic. 

In [158]:
print(lda.print_topics(num_topics=2, num_words=1)) #print topics with most relevant word for each topic
print(lda.print_topics(num_topics=2, num_words=4)) #print topics with 4 most relevant words for each topic

[u'0.079*everest', u'0.082*biolog']
[u'0.079*everest + 0.056*mountain + 0.056*summit + 0.034*attract', u'0.082*biolog + 0.049*quickli + 0.049*improv + 0.049*grow']


Note that these topics to not have any names associated with them.

Now, let's find the topics of a document outside of our training set. 
First, we have to process our document into a bag of words using our already defined dictionary.

In [164]:
test_tokens = nltk.word_tokenize(test_doc)
test_tokens = [w.lower() for w in test_tokens if w.isalpha()]
test_tokens = [i for i in tokens if not i in stop_words]    
test_tokens = [stemmer.stem(i) for i in test_tokens]
test_tokens = dictionary.doc2bow(test_tokens)

Now, let our model extract the topics from our document.

In [163]:
print(lda[test_tokens]) #feed our test document to our model

[(0, 0.064127751544000258), (1, 0.9358722484559997)]


These tuples represent (topic, relevence) pairs. The first number is the number of the topic that the document contains, and the second number is the probability that the document really does contain that topic.

Try messing around with the test document and seeing how the results change.