Topic Modeling with Gensim
==========================

Topic modeling is a technique, developed in the last few years and used increasingly heavily in recommendation engines, to analyze a text or a corpus of texts to find clusters of words that go together more often than average. In the study of literature, topic modeling is used to gain some automated insight into what the text(s) are about, without necessarily having to read them. In fact the potential of topic modeling is a subject that is still under deep discussion within the literary branches of digital humanities.

There is [a useful introduction on the subject](http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/) that 

We'll see roughly how it works today, using a Python library called `gensim`. There is a reasonably good tutorial to using gensim [available here](http://radimrehurek.com/gensim/tutorial.html), although the tutorial makes no assumptions about your purpose in analyzing texts.

If you need to, be sure to run

    pip install gensim
    
either from the command line, or prefixed with a `!` character in IPython.

We'll grab the Gutenberg corpus texts from NLTK to try the tool on. As a reminder, here is what they contain:

In [1]:
from nltk.corpus import gutenberg
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Just to remind ourselves, each text is essentially represented as a sequence of words, or *tokens*, that look something like this. 

In [2]:
# What the NLTK corpus texts look like
w = gutenberg.words("milton-paradise.txt")
" ".join(w[:350])

"[ Paradise Lost by John Milton 1667 ] Book I Of Man ' s first disobedience , and the fruit Of that forbidden tree whose mortal taste Brought death into the World , and all our woe , With loss of Eden , till one greater Man Restore us , and regain the blissful seat , Sing , Heavenly Muse , that , on the secret top Of Oreb , or of Sinai , didst inspire That shepherd who first taught the chosen seed In the beginning how the heavens and earth Rose out of Chaos : or , if Sion hill Delight thee more , and Siloa ' s brook that flowed Fast by the oracle of God , I thence Invoke thy aid to my adventurous song , That with no middle flight intends to soar Above th ' Aonian mount , while it pursues Things unattempted yet in prose or rhyme . And chiefly thou , O Spirit , that dost prefer Before all temples th ' upright heart and pure , Instruct me , for thou know ' st ; thou from the first Wast present , and , with mighty wings outspread , Dove - like sat ' st brooding on the vast Abyss , And mad 

We need to turn the corpus texts into something that is known as a **feature vector** - this is a way of representing the information about the text that our computational method will care about. In this case each text will be represented by a count of the interesting words - not too dissimilar to what you get with the NLTK `Text.vocab()` method that we saw last week. 

So how do we define interesting words...? Well, for starters, we don't want to include super-common function words because they will dominate the text. Let's get a list of these *stopwords* from NLTK so that we can remove them later.

In [3]:
from nltk.corpus import stopwords
print(" : ".join(stopwords.words("english")))
print(len(stopwords.words("english")))

# Make the stopword list into a Python set. That will make our work much faster below.
swset = set(stopwords.words("english"))

i : me : my : myself : we : our : ours : ourselves : you : your : yours : yourself : yourselves : he : him : his : himself : she : her : hers : herself : it : its : itself : they : them : their : theirs : themselves : what : which : who : whom : this : that : these : those : am : is : are : was : were : be : been : being : have : has : had : having : do : does : did : doing : a : an : the : and : but : if : or : because : as : until : while : of : at : by : for : with : about : against : between : into : through : during : before : after : above : below : to : from : up : down : in : out : on : off : over : under : again : further : then : once : here : there : when : where : why : how : all : any : both : each : few : more : most : other : some : such : no : nor : not : only : own : same : so : than : too : very : s : t : can : will : just : don : should : now : d : ll : m : o : re : ve : y : ain : aren : couldn : didn : doesn : hadn : hasn : haven : isn : ma : mightn : mustn : needn 

The stopwords, as we can see, are all lowercase. We need to account for this when we use them. We probably also don't care about punctuation.

With all this in mind, we will convert each text in the corpus into a big list of words, still in order, where we have removed the stopwords and any punctuation. We will make a big list of these word lists and put that in the variable `gutentexts`. Just to prove it works, when we're done the `gutentexts` list should be the same size as the NLTK Gutenberg corpus.

In [4]:
# Remove stopwords and punctuation from the texts
gutentexts = []
for text in gutenberg.fileids():
    wordseq = [x.lower() for x in gutenberg.words(text) 
               if x.lower() not in swset and x.isalpha()]
    gutentexts.append(wordseq)

len(gutentexts)

18

Now we have a two-dimensional array: that is a list of lists. Each text is now a list, in sequence, of the words we have deemed interesting. 

With this we turn to the Gensim module, which will help us turn these texts into the feature vectors we need. First we make a dictionary - this is simply an ID-to-word mapping of all the words that appear in our array.

In [5]:
from gensim.corpora import Dictionary

# Make a Gensim bag of words corpus from this
gutendict = Dictionary(gutentexts)
print("Dictionary has %d words" % len(gutendict))

Dictionary has 41335 words


Our dictionary now has an entry for every unique word in our "words-we-care-about" corpus, and has assigned an ID number to each one. We can look them up both ways.

In [6]:
print("Word #74 is %s" % gutendict.get(74))
print("The word 'paradise' has ID %d" % gutendict.token2id['paradise'])

Word #74 is mysteriously
The word 'paradise' has ID 16649


Now that we have our dictionary, we can make our feature vectors using the *bag of words* approach. The dictionary has a method that will read in a text sequence and convert it to a list of (wordID, wordcount) for every distinct word. This is what it looks like when it is run on one of the texts - I have picked text #5 completely arbitrarily.

In [7]:
gutendict.doc2bow(gutentexts[5])

[(0, 1),
 (2, 3),
 (6, 38),
 (8, 1),
 (16, 8),
 (17, 26),
 (23, 1),
 (24, 2),
 (26, 10),
 (34, 2),
 (35, 2),
 (38, 3),
 (39, 2),
 (40, 4),
 (42, 9),
 (44, 3),
 (53, 1),
 (55, 2),
 (56, 1),
 (57, 8),
 (64, 14),
 (78, 5),
 (84, 22),
 (87, 20),
 (88, 1),
 (90, 1),
 (91, 24),
 (95, 7),
 (97, 3),
 (102, 3),
 (113, 1),
 (117, 2),
 (121, 1),
 (124, 19),
 (129, 2),
 (132, 7),
 (134, 2),
 (135, 3),
 (140, 1),
 (142, 1),
 (148, 1),
 (156, 2),
 (163, 2),
 (167, 4),
 (169, 9),
 (177, 11),
 (182, 1),
 (186, 2),
 (191, 1),
 (199, 21),
 (203, 3),
 (206, 14),
 (207, 5),
 (211, 1),
 (220, 1),
 (223, 3),
 (225, 1),
 (227, 1),
 (229, 8),
 (230, 3),
 (231, 5),
 (232, 1),
 (236, 2),
 (237, 12),
 (240, 2),
 (244, 31),
 (246, 23),
 (247, 2),
 (256, 4),
 (259, 8),
 (264, 3),
 (267, 35),
 (275, 1),
 (283, 6),
 (284, 9),
 (285, 1),
 (286, 1),
 (288, 3),
 (293, 2),
 (295, 33),
 (299, 1),
 (302, 8),
 (305, 3),
 (314, 1),
 (316, 25),
 (320, 14),
 (322, 3),
 (324, 18),
 (329, 3),
 (331, 1),
 (332, 19),
 (335, 3),
 

So we need one of these feature vectors for every "text" in our interesting-words array. We'll store them in yet another array, which as far as Gensim is concerned is the corpus. (Don't get confused - Gensim has a different idea of "corpus" than NLTK does!)

In [8]:
gutenbow_corpus = [gutendict.doc2bow(x) for x in gutentexts]
len(gutenbow_corpus)

18

Now we have done that, we are ready to do some topic modeling! For this we have to supply the corpus of feature vectors, the dictionary, and the number of topic "buckets" we want to create. Let's see what happens when we don't specify anything else.

In [9]:
from gensim.models.ldamodel import LdaModel

result = LdaModel(gutenbow_corpus, id2word=gutendict, num_topics=30)
result.print_topics(30)



[(0,
  '0.008*shall + 0.008*said + 0.005*one + 0.005*could + 0.005*man + 0.005*lord + 0.004*would + 0.004*unto + 0.004*upon + 0.004*thy'),
 (1,
  '0.011*shall + 0.009*unto + 0.008*said + 0.006*one + 0.005*thou + 0.005*could + 0.005*lord + 0.005*ye + 0.004*would + 0.004*man'),
 (2,
  '0.016*shall + 0.010*unto + 0.010*lord + 0.009*said + 0.008*thou + 0.007*ye + 0.007*god + 0.007*thee + 0.006*thy + 0.005*upon'),
 (3,
  '0.011*shall + 0.010*said + 0.008*unto + 0.007*lord + 0.006*thou + 0.005*god + 0.005*upon + 0.005*thee + 0.005*one + 0.004*ye'),
 (4,
  '0.011*shall + 0.010*said + 0.009*lord + 0.007*thy + 0.007*unto + 0.006*man + 0.006*thou + 0.005*one + 0.005*ye + 0.005*upon'),
 (5,
  '0.016*shall + 0.013*unto + 0.012*lord + 0.009*said + 0.007*man + 0.007*one + 0.006*thee + 0.006*thou + 0.006*thy + 0.006*god'),
 (6,
  '0.008*said + 0.007*shall + 0.007*one + 0.006*thou + 0.005*lord + 0.005*like + 0.004*thee + 0.004*man + 0.004*would + 0.004*ye'),
 (7,
  '0.010*shall + 0.008*unto + 0.007*sa

Well our topics don't look that interesting - in fact they all look mostly the same - and we had this big red warning saying that the "training might not converge". We're going to have to think more carefully about what this is doing.

The "magic" behind the LDA method , [as explained in this blog post on the topic](http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/), is that it assigns words to topic buckets over and over, billions of times, and it can only be considered "done" when the words have by and large stopped being moved between buckets. What that warning message was telling us is that we didn't give the algorithm enough time - that is, enough passes through the data - to let the topic buckets get to a steady state. So the first thing we can do is take its advice, and increase the number of passes that the algorithm gets to take.

In [10]:
result = LdaModel(gutenbow_corpus, id2word=gutendict, num_topics=30, passes=30)
result.print_topics(30)

[(0,
  '0.000*shall + 0.000*said + 0.000*lord + 0.000*unto + 0.000*thou + 0.000*thy + 0.000*one + 0.000*man + 0.000*god + 0.000*upon'),
 (1,
  '0.011*macb + 0.011*haue + 0.008*thou + 0.007*enter + 0.006*shall + 0.005*vpon + 0.005*thee + 0.005*vs + 0.005*yet + 0.005*th'),
 (2,
  '0.012*whale + 0.008*one + 0.006*like + 0.005*upon + 0.005*man + 0.005*ship + 0.005*ahab + 0.005*ye + 0.004*sea + 0.004*old'),
 (3,
  '0.000*shall + 0.000*unto + 0.000*said + 0.000*lord + 0.000*thy + 0.000*one + 0.000*thou + 0.000*thee + 0.000*man + 0.000*god'),
 (4,
  '0.000*said + 0.000*lord + 0.000*unto + 0.000*shall + 0.000*thy + 0.000*one + 0.000*man + 0.000*thou + 0.000*god + 0.000*upon'),
 (5,
  '0.000*unto + 0.000*shall + 0.000*said + 0.000*lord + 0.000*thy + 0.000*would + 0.000*god + 0.000*thee + 0.000*upon + 0.000*man'),
 (6,
  '0.012*said + 0.008*little + 0.006*one + 0.006*see + 0.005*old + 0.004*upon + 0.004*day + 0.004*man + 0.004*good + 0.004*know'),
 (7,
  '0.000*unto + 0.000*shall + 0.000*lord + 

These are looking a little more promising—the topics are now more differentiated—but there are still a lot of very common words that appear in most of the topics.

Another trick we can try is to reduce our scope of interesting words. Perhaps the words that appear in every text are not quite so topic-informative as the words that are somewhat rarer. We could try this approach, removing from the dictionary any word that appears in all the texts; the Gensim dictionary class gives us a way to do that.

In [11]:
# Remove the words that appear in 100% of the texts
gutendict.filter_extremes(no_above=0.99)
print("Dictionary now has %d words" % len(gutendict))

Dictionary now has 7374 words


Of course, now we have to regenerate the feature vectors, since the dictionary has changed. (We don't have to regenerate the original text array, since any word not in the dictionary will simply be left out.)

In [12]:
gutenbow_corpus = [gutendict.doc2bow(x) for x in gutentexts]

Now we can try the modeling again.

In [13]:
result = LdaModel(gutenbow_corpus, id2word=gutendict, num_topics=30, passes=30)
result.print_topics(10)

[(10,
  '0.010*upon + 0.009*mr + 0.007*sir + 0.007*mrs + 0.007*never + 0.006*father + 0.006*shall + 0.006*much + 0.005*sure + 0.005*oh'),
 (2,
  '0.040*shall + 0.036*unto + 0.032*lord + 0.022*thou + 0.019*thy + 0.018*god + 0.016*ye + 0.016*thee + 0.011*upon + 0.010*israel'),
 (6,
  '0.036*anne + 0.024*captain + 0.023*father + 0.021*brown + 0.010*priest + 0.010*mary + 0.009*rather + 0.008*sir + 0.007*admiral + 0.007*lady'),
 (25,
  '0.000*shall + 0.000*lord + 0.000*thou + 0.000*thy + 0.000*ye + 0.000*upon + 0.000*thee + 0.000*unto + 0.000*god + 0.000*us'),
 (7,
  '0.000*whale + 0.000*shall + 0.000*thou + 0.000*upon + 0.000*ye + 0.000*first + 0.000*lord + 0.000*even + 0.000*mr + 0.000*god'),
 (9,
  '0.000*lord + 0.000*shall + 0.000*thou + 0.000*unto + 0.000*god + 0.000*ye + 0.000*thy + 0.000*king + 0.000*thee + 0.000*upon'),
 (26,
  '0.005*quite + 0.005*even + 0.005*seemed + 0.005*looked + 0.005*us + 0.004*voice + 0.004*something + 0.004*really + 0.004*first + 0.004*cried'),
 (29,
  '0.0

Feel free to play around with the number of topics, the number of passes, the set of words in the dictionary, and any of the other options [that are documented here](http://radimrehurek.com/gensim/models/ldamodel.html). There is no clear boundary of "right answer" with topic modeling; rather, you play around with the data and seek to understand the model until it begins to tell you something interesting!

There is also a popular non-Pythonic tool for topic modeling called MALLET, [which is available here](http://mallet.cs.umass.edu/download.php). The Programming Historian has provided [a good tutorial](http://programminghistorian.org/lessons/topic-modeling-and-mallet) on its installation and use; you will need the Java Development Kit installed, and it is run from the command line. If you want to really learn how topic modeling functions, it would be instructive to run Gensim and MALLET on the same data set, to see how the results differ!