Library containing tools for topic modeling and related NLP tasks.
C Python JavaScript CSS Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Docs
Topics
tests
.gitignore
README.md
requirements.txt
setup.cfg
setup.py

README.md

#topicmodeling

Library containing tools for topic modeling and related NLP tasks.

It brings together implementations from various authors, slightly modified by me as well as a new visualization tools to help inspect the results. Many of the algorithms here were derived from the published implementations of David Blei's group.

I have also added a fair ammount of tests, mainly to guide my refactoring of the code. Tests are still sparse, but will grow as the rest of the codebase sees more usage and refactoring.

###Running the tests

After you clone this repository, you can run the tests by going into the tests directory and running nosetests (nose required).

Quick tutorial

###Online LDA

The sub-package onlineldavb is currently the most used/tested. Here is a quick example of its usage: Assume you have a set of documents you want to extract the most representative topics from.

The first thing you need is a vocabulary list for these, i.e., valid informative words you may want to use to describe topics. I generally use a spellchecker to find these plus a list of stopwords. NLTK and PyEnchant can help us with that

import nltk
import enchant
from string import punctuation
from enchant.checker import SpellChecker

sw = nltk.corpus.stopwords.words('english')
checker=SpellChecker('en_US')

docset = ['...','...',...] # your corpus

Now, for every document in your corpus you can run the following code to define its vocabulary.

checker.set_text(text)
errors = [err.word for err in checker]
vocab = [word.strip(punctuation) for word in nltk.wordpunct_tokenize(text) if word.strip(punctuation) not in sw+errors]
vocab = list(set(vocab))

Now that you have a vocabulary, which the union of all the vocabularies of each document, you can run the LDA analysis. You have to specify the number of topics you expect to find (K below)

K=10
D = 100 #Number of documents in the docset
olda = onlineldavb.OnlineLDA(vocab, K, D, 1./K, 1./K, 1024, 0.7)
for doc in docset:
  gamma, bound = olda.update_lambda(doc)
  wordids, wordcts = onlineldavb.parse_doc_list(doc,olda._vocab)
  perwordbound = bound * len(docset) / (D*sum(map(sum,wordcts)))
np.savetxt('lambda.dat',olda._lambda)

Finally you can visualize the resulting topics as a Word Cloud:

cloud = GenCloud(vocab,lamb)
for i in range(K):
  cloud.gen_image(i)

If you have done everything right you should see 10 figures just like this:

topic_cloud

Turbotopics

Turbo topics from Blei & Lafferty (2009) is also part of this package. As with the rest of the code it has been refactored for better compliance to PEP 8, as well as to provide a better integration to the Topics package.

Here is a simpl usage example:

from Topics.visualization.ngrams import compute
from Topics.visualization import lda_topics

compute('mydoc_utf8.txt', 0.001,False,'unigrams.txt',stopw=sw)

After executing the code above, two files will be generated on disk: "unigrams.txt" and "ngrams_count,csv".

Now we can load them and create nice word clouds:

from collections import OrderedDict
with codecs.open('ngram_counts.csv', encoding='utf8') as f:
    ngrams = f.readlines()
ng = OrderedDict()
for l in ngrams:
    w,c = l.split('|')
    if float(c.strip()) >100:
        continue
    ng[w.strip()] = float(c.strip())

counts = np.array(ng.values())
counts.shape = 1,len(counts)
ngcloud = GenCloud(ng.keys(),counts)
ng.values()

ngcloud.gen_image(0,'ngrams')

if we want to include only the ngrams with more than one word, we can remove those from the dictionary ng, above.

Bitdeli Badge