Library containing tools for topic modeling and related NLP tasks.
It brings together implementations from various authors, slightly modified by me as well as a new visualization tools to help inspect the results. Many of the algorithms here were derived from the published implementations of David Blei's group.
I have also added a fair ammount of tests, mainly to guide my refactoring of the code. Tests are still sparse, but will grow as the rest of the codebase sees more usage and refactoring.
###Running the tests
After you clone this repository, you can run the tests by going into the tests directory and running nosetests (nose required).
The sub-package onlineldavb is currently the most used/tested. Here is a quick example of its usage: Assume you have a set of documents you want to extract the most representative topics from.
The first thing you need is a vocabulary list for these, i.e., valid informative words you may want to use to describe topics. I generally use a spellchecker to find these plus a list of stopwords. NLTK and PyEnchant can help us with that
import nltk import enchant from string import punctuation from enchant.checker import SpellChecker sw = nltk.corpus.stopwords.words('english') checker=SpellChecker('en_US') docset = ['...','...',...] # your corpus
Now, for every document in your corpus you can run the following code to define its vocabulary.
checker.set_text(text) errors = [err.word for err in checker] vocab = [word.strip(punctuation) for word in nltk.wordpunct_tokenize(text) if word.strip(punctuation) not in sw+errors] vocab = list(set(vocab))
Now that you have a vocabulary, which the union of all the vocabularies of each document, you can run the LDA analysis. You have to specify the number of topics you expect to find (K below)
K=10 D = 100 #Number of documents in the docset olda = onlineldavb.OnlineLDA(vocab, K, D, 1./K, 1./K, 1024, 0.7) for doc in docset: gamma, bound = olda.update_lambda(doc) wordids, wordcts = onlineldavb.parse_doc_list(doc,olda._vocab) perwordbound = bound * len(docset) / (D*sum(map(sum,wordcts))) np.savetxt('lambda.dat',olda._lambda)
Finally you can visualize the resulting topics as a Word Cloud:
cloud = GenCloud(vocab,lamb) for i in range(K): cloud.gen_image(i)
If you have done everything right you should see 10 figures just like this:
Turbo topics from Blei & Lafferty (2009) is also part of this package. As with the rest of the code it has been refactored for better compliance to PEP 8, as well as to provide a better integration to the Topics package.
Here is a simpl usage example:
from Topics.visualization.ngrams import compute from Topics.visualization import lda_topics compute('mydoc_utf8.txt', 0.001,False,'unigrams.txt',stopw=sw)
After executing the code above, two files will be generated on disk: "unigrams.txt" and "ngrams_count,csv".
Now we can load them and create nice word clouds:
from collections import OrderedDict with codecs.open('ngram_counts.csv', encoding='utf8') as f: ngrams = f.readlines() ng = OrderedDict() for l in ngrams: w,c = l.split('|') if float(c.strip()) >100: continue ng[w.strip()] = float(c.strip()) counts = np.array(ng.values()) counts.shape = 1,len(counts) ngcloud = GenCloud(ng.keys(),counts) ng.values() ngcloud.gen_image(0,'ngrams')
if we want to include only the ngrams with more than one word, we can remove those from the dictionary ng, above.