Topic Modeling
=======
Given a set of documents, assign them to a set of topics
* Topics are abstract concepts: we usually also want to somehow define them, e.g. with keywords
* Applications: TODO

Hard assignment with clustering
* E.g. TFIDF BOW vector space, find boundaries
* Cluster defines the topic, number of cluster = number of topics
* It is naive to assume that a document belongs to a single cluster

Modern topic models tend to use mixed membership modeling
* Each document covers all topics, i.e. we are not forcefully assigning a single category
* The goal is to figure out what is the relative proportion


Latent Dirichlet Allocation (LDA)
----------------------------
* Most common topic modeling approach, original paper has over 18K citations

### 3 Main Components of LDA: ###
Let’s assume we are representing our document as a multiset of words and we have a fixed number of topics.
Main components:
1. Vocabulary distributions for topics
    * Each word has a (corpus-wide) probability of occurring in a document, given a topic

2. Document specific topic distributions

3. Topic assignment for each word in each document in our corpus

To be truthful, there are several other distributions in the model,but for simplicity we are going to leave them out.

### Interpreting and Using the Model: ###
To interpret a given topic, we can sort the words from the most probable to the least. If the top words are coherent, they can be used as the keywords defining the topic.

From the document-specific topic distributions we can see which topics are “important” for the given document.
* By setting a threshold we can “hard-assign” the document to specific categories.
* The distributions can be also used for retrieval: 2 documents with similar topic distribution should have similar content. Thus we can calculate similarities based on the topic distributio vectors just like we do with TFIDF vectors.

Word assignments are not usually used in applications, they are just a means to model the topics.

### Solving LDA: ###
We would like to find out:
1. Vocabulary distributions within topics
2. Topic assignments for each word
3. Topic distributions for each document
Finding the optimal solution is intractable, i.e. there is no computationally efficient way of solving the problem. Luckily there are various ways of approximating the solution.

One solution, based on Gibbs sampling:
1. Randomly initialize all our attributes (that is word distribution within topics + topic distributions for documents)
2. For each document:
    * Reassign all words randomly to the topics based on the known distributions (keeping them fixed).
    * Recalculate document topic distributions based on the counts of topic assignments for the words in document.
3. Recalculate topic vocabulary distributions from global word-to-topic assignments
4. We iterate steps 2 and 3 for the whole corpus until we reach some stopping criteria.

Extension called collapsed Gibbs sampling ignores vocabulary distributions and document topic distributions while calculating the word-to-topic assignments. Instead, the word assignment is derived from the other word assignments in the document/corpus.

### For bioinformatics students: ###
Essentially the same algorithm was originally invented for modeling genetic differences between populations and individuals:
Topic = Population
Word = Allele (at some specific locus)
Document = Individual (set of alleles)
Learn:
1. Which alleles are prevalent in a population
2. From which population (or a mix) an individual originates

### Practical issues ###
1. What is a good number of topics?
    * Unfortunately there is no clear answer, several papers published about this topic.
    * Too few result in massive topics which cover anything and everything, whereas too many lead to unusable model (+ many tiny topics).
    * An extension of LDA called Hierarchical Dirichlet Process (HDP) tries to learn the optimal number of topics directly from data.
2. Unsupervised method -> no control over which topics are generated?
    * Some implementations of LDA (GuidedLDA) support adding seed words for topics to “anchor” them. These seed words work as bias to influence vocabulary distributions of topics and document-topic distributions (towards topics with seed words occurring in the document)
3. Functional words tend to form their own topics
    * Preprocessing (stop word filtering etc.) is extremely critical for good results


### LDA with Python ###
Scikit-learn has an implementation of [LDA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) we can use.

Lets try it out with a small toy dataset:

In [9]:
from __future__ import print_function

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import codecs

import numpy
numpy.set_printoptions(precision=4)

import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')

def pubmed_articles(path, max_articles=10000):
    """ Yields pubmed articles"""
    with open(path) as f:
        for line in f:
            yield line.replace('\\n', '\n')
            max_articles-=1
            if max_articles==0:
                break

# Lets get our documents
art = list(pubmed_articles('./pubmed.txt', max_articles=10000))
print ('Articles: %s' % len(art))
print (art[0])

Articles: 10000
Molar incisor hypomineralization (MIH): clinical presentation, aetiology and management.
In this paper, the current knowledge about Molar Incisor Hypomineralization (MIH) is presented. MIH is defined as hypomineralization of systemic origin of one to four permanent first molars frequently associated with affected incisors and these molars are related to major clinical problems in severe cases. At the moment, only limited data are available to describe the magnitude of the phenomenon. The prevalence of MIH in the different studies ranges from 3.6-25% and seems to differ in certain regions and birth cohorts. Several aetiological factors (for example, frequent childhood diseases) are mentioned as the cause of the defect. Children at risk should be monitored very carefully during the period of eruption of their first permanent molars. Treatment planning should consider the long-term prognosis of these teeth.




Now we have our articles, next we have to convert them into BOW vectors. Here the preprocessing settings are critical!

In [7]:
vect = CountVectorizer(stop_words='english', max_df=0.95, min_df=2)
vect.fit(art)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.95, max_features=None, min_df=2,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

That was simple, then to the actual LDA model.

In [8]:
lda = LatentDirichletAllocation(n_topics=10, random_state=42, learning_method='online', evaluate_every=1, verbose=1)
lda.fit(vect.transform(art))

iteration: 1, perplexity: 4837.5141
iteration: 2, perplexity: 4464.4961
iteration: 3, perplexity: 4367.6493
iteration: 4, perplexity: 4324.3441
iteration: 5, perplexity: 4300.5675
iteration: 6, perplexity: 4285.8962
iteration: 7, perplexity: 4275.8165
iteration: 8, perplexity: 4268.5588
iteration: 9, perplexity: 4263.1175
iteration: 10, perplexity: 4259.0157


LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_jobs=1, n_topics=10, perp_tol=0.1, random_state=42,
             topic_word_prior=None, total_samples=1000000.0, verbose=1)

After a long wait we will have the model...

Next we can start using it.

Getting topic distributions for a document:

In [48]:
print (art[0])
print (lda.transform(vect.transform(art[:1]))[0])

print (lda.components_.shape)
print (lda.components_[7])

Molar incisor hypomineralization (MIH): clinical presentation, aetiology and management.
In this paper, the current knowledge about Molar Incisor Hypomineralization (MIH) is presented. MIH is defined as hypomineralization of systemic origin of one to four permanent first molars frequently associated with affected incisors and these molars are related to major clinical problems in severe cases. At the moment, only limited data are available to describe the magnitude of the phenomenon. The prevalence of MIH in the different studies ranges from 3.6-25% and seems to differ in certain regions and birth cohorts. Several aetiological factors (for example, frequent childhood diseases) are mentioned as the cause of the defect. Children at risk should be monitored very carefully during the period of eruption of their first permanent molars. Treatment planning should consider the long-term prognosis of these teeth.


[ 0.0014  0.0014  0.0014  0.0014  0.0014  0.0014  0.0014  0.317   0.6714
  0.001

OK, seems like this document likes topics #7 and #8. But what does that even mean?

Lets create (actually steal from scikit-learn example) a helper function to print the top words in our topics:

In [33]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i] for i in topic.argsort()[::-1][:n_top_words]]))
    print()

Now we can actually try to interpret the topics

In [34]:

print_top_words(lda, vect.get_feature_names(), 5)

Topic #0:
using used species method based
Topic #1:
gene protein expression genes dna
Topic #2:
levels rats mg increased effects
Topic #3:
cells cell expression il induced
Topic #4:
isolates women lps depression strains
Topic #5:
neurons receptor receptors nerve ht
Topic #6:
activity protein kinase beta alpha
Topic #7:
patients disease clinical treatment study
Topic #8:
age results subjects associated study
Topic #9:
skin water surface structure phase



Seems like both topics #7 and #8 are related to clinical studies. I guess that fits our document?
It's good to note that e.g. the word "age" which is the most defining keyword for topic #8 doesn't actually appear in the document we tested.

In [46]:
art_topic_vectors = lda.transform(vect.transform(art))
print (art_topic_vectors.shape)
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(art_topic_vectors[:1], art_topic_vectors)[0]
best_hits = similarities.argsort()[:-6:-1]
print (best_hits)

(10000, 10)
[   0    1 3362  966 4408]


In [43]:
print (similarities[best_hits])

[ 1.      0.9999  0.9975  0.9971  0.9967]


In [45]:
for i in best_hits[1:]:
    print (art[i])

Root canal retreatment: I. Case assessment and treatment planning.
Root canal retreatment is often the preferred method of treating a tooth in which root canal treatment has failed. Part one of this two-part article discusses reasons for failure of root canal treatment, case assessment and treatment planning. Part two describes some of the practical techniques that are available to the practitioner and the rationale for root canal retreatment.


Use of a brief Smoking Consequences Questionnaire for Adults (SCQ-A) in African American smokers.
Purposes of the present study were to (a) examine psychometric properties of a brief Smoking Consequences Questionnaire-Adult (SCQ-A) among an African American sample and (b) explore differences in smoking expectancies across levels of smoking-nicotine dependence. Four hundred eighty-four smokers attending an urban health clinic completed the brief SCQ-A. Maximum likelihood factor extraction with a varimax rotation specifying 9 factors replicated 9

Seems pretty good! Remember that we have reduced each document into a dense 10-dimensional vector instead of using 27K (sparse) TFIDF vectors.