## topic modeling
https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

 - create an id => word mapping, aka dictionary  **gensim.corpora.Dictionary**
 - transform a document into a bag-of-word vector, using a dictionary 
 - transform a stream of documents into a stream of vectors **doc2bow**
 - transform between vector streams, using topic models
 - store and save trained models, for persistency
 - use manual and semi-automated methods to evaluate quality of a topic model

### Training LDA
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/lda_training_tips.ipynb

In [None]:
# Read data.

import os

# Folder containing all NIPS papers.
data_dir = 'nipstxt/'

# Folders containin individual NIPS papers.
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]

# Read all texts into a list.
docs = []
for yr_dir in dirs:
    files = os.listdir(data_dir + yr_dir)
    for filen in files:
        # Note: ignoring characters that cause encoding errors.
        with open(data_dir + yr_dir + '/' + filen) as fid:
            txt = fid.read()
        docs.append(txt)

### Pre-process and vectorize the documents
Among other things, we will:

Split the documents into tokens.</br>
<br>Lemmatize the tokens.</br>
<br>Compute bigrams.</br>
<br>Compute a bag-of-words representation of the data.</br>
<br>First we tokenize the text using a regular expression tokenizer from NLTK. We remove numeric tokens and tokens that are only a single character, as they don't tend to be useful, and the dataset contains a lot of them.</br>

In [None]:
# Tokenize the documents.

from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    #docs[idx] = docs[idx].lower()  # Convert to lowercase.
    doc = ' '.join(docs[idx])
    docs[idx] = tokenizer.tokenize(doc)  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isdigit()] for doc in docs]

# Remove words that are only one character.
docs = [[token.lower() for token in doc if len(token) > 1] for doc in docs]

We use the WordNet lemmatizer from NLTK. A lemmatizer is preferred over a stemmer in this case because it produces more readable words. Output that is easy to read is very desirable in topic modelling.

In [None]:
# Lemmatize the documents.

from nltk.stem.wordnet import WordNetLemmatizer

# Lemmatize all words in documents.
lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

### New Term Topics Methods and Document Coloring
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb

We're setting up our corpus now. We want to show off the new get_term_topics and get_document_topics functionalities, and a good way to do so is to play around with words which might have different meanings in different context.

The word bank is a good candidate here, where it can mean either the financial institution or a river bank. In the toy corpus presented, there are 11 documents, 5 river related and 6 finance related.

In [1]:
from gensim.corpora import Dictionary
from gensim.models import ldamodel
import numpy
%matplotlib inline



In [13]:
texts = [['bank','river','shore','water'],
        ['river','water','flow','fast','tree'],
        ['bank','water','fall','flow'],
        ['bank','bank','water','rain','river'],
        ['river','water','mud','tree'],
        ['money','transaction','bank','finance'],
        ['bank','borrow','money'], 
        ['bank','finance'],
        ['finance','money','sell','bank'],
        ['borrow','sell'],
        ['bank','loan','sell']]

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [27]:

numpy.random.seed(27) # setting random seed to get the same results each time.
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2, passes=23, iterations = 400,minimum_phi_value=0.05)

In [28]:
model.show_topics()

[(0,
  u'0.184*"water" + 0.150*"river" + 0.147*"bank" + 0.083*"flow" + 0.083*"tree" + 0.050*"fast" + 0.050*"fall" + 0.050*"shore" + 0.050*"mud" + 0.050*"rain"'),
 (1,
  u'0.214*"bank" + 0.134*"money" + 0.134*"sell" + 0.134*"finance" + 0.095*"borrow" + 0.057*"transaction" + 0.057*"loan" + 0.019*"water" + 0.019*"river" + 0.019*"rain"')]

### get_term_topics

The function get_term_topics returns the odds of that particular word belonging to a particular topic. A few examples:

In [29]:
model.get_term_topics('water')

[(0, 0.17019226)]

In [30]:
model.get_term_topics('finance')

[(1, 0.11727806)]

In [31]:
model.get_term_topics('bank')

[(0, 0.13316578), (1, 0.1992695)]