# Latent Dirichlet Allocation (LDA) Pipeline Example in Python

Below is a tutorial on how to process data and to train an LDA on it. 

Knowing that the LDA perform better with stop words removal and with lemmatized words, it's results are often ugly on the lemmatized words. To fix that, it's a good thing to be able to do an inverse-lemmatization on the topic words (or topic n-grams) yield by the LDA once trained on the data. 

So here we are: let's do a pipeline that looks like that: 

1. Load a dataset of many comments (or documents)
2. Transform comments to remove stop words
3. Lemmatize the comments without stop words for a better LDA
4. Perform LDA topic modeling
5. Recover words from inverse (backwards) lemmatization on topic words 
6. Clean topic words are available

Note: The classes imported are clean and have unit tests. Don't hesitate to dive in and to check what's under the hood!


In [1]:

from app.data.load_sample_data import load_sample_data
from app.logic.stop_words_remover import StopWordsRemover
from app.logic.stemmer import Stemmer
from app.logic.lda import LDA


## Load a dataset of many comments (or documents)

In [2]:
messages, comments = load_sample_data()

## Transform comments to remove stop words

In [3]:
fr_en_stopwords = StopWordsRemover()
comments_without_stopwords = fr_en_stopwords.remove_from_many_strings(comments)

## Lemmatize the comments without stop words for clean texts

In [4]:
french_stemmer = Stemmer(language='french')
stemmed_comments = [french_stemmer.lemmatize(thread) for thread in comments_without_stopwords]

## Perform LDA topic modeling

In [5]:
lda = LDA(n_topics=2, max_iter=5, learning_method='online', learning_offset=50.)
lda_sklearn, feature_names = lda.fit(stemmed_comments[-4])

## Recover words from inverse (backwards) lemmatization on topic words 

In [6]:
# (n_components=10, doc_topic_prior=None, topic_word_prior=None, learning_method=None, learning_decay=0.7, 
# learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, 
# mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=1, verbose=0, random_state=None, n_topics=None)

## Some clean topic words (or expressions) are then available!

## TODO: clean below

In [None]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('stopwords', StopWordsRemover()),
                     ('stemmer', Stemmer()),
                     ('lda', MultinomialNB()),
])

In [7]:
import sklearn

for i in range(2, 10):
    
    # lda = LDA(n_topics=2, max_iter=i, learning_method='online', learning_offset=50.)
    
    lda = LDA(
        n_topics=2, max_iter=i, learning_decay=0.7, learning_method='online', 
        learning_offset=10.0, batch_size=128, total_samples=1000000.0, 
        mean_change_tol=0.001, max_doc_update_iter=100, 
        verbose=1,
        n_jobs=-1,  # Use all CPUs
        evaluate_every=1)
    
    lda_sklearn, feature_names = lda.fit(stemmed_comments[-2])
    print(i, lda.score(stemmed_comments[-2]), lda.perplexity(stemmed_comments[-2]))

iteration: 1 of max_iter: 2, perplexity: 49.1109
iteration: 2 of max_iter: 2, perplexity: 44.6179
2 -292.4564070040818 44.61790143210732
iteration: 1 of max_iter: 3, perplexity: 49.1109
iteration: 2 of max_iter: 3, perplexity: 44.6179
iteration: 3 of max_iter: 3, perplexity: 42.1824
3 -288.13421703247997 42.18238929273422
iteration: 1 of max_iter: 4, perplexity: 49.1109
iteration: 2 of max_iter: 4, perplexity: 44.6179
iteration: 3 of max_iter: 4, perplexity: 42.1824
iteration: 4 of max_iter: 4, perplexity: 40.6754
4 -285.33295079087077 40.6753695112215
iteration: 1 of max_iter: 5, perplexity: 49.1109
iteration: 2 of max_iter: 5, perplexity: 44.6179
iteration: 3 of max_iter: 5, perplexity: 42.1824
iteration: 4 of max_iter: 5, perplexity: 40.6754
iteration: 5 of max_iter: 5, perplexity: 39.6464
5 -283.3600458998749 39.64641770588519
iteration: 1 of max_iter: 6, perplexity: 49.1109
iteration: 2 of max_iter: 6, perplexity: 44.6179
iteration: 3 of max_iter: 6, perplexity: 42.1824
iteration:

In [14]:

train, test = sklearn.model_selection.train_test_split(stemmed_comments[-2], test_size=0.2)

print(len(train), len(test))


lda = LDA(
    n_topics=2, max_iter=100, learning_decay=0.5, learning_method='online', 
    learning_offset=10.0, batch_size=128,
    mean_change_tol=0.001, max_doc_update_iter=100, 
    verbose=1,
    n_jobs=-1,  # Use all CPUs
    evaluate_every=1)

lda_sklearn, feature_names = lda.fit(train)

print(lda.score(train), lda.perplexity(train))
print(lda.score(test), lda.perplexity(test))
# -182.51029019702543   27.615270885701467
# -52.05494670051718    41.19061662552155   
# -41.973876576664225  189.9450029464927    
# -32.432059761463556 3320.9791408284505    
# perplexity should go down and score should go down too (more negative). 

20 5
iteration: 1 of max_iter: 100, perplexity: 43.5576
iteration: 2 of max_iter: 100, perplexity: 40.0195
iteration: 3 of max_iter: 100, perplexity: 38.2824
iteration: 4 of max_iter: 100, perplexity: 37.3002
iteration: 5 of max_iter: 100, perplexity: 36.6836
iteration: 6 of max_iter: 100, perplexity: 36.2304
iteration: 7 of max_iter: 100, perplexity: 35.8177
iteration: 8 of max_iter: 100, perplexity: 35.3896
iteration: 9 of max_iter: 100, perplexity: 34.9447
iteration: 10 of max_iter: 100, perplexity: 34.5067
iteration: 11 of max_iter: 100, perplexity: 34.1005
iteration: 12 of max_iter: 100, perplexity: 33.7404
iteration: 13 of max_iter: 100, perplexity: 33.4277
iteration: 14 of max_iter: 100, perplexity: 33.1558
iteration: 15 of max_iter: 100, perplexity: 32.9200
iteration: 16 of max_iter: 100, perplexity: 32.7227
iteration: 17 of max_iter: 100, perplexity: 32.5636
iteration: 18 of max_iter: 100, perplexity: 32.4344
iteration: 19 of max_iter: 100, perplexity: 32.3269
iteration: 20 of

In [None]:
len(stemmed_comments[-2]), len(stemmed_comments[-3]), len(stemmed_comments[-4])

In [None]:
([(len(i), a) for a, i in enumerate(stemmed_comments)])

In [None]:
# fr_en_stopwords = StopWordsRemover()
fr_en_stopwords.remove_from_string("Le chat s'est assis sur le tapis aujour'hui!!! il est très comfortable et n'est pas déçu, tout ronron!")