# Multilingual Latent Dirichlet Allocation (LDA) Pipeline - the Tutorial


Below is a tutorial on how to process data and to train an LDA on it. First, we're going to use the [library]([multilingual LDA library](https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA) as-is to get an overview of what it can do. Second, we'll redo the same thing but while exposing the underlying pipeline. Third, we're going to dissect the pipeline and inspect the intermediate transformations of the data for you to learn precisely how it works. As an overview, the pipeline looks like that: 

1. Try to train with words. For this, the comments will need to have words that once [stemmed](https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/Stemming-words-from-multiple-languages.ipynb) and once without stop words will at least have some words that will be present across comments from each other. 
  1. Forward pass
    1. Remove stop words
    2. Stem words
    3. Vectorize the remaining stemmed words to have their count as features for the LDA. Words will be there as 1-grams, and there will also be some 2-grams (2 words). 
    4. Learn on LDA on those features
  2. Backward pass
    1. Inverse the LDA by returning the top features per topic
    2. Inverse those top features with the count vectorizer which will get back words (1-grams) or the 2-gram of words. 
    3. Un-stem the words or the 2-grams with a custom inverse stemming algorithm
    4. Stop words won't be reintroduced at this point between 2-grams if there were stop words there normally (TO DO)
  3. Finally, split the 1-grams from the 2-grams. Also, extract the top comments for each category.
2. If the previous failed, we'll retry with a modified pipeline where we train on n-grams of letters instead of words. To do that, we replace the stemmer by a letter splitter that will split on letters before the featurization. The inverse pass will be hard to recover, but clustering would still work in that case to be able to put each comment in a category, at least, and to find the top comments of those categories, too. 

Note: The classes imported are clean and have unit tests. Don't hesitate to dive in and to check what's under the hood after or while reading!

## Overview: the why

We want to get an introspection on the data. We want to automatically categorize comments into categories, find the top comments per category, and to represent the categories by their top words or top n-grams. 

Let's dive in. First, here is an overview of what the whole thing does. It is only a very simple example designed to be understood easily, so we will ask for two categories here: comments about cats ("chats" in French), and dogs ("chiens" in French). 

We have French text here, but the pipeline would work for many languages provided it is supported by the [Snowball stemmer](http://snowball.tartarus.org/texts/stemmersoverview.html) and provided that you have a list of the stop words for the stop words removal part which seems to be quite important after testing without this part.


In [12]:
from pprint import pprint
from artifici_lda.lda_service import train_lda_pipeline_default


FR_STOPWORDS = [
    "le", "les", "la", "un", "de", "en",  # stop words
    "a", "b", "c", "d",  # 1 char words are removed too
    "est", "sur", "tres", "donc", "sont",  # can even mix in some more common words / borderline stop words.
    # even having slang/texto stop words can be good:
    "ya", "pis", "yer"]
# Note: this list of stop words is poor and has been crafted for this example.

fr_comments = [
    "Un super-chat marche sur le trottoir",
    "Les super-chats aiment ronronner",
    "Les chats sont ronrons",
    "Un super-chien aboie",
    "Deux super-chiens",
    "Combien de chiens sont en train d'aboyer?"
]

transformed_comments, top_comments, _1_grams, _2_grams = train_lda_pipeline_default(
    fr_comments,
    n_topics=2,
    stopwords=FR_STOPWORDS,
    language='french')
# More languages: 
# ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 
#  'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']

pprint(transformed_comments)
pprint(top_comments)
pprint(_1_grams)
pprint(_2_grams)


array([[0.14218066, 0.85781934],
       [0.11032926, 0.88967074],
       [0.16960699, 0.83039301],
       [0.88966976, 0.11033024],
       [0.85781743, 0.14218257],
       [0.83039307, 0.16960693]])
['Un super-chien aboie', 'Les super-chats aiment ronronner']
[[('chiens', 3.4911389446318633), ('super', 2.4999405011943825)],
 [('chats', 3.491141575287711), ('super', 2.5000594988056135)]]
[[('super chiens', 2.4921013713235154)], [('super chats', 2.4921054657872785)]]


## What's in the pipeline

Let's dig in the method `train_lda_pipeline_default(...)` and see what it does. In fact, it creates an lda_pipeline using scikit-learn's `Pipeline` class. This class can chain many other classes that we've adapted here to the pipeline and that we've improved for our usage. 

It effectively chain a `StopWordsRemover()`, a `Stemmer()`, a `CountVectorizer()`, and finally the `LDA()` with their respective (hyper)parameters stored in a dict.

Each of those chained data-transforming classes needs to implement those methods: 
- `fit`: to fit the data before transforming it.
- `transform`: to transform the data. 
- `inverse_transform`: once we have transformed data, we can feed it back into the pipeline in reverse order to get from LDA's topics to a more natural description of those topics. 

Note that `fit_transform` will be already implemented for each of those classes, which will simply call `fit` and then `transform` right after, on the very-same data. We'll use `fit_transform` everywhere below as a shortcut.

So the pipeline basically does this: 
1. Fit everything and then transform everything, class by class, moving forward in the pipeline. At the output of the LDA, we'll get the top topics per comment. 
2. We not only want the top topics, but also some description of them. So we need the inverse_transform function to get the words of each topics in a legible manner (e.g.: undo the stemming and undo the featurization).  

Let's see how all this can be put together:

In [36]:
# The code directly below is derived from the file `lda_service/lda_service.py` and is simplified

from artifici_lda.data_utils import link_topics_and_weightings, get_top_comments, split_1_grams_from_n_grams, \
    get_lda_params_with_specific_n_cluster_or_language, get_word_weightings
from artifici_lda.logic.letter_splitter import LetterSplitter
from artifici_lda.logic.stop_words_remover import StopWordsRemover
from artifici_lda.logic.stemmer import Stemmer, FRENCH
from artifici_lda.logic.lda import LDA
from artifici_lda.logic.count_vectorizer import CountVectorizer

from sklearn.pipeline import Pipeline

LDA_PIPELINE_PARAMS_WORDS = {
    'stopwords__stopwords': None,
    'stemmer__language': FRENCH,  # ENGLISH
    'count_vect__max_df': 0.98,
    'count_vect__min_df': 2,
    'count_vect__max_features': 10000,
    'count_vect__ngram_range': (1, 2),
    'count_vect__strip_accents': None,
    'lda__n_components': 2,
    'lda__max_iter': 750,
    'lda__learning_decay': 0.5,
    'lda__learning_method': 'online',
    'lda__learning_offset': 10,
    'lda__batch_size': 25,
    'lda__n_jobs': -1,  # Use all CPUs
}

lda_pipeline = Pipeline([
    ('stopwords', StopWordsRemover()),
    ('stemmer', Stemmer()),
    ('count_vect', CountVectorizer()),
    ('lda', LDA()),
]).set_params(**LDA_PIPELINE_PARAMS_WORDS)

# Fit the data
transformed_comments = lda_pipeline.fit_transform(fr_comments)
print("Probabilities of categories for comments:")
pprint(transformed_comments)

top_comments = get_top_comments(fr_comments, transformed_comments)
print("Top comments per categories:")
pprint(top_comments)

# Extract information about data
topic_words = lda_pipeline.inverse_transform(X=None)
topic_words_weighting = get_word_weightings(lda_pipeline)
topics_words_and_weightings = link_topics_and_weightings(topic_words, topic_words_weighting)
print("Top words that defines the categories, and their weighting:")
pprint(topics_words_and_weightings)

# Manipulations on the information for a clean return.
_1_grams, _2_grams = split_1_grams_from_n_grams(topics_words_and_weightings)
print("Same as the top that defines the categories and their weighting, but here the 1-grams are splitted from the 2-grams:")
pprint(_1_grams)
pprint(_2_grams)

Probabilities of categories for comments:
array([[0.14218136, 0.85781864],
       [0.11032962, 0.88967038],
       [0.16960697, 0.83039303],
       [0.88967009, 0.11032991],
       [0.85781808, 0.14218192],
       [0.83039304, 0.16960696]])
Top comments per categories:
['Un super-chien aboie', 'Les super-chats aiment ronronner']
Top words that defines the categories, and their weighting:
[[('chiens', 3.491138564268627),
  ('super', 2.4999824170081246),
  ('super chiens', 2.4921007611254002)],
 [('chats', 3.4911393619816162),
  ('super', 2.5000175829918696),
  ('super chats', 2.492101963052467)]]
Same as the top that defines the categories and their weighting, but here the 1-grams are splitted from the 2-grams:
[[('chiens', 3.491138564268627), ('super', 2.4999824170081246)],
 [('chats', 3.4911393619816162), ('super', 2.5000175829918696)]]
[[('super chiens', 2.4921007611254002)], [('super chats', 2.492101963052467)]]


## How does it works: inspecting each part of the pipeline (forward)

Now that we have a good overview, let's dig in and not use the `Pipeline` object to be able to see each intermediate step. 

In [37]:
from artifici_lda.data_utils import get_params_from_prefix_dict

### Removing stop words

In [46]:
stopwords_params = get_params_from_prefix_dict(
    param_prefix="stopwords__", 
    lda_pipeline_params=LDA_PIPELINE_PARAMS_WORDS)

swr = StopWordsRemover(**stopwords_params)

print("Original comments:")
pprint(fr_comments)
comments_without_stopwords = swr.fit_transform(fr_comments)
print("")
print("Comments without stopwords:")
pprint(comments_without_stopwords)

Original comments:
['Un super-chat marche sur le trottoir',
 'Les super-chats aiment ronronner',
 'Les chats sont ronrons',
 'Un super-chien aboie',
 'Deux super-chiens',
 "Combien de chiens sont en train d'aboyer?"]

Comments without stopwords:
['super-chat marche trottoir',
 'super-chats aiment ronronner',
 'chats ronrons',
 'super-chien aboie',
 'Deux super-chiens',
 'Combien chiens train aboyer?']


### Stemming words

In [49]:
stemmer_params = get_params_from_prefix_dict(
    param_prefix="stemmer__", 
    lda_pipeline_params=LDA_PIPELINE_PARAMS_WORDS)

st = Stemmer(**stemmer_params)

print("Original comments (already without stop words):")
pprint(comments_without_stopwords)
comments_without_stopwords_stemmed = st.fit_transform(comments_without_stopwords)
print("")
print("Stemmed comments:")
pprint(comments_without_stopwords_stemmed)
print("Custom stemmer's cache that was saved for the inverse pass later on which "
      "will need to choose the top corresponding words back from their counts:")
pprint(st.stemmed_word_to_equiv_word_count)

Original comments (already without stop words):
['super-chat marche trottoir',
 'super-chats aiment ronronner',
 'chats ronrons',
 'super-chien aboie',
 'Deux super-chiens',
 'Combien chiens train aboyer?']

Stemmed comments:
['sup chat march trottoir',
 'sup chat aiment ronron',
 'chat ronron',
 'sup chien aboi',
 'deux sup chien',
 'combien chien train aboi']
Custom stemmer's cache that was saved for the inverse pass later on which will need to choose the top corresponding words back from their counts:
{'aboi': {'aboie': 1, 'aboyer': 1},
 'aiment': {'aiment': 1},
 'chat': {'chat': 1, 'chats': 2},
 'chien': {'chien': 1, 'chiens': 2},
 'combien': {'Combien': 1},
 'deux': {'Deux': 1},
 'march': {'marche': 1},
 'ronron': {'ronronner': 1, 'ronrons': 1},
 'sup': {'super': 4},
 'train': {'train': 1},
 'trottoir': {'trottoir': 1}}


### Converting words to 1-gram and 2-gram features

In [52]:
count_vect_params = get_params_from_prefix_dict(
    param_prefix="count_vect__", 
    lda_pipeline_params=LDA_PIPELINE_PARAMS_WORDS)

cv = CountVectorizer(**count_vect_params)

print("Comments stemmed and without stopwords:")
pprint(comments_without_stopwords_stemmed)
comments_without_stopwords_stemmed_vectorized = cv.fit_transform(comments_without_stopwords_stemmed)
print("")
print("Vectorized comments:")
print(type(comments_without_stopwords_stemmed_vectorized))
pprint(comments_without_stopwords_stemmed_vectorized.toarray())
print("The features in the matrix contains 1-gram and then 2-grams, such as:")
pprint(cv.get_feature_names())

Comments stemmed and without stopwords:
['sup chat march trottoir',
 'sup chat aiment ronron',
 'chat ronron',
 'sup chien aboi',
 'deux sup chien',
 'combien chien train aboi']

Vectorized comments:
<class 'scipy.sparse.csr.csr_matrix'>
array([[0, 1, 0, 0, 1, 1, 0],
       [0, 1, 0, 1, 1, 1, 0],
       [0, 1, 0, 1, 0, 0, 0],
       [1, 0, 1, 0, 1, 0, 1],
       [0, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 0, 0, 0]], dtype=int64)
The features in the matrix contains 1-gram and then 2-grams, such as:
['aboi', 'chat', 'chien', 'ronron', 'sup', 'sup chat', 'sup chien']


### LDA on the word features

In [65]:
lda_params = get_params_from_prefix_dict(
    param_prefix="lda__", 
    lda_pipeline_params=LDA_PIPELINE_PARAMS_WORDS)

lda = LDA(**lda_params)

print("Original comments:")
pprint(fr_comments)
print("Comments, featurized:")
pprint(comments_without_stopwords_stemmed_vectorized.toarray())
print("")
comments_lda = lda.fit_transform(comments_without_stopwords_stemmed_vectorized)
print("Clusterized comments:")
pprint(comments_lda)
print("Let's see their category (argmax on inner dimension):")
pprint(comments_lda.argmax(-1))

Original comments:
['Un super-chat marche sur le trottoir',
 'Les super-chats aiment ronronner',
 'Les chats sont ronrons',
 'Un super-chien aboie',
 'Deux super-chiens',
 "Combien de chiens sont en train d'aboyer?"]
Comments, featurized:
array([[0, 1, 0, 0, 1, 1, 0],
       [0, 1, 0, 1, 1, 1, 0],
       [0, 1, 0, 1, 0, 0, 0],
       [1, 0, 1, 0, 1, 0, 1],
       [0, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 0, 0, 0]], dtype=int64)

Clusterized comments:
array([[0.14218173, 0.85781827],
       [0.11032981, 0.88967019],
       [0.16960697, 0.83039303],
       [0.88967027, 0.11032973],
       [0.85781842, 0.14218158],
       [0.83039303, 0.16960697]])
Let's see their category (argmax on inner dimension):
array([1, 1, 1, 0, 0, 0])


## Inspecting each part of the pipeline (backwards)

### Inverse of the LDA gives us topics' features

In [66]:
a = lda.inverse_transform(None)  # None here for getting the fitted categories. 
pprint(a)

[array([2, 4, 6]), array([1, 4, 5])]


### Inverse of the CountVectorizer gives us the words from features

In [67]:
b = cv.inverse_transform(a)
pprint(b)

[['chien', 'sup', 'sup chien'], ['chat', 'sup', 'sup chat']]


### Inverse Stemming here yields the most common original word for it's stemmed version

[More info on how the Inverse Stemming here](https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/Stemming-words-from-multiple-languages.ipynb).

In [70]:
c = st.inverse_transform(b)
pprint(c)

[['chiens', 'super', 'super chiens'], ['chats', 'super', 'super chats']]


### Inverse stop words removal here does nothing

The function basically returns its argument. This is a point that could be improved with a custom algorithm, such as the Stemmer's inverse pass which is custom here. For example, it would be possible to scan back each comment and to find occurences with a regex.

In [71]:
d = swr.inverse_transform(c)
pprint(d)

[['chiens', 'super', 'super chiens'], ['chats', 'super', 'super chats']]


## Conclusion

You now have a quite precise overview on how does this [multilingual LDA](https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA) works. 

On one hand, it's reasy to use and it's quite straightforward. 

On the other hand, each class in the pipeline has its own behavior. Here we inherit from some Scikit-learn classes and add them a few extras (such as most of the backward passes), and we also add of our own classes (such as the Stemmer class where [Snowball](https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/Stemming-words-from-multiple-languages.ipynb) is used for the forward pass). It would be easy to change the implementation of the LDA by creating another class, or to use other algorithms.

For more information, don't hesitate to dive into the code. There also are unit tests. 

### License

This [project](https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA) is published under the [MIT License (MIT)](https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/LICENSE).

Copyright (c) [2018 Artifici online services inc](https://github.com/ArtificiAI).

Coded by [Guillaume Chevalier](https://github.com/guillaume-chevalier) at [Neuraxio Inc.](https://github.com/Neuraxio)
