## LDA (Latent Dirichlet Allocation) Model

In simple terms Latent Dirichlet Allocation models are a way of automatically discovering topics in the given corpus. For example, imagine you have the following set of sentences (in real-world we would operate on documents that contain multiple sentences).

* I like to eat broccoli and bananas.
* I ate a banana and spinach smoothie for breakfast.
* Chinchillas and kittens are cute.
* My sister adopted a kitten yesterday.
* Look at this cute hamster munching on a piece of broccoli.

The LDA model returns more or less the following information about the probability to belong each sentence to a topic.

* **Sentences 1 and 2**: 100% Topic A
* **Sentences 3 and 4**: 100% Topic B
* **Sentence 5**: 60% Topic A, 40% Topic B

And the most representative words for a topic.
* **Topic A**: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, etc. (at which point, you could interpret topic A to be about food)
* **Topic B**: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, etc. (at which point, you could interpret topic B to be about cute animals)

### This is fun but how does it happen?

First, LDA makes assumptions about how the documents are created. It sees each document as a mixutere of topics that spit out words with certain probabilities. Therefore, it assumes that each document is created in the following fashion.

1. Decide on the number of words `N` the document will have.
2. Choose a topic mixture for the document. For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of $\frac{1}{3}$ food and $\frac{2}{3}$ cute animals.
3. Generate each word $w_i$ in the document by:
	- First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
	- Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.

### Example

According to the above process, when generating some particular document D, you might

1. Pick 5 to be the number of words in D.
2. Decide that D will be 1/2 about food and 1/2 about cute animals.
3. Pick the first word to come from the food topic, which then gives you the word `"broccoli"`.
4. Pick the second word to come from the cute animals topic, which gives you `"panda"`.
5. Pick the third word to come from the cute animals topic, giving you `"adorable"`.
6. Pick the fourth word to come from the food topic, giving you `"cherries"`.
7. Pick the fifth word to come from the food topic, giving you `"eating"`.

So the document generated under the LDA model will be `"broccoli panda adorable cherries eating"` (note that LDA is a bag-of-words model).

### Learning

This is all very good, let's assume that we have generated a set of similar documents. You’ve chosen some fixed number of `K` topics to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic. How do you do this? One way (known as collapsed [Gibbs sampling](https://en.wikipedia.org/wiki/Gibbs_sampling)) is the following:

Go through each document, and randomly assign each word in the document to one of the K topics. Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).

So how to improve this topics?

Go through each word `w` in `d` and for each topic `t`, compute two things: 

1) $p(topic_t | document_d)$ -- the proportion of words in document `d` that are currently assigned to topic `t`. 
2) $p(word_w | topic_t)$ -- the proportion of assignments to topic `t` over all documents that come from this word `w`. 

Reassign `w` a new topic, where we choose topic `t` with probability $p(topic_t | document_d) \times p(word_w | topic_t)$ (according to our generative model, this is essentially the probability that topic `t` generated word `w`, so it makes sense that we resample the current word’s topic with this probability). In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.

After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).


### Real-world example

This is all very good but how to do it in practice. The good news is that it is quite easy in _Python_. What we need is a corpus of texts. Let's use the same corpus we used in the article on [The effect of the pandemic on European narratives on smart cities and surveillance](https://doi.org/10.1177/00420980221138317). You can download it from this link from the [Google Drive](https://classroom.google.com/c/NjI5NzI5ODQxNDIw/m/NjU3NTM3NTU4MDg0/details).

In [None]:
## Import module for tokenization and lemmatization
import spacy
nlp = spacy.load("en_core_web_sm")
## Import module for LDA
from gensim.corpora import Dictionary
from gensim.models import LdaModel
## Import the module for JSON handling
import json
## Import the module for path handling
import os
## Import module for creating a table
import pandas as pd
## Import module for plotting
import matplotlib.pyplot as plt
## Import module for handling standard output
import sys

In [None]:
class MyCorpus:
    """
    A class that represents a corpus and has usefull methods defined.

    """
    
    def __init__(self, path, key='content'):
        """
        Reads from a JSON line file. Tokenizes and lemmatizes
        the text under key. It writes out the new JSON line
        file with a new field -- tokens.
        Args:
            path (str): a path to a JSON line.
            key (str): a key with the content to lemmatize.
        """
        self._path_original = path
        self._key = key
        self._dictionary = None
        self._path = path.replace('.', '_NLP.')
        with open(self._path, 'w') as file:
            n = 1
            for line in open(self._path_original, 'r'):
                temp_dict = json.loads(line)
                text_nlp = nlp(temp_dict[self._key])
                temp_dict['tokens'] = []
                for token in text_nlp:
                    is_stop = token.is_stop or token.is_punct or token.is_space \
                        or token.is_bracket or token.is_currency or token.is_digit \
                            or token.is_quote or len(token) < 2
                    if is_stop:
                        continue
                    else:
                        temp_dict['tokens'].append(token.lemma_.lower())
                file.write( json.dumps(temp_dict) + '\n')
                sys.stdout.write(f'\rLine {n} processed')
                n += 1
                sys.stdout.flush()

        
    def set_dictionary(self, dictionary):
        """
        Assigns a gensim.corpora.dictionary.Dictioanry object
        to self._dictionary.

        Args:
            dictionary (gensim.corpora.dictionary.Dictionary): a dictionary
            that stores the frequencies of unique tokens in the corpus.
        """
        self._dictionary = dictionary

    def get_tokens(self):
        """
        It takes the path to a JSON line file with comments from Reddit and
        returns a generator that yields tokens for each comment.

        Yields:
            list : list of tokens for a comment from Reddit. 
        """
        for doc in open(self._path, 'r'):
            temp = json.loads(doc)
            yield temp['tokens']
    
    def get_bow(self):
        """
        It takes a dictionary with frequencies of unique tokens in the corpus
        and for each list of tokens returns a list of tuples that denote the 
        id of a given token and its frequency in a given document.

        Raises:
            ValueError: if the dictionary was not assigned to self._dictionary.

        Yields:
            list : a list of tuples that denote the id of a given token and its
            frequency in a given document.
        """
        if self._dictionary:
            for doc in self.get_tokens():
                yield self._dictionary.doc2bow(doc)
        else:
            raise ValueError('Dictionary has the value of None')
    
    def __iter__(self):
        """
        Yields:
            list : a list of tuples that denote the id of a given token and
            its frequency in a given document.
        """
        for doc in self.get_bow():
            yield doc

    def get_topics(self, model):
        """
        It takes a model and returns a generator that yields a mapping for each
        comment from Reddit. Among other keys it returns the most probable topic
        based on the LDA model provided and its probability.

        Args:
            model (gensim.models.ldamodel.LdaModel): Latent Dirchlet Allocation
            model.

        Yields:
            dict : a mapping for each comment from Reddit. Among other keys it
            returns the most prpobable topic based on the LDA model provided and
            its probability. 
        """
        for doc in open(self._path, 'r'):
            temp = json.loads(doc)
            topics = model.get_document_topics(self._dictionary.doc2bow(temp['tokens']))
            topic, prob = sorted( topics, key = lambda x: x[1], reverse=True )[0]
            temp['topic'] = topic + 1
            temp['topic_prob'] = prob
            yield temp

                
class MyModel(LdaModel):
    """
    Subclass of gensim.models.LdaModel.
    """
    def get_coherence(self, corpus):
        """
        Returns the average coherence measure for the given model.

        Args:
            corpus (MyCorpus): A corpus on which the model is computed. 

        Returns:
            float: the average coherence measure for the given model.
        """
        top_topics = self.top_topics(corpus)
        return sum([t[1] for t in top_topics]) / len(top_topics)
    
    def get_top_tokens(self, corpus):
        """
        Returns a list of dictionaries that depict the most probable
        tokens for each topic.

        Args:
            corpus (MyCorpus): A corpus on which the model was computed.

        Returns:
            list: list of dicitionaries that depict the most probable 
            tokens fro each topic.
        """
        top_tokens = self.top_topics(corpus)
        return [ { key : value for value, key in t[0] } for t in top_tokens ]

    
    
        
def run_lda_models(corpus, dictionary, min_topics, max_topics, step = 1, **kwargs):
    """
    Computes a sequence of lda models for a given corpus and dictionary. It prints
    the coherence measure and number of topics to the screen. It writes out the
    model to disk.

    Args:
        corpus (MyModel): A stream of document vectors or sparse matrix of shape (num_documents, num_terms).
        dictionary (dict): a mapping that assigns id to unique tokens from the corpus.
        min_topics (int): the smallest number of topics to compute.
        max_topics (int): the highest number of topics to compute.
        step (int, optional): the size of the break inbetween computed models. Defaults to 1.
    """
    name = input("Please provide the name of the model\n")
    temp = dictionary[0]
    id2word = dictionary.id2token
    if not os.path.exists('models'):
        os.mkdir('models')
    if not os.path.exists('png'):
        os.mkdir('png')
    for num_topic in range(min_topics, max_topics+1, step):
        model = MyModel( corpus = corpus,
                         id2word=id2word,
                         alpha = 'asymmetric',
                         eta = 'auto',
                         iterations = 500,
                         passes = 20,
                         eval_every=None,
                         num_topics=num_topic,
                         random_state=1044,
                         per_word_topics=True)
        temp_dict = {}
        temp_dict['name'] = name
        temp_dict['num_topics'] =  num_topic
        temp_dict['coherence'] = model.get_coherence(corpus = corpus)
        path_name = os.path.join('models', name + '-' + str(num_topic))
        model.save(path_name) 
        print(temp_dict)

With the corpus of texts. First, we will need to preprocess it a bit. It will involve 3 steps (this is a very naive way of preprocessing the corpus but for our purposes, it is more than enough).

1. Tokenize the text.
2. Lemmatize the tokens.
3. Compute the bag-of-words representation of the data.

Afterward, we can compute the LDA Model that will reduce our data by grouping similar texts together. For example, instead of 184 articles this way we have 7 coherent groups. It allows our limited cognitive system to process, compare, or even draw conclusions about the main topics out there.

4. Compute the LDA.
5. Interpret the results.

There is a very good tutorial on exactly this [here](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#sphx-glr-auto-examples-tutorials-run-lda-py). However, its main issue is that it is kind of 'raw' and provides some details that can put you off because they require additional _Python_ knowledge. Therefore, in the code below I minimized the number of unnecessary details and focused on the most important parts. The code below uses the classes and functions defined in the chunk above. Don't be overwhelmed by it. For most of the uses, you can use it as a script that computes LDA for a corpus in a JSON line file.

**IMPORTANT**: By default, the `content` field stores the text you would like to first tokenize and later lemmatize. 

In [None]:
## Read the corpus from the file
corpus = MyCorpus(path = 'LDA.jsonl')

In [None]:
## Create the dictionary
dictionary = Dictionary( corpus.get_tokens() )

In [None]:
## Filter out words that occur less than in 5 documents, or more than 50% of the documents
dictionary.filter_extremes(no_below=5, no_above=0.5)

In [None]:
## Add the dictionary to the corpus
corpus.set_dictionary(dictionary)

In [None]:
## Compute models and write them out to the files
run_lda_models(corpus = corpus, dictionary = dictionary, min_topics=3, max_topics=10)

In [None]:
## Read in the model. It requires providing
## the name of the model we want to load.
model_name = input('Provide the name of the model you would like to load:\n')
model_path = os.path.join('models', model_name)
model = LdaModel.load(model_path)

In [None]:
## Print out and write the figures with the most 
## probable tokens in each topic.
list_top_tokens = model.get_top_tokens(corpus)
for i in range(len(list_top_tokens)):
    plt.barh(list(list_top_tokens[i].keys()), list(list_top_tokens[i].values()), align = 'center')
    plt.xlim(0,.03)
    plt.gca().invert_yaxis()
    plt.title('Topic' + ' ' + str(i + 1))
    plt.xlabel('Probability')
    plt.savefig('png/' + 'topic' + str(i + 1))
    plt.show()

In [None]:
## Write out the results into a CSV file
pd.DataFrame.from_records(line for line in corpus.get_topics(model = model)).to_excel('corpus.xlsx')