# Topic Modelling

In this notebook we will be exploring the unsupervised technique __topic modelling__, but let's first load a toy dataset used in the paper: 

> Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

This is actually the testing set (for algorithm evaluation) but we will just use this as an illustration for text analytics procedures.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data/ag_news.csv')
df.head(2)

There are four columns: 
- class index and class names that annotate the content type
- title and description of each news piece

Let's combine the title and description into a column called `content` and drop unneccessary columns.

In [None]:
df['content'] = df['title'] + '. ' + df['description']
df.drop(columns=['title', 'description'], inplace=True)
df.head(2)

#### bag-of-words
We'll first create a bag of words

In [None]:
import nltk
from string import punctuation
from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english')
from nltk import PorterStemmer
stemmer = PorterStemmer()

def tokenize(df, min_length=3):
    """
    Tokenize the text content of a DataFrame.

    Parameters
    ----------
    df : pandas.DataFrame
        A DataFrame with a 'content' column containing the text to be tokenized.
    min_length : int, optional
        The minimum length of a token to include in the output. Default is 3.

    Returns
    -------
    list
        A list of tokenized documents, where each document is represented as a list of tokens.

    Notes
    -----
    This function uses the NLTK library for tokenization, removes punctuation and stopwords, and applies stemming.

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({'content': ['This is a test.', 'Another test sentence.'], 'label': [0, 1]})
    >>> tokens = tokenize(df)
    >>> tokens
    [['test'], ['anoth', 'test', 'sentenc']]
    """
    bow = [nltk.word_tokenize(content.lower()) for content in df['content'].values]
    bow = [[w for w in d if w not in punctuation and w not in eng_stopwords and not w.isdigit()] for d in bow]
    trans = str.maketrans('', '', punctuation)
    bow = [[w.translate(trans).strip() for w in d] for d in bow]
    bow = [[w for w in d if len(w) >= min_length] for d in bow]
    bow = [[stemmer.stem(w) for w in d] for d in bow]
    
    return bow

In [None]:
bow = tokenize(df)

##### Vector Space Model (VSM)

When we have a bag of words, we can create vectors based on these items. Instead of using tokens/text, it is sometimes easier to just use integer indices. For example, `race` is the first word and therefore the number `1` maps to `race`.

For this task, I like to use [`gensim`](https://radimrehurek.com/gensim/index.html), which has a library of very well written and convenient APIs, especially for [topic modeling](https://en.wikipedia.org/wiki/Topic_model) and [word2vec](https://rare-technologies.com/word2vec-tutorial/) algorithms:

`pip install gensim`

In [None]:
import gensim
dictionary = gensim.corpora.Dictionary(bow)
print(dictionary)

In [None]:
dictionaryl.token2id

Mapping of tokens

In [None]:
dictionary.token2id['disappoint'], dictionary.token2id['california']

Upon creation of a dictionary that maps words to integers (and vice versa), we can transform our bag of words. Each document will be a list of tuples that contain token indices and frequencies.

In [None]:
corpus = [dictionary.doc2bow(d) for d in bow]
corpus[0]

Let's investigate e.g. the token with the highest frequency (11)

In [None]:
dictionary.id2token[11]

In [None]:
bow[0]

#### Topic Modeling

##### Latent Dirichet Allocation (LDA)

A very commonly used dimensionality reduction technique family is called ___topic modeling___. It assumes that each document is a mixture of topics, where each topic is a mixture of terms. One of the most successful algorithms is [___latent Dirichlet allocation___](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) (LDA), whose corresponding paper is:
> Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.

LDA is a generative model that does the reverse engineerging of document generation. It can be represented as a probablistic graphical model:
![lda](https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png)

The generative process can be described as follows:
- For each topic $k$, sample a multinomial distribution $\phi_k$ over words from the Dirichlet prior with parameter $\beta$
- For each document $m$, sample a multinomial distribution $\theta_m$ over topics from the Dirichlet prior with parameter $\alpha$
    - For each word $n$ in $m$:
        - Sample a topic $z_{m,n}$ from the correponding topic distribution parameterized by $\theta_m$
        - Sample a word $w_{m,n}$ from the correponding topic $z_{m,n}$'s word distribution parameterized by $\phi_{z_{m,n}}$

##### Parameters in LDA

Generally, we need to control two hyperparamters of a LDA model:
- Topic-word Dirichlet prior $\beta$
- Document-topic Dirichlet prior $\alpha$

The selection of these parameters are application dependent. Heuristically, people will choose $\alpha=\dfrac{50}{K}$ and $\beta=0.01$, as described in

> Griffiths, T. L., and Steyvers, M. 2004. “Finding Scientific Topics,” Proceedings of the National Academy of Sciences (101:Supplement 1), National Academy of Sciences, pp. 5228–5235.

It is also possible to infer these two hyperparameters given the data.

The selection of $K$ totally depends on the context. It is also possible to select a topic number based on quantitative measures of topic modeling quality, but this is beyond the scope of this tutorial.

For our toy sample set, we will just select $K=4$ because there are 4 labels: 

In [None]:
df.class_name.unique()

##### Run LDA!

Thanks to the convenient APIs by `gensim`, we can easily run [LDA in Python](https://radimrehurek.com/gensim/models/ldamodel.html):

In [None]:
from gensim.models import LdaModel
lda = LdaModel(corpus, num_topics=4, id2word=dictionary, passes=10, 
               minimum_probability=0)

##### Analysis on LDA results

Let's take a look at the output of LDA. First, we can check if the topics make sense

In [None]:
for _, topic_str in lda.show_topics():
    print(topic_str)
    print('------------'*10)

While we probably cannot say the topics are perfect, they are okay. We can interpret the topics as: sci/tech, sports, world, and business.

For each document, we can check their topic distributions:

In [None]:
i = 1000
lda.get_document_topics(corpus[i])

We can see that topic 0, which is interpreted "business" topics dominate this document. We can check to see if this makes sense:

In [None]:
df.loc[i]

In fact, LDA can be used in many situtaions, such as information retrieval, document clustering and labeling, and even for images! Here we just mention the simplest use case

return to [overview](../00_overview.ipynb)