## Topic Extraction w/ Latent Dirichlet Allocation `LDA`

## Fetching dataset & unraring


In [1]:
!wget https://www.minapharm.com/gShare/Pubmed5k.rar 
!apt-get install unrar
!unrar e Pubmed5k.rar


---

## Upgrading modules & fetching required modules


In [2]:
!python3 -m pip install --upgrade pip nltk gensim
# openpyxl is required by pandas to read_excel
!python3 -m pip install openpyxl 


## Feature Engineering of corpus

The following resources were found helpful devising a strategy for feature engineering:

- [Text Analysis & Feature Engineering with NLP](https://towardsdatascience.com/text-analysis-feature-engineering-with-nlp-502d6ea9225d)
- [Topic Modelling and Latent Dirichlet Allocation \(LDA\) in Python](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24)

### Importing modules &amp; defining globals


In [3]:
import os
import numpy as np
import pandas as pd
import nltk

from gensim.parsing.preprocessing import STOPWORDS
from nltk.corpus import stopwords

nltk.download("omw-1.4")
np.random.seed(42)

DATA_PATH = "./"
STOPWORDS = STOPWORDS | frozenset(stopwords.words("english"))


### Loading dataset


In [4]:
io = "Pubmed5k.xlsx"
sheet_name = "random 5k"
df = pd.read_excel(os.path.join(DATA_PATH, io), sheet_name=sheet_name, index_col=[0])
df.head()


### Exploring dataset


In [5]:
df_lower = df["Abstract"].str.lower()

# some records have no clear abstract
idx_no_abs = df_lower.str.find("no abstract") != -1

df_processed = df[~idx_no_abs]


In [6]:
df_len = df_processed.apply({"Title": len, "Abstract": len})
df_len.describe()


some of the statistics for length of the documents are illogical.
e.g: min(Abstract) = 1


In [7]:
df_len["Abstract"].value_counts().sort_index().head(10)


In [8]:
df_processed[df_len["Abstract"] <= 104]["Abstract"]


In [9]:
df_processed[df_len["Abstract"] == 43]["Title"].values


In [10]:
df_len["Title"].value_counts().sort_index().head(10)


In [11]:
df_processed[df_len["Title"] <= 31]["Title"]


In [12]:
df_altered = df[idx_no_abs]
df_altered = pd.concat([df_altered, df_processed[df_len["Abstract"] <= 43]])
df_altered.loc[:, "Abstract"] = ""
df_altered.shape


In [13]:
df_altered.head()


In [14]:
df_processed = df.drop(index=df_altered.index)
df_processed = pd.concat([df_processed, df_altered])
df_processed.shape


### Preprocessing

> Note: the approaches here are for a unigram model.

after exploring the dataset, and slightly tweaking some features, the two columns of the set are joined to make a single feature (document per record) to be able to operate on the dataset


In [15]:
corpus = df_processed["Title"] + r" " + df_processed["Abstract"]
corpus.name = "document"
corpus.head()


Defining helper function. Basically, bundling the preprocess subroutine into a function

For each document:

1. Tokenising the document.
2. lowercasing the tokens.
3. lemmatising the tokens.
4. dropping stop words.


In [16]:
from typing import Union

from gensim.utils import simple_preprocess
from nltk.stem import WordNetLemmatizer


def preprocess(
    document: str,
    min_len: int = 2,
    max_len: int = 15,
    stopwords: frozenset = frozenset(),
    pos: Union[str, list] = "n",
) -> str:
    """Tokenise the document, drop stopwords, then lowercase and lemmatise
    each token

    Parameters:
    -----------

    Returns:
    --------

    TODO: complete the pydoc
    """
    # using gensim.utils.simple_preprocess to tokenise, lowercase a document
    pp_doc = simple_preprocess(document, min_len=min_len, max_len=max_len, deacc=True)
    # removing generic stopwords from the tokens
    doc_non_stop = [token for token in pp_doc if token not in stopwords]

    # applying text normalisation: lemmatisation
    lemmatise = WordNetLemmatizer().lemmatize
    if type(pos) == str:
        return [lemmatise(token, pos=pos) for token in doc_non_stop]
    else:
        result = doc_non_stop
        for p in pos:
            result = [lemmatise(token, pos=p) for token in result]
        return result


- Calling the preprocessing subroutine on the dataset


In [17]:
pos = ["a", "n", "r", "s", "v"]
corpus_processed = corpus.apply(preprocess, stopwords=STOPWORDS, pos=pos)
corpus_processed.head()


### Saving the final feature engineered dataset


In [18]:
corpus_processed.to_csv(os.path.join(DATA_PATH, "corpus_clean.csv"))


## Hyper-Parameter Tuning

> NOTE: a unigram model is the current considered option

On using Latent Dirichlet Allocation `LDA`, the most important hyperparameters are:

- **k**: The number of topics.
- **&alpha;**: the Dirichlet prior for document/topic distribution.
- **&eta;**: the Dirichlet prior for topic/word distribution.

[Gensim](https://radimrehurek.com/gensim/index.html)'s implementation of `LDA` will be used.
Specifically the multi-cored variant [LdaMulticore](https://radimrehurek.com/gensim/models/ldamulticore.html)

These resources were found helpful for approaching the tuning routine.

- [When Coherence Score is Good or Bad in Topic Modelling?](https://www.baeldung.com/cs/topic-modeling-coherence-score)
- [Evaluate Topic Models: Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0)

---

### Importing required modules and setting up globals


In [19]:
import pickle
import matplotlib.pyplot as plt

from gensim.corpora import Dictionary
from gensim.models import LdaMulticore, CoherenceModel
from pylab import rcParams

rcParams["figure.figsize"] = (16, 9)
plt.style.use("ggplot")


In [20]:
corpus = corpus_processed


## Tuning subroutines

First, elbow method is used to tune **k**.
**&alpha;** &amp; **&eta;** are considered optimal, then different **k** topics are fit, then plotted.

---

`LdaMulticore` accepts certain parameters, so we build them first:

- A dictionary of words. Basically a map assigning each word in the available vocabulary an id.
- A Bag-of-Words `BoW`, that can be easily built from the dictionary

> also the dictionary provides useful utilities, like the ability to drop tokens that appear too little or too much in our corpus; as that might indicate they're less significant than other tokens.


In [21]:
# build the mapping
id2word = Dictionary(corpus)

# removing tokens that appear in less that 1% (.01) or in more than 50% (.5) of the corpus
id2word.filter_extremes(0.01 * len(corpus), 0.5, None)

# build a BoW
bow = corpus.apply(id2word.doc2bow)


### Saving objects


In [22]:
pickle_path = "./"
objects = {"id2word": id2word, "bow": bow}

for key, val in objects.items():
    path = os.path.join(pickle_path, f"{key}.pkl")
    file_ref = open(path, "wb")
    pickle.dump(val, file_ref)
    file_ref.close()


Now, an `LdaMulticore` model may be built.

The following resources were found useful to selecting &alpha; &amp; &eta;:

- [What are typical values to use for alpha and beta in Latent Dirichlet Allocation?](https://stats.stackexchange.com/questions/59684/what-are-typical-values-to-use-for-alpha-and-beta-in-latent-dirichlet-allocation)
- [Rules to set hyper-parameters alpha and theta in LDA model](https://stackoverflow.com/questions/39644667/rules-to-set-hyper-parameters-alpha-and-theta-in-lda-model)

> NOTE: keeping the range of possible **k** at 20 for simplicity and ease of calculation

---

### Tuning **k**

Calculating coherence score on **k** in range \[3,20\]


In [23]:
n_co_scores = []
for n in range(3, 21):
    lda_model = LdaMulticore(
        bow,
        id2word=id2word,
        num_topics=n,
        random_state=42,
        alpha="asymmetric",
        eta="auto",
    )
    co_model = CoherenceModel(
        model=lda_model, texts=corpus, dictionary=id2word, coherence="u_mass"
    )
    n_co_scores.append((n, co_model.get_coherence()))


- Plotting the findings, to look for _elbow_ in graph


In [24]:
x, y = np.array(n_co_scores).T
plt.plot(x, y)
plt.xticks(x)
plt.grid(True)
plt.show()


From the graph: best **k** is at $7$

<!-- , followed by $11$ -->

<!-- > NOTE: choosing $11$ to avoid future mishap, as selecting $3$ yields more documents w/ less than $3$ topic participation -->

---

Due to implementation specifics, `LdaMulticore` doesn't allow using `auto` for &alpha; in constructor. So &alpha; would need to be tuned in a similar fashion.

Setting &eta; to `auto` allows the model to also learn that hyperparameter

### Tuning &alpha;


In [25]:
n = 7
a_co_scores = []

for a in np.linspace(0.01, 1):
    lda_model = LdaMulticore(
        bow,
        id2word=id2word,
        num_topics=n,
        random_state=42,
        alpha=a,
        eta="auto",  # lets the model learn that hyperparameter
    )
    co_model = CoherenceModel(
        model=lda_model, texts=corpus, dictionary=id2word, coherence="u_mass"
    )
    a_co_scores.append((a, co_model.get_coherence()))


Selecting the &alpha; w/ highest coherence score


In [26]:
a_co_scores = sorted(a_co_scores, key=(lambda tup: tup[1]))
a = a_co_scores[-1][0]


### Saving findings


In [27]:
hyperparameters = np.array([n, a], dtype=float)
np.save(os.path.join(DATA_PATH, "hyperparameters.npy"), hyperparameters)


## Training a Unigram `LdaMulticore` model


In [28]:
k = n
lda_model = LdaMulticore(
    bow,
    num_topics=k,
    id2word=id2word,
    eta="auto",
    alpha=a,
    random_state=42,
)


Defining a helper function to wrap the subroutine for selecting top _n_ topics


In [29]:
def get_top_n_topics(
    corpus: pd.DataFrame, model: LdaMulticore, n: int = 3
) -> pd.DataFrame:
    """calculates the top `n` topics for each document in `df` through the model

    Parameters:
    -----------

    Returns:
    --------

    TODO: complete the pydoc
    """
    df_lda = corpus.apply(
        lambda doc: sorted(model[doc], key=(lambda tup: tup[1]), reverse=True)[:n]
    )
    corpus = pd.DataFrame(corpus)

    for i in range(n):
        corpus[f"topic_{i+1}"] = df_lda.apply(
            lambda topics: topics[i][0] if len(topics) > i else None
        )
        corpus[f"topic_{i+1}_prop"] = df_lda.apply(
            lambda topics: topics[i][1] if len(topics) > i else None
        )

    # FIXME: some documents might have less than `n` possible topics
    # to avoid N/As, set them to the previous topic
    for i in range(2, n + 1):
        idx = corpus[f"topic_{i}"].isna()
        cols = [f"topic_{i}", f"topic_{i}_prop"]
        cols_prev = [f"topic_{i-1}", f"topic_{i-1}_prop"]
        corpus.loc[idx, cols] = corpus.loc[idx, cols_prev].values

    return corpus.drop(columns=["document"])


## Assigning topics

now we use the model to assign topics to records/docs/articles


In [30]:
n = 3

df_topics = get_top_n_topics(bow, lda_model, n)
columns = ["topic_1", "topic_2", "topic_3"]
df_topics[columns] = df_topics[columns].astype(int)
df_topics.head()


### Saving findings


In [31]:
df_topics.to_csv(os.path.join(DATA_PATH, "topics.csv"))


## Evaluating model


In [32]:
co_model = CoherenceModel(
    lda_model, texts=corpus, dictionary=id2word, coherence="u_mass"
)
co_model.get_coherence()


## Saving the model


In [33]:
path = os.path.join(DATA_PATH, "lda_model.pkl")
lda_model.save(path)


## Naming the topics

According to [Luis Serrano](https://ca.linkedin.com/in/luisgserrano) in this [video](https://www.youtube.com/watch?v=T05t-SqKArY);
labeling/naming a topic is a task that is best done by humans.

A naive approach would be:

    the name of the topic would be included in the title of the articles.

<!-- here is a simple hack to try using ML for the topic name. -->

<!-- Exploiting TF-IDF, for each topic, the lower the IDF score of a token, the more it appears in documents; intuitively the higher the chance that this particular token is the name of the topic. -->

so we gather the top contributing words per topic, across all topics.


In [34]:
topics_word = lda_model.show_topics(formatted=False)
topics_word


ignoring for a moment the participation ratio of the tokens


In [35]:
k = len(topics_word)
topic_word_sets = [set(np.array(topics_word[i][1])[:, 0]) for i in range(k)]
topic_word_sets


In [36]:
intersection = topic_word_sets[0]
for tw_set in topic_word_sets[1:]:
    intersection &= tw_set

intersection


seems like some words already highly contribute to the entire set of topics of $7$.

\[high, patient\]

---

let's try to obtain unique terms per topic, i.e. the tokens that belong to at most $1$ topic

to do that; we need to filter the sets:

for each topic:

- make a union
  of each set of topic/words,
  not including the current topic


In [37]:
union = [
    set.union(
        *[
            topic_word_sets[i]
            for i in range(k)  # 3. of each set of topic/words
            if i != j  # 4. not including the current topic
        ]
    )  # 2. make a union
    for j in range(k)  # 1. for each topic
]
union


now, for each topic/words set, get the difference between the set and the complementing union


In [38]:
unique_topic_word = [topic_word_sets[i] - union[i] for i in range(k)]
unique_topic_word


The notice here is that the results contain empty sets. One intuition would be:

    the vocab is built on a unigram model, and topics could likely be a polygram

The same naive approach could be extended in many ways, e.g:

1. Relying on more than top $10$ words; that might yield non-empty sets, but could equally likely increase the empty sets.
1. Building a polygram \(bigram, trigram, ...\) models, and applying the same naive approach.

---

Apart from that caveat, it seems the approach gives acceptable insights into what some of the topics are:

- topic \#$1$ could be about `treatment`s
- topic \#$4$ could be about `cancer`
- topic \#$6$ could be about `protein`s

other result don't seem particularly informative:

- topic \#$2$ could be about `control`, `effect`, `risk`, or a combination of any two of them, or the three of them.
  perhaps the contribution rates of them could draw another conclusion.
- topic \#$5$ have values `low` &amp; `report` which neither can be highly descriptive.


In [39]:
# looking at participation of words in topic 2
_, topic_2_words = topics_word[2]
unique_participants = [
    topic_2_words[i]
    for i in range(len(topic_2_words))
    if topic_2_words[i][0] in unique_topic_word[2]
]
unique_participants


The participation rates merely sorts the findings, as the differences between them are in the $10^{-3}$ range.

Here a bigram model could be helpful, where it could be seen how each pair of them appear, the topic name might be something like:

- `risk control`
- `risk effect`
- `control effect`, maybe?
- ...

Or a trigram model:

- `control risk effect`
- `effect` _of_ `risk control`
- ...


In [42]:
topic_dict = {
    1: "treatment",
    4: "cancer",
    6: "protein",
    2: "effect",  # selected by order of participation
    5: "report",  # selected intuitively as report could be more meaningful than low
    # randomly selecting remaining topic names from participants
    0: "covid",
    3: "specie",
}
columns = ["topic_1", "topic_2", "topic_3"]
df_topics.loc[:, columns] = df_topics[columns].replace(topic_dict).values
df_topics.head()


### Saving findings


In [43]:
df_topics.to_csv(os.path.join(DATA_PATH, "topics_named.csv"))
