# Topic Modelling

In this session we will discuss an approach to discover underlying "topics" in your text collection. But before we start, let's discuss the task more broadly.

There are two ways of identifying topics in a collection of text, which rely on a specific aspect:

1. you know which topics you are looking for
2. you don't know which topics you are looking for

In this session we are looking at the 2nd scenario. If you are instead in the first case, in a later session we will discuss about text classification.

## What do we mean by "topics"

1. Groups of tokens that are likely to appear in **the same context**
2. A **hidden structure** that determines how tokens appear in the corpus

The 1st is what you see (tokens co-occuring together), the second is what you are assuming when you use a topic modelling approach.

![](images/lda.png)

## How do we get these topics?

Many ways:
- Latent Semantic Analysis
- Probabilistic Latent Semantic Analysis
- Latent Dirichlet Allocation (LDA) <-- the most adopted approach

In the last ten years, LDA has been a highly popular approach in digital humanities for corpus exploration, due to its flexibility (it can be applied to any language, given a tokenizer). You "just" need to select the number of topics you want to discover, in advance.

[Slides to go though together](https://docs.google.com/presentation/d/1u5Fs1C6vwdfsv93H-iX3c4jkwyjgijxShsUjoGZPwcc/edit?usp=sharing) (starting from slide 15)

## Topic Modelling the LwM Animacy Dataset

In [None]:
import pandas as pd

animacy_df = pd.read_csv('data/LwM-nlp-animacy-annotations-machines19thC.tsv',sep='\t')
animacy_snippets = animacy_df['SentenceCtxt'].to_list()

print (animacy_snippets[0])

In [None]:
def clean_animacy_text(snippet:str)-> str:
    """
    Remove specific tags from sentences in the animacy dataset

    Args:
        snippet (str): a snippet of text from the dataset

    Returns:
        str: the same snippet, without the tags
    """
    assert type(snippet) is str, 'The input is not a string'
    snippet = snippet.replace('[SEP]','')
    snippet = snippet.replace('***','')
    return snippet


In [None]:
animacy_snippets = [clean_animacy_text(snippet) for snippet in animacy_snippets]

print (animacy_snippets[0])

In [None]:
import warnings
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def train_lda(texts,n_topics):
    # the vectorizer object will be used to transform text to vector form
    tfidf_vectorizer = TfidfVectorizer(
                                    max_df = 0.5, 
                                    min_df = 10)


    dtm_tfidf = tfidf_vectorizer.fit_transform(texts)

    lda_tfidf = LatentDirichletAllocation(n_components=n_topics, random_state=0)
    lda_tfidf.fit(dtm_tfidf)
    return lda_tfidf, dtm_tfidf, tfidf_vectorizer

animacy_lda_tfidf,animacy_dtm_tfidf, animacy_tfidf_vectorizer = train_lda(animacy_snippets,n_topics=20)

In [None]:
%matplotlib inline
import pyLDAvis
import pyLDAvis.sklearn
import warnings

pyLDAvis.enable_notebook()

warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

pyLDAvis.sklearn.prepare(animacy_lda_tfidf,animacy_dtm_tfidf, animacy_tfidf_vectorizer)

## Topic modeling on the British Library Books

We are using here a small sample of a collection of digitised books created by the British Library in partnership with Microsoft, which is available on HuggingFace. To know more see [here](https://blogs.bl.uk/digital-scholarship/2022/04/making-british-library-collections-even-more-accessible.html)

In [None]:
import pandas as pd

sample_blboooks_df = pd.read_pickle('data/bl_books_sample.pickle')
blbooks_content = sample_blboooks_df['text'].to_list()
print (len(blbooks_content))

In [None]:
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

blbooks_lda_tfidf,blbooks_dtm_tfidf, blbooks_tfidf_vectorizer = train_lda(blbooks_content,n_topics=50)
pyLDAvis.sklearn.prepare(blbooks_lda_tfidf,blbooks_dtm_tfidf, blbooks_tfidf_vectorizer)

✏️ **Exercise:** 

Process each of the "documents" inside the `animacy_snippets` dataset, keeping only the nouns. Re-run the topic modelling code and see if the topics look better.

### How do we properly assess the quality of our topics?

Topic models are useful for data exploration, but if we want to use them as evidences in our study, we need to be sure they are working well.

There are many approaches for evaluating topic models:

1.   ~~Looking at them~~
2.   ~~Cherry-pick only the good ones~~
3.   Measuring topic coherence
4.   Studying topic stability 
5.   Using topics as features in another machine learning system

![](https://media3.giphy.com/media/KFiQXtO3rWxlzpjnrV/giphy.gif)

### The word intrusion task

An alternative is conducting the so-called word intrusion task, where a word from another topic is added to the list of most relevant words of a topic and the users need to spot it.

In [None]:
import numpy as np 

topic_words = {}

for topic, comp in enumerate(blbooks_lda_tfidf.components_):
    word_idx = np.argsort(comp)[::-1][:10]

    # store the words most relevant to the topic
    topic_words[topic] = [blbooks_tfidf_vectorizer.get_feature_names()[i] for i in word_idx]

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

from random import shuffle
from urllib.request import urlopen
import re
from IPython.display import display, Markdown


def good_job():
  html = urlopen("https://giphy.com/explore/good-job").read()
  links = [x.split(".gif")[0]+".gif" for x in re.findall("(?P<url>https?://[^\s]+)", str(html)) if ".gif" in x]
  shuffle(links)
  gif = "<img src="+links[0]+'  >'
  display(Markdown(gif))


def word_intrusion(topics_words):
  shuffle(topics_words)
  for topic,words in topics_words.items():
      topic_words = words[:4]
      another_topic = [topic_id for topic_id in topics_words.keys() if topic_id != topic]
      shuffle(another_topic)
      another_topic = another_topic[0]
      word_another_topic = [word for word in topics_words[another_topic] if word not in topic_words][0]
      topic_words.append(word_another_topic)
      shuffle(topic_words)

      return topic_words ,word_another_topic

In [None]:
topic_list, intruder = word_intrusion(topic_words)
print (topic_list)

In [None]:
your_guess = "translation"
if your_guess == intruder:
  print ("Good one!")
  good_job()
else:
  print ("False! The correct one is:",intruder)