<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/tutorials_notebooks_in_class_2024/W12_Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this tutorial, we look into two topic modeling algorithms - `Latent Dirichlet Allocation (LDA)` and `Combined Topic Models (CTM)`. In this tutorial we use [A Million News Headlines](https://www.kaggle.com/datasets/therohk/million-headlines/data) dataset, containing data of news headlines published over a period of nineteen years, sourced from Australian Broadcasting Corporation

  

### Import Necessary Libraries

In [1]:
import kagglehub
import pandas as pd
import os

### Download Dataset

In [2]:
# Download latest version
path = kagglehub.dataset_download("therohk/million-headlines")
df = pd.read_csv(os.path.join(path, 'abcnews-date-text.csv'))

Downloading from https://www.kaggle.com/api/v1/datasets/download/therohk/million-headlines?dataset_version_number=5...


100%|██████████| 21.4M/21.4M [00:00<00:00, 107MB/s]

Extracting files...





In [3]:
# Inspect Dataset
print(df['headline_text'].iloc[0])

# Select only 100000 samples
df = df.sample(n=100000, random_state=42)
df.head()

aba decides against community broadcasting licence


Unnamed: 0,publish_date,headline_text
1144371,20181017,virtual reality trial ahead of fire season in ...
282871,20070131,farmers prepare for ec funding
895099,20140810,the sunday inquisition august 10
764744,20130221,news csg reax
894276,20140806,rosetta spacecraft on final approach to comet ...


### Data Preprocessing

For preprocessing, we remove stopwords and lemmatize the words. Lemmatization reduces words to the base root - words such as "walking", "walks" and "walked" are reduced to their root word "walk".

In [4]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [5]:
# Downloading dependencies
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

# Function to lemmatize and remove stopwords from the text data
def preprocess(text):
    text = text.lower()
    words = word_tokenize(text)
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return " ".join(words)


# Applying the function to headlines
df['preprocessed_text'] = df['headline_text'].apply(lambda x: preprocess(x))

# Convert to list
text = df['preprocessed_text'].tolist()
print(text[:10])

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['virtual reality trial ahead fire season south australia', 'farmer prepare ec funding', 'sunday inquisition august 10', 'news csg reax', 'rosetta spacecraft final approach comet landing', "milne 's lawyer want access police note", 'needle found mandarin amid sa fruit contamination incident', 'nrn prawn plan', 'tiger wood dominates president cup day three', 'long take lose fitness']


### LDA

LDA is a generative probabilistic model that assumes each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of topic probabilities. [This](https://scikit-learn.org/1.5/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) is the link for official LDA documentation from sklearn.

In [6]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


NUM_LDA_TOPICS = 5 # Number of topics to use for LDA
NUM_FEATURES = 10000 # Number of words to keep in the vocabulary - used in CountVectorizer
MAX_DF = 0.95 # Remove words that appear too frequently. Used in CountVectorizer. In this case, words that appear in more than 95% documents are removed.
MIN_DF = 100 # Remove terms that appear too infrequently. Used in CountVectorizer. In this case, terms that appear in less than 100 documents are removed

In [7]:
# Use CountVectorizer to generate Bag of Words

tf_vectorizer = CountVectorizer(max_df=MAX_DF, min_df=MIN_DF, max_features=NUM_FEATURES)
tf = tf_vectorizer.fit_transform(text)
tf_feature_names = tf_vectorizer.get_feature_names_out()

In [8]:
# LDA

lda = LatentDirichletAllocation(n_components=NUM_LDA_TOPICS,
                                max_iter=5,
                                learning_method='online',
                                random_state=42).fit(tf)

In [9]:
for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {topic_idx}:', end=' ')
    print(' '.join([tf_feature_names[i] for i in topic.argsort()[:-12 - 1:-1]]))

Topic 0: police win crash wa day car man australia world two set mp
Topic 1: say nsw qld attack change hit death government market coast take farmer
Topic 2: new plan call council australian get home charged case drug killed open
Topic 3: man court govt year water face sa election murder charge sydney first
Topic 4: fire interview back report hospital health claim school south cut national minister


It could be difficult to identify the exact topics in this setting. However, the following could be some of the probable topic themes.  

Topic 0: Road Accidents  - Car, Crash, Police (maybe)

Topic 1: Places (NSW - New South Wales ; QLD - Queensland)

Topic 2: Criminal Cases (Drug, Killed, Case, Charged)

Topic 3: Judiciary (Court, Govt)

Topic 4: Emergencies (Fire, Hospital, Health)

### CTM

Now we look into Contextualized Topic Models, which uses pre-trained Document representation instead of a plain Bag of Words.

Method developed by [Bianchi et al. 2021](https://aclanthology.org/2021.acl-short.96/).

[A 6min presentation of the paper by one of the authors.](https://underline.io/lecture/25716-pre-training-is-a-hot-topic-contextualized-document-embeddings-improve-topic-coherence)

[Medium Blog](https://towardsdatascience.com/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576)

Code: [https://github.com/MilaNLProc/contextualized-topic-models](https://github.com/MilaNLProc/contextualized-topic-models)

Tutorial: [https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing](https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing)


In [10]:
!pip install -qU contextualized-topic-models

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m53.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m784.3/784.3 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 1.27.0 requires ipywidgets>=7.7.1, but you have ipywidgets 7.5.1 which is incompatible.
bigquery-magics 0.4.0 requires ipywidgets>=7.7.1, but you have ipywidgets 7.5.1 which is incompatible.
google-colab 1.0.0 

In [11]:
from contextualized_topic_models.models.ctm import CombinedTM

In [12]:
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation

 ***Important - Executing the import below (WhiteSpacePreprocessing) will produce an error on the first run. Executing it again mitigates the error. This is probably due to some caching issues with contextualized_topic_models package***

In [13]:
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing

ImportError: cannot import name 'triu' from 'scipy.linalg.special_matrices' (/usr/local/lib/python3.10/dist-packages/scipy/linalg/special_matrices.py)

In [14]:
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
NUM_CTM_TOPICS = 5

In [15]:
# Preprocessing specific to CTM - Preprocess titles by removing stopwords and whitespace.
# We use only first 10k headlines for faster implementation
sp = WhiteSpacePreprocessing(text[:10000], stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()




In [16]:
# Load model and create training dataset by creating bow and contextualized embeddings representations.

tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v1")
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)
tp.vocab[:10]

  and should_run_async(code)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.78k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/50 [00:00<?, ?it/s]

array(['10', '100', '11', '12', '13', '14', '15', '16', '17', '18'],
      dtype=object)

In [17]:
# Train the topic model in order to obtain 5 topics.

ctm = CombinedTM(bow_size=len(tp.vocab),
                 contextual_size=768,
                 n_components=NUM_CTM_TOPICS,
                 num_epochs=20)
ctm.fit(training_dataset) # run the model

# Look at the 20 most important words of those topics.
ctm.get_topic_lists(20)


  and should_run_async(code)
Epoch: [20/20]	 Seen Samples: [197120/197580]	Train Loss: 29.979616908283976	Time: 0:00:04.006571: : 20it [01:00,  3.03s/it]
100%|██████████| 155/155 [00:02<00:00, 66.23it/s]


[['police',
  'car',
  'crash',
  'killed',
  'fire',
  'woman',
  'dead',
  'found',
  'west',
  'two',
  'missing',
  'report',
  'home',
  'dy',
  'soldier',
  'death',
  'south',
  'flood',
  'hit',
  'three'],
 ['court',
  'murder',
  'man',
  'charge',
  'charged',
  'accused',
  'face',
  'guilty',
  'child',
  'assault',
  'drug',
  'jailed',
  'sex',
  'back',
  'attack',
  'case',
  'woman',
  'law',
  'coronavirus',
  'trial'],
 ['interview',
  'david',
  'britain',
  'election',
  'visit',
  'china',
  'military',
  'rudd',
  'begin',
  'prime',
  'looking',
  'attempt',
  'iraq',
  'coaching',
  'criticises',
  'kevin',
  'pm',
  'gillard',
  'turkey',
  'hong'],
 ['win',
  'news',
  'cup',
  'day',
  'lead',
  'world',
  'first',
  'australia',
  'top',
  'end',
  'beat',
  'open',
  'market',
  'aussie',
  'league',
  'drum',
  'final',
  'second',
  'country',
  'grandstand'],
 ['plan',
  'govt',
  'council',
  'government',
  'new',
  'say',
  'call',
  'health',
  'fu

📝❓ Can you guess the topics?

*Note that the ordering of the topics and words in the output might change in different runs, but the overall topic themes should remain the same*