<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/exercises/ex6/ex06_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -qU contextualized-topic-models

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m784.3/784.3 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 1.27.0 requires ipywidgets>=7.7.1, but you have ipywidgets 7.5.1 which is incompatible.
bigquery-magics 0.4.0 requires ipywidgets>=7.7.1, but you have ipywidgets 7.5.1 which is incompatible.
google-colab 1.0.0 

## General Instructions

1. Perform Topic Modeling using LDA and CTM on the three time frames: before 1990, 1990-2009 and 2010 onwards.
2. Experiment with a) different preprocessing functions and b) varying number of topics.
3. Annotate the topics.
4. Answer the questions marked with 📝❓ in your lab report at the end of this notebook  

## Import Libraries

In [2]:
import re
import urllib
import gzip
import io
import csv
import random
from collections import defaultdict
from tqdm import tqdm

## Download Dataset

In [3]:
url_before_1990 = 'https://drive.google.com/file/d/1o_IeJCqvDLH5xgjYYuEHoPuPjF7SYvwR/view?usp=drive_link'
url_from_1990_to_2009 = 'https://drive.google.com/file/d/1Q31iYPxlcsvB0nwGter3RDfbhVRtV2yI/view?usp=drive_link'
url_from_2010 = 'https://drive.google.com/file/d/1s7pLqaiMVxM0M4WBKgZpBxNDFKXeQ47x/view?usp=drive_link'

In [4]:
# Function to download data given a google drive url - Returns a list
import requests

def download_text_file_from_drive(drive_url):
    try:
        file_id = drive_url.split('/d/')[1].split('/')[0]
    except IndexError:
        raise ValueError("Invalid Google Drive URL format. Ensure it includes '/d/<file_id>/'.")

    download_url = f"https://drive.google.com/uc?id={file_id}&export=download"

    response = requests.get(download_url)
    if response.status_code != 200:
        raise RuntimeError(f"Failed to download file. HTTP Status Code: {response.status_code}")

    content = response.text
    titles_year = content.splitlines()
    titles = [x.split(',')[0] for x in titles_year]
    return titles

In [5]:
titles_before_1990 = download_text_file_from_drive(url_before_1990)
titles_from_1990_to_2009 = download_text_file_from_drive(url_from_1990_to_2009)
titles_from_2010 = download_text_file_from_drive(url_from_2010)

# Check the length of downloaded data
print(len(titles_before_1990))
print(len(titles_from_1990_to_2009))
print(len(titles_from_2010))

# Check the first element of each list
# Elements in the list are of the format - paper_title, year
print(titles_before_1990[0])
print(titles_from_1990_to_2009[0])
print(titles_from_2010[0])

40000
243581
582378
An Introduction to Mathematical Taxonomy
The Future of Classic Data Administration: Objects + Databases + CASE
E. W. Dijkstra Archive: The manuscripts of Edsger W. Dijkstra 1930-2002


## Preprocessing Functions

*Optionally, you can write the preprocessing functions for LDA here or use inbuilt sklearn functionalities for preprocessing while performing LDA*

*For CTMs, it is recommended that you preprocess the dataset only for creating Bag of Words, while the embeddings are generated without doing any preprocessing. This will ensure that better quality embeddings are generated as more context is present, without the vocabulary size becoming huge. You can refer to authors' proposed preprocessing implementation [here](https://github.com/MilaNLProc/contextualized-topic-models?tab=readme-ov-file#preprocessing)*

In [6]:
# Preprocess 1: Basic Cleaning
def preprocess1(texts):
    return [re.sub(r'\W+', ' ', text.lower()) for text in texts]

In [7]:
# Preprocess 2: Advanced Cleaning (e.g., removing short words, stemming)
def preprocess2(texts):
    from nltk.stem import PorterStemmer
    stemmer = PorterStemmer()
    cleaned_texts = []
    for text in texts:
        cleaned = re.sub(r'\W+', ' ', text.lower())
        tokens = [stemmer.stem(word) for word in cleaned.split() if len(word) > 2]
        cleaned_texts.append(' '.join(tokens))
    return cleaned_texts

## LDA

In [8]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

num_lda_topics = 5 # min number of topics

### Before the 1990s:

In [9]:
# Read data
titles_before_1990 = download_text_file_from_drive(url_before_1990)

In [10]:
titles_before_1990

['An Introduction to Mathematical Taxonomy',
 'Speech Acts: An Essay in the Philosophy of Language.',
 'Evolution and the Theory of Games',
 'On the Construction of Programs',
 'Modal Logic - An Introduction',
 'Introduction to Combinators and Lambda-Calculus.',
 'The Cognitive Structure of Emotions.',
 'A Structured Operating System.',
 'Parallel Processing in Ada.',
 'A Model for Communicating Sequential Processes.',
 'An Introduction to Pascal-Plus.',
 'Languages for Parallel Computers.',
 'Modules and Visibility in the Ada Programming Language.',
 'Algorithms for Parallel Computers.',
 'Concurrent Pascal - An Appraisal.',
 'A Structured Compiler.',
 '"Information Systems: Modelling',
 'Computable set theory. Volume 1.',
 'Mentale Belastung und kognitive Prozesse bei komplexen Dialogstrukturen.',
 'Naturwissenschaftsdidaktik als Studienfach: kommentierte Dokumentation ausl&auml;nd. Studieng&auml;nge.',
 'Simulation and the Monte Carlo method.',
 'Software engineering in C.',
 'Searc

In [11]:
# Preprocess 1
titles_before_1990_1 = preprocess1(titles_before_1990)

In [12]:
titles_before_1990_1

['an introduction to mathematical taxonomy',
 'speech acts an essay in the philosophy of language ',
 'evolution and the theory of games',
 'on the construction of programs',
 'modal logic an introduction',
 'introduction to combinators and lambda calculus ',
 'the cognitive structure of emotions ',
 'a structured operating system ',
 'parallel processing in ada ',
 'a model for communicating sequential processes ',
 'an introduction to pascal plus ',
 'languages for parallel computers ',
 'modules and visibility in the ada programming language ',
 'algorithms for parallel computers ',
 'concurrent pascal an appraisal ',
 'a structured compiler ',
 ' information systems modelling',
 'computable set theory volume 1 ',
 'mentale belastung und kognitive prozesse bei komplexen dialogstrukturen ',
 'naturwissenschaftsdidaktik als studienfach kommentierte dokumentation ausl auml nd studieng auml nge ',
 'simulation and the monte carlo method ',
 'software engineering in c ',
 'search in arti

In [13]:
# Preprocess 2
titles_before_1990_2 = preprocess2(titles_before_1990)

In [14]:
titles_before_1990_2

['introduct mathemat taxonomi',
 'speech act essay the philosophi languag',
 'evolut and the theori game',
 'the construct program',
 'modal logic introduct',
 'introduct combin and lambda calculu',
 'the cognit structur emot',
 'structur oper system',
 'parallel process ada',
 'model for commun sequenti process',
 'introduct pascal plu',
 'languag for parallel comput',
 'modul and visibl the ada program languag',
 'algorithm for parallel comput',
 'concurr pascal apprais',
 'structur compil',
 'inform system model',
 'comput set theori volum',
 'mental belastung und kognit prozess bei komplexen dialogstrukturen',
 'naturwissenschaftsdidaktik al studienfach kommentiert dokument ausl auml studieng auml nge',
 'simul and the mont carlo method',
 'softwar engin',
 'search artifici intellig',
 'build expert system',
 'structur complex',
 'encyclopaedia linguist',
 'the comput and the mind introduct cognit scienc',
 'the theori pars',
 'catalogu artifici intellig tool',
 'artifici intellig 

In [15]:
def perform_lda(texts, num_topics):
    vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
    X = vectorizer.fit_transform(texts)

    lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
    lda.fit(X)

    feature_names = vectorizer.get_feature_names_out()
    topics = []
    for idx, topic in enumerate(lda.components_):
        topics.append([feature_names[i] for i in topic.argsort()[:-11:-1]])
        print(f"Topic {idx}: {topics[-1]}")

    return topics

In [16]:
# Perform LDA with num_lda_topics = 5 for Preprocess 1 - Annotate the topics
print("LDA with Preprocess 1 (num_topics = 5):")
topics_p1_5 = perform_lda(titles_before_1990_1, 5)

LDA with Preprocess 1 (num_topics = 5):
Topic 0: ['uuml', 'der', 'und', 'von', 'auml', 'systems', 'zur', 'die', 'distributed', 'retrieval']
Topic 1: ['based', 'analysis', 'using', 'recognition', 'logic', 'design', 'networks', 'graphs', 'image', 'pattern']
Topic 2: ['language', 'sub', 'eacute', 'sup', 'program', 'programming', 'sets', 'trees', 'research', 'graph']
Topic 3: ['computer', 'systems', 'data', 'software', 'design', 'information', 'review', 'analysis', 'programming', 'database']
Topic 4: ['linear', 'problem', 'algorithm', 'control', 'problems', 'method', 'optimal', 'time', 'algorithms', 'systems']


In [17]:
# Perform LDA with num_lda_topics = 5 for Preprocess 2 - Annotate the topics
print("LDA with Preprocess 1 (num_topics = 5):")
topics_p2_5 = perform_lda(titles_before_1990_2, 5)

LDA with Preprocess 1 (num_topics = 5):
Topic 0: ['graph', 'uuml', 'comput', 'der', 'und', 'von', 'ein', 'auml', 'zur', 'die']
Topic 1: ['comput', 'inform', 'data', 'network', 'problem', 'algorithm', 'model', 'design', 'base', 'perform']
Topic 2: ['control', 'program', 'theori', 'model', 'set', 'function', 'linear', 'problem', 'languag', 'applic']
Topic 3: ['time', 'review', 'analysi', 'approxim', 'algorithm', 'error', 'model', 'introduct', 'filter', 'languag']
Topic 4: ['method', 'logic', 'softwar', 'imag', 'use', 'process', 'pattern', 'equat', 'algorithm', 'comput']


In [18]:
# Perform LDA with num_lda_topics > 5 for Preprocess 1 - Annotate the topics
print("LDA with Preprocess 1 (num_topics = 5):")
topics_p1_10 = perform_lda(titles_before_1990_1, 10)

LDA with Preprocess 1 (num_topics = 5):
Topic 0: ['systems', 'control', 'time', 'distributed', 'model', 'discrete', 'linear', 'real', 'adaptive', 'optimal']
Topic 1: ['design', 'logic', 'systems', 'networks', 'based', 'microprocessor', 'analysis', 'new', 'memory', 'high']
Topic 2: ['programming', 'program', 'language', 'software', 'trees', 'development', 'functions', 'engineering', 'computer', 'computing']
Topic 3: ['sub', 'languages', 'structures', 'systems', 'computer', 'introduction', 'free', 'database', 'operating', 'context']
Topic 4: ['algorithm', 'algorithms', 'optimal', 'linear', 'estimation', 'using', 'method', 'random', 'efficient', 'function']
Topic 5: ['theory', 'review', 'graphs', 'theorem', 'note', 'graph', 'set', 'sets', 'book', 'number']
Topic 6: ['uuml', 'der', 'und', 'von', 'auml', 'zur', 'die', 'sup', 'ouml', 'ein']
Topic 7: ['problems', 'problem', 'method', 'equations', 'solution', 'methods', 'complexity', 'linear', 'solving', 'numerical']
Topic 8: ['data', 'compute

In [19]:
# Perform LDA with num_lda_topics > 5 for Preprocess 2 - Annotate the topics
print("LDA with Preprocess 1 (num_topics = 5):")
topics_p2_10 = perform_lda(titles_before_1990_2, 10)

LDA with Preprocess 1 (num_topics = 5):
Topic 0: ['comput', 'scienc', 'studi', 'technolog', 'new', 'program', 'univers', 'educ', 'chess', 'report']
Topic 1: ['network', 'problem', 'comput', 'algorithm', 'parallel', 'tree', 'complex', 'perform', 'model', 'flow']
Topic 2: ['control', 'theori', 'model', 'optim', 'sub', 'sup', 'stochast', 'applic', 'linear', 'adapt']
Topic 3: ['review', 'time', 'analysi', 'real', 'book', 'recognit', 'grammar', 'languag', 'correct', 'queue']
Topic 4: ['process', 'test', 'gener', 'evalu', 'distribut', 'pattern', 'recognit', 'research', 'method', 'model']
Topic 5: ['uuml', 'der', 'und', 'von', 'ein', 'auml', 'zur', 'die', 'ouml', 'mit']
Topic 6: ['algorithm', 'method', 'imag', 'use', 'linear', 'equat', 'solut', 'problem', 'transform', 'dimension']
Topic 7: ['inform', 'logic', 'estim', 'time', 'retriev', 'model', 'analysi', 'decis', 'associ', 'effect']
Topic 8: ['graph', 'set', 'theorem', 'relat', 'eacut', 'number', 'problem', 'group', 'order', 'finit']
Topic 

### From 1990 to 2009:

*Add your code for topic modelling the period from 1990 to 2009 here - similar to what you did for before 1990s*

In [20]:
titles_from_1990_to_2009 = download_text_file_from_drive(url_from_1990_to_2009)
titles_from_1990_to_2009_1 = preprocess1(titles_from_1990_to_2009)
titles_from_1990_to_2009_2 = preprocess2(titles_from_1990_to_2009)

In [21]:
titles_from_1990_to_2009_1_5 = perform_lda(titles_from_1990_to_2009_1,5)

Topic 0: ['using', 'sub', 'sup', 'analysis', 'complexity', 'model', 'high', 'low', 'language', 'study']
Topic 1: ['systems', 'software', 'information', 'computer', 'learning', 'based', 'process', 'study', 'model', 'case']
Topic 2: ['problem', 'problems', 'method', 'algorithm', 'linear', 'graphs', 'finite', 'order', 'methods', 'algorithms']
Topic 3: ['information', 'networks', 'data', 'design', 'management', 'web', 'based', 'systems', 'wireless', 'knowledge']
Topic 4: ['based', 'using', 'systems', 'control', 'time', 'networks', 'analysis', 'neural', 'data', 'adaptive']


In [22]:
titles_from_1990_to_2009_1_10 = perform_lda(titles_from_1990_to_2009_1,10)

Topic 0: ['sub', 'sup', 'using', 'high', 'low', 'video', 'power', 'codes', 'measurement', 'frequency']
Topic 1: ['process', 'systems', 'models', 'model', 'information', 'study', 'case', 'test', 'analysis', 'processes']
Topic 2: ['problems', 'problem', 'graphs', 'method', 'uuml', 'finite', 'equations', 'methods', 'und', 'algorithm']
Topic 3: ['management', 'networks', 'wireless', 'mobile', 'information', 'eacute', 'oriented', 'internet', 'object', 'knowledge']
Topic 4: ['systems', 'control', 'time', 'neural', 'adaptive', 'networks', 'based', 'fuzzy', 'using', 'network']
Topic 5: ['analysis', 'human', 'functional', 'study', 'brain', 'model', 'data', 'fmri', 'control', 'networks']
Topic 6: ['based', 'using', 'image', 'images', 'recognition', 'detection', 'data', 'analysis', 'classification', 'processing']
Topic 7: ['based', 'parallel', 'decision', 'scheduling', 'algorithms', 'service', 'simulation', 'multi', 'using', 'performance']
Topic 8: ['software', 'computer', 'systems', 'web', 'desi

In [None]:
titles_from_1990_to_2009_2_5 = perform_lda(titles_from_1990_to_2009_2,5)

In [None]:
titles_from_1990_to_2009_2_10 = perform_lda(titles_from_1990_to_2009_2,10)

### From 2010 onwards:

*Add your code for topic modelling the period from 2010 onwards here - similar to what you did for before 1990s*

In [None]:
titles_from_2010 = download_text_file_from_drive(url_from_2010)
titles_from_2010_1 = preprocess1(titles_from_2010)
titles_from_2010_2 = preprocess2(titles_from_2010)

In [None]:
titles_from_2010_1_5 = perform_lda(titles_from_2010_1,5)

In [None]:
titles_from_2010_1_5 = perform_lda(titles_from_2010_1,10)

In [None]:
titles_from_2010_2_5 = perform_lda(titles_from_2010_2,5)

In [None]:
titles_from_2010_2_10 = perform_lda(titles_from_2010_2,10)

📝❓ For each period, assign a name to each generated topic based on the topic’s top words. List all topic names in your report. If a topic is incoherent to the degree that no common theme is detectable, you can just mark it as incoherent (i.e., no need to name a topic that does not exist).

📝❓ Do the topics make sense to you? Are they coherent? Do you observe trends across different time periods? Discuss in 4-6 sentences.


## Combined Topic Models

Method developed by [Bianchi et al. 2021](https://aclanthology.org/2021.acl-short.96/).

[A 6min presentation of the paper by one of the authors.](https://underline.io/lecture/25716-pre-training-is-a-hot-topic-contextualized-document-embeddings-improve-topic-coherence)

[Medium Blog](https://towardsdatascience.com/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576)

Code: [https://github.com/MilaNLProc/contextualized-topic-models](https://github.com/MilaNLProc/contextualized-topic-models)

Tutorial: [https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing](https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing)

Again, perform topic modelling for the three time periods - this time using the combined topic models (CTMs).

You can use and adapt the code from the tutorial linked above.

Use the available GPU for faster running times.

In [None]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation

 ***Important - Executing the import below (WhiteSpacePreprocessing) will produce an error on the first run. Executing it again mitigates the error. This is probably due to some caching issues with contextualized_topic_models package***

In [None]:
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing

In [None]:
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing

In [None]:
num_ctm_topics = 5  # min number of topics
MODEL_NAME = "sentence-transformers/paraphrase-mpnet-base-v2" # Model to use for CTM

### Before the 1990s:

In [None]:
# Preprocess 1

In [None]:
# Preprocess 2

In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 1 - Annotate the topics

In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 2 - Annotate the topics

In [None]:
# Perform CTM with num_ctm_topics > 5 for Preprocess 1 - Annotate the topics

In [None]:
# Perform CTM with num_ctm_topics > 5 for Preprocess 2 - Annotate the topics

### From 1990 to 2009

Add your code for topic modelling the period from 1990 to 2009 here - similar to what you did for before 1990s

### From 2010 onwards

Add your code for topic modelling the period from 2010 onwards - similar to what you did for before 1990s

📝❓ Again: Assign a name to each topic based on the topic’s top words (for each period). List all topic names in your report.

📝❓ Bianchi et al. 2021 claim that their approach produces more coherent topics than previous methods. Let’s test this claim by comparing the coherence of the topics produced by CTM with the topics produced by LDA. Describe your observations in 3-4 sentences.

📝❓ Do the two models generate similar topics? Can you discover the same temporal trends (if there are any)? Discuss in 5-6 sentences.

📝❓ Can you suggest an alternate model apart from paraphrase-mpnet-base-v2? What could be some of the possible advantages and disadvantages of using an alternate model? Hint: Look at some of the models [here](https://huggingface.co/spaces/mteb/leaderboard). Note: You do not need to execute the code for an alternate model.

## Lab Report