<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/exercises/ex6/ex06_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -qU contextualized-topic-models

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m784.3/784.3 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 1.27.0 requires ipywidgets>=7.7.1, but you have ipywidgets 7.5.1 which is incompatible.
bigquery-magics 0.4.0 requires ipywidgets>=7.7.1, but you have ipywidgets 7.5.1 which is incompatible.
google-colab 1.0.0 r

## General Instructions

1. Perform Topic Modeling using LDA and CTM on the three time frames: before 1990, 1990-2009 and 2010 onwards.
2. Experiment with a) different preprocessing functions and b) varying number of topics.
3. Annotate the topics.
4. Answer the questions marked with 📝❓ in your lab report at the end of this notebook  

## Import Libraries

In [None]:
import re
import urllib
import gzip
import io
import csv
import random
from collections import defaultdict
from tqdm import tqdm

## Download Dataset

In [None]:
url_before_1990 = 'https://drive.google.com/file/d/1o_IeJCqvDLH5xgjYYuEHoPuPjF7SYvwR/view?usp=drive_link'
url_from_1990_to_2009 = 'https://drive.google.com/file/d/1Q31iYPxlcsvB0nwGter3RDfbhVRtV2yI/view?usp=drive_link'
url_from_2010 = 'https://drive.google.com/file/d/1s7pLqaiMVxM0M4WBKgZpBxNDFKXeQ47x/view?usp=drive_link'

In [None]:
# Function to download data given a google drive url - Returns a list
import requests

def download_text_file_from_drive(drive_url):
    try:
        file_id = drive_url.split('/d/')[1].split('/')[0]
    except IndexError:
        raise ValueError("Invalid Google Drive URL format. Ensure it includes '/d/<file_id>/'.")

    download_url = f"https://drive.google.com/uc?id={file_id}&export=download"

    response = requests.get(download_url)
    if response.status_code != 200:
        raise RuntimeError(f"Failed to download file. HTTP Status Code: {response.status_code}")

    content = response.text
    titles_year = content.splitlines()
    titles = [x.split(',')[0] for x in titles_year]
    return titles

In [None]:
titles_before_1990 = download_text_file_from_drive(url_before_1990)
titles_from_1990_to_2009 = download_text_file_from_drive(url_from_1990_to_2009)
titles_from_2010 = download_text_file_from_drive(url_from_2010)

# Check the length of downloaded data
print(len(titles_before_1990))
print(len(titles_from_1990_to_2009))
print(len(titles_from_2010))

# Check the first element of each list
# Elements in the list are of the format - paper_title, year
print(titles_before_1990[0])
print(titles_from_1990_to_2009[0])
print(titles_from_2010[0])

40000
243581
582378
An Introduction to Mathematical Taxonomy
The Future of Classic Data Administration: Objects + Databases + CASE
E. W. Dijkstra Archive: The manuscripts of Edsger W. Dijkstra 1930-2002


## Preprocessing Functions

*Optionally, you can write the preprocessing functions for LDA here or use inbuilt sklearn functionalities for preprocessing while performing LDA*

*For CTMs, it is recommended that you preprocess the dataset only for creating Bag of Words, while the embeddings are generated without doing any preprocessing. This will ensure that better quality embeddings are generated as more context is present, without the vocabulary size becoming huge. You can refer to authors' proposed preprocessing implementation [here](https://github.com/MilaNLProc/contextualized-topic-models?tab=readme-ov-file#preprocessing)*

In [None]:
def preprocess1():
    return

In [None]:
def preprocess2():
    return

## LDA

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

num_lda_topics = 5 # min number of topics

In [None]:
# Constants
NUM_OF_FEATURES = 10000
MAX_DF = 0.95
MIN_DF = 0.01
NUM_LDA_TOPICS_5 = 5
NUM_LDA_TOPICS_8 = 8

### Before the 1990s:

In [None]:
# Read data
def get_titles(data):
    """Extracts titles from the downloaded text file."""
    return [re.sub(r',.*', '', line) for line in data]

In [None]:
# Preprocess 1
def preprocess1(text):
    """Basic text cleaning: remove non-alphanumeric characters and convert to lowercase."""
    text = re.sub(r'[^a-zA-Z ]', '', text)
    text = text.lower()
    return text

In [None]:
# Preprocess 2
def preprocess2(text, stop_words=None):
    """Text cleaning with additional stopword removal."""
    if stop_words is None:
        stop_words = set(['the', 'of', 'and', 'to', 'a', 'in', 'for', 'on', 'with'])
    text = preprocess1(text)
    words = [word for word in text.split() if word not in stop_words]
    return " ".join(words)

In [None]:
# Vectorize titles
def vectorize_titles(titles, num_features, max_df=MAX_DF, min_df=MIN_DF):
    """Converts titles into a term-document matrix."""
    vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=num_features, stop_words='english')
    tf = vectorizer.fit_transform(titles)
    tf_feature_names = vectorizer.get_feature_names_out()
    return tf, tf_feature_names

# Print topics
def print_topics(lda, tf_feature_names):
    """Displays the top 12 words for each topic."""
    for topic_idx, topic in enumerate(lda.components_):
        print(f"Topic {topic_idx + 1}: ", end="")
        print(" ".join([tf_feature_names[i] for i in topic.argsort()[:-12 - 1:-1]]))

In [None]:
# titles_before_1990 = []

In [None]:
# Perform LDA with num_lda_topics = 5 for Preprocess 1 - Annotate the topics
titles_before_1990_preprocessed1 = [preprocess1(title) for title in titles_before_1990]

titles_before_1990_preprocessed1 = [title for title in titles_before_1990_preprocessed1 if title.strip()]

print("Preprocessed Titles (Method 1):", titles_before_1990_preprocessed1)



In [None]:
tf1, feature_names1 = vectorize_titles(titles_before_1990_preprocessed1, NUM_OF_FEATURES)
# Perform LDA with num_lda_topics = 5
print("LDA Topics with num_lda_topics = 5 (Preprocess 1):")
lda1_5 = LatentDirichletAllocation(
    n_components=NUM_LDA_TOPICS_5,
    max_iter=5,
    learning_method="online",
    random_state=42
).fit(tf1)
print_topics(lda1_5, feature_names1)

LDA Topics with num_lda_topics = 5 (Preprocess 1):
Topic 1: model linear method theory new note adaptive digital program graphs methods applications
Topic 2: systems using information software approach performance processing application network models database distributed
Topic 3: control algorithm design problem problems distributed parallel und based programs optimal adaptive
Topic 4: analysis data algorithms review optimal der time estimation dynamic von zur functions
Topic 5: computer programming networks language logic image evaluation science finite processing dynamic performance


In [None]:
# Perform LDA with num_lda_topics = 5 for Preprocess 2 - Annotate the topics
titles_before_1990_preprocessed2 = [preprocess2(title) for title in titles_before_1990]

tf2, feature_names2 = vectorize_titles(titles_before_1990_preprocessed2, NUM_OF_FEATURES)

print("\nLDA Topics with num_lda_topics = 5 (Preprocess 2):")
lda2_5 = LatentDirichletAllocation(
    n_components=NUM_LDA_TOPICS_5,
    max_iter=5,
    learning_method="online",
    random_state=42
).fit(tf2)
print_topics(lda2_5, feature_names2)


LDA Topics with num_lda_topics = 5 (Preprocess 2):
Topic 1: model linear method theory new note adaptive digital program graphs methods applications
Topic 2: systems using information software approach performance processing application network models database distributed
Topic 3: control algorithm design problem problems distributed parallel und based programs optimal adaptive
Topic 4: analysis data algorithms review optimal der time estimation dynamic von zur functions
Topic 5: computer programming networks language logic image evaluation science finite processing dynamic performance


In [None]:
# Perform LDA with num_lda_topics > 5 for Preprocess 1 - Annotate the topics
print("\nLDA Topics with num_lda_topics = 8 (Preprocess 1):")
lda1_8 = LatentDirichletAllocation(
    n_components=NUM_LDA_TOPICS_8,
    max_iter=5,
    learning_method="online",
    random_state=42
).fit(tf1)
print_topics(lda1_8, feature_names1)


LDA Topics with num_lda_topics = 8 (Preprocess 1):
Topic 1: problems distributed processing parallel note based languages functions image algorithms algorithm method
Topic 2: problem software approach networks digital program recognition simulation new using application design
Topic 3: control algorithm design theory programs database adaptive application systems applications linear optimal
Topic 4: analysis data algorithms time adaptive estimation methods dynamic image performance using application
Topic 5: programming language logic finite dynamic linear languages programs sets applications application design
Topic 6: systems information linear method models graphs applications sets control time dynamic estimation
Topic 7: computer using model review new performance application evaluation science systems image information
Topic 8: optimal der network und von zur control simulation linear problem algorithm time


In [None]:
# Perform LDA with num_lda_topics > 5 for Preprocess 2 - Annotate the topics
print("\nLDA Topics with num_lda_topics = 8 (Preprocess 2):")
lda2_8 = LatentDirichletAllocation(
    n_components=NUM_LDA_TOPICS_8,
    max_iter=5,
    learning_method="online",
    random_state=42
).fit(tf2)
print_topics(lda2_8, feature_names2)


LDA Topics with num_lda_topics = 8 (Preprocess 2):
Topic 1: problems distributed processing parallel note based languages functions image algorithms algorithm method
Topic 2: problem software approach networks digital program recognition simulation new using application design
Topic 3: control algorithm design theory programs database adaptive application systems applications linear optimal
Topic 4: analysis data algorithms time adaptive estimation methods dynamic image performance using application
Topic 5: programming language logic finite dynamic linear languages programs sets applications application design
Topic 6: systems information linear method models graphs applications sets control time dynamic estimation
Topic 7: computer using model review new performance application evaluation science systems information analysis
Topic 8: optimal der network und von zur control simulation linear problem algorithm time


### From 1990 to 2009:

*Add your code for topic modelling the period from 1990 to 2009 here - similar to what you did for before 1990s*

In [None]:
# Preprocess titles
titles_preprocessed1 = [preprocess1(title) for title in titles_from_1990_to_2009]
titles_preprocessed2 = [preprocess2(title) for title in titles_from_1990_to_2009]

# Remove empty titles after preprocessing
titles_preprocessed1 = [title for title in titles_preprocessed1 if title.strip()]
titles_preprocessed2 = [title for title in titles_preprocessed2 if title.strip()]

In [None]:
# Vectorization
def vectorize_titles(titles, num_features, max_df=MAX_DF, min_df=MIN_DF):
    """Converts titles into a term-document matrix."""
    vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=num_features, stop_words='english')
    tf = vectorizer.fit_transform(titles)
    tf_feature_names = vectorizer.get_feature_names_out()
    return tf, tf_feature_names

# Vectorize titles for both preprocessing methods
tf1, feature_names1 = vectorize_titles(titles_preprocessed1, NUM_OF_FEATURES)
tf2, feature_names2 = vectorize_titles(titles_preprocessed2, NUM_OF_FEATURES)

# Perform LDA and print topics
def perform_lda_and_print(tf, tf_feature_names, num_topics):
    lda = LatentDirichletAllocation(
        n_components=num_topics,
        max_iter=5,
        learning_method="online",
        random_state=42
    ).fit(tf)

    for topic_idx, topic in enumerate(lda.components_):
        print(f"Topic {topic_idx + 1}: ", end="")
        print(" ".join([tf_feature_names[i] for i in topic.argsort()[:-12 - 1:-1]]))


In [None]:
# Perform LDA with num_lda_topics = 5 for Preprocess 1
print("LDA Topics with num_lda_topics = 5 (Preprocess 1):")
perform_lda_and_print(tf1, feature_names1, NUM_LDA_TOPICS_5)

LDA Topics with num_lda_topics = 5 (Preprocess 1):
Topic 1: design information method new adaptive models software problem applications digital mobile control
Topic 2: model data methods distributed image development knowledge recognition processing management based application
Topic 3: networks approach network neural performance study linear optimal nonlinear graphs time evaluation
Topic 4: systems control based algorithms parallel problems modeling dynamic management computing theory web
Topic 5: using analysis algorithm learning estimation application efficient optimization simulation processing images parallel


In [None]:
# Perform LDA with num_lda_topics = 5 for Preprocess 2
print("\nLDA Topics with num_lda_topics = 5 (Preprocess 2):")
perform_lda_and_print(tf2, feature_names2, NUM_LDA_TOPICS_5)


LDA Topics with num_lda_topics = 5 (Preprocess 2):
Topic 1: design information method new adaptive models software problem applications digital mobile control
Topic 2: model data methods distributed image development knowledge recognition processing management based application
Topic 3: networks approach network neural performance study linear optimal nonlinear graphs time evaluation
Topic 4: systems control based algorithms parallel problems modeling dynamic management computing theory web
Topic 5: using analysis algorithm learning estimation application efficient optimization simulation processing images parallel


In [None]:
# Perform LDA with num_lda_topics > 5 for Preprocess 1
print("\nLDA Topics with num_lda_topics = 8 (Preprocess 1):")
perform_lda_and_print(tf1, feature_names1, NUM_LDA_TOPICS_8)


LDA Topics with num_lda_topics = 8 (Preprocess 1):
Topic 1: design new learning models problem digital mobile algorithm approach application linear using
Topic 2: model data development recognition software application using based management information time approach
Topic 3: networks approach network neural performance study time evaluation detection using nonlinear mobile
Topic 4: systems control problems optimization web programming nonlinear linear optimal information approach application
Topic 5: analysis distributed dynamic computing theory processing images performance applications application parallel using
Topic 6: using algorithm information linear software application nonlinear efficient applications management time parallel
Topic 7: based method algorithms estimation parallel optimal methods modeling computer efficient linear nonlinear
Topic 8: adaptive graphs fuzzy image simulation knowledge nonlinear management using algorithm control application


In [None]:
# Perform LDA with num_lda_topics > 5 (e.g., 8) for Preprocess 2
print("\nLDA Topics with num_lda_topics = 8 (Preprocess 2):")
perform_lda_and_print(tf2, feature_names2, NUM_LDA_TOPICS_8)


LDA Topics with num_lda_topics = 8 (Preprocess 2):
Topic 1: design new learning models problem digital mobile algorithm approach application linear using
Topic 2: model data development recognition software application using based management information time approach
Topic 3: networks approach network neural performance study time evaluation detection using nonlinear mobile
Topic 4: systems control problems optimization web programming nonlinear linear optimal information approach application
Topic 5: analysis distributed dynamic computing theory processing images performance applications application parallel using
Topic 6: using algorithm information linear software application nonlinear efficient applications management time parallel
Topic 7: based method algorithms estimation parallel optimal methods modeling computer efficient linear nonlinear
Topic 8: adaptive graphs fuzzy image simulation knowledge nonlinear management using algorithm control application


### From 2010 onwards:

*Add your code for topic modelling the period from 2010 onwards here - similar to what you did for before 1990s*

In [None]:
# Preprocess titles
titles_preprocessed1 = [preprocess1(title) for title in titles_from_2010]
titles_preprocessed2 = [preprocess2(title) for title in titles_from_2010]

# Remove empty titles after preprocessing
titles_preprocessed1 = [title for title in titles_preprocessed1 if title.strip()]
titles_preprocessed2 = [title for title in titles_preprocessed2 if title.strip()]

In [None]:
# Perform LDA with num_lda_topics = 5 for Preprocess 1
print("LDA Topics with num_lda_topics = 5 (Preprocess 1):")
perform_lda_and_print(tf1, feature_names1, NUM_LDA_TOPICS_5)

LDA Topics with num_lda_topics = 5 (Preprocess 1):
Topic 1: design information method new adaptive models software problem applications digital mobile control
Topic 2: model data methods distributed image development knowledge recognition processing management based application
Topic 3: networks approach network neural performance study linear optimal nonlinear graphs time evaluation
Topic 4: systems control based algorithms parallel problems modeling dynamic management computing theory web
Topic 5: using analysis algorithm learning estimation application efficient optimization simulation processing images parallel


In [None]:
# Perform LDA with num_lda_topics = 5 for Preprocess 2
print("\nLDA Topics with num_lda_topics = 5 (Preprocess 2):")
perform_lda_and_print(tf2, feature_names2, NUM_LDA_TOPICS_5)


LDA Topics with num_lda_topics = 5 (Preprocess 2):
Topic 1: design information method new adaptive models software problem applications digital mobile control
Topic 2: model data methods distributed image development knowledge recognition processing management based application
Topic 3: networks approach network neural performance study linear optimal nonlinear graphs time evaluation
Topic 4: systems control based algorithms parallel problems modeling dynamic management computing theory web
Topic 5: using analysis algorithm learning estimation application efficient optimization simulation processing images parallel


In [None]:
# Perform LDA with num_lda_topics > 5 for Preprocess 1
print("\nLDA Topics with num_lda_topics = 8 (Preprocess 1):")
perform_lda_and_print(tf1, feature_names1, NUM_LDA_TOPICS_8)


LDA Topics with num_lda_topics = 8 (Preprocess 1):
Topic 1: design new learning models problem digital mobile algorithm approach application linear using
Topic 2: model data development recognition software application using based management information time approach
Topic 3: networks approach network neural performance study time evaluation detection using nonlinear mobile
Topic 4: systems control problems optimization web programming nonlinear linear optimal information approach application
Topic 5: analysis distributed dynamic computing theory processing images performance applications application parallel using
Topic 6: using algorithm information linear software application nonlinear efficient applications management time parallel
Topic 7: based method algorithms estimation parallel optimal methods modeling computer efficient linear nonlinear
Topic 8: adaptive graphs fuzzy image simulation knowledge nonlinear management using algorithm control application


In [None]:
# Perform LDA with num_lda_topics > 5 for Preprocess 2
print("\nLDA Topics with num_lda_topics = 8 (Preprocess 2):")
perform_lda_and_print(tf1, feature_names2, NUM_LDA_TOPICS_8)


LDA Topics with num_lda_topics = 8 (Preprocess 2):
Topic 1: design new learning models problem digital mobile algorithm approach application linear using
Topic 2: model data development recognition software application using based management information time approach
Topic 3: networks approach network neural performance study time evaluation detection using nonlinear mobile
Topic 4: systems control problems optimization web programming nonlinear linear optimal information approach application
Topic 5: analysis distributed dynamic computing theory processing images performance applications application parallel using
Topic 6: using algorithm information linear software application nonlinear efficient applications management time parallel
Topic 7: based method algorithms estimation parallel optimal methods modeling computer efficient linear nonlinear
Topic 8: adaptive graphs fuzzy image simulation knowledge nonlinear management using algorithm control application


📝❓ For each period, assign a name to each generated topic based on the topic’s top words. List all topic names in your report. If a topic is incoherent to the degree that no common theme is detectable, you can just mark it as incoherent (i.e., no need to name a topic that does not exist).

📝❓ Do the topics make sense to you? Are they coherent? Do you observe trends across different time periods? Discuss in 4-6 sentences.


## Combined Topic Models

Method developed by [Bianchi et al. 2021](https://aclanthology.org/2021.acl-short.96/).

[A 6min presentation of the paper by one of the authors.](https://underline.io/lecture/25716-pre-training-is-a-hot-topic-contextualized-document-embeddings-improve-topic-coherence)

[Medium Blog](https://towardsdatascience.com/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576)

Code: [https://github.com/MilaNLProc/contextualized-topic-models](https://github.com/MilaNLProc/contextualized-topic-models)

Tutorial: [https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing](https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing)

Again, perform topic modelling for the three time periods - this time using the combined topic models (CTMs).

You can use and adapt the code from the tutorial linked above.

Use the available GPU for faster running times.

In [None]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessingStopwords

 ***Important - Executing the import below (WhiteSpacePreprocessing) will produce an error on the first run. Executing it again mitigates the error. This is probably due to some caching issues with contextualized_topic_models package***

In [None]:
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing

In [None]:
def print_ctm_topics(ctm):
  for topic_idx, topic in enumerate(ctm.get_topic_lists(10)):
    print(f"Topic {topic_idx + 1}: ", end="")
    print(" ".join(topic))

# Preprocess 1

In [None]:
# Preprocess 1
import nltk

from nltk.corpus import stopwords as stop_words

nltk.download('stopwords')

stopwords = list(stop_words.words("english"))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
tp = TopicModelDataPreparation("sentence-transformers/paraphrase-mpnet-base-v2")

In [None]:
# Before 1990

sp_before_1990_1 = WhiteSpacePreprocessingStopwords(titles_before_1990, stopwords_list=stopwords)
preprocessed_documents_before_1990_1, unpreprocessed_corpus_before_1990_1, vocab_before_1990_1, retained_indices_before_1990_1 = sp_before_1990_1.preprocess()
training_dataset_before_1990_1 = tp.fit(text_for_contextual=unpreprocessed_corpus_before_1990_1, text_for_bow=preprocessed_documents_before_1990_1)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/197 [00:00<?, ?it/s]

In [None]:
# From 1990 to 2009

sp_from_1990_to_2009_1 = WhiteSpacePreprocessingStopwords(titles_from_1990_to_2009, stopwords_list=stopwords)
preprocessed_documents_from_1990_to_2009_1, unpreprocessed_corpus_from_1990_to_2009_1, vocab_from_1990_to_2009_1, retained_indices_from_1990_to_2009_1 = sp_from_1990_to_2009_1.preprocess()
training_dataset_from_1990_to_2009_1 = tp.fit(text_for_contextual=unpreprocessed_corpus_from_1990_to_2009_1, text_for_bow=preprocessed_documents_from_1990_to_2009_1)

Batches:   0%|          | 0/1197 [00:00<?, ?it/s]

In [None]:
# From 2010

sp_from_2010_1 = WhiteSpacePreprocessingStopwords(titles_from_2010, stopwords_list=stopwords)
preprocessed_documents_from_2010_1, unpreprocessed_corpus_from_2010_1, vocab_from_2010_1, retained_indices_from_20109_1 = sp_from_2010_1.preprocess()
training_dataset_from_2010_1 = tp.fit(text_for_contextual=unpreprocessed_corpus_from_2010_1, text_for_bow=preprocessed_documents_from_2010_1)

Batches:   0%|          | 0/2880 [00:00<?, ?it/s]

# Preprocess 2

In [None]:
# Preprocess 2
tp_2 = TopicModelDataPreparation("all-mpnet-base-v2")

In [None]:
# Before 1990

sp_before_1990_2 = WhiteSpacePreprocessingStopwords(titles_before_1990, stopwords_list=stopwords)
preprocessed_documents_before_1990_2, unpreprocessed_corpus_before_1990_2, vocab_before_1990_2, retained_indices_before_1990_2 = sp_before_1990_2.preprocess()
training_dataset_before_1990_2 = tp_2.fit(text_for_contextual=unpreprocessed_corpus_before_1990_2, text_for_bow=preprocessed_documents_before_1990_2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/197 [00:00<?, ?it/s]

In [None]:
# From 1990 to 2009

sp_from_1990_to_2009_2 = WhiteSpacePreprocessingStopwords(titles_from_1990_to_2009, stopwords_list=stopwords)
preprocessed_documents_from_1990_to_2009_2, unpreprocessed_corpus_from_1990_to_2009_2, vocab_from_1990_to_2009_2, retained_indices_from_1990_to_2009_2 = sp_from_1990_to_2009_2.preprocess()
training_dataset_from_1990_to_2009_2 = tp_2.fit(text_for_contextual=unpreprocessed_corpus_from_1990_to_2009_2, text_for_bow=preprocessed_documents_from_1990_to_2009_2)

Batches:   0%|          | 0/1197 [00:00<?, ?it/s]

In [None]:
# From 2010

sp_from_2010_2 = WhiteSpacePreprocessingStopwords(titles_from_2010, stopwords_list=stopwords)
preprocessed_documents_from_2010_2, unpreprocessed_corpus_from_2010_2, vocab_from_2010_2, retained_indices_from_20109_21 = sp_from_2010_2.preprocess()
training_dataset_from_2010_2 = tp_2.fit(text_for_contextual=unpreprocessed_corpus_from_2010_2, text_for_bow=preprocessed_documents_from_2010_2)

Batches:   0%|          | 0/2880 [00:00<?, ?it/s]

# Perform CTM

In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 1 - Annotate the topics
ctm_1_1 = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=5, num_epochs=10)
ctm_1_1.fit(training_dataset_before_1990_1) # run the model


In [None]:
print("CTM Topics with num_lda_topics = 5 (Preprocess 1):")
print_ctm_topics(ctm_1_1)

CTM Topics with num_lda_topics = 5 (Preprocess 1):
Topic 1: uuml und zur von der auml des ber die ouml
Topic 2: book editor technology teaching survey intelligence report science program artificial
Topic 3: algorithm linear control method problems time problem equations adaptive algorithms
Topic 4: number regular designs sup combinatorial boolean groups plane graphs trees
Topic 5: system data design information systems based processing analysis management model


In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 2 - Annotate the topics
ctm_1_2 = CombinedTM(bow_size=len(tp_2.vocab), contextual_size=768, n_components=5, num_epochs=10)
ctm_1_2.fit(training_dataset_before_1990_2) # run the model

Epoch: [10/10]	 Seen Samples: [392960/393030]	Train Loss: 34.373047101769465	Time: 0:00:11.849386: : 10it [01:55, 11.53s/it]
100%|██████████| 615/615 [00:08<00:00, 71.76it/s]


In [None]:
print("CTM Topics with num_lda_topics = 5 (Preprocess 2):")
print_ctm_topics(ctm_1_2)

CTM Topics with num_lda_topics = 5 (Preprocess 2):
Topic 1: using image recognition dimensional algorithms algorithm adaptive pattern speech estimation
Topic 2: graphs problems finite problem sup equations solution sub linear number
Topic 3: systems computer design system control data information software database programming
Topic 4: experiment symbolic physical advances logical logic capability methodologies assembly heterogeneous
Topic 5: der uuml zur von und de ouml eacute die mit


In [None]:
# Perform CTM with num_ctm_topics = 8 for Preprocess 1 - Annotate the topics
ctm_1_3 = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=8, num_epochs=10)
ctm_1_3.fit(training_dataset_before_1990_1) # run the model

In [None]:
print("CTM Topics with num_lda_topics = 8 (Preprocess 1):")
print_ctm_topics(ctm_1_3)

CTM Topics with num_lda_topics = 8 (Preprocess 1):
Topic 1: uuml von und der auml die zur mit ouml ein
Topic 2: graphs groups classes forms regular cycles hierarchy designs theorem proof
Topic 3: problems problem algorithm algorithms solution equations parallel complexity method linear
Topic 4: system information data database systems processing based management performance design
Topic 5: control systems time model optimal linear theory discrete dynamic adaptive
Topic 6: book computers scientific editor world research chess report current international
Topic 7: recognition image using pattern digital images dimensional processing speech detection
Topic 8: programming software language languages logic program design development programs engineering


In [None]:
# Perform CTM with num_ctm_topics = 8 for Preprocess 2 - Annotate the topics
ctm_1_4 = CombinedTM(bow_size=len(tp_2.vocab), contextual_size=768, n_components=8, num_epochs=10)
ctm_1_4.fit(training_dataset_before_1990_2) # run the model

Epoch: [10/10]	 Seen Samples: [392960/393030]	Train Loss: 34.709036457422116	Time: 0:00:11.341558: : 10it [01:54, 11.41s/it]
100%|██████████| 615/615 [00:08<00:00, 72.47it/s]


In [None]:
print("CTM Topics with num_lda_topics = 8 (Preprocess 2):")
print_ctm_topics(ctm_1_4)

CTM Topics with num_lda_topics = 8 (Preprocess 2):
Topic 1: problem algorithm graphs algorithms networks graph network parallel trees search
Topic 2: und von auml der uuml die ouml zur ein szlig
Topic 3: causal layer limited magnetic chip mass water arm optimizing gas
Topic 4: using recognition analysis dimensional image detection digital adaptive images pattern
Topic 5: computer science review software language programming engineering book program development
Topic 6: system data design systems performance distributed information database management based
Topic 7: linear control method systems problems methods discrete equations nonlinear stochastic
Topic 8: logic theorem languages arithmetic grammars sets theories calculus automata types


### From 1990 to 2009

Add your code for topic modelling the period from 1990 to 2009 here - similar to what you did for before 1990s

In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 1 - Annotate the topics

ctm_2_1 = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=5, num_epochs=10)
ctm_2_1.fit(training_dataset_from_1990_to_2009_1) # run the model


In [None]:
print("CTM Topics with num_lda_topics = 5 (Preprocess 1):")
print_ctm_topics(ctm_2_1)

CTM Topics with num_lda_topics = 5 (Preprocess 1):
Topic 1: systems networks control time network neural design system performance adaptive
Topic 2: collision vibration antenna multilayer dense disk following fine reducing redundancy
Topic 3: computer information technology web software knowledge virtual development case research
Topic 4: image using data analysis images based model detection recognition estimation
Topic 5: problems sub problem sup linear methods finite order method algorithm


In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 2 - Annotate the topics
ctm_2_2 = CombinedTM(bow_size=len(tp_2.vocab), contextual_size=768, n_components=5, num_epochs=10)
ctm_2_2.fit(training_dataset_from_1990_to_2009_2) # run the model

Epoch: [10/10]	 Seen Samples: [2392320/2392490]	Train Loss: 39.25382107244321	Time: 0:01:08.441913: : 10it [11:11, 67.14s/it]
100%|██████████| 3739/3739 [00:58<00:00, 63.49it/s]


In [None]:
print("CTM Topics with num_lda_topics = 5 (Preprocess 2):")
print_ctm_topics(ctm_2_2)

CTM Topics with num_lda_topics = 5 (Preprocess 2):
Topic 1: polygonal log covariance nonstationary coefficient vibration nonuniform weights residual shift
Topic 2: linear sub sup time problems control nonlinear systems method equations
Topic 3: using image based data analysis model detection estimation images recognition
Topic 4: information web knowledge case study management research computer virtual eacute
Topic 5: design system performance uuml distributed und software der oriented high


In [None]:
# Perform CTM with num_ctm_topics = 8 for Preprocess 1 - Annotate the topics

ctm_2_3 = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=8, num_epochs=10)
ctm_2_3.fit(training_dataset_from_1990_to_2009_1) # run the model


In [None]:
print("CTM Topics with num_lda_topics = 5 (Preprocess 1):")
print_ctm_topics(ctm_2_3)

CTM Topics with num_lda_topics = 5 (Preprocess 1):
Topic 1: analysis data functional brain study imaging activity models human fmri
Topic 2: capability forecasting utilizing optimisation neuro hopfield underwater guided window consensus
Topic 3: control time systems nonlinear sub linear adaptive neural discrete estimation
Topic 4: problems problem finite sup graphs equations methods method element order
Topic 5: image using based images recognition detection coding algorithm classification compression
Topic 6: computer introduction research technology review online internet science virtual electronic
Topic 7: system software design development object knowledge oriented engineering management based
Topic 8: networks wireless performance network parallel mobile distributed high sensor power


In [None]:
# Perform CTM with num_ctm_topics = 8 for Preprocess 2 - Annotate the topics
ctm_2_4 = CombinedTM(bow_size=len(tp_2.vocab), contextual_size=768, n_components=8, num_epochs=10)
ctm_2_4.fit(training_dataset_from_1990_to_2009_2) # run the model

Epoch: [10/10]	 Seen Samples: [2392320/2392490]	Train Loss: 39.6022332106254	Time: 0:01:05.475847: : 10it [11:11, 67.16s/it]
100%|██████████| 3739/3739 [00:58<00:00, 64.25it/s]


In [None]:
print("CTM Topics with num_lda_topics = 8 (Preprocess 2):")
print_ctm_topics(ctm_2_4)

CTM Topics with num_lda_topics = 8 (Preprocess 2):
Topic 1: special introduction issue auml von und guest editorial uuml der
Topic 2: problems problem method finite methods two order equations optimization algorithm
Topic 3: control systems time sub nonlinear discrete linear adaptive real robust
Topic 4: information software knowledge web system engineering learning development management design
Topic 5: using image based recognition images detection vector transform classification feature
Topic 6: complete designs connected proof de hopfield cyclic eacute perfect et
Topic 7: networks performance wireless mobile network routing distributed sensor scheduling high
Topic 8: analysis functional data study brain human effects imaging models fmri


### From 2010 onwards

In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 1 - Annotate the topics

ctm_3_1 = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=5, num_epochs=10)
ctm_3_1.fit(training_dataset_from_2010_1) # run the model


Epoch: [10/10]	 Seen Samples: [5758080/5758260]	Train Loss: 49.045647748671755	Time: 0:02:45.458489: : 10it [27:46, 166.66s/it]
100%|██████████| 8998/8998 [02:22<00:00, 63.14it/s]


In [None]:
print("CTM Topics with num_lda_topics = 5 (Preprocess 1):")
print_ctm_topics(ctm_3_1)

CTM Topics with num_lda_topics = 5 (Preprocess 1):
Topic 1: time control nonlinear linear systems order method sub problems discrete
Topic 2: networks energy wireless sensor efficient system algorithm computing mobile cloud
Topic 3: using learning image detection deep network classification images based neural
Topic 4: study case review development research social digital knowledge technology software
Topic 5: directional bi frame neighborhood overlapping gray incremental neighbor adjustment hyper


In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 2 - Annotate the topics

ctm_3_2 = CombinedTM(bow_size=len(tp_2.vocab), contextual_size=768, n_components=5, num_epochs=10)
ctm_3_2.fit(training_dataset_from_2010_2) # run the model


Epoch: [10/10]	 Seen Samples: [5758080/5758260]	Train Loss: 49.03129169243209	Time: 0:02:47.322584: : 10it [30:06, 180.66s/it]
100%|██████████| 8998/8998 [02:19<00:00, 64.53it/s]


In [None]:
print("CTM Topics with num_lda_topics = 5 (Preprocess 2):")
print_ctm_topics(ctm_3_2)

CTM Topics with num_lda_topics = 5 (Preprocess 2):
Topic 1: time nonlinear sub systems control linear order method problems finite
Topic 2: bidirectional neighborhood auto pair targets deformable completion branch rotation multichannel
Topic 3: study research case social information development review knowledge technology software
Topic 4: using image learning detection deep network images classification neural based
Topic 5: networks wireless energy sensor efficient power iot computing cloud system


In [None]:
# Perform CTM with num_ctm_topics = 8 for Preprocess 1 - Annotate the topics

ctm_3_3 = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=8, num_epochs=10)
ctm_3_3.fit(training_dataset_from_2010_1) # run the model


Epoch: [10/10]	 Seen Samples: [2392320/2392490]	Train Loss: 39.681725403534145	Time: 0:01:06.883526: : 10it [11:31, 69.10s/it]
100%|██████████| 3739/3739 [01:00<00:00, 61.93it/s]


In [None]:
print("CTM Topics with num_lda_topics = 5 (Preprocess 1):")
print_ctm_topics(ctm_3_3)

CTM Topics with num_lda_topics = 5 (Preprocess 1):
Topic 1: formation hopfield internal fine nuclear position multilayer guaranteed controlled similar
Topic 2: special introduction issue und uuml der eacute von de editorial
Topic 3: image using data based recognition classification images detection segmentation speech
Topic 4: sup graphs problem algorithms algorithm sets graph number trees complexity
Topic 5: information software knowledge web development study case oriented virtual engineering
Topic 6: networks performance wireless high mobile distributed network power sensor efficient
Topic 7: systems control time neural nonlinear fuzzy approach model adaptive robust
Topic 8: method finite methods order equations estimation two analysis problems differential


In [None]:
# Perform CTM with num_ctm_topics = 8 for Preprocess 2 - Annotate the topics

ctm_3_4 = CombinedTM(bow_size=len(tp_2.vocab), contextual_size=768, n_components=8, num_epochs=10)
ctm_3_4.fit(training_dataset_from_2010_2) # run the model


Epoch: [10/10]	 Seen Samples: [5758080/5758260]	Train Loss: 49.335801242761804	Time: 0:02:43.756885: : 10it [26:57, 161.76s/it]
100%|██████████| 8998/8998 [02:18<00:00, 64.75it/s]


In [None]:
print("CTM Topics with num_lda_topics = 8 (Preprocess 2):")
print_ctm_topics(ctm_3_4)

CTM Topics with num_lda_topics = 8 (Preprocess 2):
Topic 1: sup problems eacute equations boundary approximation problem two solving generalized
Topic 2: learning deep network neural machine classification recognition graph detection convolutional
Topic 3: image based algorithm feature images detection estimation color fusion improved
Topic 4: multichannel drone ecg hyper adjustment zone grey conversion marine multilayer
Topic 5: wireless networks energy sensor efficient computing cloud mobile iot power
Topic 6: research review technology virtual reality software digital perspective information special
Topic 7: systems control time nonlinear feedback adaptive state varying system design
Topic 8: data analysis mapping remote satellite study surface case using sensing


In [None]:
import numpy as np
from collections import defaultdict
from itertools import combinations

def compute_coherence_for_topics(topics_keywords, documents, beta=1):
  tokenized_documents = [doc.split() for doc in documents]

  coherences = []

  for topic_idx, top_words in enumerate(topics_keywords):
      word_doc_count = defaultdict(int)  # D(w_i)
      word_pair_doc_count = defaultdict(int)  # D(w_j, w_i)
      for doc in tokenized_documents:
          unique_words = set(doc)
          for word in top_words:
              if word in unique_words:
                  word_doc_count[word] += 1
          for word_i, word_j in combinations(top_words, 2):
              if word_i in unique_words and word_j in unique_words:
                  word_pair_doc_count[(word_i, word_j)] += 1
      coherence = 0.0
      for word_i, word_j in combinations(top_words, 2):
          D_wi_wj = word_pair_doc_count[(word_j, word_i)]
          D_wi = word_doc_count[word_i]
          coherence += np.log((D_wi_wj + beta) / (D_wi + beta))

      coherences.append(coherence)
      print(f"Topic {topic_idx}: Coherence Score = {coherence}")

  return coherences


In [None]:
for ctm_model, preprocessed_docs in [
    (ctm_1_1, preprocessed_documents_before_1990_1),
    (ctm_2_1, preprocessed_documents_from_1990_to_2009_1),
    (ctm_3_1, preprocessed_documents_from_2010_1),
    (ctm_1_3, preprocessed_documents_before_1990_1),
    (ctm_2_3, preprocessed_documents_from_1990_to_2009_1),
    (ctm_3_3, preprocessed_documents_from_2010_1)
]:
    topic_keywords = ctm_model.get_topic_lists(10)

    coherence_score = compute_coherence_for_topics(topic_keywords, preprocessed_docs)

    print(f"Coherence Score for {ctm_model}: {coherence_score}")

📝❓ Again: Assign a name to each topic based on the topic’s top words (for each period). List all topic names in your report.

📝❓ Bianchi et al. 2021 claim that their approach produces more coherent topics than previous methods. Let’s test this claim by comparing the coherence of the topics produced by CTM with the topics produced by LDA. Describe your observations in 3-4 sentences.

📝❓ Do the two models generate similar topics? Can you discover the same temporal trends (if there are any)? Discuss in 5-6 sentences.

📝❓ Can you suggest an alternate model apart from paraphrase-mpnet-base-v2? What could be some of the possible advantages and disadvantages of using an alternate model? Hint: Look at some of the models [here](https://huggingface.co/spaces/mteb/leaderboard). Note: You do not need to execute the code for an alternate model.

## Lab Report

# LDA

📝❓ For each period, assign a name to each generated topic based on the topic’s top words. List all topic names in your report. If a topic is incoherent to the degree that no common theme is detectable, you can just mark it as incoherent (i.e., no need to name a topic that does not exist).

### Table 1: Topic Analysis before 1990
| Topic | Content | Name |
|------------------|-----------------|-----------------|
| Topic 1 | model linear method theory new note adaptive digital program graphs methods applications | model note|
| Topic 2 | systems using information software approach performance processing application network models database distributed | systems database |
| Topic 3 | control algorithm design problem problems distributed parallel und based programs optimal adaptive | control problems|
| Topic 4 | analysis data algorithms review optimal der time estimation dynamic von zur functions | analysis review |
| Topic 5 | computer programming networks language logic image evaluation science finite processing dynamic performance | computer processing|

### Table 2: Topic Analysis from 1990 to 2009
| Topic | Content | Name |
|------------------|-----------------|-----------------|
| Topic 1 | design, information, method, new, adaptive, models, software, problem, applications, digital, mobile, control | design |
| Topic 2 | model, data, methods, distributed, image, development, knowledge, recognition, processing, management, based, application | distributed model|
| Topic 3 | networks, approach, network, neural, performance, study, linear, optimal, nonlinear, graphs, time, evaluation |networks |
| Topic 4 | systems, control, based, algorithms, parallel, problems, modeling, dynamic, management, computing, theory, web |based systems|
| Topic 5 | using, analysis, algorithm, learning, estimation, application, efficient, optimization, simulation, processing, images, parallel |using analysis|

### Table 3: Topic Analysis from 2010
| Topic | Content | Name |
|------------------|-----------------|-----------------|
|Topic 1| time control nonlinear linear systems order method sub problems discrete | time discrete|
|Topic 2 | networks energy wireless sensor efficient system algorithm computing mobile cloud | networkssensor|
|Topic 3 | using learning image detection deep network classification images based neural | using detection|
|Topic 4 | study case review development research social digital knowledge technology software | study case|
|Topic 5 | directional bi frame neighborhood overlapping gray incremental neighbor adjustment hyper | directional adjustment|


### Table 4: Topic Analysis before 1990 with 8 topics
| Topic | Content | Name |
|------------------|-----------------|-----------------|
|Topic 1 | problems distributed processing parallel note based languages functions image algorithms algorithm method | problems image|
|Topic 2 | problem software approach networks digital program recognition simulation new using application design | problem design|
|Topic 3 | control algorithm design theory programs database adaptive application systems applications linear optimal | control database|
|Topic 4 | analysis data algorithms time adaptive estimation methods dynamic image performance using application | analysis methods|
|Topic 5 | programming language logic finite dynamic linear languages programs sets applications application design | programming languages|
|Topic 6 | systems information linear method models graphs applications sets control time dynamic estimation | systems control|
|Topic 7 | computer using model review new performance application evaluation science systems image information | computer science |
|Topic 8  |optimal der network und von zur control simulation linear problem algorithm time | optimal algorithm|

### Table 5: Topic Analysis from 1990 to 2009 with 8 topics
| Topic | Content | Name |
|------------------|-----------------|-----------------|
| Topic 1 | design, new, learning, models, problem, digital, mobile, algorithm, approach, application, linear, using | digital application|
| Topic 2 | model, data, development, recognition, software, application, using, based, management, information, time, approach | data recognition|
| Topic 3 | networks, approach, network, neural, performance, study, time, evaluation, detection, using, nonlinear, mobile |networks evaluation|
| Topic 4 | systems, control, problems, optimization, web, programming, nonlinear, linear, optimal, information, approach, application |systems approach|
| Topic 5 | analysis, distributed, dynamic, computing, theory, processing, images, performance, applications, application, parallel, using |synamic processing|
| Topic 6 | using, algorithm, information, linear, software, application, nonlinear, efficient, applications, management, time, parallel |efficient management|
| Topic 7 | based, method, algorithms, estimation, parallel, optimal, methods, modeling, computer, efficient, linear, nonlinear |based parallel methods|
| Topic 8 | adaptive, graphs, fuzzy, image, simulation, knowledge, nonlinear, management, using, algorithm, control, application |adaptive knowledge|

### Table 6: Topic Analysis from 2010 with 8 topics
| Topic | Content | Name |
|------------------|-----------------|-----------------|
|Topic 1 |design new learning models problem digital mobile algorithm approach application linear using | design learnig  models |
|Topic 2 | model data development recognition software application using based management information time approach |model development|
|Topic 3 | networks approach network neural performance study time evaluation detection using nonlinear mobile |networks performance|
|Topic 4 | systems control problems optimization web programming nonlinear linear optimal information approach application |systemsoptimization|
|Topic 5 | analysis distributed dynamic computing theory processing images performance applications application parallel using |analysis computing theory|
|Topic 6 | using algorithm information linear software application nonlinear efficient applications management time parallel |using linear software|
|Topic 7 | based method algorithms estimation parallel optimal methods modeling computer efficient linear nonlinear |based algorithms|
|Topic 8 | adaptive graphs fuzzy image simulation knowledge nonlinear management using algorithm control application |adaptive simulation|


📝❓ Do the topics make sense to you? Are they coherent? Do you observe trends across different time periods? Discuss in 4-6 sentences.

The topics appear to be generally coherent, with each grouping centered on a specific domain such as design, algorithms, neural networks, systems, and optimization. For example, terms like "design," "adaptive," and "control" in one topic indicate a focus on system modeling and software design, while another topic with "networks," "neural," and "graphs" highlights machine learning and network analysis. Across different numbers of topics (e.g., 5 vs. 8), the granularity increases, with broader themes splitting into more specific subtopics, such as separating neural networks from optimization and programming.

Observing trends, the emphasis on "using," "application," and "management" across topics suggests a strong applied focus in the research. Similarly, terms like "parallel," "distributed," and "dynamic" point to trends in scalable computing and performance optimization, which are significant in modern computational methods. While specific trends over time cannot be inferred without additional context on time periods or datasets, the emergence of distinct yet interconnected themes reflects coherence and relevance to applied computational and theoretical advancements.

# CMT

📝❓ Again: Assign a name to each topic based on the topic’s top words (for each period). List all topic names in your report.

### Table 1: Topic Analysis before 1990
| Topic | Content | Name |
|------------------|-----------------|-----------------|
| Topic 1 | uuml und zur von der auml des ber die ouml | Incoherent |
| Topic 2 | book editor technology teaching survey intelligence report science program artificial| Artificial Intelligence and Technology |
| Topic 3 | algorithm linear control method problems time problem equations adaptive algorithms | Algorithm and Problem Solving |
| Topic 4 | number regular designs sup combinatorial boolean groups plane graphs trees | Combinatorial Mathematics |
| Topic 5 | system data design information systems based processing analysis management model | System Design and Data Analysis |

### Table 2: Topic Analysis from 1990 to 2009
| Topic | Content | Name |
|------------------|-----------------|-----------------|
| Topic 1 | systems networks control time network neural design system performance adaptive | Neural Networks and Adaptive Systems |
| Topic 2 | collision vibration antenna multilayer dense disk following fine reducing redundancy | Vibration and Redundancy Reduction |
| Topic 3 | computer information technology web software knowledge virtual development case research | Information Technology and Virtual Systems |
| Topic 4 | image using data analysis images based model detection recognition estimation | Image Analysis and Recognition |
| Topic 5 | problems sub problem sup linear methods finite order method algorithm | Mathematical Algorithms |

### Table 3: Topic Analysis from 2010
| Topic | Content | Name |
|------------------|-----------------|-----------------|
| Topic 1 | time control nonlinear linear systems order method sub problems discrete | Nonlinear Systems and Control Theory |
| Topic 2 | networks energy wireless sensor efficient system algorithm computing mobile cloud | Energy-Efficient Wireless Networks |
| Topic 3 | using learning image detection deep network classification images based neural | Deep Learning in Image Classification |
| Topic 4 | study case review development research social digital knowledge technology software | Social and Digital Technology Research |
| Topic 5 | directional bi frame neighborhood overlapping gray incremental neighbor adjustment hyper | Image Segmentation and Analysis |

### Table 4: Topic Analysis before 1990 with num_ctm_topics = 8
| Topic | Content | Name |
|------------------|-----------------|-----------------|
| Topic 1 | uuml von und der auml die zur mit ouml ein | Incoherent |
| Topic 2 | graphs groups classes forms regular cycles hierarchy designs theorem proof | Graph Theory and Proofs |
| Topic 3 | problems problem algorithm algorithms solution equations parallel complexity method linear | Computational Complexity and Algorithms |
| Topic 4 | system information data database systems processing based management performance design | Information and Database Systems |
| Topic 5 | control systems time model optimal linear theory discrete dynamic adaptive | Control Theory and Dynamic Systems |
| Topic 6 | book computers scientific editor world research chess report current international | Scientific Research and International Publications |
| Topic 7 | recognition image using pattern digital images dimensional processing speech detection | Image Processing and Recognition |
| Topic 8 | programming software language languages logic program design development programs engineering | Programming and Software Development |

### Table 5: Topic Analysis from 1990 to 2009 with num_ctm_topics = 8
| Topic | Content | Name |
|------------------|-----------------|-----------------|
| Topic 1 | analysis data functional brain study imaging activity models human fmri | Brain Imaging and Human Activity |
| Topic 2 | capability forecasting utilizing optimisation neuro hopfield underwater guided window consensus | Optimization and Consensus Forecasting |
| Topic 3 | control time systems nonlinear sub linear adaptive neural discrete estimation | Nonlinear Control Systems |
| Topic 4 | problems problem finite sup graphs equations methods method element order | Mathematical Equations and Graph Problems |
| Topic 5 | image using based images recognition detection coding algorithm classification compression | Image Recognition and Classification |
| Topic 6 | computer introduction research technology review online internet science virtual electronic | Technology and Virtual Research |
| Topic 7 | system software design development object knowledge oriented engineering management based | Knowledge-Oriented System Design |
| Topic 8 | networks wireless performance network parallel mobile distributed high sensor power | Wireless and Sensor Networks |

### Table 6: Topic Analysis from 2010 with num_ctm_topics = 8
| Topic | Content | Name |
|------------------|-----------------|-----------------|
| Topic 1 | formation hopfield internal fine nuclear position multilayer guaranteed controlled similar | Nuclear and Internal Positioning |
| Topic 2 | special introduction issue und uuml der eacute von de editorial | Incoherent |
| Topic 3 | image using data based recognition classification images detection segmentation speech | Speech and Image Recognition |
| Topic 4 | sup graphs problem algorithms algorithm sets graph number trees complexity | Graph Theory and Complexity |
| Topic 5 | information software knowledge web development study case oriented virtual engineering | Virtual Knowledge Engineering |
| Topic 6 | networks performance wireless high mobile distributed network power sensor efficient | Distributed Wireless Networks |
| Topic 7 | systems control time neural nonlinear fuzzy approach model adaptive robust | Adaptive and Robust Systems |
| Topic 8 | method finite methods order equations estimation two analysis problems differential | Differential Equations and Methods |


📝❓ Bianchi et al. 2021 claim that their approach produces more coherent topics than previous methods. Let’s test this claim by comparing the coherence of the topics produced by CTM with the topics produced by LDA. Describe your observations in 3-4 sentences.

### Table 7: the Coherence of the Topics

| Model | Data                                | Coherence_1 | Coherence_2 | Coherence_3 | Coherence_4 | Coherence_5 | Coherence_6 | Coherence_7 | Coherence_8 |
|-------|-------------------------------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|
| CTM   | before 1990 with num_ctm_topics = 5 | -284.14     | -231.62     | -310.44     | -234.58     | -325.62     |             |             |             |
| LDA   | before 1990 with num_ctm_topics = 5 | -435.20     | -454.14     | -448.12     | -438.18     | -430.77     |             |             |             |
| CTM   | before 1990 with num_ctm_topics = 8 | -285.68     | -230.52     | -295.82     | -315.89     | -317.35     | -230.75     | -281.80     | -293.96     |
| LDA   | before 1990 with num_ctm_topics = 8 | -418.61     | -429.88     | -447.11     | -433.05     | -419.90     | -450.13     | -445.46     | -426.54     |
| CTM   | from 1990 to 2009 with num_ctm_topics = 5 | -306.55 | -47.45      | -247.65     | -299.44     | -278.84     |             |             |             |
| LDA   | from 1990 to 2009 with num_ctm_topics = 5 | -436.83     | -421.59     | -403.57     | -440.45     | -426.00     |             |             |             |
| CTM   | from 1990 to 2009 with num_ctm_topics = 8 | -250.18 | -74.84      | -298.45     | -282.09     | -282.06     | -248.64     | -291.06     | -219.43     |
| LDA   | from 1990 to 2009 with num_ctm_topics = 8 | -412.84     | -434.49     | -399.23     | -416.25     | -422.54     | -438.95     | -419.81     | -397.37     |
| CTM   | from 2010 with num_ctm_topics = 5   | -303.72     | -209.99     | -242.80     | -256.36     | -54.85      |             |             |             |
| LDA   | from 2010 with num_ctm_topics = 5   | -436.83     | -421.59     | -403.57     | -440.45     | -426.00     |             |             |             |
| CTM   | from 2010 with num_ctm_topics = 8   | -62.46      | -249.99     | -289.50     | -280.41     | -239.89     | -218.57     | -294.42     | -285.70     |
| LDA   | from 2010 with num_ctm_topics = 8   | -412.84     | -434.49     | -399.23     | -416.25     | -422.54     | -438.95     | -419.81     | -397.37     |

Based on the table, CTM generally produces more coherent topics compared to LDA, as indicated by its less negative coherence scores. For example, in the before 1990 and from 2010 periods, CTM achieves significantly better coherence (e.g., -62.46) compared to LDA, which remains around -400. LDA also shows larger variability and poorer coherence in several topics, suggesting weaker word co-occurrence relationships.

📝❓ Do the two models generate similar topics? Can you discover the same temporal trends (if there are any)? Discuss in 5-6 sentences.

Based on the table, CTM and LDA produce topics with noticeable differences in coherence, but they generally capture similar temporal trends. Both models show relatively poorer coherence scores for topics in earlier periods, such as before 1990, indicating that topics in older data may be less clearly defined. In contrast, the from 2010 period shows improved coherence for CTM, particularly for some topics (e.g., -62.46), suggesting that both models are better able to identify clearer and more consistent topics in recent data. However, LDA consistently produces lower coherence scores across all periods, reflecting weaker topic quality and less semantic clarity compared to CTM. Despite these differences, the two models likely detect similar broad trends over time, such as the increasing clarity of topics in more recent data. This highlights the ability of CTM to refine topic quality while still aligning with the temporal patterns captured by LDA.

📝❓ Can you suggest an alternate model apart from paraphrase-mpnet-base-v2? What could be some of the possible advantages and disadvantages of using an alternate model? Hint: Look at some of the models [here](https://huggingface.co/spaces/mteb/leaderboard). Note: You do not need to execute the code for an alternate model.

sentence-transformers/all-MiniLM-L6-v2