<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/exercises/ex6/ex06_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -qU contextualized-topic-models

## General Instructions

1. Perform Topic Modeling using LDA and CTM on the three time frames: before 1990, 1990-2009 and 2010 onwards.
2. Experiment with a) different preprocessing functions and b) varying number of topics.
3. Annotate the topics.
4. Answer the questions marked with 📝❓ in your lab report at the end of this notebook  

In [None]:
!pip install pyLDAvis



In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter("ignore", DeprecationWarning)
warnings.simplefilter("ignore", UserWarning)

## Import Libraries

In [None]:
import re
import urllib
import gzip
import io
import csv
import random
from collections import defaultdict
from tqdm import tqdm
import pyLDAvis
import nltk

  signature = inspect.formatargspec(regargs, varargs, varkwargs, defaults,


## Download Dataset

In [None]:
url_before_1990 = 'https://drive.google.com/file/d/1o_IeJCqvDLH5xgjYYuEHoPuPjF7SYvwR/view?usp=drive_link'
url_from_1990_to_2009 = 'https://drive.google.com/file/d/1Q31iYPxlcsvB0nwGter3RDfbhVRtV2yI/view?usp=drive_link'
url_from_2010 = 'https://drive.google.com/file/d/1s7pLqaiMVxM0M4WBKgZpBxNDFKXeQ47x/view?usp=drive_link'

In [None]:
# Function to download data given a google drive url - Returns a list
import requests

def download_text_file_from_drive(drive_url):
    try:
        file_id = drive_url.split('/d/')[1].split('/')[0]
    except IndexError:
        raise ValueError("Invalid Google Drive URL format. Ensure it includes '/d/<file_id>/'.")

    download_url = f"https://drive.google.com/uc?id={file_id}&export=download"

    response = requests.get(download_url)
    if response.status_code != 200:
        raise RuntimeError(f"Failed to download file. HTTP Status Code: {response.status_code}")

    content = response.text
    titles_year = content.splitlines()
    titles = [x.split(',')[0] for x in titles_year]
    return titles

In [None]:
titles_before_1990 = download_text_file_from_drive(url_before_1990)
titles_from_1990_to_2009 = download_text_file_from_drive(url_from_1990_to_2009)
titles_from_2010 = download_text_file_from_drive(url_from_2010)

# Check the length of downloaded data
print(len(titles_before_1990))
print(len(titles_from_1990_to_2009))
print(len(titles_from_2010))

# Check the first element of each list
# Elements in the list are of the format - paper_title, year
print(titles_before_1990[0])
print(titles_from_1990_to_2009[0])
print(titles_from_2010[0])

40000
243581
582378
An Introduction to Mathematical Taxonomy
The Future of Classic Data Administration: Objects + Databases + CASE
E. W. Dijkstra Archive: The manuscripts of Edsger W. Dijkstra 1930-2002


## Preprocessing Functions

*Optionally, you can write the preprocessing functions for LDA here or use inbuilt sklearn functionalities for preprocessing while performing LDA*

*For CTMs, it is recommended that you preprocess the dataset only for creating Bag of Words, while the embeddings are generated without doing any preprocessing. This will ensure that better quality embeddings are generated as more context is present, without the vocabulary size becoming huge. You can refer to authors' proposed preprocessing implementation [here](https://github.com/MilaNLProc/contextualized-topic-models?tab=readme-ov-file#preprocessing)*

In [None]:
# Preprocess 1: Basic Cleaning
def preprocess1(texts):
    return [re.sub(r'\W+', ' ', text.lower()) for text in texts]

In [None]:
# Preprocess 2: Advanced Cleaning (e.g., removing short words, stemming)
from nltk.stem import PorterStemmer

def preprocess2(texts):
    stemmer = PorterStemmer()
    cleaned_texts = []
    for text in texts:
        cleaned = re.sub(r'\W+', ' ', text.lower())
        tokens = [stemmer.stem(word) for word in cleaned.split() if len(word) > 2]
        cleaned_texts.append(' '.join(tokens))
    return cleaned_texts

## LDA

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

num_lda_topics = 5 # min number of topics

### Before the 1990s:

In [None]:
# Read data
titles_before_1990 = download_text_file_from_drive(url_before_1990)

In [None]:
titles_before_1990

['An Introduction to Mathematical Taxonomy',
 'Speech Acts: An Essay in the Philosophy of Language.',
 'Evolution and the Theory of Games',
 'On the Construction of Programs',
 'Modal Logic - An Introduction',
 'Introduction to Combinators and Lambda-Calculus.',
 'The Cognitive Structure of Emotions.',
 'A Structured Operating System.',
 'Parallel Processing in Ada.',
 'A Model for Communicating Sequential Processes.',
 'An Introduction to Pascal-Plus.',
 'Languages for Parallel Computers.',
 'Modules and Visibility in the Ada Programming Language.',
 'Algorithms for Parallel Computers.',
 'Concurrent Pascal - An Appraisal.',
 'A Structured Compiler.',
 '"Information Systems: Modelling',
 'Computable set theory. Volume 1.',
 'Mentale Belastung und kognitive Prozesse bei komplexen Dialogstrukturen.',
 'Naturwissenschaftsdidaktik als Studienfach: kommentierte Dokumentation ausl&auml;nd. Studieng&auml;nge.',
 'Simulation and the Monte Carlo method.',
 'Software engineering in C.',
 'Searc

In [None]:
# Preprocess 1
titles_before_1990_1 = preprocess1(titles_before_1990)

In [None]:
titles_before_1990_1

['an introduction to mathematical taxonomy',
 'speech acts an essay in the philosophy of language ',
 'evolution and the theory of games',
 'on the construction of programs',
 'modal logic an introduction',
 'introduction to combinators and lambda calculus ',
 'the cognitive structure of emotions ',
 'a structured operating system ',
 'parallel processing in ada ',
 'a model for communicating sequential processes ',
 'an introduction to pascal plus ',
 'languages for parallel computers ',
 'modules and visibility in the ada programming language ',
 'algorithms for parallel computers ',
 'concurrent pascal an appraisal ',
 'a structured compiler ',
 ' information systems modelling',
 'computable set theory volume 1 ',
 'mentale belastung und kognitive prozesse bei komplexen dialogstrukturen ',
 'naturwissenschaftsdidaktik als studienfach kommentierte dokumentation ausl auml nd studieng auml nge ',
 'simulation and the monte carlo method ',
 'software engineering in c ',
 'search in arti

In [None]:
# Preprocess 2
titles_before_1990_2 = preprocess2(titles_before_1990)

In [None]:
titles_before_1990_2

['introduct mathemat taxonomi',
 'speech act essay the philosophi languag',
 'evolut and the theori game',
 'the construct program',
 'modal logic introduct',
 'introduct combin and lambda calculu',
 'the cognit structur emot',
 'structur oper system',
 'parallel process ada',
 'model for commun sequenti process',
 'introduct pascal plu',
 'languag for parallel comput',
 'modul and visibl the ada program languag',
 'algorithm for parallel comput',
 'concurr pascal apprais',
 'structur compil',
 'inform system model',
 'comput set theori volum',
 'mental belastung und kognit prozess bei komplexen dialogstrukturen',
 'naturwissenschaftsdidaktik al studienfach kommentiert dokument ausl auml studieng auml nge',
 'simul and the mont carlo method',
 'softwar engin',
 'search artifici intellig',
 'build expert system',
 'structur complex',
 'encyclopaedia linguist',
 'the comput and the mind introduct cognit scienc',
 'the theori pars',
 'catalogu artifici intellig tool',
 'artifici intellig 

In [None]:
def perform_lda(texts, num_topics):
    vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
    X = vectorizer.fit_transform(texts)

    lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
    lda.fit(X)

    feature_names = vectorizer.get_feature_names_out()
    topics = []
    for idx, topic in enumerate(lda.components_):
        topics.append([feature_names[i] for i in topic.argsort()[:-11:-1]])
        print(f"Topic {idx}: {topics[-1]}")

    return topics

In [None]:
# Perform LDA with num_lda_topics = 5 for Preprocess 1 - Annotate the topics
print("LDA with Preprocess 1 (num_topics = 5):")
topics_p1_5 = perform_lda(titles_before_1990_1, 5)

LDA with Preprocess 1 (num_topics = 5):
Topic 0: ['uuml', 'der', 'und', 'von', 'auml', 'systems', 'zur', 'die', 'distributed', 'retrieval']
Topic 1: ['based', 'analysis', 'using', 'recognition', 'logic', 'design', 'networks', 'graphs', 'image', 'pattern']
Topic 2: ['language', 'sub', 'eacute', 'sup', 'program', 'programming', 'sets', 'trees', 'research', 'graph']
Topic 3: ['computer', 'systems', 'data', 'software', 'design', 'information', 'review', 'analysis', 'programming', 'database']
Topic 4: ['linear', 'problem', 'algorithm', 'control', 'problems', 'method', 'optimal', 'time', 'algorithms', 'systems']


In [None]:
# Perform LDA with num_lda_topics = 5 for Preprocess 2 - Annotate the topics
print("LDA with Preprocess 1 (num_topics = 5):")
topics_p2_5 = perform_lda(titles_before_1990_2, 5)

LDA with Preprocess 1 (num_topics = 5):
Topic 0: ['graph', 'uuml', 'comput', 'der', 'und', 'von', 'ein', 'auml', 'zur', 'die']
Topic 1: ['comput', 'inform', 'data', 'network', 'problem', 'algorithm', 'model', 'design', 'base', 'perform']
Topic 2: ['control', 'program', 'theori', 'model', 'set', 'function', 'linear', 'problem', 'languag', 'applic']
Topic 3: ['time', 'review', 'analysi', 'approxim', 'algorithm', 'error', 'model', 'introduct', 'filter', 'languag']
Topic 4: ['method', 'logic', 'softwar', 'imag', 'use', 'process', 'pattern', 'equat', 'algorithm', 'comput']


In [None]:
# Perform LDA with num_lda_topics > 5 for Preprocess 1 - Annotate the topics
print("LDA with Preprocess 1 (num_topics = 5):")
topics_p1_10 = perform_lda(titles_before_1990_1, 10)

LDA with Preprocess 1 (num_topics = 5):
Topic 0: ['systems', 'control', 'time', 'distributed', 'model', 'discrete', 'linear', 'real', 'adaptive', 'optimal']
Topic 1: ['design', 'logic', 'systems', 'networks', 'based', 'microprocessor', 'analysis', 'new', 'memory', 'high']
Topic 2: ['programming', 'program', 'language', 'software', 'trees', 'development', 'functions', 'engineering', 'computer', 'computing']
Topic 3: ['sub', 'languages', 'structures', 'systems', 'computer', 'introduction', 'free', 'database', 'operating', 'context']
Topic 4: ['algorithm', 'algorithms', 'optimal', 'linear', 'estimation', 'using', 'method', 'random', 'efficient', 'function']
Topic 5: ['theory', 'review', 'graphs', 'theorem', 'note', 'graph', 'set', 'sets', 'book', 'number']
Topic 6: ['uuml', 'der', 'und', 'von', 'auml', 'zur', 'die', 'sup', 'ouml', 'ein']
Topic 7: ['problems', 'problem', 'method', 'equations', 'solution', 'methods', 'complexity', 'linear', 'solving', 'numerical']
Topic 8: ['data', 'compute

In [None]:
# Perform LDA with num_lda_topics > 5 for Preprocess 2 - Annotate the topics
print("LDA with Preprocess 1 (num_topics = 5):")
topics_p2_10 = perform_lda(titles_before_1990_2, 10)

LDA with Preprocess 1 (num_topics = 5):
Topic 0: ['comput', 'scienc', 'studi', 'technolog', 'new', 'program', 'univers', 'educ', 'chess', 'report']
Topic 1: ['network', 'problem', 'comput', 'algorithm', 'parallel', 'tree', 'complex', 'perform', 'model', 'flow']
Topic 2: ['control', 'theori', 'model', 'optim', 'sub', 'sup', 'stochast', 'applic', 'linear', 'adapt']
Topic 3: ['review', 'time', 'analysi', 'real', 'book', 'recognit', 'grammar', 'languag', 'correct', 'queue']
Topic 4: ['process', 'test', 'gener', 'evalu', 'distribut', 'pattern', 'recognit', 'research', 'method', 'model']
Topic 5: ['uuml', 'der', 'und', 'von', 'ein', 'auml', 'zur', 'die', 'ouml', 'mit']
Topic 6: ['algorithm', 'method', 'imag', 'use', 'linear', 'equat', 'solut', 'problem', 'transform', 'dimension']
Topic 7: ['inform', 'logic', 'estim', 'time', 'retriev', 'model', 'analysi', 'decis', 'associ', 'effect']
Topic 8: ['graph', 'set', 'theorem', 'relat', 'eacut', 'number', 'problem', 'group', 'order', 'finit']
Topic 

### From 1990 to 2009:

*Add your code for topic modelling the period from 1990 to 2009 here - similar to what you did for before 1990s*

In [None]:
titles_from_1990_to_2009 = download_text_file_from_drive(url_from_1990_to_2009)
titles_from_1990_to_2009_1 = preprocess1(titles_from_1990_to_2009)
titles_from_1990_to_2009_2 = preprocess2(titles_from_1990_to_2009)

In [None]:
titles_from_1990_to_2009_1_5 = perform_lda(titles_from_1990_to_2009_1,5)

Topic 0: ['using', 'sub', 'sup', 'analysis', 'complexity', 'model', 'high', 'low', 'language', 'study']
Topic 1: ['systems', 'software', 'information', 'computer', 'learning', 'based', 'process', 'study', 'model', 'case']
Topic 2: ['problem', 'problems', 'method', 'algorithm', 'linear', 'graphs', 'finite', 'order', 'methods', 'algorithms']
Topic 3: ['information', 'networks', 'data', 'design', 'management', 'web', 'based', 'systems', 'wireless', 'knowledge']
Topic 4: ['based', 'using', 'systems', 'control', 'time', 'networks', 'analysis', 'neural', 'data', 'adaptive']


In [None]:
titles_from_1990_to_2009_1_10 = perform_lda(titles_from_1990_to_2009_1,10)

Topic 0: ['sub', 'sup', 'using', 'high', 'low', 'video', 'power', 'codes', 'measurement', 'frequency']
Topic 1: ['process', 'systems', 'models', 'model', 'information', 'study', 'case', 'test', 'analysis', 'processes']
Topic 2: ['problems', 'problem', 'graphs', 'method', 'uuml', 'finite', 'equations', 'methods', 'und', 'algorithm']
Topic 3: ['management', 'networks', 'wireless', 'mobile', 'information', 'eacute', 'oriented', 'internet', 'object', 'knowledge']
Topic 4: ['systems', 'control', 'time', 'neural', 'adaptive', 'networks', 'based', 'fuzzy', 'using', 'network']
Topic 5: ['analysis', 'human', 'functional', 'study', 'brain', 'model', 'data', 'fmri', 'control', 'networks']
Topic 6: ['based', 'using', 'image', 'images', 'recognition', 'detection', 'data', 'analysis', 'classification', 'processing']
Topic 7: ['based', 'parallel', 'decision', 'scheduling', 'algorithms', 'service', 'simulation', 'multi', 'using', 'performance']
Topic 8: ['software', 'computer', 'systems', 'web', 'desi

In [None]:
titles_from_1990_to_2009_2_5 = perform_lda(titles_from_1990_to_2009_2,5)

Topic 0: ['inform', 'comput', 'equat', 'knowledg', 'theori', 'logic', 'manag', 'databas', 'issu', 'eacut']
Topic 1: ['use', 'model', 'imag', 'base', 'estim', 'analysi', 'method', 'detect', 'measur', 'code']
Topic 2: ['network', 'control', 'design', 'base', 'time', 'model', 'use', 'neural', 'perform', 'learn']
Topic 3: ['algorithm', 'problem', 'method', 'optim', 'program', 'graph', 'parallel', 'linear', 'gener', 'comput']
Topic 4: ['data', 'sub', 'robot', 'softwar', 'test', 'model', 'uuml', 'base', 'und', 'develop']


In [None]:
titles_from_1990_to_2009_2_10 = perform_lda(titles_from_1990_to_2009_2,10)

Topic 0: ['equat', 'logic', 'eacut', 'order', 'semant', 'differenti', 'introduct', 'issu', 'game', 'comput']
Topic 1: ['imag', 'use', 'base', 'model', 'analysi', 'recognit', 'detect', 'visual', 'data', 'featur']
Topic 2: ['control', 'network', 'design', 'base', 'adapt', 'wireless', 'sensor', 'mobil', 'use', 'model']
Topic 3: ['problem', 'program', 'algorithm', 'optim', 'sup', 'method', 'parallel', 'comput', 'object', 'solut']
Topic 4: ['robot', 'uuml', 'und', 'der', 'sub', 'manipul', 'auml', 'von', 'ein', 'die']
Topic 5: ['high', 'sub', 'model', 'method', 'test', 'finit', 'flow', 'power', 'perform', 'element']
Topic 6: ['code', 'transform', 'estim', 'signal', 'filter', 'time', 'use', 'base', 'algorithm', 'channel']
Topic 7: ['time', 'set', 'real', 'structur', 'function', 'complex', 'model', 'data', 'protein', 'larg']
Topic 8: ['inform', 'softwar', 'comput', 'model', 'manag', 'develop', 'base', 'technolog', 'web', 'learn']
Topic 9: ['network', 'neural', 'algorithm', 'graph', 'model', 'u

### From 2010 onwards:

*Add your code for topic modelling the period from 2010 onwards here - similar to what you did for before 1990s*

In [None]:
titles_from_2010 = download_text_file_from_drive(url_from_2010)
titles_from_2010_1 = preprocess1(titles_from_2010)
titles_from_2010_2 = preprocess2(titles_from_2010)

In [None]:
titles_from_2010_1_5 = perform_lda(titles_from_2010_1,5)

Topic 0: ['based', 'using', 'detection', 'network', 'image', 'multi', 'classification', 'recognition', 'images', 'data']
Topic 1: ['based', 'data', 'algorithm', 'image', 'large', 'analysis', 'clustering', 'processing', 'efficient', 'search']
Topic 2: ['based', 'networks', 'systems', 'sensor', 'wireless', 'energy', 'design', 'network', 'using', 'control']
Topic 3: ['learning', 'based', 'data', 'study', 'using', 'model', 'machine', 'analysis', 'social', 'information']
Topic 4: ['systems', 'control', 'time', 'method', 'model', 'based', 'linear', 'nonlinear', 'order', 'analysis']


In [None]:
titles_from_2010_1_5 = perform_lda(titles_from_2010_1,10)

Topic 0: ['social', 'analysis', 'research', 'information', 'brain', 'sup', 'study', 'knowledge', 'functional', 'human']
Topic 1: ['algorithm', 'large', 'based', 'data', 'search', 'scale', 'eacute', 'algorithms', 'problem', 'optimization']
Topic 2: ['networks', 'based', 'systems', 'wireless', 'energy', 'computing', 'sub', 'multi', 'sensor', 'network']
Topic 3: ['learning', 'based', 'machine', 'model', 'virtual', 'decision', 'approach', 'using', 'support', 'fuzzy']
Topic 4: ['systems', 'control', 'time', 'method', 'linear', 'nonlinear', 'order', 'model', 'finite', 'problems']
Topic 5: ['data', 'study', 'analysis', 'case', 'review', 'digital', 'systems', 'privacy', 'security', 'based']
Topic 6: ['design', 'based', 'robot', 'using', 'control', 'high', 'performance', 'analysis', 'optical', 'human']
Topic 7: ['based', 'learning', 'network', 'detection', 'neural', 'using', 'deep', 'time', 'recognition', 'classification']
Topic 8: ['using', 'based', 'sensing', 'data', 'remote', 'internet', 'me

In [None]:
titles_from_2010_2_5 = perform_lda(titles_from_2010_2,5)

Topic 0: ['base', 'imag', 'learn', 'use', 'network', 'detect', 'deep', 'model', 'neural', 'featur']
Topic 1: ['control', 'model', 'time', 'base', 'function', 'dynam', 'graph', 'fuzzi', 'sub', 'robot']
Topic 2: ['optim', 'algorithm', 'method', 'base', 'network', 'problem', 'effici', 'comput', 'energi', 'model']
Topic 3: ['data', 'use', 'base', 'sens', 'model', 'map', 'studi', 'analysi', 'remot', 'cloud']
Topic 4: ['network', 'base', 'learn', 'social', 'use', 'commun', 'inform', 'analysi', 'studi', 'mobil']


In [None]:
titles_from_2010_2_10 = perform_lda(titles_from_2010_2,10)

Topic 0: ['base', 'imag', 'learn', 'network', 'use', 'detect', 'deep', 'neural', 'featur', 'classif']
Topic 1: ['control', 'model', 'time', 'base', 'sub', 'fuzzi', 'nonlinear', 'decis', 'dynam', 'delay']
Topic 2: ['optim', 'algorithm', 'base', 'problem', 'comput', 'effici', 'multi', 'energi', 'schedul', 'data']
Topic 3: ['data', 'use', 'base', 'sens', 'remot', 'model', 'map', 'spatial', 'monitor', 'analysi']
Topic 4: ['network', 'sensor', 'base', 'wireless', 'commun', 'mobil', 'energi', 'power', 'channel', 'internet']
Topic 5: ['learn', 'research', 'review', 'social', 'studi', 'inform', 'data', 'use', 'technolog', 'analysi']
Topic 6: ['model', 'brain', 'interact', 'use', 'human', 'robot', 'function', 'base', 'analysi', 'predict']
Topic 7: ['base', 'design', 'secur', 'architectur', 'high', 'code', 'applic', 'use', 'cloud', 'attack']
Topic 8: ['method', 'equat', 'flow', 'order', 'point', 'problem', 'model', 'finit', 'solut', 'special']
Topic 9: ['graph', 'model', 'product', 'game', 'know

📝❓ For each period, assign a name to each generated topic based on the topic’s top words. List all topic names in your report. If a topic is incoherent to the degree that no common theme is detectable, you can just mark it as incoherent (i.e., no need to name a topic that does not exist).

📝❓ Do the topics make sense to you? Are they coherent? Do you observe trends across different time periods? Discuss in 4-6 sentences.


## Combined Topic Models

Method developed by [Bianchi et al. 2021](https://aclanthology.org/2021.acl-short.96/).

[A 6min presentation of the paper by one of the authors.](https://underline.io/lecture/25716-pre-training-is-a-hot-topic-contextualized-document-embeddings-improve-topic-coherence)

[Medium Blog](https://towardsdatascience.com/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576)

Code: [https://github.com/MilaNLProc/contextualized-topic-models](https://github.com/MilaNLProc/contextualized-topic-models)

Tutorial: [https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing](https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing)

Again, perform topic modelling for the three time periods - this time using the combined topic models (CTMs).

You can use and adapt the code from the tutorial linked above.

Use the available GPU for faster running times.

In [None]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation

 ***Important - Executing the import below (WhiteSpacePreprocessing) will produce an error on the first run. Executing it again mitigates the error. This is probably due to some caching issues with contextualized_topic_models package***

In [None]:
try: # we know it fails at first
  from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
except:
  from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing

  import pkg_resources
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(parent)
Implementing implicit namespac

In [None]:
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing

In [None]:
num_ctm_topics = 5  # min number of topics
MODEL_NAME = "sentence-transformers/paraphrase-mpnet-base-v2" # Model to use for CTM

### Before the 1990s:

In [None]:
# Read data
titles_before_1990 = download_text_file_from_drive(url_before_1990)
titles_before_1990[:10]

['An Introduction to Mathematical Taxonomy',
 'Speech Acts: An Essay in the Philosophy of Language.',
 'Evolution and the Theory of Games',
 'On the Construction of Programs',
 'Modal Logic - An Introduction',
 'Introduction to Combinators and Lambda-Calculus.',
 'The Cognitive Structure of Emotions.',
 'A Structured Operating System.',
 'Parallel Processing in Ada.',
 'A Model for Communicating Sequential Processes.']

In [None]:
# Using recommended preprocessing on top of these preprocessing steps
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# To ensure we have an untouched copy of the documents, we first use WhiteSpacePreprocessing
# And then use preprocess 1 and 2 on the preprocessed documents.

sp_1 = WhiteSpacePreprocessing(titles_before_1990, "english")
preprocessed_documents_1, unpreprocessed_corpus_1, vocab_1, retained_indices_1 = sp_1.preprocess()

sp_2 = WhiteSpacePreprocessing(titles_before_1990, "english")
preprocessed_documents_2, unpreprocessed_corpus_2, vocab_2, retained_indices_2 = sp_2.preprocess()

In [None]:
# Preprocess 1
preprocessed_documents_1 = preprocess1(preprocessed_documents_1)

# Preprocess 2
preprocessed_documents_2 = preprocess2(preprocessed_documents_2)

In [None]:
preprocessed_documents_1[:2]

['introduction mathematical', 'speech language']

In [None]:
preprocessed_documents_2[:2]

['introduct mathemat', 'speech languag']

In [None]:
unpreprocessed_corpus_2[:2]

['An Introduction to Mathematical Taxonomy',
 'Speech Acts: An Essay in the Philosophy of Language.']

In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 1 - Annotate the topics

tp_5_1 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_5_1 = tp_5_1.fit(text_for_contextual=unpreprocessed_corpus_1, text_for_bow=preprocessed_documents_1)

ctm_5_1 = CombinedTM(bow_size=len(tp_5_1.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_5_1.fit(training_dataset_5_1) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ', '.join(ctm_5_1.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/197 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  system, systems, design, information, software, computer, data, development, distributed, model
topic 1:  uuml, der, und, zur, des, von, mit, eacute, ein, die
topic 2:  linear, control, using, adaptive, recognition, algorithm, algorithms, time, optimal, estimation
topic 3:  graphs, finite, set, functions, note, theorem, sup, trees, logic, graph
topic 4:  report, controlled, international, automated, scientific, book, citation, impact, assembly, education





In [None]:
import pyLDAvis as vis

lda_vis_data_5_1 = ctm_5_1.get_ldavis_data_format(tp_5_1.vocab, training_dataset_5_1, n_samples=10)

ctm_pd_5_1 = vis.prepare(**lda_vis_data_5_1)
vis.display(ctm_pd_5_1)

  0%|          | 0/615 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling 

In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 2 - Annotate the topics

tp_5_2 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_5_2 = tp_5_2.fit(text_for_contextual=unpreprocessed_corpus_2, text_for_bow=preprocessed_documents_2)

ctm_5_2 = CombinedTM(bow_size=len(tp_5_2.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_5_2.fit(training_dataset_5_2) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ', '.join(ctm_5_2.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/197 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  algorithm, method, use, time, linear, optim, imag, estim, control, equat
topic 1:  set, graph, sub, bound, minim, tree, number, note, theori, properti
topic 2:  uuml, ein, von, auml, da, der, und, die, ber, ouml
topic 3:  3b20d, futur, qualiti, air, cam, world, white, vision, renew, mv
topic 4:  system, comput, design, model, inform, data, program, base, softwar, structur





In [None]:
lda_vis_data_5_2 = ctm_5_2.get_ldavis_data_format(tp_5_2.vocab, training_dataset_5_2, n_samples=10)

ctm_pd_5_2 = vis.prepare(**lda_vis_data_5_2)
vis.display(ctm_pd_5_2)

  0%|          | 0/615 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling 

In [None]:
# Perform CTM with num_ctm_topics > 5 for Preprocess 1 - Annotate the topics

num_ctm_topics = 10

tp_10_1 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_10_1 = tp_10_1.fit(text_for_contextual=unpreprocessed_corpus_1, text_for_bow=preprocessed_documents_1)

ctm_10_1 = CombinedTM(bow_size=len(tp_10_1.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_10_1.fit(training_dataset_10_1) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ', '.join(ctm_10_1.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/197 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  uuml, und, von, auml, der, zur, die, eacute, de, des
topic 1:  science, computer, software, graphics, computers, technology, engineering, research, issues, security
topic 2:  graphs, algorithm, algorithms, networks, graph, parallel, problem, tree, trees, complexity
topic 3:  sets, designs, types, theories, groups, degrees, theorem, sub, theorems, forms
topic 4:  data, information, system, design, systems, based, management, retrieval, database, processing
topic 5:  image, recognition, pattern, digital, using, speech, images, shape, dimensional, detection
topic 6:  languages, language, logic, programming, book, review, program, programs, context, natural
topic 7:  das, und, uuml, von, der, ein, auml, mit, zur, ouml
topic 8:  control, systems, time, adaptive, model, optimal, linear, estimation, dynamic, system
topic 9:  problems, method, linear, equations, solution, problem, nonlinear, functions, two, finite





In [None]:
lda_vis_data_10_1 = ctm_10_1.get_ldavis_data_format(tp_10_1.vocab, training_dataset_10_1, n_samples=10)

ctm_pd_10_1 = vis.prepare(**lda_vis_data_10_1)
vis.display(ctm_pd_10_1)

  0%|          | 0/615 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling 

In [None]:
# Perform CTM with num_ctm_topics > 5 for Preprocess 2 - Annotate the topics

tp_10_2 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_10_2 = tp_10_2.fit(text_for_contextual=unpreprocessed_corpus_2, text_for_bow=preprocessed_documents_2)

ctm_10_2 = CombinedTM(bow_size=len(tp_10_2.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_10_2.fit(training_dataset_10_2) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ', '.join(ctm_10_2.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/197 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  set, graph, sub, theorem, theori, order, number, group, finit, sup
topic 1:  control, estim, time, optim, linear, process, filter, analysi, adapt, system
topic 2:  languag, recognit, use, program, imag, pattern, dimension, techniqu, code, automat
topic 3:  system, data, inform, base, design, structur, retriev, databas, knowledg, manag
topic 4:  problem, method, equat, function, solut, valu, linear, numer, solv, approxim
topic 5:  uncertainti, market, cancel, regist, markovian, uncertain, treatment, highli, associ, unifi
topic 6:  comput, scienc, review, graphic, softwar, research, engin, book, educ, secur
topic 7:  algorithm, network, tree, parallel, search, bound, distribut, optim, two, binari
topic 8:  model, simul, perform, control, dynam, design, robot, oper, memori, processor
topic 9:  uuml, von, und, der, ein, mit, die, auml, eacut, ouml





In [None]:
lda_vis_data_10_2 = ctm_10_2.get_ldavis_data_format(tp_10_2.vocab, training_dataset_10_2, n_samples=10)

ctm_pd_10_2 = vis.prepare(**lda_vis_data_10_2)
vis.display(ctm_pd_10_2)

  0%|          | 0/615 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling 

### From 1990 to 2009

Add your code for topic modelling the period from 1990 to 2009 here - similar to what you did for before 1990s

In [None]:
titles_from_1990_to_2009 = download_text_file_from_drive(url_from_1990_to_2009)

In [None]:
sp_1 = WhiteSpacePreprocessing(titles_from_1990_to_2009, "english")
preprocessed_documents_1, unpreprocessed_corpus_1, vocab_1, retained_indices_1 = sp_1.preprocess()

sp_2 = WhiteSpacePreprocessing(titles_from_1990_to_2009, "english")
preprocessed_documents_2, unpreprocessed_corpus_2, vocab_2, retained_indices_2 = sp_2.preprocess()

In [None]:
# Preprocess 1
preprocessed_documents_1 = preprocess1(preprocessed_documents_1)

# Preprocess 2
preprocessed_documents_2 = preprocess2(preprocessed_documents_2)

In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 1 - Annotate the topics

num_ctm_topics = 5

tp_5_1 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_5_1 = tp_5_1.fit(text_for_contextual=unpreprocessed_corpus_1, text_for_bow=preprocessed_documents_1)

ctm_5_1 = CombinedTM(bow_size=len(tp_5_1.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_5_1.fit(training_dataset_5_1) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ', '.join(ctm_5_1.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/1197 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  systems, control, networks, time, neural, network, adaptive, nonlinear, dynamic, fuzzy
topic 1:  arm, regulation, sensory, forecasting, hopfield, vibration, actuator, egrave, regulatory, plant
topic 2:  information, web, software, management, knowledge, computer, technology, framework, development, security
topic 3:  problems, problem, equations, sup, finite, order, methods, graphs, two, algorithms
topic 4:  using, image, data, based, detection, estimation, recognition, images, analysis, model





In [None]:
lda_vis_data_5_1 = ctm_5_1.get_ldavis_data_format(tp_5_1.vocab, training_dataset_5_1, n_samples=10)

ctm_pd_5_1 = vis.prepare(**lda_vis_data_5_1)
vis.display(ctm_pd_5_1)

  0%|          | 0/3740 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling

In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 2 - Annotate the topics

tp_5_2 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_5_2 = tp_5_2.fit(text_for_contextual=unpreprocessed_corpus_2, text_for_bow=preprocessed_documents_2)

ctm_5_2 = CombinedTM(bow_size=len(tp_5_2.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_5_2.fit(training_dataset_5_2) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ', '.join(ctm_5_2.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/1197 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  imag, use, data, analysi, base, estim, detect, model, recognit, featur
topic 1:  network, system, control, time, neural, dynam, design, wireless, robot, distribut
topic 2:  redund, gain, static, feedforward, neuro, vibrat, motor, tune, hopfield, trajectori
topic 3:  technolog, web, inform, softwar, uuml, develop, knowledg, comput, virtual, und
topic 4:  problem, sup, method, algorithm, equat, sub, approxim, linear, optim, solut





In [None]:
lda_vis_data_5_2 = ctm_5_2.get_ldavis_data_format(tp_5_2.vocab, training_dataset_5_2, n_samples=10)

ctm_pd_5_2 = vis.prepare(**lda_vis_data_5_2)
vis.display(ctm_pd_5_2)

  0%|          | 0/3740 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling

In [None]:
# Perform CTM with num_ctm_topics > 5 for Preprocess 1 - Annotate the topics

num_ctm_topics = 10

tp_10_1 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_10_1 = tp_10_1.fit(text_for_contextual=unpreprocessed_corpus_1, text_for_bow=preprocessed_documents_1)

ctm_10_1 = CombinedTM(bow_size=len(tp_10_1.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_10_1.fit(training_dataset_10_1) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ' '.join(ctm_10_1.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/1197 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  systems time sub control linear nonlinear discrete estimation real adaptive
topic 1:  computer uuml und eacute der virtual internet de science auml
topic 2:  analysis study functional brain effects human molecular model imaging fmri
topic 3:  graphs sets number trees graph note complexity complete degree matrices
topic 4:  information software management knowledge engineering oriented object web development database
topic 5:  problems problem method finite methods algorithms order equations algorithm optimization
topic 6:  networks performance wireless distributed scheduling power sensor parallel mobile high
topic 7:  neural learning system network based fuzzy model approach robot control
topic 8:  considering utilizing reference reactive capability angle gate vibration symbolic positioning
topic 9:  image using images video detection based motion fast recognition estimation





In [None]:
lda_vis_data_10_1 = ctm_10_1.get_ldavis_data_format(tp_10_1.vocab, training_dataset_10_1, n_samples=10)

ctm_pd_10_1 = vis.prepare(**lda_vis_data_10_1)
vis.display(ctm_pd_10_1)

  0%|          | 0/3740 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling

In [None]:
# Perform CTM with num_ctm_topics > 5 for Preprocess 2 - Annotate the topics

tp_10_2 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_10_2 = tp_10_2.fit(text_for_contextual=unpreprocessed_corpus_2, text_for_bow=preprocessed_documents_2)

ctm_10_2 = CombinedTM(bow_size=len(tp_10_2.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_10_2.fit(training_dataset_10_2) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ' '.join(ctm_10_2.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/1197 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  network perform wireless effici neural rout protocol traffic mobil sensor
topic 1:  comput softwar design orient object engin program applic architectur special
topic 2:  number cycl graph theorem note group partit algebra complet proof
topic 3:  effect measur human activ respons visual brain field function fmri
topic 4:  multir consid von anneal incorpor underwat forecast neuro overlap guarante
topic 5:  inform web manag technolog knowledg librari research case learn busi
topic 6:  problem method equat sup finit algorithm approxim solut optim order
topic 7:  imag estim code filter signal transform use adapt compress channel
topic 8:  model base data approach analysi fuzzi use mine recognit rule
topic 9:  system control time sub nonlinear linear dynam robot stabil discret





In [None]:
lda_vis_data_10_2 = ctm_10_2.get_ldavis_data_format(tp_10_2.vocab, training_dataset_10_2, n_samples=10)

ctm_pd_10_2 = vis.prepare(**lda_vis_data_10_2)
vis.display(ctm_pd_10_2)

  0%|          | 0/3740 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling

### From 2010 onwards

Add your code for topic modelling the period from 2010 onwards - similar to what you did for before 1990s

In [None]:
titles_from_2010 = download_text_file_from_drive(url_from_2010)

In [None]:
sp_1 = WhiteSpacePreprocessing(titles_from_2010, "english")
preprocessed_documents_1, unpreprocessed_corpus_1, vocab_1, retained_indices_1 = sp_1.preprocess()

sp_2 = WhiteSpacePreprocessing(titles_from_2010, "english")
preprocessed_documents_2, unpreprocessed_corpus_2, vocab_2, retained_indices_2 = sp_2.preprocess()

In [None]:
# Preprocess 1
preprocessed_documents_1 = preprocess1(preprocessed_documents_1)

# Preprocess 2
preprocessed_documents_2 = preprocess2(preprocessed_documents_2)

In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 1 - Annotate the topics

num_ctm_topics = 5

tp_5_1 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_5_1 = tp_5_1.fit(text_for_contextual=unpreprocessed_corpus_1, text_for_bow=preprocessed_documents_1)

ctm_5_1 = CombinedTM(bow_size=len(tp_5_1.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_5_1.fit(training_dataset_5_1) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ', '.join(ctm_5_1.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/2880 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  using, image, learning, detection, deep, classification, network, images, recognition, based
topic 1:  time, nonlinear, control, order, systems, linear, sub, method, problems, finite
topic 2:  branch, decoder, window, directional, neighborhood, multilayer, multichannel, frame, coarse, angle
topic 3:  study, research, case, review, development, social, information, software, knowledge, digital
topic 4:  networks, wireless, sensor, efficient, energy, system, cloud, algorithm, allocation, computing





In [None]:
lda_vis_data_5_1 = ctm_5_1.get_ldavis_data_format(tp_5_1.vocab, training_dataset_5_1, n_samples=10)

ctm_pd_5_1 = vis.prepare(**lda_vis_data_5_1)
vis.display(ctm_pd_5_1)

  0%|          | 0/8999 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling

In [None]:
# Perform CTM with num_ctm_topics = 5 for Preprocess 2 - Annotate the topics

tp_5_2 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_5_2 = tp_5_2.fit(text_for_contextual=unpreprocessed_corpus_2, text_for_bow=preprocessed_documents_2)

ctm_5_2 = CombinedTM(bow_size=len(tp_5_2.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_5_2.fit(training_dataset_5_2) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ', '.join(ctm_5_2.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/2880 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  detector, branch, bidirect, descriptor, sparsiti, frame, neighborhood, mutual, acoust, multichannel
topic 1:  case, studi, inform, research, develop, technolog, review, social, evalu, softwar
topic 2:  network, sensor, wireless, effici, energi, commun, algorithm, mobil, optim, distribut
topic 3:  imag, use, detect, learn, deep, featur, base, classif, recognit, segment
topic 4:  time, method, control, problem, nonlinear, equat, linear, order, system, stabil


In [None]:
lda_vis_data_5_2 = ctm_5_2.get_ldavis_data_format(tp_5_2.vocab, training_dataset_5_2, n_samples=10)

ctm_pd_5_2 = vis.prepare(**lda_vis_data_5_2)
vis.display(ctm_pd_5_2)

  0%|          | 0/8999 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling

In [None]:
# Perform CTM with num_ctm_topics > 5 for Preprocess 1 - Annotate the topics

num_ctm_topics = 10

tp_10_1 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_10_1 = tp_10_1.fit(text_for_contextual=unpreprocessed_corpus_1, text_for_bow=preprocessed_documents_1)

ctm_10_1 = CombinedTM(bow_size=len(tp_10_1.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_10_1.fit(training_dataset_10_1) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ', '.join(ctm_10_1.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/2880 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  data, analysis, remote, using, sensing, mapping, surface, land, spatial, satellite
topic 1:  systems, control, time, nonlinear, sub, adaptive, state, feedback, discrete, varying
topic 2:  problems, method, order, dimensional, methods, solving, equations, differential, two, equation
topic 3:  distance, sets, trees, graphs, buildings, number, organizing, multilayer, codes, drone
topic 4:  image, segmentation, feature, images, recognition, attention, video, detection, object, super
topic 5:  wireless, networks, energy, sensor, efficient, computing, edge, power, communications, iot
topic 6:  research, special, review, social, digital, information, issue, online, media, open
topic 7:  functional, connectivity, magnetic, brain, fmri, response, memory, activity, structural, resonance
topic 8:  system, virtual, design, decision, development, making, simulation, eacute, reality, robot
topic 9:  learning, deep, neural, machine, network, based, classification, using, prediction, detecti




In [None]:
lda_vis_data_10_1 = ctm_10_1.get_ldavis_data_format(tp_10_1.vocab, training_dataset_10_1, n_samples=10)

ctm_pd_10_1 = vis.prepare(**lda_vis_data_10_1)
vis.display(ctm_pd_10_1)

  0%|          | 0/8999 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling

In [None]:
# Perform CTM with num_ctm_topics > 5 for Preprocess 2 - Annotate the topics

tp_10_2 = TopicModelDataPreparation(MODEL_NAME)
training_dataset_10_2 = tp_10_2.fit(text_for_contextual=unpreprocessed_corpus_2, text_for_bow=preprocessed_documents_2)

ctm_10_2 = CombinedTM(bow_size=len(tp_10_2.vocab), contextual_size=768, n_components=num_ctm_topics, num_epochs=10)
ctm_10_2.fit(training_dataset_10_2) # run the model

for i in range(num_ctm_topics):
    print(f"topic {i}: ", ', '.join(ctm_10_2.get_topic_lists(10)[i]))

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/2880 [00:00<?, ?it/s]

0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

topic 0:  wireless, energi, sensor, effici, network, power, commun, alloc, rout, radio
topic 1:  equat, sup, problem, element, approxim, differenti, finit, solut, converg, boundari
topic 2:  learn, neural, deep, machin, network, convolut, classif, detect, predict, recognit
topic 3:  research, social, role, effect, empir, impact, media, inform, onlin, influenc
topic 4:  model, base, algorithm, approach, optim, decis, fuzzi, process, make, multi
topic 5:  comput, secur, issu, special, internet, privaci, smart, intellig, cloud, thing
topic 6:  data, use, water, satellit, land, forest, chang, urban, map, analysi
topic 7:  imag, featur, object, segment, detect, local, super, transform, color, fusion
topic 8:  control, system, time, nonlinear, feedback, output, delay, adapt, vari, input
topic 9:  multichannel, adjust, bidirect, railway, aircraft, multilay, marin, batch, window, station





In [None]:
lda_vis_data_10_2 = ctm_10_2.get_ldavis_data_format(tp_10_2.vocab, training_dataset_10_2, n_samples=10)

ctm_pd_10_2 = vis.prepare(**lda_vis_data_10_2)
vis.display(ctm_pd_10_2)

  0%|          | 0/8999 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling

📝❓ Again: Assign a name to each topic based on the topic’s top words (for each period). List all topic names in your report.

📝❓ Bianchi et al. 2021 claim that their approach produces more coherent topics than previous methods. Let’s test this claim by comparing the coherence of the topics produced by CTM with the topics produced by LDA. Describe your observations in 3-4 sentences.

📝❓ Do the two models generate similar topics? Can you discover the same temporal trends (if there are any)? Discuss in 5-6 sentences.

📝❓ Can you suggest an alternate model apart from paraphrase-mpnet-base-v2? What could be some of the possible advantages and disadvantages of using an alternate model? Hint: Look at some of the models [here](https://huggingface.co/spaces/mteb/leaderboard). Note: You do not need to execute the code for an alternate model.

## Lab Report

# **LDA Analysis**

---

## **Before 1990**

| Preprocessing   | Number of Topics | Topic Index | Topic Name                           |
|------------------|------------------|-------------|---------------------------------------|
| Preprocessing 1 | 5                | 0           | Distributed Systems and Retrieval    |
| Preprocessing 1 | 5                | 1           | Pattern Recognition and Analysis     |
| Preprocessing 1 | 5                | 2           | Programming Languages and Structures |
| Preprocessing 1 | 5                | 3           | Software Development and Data Systems |
| Preprocessing 1 | 5                | 4           | Algorithms and Optimization          |
| Preprocessing 1 | 10               | 0           | Control Systems                      |
| Preprocessing 1 | 10               | 1           | Distributed Networks                 |
| Preprocessing 1 | 10               | 2           | Programming Frameworks               |
| Preprocessing 1 | 10               | 3           | Data Structures and Algorithms       |
| Preprocessing 1 | 10               | 4           | Optimization Methods                 |
| Preprocessing 1 | 10               | 5           | Graph Theory                         |
| Preprocessing 1 | 10               | 6           | Programming Languages                |
| Preprocessing 1 | 10               | 7           | Software Systems                     |
| Preprocessing 1 | 10               | 8           | Computational Models                 |
| Preprocessing 1 | 10               | 9           | Pattern Analysis                     |
| Preprocessing 2 | 5                | 0           | Graph Theory                         |
| Preprocessing 2 | 5                | 1           | Computing Performance                |
| Preprocessing 2 | 5                | 2           | System Optimization                  |
| Preprocessing 2 | 5                | 3           | Software Engineering                 |
| Preprocessing 2 | 5                | 4           | Computational Methods                |
| Preprocessing 2 | 10               | 0           | Advanced Graphs                      |
| Preprocessing 2 | 10               | 1           | Parallel Systems                     |
| Preprocessing 2 | 10               | 2           | Algorithm Optimization               |
| Preprocessing 2 | 10               | 3           | Pattern Recognition                  |
| Preprocessing 2 | 10               | 4           | Language Models                      |
| Preprocessing 2 | 10               | 5           | Software Techniques                  |
| Preprocessing 2 | 10               | 6           | Distributed Systems                  |
| Preprocessing 2 | 10               | 7           | Programming Paradigms                |
| Preprocessing 2 | 10               | 8           | Information Retrieval                |
| Preprocessing 2 | 10               | 9           | Mathematical Approaches              |

---

## **1990-2009**

| Preprocessing   | Number of Topics | Topic Index | Topic Name                           |
|------------------|------------------|-------------|---------------------------------------|
| Preprocessing 1 | 5                | 0           | Computational Models                 |
| Preprocessing 1 | 5                | 1           | Software Systems                     |
| Preprocessing 1 | 5                | 2           | Optimization Problems                |
| Preprocessing 1 | 5                | 3           | Networks and Web Technologies        |
| Preprocessing 1 | 5                | 4           | Neural Control Systems               |
| Preprocessing 1 | 10               | 0           | High-Performance Computing           |
| Preprocessing 1 | 10               | 1           | Data Processing Techniques           |
| Preprocessing 1 | 10               | 2           | Robotics and Automation              |
| Preprocessing 1 | 10               | 3           | Software Design                      |
| Preprocessing 1 | 10               | 4           | Control Theory                       |
| Preprocessing 1 | 10               | 5           | Computational Mathematics            |
| Preprocessing 1 | 10               | 6           | Image Processing                     |
| Preprocessing 1 | 10               | 7           | Database Systems                     |
| Preprocessing 1 | 10               | 8           | Wireless Technologies                |
| Preprocessing 1 | 10               | 9           | Parallel Computing                   |
| Preprocessing 2 | 5                | 0           | Knowledge Systems                    |
| Preprocessing 2 | 5                | 1           | Image Recognition                    |
| Preprocessing 2 | 5                | 2           | Control and Design                   |
| Preprocessing 2 | 5                | 3           | Graph Theory and Algorithms          |
| Preprocessing 2 | 5                | 4           | Software Tools                       |
| Preprocessing 2 | 10               | 0           | Computational Equations              |
| Preprocessing 2 | 10               | 1           | Visual Recognition                   |
| Preprocessing 2 | 10               | 2           | Adaptive Networks                    |
| Preprocessing 2 | 10               | 3           | Optimization Problems                |
| Preprocessing 2 | 10               | 4           | Robotic Design                       |
| Preprocessing 2 | 10               | 5           | High-Performance Models              |
| Preprocessing 2 | 10               | 6           | Signal Processing                    |
| Preprocessing 2 | 10               | 7           | Real-Time Systems                    |
| Preprocessing 2 | 10               | 8           | Database Development                 |
| Preprocessing 2 | 10               | 9           | Advanced Graph Analysis              |

---

## **2010 Onwards**

| Preprocessing   | Number of Topics | Topic Index | Topic Name                           |
|------------------|------------------|-------------|---------------------------------------|
| Preprocessing 1 | 5                | 0           | Deep Learning and Image Processing   |
| Preprocessing 1 | 5                | 1           | Large-Scale Data Analysis            |
| Preprocessing 1 | 5                | 2           | Wireless Energy Systems              |
| Preprocessing 1 | 5                | 3           | Machine Learning Applications        |
| Preprocessing 1 | 5                | 4           | Control and Nonlinear Systems        |
| Preprocessing 1 | 10               | 0           | Social Networks                      |
| Preprocessing 1 | 10               | 1           | Efficient Algorithms                 |
| Preprocessing 1 | 10               | 2           | Neural Architectures                 |
| Preprocessing 1 | 10               | 3           | Machine Learning Models              |
| Preprocessing 1 | 10               | 4           | System Control                       |
| Preprocessing 1 | 10               | 5           | Privacy and Security                 |
| Preprocessing 1 | 10               | 6           | Human-Robot Interaction              |
| Preprocessing 1 | 10               | 7           | Cloud Technologies                   |
| Preprocessing 1 | 10               | 8           | Mathematical Solutions               |
| Preprocessing 1 | 10               | 9           | Risk Management                      |
| Preprocessing 2 | 5                | 0           | Image and Feature Learning           |
| Preprocessing 2 | 5                | 1           | Advanced Robotics                    |
| Preprocessing 2 | 5                | 2           | Computational Efficiency             |
| Preprocessing 2 | 5                | 3           | IoT and Sensors                      |
| Preprocessing 2 | 5                | 4           | Communication Networks               |
| Preprocessing 2 | 10               | 0           | Advanced Visual Models               |
| Preprocessing 2 | 10               | 1           | Dynamic Systems                      |
| Preprocessing 2 | 10               | 2           | Energy Optimization                  |
| Preprocessing 2 | 10               | 3           | Data Monitoring                      |
| Preprocessing 2 | 10               | 4           | Wireless Communication               |
| Preprocessing 2 | 10               | 5           | Social Research                      |
| Preprocessing 2 | 10               | 6           | Predictive Analytics                 |
| Preprocessing 2 | 10               | 7           | Security Frameworks                  |
| Preprocessing 2 | 10               | 8           | Numerical Methods                    |
| Preprocessing 2 | 10               | 9           | Advanced Computational Graphs        |

---


### **Discussion**
Do the topics make sense? Are they coherent?

Yes, the topics largely make sense and appear coherent. For instance, foundational topics such as distributed systems and programming languages are prevalent before 1990. The period from 1990 to 2009 emphasizes applied systems, including networks and optimization problems. Post-2010, modern technologies like deep learning, IoT, and social network analysis dominate. These trends align well with the historical evolution of research in computing and technology.
