# Topic Modeling of Machine Learning Research Papers with BERTopic

This notebook demonstrates how to perform topic modeling on a dataset of Machine Learning research papers from arXiv using the BERTopic library. It covers data loading, model training (or loading a pre-trained model), topic visualization, and analysis.

## **1. Installation of Libraries**


Overview of installed libraries:
*   **`bertopic`**: The core library for BERTopic-based topic modeling.
*   **`sentence-transformers`**: Used for generating sentence embeddings, which are crucial for BERTopic's understanding of semantic meaning.
*   **`sklearn`**: scikit-learn; provides machine learning tools, including CountVectorizer used here for text preprocessing.
*   **`umap-learn`**: Implements UMAP (Uniform Manifold Approximation and Projection), a dimensionality reduction technique used for visualization.
*   **`pandas`**: Used for data manipulation and working with DataFrames.
*   **`torch`**: PyTorch is a deep learning framework, and we are checking for hardware acceleration.
*   **`kagglehub`**: Used to fetch data from Kaggle.

In [None]:
%pip install bertopic sentence-transformers pandas torch scikit-learn kagglehub octis

## 2. Importing Libraries and Reading the Dataset

After importing the required libraries, we combine each paper's title and abstract into a single text column for preprocessing. This combined text serves as input for our embedding model. 

We preserve the complete text, including stop words, since transformer-based embedding models require full contextual information to generate accurate embeddings (Grootendorst, 2024). [As recommended by BERTopic's developers](https://maartengr.github.io/BERTopic/faq.html#how-do-i-remove-stop-words), any additional preprocessing steps are performed after generating the embeddings.



In [107]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
import pandas as pd
import torch
import kagglehub
import numpy as np

dataset = "/kaggle/input/machine-learning-arxiv-papers-122022-122024/arxiv_papers.csv"
try:
    df = pd.read_csv(dataset)
except:
    dataset = kagglehub.dataset_download('student344/machine-learning-arxiv-papers-122022-122024', path="arxiv_papers.csv")
    df = pd.read_csv(dataset)
    
df["text"] = df["title"] + " " + df["summary"]
docs = df["text"].tolist()
print("The dataset has been processed successfully.")

The dataset has been processed successfully.


## 3. Loading the Embeddings

Sentence embeddings are numerical representations of text that capture semantic meaning. This section handles loading embeddings from the current environment or downloading them from Kaggle. To replicate all steps to create the embeddings from the embedding model, we can set `load_embeddings_from_storage` to `False`. 

In [106]:
embeddings = None
found_model = False
load_embeddings_from_storage = True  # set to False to recreate the embeddings

if load_embeddings_from_storage:
    try:
        print(f"Attempting to load embeddings...")
        embeddings = np.load("/kaggle/input/machine-learning-arxiv-papers-122022-122024/embeddings.npy")
        found_model = True
    except:
        embeddings = kagglehub.dataset_download('student344/machine-learning-arxiv-papers-122022-122024', path="embeddings.npy")
    print("Embeddings loaded successfully.")

### 3.1 Setting the Embedding Model

*   **Embedding Model:** We use the "nomic-ai/nomic-embed-text-v1.5" model from Sentence Transformers. 
*   **Encoding:** The `embedding_model.encode()` function generates embeddings for the `docs` (the list of paper texts). The `task` is set to "text-matching", which is one of the embedding tasks supported by the nomic model.
*   **Device Usage:** `device=device` ensures that the embedding generation uses the available hardware acceleration (GPU or CPU).

In [230]:
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
embedding_model_name = "nomic-ai/nomic-embed-text-v1.5"
embedding_model = SentenceTransformer(embedding_model_name, trust_remote_code=True, device=device)

if not found_model:
    print("Creating embeddings...")
    embeddings = embedding_model.encode(docs, prompt="clustering", show_progress_bar=True)
    np.save("embeddings.npy", embeddings)

SyntaxError: invalid character '≈' (U+2248) (<ipython-input-230-3b1bc5f7bdfb>, line 7)

### 3.2 Creating a Topic List for Zero-Shot Topic Modeling

We define a `topic_list`, which is a Python list containing strings that represent potential topics within the machine learning domain. These topics are used for zero-shot topic classification, where the model tries to assign documents to these pre-defined topics. Topics that do not fit into the topic list are automatically clustered as new topics by the model.

The code initializes a BERTopic model with the following parameters:

- Now that the embeddings are generated, CountVectorizer is used to remove English stop words.
- The previously defined embedding model is used.
- Zero-shot topic modeling with minimum similarity threshold of 0.54.
- Maximum of 100 topics allowed (nr_topics=100).
- N-gram range of (1,2) to capture single words and bi-grams (pairs of words such as "Computer Vision", "Reinforcement Learning", etc.),
- Probability calculation is disabled for efficiency.
- Verbose mode enabled for training progress updates.



In [245]:
topic_list = [
    # Core ML Paradigms
    "Supervised Learning",
    "Unsupervised Learning", 
    "Reinforcement Learning",
    "Transfer Learning",
    
    # Key Application Areas
    "Text Classification & Clustering",
    "Computer Vision",
    "Translation & Transcription",
    "Image Segmentation",
    "Weather & Climate Prediction",
    "Clinical/Medical Data Analysis",
    "Robotics",
    "Recommendation Systems",
    "Graph Neural Networks (GNNs)",
    "Time Series Analysis & Forecasting",
    
    # Model Development
    "Meta Learning",
    "Optimization Methods",
    "Model Compression & Quantization",
    "Distributed Learning",
    "Transformer Architecture",
    "Tokenization",
    "Retrieval-Augmented Generation (RAG)",
    "Synthetic Data",
    "In-Context Learning (ICL)",
    "Fine-tuning",

    # Generation and Synthesis
    # "Generative AI",
    "Large Language Models (LLMs)",
    "Multimodality",
    "Diffusion",
    "Image & Video Generation",
    "Audio, Speech & Music Synthesis",
    "Program Synthesis & Code Generation",
    "3D Mesh Generation & Gaussian Splatting",
    
    # Responsible AI
    "AI Safety & Alignment",
    "Explainability",
    "Interpretability",
    "Privacy & Security",
    "Fairness in AI",
    
    # Emerging Technologies
    "Reasoning",
    "Autonomous Systems",
    "Agents",
    "Quantum Computing",
    "Adversarial Learning"
]

vectorizer_model = CountVectorizer(stop_words="english")
print("Creating new model...")
topic_model = BERTopic(
    verbose=True,
    embedding_model=embedding_model,
    zeroshot_topic_list=topic_list,
    vectorizer_model=vectorizer_model,
    n_gram_range=(1, 2),
    zeroshot_min_similarity=.56,
)

topics = topic_model.fit(docs, embeddings)
print("The topic model has been created.")


2024-12-30 15:14:03,457 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


Creating new model...


2024-12-30 15:14:08,769 - BERTopic - Dimensionality - Completed ✓
2024-12-30 15:14:08,770 - BERTopic - Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2024-12-30 15:14:09,031 - BERTopic - Zeroshot Step 1 - Completed ✓
2024-12-30 15:14:11,607 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-12-30 15:14:11,661 - BERTopic - Cluster - Completed ✓
2024-12-30 15:14:11,662 - BERTopic - Zeroshot Step 2 - Combining topics from zero-shot topic modeling with topics from clustering...
2024-12-30 15:14:11,700 - BERTopic - Zeroshot Step 2 - Completed ✓
2024-12-30 15:14:11,701 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-30 15:14:13,395 - BERTopic - Representation - Completed ✓


The topic model has been created.


## 4. Exploring and Visualizing the Results

### 4.1 Top Topics Table

The table shows the top 30 most frequent topics, with the outlier topic (Topic -1) filtered out.

In [199]:
topic_info = topics.get_topic_info()
# topic_info.set_index("Topic", inplace=True)
topic_info[topic_info["Topic"] != -1].sort_values("Count", ascending=False).head(30)



Unnamed: 0,Topic,Count,Name,CustomName,Representation,Representative_Docs
13,12,2317,Graph Neural Networks (GNNs),Graph Neural Networks (GNNs),"[neural, graph, networks, network, deep, graph...",[A Multi-Fidelity Graph U-Net Model for Accele...
3,2,1076,Reinforcement Learning,Reinforcement Learning,"[reinforcement, rl, policy, learning, reward, ...",[RL$^3$: Boosting Meta Reinforcement Learning ...
25,24,873,Large Language Models (LLMs),Large Language Models (LLMs),"[llms, language, llm, large, models, model, fi...",[Evaluating Large Language Models for Health-R...
1,0,725,Supervised Learning,Supervised Learning,"[learning, supervised, data, training, ssl, la...",[Multi-Label Contrastive Learning : A Comprehe...
5,4,543,Text Classification & Clustering,Text Classification & Clustering,"[clustering, data, cluster, classification, te...",[Effectiveness of Deep Image Embedding Cluster...
14,13,507,Time Series Analysis & Forecasting,Time Series Analysis & Forecasting,"[series, time, forecasting, temporal, data, pr...",[MSGNet: Learning Multi-Scale Inter-Series Cor...
16,15,439,Optimization Methods,Optimization Methods,"[optimization, gradient, convex, algorithm, pr...",[Smooth Tchebycheff Scalarization for Multi-Ob...
21,20,356,Retrieval-Augmented Generation (RAG),Retrieval-Augmented Generation (RAG),"[retrieval, rag, generation, data, image, mode...",[Know Your RAG: Dataset Taxonomy and Generatio...
27,26,353,Diffusion,Diffusion,"[diffusion, models, image, generative, denoisi...",[Image Embedding for Denoising Generative Mode...
17,16,303,Model Compression & Quantization,Model Compression & Quantization,"[quantization, compression, model, models, qua...",[MixQuant: Mixed Precision Quantization with a...


### 4.2 Intertopic Distance Map

Before visualizing the topics, we change the displayed topic labels to show their actual topic names.

Our first visualization shows the relationships between topics in a 2D space. Topics that are closer together are semantically more similar. 

The map allows for interactive exploration. Upon hovering over the circles, the topic names and sizes are shown. Any area can be selected to zoom in for closer inspection. The size of each circle corresponds to the topic's prevalence in the dataset, making it easy to identify dominant themes.

This visualization provides a clear and intuitive clustering of topics. For instance, Topic 24, labeled "Large Language Models (LLMs)," is positioned near Topic 25, labeled "Retrieval-Augmented Generation (RAG)." This proximity aligns with expectations, as RAG is a commonly used technique for enhancing LLMs with external knowledge from documents.



In [200]:
topics.custom_labels_ = topic_info["Name"]
topics.visualize_topics(custom_labels=True)

### 4.3 Topic Word Scores Bar Chart

This bar chart visualization highlights the top words associated with each topic identified by the BERTopic model. The topics are represented by their most representative terms, ranked by relevance scores. The length of the bars corresponds to the importance of each word in defining the topic. Note how the Reinforcement Learning topic is not just represented by 'reinforcement' and 'learning,' but also by related concepts such as 'reward' (the feedback signal that the algorithm seeks to maximize over time) and 'policy' (the strategy or mapping from states to actions that the algorithm learns).
This demonstrates how the BERTopic model’s underlying embeddings effectively capture the semantic relationships between terms and concepts

In [201]:
topics.visualize_barchart(custom_labels=True, width=320)

### 4.4 Topic Similarity Heatmap

The similarity matrix heatmap offers a good way to inspect the relationships between pairs of topics. Each row and column corresponds to a particular topic, and the color of each cell reflects the degree of semantic similarity between those two topics. Darker cells along the diagonal indicate higher self-similarity (a topic compared to itself), while off-diagonal cells reveal how related (or unrelated) different topics are.

From the heatmap, you can see which topics tend to cluster together. Topics that share conceptual ground, such as “Supervised Learning” and “Unsupervised Learning”, appear in regions of higher similarity, suggesting that the language used to describe them overlaps significantly. Conversely, less closely related topics (e.g., “Quantum Computing” vs. “Audio, Speech & Music Synthesis”) have lower similarity scores, appearing in lighter-colored cells.

In [225]:
topics.visualize_heatmap(custom_labels=True, top_n_topics=25)

## 5. Document-Level Visualizations

### 5.1 Visualize Documents with Hoverable Titles

The following visualization is a scatter plot where each point represents a document. 
The documents are colored by their assigned topic. Hovering over a point shows the document's title.

In [192]:
top_topics = topic_info[topic_info["Topic"] != -1].sort_values("Count", ascending=False).head(15)["Topic"].to_list()

topics.visualize_documents(df["title"], 
                           title="Documents and Topics",
                           embeddings=embeddings,
                           custom_labels=True, 
                           hide_annotations=True, 
                           topics=top_topics)

### 5.2 Documents with Labeled Topics

This scatter plot visualizes the distribution of documents and their assigned topics in a two-dimensional space. Each point represents a document, and points are colored according to their topic. Labels indicate the general area of the plot where a particular topic is most prominent, showing how the BERTopic model clusters semantically similar documents together.

For example, the "Quantum Computing" topic is well-separated from other topics, suggesting its distinct semantic space, while "Reinforcement Learning" forms its own cohesive cluster. More general topics, such as "Supervised Learning" and "Optimization Methods," are positioned closer to other topics, reflecting overlapping concepts or shared terminology.

The clustering and separation of points indicate the effectiveness of the topic modeling process, with clear groupings suggesting coherent topic definitions.



In [194]:
topics.visualize_documents(docs, 
                           title="Documents and Topics",
                           embeddings=embeddings, 
                           hide_document_hover=True, 
                           custom_labels=True, 
                           topics=top_topics)

## 6. Topic Search

Below, we put our topic modeling to use with a search engine for our data, allowing for filtering by topic.

In [228]:
def search_papers(query, topic_model, df, embeddings, embedding_model, top_k=5, topic_filter=None):
    query_embedding = embedding_model.encode([query], prompt="clustering", show_progress_bar=False)[0]
    similarities = np.dot(embeddings, query_embedding) / (
        np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_embedding)
    )
    
    doc_topics = topic_model.topics_
    
    if topic_filter is not None:
        mask = np.array([t in topic_filter for t in doc_topics])
        similarities = similarities * mask
    
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    
    results = []
    for idx in top_indices:
        topic_id = doc_topics[idx]
        # Get topic name from topic_info DataFrame
        topic_name = topic_info.loc[topic_info["Topic"] == topic_id, "Name"].iloc[0]
        
        results.append({
            'title': df['title'].iloc[idx],
            'summary': df['summary'].iloc[idx],
            'topic': topic_name,
            'topic_id': topic_id,
            'similarity': similarities[idx]
        })
    
    return results

In [229]:
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output

def create_search_interface(topic_model, df, embeddings, embedding_model):
    # Create widgets
    search_box = widgets.Text(
        placeholder='Enter search query...',
        description='Search:',
        layout=widgets.Layout(width='50%')
    )
    
    # Get available topics for dropdown
    topic_options = [(name, topic) for topic, name in 
                     zip(topic_info[topic_info['Topic'] != -1]['Topic'], 
                         topic_info[topic_info['Topic'] != -1]['Name'])]
    
    topic_dropdown = widgets.SelectMultiple(
        options=topic_options,
        description='Filter by topics:',
        layout=widgets.Layout(width='50%', height='200px')
    )
    
    results_output = widgets.Output()
    
    def on_search_clicked(b):
        with results_output:
            clear_output()
            query = search_box.value
            topic_filter = list(topic_dropdown.value) if topic_dropdown.value else None
            
            results = search_papers(query, topic_model, df, embeddings, 
                                  embedding_model, top_k=5, topic_filter=topic_filter)
            
            for i, result in enumerate(results, 1):
                html = f"""
                <div style="margin-bottom: 20px; padding: 10px; border: 1px solid #ddd;">
                    <h3>{i}. {result['title']}</h3>
                    <p><b>Topic:</b> {result['topic']}</p>
                    <p><b>Similarity:</b> {result['similarity']:.3f}</p>
                    <p><b>Summary:</b> {result['summary'][:200]}...</p>
                </div>
                """
                display(HTML(html))
    
    search_button = widgets.Button(description="Search")
    search_button.on_click(on_search_clicked)
    
    clear_button = widgets.Button(description="Clear")
    def on_clear_clicked(b):
        with results_output:
            clear_output()
            search_box.value = ''
            topic_dropdown.value = ()
    clear_button.on_click(on_clear_clicked)
    
    # Layout
    buttons = widgets.HBox([search_button, clear_button])
    interface = widgets.VBox([
        search_box,
        topic_dropdown,
        buttons,
        results_output
    ])
    
    display(interface)

# Use the interface
create_search_interface(topic_model, df, embeddings, embedding_model)

VBox(children=(Text(value='', description='Search:', layout=Layout(width='50%'), placeholder='Enter search que…

Below, we use standard metrics for topic modeling: a diversity score and a coherence score. The C_V Coherence score of **0.543681** 0.54 suggests that our topics are reasonably interpretable, meaning the words within each topic tend to be semantically related. The ideal diversity score is close to 1, indicating high diversity. Our score of **0.991419** suggests that our model is quite successfully capturing different themes within our data.



In [261]:
from gensim.models.coherencemodel import CoherenceModel
import gensim.corpora as corpora
tokenized_docs = [doc.split() for doc in docs] 

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(tokenized_docs)

# Convert document into the bag-of-words (BoW) format
corpus = [dictionary.doc2bow(text) for text in tokenized_docs]

# Get the top words for each topic from BERTopic
topics_words = [
  [word for word, _ in topic_model.get_topic(topic_id)[:10]]
  for topic_id in range(len(set(topics.topics_)) - 1) # Exclude outlier topic -1
]

def calculate_topic_diversity(topic_words):
     """
     Calculates the topic diversity.

     Args:
         topic_words (list of list of str): A list of lists, where each inner list contains the top words for a topic.

     Returns:
         float: The average pairwise Jaccard distance between all topics.
     """
     num_topics = len(topic_words)
     total_distance = 0
     count = 0

     for i in range(num_topics):
         for j in range(i + 1, num_topics):
             set1 = set(topic_words[i])
             set2 = set(topic_words[j])
             intersection = len(set1.intersection(set2))
             union = len(set1.union(set2))
             jaccard_similarity = intersection / union if union > 0 else 0
             jaccard_distance = 1 - jaccard_similarity  # We want distance (opposite of similarity)
             total_distance += jaccard_distance
             count += 1

     return total_distance / count if count > 0 else 0

# Get the topic words (same as for coherence calculation)
topic_diversity_score = calculate_topic_diversity(topics_words)
print(f"Topic Diversity: {topic_diversity_score}")


# Calculate coherence
cm_cv = CoherenceModel(topics=topics_words, texts=tokenized_docs, dictionary=dictionary, coherence='c_v')
coherence_cv = cm_cv.get_coherence()

print(f"C_V Coherence: {coherence_cv}")
print(f"NPMI Coherence: {coherence_npmi}")


Topic Diversity: 0.9914190771627477



os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.



C_V Coherence: 0.5436805076971135
NPMI Coherence: -0.002126678890616694


In [246]:
from sklearn.metrics import silhouette_score

# Get the topic labels for each document
doc_info = topics.get_document_info(docs)
labels = doc_info['Topic'].values  # The assigned topic for each document

# Calculate the Silhouette Score on your embeddings
sil_score = silhouette_score(embeddings, labels)
print("Silhouette Score:", sil_score)


Silhouette Score: -0.012392449


Topic Diversity: 0.9914190771627477


In [258]:
evaluation_results = pd.DataFrame({
    'Metric': ['C_V Coherence', 'NPMI Coherence', 'Topic Diversity'],
    'Score': [coherence_cv, coherence_npmi, topic_diversity_score]
})
evaluation_results

Unnamed: 0,Metric,Score
0,C_V Coherence,0.543681
1,NPMI Coherence,-0.002127
2,Topic Diversity,0.991419
