<a href="https://www.kaggle.com/code/student344/arxiv-topic-modeling?scriptVersionId=247937525" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Machine Learning Research Topic Modeling with BERTopic

This notebook demonstrates how to perform topic modeling on a dataset of Machine Learning research papers from arXiv using the BERTopic library. It covers data loading, model training (or loading a pre-trained model), topic visualization, and analysis.

### NOTE
Sometimes the visualization outputs might be blank. This is likely an issue with rendering in the Kaggle environment, and is solved by simply running the code again. Also, to avoid rendering issues, make sure that your ad-blocking extension is disabled.  

## **1. Installation of Libraries**


Some of the dependencies we use don't need to be installed when the notebook is run on Kaggle, because they are included in every Kaggle environment.
Overview of dependencies:
*   **`bertopic`**: The core package for BERTopic-based topic modeling.
*   **`litellm`**: A package that simplifies LLM API calls by providing a single API client for any LLM provider (e.g. Gemini, Anthropic, OpenAI, AWS Bedrock, etc.)
*   **`octis`**: Used for topic coherence and topic diversity evaluation metrics.
*   **`sentence-transformers`**: Used for generating sentence embeddings, which are crucial for BERTopic's understanding of semantic meaning.
*   **`scikit-learn`**: Provides machine learning tools, including CountVectorizer used here for text preprocessing.
*   **`pandas`**: Used for data manipulation, data analysis, and working with DataFrames.
*   **`torch`**: PyTorch is a deep learning framework, and we are using it to check for GPU hardware acceleration availability.
*   **`kagglehub`**: Used to fetch data from Kaggle.

In [21]:
%pip install litellm bertopic scikit-learn kagglehub octis --quiet 


os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.



Note: you may need to restart the kernel to use updated packages.


## 2. Importing Libraries and Preparing the Dataset

In [22]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from kaggle_secrets import UserSecretsClient
import pandas as pd
import torch
import kagglehub
import numpy as np

dataset = "/kaggle/input/arxiv-ml-ai-052023-052025/arxiv_ml_ai_papers_last_2_years.csv"

try:
    df = pd.read_csv(dataset)
except:
    dataset = kagglehub.dataset_download('student344/arxiv-ml-ai-052023-052025', path="/kaggle/input/arxiv-ml-ai-052023-052025/arxiv_ml_ai_papers_last_2_years.csv")
    df = pd.read_csv(dataset)
    
df["text"] = df["title"] + " " + df["summary"]
print("The dataset has been parsed successfully.")

The dataset has been parsed successfully.


After importing the required libraries, we combine each paper's title and abstract into a single text column for preprocessing. This combined text serves as input for our embedding model. 

We preserve most of the text, including stop words, since transformer-based embedding models require full contextual information to generate accurate embeddings. Light preprocessing is performed to remove escape characters, LaTeX code, URLs, and other noise. [As recommended by BERTopic's developers](https://maartengr.github.io/BERTopic/faq.html#how-do-i-remove-stop-words), any additional preprocessing steps are performed *after* generating the embeddings.



In [23]:
import re

def preprocess(text: str) -> str:
    # Remove inline LaTeX math expressions: $...$
    text = re.sub(r'\$(.*?)\$', '', text)
    
    # Remove display math
    text = re.sub(r'\$\$(.*?)\$\$', '', text, flags=re.DOTALL)
    text = re.sub(r'\\(.*?)(.*?)\\', '', text, flags=re.DOTALL)

    # Remove common LaTeX commands (e.g., "\cite{}", "?????????\ref{}", etc.)
    text = re.sub(r'\\[a-zA-Z]+\{.*?\}', '', text)
    # Remove LaTeX escape sequences such as \\% or \\_
    text = re.sub(r'\\([%_&#$])', r'\1', text)

    # Remove multiple spaces and newlines
    text = re.sub(r'\s+', ' ', text)
    
    # Remove URLs (http/https)
    text = re.sub(r'http\S+|www\.\S+', '', text)

    # Strip leading/trailing whitespace
    text = text.strip()
    
    return text

docs = df['text'].apply(preprocess).to_list()
print('Data preprocessing complete.')

Data preprocessing complete.


## 3. Loading the Embeddings

Sentence embeddings are numerical representations of text that capture semantic meaning. This section handles loading embeddings from the current environment or downloading them from Kaggle. To replicate all steps to create the embeddings from the embedding model, we can set `load_embeddings_from_storage` to `False`. 

In [24]:
embeddings = None
load_embeddings_from_storage = False  # set to False to recreate the embeddings

def load_embeddings():
    try:
        print("Attempting to load embeddings from local storage...")
        return np.load("/kaggle/working/arxiv_gist_embeddings_small.npy")
    except Exception as e:
        print(f"Local load failed: {e}. Trying kagglehub download...")
        return kagglehub.dataset_download(
            'student344/machine-learning-arxiv-papers-122022-122024',
            path="arxiv_gist_embeddings.npy"
        )

if load_embeddings_from_storage:
    embeddings = load_embeddings()
    print("Embeddings loaded successfully.")
else:
    print("Skipping embedding load: will recreate embeddings.")


Skipping embedding load: will recreate embeddings.


### 3.1 Creating the Text Embeddings

*   **Embedding Model:** We use the "avsolatorio/GIST-Embedding-v0" model (via SentenceTransformers), one of the highest scoring semantic text embedding models of the <100m parameter range on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard), particularly on clustering-related benchmarks. This model has a 33.4 million parameter size, which is lightweight enough to load quickly in a Kaggle or Colab runtime (with the GPU runtime enabled).
*   **Encoding:** The `embedding_model.encode()` function generates embeddings for the `docs` (the list of paper texts).
*   **Device Usage:** `device=device` ensures that the embedding generation uses the available hardware acceleration (GPU or CPU).

In [25]:
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
embedding_model_name = "avsolatorio/GIST-small-Embedding-v0"

embedding_model = SentenceTransformer(embedding_model_name,
                                      trust_remote_code=True, 
                                      device=device,
                                     )
if not load_embeddings_from_storage:
    print("Creating embeddings...")
    embeddings = embedding_model.encode(docs, show_progress_bar=True)


Creating embeddings...


Batches:   0%|          | 0/379 [00:00<?, ?it/s]

## 4. Creating the Topic Model

We are now ready to initialize a BERTopic model. 

#### Parameters
- Now that the embeddings are generated, CountVectorizer is used to remove English stop words (words like "for", "and", " "to", etc.).
- The previously defined embedding model is used.
- N-gram range of (1,3) is used to capture single words, bi-grams (pairs of words such as "Computer Vision" and "Reinforcement Learning"), and tri-grams (terms like "Large Language Models" "Time Series Forecasting", etc.)
- Verbose mode enabled for training progress updates.
- We set a minimum topic size of 35, so that smaller topics with less than 35 document examples do not get clustered. This prevents noise in the results at the expense of not capturing the entire breadth of topics in the dataset. 


In [26]:
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2))
print("Creating new model...")

topic_model = BERTopic(
    verbose=True,
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    n_gram_range=(1, 3),
    min_topic_size=30,
)

topics, probs = topic_model.fit_transform(docs, embeddings)
print("The topic model has been created.")


2025-06-29 01:25:38,813 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


Creating new model...


2025-06-29 01:25:43,584 - BERTopic - Dimensionality - Completed ✓
2025-06-29 01:25:43,585 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-06-29 01:25:44,088 - BERTopic - Cluster - Completed ✓
2025-06-29 01:25:44,095 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-06-29 01:25:50,782 - BERTopic - Representation - Completed ✓


The topic model has been created.


The following table shows the model output. We will explore this further in the sections below. 
- **Topic**: The topic ID. Note that the Topic ID "-1" represents the outliers (documents that were not clustered into any specific topic). 
- **Count**: The number of documents that were clustered into the topic.
- **Representation**: The list of the top words that represent the topic.
- **Representative_Docs**: A sample of representative documents for the topic.

In [27]:
topic_info = topic_model.get_topic_info()
topic_info.head(30)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,4135,-1_models_data_model_learning,"[models, data, model, learning, based, perform...",[Task-Equivariant Graph Few-shot Learning Alth...
1,0,413,0_language_llms_models_language models,"[language, llms, models, language models, larg...",[LayerNorm: A key component in parameter-effic...
2,1,322,1_federated_fl_federated learning_clients,"[federated, fl, federated learning, clients, p...",[A Survey on Blockchain-Based Federated Learni...
3,2,311,2_driving_traffic_autonomous_trajectory,"[driving, traffic, autonomous, trajectory, veh...",[Action and Trajectory Planning for Urban Auto...
4,3,301,3_neural_networks_gradient_optimization,"[neural, networks, gradient, optimization, neu...",[Occam Gradient Descent Deep learning neural n...
5,4,289,4_rl_policy_reinforcement_reinforcement learning,"[rl, policy, reinforcement, reinforcement lear...",[Sharper Model-free Reinforcement Learning for...
6,5,258,5_visual_multimodal_image_vision,"[visual, multimodal, image, vision, language, ...",[CLIP meets DINO for Tuning Zero-Shot Classifi...
7,6,238,6_matrix_regression_learning_data,"[matrix, regression, learning, data, label, di...",[An Unbiased Risk Estimator for Partial Label ...
8,7,219,7_ai_human_ai systems_systems,"[ai, human, ai systems, systems, intelligence,...",[Human-AI collaboration is not very collaborat...
9,8,217,8_code_software_llms_code generation,"[code, software, llms, code generation, genera...",[SEED: Customize Large Language Models with Sa...


## 5. Evaluation Metric Scores 

Below, we use standard evaluation metrics for topic modeling. We use the OCTIS (Optimizing and Comparing Topic Models is Simple) library to calculate the NPMI topic coherence score and the topic diversity score. 
The coherence score calculation can take a few minutes, even when using multithreading.

The NPMI coherence score of **~0.22** suggests that our topics are reasonably interpretable, meaning the words within each topic tend to be semantically related. Our diversity score of **~0.73** suggests that our model is decently capturing different themes within our data, although an ideal score would be above 80. This is expected, however, since we have limited our topic size to a minimum of 35 representative documents (skipping many smaller topics), and our dataset is supposed to have common themes (topics related to machine learning).  

While it is good to have these evaluation metrics, determining the relevance of clustered topics is an inherently subjective process, so it is difficult and impractical to obtain a clear picture of the topic model's quality while only relying on purely quantitative metrics. 

In [28]:
topics_dict  = topic_model.get_topics()                 
topic_words  = [
    [w for w, _ in words[:10]]                 # top-10 words
    for tid, words in topics_dict.items()
    if tid != -1                               # skip outliers
]


In [29]:
vectorizer = topic_model.vectorizer_model # the CountVectorizer that c-TF-IDF used
analyzer   = vectorizer.build_analyzer()  # includes lowercase, n-grams, stop-words …

tokenized_docs = [analyzer(d) for d in docs]


In [30]:
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.evaluation_metrics.diversity_metrics import TopicDiversity

octis_format = {"topics": topic_words}

coh = Coherence(texts=tokenized_docs, topk=5, processes=4).score(octis_format)
print(f"c_npmi={coh:.3f}")


os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.



c_npmi=0.215


In [31]:
div = TopicDiversity(topk=10).score(octis_format)
print(f"diversity={div:.3f}")


diversity=0.738


## 6. Exploring and Visualizing the Results

### 6.1 Creating New Topic Labels and Summaries using Large Language Models

As seen in the table above, the names of our topics use a simple topic representation model: the topic ID number, followed by the top representative words, separated with underscores.
We can use a Large Language Model (LLM) to create a more descriptive label. Additionally, we can use it to generate a short summary for each topic. 

At the moment, Google offers an experimental version of Gemini Flash 2.0-Lite at only 0.0075 USD per million input tokens and 30 cents per million output tokens. Additionally, it offers 15 requests per minute (RPM) and 1500 free requests per day. It is distinguished for being small, fast, and capable of high quality outputs for non-reasoning tasks. Since the model has a large context window, we do not really need to batch our requests, but our code below processes our topics list in two halves, just in case (some models may have degraded performance when the included context is too large).

We use LiteLLM, a library that supports API calls to many LLM providers using a common interface. With this library, we can swap out our chosen model for a new one in the future without having to rewrite most of the code.

Note: You must set up your own Gemini API key. After creating your key through Google AI Studio, go to the "Add-ons" > "Secrets" > "Add Secret", then create a secret with the name "GEMINI_API_KEY" and paste your unique API token. 

In [32]:
import os, math, json, time
from litellm import completion
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
api_key = user_secrets.get_secret("GEMINI_API_KEY")

os.environ["GEMINI_API_KEY"] = api_key

def prompt_builder(df_chunk):
    # concise system instruction
    system_msg = """You are an ML researcher who names Machine Learning research topic clusters 
    generated by BERTopic. 
    For each topic you receive, return a JSON object with:
          id      : integer   (the topic ID)
          Label   : ≤ 6 words (concise title)
          Summary : ≤ 25 words (short description of the topic)
        Each topic cluster you receive includes its name and representation (top words and phrases found in the cluster).
        Respond with a JSON list only—no extra text.
    """.strip()

    # build a small TSV block the model can read
    rows = [
        f"\t{r.Name}\t{r.Representation}"
        for _, r in df_chunk.iterrows()
    ]
    user_msg = (
        "Columns: id, size, placeholder_label, top_keywords\n"
        "```text\n" + "\n".join(rows) + "\n```"
    )

    return [
        {"role": "system", "content": system_msg},
        {"role": "user",   "content": user_msg}
    ]

# split the DataFrame
midpoint = math.ceil(len(topic_info) / 2)

first_half  = topic_info.iloc[:midpoint]
second_half = topic_info.iloc[midpoint:]

# build prompts and call LLM
def call_gemini(df_part):
    messages = prompt_builder(df_part)      
    resp = completion(
        model="gemini/gemini-2.0-flash",
        messages=messages,
        response_format={"type": "json_object"},
        temperature=0.3,                         
        max_tokens=100000,                    
    )
    raw_json = resp.choices[0].message.content   
    return json.loads(raw_json)                  

results_1 = call_gemini(first_half)
results_2 = call_gemini(second_half)

# --- 3. stitch back together --------------------------------------------------
all_results = results_1 + results_2
all_results.sort(key=lambda d: d["id"])

[92m01:26:15 - LiteLLM:INFO[0m: utils.py:3173 - 
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
[92m01:26:23 - LiteLLM:INFO[0m: utils.py:1234 - Wrapper: Completed Call, calling success_handler
[92m01:26:23 - LiteLLM:INFO[0m: utils.py:3173 - 
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
[92m01:26:31 - LiteLLM:INFO[0m: utils.py:1234 - Wrapper: Completed Call, calling success_handler


The model was instructed to generate its output in structured JSON, which can be easily parsed into a dataframe and merged with the dataframe generated by `get_topic_info()`.

In [33]:
import json

label_df = pd.DataFrame(all_results)                
label_df.rename(columns={"id": "Topic"}, inplace=True)

# apply to BERTopic 
# merge so we keep original ordering and any extra columns
topic_info = topic_model.get_topic_info()
topic_info = topic_info.merge(label_df, on="Topic", how="left")

# use the “label” column as custom topic names
custom_labels = topic_info.set_index("Topic")["Label"].to_dict()
topic_model.set_topic_labels(custom_labels)

# store the summaries for later display
topic_summaries = topic_info.set_index("Topic")["Summary"].to_dict()
print('Topic labels and summaries have been added to the topic model.')

Topic labels and summaries have been added to the topic model.


### 6.2 Top Topics Table

The table below shows the top 30 most frequent topics, including the new summaries and labels, with the outliers topic (Topic -1) at the top. You can change the value of `head(30)` to see more or less of the top topics. You can also change `head(30)` to `tail(30)` to see the bottom 30 topics as well. 

In [34]:
topic_info[["Topic", "Count", "Name", "Label", "Summary"]].head(30)

Unnamed: 0,Topic,Count,Name,Label,Summary
0,-1,4135,-1_models_data_model_learning,General Machine Learning,"General machine learning methods, models, data..."
1,0,413,0_language_llms_models_language models,Large Language Models (LLMs),"Training, fine-tuning, and applications of lar..."
2,1,322,1_federated_fl_federated learning_clients,Federated Learning for Privacy,Federated learning (FL) for privacy-preserving...
3,2,311,2_driving_traffic_autonomous_trajectory,Autonomous Driving and Trajectory Prediction,"Autonomous driving, traffic management, and tr..."
4,3,301,3_neural_networks_gradient_optimization,Neural Network Optimization,"Optimization methods, including gradient desce..."
5,4,289,4_rl_policy_reinforcement_reinforcement learning,Reinforcement Learning and Policy Optimization,"Reinforcement learning (RL) algorithms, policy..."
6,5,258,5_visual_multimodal_image_vision,Visual-Language Models,Multimodal models combining visual and languag...
7,6,238,6_matrix_regression_learning_data,Matrix Regression and Generalization,"Matrix regression, learning bounds, and genera..."
8,7,219,7_ai_human_ai systems_systems,Human-AI Interaction,"Research on AI systems, human-AI interaction, ..."
9,8,217,8_code_software_llms_code generation,Code Generation with LLMs,Using large language models (LLMs) for code ge...


### 6.3 Intertopic Distance Map

Our first visualization shows the relationships between topics in a 2D space. Topics that are closer together are semantically more similar. Since we set the labels generated by the LLM as our custom topic labels, they can now be used in the visualizations.


The map allows for interactive exploration. Upon hovering over the circles, the topic names and sizes are shown. Any area can be selected to zoom in for closer inspection. The size of each circle corresponds to the topic's prevalence in the dataset, making it easy to identify dominant themes.

This visualization provides a clear and intuitive overview of topic relationships, and allows us to see which topics have enough overlap to be merged if we want to trim down our number of topics even further. To see the topic labels, sizes and IDs, simply hover over each circle with the mouse. The slider can be used to highlight a specific topic.



In [35]:
fig = topic_model.visualize_topics(custom_labels=True)
fig.show()

### 6.4 Topic Word Scores Bar Chart

This bar chart visualization highlights the top words associated with each topic identified by the BERTopic model. The topics are represented by their most representative terms, ranked by relevance scores. The length of the bars corresponds to the importance of each word in defining the topic. Note how the Reinforcement Learning topic is not just represented by 'reinforcement' and 'learning,' but also by related concepts such as 'reward' (the feedback signal that the algorithm seeks to maximize over time) and 'policy' (the strategy or mapping from states to actions that the algorithm learns).
This demonstrates how the BERTopic model’s underlying embeddings effectively capture the semantic relationships between terms and concepts

In [36]:
fig = topic_model.visualize_barchart(custom_labels=True, height=300, width=385)
fig.show()

### 6.5 Topic Similarity Heatmap

The similarity matrix heatmap offers a good way to inspect the relationships between pairs of topics. Each row and column corresponds to a particular topic, and the color of each cell reflects the degree of semantic similarity between those two topics. Darker cells along the diagonal indicate higher self-similarity (a topic compared to itself), while off-diagonal cells reveal how related (or unrelated) different topics are.

From the heatmap, you can see which topics tend to cluster together. Topics that share conceptual ground, such as “Bandit Algorithms” and “Reinforcement Learning Policies”, appear in regions of higher similarity, suggesting that the language used to describe them overlaps significantly. Conversely, less closely related topics have lower similarity scores, appearing in lighter-colored cells. Note that the results below might differ from the examples described, due to the stochastic nature of the topic model and the LLM outputs.

In [37]:
fig = topic_model.visualize_heatmap(custom_labels=True, top_n_topics=25)
fig.show()

## 7. Document-Level Visualizations

### 7.1 Visualize Documents with Hoverable Titles

The following visualization is a scatter plot where each point represents a document. This scatter plot visualizes the distribution of documents and their assigned topics in a two-dimensional space. Each point represents a document, and points are colored according to their topic. Labels indicate the general area of the plot where a particular topic is most prominent, showing how the BERTopic model clusters semantically similar documents together.

The documents are colored by their assigned topic. Hovering over a point shows the document's title. 

The clustering and separation of points indicate the effectiveness of the topic modeling process, with clear groupings suggesting coherent topic definitions.


In [38]:
fig = topic_model.visualize_documents(df["title"], 
                           title="Documents and Topics",
                           embeddings=embeddings,
                           custom_labels=True, 
                           hide_annotations=True, 
                           topics=topics)
fig.show()

### 7.2 Documents with Labeled Topics

This version of the visualization shows the document clusters with their respective topic labels. In order to make space for the labels, only the top 55 document clusters are used for the `topics` parameter.

In [39]:
top_topics = topics[:55]
fig = topic_model.visualize_documents(docs, 
                           title="Documents and Topics",
                           embeddings=embeddings, 
                           hide_document_hover=True, 
                           custom_labels=True, 
                           topics=top_topics)
fig.show()

## 8. Topic Search

Below, we put our topic modeling to use with a search engine for our data, allowing for filtering by topic. The engine uses a simple cosine similarity algorithm and leverages the same embedding model that was used for the topic model.

* Enter you search query and click the search button to search accross all topics.
* Click on a topic from the list to choose a filter.
* Click the clear button to remove your input text and your filtered topic.

Try the following search queries with no filter selected: "rag", "image segment", "cluster", and "TTS".

In [40]:
import numpy as np
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
from sklearn.metrics.pairwise import cosine_similarity

def search_papers(
    query,
    topic_model,
    df,
    embeddings,
    embedder,
    top_k=15,
    topic_filter=None):
    # 1. Encode the query. Reshape to (1, n_features) for cosine_similarity.
    query_emb = embedder.encode([query], show_progress_bar=False)

    # 2. Get all document topics from the model.
    doc_topics = np.array(topic_model.topics_)
    
    # 3. Apply the topic filter before calculating similarity.
    search_indices = np.arange(len(embeddings))
    search_embeddings = embeddings

    if topic_filter:
        mask = np.isin(doc_topics, topic_filter)
        search_indices = np.where(mask)[0]
        
        if len(search_indices) == 0:
            return []
            
        search_embeddings = embeddings[search_indices]

    # 4. Calculate cosine similarity against the (potentially filtered) embeddings.
    similarities = cosine_similarity(query_emb, search_embeddings)[0]

    # 5. Get the top-k indices from the filtered results.
    num_results = min(top_k, len(similarities))
    if num_results == 0:
        return []

    top_filtered_indices = np.argpartition(similarities, -num_results)[-num_results:]
    top_filtered_indices = top_filtered_indices[np.argsort(similarities[top_filtered_indices])[::-1]]

    # 6. Build the results list.
    results = []
    for idx in top_filtered_indices:
        original_idx = search_indices[idx]
        topic_id = doc_topics[original_idx]
        try:
            if topic_id == "-1":
                topic_label = "Outlier Topics"
            else:
                topic_label = topic_info.loc[topic_info["Topic"] == topic_id, "Label"].iloc[0]
                topic_summary = topic_info.loc[topic_info["Topic"] == topic_id, "Summary"].iloc[0]
        except IndexError:
            topic_label = f"Topic {topic_id}"
            topic_summary = ""

        results.append(
            {
                "title": df.at[original_idx, "title"],
                "summary": df.at[original_idx, "summary"],
                "topic": topic_label,
                "topic_summary": topic_summary,
                "topic_id": int(topic_id),
                "similarity": float(similarities[idx]),
            }
        )
    return results


def create_search_interface(
    topic_model,
    df,
    embeddings,
    embedder,
    topic_info):
    """
    Creates and displays a search interface in a Jupyter environment.
    """

    # Widgets
    search_box = widgets.Text(
        placeholder="Enter search query…",
        description="Search:",
        layout=widgets.Layout(width="50%"),
    )

    topic_options = [
        (row["Name"], int(row["Topic"]))
        for _, row in topic_info[topic_info["Topic"] != -1].iterrows()
    ]

    topic_dropdown = widgets.SelectMultiple(
        options=topic_options,
        description="Filter topics:",
        layout=widgets.Layout(width="50%", height="200px"),
    )

    results_out = widgets.Output()
    
    # Event Handlers
    def run_search(_):
        with results_out:
            clear_output()
            if not search_box.value.strip():
                display(HTML("<em>Please enter a search query.</em>"))
                return

            topic_filter = list(topic_dropdown.value) or None
            
            # Call the refined search function
            hits = search_papers(
                search_box.value,
                topic_model,
                df,
                embeddings,
                embedder,
                top_k=15,
                topic_filter=topic_filter,
            )
            
            if not hits:
                display(HTML("<em>No results found.</em>"))
                return

            for i, hit in enumerate(hits, 1):
                html = f"""
                <div style="margin:12px 0; padding:12px; border:1px solid #e0e0e0; border-radius: 8px; background-color: #f9f9f9;">
                    <h3 style="margin-top:0;">{i}. {hit['title']}</h3>
                    <p><b>Topic:</b> {hit['topic']} (ID: {hit['topic_id']})</p>
                    <p><b>Similarity:</b> {hit['similarity']:.3f}</p>
                    <p><b>Summary:</b> {hit['summary'][:500]}…</p>
                    <p><b>Topic Summary:</b> {hit['topic_summary'][:500]}.</p>

                </div>"""
                display(HTML(html))

    def clear_form(_):
        search_box.value = ""
        topic_dropdown.value = ()
        with results_out:
            clear_output()

    # Assemble UI
    search_btn = widgets.Button(description="Search", button_style='primary')
    clear_btn = widgets.Button(description="Clear")
    search_btn.on_click(run_search)
    clear_btn.on_click(clear_form)

    ui = widgets.VBox(
        [
            search_box,
            topic_dropdown,
            widgets.HBox([search_btn, clear_btn]),
            results_out,
        ]
    )
    display(ui)

create_search_interface(topic_model, df, embeddings, embedding_model, topic_info)


VBox(children=(Text(value='', description='Search:', layout=Layout(width='50%'), placeholder='Enter search que…