# Enhancing Topic Modeling with Open Source Large Language Models (LLMs) 🦙
*Integrate BERTopic and LLMs for Richer Topic Insights*

[Inspired by MAARTEN GROOTENDORST](https://maartengrootendorst.substack.com/p/topic-modeling-with-llama-2)
<br>

In this guide, we'll delve into leveraging open-source Large Language Models (LLMs) like Mistral, Llama, Gemma (and also Claude or GPT) and others for efficient Topic Modeling. Our focus will be on avoiding the exhaustive approach of processing each document with an LLM. We'll employ BERTopic, a flexible topic modeling framework that can utilize any LLM to refine topic delineations.

BERTopic simplifies the process into five clear steps: embedding documents, dimensionality reduction of embeddings, clustering of embeddings, document tokenization by cluster, and extraction of the most representative words for each topic.
<br>
<div>
<img src="https://github.com/MaartenGr/BERTopic/assets/25746895/e9b0d8cf-2e19-4bf1-beb4-4ff2d9fa5e2d" width="500"/>
</div>

With the advent of advanced LLMs like **Llama, Mistral or Gemma**, our capacity for topic modeling has greatly expanded beyond simple word lists. Direct analysis of all documents by Llama 2 is computationally impractical. Although vector databases offer a solution for search, determining the precise topics of interest remains a challenge.

We propose a novel approach: utilizing BERTopic to generate clusters and topics, then employing Mixtral to refine and enhance these into more precise topic representations.

This method merges the strengths of both worlds: BERTopic's efficient topic generation and Mixtral's refined topic representation.
<br>
<div>
<img src="https://github.com/MaartenGr/BERTopic/assets/25746895/7c7374a1-5b41-4e93-aafd-a1587367767b" width="500"/>
</div>

With our introduction complete, let's dive into the practical tutorial!

---
        
💡 **NOTE**: We will be using together.ai and the remote LLM.

---

We will start by installing a number of packages that we are going to use throughout this example:

In [None]:
!pip install bertopic datasets -qqq

# DataMapPlot
!pip install datamapplot -q

# GPU-accelerated HDBSCAN + UMAP
!pip install cudf-cu12 dask-cudf-cu12 --extra-index-url=https://pypi.nvidia.com -q
!pip install cuml-cu12 --extra-index-url=https://pypi.nvidia.com -q
!pip install cugraph-cu12 --extra-index-url=https://pypi.nvidia.com -q
!pip install cupy-cuda12x -f https://pip.cupy.dev/aarch64 -q

In [None]:
from bertopic.representation import OpenAI

In [None]:
!pip install openai -q

# 📄 **Data**

We are going to apply topic modeling on a number of Patent abstracts. They are a great source for topic modeling since they contain a wide variety of technologies and therefore topics and are generally well-written.

In [None]:
import textwrap

In [None]:
from datasets import load_dataset

dataset = load_dataset("RJuro/neuro_patents")['train']

# Extract abstracts to train on and corresponding titles
abstracts = dataset["appln_abstract"]
titles = dataset["appln_title"]

To give you an idea, an abstract looks like the following:

In [None]:
# a sleeping stage monitor
print(textwrap.fill(abstracts[8765], width=80))

In [None]:
len(abstracts)

# 💬 **Utilizing the OpenAI Package with together.ai API**

In this tutorial segment, we'll explore the utilization of the OpenAI package, leveraging its compatibility with the together.ai API for integrating Large Language Models (LLMs). Specifically, we will focus on the Mixtral model from Mistral, which represents an optimal balance of performance (surpasses chatGPT3.5) and computational efficiency - and also price!

### Integration Steps:

1. **OpenAI Client Setup**: We initiate by configuring the OpenAI client. This setup involves specifying the together.ai API as the base URL and providing the necessary API key. The compatibility with the OpenAI format simplifies this process, allowing for a smooth integration. And later you can swap out for chatGPT or a local LLM.

2. **Model Selection and Request**: Our model of choice for this task is the Mixtral model from Mistral. Through the OpenAI package, we will craft a request that aligns with the together.ai API specifications. This step includes defining the task, such as text classification or generation, and setting parameters like temperature for controlling the output's creativity.

3. **Execution and Output Retrieval**: With the request formulated, we execute it via the OpenAI client, which communicates with the together.ai API backend. The backend, understanding the OpenAI format, efficiently processes our request using the specified Mixtral model.


In [None]:
from google.colab import userdata

In [None]:
import openai

In [None]:
TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')

In [None]:
# Point to the local server
client = openai.OpenAI(base_url="https://api.together.xyz/v1", api_key=TOGETHER_API_KEY)

## **Prompt Engineering**

To check whether our model is correctly loaded, let's try it out with a few prompts.

In [None]:
system = "You are a helpful assistant"
user = "Could you explain to me training dogs works as if I am 5?"

In [None]:
completion = client.chat.completions.create(
  model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo", # this field is currently unused
  messages=[
    {"role": "system", "content": system},
    {"role": "user", "content": user}
  ],
  temperature=0.2,
)

In [None]:
print(textwrap.fill(completion.choices[0].message.content, width=100))

In [None]:
# let's print the markdown produced, too

from IPython.display import Markdown, display

display(Markdown(completion.choices[0].message.content))

### **Prompt Template**

We are going to keep our `system prompt` simple and to the point:

In [None]:
# System prompt describes information given to all conversations
system_prompt = """
You are a helpful, respectful and honest assistant for labeling scientific and technical topics - particularly withing neuroscience and neurotech.
"""

We will tell the model that it is simply a helpful assistant for labeling topics since that is our main goal.

In contrast, our `user prompt` is going to the be a bit more involved. It will consist of two components, an **example** and the **main prompt**.

Let's start with the **example**. Most LLMs do a much better job of generating accurate responses if you give them an example to work with. We will show it an accurate example of the kind of output we are expecting.

In [None]:
# Example prompt demonstrating the output we are looking for
example_prompt = """
I have a topic that contains the following documents:
- Optogenetics allows for the control of specific neurons with high temporal precision using light, making it a powerful tool for studying neural circuits.
- Recent developments in optogenetic tools have improved the targeting of specific cell types, enhancing the ability to manipulate neural pathways with minimal invasiveness.
- The combination of optogenetics with other techniques, such as electrophysiology, provides deeper insights into the functional dynamics of neural networks.

The topic is described by the following keywords: 'optogenetics, neurons, neural circuits, light, cell types, neural pathways, electrophysiology'.

Based on the information about the topic above, please create a short label of this topic. Make sure to only return the label and nothing more.
"""

example_output = """Advancements in optogenetics for precise neural circuit manipulation"""

This example, based on a number of keywords and documents primarily about the impact of
meat, helps to model to understand the kind of output it should give. We show the model that we were expecting only the label, which is easier for us to extract.

Next, we will create a template that we can use within BERTopic:

In [None]:
# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
main_prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
"""

There are two BERTopic-specific tags that are of interest, namely `[DOCUMENTS]` and `[KEYWORDS]`:

* `[DOCUMENTS]` contain the top 5 most relevant documents to the topic
* `[KEYWORDS]` contain the top 10 most relevant keywords to the topic as generated through c-TF-IDF

This template will be filled accordingly to each topic. And finally, we can combine this into our final prompt:

In [None]:
prompt = system_prompt + example_prompt + main_prompt

In [None]:
print(prompt)

# 🗨️ **BERTopic**

Before we can start with topic modeling, we will first need to perform two steps:
* Pre-calculating Embeddings
* Defining Sub-models

## **Preparing Embeddings**

By pre-calculating the embeddings for each document, we can speed-up additional exploration steps and use the embeddings to quickly iterate over BERTopic's hyperparameters if needed.

use `BAAI/bge-small-zh-v1.5`for Chinese.


🔥 **TIP**: You can find a great overview of good embeddings for clustering on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

In [None]:
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("BAAI/bge-small-en")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

## **Sub-models**

Next, we will define all sub-models in BERTopic and do some small tweaks to the number of clusters to be created, setting random states, etc.

In [None]:
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
# from umap import UMAP
# from hdbscan import HDBSCAN

umap_model = UMAP(n_neighbors=3, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=25, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

As a small bonus, we are going to reduce the embeddings we created before to 2-dimensions so that we can use them for visualization purposes when we have created our topics.

In [None]:
# Pre-reduce embeddings for visualization purposes
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(embeddings)

### **Representation Models**

One of the ways we are going to represent the topics is with Zephyr which should give us a nice label. However, we might want to have additional representations to view a topic from multiple angles.

Here, we will be using c-TF-IDF as our main representation and [KeyBERT](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#keybertinspired), [MMR](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#maximalmarginalrelevance), and [Zephyr](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html) as our additional representations.

In [None]:
prompt = """
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]

Based on the information above, please give a topic label of maximum 6 words:
topic: <label>
"""

In [None]:
#from bertopic.representation import OpenAI

from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance

# KeyBERT
keybert = KeyBERTInspired()

# MMR
mmr = MaximalMarginalRelevance(diversity=0.3)

#openai_rep = OpenAI(client, model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
#                    chat=True,
#                    prompt=prompt,
#                    nr_docs=5,
#                    delay_in_seconds=3)


# All representation models
representation_model = {
    "KeyBERT": keybert,
#    "Llama": openai_rep,
    "MMR": mmr,
}

# 🔥 **Training**

Now that we have our models prepared, we can start training our topic model! We supply BERTopic with the sub-models of interest, run `.fit_transform`, and see what kind of topics we get.

In [None]:
from bertopic import BERTopic

topic_model = BERTopic(

  # Sub-models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=15,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(abstracts, embeddings)


Now that we are done training our model, let's see what topics were generated:

In [None]:
# Show topics
topic_model.get_topic_info()

In [None]:
topic_model.get_topic(1, full=True)["KeyBERT"]

### WORKAROUND ###

unfortunately there are some incompatibilities right now. I needed to implement the labeling by hand

In [None]:
from tqdm import tqdm

# System prompt describing the context
system_prompt = """
You are a helpful, respectful and honest assistant for labeling scientific and technical topics - particularly within neuroscience and neurotech.
"""

# Example prompt demonstrating the desired output
example_prompt = """
I have a topic that contains the following documents:
- Optogenetics allows for the control of specific neurons with high temporal precision using light, making it a powerful tool for studying neural circuits.
- Recent developments in optogenetic tools have improved the targeting of specific cell types, enhancing the ability to manipulate neural pathways with minimal invasiveness.
- The combination of optogenetics with other techniques, such as electrophysiology, provides deeper insights into the functional dynamics of neural networks.

The topic is described by the following keywords: 'optogenetics, neurons, neural circuits, light, cell types, neural pathways, electrophysiology'.

Based on the information about the topic above, please create a short label of this topic. Make sure to only return the label and nothing more.
"""

example_output = "Advancements in optogenetics for precise neural circuit manipulation"

# Our main prompt template
main_prompt_template = """
I have a topic that contains the following documents:
{documents}

The topic is described by the following keywords: '{keywords}'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
"""


# List to store the results
labels = []

# Iterate over each row in the dataframe
for index, row in tqdm(topic_model.get_topic_info().iterrows(), total=topic_model.get_topic_info().shape[0], desc="Labeling Topics"):
    documents = row['Representative_Docs']
    keywords = row['KeyBERT']

    # Create the main prompt for the current topic
    main_prompt = main_prompt_template.format(documents="\n".join(documents), keywords=keywords)

    # Combine system, example, and main prompts
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": example_prompt},
        {"role": "assistant", "content": example_output},
        {"role": "user", "content": main_prompt},
    ]

    # Call the LLM model to generate the label
    completion = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",  # Specified model, currently unused
        messages=messages,
        temperature=0.2,
    )

    # Extract the label from the response and add it to the results list
    label = completion.choices[0].message.content.strip()
    labels.append(label)

# 'labels' now contains the topic labels generated for each topic in the dataframe

We got over 100 topics that were created and they all seem quite diverse.We can use the labels by Llama 2 and assign them to topics that we have created. Normally, the default topic representation would be c-TF-IDF, but we will focus on Llama 2 representations instead.


In [None]:
topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings, hide_annotations=True, hide_document_hover=False, custom_labels=True)

In [None]:
import PIL
import numpy as np
import requests

# Prepare logo
bertopic_logo_response = requests.get(
    "https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png",
    stream=True,
    headers={'User-Agent': 'My User Agent 1.0'}
)
bertopic_logo = np.asarray(PIL.Image.open(bertopic_logo_response.raw))

In [None]:
import datamapplot
import re

# Create a label for each document | notice that we are passing labels from manual labeling here
llm_labels = [label if label else "Unlabelled" for label in labels]
all_labels = [llm_labels[topic+topic_model._outliers] if topic != -1 else "Unlabelled" for topic in topics]

# Run the visualization
datamapplot.create_plot(
    reduced_embeddings,
    all_labels,
    label_font_size=10,
    title="Neurotech - Topics",
    sub_title="Topics labeled with `Llama 3.1`",
    label_wrap_width=20,
    use_medoids=True,
    #logo=bertopic_logo,
    #logo_width=0.16
)

In [None]:
dataset.to_parquet('patents_with_tm.pq')