Add `.merge_models` for merging several fitted BERTopic models #1516

MaartenGr · 2023-09-09T06:45:17Z

Merge Multiple Fitted Models

After you have trained a new BERTopic model on your data, new data might still be coming in. Although you can use online BERTopic, you might prefer to use the default HDBSCAN and UMAP models since they do not support incremental learning out of the box.

Instead, we you can train a new BERTopic on incoming data and merge it with your base model to detect whether new topics have appeared in the unseen documents. This is a great way of detecting whether your new model contains information that was not previously found in your base topic model.

Similarly, you might want to train multiple BERTopic models using different sets of settings, even though they might all be using the same underlying embedding model. Merging these models would also allow for a single model that you can use throughout your use cases.

Lastly, this methods also allows for a degree of federated learning where each node trains a topic model that are aggregated in a central server.

Example

To demonstrate merging different topic models with BERTopic, we use the ArXiv paper abstracts to see which topics they generally contain.

First, we train three separate models on different parts of the data:

from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

# Extract abstracts to train on and corresponding titles
abstracts_1 = dataset["abstract"][:5_000]
abstracts_2 = dataset["abstract"][5_000:10_000]
abstracts_3 = dataset["abstract"][10_000:15_000]

# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)

Then, we can combine all three models into one with .merge_models:

# Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])

When we inspect the first model, we can see it has 52 topics:

>>> len(topic_model_1.get_topic_info())
52

Now, we inspect the merged model, we can see it has 57 topics:

>>> len(merged_model.get_topic_info())
57

It seems that by merging these three models, there were 6 undiscovered topics that we could add to the very first model.

Let's inspect them:

>>> merged_model.get_topic_info().tail(5)

	Topic	Count	Name	Representation	Representative_Docs
52	51	47	50_activity_mobile_wearable_sensors	['activity', 'mobile', 'wearable', 'sensors', 'falls', 'human', 'phone', 'recognition', 'activities', 'accelerometer']	nan
53	52	48	25_music_musical_audio_chord	['music', 'musical', 'audio', 'chord', 'and', 'we', 'to', 'that', 'of', 'for']	nan
54	53	32	36_fairness_discrimination_fair_groups	['fairness', 'discrimination', 'fair', 'groups', 'protected', 'decision', 'we', 'of', 'classifier', 'to']	nan
55	54	30	38_traffic_driver_prediction_flow	['traffic', 'driver', 'prediction', 'flow', 'trajectory', 'the', 'and', 'congestion', 'of', 'transportation']	nan
56	55	22	50_spiking_neurons_networks_learning	['spiking', 'neurons', 'networks', 'learning', 'neural', 'snn', 'dynamics', 'plasticity', 'snns', 'of']	nan

MaartenGr added 2 commits September 9, 2023 08:44

Add method for merging several fitted BERTopic models

9d5fc17

Update test

e9c2cba

MaartenGr mentioned this pull request Sep 14, 2023

Topic merging in different models #1531

Closed

MaartenGr merged commit 5b663e9 into master Sep 21, 2023
2 checks passed

This was referenced Oct 23, 2023

Loaded Model wont .fit or fit and transform #1584

Open

Is there a way to find emerging topics on regular basis? #1514

Open

MaartenGr mentioned this pull request Nov 1, 2023

Merging topic models #1471

Open

This was referenced Nov 15, 2023

Using represenation_model in online learning #1628

Closed

Constant labels #1632

Closed

Documents and Topics are different lengths and cannot merge the topics #1626

Open

MaartenGr deleted the merge_models branch May 12, 2024 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `.merge_models` for merging several fitted BERTopic models #1516

Add `.merge_models` for merging several fitted BERTopic models #1516

MaartenGr commented Sep 9, 2023

Add .merge_models for merging several fitted BERTopic models #1516

Add .merge_models for merging several fitted BERTopic models #1516

Conversation

MaartenGr commented Sep 9, 2023

Merge Multiple Fitted Models

Example

Add `.merge_models` for merging several fitted BERTopic models #1516

Add `.merge_models` for merging several fitted BERTopic models #1516