Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .merge_models for merging several fitted BERTopic models #1516

Merged
merged 2 commits into from
Sep 21, 2023

Conversation

MaartenGr
Copy link
Owner

Merge Multiple Fitted Models

After you have trained a new BERTopic model on your data, new data might still be coming in. Although you can use online BERTopic, you might prefer to use the default HDBSCAN and UMAP models since they do not support incremental learning out of the box.

Instead, we you can train a new BERTopic on incoming data and merge it with your base model to detect whether new topics have appeared in the unseen documents. This is a great way of detecting whether your new model contains information that was not previously found in your base topic model.

Similarly, you might want to train multiple BERTopic models using different sets of settings, even though they might all be using the same underlying embedding model. Merging these models would also allow for a single model that you can use throughout your use cases.

Lastly, this methods also allows for a degree of federated learning where each node trains a topic model that are aggregated in a central server.

Example

To demonstrate merging different topic models with BERTopic, we use the ArXiv paper abstracts to see which topics they generally contain.

First, we train three separate models on different parts of the data:

from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

# Extract abstracts to train on and corresponding titles
abstracts_1 = dataset["abstract"][:5_000]
abstracts_2 = dataset["abstract"][5_000:10_000]
abstracts_3 = dataset["abstract"][10_000:15_000]

# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)

Then, we can combine all three models into one with .merge_models:

# Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])

When we inspect the first model, we can see it has 52 topics:

>>> len(topic_model_1.get_topic_info())
52

Now, we inspect the merged model, we can see it has 57 topics:

>>> len(merged_model.get_topic_info())
57

It seems that by merging these three models, there were 6 undiscovered topics that we could add to the very first model.

Let's inspect them:

>>> merged_model.get_topic_info().tail(5)
Topic Count Name Representation Representative_Docs
52 51 47 50_activity_mobile_wearable_sensors ['activity', 'mobile', 'wearable', 'sensors', 'falls', 'human', 'phone', 'recognition', 'activities', 'accelerometer'] nan
53 52 48 25_music_musical_audio_chord ['music', 'musical', 'audio', 'chord', 'and', 'we', 'to', 'that', 'of', 'for'] nan
54 53 32 36_fairness_discrimination_fair_groups ['fairness', 'discrimination', 'fair', 'groups', 'protected', 'decision', 'we', 'of', 'classifier', 'to'] nan
55 54 30 38_traffic_driver_prediction_flow ['traffic', 'driver', 'prediction', 'flow', 'trajectory', 'the', 'and', 'congestion', 'of', 'transportation'] nan
56 55 22 50_spiking_neurons_networks_learning ['spiking', 'neurons', 'networks', 'learning', 'neural', 'snn', 'dynamics', 'plasticity', 'snns', 'of'] nan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant