# Sharing BERTopic models on the Hugging Face Hub

This notebook shows the steps involved in sharing a BERTopic model on the Hugging Face Hub. As an example, we'll train a topic model on GitHub issue titles for the Transformers library.

First we need to install `BERTopic` along with the `huggingface_hub` library. We can optionally also install [`safetensors`](https://huggingface.co/docs/safetensors/index). `safetensors` Safetensors is a new simple format for storing tensors safely (as opposed to pickle) that is still fast (zero-copy). If this library is installed, BERTopic can use the `safetensor` format for model serialization.

In [None]:
%pip install git+https://github.com/MaartenGr/BERTopic huggingface_hub safetensors -qqq

We can use a [dataset](https://github.com/nlp-with-transformers/notebooks) that has been created for the [Natural Language Processing with Transformers](https://github.com/nlp-with-transformers/notebooks) book. This dataset contains issue titles, along with some metadata for the Transformers library GitHub repository.

GitHub issues are an example of a domain where me might assume some sort of topics exist in the corpus, but we probablydon't have an exact sense of what all of these topics would be. This is the type of problem where topic modelling can give us a better sense of the corpus and potentially be useful for classifying new issues into topics.

We'll start by loading the data into a pandas DataFrame.

In [None]:
import pandas as pd

dataset_url = "https://raw.githubusercontent.com/nlp-with-transformers/notebooks/main/data/github-issues-transformers.jsonl"
df_issues = pd.read_json(dataset_url, lines=True)


In [None]:
df_issues.head(4)

We can train our topic model on a subset of the data and hold back some examples which we can treat as new data. This mirrors a situtation where we might use BERTopic model in a production setting.

In [None]:
len(df_issues)

In [None]:
df_issues_train = df_issues[:9000]

In [None]:
df_issues_test = df_issues[9000:]

BERTopic expects a list of strings as input so let's grab the title column and turn this into a list.

In [None]:
issue_titles = df_issues_train['title'].to_list()

In [None]:
issue_titles[:3]

## Training our model

We'll train a BERTopic model using fairly standard settings.

In [None]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

In [None]:
representation_model = KeyBERTInspired()

In [None]:
topic_model = BERTopic("english", verbose=True, nr_topics=30, representation_model=representation_model)

In [None]:
topics, probs = topic_model.fit_transform(issue_titles)

We can quickly explore the topics from our model

In [None]:
freq = topic_model.get_topic_info()

In [None]:
freq.head(10)

In [None]:
topic_model.visualize_topics()

We can also view topics over time

In [None]:
timestamps = df_issues_train['created_at']

In [None]:
topics_over_time = topic_model.topics_over_time(issue_titles, timestamps, nr_bins=20)

In [None]:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)

## Pushing our BERTopic model to the Hugging Face Hub 🤗

We can use the new BERTopic Hub intergration to push our models to the Hugging Face hub. Sharing models to the Hub makes it easier for others (or our future self) to use or adapt our topic models for further use.

In [None]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()

In [None]:
HF_USER_NAME = "" # add your hub username here

In [None]:
topic_model.push_to_hf_hub(f'{HF_USER_NAME}/transformers_issues_topics')

## Loading models from the Hugging Face Hub 🤗

We can similarly load models from the Hub.

In [None]:
from bertopic import BERTopic
topic_model = BERTopic.load("davanstrien/transformers_issues_topics")

We can then use this model to predict the topics of new unseen documents.

In [None]:
new_issue_titles = df_issues_test['title'].to_list()

In [None]:
examples = new_issue_titles[5:15]

In [None]:
examples

In [None]:
topics, prob = topic_model.transform(examples)

In [None]:
for example, topic in zip(examples,topics):
    print(f"TEXT: {example}")
    print(f"TOPIC: {topic_model.get_topic_info(int(topic)).loc[0,'Representation']}")
    print('--*--'*9)


## Next steps

You can try training your own topic model and pushing it to the Hub. BERTopic is a very flexible library so you can swap out many of the components.

You can easily grab a dataset from Hugging Face and extract the text you want to use for training a topic model. For example we can train a topic model on the German subset of the [amazon_reviews_multi](https://huggingface.co/datasets/amazon_reviews_multi) dataset.

In [None]:
%pip install datasets

In [None]:
from datasets import load_dataset

dataset = load_dataset("amazon_reviews_multi", "de")

In [None]:
docs = dataset['train']['review_body']

In [None]:
docs[0:5]

In [None]:
topic_model = BERTopic("german")

In [None]:
topics, probs = topic_model.fit_transform(docs)