# BERTopic - Tutorial
We start with installing bertopic from pypi before preparing the data. 

**NOTE**: Make sure to select a GPU runtime. Otherwise, the model can take quite some time to create the document embeddings!

In [1]:
!pip install bertopic



# Prepare data
For this example, we use the famous 20 Newsgroups dataset which contains roughly 18000 newsgroups posts on 20 topics.

In [2]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='train')['data']

In [10]:
len(docs)

2209

In [8]:
from pathlib import Path 
docs = []
pseudo_texts =[docs.append(text.read_text()) for text in Path('/home/ajanco/projects/slavic_review/slavic_review_data/SlavicStudiesCluster1991to2020/pseudo').iterdir()]


# Create Topics
We use the **distilbert-base-nli-mean-tokens** model as it is the recommended model for creating sentence embeddings according to the authors of the [sentence-embeddings](https://www.sbert.net/docs/pretrained_models.html) package. However, you can use whatever embeddings is currently pre-trained in the package.

In [11]:
model = BERTopic("distilbert-base-nli-mean-tokens", verbose=True)
topics = model.fit_transform(docs)

2020-10-06 10:16:16,473 - BERTopic - Loaded BERT model
INFO:BERTopic:Loaded BERT model
2020-10-06 10:21:28,970 - BERTopic - Transformed documents to Embeddings
INFO:BERTopic:Transformed documents to Embeddings
2020-10-06 10:21:38,207 - BERTopic - Reduced dimensionality with UMAP
INFO:BERTopic:Reduced dimensionality with UMAP
2020-10-06 10:21:38,285 - BERTopic - Clustered UMAP embeddings with HDBSCAN
INFO:BERTopic:Clustered UMAP embeddings with HDBSCAN


In [12]:
# Get most frequent topics
model.get_topics_freq()[:5]

Unnamed: 0,Topic,Count
0,5,1779
1,-1,152
2,0,64
3,3,64
4,4,59


In [13]:
# Get a topic 
model.get_topic(5)[:10]

[('ipfv', 0.00029459507352350904),
 ('inf', 0.00029184607606157323),
 ('1sg', 0.00029084422394701524),
 ('dat', 0.00028680449951270067),
 ('3pl', 0.00028208019895627765),
 ('že', 0.00028170985984073957),
 ('ako', 0.00028097106167717615),
 ('iprf', 0.0002796271642093219),
 ('china', 0.000278205824134825),
 ('putin', 0.0002768406925860054)]

In [16]:
import matplotlib.pyplot as plt
embeddings = model.encode(docs, show_progress_bar=True)
# Prepare data
umap_data = umap.UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
result = pd.DataFrame(umap_data, columns=['x', 'y'])
result['labels'] = cluster.labels_

# Visualize clusters
fig, ax = plt.subplots(figsize=(20, 10))
outliers = result.loc[result.labels == -1, :]
clustered = result.loc[result.labels != -1, :]
plt.scatter(outliers.x, outliers.y, color='#BDBDBD', s=0.05)
plt.scatter(clustered.x, clustered.y, c=clustered.labels, s=0.05, cmap='hsv_r')
plt.colorbar()

AttributeError: 'BERTopic' object has no attribute 'encode'

In [15]:
!pip install matplotlib

Collecting matplotlib
  Using cached matplotlib-3.3.2-cp38-cp38-manylinux1_x86_64.whl (11.6 MB)
Collecting kiwisolver>=1.0.1
  Using cached kiwisolver-1.2.0-cp38-cp38-manylinux1_x86_64.whl (92 kB)
Collecting cycler>=0.10
  Using cached cycler-0.10.0-py2.py3-none-any.whl (6.5 kB)
Installing collected packages: kiwisolver, cycler, matplotlib
Successfully installed cycler-0.10.0 kiwisolver-1.2.0 matplotlib-3.3.2


## Model serialization
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [None]:
# Save model
model.save("my_model")	

In [None]:
# Load model
my_model = BERTopic.load("my_model")	

In [None]:
my_model.get_topic(4)[:10]

[('baseball', 0.01534818753609341),
 ('players', 0.01113384693242755),
 ('cubs', 0.010651317673247482),
 ('game', 0.01064425481072388),
 ('braves', 0.010439585241772109),
 ('pitching', 0.009477156669897367),
 ('games', 0.009166144809830891),
 ('runs', 0.009154570979537589),
 ('year', 0.008982491530594413),
 ('team', 0.00894693731063402)]