Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[![PyPI - Python](https://img.shields.io/badge/python-v3.6+-blue.svg)](https://pypi.org/project/bertopic/)
[![PyPI - Python](https://img.shields.io/badge/python-v3.7+-blue.svg)](https://pypi.org/project/bertopic/)
[![Build](https://img.shields.io/github/workflow/status/MaartenGr/BERTopic/Code%20Checks/master)](https://pypi.org/project/bertopic/)
[![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/BERTopic/)
[![PyPI - PyPi](https://img.shields.io/pypi/v/BERTopic)](https://pypi.org/project/bertopic/)
Expand All @@ -13,9 +13,9 @@ BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF
allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports
[**guided**](https://maartengr.github.io/BERTopic/tutorial/guided/guided.html),
(semi-) [**supervised**](https://maartengr.github.io/BERTopic/tutorial/supervised/supervised.html),
and [**dynamic**](https://maartengr.github.io/BERTopic/tutorial/topicsovertime/topicsovertime.html) topic modeling. It even supports visualizations similar to LDAvis!
[**guided**](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html),
(semi-) [**supervised**](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html),
and [**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html) topic modeling. It even supports visualizations similar to LDAvis!

Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99)
and [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4).
Expand Down Expand Up @@ -54,7 +54,7 @@ with one of the examples below:


## Quick Start
We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of english documents:
We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

```python
from bertopic import BERTopic
Expand All @@ -66,7 +66,7 @@ topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
```

After generating topics, we can access the frequent topics that were generated:
After generating topics and their probabilities, we can access the frequent topics that were generated:

```python
>>> topic_model.get_topic_info()
Expand Down Expand Up @@ -123,7 +123,7 @@ topic_model.visualize_barchart()


Find all possible visualizations with interactive examples in the documentation
[here](https://maartengr.github.io/BERTopic/tutorial/visualization/visualization.html).
[here](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html).

## Embedding Models
BERTopic supports many embedding models that can be used to embed the documents and words:
Expand Down Expand Up @@ -151,7 +151,7 @@ roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)
```

Click [here](https://maartengr.github.io/BERTopic/tutorial/embeddings/embeddings.html)
Click [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html)
for a full overview of all supported embedding models.

## Dynamic Topic Modeling
Expand Down Expand Up @@ -238,7 +238,7 @@ To cite BERTopic in your work, please use the following bibtex reference:
title = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
year = 2020,
publisher = {Zenodo},
version = {v0.9.2},
version = {v0.9.4},
doi = {10.5281/zenodo.4381785},
url = {https://doi.org/10.5281/zenodo.4381785}
}
Expand Down
2 changes: 1 addition & 1 deletion bertopic/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from bertopic._bertopic import BERTopic

__version__ = "0.9.3"
__version__ = "0.9.4"

__all__ = [
"BERTopic",
Expand Down
66 changes: 36 additions & 30 deletions bertopic/_bertopic.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ def __init__(self,
nr_topics: Union[int, str] = None,
low_memory: bool = False,
calculate_probabilities: bool = False,
diversity: float = None,
seed_topic_list: List[List[str]] = None,
embedding_model=None,
umap_model: UMAP = None,
Expand Down Expand Up @@ -105,8 +106,7 @@ def __init__(self,
number of topics to the value specified. This reduction can take
a while as each reduction in topics (-1) activates a c-TF-IDF
calculation. If this is set to None, no reduction is applied. Use
"auto" to automatically reduce topics that have a similarity of at
least 0.9, do not maps all others.
"auto" to automatically reduce topics using HDBSCAN.
low_memory: Sets UMAP low memory to True to make sure less memory is used.
calculate_probabilities: Whether to calculate the probabilities of all topics
per document instead of the probability of the assigned
Expand All @@ -116,6 +116,9 @@ def __init__(self,
you do not mind more computation time.
NOTE: If false you cannot use the corresponding
visualization method `visualize_probabilities`.
diversity: Whether to use MMR to diversify the resulting topic representations.
If set to None, MMR will not be used. Accepted values lie between
0 and 1 with 0 being not at all diverse and 1 being very diverse.
seed_topic_list: A list of seed words per topic to converge around
verbose: Changes the verbosity of the model, Set to True if you want
to track the stages of the model.
Expand All @@ -141,6 +144,7 @@ def __init__(self,
self.nr_topics = nr_topics
self.low_memory = low_memory
self.calculate_probabilities = calculate_probabilities
self.diversity = diversity
self.verbose = verbose
self.seed_topic_list = seed_topic_list

Expand Down Expand Up @@ -370,10 +374,14 @@ def transform(self,
verbose=self.verbose)

umap_embeddings = self.umap_model.transform(embeddings)
logger.info("Reduced dimensionality with UMAP")

predictions, probabilities = hdbscan.approximate_predict(self.hdbscan_model, umap_embeddings)
logger.info("Predicted clusters with HDBSCAN")

if self.calculate_probabilities:
probabilities = hdbscan.membership_vector(self.hdbscan_model, umap_embeddings)
logger.info("Calculated probabilities with HDBSCAN")
else:
probabilities = None

Expand Down Expand Up @@ -476,7 +484,7 @@ def topics_over_time(self,
selection = documents.loc[documents.Timestamps == timestamp, :]
documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,
"Timestamps": "count"})
c_tf_idf, words = self._c_tf_idf(documents_per_topic, m=len(selection), fit=False)
c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)

if global_tuning or evolution_tuning:
c_tf_idf = normalize(c_tf_idf, axis=1, norm='l1', copy=False)
Expand Down Expand Up @@ -569,7 +577,7 @@ def topics_per_class(self,
selection = documents.loc[documents.Class == class_, :]
documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,
"Class": "count"})
c_tf_idf, words = self._c_tf_idf(documents_per_topic, m=len(selection), fit=False)
c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)

# Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation
# by simply taking the average of the two
Expand Down Expand Up @@ -1107,8 +1115,8 @@ def visualize_hierarchy(self,
Either 'left' or 'bottom'
topics: A selection of topics to visualize
top_n_topics: Only select the top n most frequent topics
width: The width of the figure.
height: The height of the figure.
width: The width of the figure. Only works if orientation is set to 'left'
height: The height of the figure. Only works if orientation is set to 'bottom'

Returns:
fig: A plotly figure
Expand Down Expand Up @@ -1185,18 +1193,18 @@ def visualize_heatmap(self,

def visualize_barchart(self,
topics: List[int] = None,
top_n_topics: int = 6,
top_n_topics: int = 8,
n_words: int = 5,
width: int = 800,
height: int = 600) -> go.Figure:
width: int = 250,
height: int = 250) -> go.Figure:
""" Visualize a barchart of selected topics

Arguments:
topics: A selection of topics to visualize.
top_n_topics: Only select the top n most frequent topics.
n_words: Number of words to show in a topic
width: The width of the figure.
height: The height of the figure.
width: The width of each figure.
height: The height of each figure.

Returns:
fig: A plotly figure
Expand Down Expand Up @@ -1447,7 +1455,7 @@ def _extract_topics(self, documents: pd.DataFrame):
c_tf_idf: The resulting matrix giving a value (importance score) for each word per topic
"""
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
self.c_tf_idf, words = self._c_tf_idf(documents_per_topic, m=len(documents))
self.c_tf_idf, words = self._c_tf_idf(documents_per_topic)
self.topics = self._extract_words_per_topic(words)
self._create_topic_vectors()
self.topic_names = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
Expand Down Expand Up @@ -1553,7 +1561,7 @@ def _create_topic_vectors(self):

self.topic_embeddings = topic_embeddings

def _c_tf_idf(self, documents_per_topic: pd.DataFrame, m: int, fit: bool = True) -> Tuple[csr_matrix, List[str]]:
def _c_tf_idf(self, documents_per_topic: pd.DataFrame, fit: bool = True) -> Tuple[csr_matrix, List[str]]:
""" Calculate a class-based TF-IDF where m is the number of total documents.

Arguments:
Expand Down Expand Up @@ -1581,7 +1589,7 @@ def _c_tf_idf(self, documents_per_topic: pd.DataFrame, m: int, fit: bool = True)
multiplier = None

if fit:
self.transformer = ClassTFIDF().fit(X, n_samples=m, multiplier=multiplier)
self.transformer = ClassTFIDF().fit(X, multiplier=multiplier)

c_tf_idf = self.transformer.transform(X)

Expand Down Expand Up @@ -1641,19 +1649,20 @@ def _extract_words_per_topic(self,

# Extract word embeddings for the top 30 words per topic and compare it
# with the topic embedding to keep only the words most similar to the topic embedding
if self.embedding_model is not None:
if self.diversity is not None:
if self.embedding_model is not None:

for topic, topic_words in topics.items():
words = [word[0] for word in topic_words]
word_embeddings = self._extract_embeddings(words,
method="word",
verbose=False)
topic_embedding = self._extract_embeddings(" ".join(words),
method="word",
verbose=False).reshape(1, -1)
topic_words = mmr(topic_embedding, word_embeddings, words,
top_n=self.top_n_words, diversity=0)
topics[topic] = [(word, value) for word, value in topics[topic] if word in topic_words]
for topic, topic_words in topics.items():
words = [word[0] for word in topic_words]
word_embeddings = self._extract_embeddings(words,
method="word",
verbose=False)
topic_embedding = self._extract_embeddings(" ".join(words),
method="word",
verbose=False).reshape(1, -1)
topic_words = mmr(topic_embedding, word_embeddings, words,
top_n=self.top_n_words, diversity=self.diversity)
topics[topic] = [(word, value) for word, value in topics[topic] if word in topic_words]
topics = {label: values[:self.top_n_words] for label, values in topics.items()}

return topics
Expand Down Expand Up @@ -1694,10 +1703,7 @@ def _reduce_to_n_topics(self, documents: pd.DataFrame) -> pd.DataFrame:
self.merged_topics = []

# Create topic similarity matrix
if self.topic_embeddings is not None:
similarities = cosine_similarity(np.array(self.topic_embeddings))
else:
similarities = cosine_similarity(self.c_tf_idf)
similarities = cosine_similarity(self.c_tf_idf)
np.fill_diagonal(similarities, 0)

# Find most similar topic to least common topic
Expand Down
18 changes: 14 additions & 4 deletions bertopic/_ctfidf.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ class ClassTFIDF(TfidfTransformer):
def __init__(self, *args, **kwargs):
super(ClassTFIDF, self).__init__(*args, **kwargs)

def fit(self, X: sp.csr_matrix, n_samples: int, multiplier: np.ndarray = None):
def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):
"""Learn the idf vector (global term weights).

Arguments:
X: A matrix of term/token counts.
n_samples: Number of total documents
multiplier: A multiplier for increasing/decreasing certain IDF scores
"""
X = check_array(X, accept_sparse=('csr', 'csc'))
if not sp.issparse(X):
Expand All @@ -35,19 +35,29 @@ def fit(self, X: sp.csr_matrix, n_samples: int, multiplier: np.ndarray = None):

if self.use_idf:
_, n_features = X.shape

# Calculate the frequency of words across all classes
df = np.squeeze(np.asarray(X.sum(axis=0)))

# Calculate the average number of samples as regularization
avg_nr_samples = int(X.sum(axis=1).mean())
idf = np.log(avg_nr_samples / df)

# Divide the average number of samples by the word frequency
# +1 is added to force values to be positive
idf = np.log((avg_nr_samples / df)+1)

# Multiplier to increase/decrease certain idf scores
if multiplier is not None:
idf = idf * multiplier

self._idf_diag = sp.diags(idf, offsets=0,
shape=(n_features, n_features),
format='csr',
dtype=dtype)

return self

def transform(self, X: sp.csr_matrix, copy=True):
def transform(self, X: sp.csr_matrix):
"""Transform a count-based matrix to c-TF-IDF

Arguments:
Expand Down
33 changes: 18 additions & 15 deletions bertopic/plotting/_barchart.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import itertools
import numpy as np
from typing import List

Expand All @@ -7,19 +8,19 @@

def visualize_barchart(topic_model,
topics: List[int] = None,
top_n_topics: int = 6,
top_n_topics: int = 8,
n_words: int = 5,
width: int = 800,
height: int = 600) -> go.Figure:
width: int = 250,
height: int = 250) -> go.Figure:
""" Visualize a barchart of selected topics

Arguments:
topic_model: A fitted BERTopic instance.
topics: A selection of topics to visualize.
top_n_topics: Only select the top n most frequent topics.
n_words: Number of words to show in a topic
width: The width of the figure.
height: The height of the figure.
width: The width of each figure.
height: The height of each figure.

Returns:
fig: A plotly figure
Expand All @@ -39,9 +40,11 @@ def visualize_barchart(topic_model,
fig = topic_model.visualize_barchart()
fig.write_html("path/to/file.html")
```
<iframe src="../../tutorial/visualization/bar_chart.html"
<iframe src="../../getting_started/visualization/bar_chart.html"
style="width:1100px; height: 660px; border: 0px;""></iframe>
"""
colors = itertools.cycle(["#D55E00", "#0072B2", "#CC79A7", "#E69F00", "#56B4E9", "#009E73", "#F0E442"])

# Select topics based on top_n and topics args
if topics is not None:
topics = list(topics)
Expand All @@ -52,13 +55,13 @@ def visualize_barchart(topic_model,

# Initialize figure
subplot_titles = [f"Topic {topic}" for topic in topics]
columns = 3
columns = 4
rows = int(np.ceil(len(topics) / columns))
fig = make_subplots(rows=rows,
cols=columns,
shared_xaxes=True,
horizontal_spacing=.15,
vertical_spacing=.15,
shared_xaxes=False,
horizontal_spacing=.1,
vertical_spacing=.4 / rows if rows > 1 else 0,
subplot_titles=subplot_titles)

# Add barchart for each topic
Expand All @@ -71,7 +74,8 @@ def visualize_barchart(topic_model,
fig.add_trace(
go.Bar(x=scores,
y=words,
orientation='h'),
orientation='h',
marker_color=next(colors)),
row=row, col=column)

if column == columns:
Expand All @@ -86,16 +90,15 @@ def visualize_barchart(topic_model,
showlegend=False,
title={
'text': "<b>Topic Word Scores",
'y': .95,
'x': .15,
'x': .5,
'xanchor': 'center',
'yanchor': 'top',
'font': dict(
size=22,
color="Black")
},
width=width,
height=height,
width=width*4,
height=height*rows if rows > 1 else height * 1.3,
hoverlabel=dict(
bgcolor="white",
font_size=16,
Expand Down
2 changes: 1 addition & 1 deletion bertopic/plotting/_distribution.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ def visualize_distribution(topic_model,
fig = topic_model.visualize_distribution(probabilities[0])
fig.write_html("path/to/file.html")
```
<iframe src="../../tutorial/visualization/probabilities.html"
<iframe src="../../getting_started/visualization/probabilities.html"
style="width:1000px; height: 500px; border: 0px;""></iframe>
"""
if len(probabilities.shape) != 1:
Expand Down
Loading