MaartenGr · MaartenGr · Dec 14, 2021 · Nov 18, 2021 · Nov 19, 2021 · Nov 26, 2021
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-[![PyPI - Python](https://img.shields.io/badge/python-v3.6+-blue.svg)](https://pypi.org/project/bertopic/)
+[![PyPI - Python](https://img.shields.io/badge/python-v3.7+-blue.svg)](https://pypi.org/project/bertopic/)
 [![Build](https://img.shields.io/github/workflow/status/MaartenGr/BERTopic/Code%20Checks/master)](https://pypi.org/project/bertopic/)
 [![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/BERTopic/)
 [![PyPI - PyPi](https://img.shields.io/pypi/v/BERTopic)](https://pypi.org/project/bertopic/)
@@ -13,9 +13,9 @@ BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF
 allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
 
 BERTopic supports 
-[**guided**](https://maartengr.github.io/BERTopic/tutorial/guided/guided.html), 
-(semi-) [**supervised**](https://maartengr.github.io/BERTopic/tutorial/supervised/supervised.html), 
-and [**dynamic**](https://maartengr.github.io/BERTopic/tutorial/topicsovertime/topicsovertime.html) topic modeling. It even supports visualizations similar to LDAvis!
+[**guided**](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html), 
+(semi-) [**supervised**](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html), 
+and [**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html) topic modeling. It even supports visualizations similar to LDAvis!
 
 Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99) 
 and [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4).
@@ -54,7 +54,7 @@ with one of the examples below:
 
 
 ## Quick Start
-We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of english documents:
+We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:
 
 ```python
 from bertopic import BERTopic
@@ -66,7 +66,7 @@ topic_model = BERTopic()
 topics, probs = topic_model.fit_transform(docs)
 ```
 
-After generating topics, we can access the frequent topics that were generated:
+After generating topics and their probabilities, we can access the frequent topics that were generated:
 
 ```python
 >>> topic_model.get_topic_info()
@@ -123,7 +123,7 @@ topic_model.visualize_barchart()
 
 
 Find all possible visualizations with interactive examples in the documentation 
-[here](https://maartengr.github.io/BERTopic/tutorial/visualization/visualization.html). 
+[here](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html). 
 
 ## Embedding Models
 BERTopic supports many embedding models that can be used to embed the documents and words:
@@ -151,7 +151,7 @@ roberta = TransformerDocumentEmbeddings('roberta-base')
 topic_model = BERTopic(embedding_model=roberta)
 ```
 
-Click [here](https://maartengr.github.io/BERTopic/tutorial/embeddings/embeddings.html) 
+Click [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) 
 for a full overview of all supported embedding models. 
 
 ## Dynamic Topic Modeling
@@ -238,7 +238,7 @@ To cite BERTopic in your work, please use the following bibtex reference:
   title        = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
   year         = 2020,
   publisher    = {Zenodo},
-  version      = {v0.9.2},
+  version      = {v0.9.4},
   doi          = {10.5281/zenodo.4381785},
   url          = {https://doi.org/10.5281/zenodo.4381785}
 }

diff --git a/bertopic/__init__.py b/bertopic/__init__.py
@@ -1,6 +1,6 @@
 from bertopic._bertopic import BERTopic
 
-__version__ = "0.9.3"
+__version__ = "0.9.4"
 
 __all__ = [
     "BERTopic",

diff --git a/bertopic/_bertopic.py b/bertopic/_bertopic.py
@@ -78,6 +78,7 @@ def __init__(self,
                  nr_topics: Union[int, str] = None,
                  low_memory: bool = False,
                  calculate_probabilities: bool = False,
+                 diversity: float = None,
                  seed_topic_list: List[List[str]] = None,
                  embedding_model=None,
                  umap_model: UMAP = None,
@@ -105,8 +106,7 @@ def __init__(self,
                        number of topics to the value specified. This reduction can take
                        a while as each reduction in topics (-1) activates a c-TF-IDF
                        calculation. If this is set to None, no reduction is applied. Use
-                       "auto" to automatically reduce topics that have a similarity of at
-                       least 0.9, do not maps all others.
+                       "auto" to automatically reduce topics using HDBSCAN.
             low_memory: Sets UMAP low memory to True to make sure less memory is used.
             calculate_probabilities: Whether to calculate the probabilities of all topics
                                      per document instead of the probability of the assigned
@@ -116,6 +116,9 @@ def __init__(self,
                                      you do not mind more computation time.
                                      NOTE: If false you cannot use the corresponding
                                      visualization method `visualize_probabilities`.
+            diversity: Whether to use MMR to diversify the resulting topic representations.
+                       If set to None, MMR will not be used. Accepted values lie between 
+                       0 and 1 with 0 being not at all diverse and 1 being very diverse. 
             seed_topic_list: A list of seed words per topic to converge around
             verbose: Changes the verbosity of the model, Set to True if you want
                      to track the stages of the model.
@@ -141,6 +144,7 @@ def __init__(self,
         self.nr_topics = nr_topics
         self.low_memory = low_memory
         self.calculate_probabilities = calculate_probabilities
+        self.diversity = diversity
         self.verbose = verbose
         self.seed_topic_list = seed_topic_list
 
@@ -370,10 +374,14 @@ def transform(self,
                                                   verbose=self.verbose)
 
         umap_embeddings = self.umap_model.transform(embeddings)
+        logger.info("Reduced dimensionality with UMAP")
+
         predictions, probabilities = hdbscan.approximate_predict(self.hdbscan_model, umap_embeddings)
+        logger.info("Predicted clusters with HDBSCAN")
 
         if self.calculate_probabilities:
             probabilities = hdbscan.membership_vector(self.hdbscan_model, umap_embeddings)
+            logger.info("Calculated probabilities with HDBSCAN")
         else:
             probabilities = None
 
@@ -476,7 +484,7 @@ def topics_over_time(self,
             selection = documents.loc[documents.Timestamps == timestamp, :]
             documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,
                                                                                     "Timestamps": "count"})
-            c_tf_idf, words = self._c_tf_idf(documents_per_topic, m=len(selection), fit=False)
+            c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)
 
             if global_tuning or evolution_tuning:
                 c_tf_idf = normalize(c_tf_idf, axis=1, norm='l1', copy=False)
@@ -569,7 +577,7 @@ def topics_per_class(self,
             selection = documents.loc[documents.Class == class_, :]
             documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,
                                                                                     "Class": "count"})
-            c_tf_idf, words = self._c_tf_idf(documents_per_topic, m=len(selection), fit=False)
+            c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)
 
             # Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation
             # by simply taking the average of the two
@@ -1107,8 +1115,8 @@ def visualize_hierarchy(self,
                          Either 'left' or 'bottom'
             topics: A selection of topics to visualize
             top_n_topics: Only select the top n most frequent topics
-            width: The width of the figure.
-            height: The height of the figure.
+            width: The width of the figure. Only works if orientation is set to 'left'
+            height: The height of the figure. Only works if orientation is set to 'bottom'
 
         Returns:
             fig: A plotly figure
@@ -1185,18 +1193,18 @@ def visualize_heatmap(self,
 
     def visualize_barchart(self,
                            topics: List[int] = None,
-                           top_n_topics: int = 6,
+                           top_n_topics: int = 8,
                            n_words: int = 5,
-                           width: int = 800,
-                           height: int = 600) -> go.Figure:
+                           width: int = 250,
+                           height: int = 250) -> go.Figure:
         """ Visualize a barchart of selected topics
 
         Arguments:
             topics: A selection of topics to visualize.
             top_n_topics: Only select the top n most frequent topics.
             n_words: Number of words to show in a topic
-            width: The width of the figure.
-            height: The height of the figure.
+            width: The width of each figure.
+            height: The height of each figure.
 
         Returns:
             fig: A plotly figure
@@ -1447,7 +1455,7 @@ def _extract_topics(self, documents: pd.DataFrame):
             c_tf_idf: The resulting matrix giving a value (importance score) for each word per topic
         """
         documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
-        self.c_tf_idf, words = self._c_tf_idf(documents_per_topic, m=len(documents))
+        self.c_tf_idf, words = self._c_tf_idf(documents_per_topic)
         self.topics = self._extract_words_per_topic(words)
         self._create_topic_vectors()
         self.topic_names = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
@@ -1553,7 +1561,7 @@ def _create_topic_vectors(self):
 
             self.topic_embeddings = topic_embeddings
 
-    def _c_tf_idf(self, documents_per_topic: pd.DataFrame, m: int, fit: bool = True) -> Tuple[csr_matrix, List[str]]:
+    def _c_tf_idf(self, documents_per_topic: pd.DataFrame, fit: bool = True) -> Tuple[csr_matrix, List[str]]:
         """ Calculate a class-based TF-IDF where m is the number of total documents.
 
         Arguments:
@@ -1581,7 +1589,7 @@ def _c_tf_idf(self, documents_per_topic: pd.DataFrame, m: int, fit: bool = True)
             multiplier = None
 
         if fit:
-            self.transformer = ClassTFIDF().fit(X, n_samples=m, multiplier=multiplier)
+            self.transformer = ClassTFIDF().fit(X, multiplier=multiplier)
 
         c_tf_idf = self.transformer.transform(X)
 
@@ -1641,19 +1649,20 @@ def _extract_words_per_topic(self,
 
         # Extract word embeddings for the top 30 words per topic and compare it
         # with the topic embedding to keep only the words most similar to the topic embedding
-        if self.embedding_model is not None:
+        if self.diversity is not None:
+            if self.embedding_model is not None:
 
-            for topic, topic_words in topics.items():
-                words = [word[0] for word in topic_words]
-                word_embeddings = self._extract_embeddings(words,
-                                                           method="word",
-                                                           verbose=False)
-                topic_embedding = self._extract_embeddings(" ".join(words),
-                                                           method="word",
-                                                           verbose=False).reshape(1, -1)
-                topic_words = mmr(topic_embedding, word_embeddings, words,
-                                  top_n=self.top_n_words, diversity=0)
-                topics[topic] = [(word, value) for word, value in topics[topic] if word in topic_words]
+                for topic, topic_words in topics.items():
+                    words = [word[0] for word in topic_words]
+                    word_embeddings = self._extract_embeddings(words,
+                                                            method="word",
+                                                            verbose=False)
+                    topic_embedding = self._extract_embeddings(" ".join(words),
+                                                            method="word",
+                                                            verbose=False).reshape(1, -1)
+                    topic_words = mmr(topic_embedding, word_embeddings, words,
+                                    top_n=self.top_n_words, diversity=self.diversity)
+                    topics[topic] = [(word, value) for word, value in topics[topic] if word in topic_words]
         topics = {label: values[:self.top_n_words] for label, values in topics.items()}
 
         return topics
@@ -1694,10 +1703,7 @@ def _reduce_to_n_topics(self, documents: pd.DataFrame) -> pd.DataFrame:
             self.merged_topics = []
 
         # Create topic similarity matrix
-        if self.topic_embeddings is not None:
-            similarities = cosine_similarity(np.array(self.topic_embeddings))
-        else:
-            similarities = cosine_similarity(self.c_tf_idf)
+        similarities = cosine_similarity(self.c_tf_idf)
         np.fill_diagonal(similarities, 0)
 
         # Find most similar topic to least common topic

diff --git a/bertopic/_ctfidf.py b/bertopic/_ctfidf.py
@@ -21,12 +21,12 @@ class ClassTFIDF(TfidfTransformer):
     def __init__(self, *args, **kwargs):
         super(ClassTFIDF, self).__init__(*args, **kwargs)
 
-    def fit(self, X: sp.csr_matrix, n_samples: int, multiplier: np.ndarray = None):
+    def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):
         """Learn the idf vector (global term weights).
 
         Arguments:
             X: A matrix of term/token counts.
-            n_samples: Number of total documents
+            multiplier: A multiplier for increasing/decreasing certain IDF scores
         """
         X = check_array(X, accept_sparse=('csr', 'csc'))
         if not sp.issparse(X):
@@ -35,19 +35,29 @@ def fit(self, X: sp.csr_matrix, n_samples: int, multiplier: np.ndarray = None):
 
         if self.use_idf:
             _, n_features = X.shape
+
+            # Calculate the frequency of words across all classes
             df = np.squeeze(np.asarray(X.sum(axis=0)))
+
+            # Calculate the average number of samples as regularization
             avg_nr_samples = int(X.sum(axis=1).mean())
-            idf = np.log(avg_nr_samples / df)
+
+            # Divide the average number of samples by the word frequency
+            # +1 is added to force values to be positive
+            idf = np.log((avg_nr_samples / df)+1)
+
+            # Multiplier to increase/decrease certain idf scores
             if multiplier is not None:
                 idf = idf * multiplier
+
             self._idf_diag = sp.diags(idf, offsets=0,
                                       shape=(n_features, n_features),
                                       format='csr',
                                       dtype=dtype)
 
         return self
 
-    def transform(self, X: sp.csr_matrix, copy=True):
+    def transform(self, X: sp.csr_matrix):
         """Transform a count-based matrix to c-TF-IDF
 
         Arguments:

diff --git a/bertopic/plotting/_barchart.py b/bertopic/plotting/_barchart.py
@@ -1,3 +1,4 @@
+import itertools
 import numpy as np
 from typing import List
 
@@ -7,19 +8,19 @@
 
 def visualize_barchart(topic_model,
                        topics: List[int] = None,
-                       top_n_topics: int = 6,
+                       top_n_topics: int = 8,
                        n_words: int = 5,
-                       width: int = 800,
-                       height: int = 600) -> go.Figure:
+                       width: int = 250,
+                       height: int = 250) -> go.Figure:
     """ Visualize a barchart of selected topics
 
     Arguments:
         topic_model: A fitted BERTopic instance.
         topics: A selection of topics to visualize.
         top_n_topics: Only select the top n most frequent topics.
         n_words: Number of words to show in a topic
-        width: The width of the figure.
-        height: The height of the figure.
+        width: The width of each figure.
+        height: The height of each figure.
 
     Returns:
         fig: A plotly figure
@@ -39,9 +40,11 @@ def visualize_barchart(topic_model,
     fig = topic_model.visualize_barchart()
     fig.write_html("path/to/file.html")
     ```
-    <iframe src="../../tutorial/visualization/bar_chart.html"
+    <iframe src="../../getting_started/visualization/bar_chart.html"
     style="width:1100px; height: 660px; border: 0px;""></iframe>
     """
+    colors = itertools.cycle(["#D55E00", "#0072B2", "#CC79A7", "#E69F00", "#56B4E9", "#009E73", "#F0E442"])
+
     # Select topics based on top_n and topics args
     if topics is not None:
         topics = list(topics)
@@ -52,13 +55,13 @@ def visualize_barchart(topic_model,
 
     # Initialize figure
     subplot_titles = [f"Topic {topic}" for topic in topics]
-    columns = 3
+    columns = 4
     rows = int(np.ceil(len(topics) / columns))
     fig = make_subplots(rows=rows,
                         cols=columns,
-                        shared_xaxes=True,
-                        horizontal_spacing=.15,
-                        vertical_spacing=.15,
+                        shared_xaxes=False,
+                        horizontal_spacing=.1,
+                        vertical_spacing=.4 / rows if rows > 1 else 0,
                         subplot_titles=subplot_titles)
 
     # Add barchart for each topic
@@ -71,7 +74,8 @@ def visualize_barchart(topic_model,
         fig.add_trace(
             go.Bar(x=scores,
                    y=words,
-                   orientation='h'),
+                   orientation='h',
+                   marker_color=next(colors)),
             row=row, col=column)
 
         if column == columns:
@@ -86,16 +90,15 @@ def visualize_barchart(topic_model,
         showlegend=False,
         title={
             'text': "<b>Topic Word Scores",
-            'y': .95,
-            'x': .15,
+            'x': .5,
             'xanchor': 'center',
             'yanchor': 'top',
             'font': dict(
                 size=22,
                 color="Black")
         },
-        width=width,
-        height=height,
+        width=width*4,
+        height=height*rows if rows > 1 else height * 1.3,
         hoverlabel=dict(
             bgcolor="white",
             font_size=16,

diff --git a/bertopic/plotting/_distribution.py b/bertopic/plotting/_distribution.py
@@ -32,7 +32,7 @@ def visualize_distribution(topic_model,
     fig = topic_model.visualize_distribution(probabilities[0])
     fig.write_html("path/to/file.html")
     ```
-    <iframe src="../../tutorial/visualization/probabilities.html"
+    <iframe src="../../getting_started/visualization/probabilities.html"
     style="width:1000px; height: 500px; border: 0px;""></iframe>
     """
     if len(probabilities.shape) != 1: