Welcome to this tutorial! 

Please install the requirements, fetch the data, and train the model to begin. 
Thereafter, questions are listed that can be answered using the trained BERTopic model. Explore the BERTopic functionalities to be able to answer the questions. 

Note: Visualization of the BERTopic library are not working in Jupyter Lab. Please use Jupyter Notebook for this. 

### Install requirements

In [None]:
!pip install -r requirements-core.txt --user

### Imports etc.

In [None]:
# NOTE: After properly installing the requirements you might have
# to restart your kernel to be able to make the imports.
from sklearn.datasets import fetch_20newsgroups
from bertopic import BERTopic
import plotly.offline as pyo
import plotly.graph_objs as go
pyo.init_notebook_mode()

### Fetch and checkout data

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train')

In [None]:
data = newsgroups_train.data

In [None]:
type(data)

In [None]:
len(data)

In [None]:
# we will take only the first X number of document, to speed up the computation
# please be aware that this will create worse topics. 
# normally, you would like to use as much data as possible.
data = data[:1000]

In [None]:
# example
print(data[0])

### BERT time!

Link to the documentation of the BERTopic API: 

https://maartengr.github.io/BERTopic/api/bertopic.html

In [None]:
# initialize the model
model = BERTopic(language='english')

In [None]:
# fit the model on the documents
# on 1000 entries this should only take 1-2 minutes
topics, probs = model.fit_transform(data)

In [None]:
# topics: a list of integers that show to which topic each document is assigned. 
topics

In [None]:
len(topics), len(data)

In [None]:
# TODO: Checkout the results to be able to answer the following questions:

#### Q1: How many topics has the model identified? 

In [None]:
# for Q1, 2, and 3
model.get_topic_info()

#### Q2: How many documents are assigned to each topic? 

#### Q3: What are the names of the topics? 

#### Q4: Can you get the top 3 most representative topics for a topic of your choice? 

In [None]:
# plug in any topic id in the function
for doc in model.get_representative_docs(0):
    print('#################')
    print(doc)

#### Q5: Which visualizations can you make to give more detailed insights in the topics? 

In [None]:
# any function from the 'BERTopic' object with 'visualize' in it

In [None]:
# only works on large datasets (few thousands)
model.visualize_topics()

In [None]:
model.visualize_heatmap()

In [None]:
model.visualize_hierarchy()

#### Q6: Can you find topics that match certain keywords? 

In [None]:
keyword = "article"
similar_topics, similarity = model.find_topics(keyword, top_n=5)
model.get_topic(similar_topics[0])

#### Q7: Can you reduce the number of topics to create more generalized representations? 

In [None]:
# create new list of topic assignments and update 'model' object. 
new_topics, new_probs = model.reduce_topics(data, topics, probabilities=probs, nr_topics='auto')
new_topics

#### Q8: Can you update the topic representations? 

In [None]:
# add n grams to topic representations
model.update_topics(data, topics, n_gram_range=(1, 3))
model.get_topic(0)[:10]

In [None]:
# remove stopwords
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3))
model.update_topics(data, topics, vectorizer_model=vectorizer_model)
model.get_topic(0)[:10]

In [None]:
model.get_topic_info()