# Session 9b - Dynamic topic modelling

# Data
This time, we're going to load an work with an English language novel. The goal of *dynamic topic modelling* is to study how topics evolve and interact over time.

In this case, we want to know how different topics evolve and change over the course of a single novel.

In [None]:
import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic

In [None]:
with open("path/to/novel", "r") as f:
    text = f.read()

We're first going to do some *very* crude pre-processing by splitting the text up into sentences based on the presence of period/full stop.

**Question:** How might we improve this step?

In [None]:
docs = text.split(".")

# **Topic Modeling**


## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language.

We will also calculate the topic probabilities which will be useful later.

**NB:** On a 4 CPU UCloud instance, this takes around 13 minutes to run!


In [None]:
# initialize the model
topic_model = BERTopic(language="english", 
                       calculate_probabilities=True, 
                       verbose=True)

# notice how we use the same fit_transform logic as we've seen before
topics, probs = topic_model.fit_transform(docs)

**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [None]:
freq = topic_model.get_topic_info()
freq.head(5)

-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(2)  # Select the most frequent topic

__Get top 10 topics__

In [None]:
topic_model.topics_[:10]

# **Visualization**

## Visualize Topics

In [None]:
topic_model.visualize_topics()

## Visualize Topic Probabilities

In [None]:
topic_model.visualize_distribution(probs[10])

## Visualize Topic Hierarchy

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

## Visualize Terms

In [None]:
topic_model.visualize_barchart(top_n_topics=20)

## Visualize Topic Similarity

In [None]:
topic_model.visualize_heatmap(n_clusters=20, 
                              width=1000, 
                              height=1000)

# **Search Topics**

In [None]:
similar_topics, similarity = topic_model.find_topics("love", top_n=5)
print(similar_topics)

In [None]:
topic_model.get_topic(5)

In [None]:
topic_model.get_representative_docs(5)

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created. 

This allows for fine-tuning the model to your specifications and wishes. 

## Update Topics

One of the easiest ways to make our topics more coherent is to revise the *topic descriptions* rather than training a new model. 

This means that we try to foreground more potentially significant and meaningful words for individual topics.

We can do this by creating a simple bag-of-words model and filtering based on some conditions - just like we do when working with scikit-learn. 

**NB:** This will also take around 25 mins to run on a 4 CPU machine.

In [None]:
vectorizer_model = CountVectorizer(stop_words="english", 
                                   ngram_range=(1, 2),
                                   min_df=0.05,
                                   max_df=0.9)
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)

In [None]:
topic_model.get_topic(1)   # We select topic that we viewed before

## Dynamic topic models

We first need to create some kind of *timestamps* for our data.

**Question:** In this context, what might be a good idea for creating timestamps? What is this cell below doing?

In [None]:
timestamps = list(range(len(docs)))

In [None]:
topics_over_time = topic_model.topics_over_time(docs, 
                                                timestamps, 
                                                nr_bins=10)

__Plotting topics over time__

In [None]:
topic_model.visualize_topics_over_time(topics_over_time, 
                                       normalize_frequency=True, 
                                       top_n_topics=20)

__Plot specific topics__

In [None]:
topic_model.visualize_topics_over_time(topics_over_time, 
                                       normalize_frequency=True, 
                                       topics=[5]) # list of chosen topics