# Transformer Topic Modelling
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/funko_prime.jpg?raw=true" align="right" width="200">

Whilst topic modelling has been a foundational technique for many years, recent strides in neural networks alongside the building of huge archives of textual material has meant that rather than build our own textual models like before, we can use models pre-trained on millions of examples. These models are far better at accounting for the semantic meaning of words, have a better sense of what words should be given the most attention and account for word context.

For example in the sentence "One dog greeted the other dog" the values assigned to the first 'dog' will differ to the values assigned to the second 'dog'. BERT refers to...
- **B**idirectional - Considers each word and looks both at what preceeds it and what follows it.
- **E**ncoder - Encodes the textual material into numerical values...
- **R**epresentations from - that accurately represent the original textual meaning
- **T**ransformers - a type of machine learning model that is able to adjust what parts of the data it pays most attention to.

Transformers have been used for many applications including language translation, and [text generation](https://huggingface.co/gpt2)

If you'd like more technical details on the BERT model you can see this [well illustrated guide](https://www.exxactcorp.com/blog/Deep-Learning/how-do-bert-transformers-work), but it is not necessary to fully understand the details, thanks to...

# BERTopic
<img src="https://maartengr.github.io/BERTopic/logo.png?raw=true" align="right" width="200">

- [BERTopic Website](https://maartengr.github.io/BERTopic/index.html)

- Grootendorst, M. (2022) ‘BERTopic: Neural topic modeling with a class-based TF-IDF procedure’. arXiv. Available at: [https://doi.org/10.48550/ARXIV.2203.05794](https://doi.org/10.48550/ARXIV.2203.05794)

BERTopic provides us a Python library that leverages BERT transformers whilst providing an accessible set of methods for helpful visualisations, summaries and tweaking of the model.




In [None]:
! conda install -c conda-forge hdbscan --yes

In [None]:
! pip install bertopic seaborn

In [None]:
!pip install "jupyterlab>=3" "ipywidgets>=7.6"

In [None]:
# # May be required if iProgress widget error
# ! pip install ipywidgets widgetsnbextension
# !jupyter nbextension enable --py widgetsnbextension

In [None]:
from bertopic import BERTopic
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import plotly.io as pio
pio.renderers.default = "colab"

### Our Dataset

In [None]:
news_df = pd.read_csv('sample_news_large_with_tokens.csv')

In [None]:
news_df.head()

In [None]:
raw_corpus = news_df['text'].tolist()
raw_corpus[0:2]

In [None]:
tokenized_corpus = news_df['tokens'].tolist()
tokenized_corpus[0:2]

### Basic BERTopic

BERTopic analysis can be broken down into two parts.

1. Embeddings
2. Topic Representation

#### 1. Embeddings
BERTopic uses the pre-trained BERT model to generate embeddings for each document. We created embeddings ourselves when we performed vectorisation on text, generating a matrix of document to word count values.


> 1. I am a pet dog.
> 2. I am a pet cat
> 3. Architecture is a serious, serious discipline


| **Document** | I | am | a | pet | dog | cat | architecture | is | serious | discipline |
|--------------|---|----|---|-----|-----|-----|--------------|----|---------|------------|
| 1            | 1 | 1  | 1 | 1   | 1   | 0   | 0            | 0  | 0       | 0          |
| 2            | 1 | 1  | 1 | 1   | 0   | 1   | 0            | 0  | 0       | 0          |
| 3            | 0 | 0  | 1 | 0   |     |     | 1            | 1  | 2       | 1          |

BERT embeddings represent each document as a row of 384 values, generated by the pre-trained model. This data can be used by our computational tools to find clusters of documents based on their semantic similarity, identify which documents are representative of a particular cluster etc. We generate these embeddings by passing in raw text because the BERT model needs context, including all the little words and grammar we might strip out in pre-processing.

#### 2. Topic Representation
Seperately, BERTopic uses a variation of TFIDF to then generate keywords to represent the topics it finds using the embeddings. In this case TFIDF works best when we **DO** strip out the noise and grammatical features.

By default BERTopic uses basic tokenisation with little filtering of words, though this can be adjusted which we'll see later.

However, it is also possible to pass it pre-prepared tokens which it will happily use instead. Given we spent all that time learning pre-processing let's use our custom tokens for this notebook, though we'll look at how to tweak BERTopic's built in pre-processing for better results later.





### Basic BERTopic - default settings, no custom pre-processing

In [None]:
topic_model =
topics, probabilities =

In [None]:
#info

Here are our topics. It has discovered 7 topics plus a noise topic labelled -1. There is almost always a noise topic as not all documents perfectly fit into a cluster. There may be outliers or topics that are difficult to classify as one topic or another.

Topic representation, which we can see in the labels it has generated for each topic could be improved.

In [None]:
# Our first built in visualisation helps us quickly see the topics and their associated words. Hover over the bars to see words and scores.



Remember the seperation in the model. The embeddings which determine the topics, and then the topic representation. We can update our topic model's topic representation side without impacting the embedding side.

Looking better. The number and distribution of the topics is still broadly the same, but now the topic descriptions are improved. Let's work with this.

#### Assessing our Topic Model

To some extent we already know generally what topics should be in our data as we know the queries that generated the documents.

In [None]:
query_topic_crosstab =
query_topic_crosstab

In [None]:
#heatmap

We can see that generally the topics conform to our queries. This is a good sign indicating that out embeddings were able to accurately determine similarity. We can even see some crossover on particular queries. Remember when the modelling started BERTopic had no idea what query produced each news document, it determined this seperation based on the content of the articles.

We won't always have existing classifications like this, but this helps give us confidence that if we did the same procedure on a set of documents in which we had no sense of the topics, it would be able to surface them for us.

In [None]:
topic_model.generate_topic_labels()

### Topic and Document Distribution

We can see the similarity of topics using the built in visualiser. Whether they are or are not similar to the extent that they could be merged as a single topic is down to qualitative assessment. Normally they will overlap if they are all part of a larger overarching topic.

The plot above shows us the distance between topics, with the size of the circle indicating the relative size of the topic in the corpus. Topics that are closer together are considered similar. We can see a more detailed version by visualizing the document embeddings in two dimensons.
The first argument specifies how to label the points, rather than relying on the text itself if we provide the embeddings.

In [None]:
embeddings =

If we examine the scatter plot above more closely, and consider the article titles we can see why some articles might be closer together even within a cluster.

### Hierarchical Clustering
This visual shows us how the topics were determined, indicating where large clusters of documents were split into multiple groups and at what point.

We can see above that topic 4 (Hong kong/Protest) was considered significantly distinct enough to be seperated from the remaining documents early on. Then topic 3 (Tesla). Then topic 5 (online scams and online abuse) then topic 2 (Brexit) and finally topics 0 (facebook/libra) and topic 1 (trump/alt-right). The colouring indicates that these last three topics are more similar to one another than the others, having been split off from a larger cluster.

### Term scoring
When looking at a topic's keywords, how far down the list do you go until you stop looking. Top 10, top 20? Term rank allows us to see where the number of terms stops adding value to the differentiation of topics. i.e. the point at which adding more terms doesn't aid in differentiating topics anymore.

The guidance is to look for the 'knee' or 'elbow' where the line flattens out. At that point no more terms will improve the differentiation. At this point we can see that differentiation dramatically declines for most topics after only 3 keywords.

# Topics over time
If you have datestamps for your individual data points, you can get BERTopic to show you topic trends over time

In [None]:
topics_over_time =

Note that as you hover over each point, the keywords for the topic change. This helps us see how the topic discourse may have altered over time.

In [None]:
# The raw data used to generate the visual is in our topics_over_time dataframe
topics_over_time

# Topics per class
Allows us to 'split' up the model to see how different topics might differ depending on some sort of classification. So for example in our data, if we took the time to label each document with the type of publication (Broadsheet newspaper, tabloid, left wing, right wing, etc.) we could see how the topics found across all the documents, differed depending on the type of publisher.

We don't have that information(!) but we can demonstrate using our `query` classification at least.

In [None]:
topics_per_class =


This is still informative in that it shows us which topics are most important for each query group, but also that some topics might actually overlap a little. Again note that the words for each topic differ depending on the classification.

In [None]:
# ...and again the raw data is available in to us in the variable we created...
topics_per_class

### Topic Similarity
A different way of examining similar phenomena - where do topics overlap, how similar or different are they. Ideally you don't want all your topics to be highly similar, because then you haven't been able to distinguish different topics. However if some overlap in some way, that might tell you something interesting about how different discourses/issues/cultures might overlap or intersect.

# Topic Distribution
If you recall in LDA topic modelling every document has a score for each topic. Whilst most documents might align strongly with only one topic, this approach recognised that topics existed across documents, and one document could contain multiple topics.

BERTopic does not work like LDA but it does provide us a table of probabilities. This shows us how probable it is that a document could be classified as topic x.

In [None]:
news_df.loc[3,'title']

In [None]:
print(news_df.loc[3,'text'])

## Representative Documents
The model stores a few documents that it considers representative of the topic as a whole. We can use the command below, passing in the topic number, to get the corresponding representative documents.

## Exporting Analysis for use in your report
### Tables

Any of the tables that are produced by BERTopic can be exported as they are Pandas Dataframes...

In [None]:
# In the rare case that the table may instead be a numpy array...
type(probabilities)

In [None]:
# simply wrap in a dataframe and then export



### Figures
All figures produced by BERTopic are actually [Plotly](https://plotly.com/python/) figures. They can be exported too...

In [None]:


topic_bar_chart = topic_model.visualize_barchart(topics=[1,2,3,4],n_words=20, height=400)
topic_bar_chart

Whilst you can write directly to image file in plotly, it requires additional packages. It is simpler to generate the html file and then click the camera icon in the top right of the tool bar that appears when you hover over the figure. This will download an image of the plot for you.

On the rare occasion we use a Seaborn chart instead...

In [None]:
heatmap = sns.heatmap(query_topic_crosstab, cmap='YlGnBu')



 # dpi 300 or 400 produces a good sized image

# Topic Modelling with Twitter Data - Example

Here we show an example of how you can apply BERTopic to Twitter data. The small text sizes, noise and variability of twitter data can mean it is difficult to get a handle on whether there are any latent topics of discussion within your collected data. In this example we'll show how you can use the community detection analysis from your network work, to help guide the topic modelling process, as well as some of the more advanced tweaks that can help improve your models.

Here is our original twitter dataset and the communities we detected using NetworkX

In [None]:
tweets = pd.read_json('trhr.json')
communities = pd.read_csv('communities.csv', index_col=0)

In [None]:
# We may choose to only examine original tweets, excluding retweets. Whether or not this is advisable varies.
tweets =
tweets

Here we combine the original tweet text with the community classifications we generated. We drop any tweets where the user had no community affiliation as that means they weren't a part of our primary network.

In [None]:
tweets = tweets.merge(communities,how='left', left_on='user_id', right_index=True).dropna(subset=['community'])
tweets = tweets[['user_username','text','community']]
tweets

We don't have any pre-existing tokens this time, so we're going to tweak the settings of our BERTopic model to improve the topic representation.
BERTopic is built on a range of pre-existing well established components, including SciKit learn's CountVectoriser. We can generate and configure our own vectoriser, and then pass it to our model to be used instead of the default one.

In [None]:
# Here we will just keep tweets users in the top 5 communities
n_communities = 5
top_communities = tweets['community'].value_counts().head(n_communities).index
tweets = tweets[tweets['community'].isin(top_communities)]

In [None]:

from sklearn.feature_extraction.text import CountVectorizer

# We set stop words, tell it to look for phrases between 1 and 2 words long, that a word
# must appear at least 5 times to be included, and no more than in 95% of the documents.
cv =

# We're also going to specify the diversity of the representative words this time.
# This pushes the model to try and avoid repeating words across topic representations. It is similar to tweaking the lambda value in LDAvis
# nr_topics='auto' tells the model that after the clusters are found, to automatically try and combine highly similar topics.
# This can be useful when you get hundreds of topics, or they are highly similar. Conversely if you want fine-grained detail of how
# A particular topic may be being discussed in different ways, you may want to avoid this argument.

tweet_model =

# Here we tell the model that we want it to consider the community assignments when modelling. This is providing guidance to the model that there should be semantic similarity within communities.

topics, probabilities =



In [None]:
tweet_model.get_topic_info()

In [None]:
tweet_model.visualize_barchart(n_words=20,height=400)

In [None]:
tweet_model.visualize_topics()

In [None]:
tweet_model.visualize_documents(tweets['text'].tolist(), sample=0.1)

In [None]:
tweet_model.visualize_hierarchy()

In [None]:
tweets['community'].value_counts()

In [None]:
topic_community_crosstab =
topic_community_crosstab.loc[0:]

In [None]:
sns.set(rc={"figure.figsize":(8,6)})
sns.clustermap(topic_community_crosstab.loc[0:], cmap='coolwarm', linewidths=0.1)


In [None]:
tweet_model.get_representative_docs(1)

In [None]:
topics_per_class = tweet_model.topics_per_class(tweets['text'].tolist(),topics, classes=tweets['community'].tolist())
tweet_model.visualize_topics_per_class(topics_per_class)