> # Pre-Lab Instructions
> <img src="https://github.com/Minyall/sc207_290_public/blob/main/images/attention.webp?raw=true" align="right" height=150>
> This poor man will be at the start of every notebook, letting you know what you will need for the lab.
> 
> For this lab you will need:
> - DATA: `farright_dataset_cleaned.parquet` - Download from Moodle and upload to this Colab session.
> - INSTALL: You will need to install `bertopic` and `embedding-atlas`. Use the cell below.

In [None]:
# Uncomment the line below and run 
# ! pip install bertopic embedding-atlas

# Let it completely finish before moving on

# SC290: Finding topics and themes when you have too much text 
<img src="https://github.com/Minyall/sc207_290_public/blob/main/images/sc290_headers/4.png?raw=true" height=150 align="right">

## Last week
- How to represent texts as 'vectors'.
- Counts, TFIDF and BERT embeddings.
- How vectors can help us summarise or represent groups of text.
- How vectors can help us determine document similarity and difference.

## This week

Having learned about BERT embeddings and TFIDF we're going to practically apply these two techniques for *Topic Modelling*. Topic modelling is a well established technique for finding the key themes or topics that exist across a large number of documents.

Traditionally topic modelling uses word frequencies, often through the technique known as *Latent Dirichlet Allocation (LDA)*. However with the development of BERT embedding models, the cutting edge of topic modelling uses embeddings to determine these topics. With their capacity to better account for the semantic similarity of words, as well as adjust its attention when examining documents and account for word context, embeddings are far superior to simple word counts when identifying topics.

Whilst it is a complex process, it has been made incredibly simple and accessible via this week's library...


# BERTopic
<img src="https://maartengr.github.io/BERTopic/logo.png?raw=true" align="right" width="200">

- [BERTopic Website](https://maartengr.github.io/BERTopic/index.html)

- Grootendorst, M. (2022) ‘BERTopic: Neural topic modeling with a class-based TF-IDF procedure’. arXiv. Available at: [https://doi.org/10.48550/ARXIV.2203.05794](https://doi.org/10.48550/ARXIV.2203.05794)

BERTopic provides us a Python library that leverages BERT transformers whilst providing an accessible set of methods for helpful visualisations, summaries and tweaking of the model.




In [None]:
# We'll need a few different libraries this week but we'll introduce them as we use them.
## To begin we just need...


In [None]:
# Convenience variables for accessing our documents


## 1.1 Introducing BERTopic

BERTopic analysis can be broken down into two parts.

**1.Embeddings**

Embeddings rely on the BERT pre-trained model, like we used in the previous session to determine the similarity/difference of our documents. Rememebr for embeddings, they work best if we provide the whole text with all the variation in words, punctuation etc. We'll use the data in our `cleaned_text` column.

**2.Topic Representation**

Seperately, BERTopic uses a variation of TFIDF to then generate keywords to represent the topics it finds using the embeddings. In this case TFIDF works best when we **DO** strip out the noise and grammatical features because like TFIDF it is based on the frequency of words. For this we'll use our pre-prepared tokens we created in our session on text preparation, the `tokens` column.





In [None]:
# Whilst BERTopic can generate embeddings for us, 
# it is more efficient to do it ourselves first as we can then use them 
# when we need rather than wait for them to be generated


In [None]:
# We create a blank topic model object

# We then train the model on our documents and embeddings

# Finally we update the topic representations to use our tokens.


In [None]:
# The best overview method is .get_topic_info()


This is our starting point of understandng the results of our topic modelling. 
- Each row of the dataframe represents a topic found in the documents. The number of rows == number of topics.
- Each document is assigned a topic label and so the topics are in size order -  the topic with the largest number of documents is first.
- Each topic is given a number label. The `-1` topic represents documents not given a topic, we'll explore why later.
- Each topic is given a name, which tends to be the words most representative of the topic according to TFIDF.
- Representation is the full list of words used to represent the topic.
- Representative Docs is a list of documents that are most emblematic of that topic.
- All tables produced by BERTopic are Pandas Dataframes meaning the skills you learned in working with Pandas applies to any table generated by BERTopic.

In [None]:
#*
# # When the model ran we got a list of topic assignments, one per document


In [9]:
#*
# and a table of probabilities that says how likely it is that each document might be assgned to each topic.
# More on this later
pd.DataFrame(probabilities) 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,2.267476e-308,4.392524e-308,1.730944e-308,1.112468e-308,6.062572e-308,3.731248e-308,5.119500e-308,1.646307e-308,3.584772e-308,2.017166e-308,1.451399e-308,1.642077e-308,2.993094e-308,1.913429e-308,1.000000e+00,7.901506e-308,8.160806e-308
1,1.161820e-02,1.446296e-02,4.386093e-02,4.438669e-01,1.564423e-02,1.284043e-02,1.151416e-02,1.390409e-02,1.837927e-02,1.585187e-02,2.462544e-02,2.208949e-02,1.616646e-02,1.281437e-02,1.312723e-02,1.214967e-02,1.396249e-02
2,9.740014e-309,1.207674e-308,3.367549e-308,1.000000e+00,1.298676e-308,1.072327e-308,9.662869e-309,1.161196e-308,1.509446e-308,1.318903e-308,2.015496e-308,1.827057e-308,1.340081e-308,1.077033e-308,1.095078e-308,1.016497e-308,1.160705e-308
3,9.718246e-02,1.545379e-02,1.682372e-02,8.709029e-03,1.991956e-02,2.820512e-02,1.456056e-02,4.224967e-02,1.826659e-02,2.340316e-02,1.624356e-02,1.417418e-02,2.339893e-02,1.490387e-02,1.826271e-02,1.852660e-02,1.932790e-02
4,1.043662e-02,1.015999e-02,6.280796e-02,1.349813e-02,1.158738e-02,1.022276e-02,7.668474e-03,1.716752e-02,1.566911e-02,1.418452e-02,6.527917e-01,1.224866e-02,1.252941e-02,1.055487e-02,9.591454e-03,8.674620e-03,1.035162e-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
739,1.424486e-02,1.613341e-02,9.635406e-03,7.122975e-03,2.240795e-02,2.647293e-02,2.494602e-01,9.790967e-03,1.547705e-02,1.189195e-02,8.358316e-03,1.093881e-02,1.683345e-02,9.763309e-03,4.439482e-02,1.103540e-01,4.059209e-02
740,2.361316e-02,3.378041e-01,2.432694e-02,1.646613e-02,5.841501e-02,3.041226e-02,3.358133e-02,2.123325e-02,5.228362e-02,2.543257e-02,2.016987e-02,2.060035e-02,3.330612e-02,3.636687e-02,6.611624e-02,3.891839e-02,4.491127e-02
741,1.528457e-02,3.253559e-02,1.455243e-02,8.245157e-03,3.959936e-02,2.133982e-02,2.167301e-02,1.224818e-02,3.928125e-02,1.473549e-02,1.185448e-02,1.217723e-02,2.084735e-02,1.450569e-02,5.530695e-02,2.901776e-02,5.353181e-02
742,1.150975e-02,1.464244e-02,1.440339e-02,6.131844e-03,1.547092e-02,1.211191e-02,8.073927e-03,1.510960e-02,2.342544e-02,1.580747e-02,1.175176e-02,8.856387e-03,1.618073e-02,1.558465e-02,1.233139e-02,9.740175e-03,1.198708e-02


BertTopic uses a number of well establised libraries to do its various tasks. Think of a BERTopic model as a little factory made up of different machines. These machines are pre-built by other libraries and if we want to we can easily swap out one of these machines for our own. 

For example, one of the 'machines' is a Scikit Learn `CountVectorizer` that we used last week. This controls what words are kept in the vocabulary when it is working out what words best represent each topic. We can update our model to use our own `CountVectorizer`. It will then use that to update the topic representations, whilst keeping the underlying topic assignments the same.

In [None]:
# Our first built in visualisation helps us quickly see the topics and their associated words. Hover over the bars to see words and scores.


If we want to get a sense of what documents are exemplary of these topics we can ask for the representative documents.

## 1.2 Understanding Topic Identification

Now we have established our topics, let's explain how topics are determined in the first place. Understanding the underlying process helps with interpretation of the many different analysis options that come built in to BERTopic.

Remember `embedding_atlas` and how it represented the relative similarity and difference between documents? We'll do the same using a built-in BERTopic function.


**What is a topic?**

If we remember when we examined embeddings using `embedding_atlas`, there were areas of denser concentration of articles. Here BERTopic has identified those dense areas and labelled them as a topic.

Documents that are alone, too far from, or hovering between multiple topics are not given a topic and coloured light grey. These are the documents assigned `-1` also known as 'noise'. This doesn't mean they're meaningless, just that they do not easily fit into a cluster and so BERTopic is cautious about assigning one.

**What makes a cluster?**

<img src="https://github.com/Minyall/sc207_290_public/blob/main/images/clusters.png?raw=true" align="right" height=200>
A cluster is determined when a document is considered significantly closer to one set of documents than they are another, and that there are a decent number of documents all closer to one another than they are to anyone else. BERTopic uses an algorithm called HDBSCAN to find clusters of documents. Clusters are not necessarily circular, but can stretch across the 'page' or form irregular shapes. There may also occasionally be a document out of place, such as a different topic document in the middle of a cluster. This is firstly because HDBSCAN is designed to find oddly shaped cluters, because texts tend to cluster in odd ways. 

More importantly it is because when BERTopic determines these clusters, it's actually using more than two dimensions to determine document similarity (5 by default). Here we are forcing it to show document similarity in only two dimensions, meaning there may be the occasional document that seems to be misplaced. If we were able to see all five dimensions those misplaced documents would actually be a part of their cluster.

> ### Wait, how many dimensions?
> Remember in our vectors session we showed that BERT embeddings represent documents in 384 dimensions that are meant to fully capture the nuanced differences between them. |
> BERTopic takes those 384 dimensions and uses a clever process called UMAP to reduce the number of dimensions down to 5 whilst still maintaining enough information to express those document differences. It is those 5 dimensions that BERTopic then examines to find the 'clusters' in the data where documents are densely packed together to then identify topics.

### Topic and Document Distribution

We can see the similarity of topics using the built in visualiser. Whether they are or are not similar to the extent that they could be merged as a single topic is down to qualitative assessment. Normally they will overlap if they are all part of a larger overarching topic.

The plot above shows us the distance between topics, with the size of the circle indicating the relative size of the topic in the corpus. Topics that are closer together are considered similar. We can see a more detailed version by visualizing the document embeddings in two dimensons.
The first argument specifies how to label the points, rather than relying on the text itself if we provide the embeddings.

### Hierarchical Clustering
This visual shows us how the topics were determined, indicating where large clusters of documents were split into multiple groups and at what point.

### Term scoring
When looking at a topic's keywords, how far down the list do you go until you stop looking. Top 10, top 20? Term rank allows us to see where the number of terms stops adding value to the differentiation of topics. i.e. the point at which adding more terms doesn't aid in differentiating topics anymore.

The guidance is to look for the 'knee' or 'elbow' where the line flattens out. At that point no more terms will improve the differentiation. At this point we can see that differentiation dramatically declines for most topics after only 3 keywords.

# Topics over time
If you have datestamps for your individual data points, you can get BERTopic to show you topic trends over time

Note that as you hover over each point, the keywords for the topic change. This helps us see how the topic discourse may have altered over time.

In [None]:
# The raw data used to generate the visual is in our topics_over_time dataframe


# Topics per class
Allows us to 'split' up the model to see how different topics might differ depending on some sort of classification. So for example in our data, if we took the time to label each document with the type of publication (Broadsheet newspaper, tabloid, left wing, right wing, etc.) we could see how the topics found across all the documents, differed depending on the type of publisher.

We don't have that information(!) but we can demonstrate using our `query` classification at least.

This is still informative in that it shows us which topics are most important for each query group, but also that some topics might actually overlap a little. Again note that the words for each topic differ depending on the classification.

In [None]:
# ...and again the raw data is available in to us in the variable we created...


### Topic Similarity
A different way of examining similar phenomena - where do topics overlap, how similar or different are they. Ideally you don't want all your topics to be highly similar, because then you haven't been able to distinguish different topics. However if some overlap in some way, that might tell you something interesting about how different discourses/issues/cultures might overlap or intersect.

# Topic Distribution
If you recall in LDA topic modelling every document has a score for each topic. Whilst most documents might align strongly with only one topic, this approach recognised that topics existed across documents, and one document could contain multiple topics.

BERTopic does not work like LDA but it does provide us a table of probabilities. This shows us how probable it is that a document could be classified as topic x.

## Visualising topic content

In [None]:
# saving the model, then advanced tweaking of it

# Show wordcloud


Topics are essentially just another categorical classification for your documents. We can add them to the articles dataframe and then explore them using other techniques we learned in SC207

## Exporting Analysis for use in your report
### Tables

Any of the tables that are produced by BERTopic can be exported as they are Pandas Dataframes...

### Figures
All figures produced by BERTopic are actually [Plotly](https://plotly.com/python/) figures. They can be exported too...

Whilst you can write directly to image file in plotly, it requires additional packages. It is simpler to generate the html file and then click the camera icon in the top right of the tool bar that appears when you hover over the figure. This will download an image of the plot for you.

On the rare occasion we use a Seaborn chart instead...

In [None]:
# fig = heatmap.get_figure()
