> # Pre-Lab Instructions
> <img src="https://github.com/Minyall/sc207_290_public/blob/main/images/attention.webp?raw=true" height=200>

> For this lab you will need:
> - DATA: `farright_dataset_cleaned.parquet` - Download from Moodle and upload to this Colab session.
> - INSTALL: You will need to install `bertopic` and `embedding-atlas`. Use the cell below.

In [None]:
# Uncomment the line below and run 
# ! pip install bertopic embedding-atlas

# Let it completely finish before moving on

# Transformer Topic Modelling
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/funko_prime.jpg?raw=true" align="right" width="200">

Whilst topic modelling has been a foundational technique for many years, recent strides in neural networks alongside the building of huge archives of textual material has meant that rather than build our own textual models like before, we can use models pre-trained on millions of examples. These models are far better at accounting for the semantic meaning of words, have a better sense of what words should be given the most attention and account for word context.

For example in the sentence "One dog greeted the other dog" the values assigned to the first 'dog' will differ to the values assigned to the second 'dog'. BERT refers to...
- **B**idirectional - Considers each word and looks both at what preceeds it and what follows it.
- **E**ncoder - Encodes the textual material into numerical values...
- **R**epresentations from - that accurately represent the original textual meaning
- **T**ransformers - a type of machine learning model that is able to adjust what parts of the data it pays most attention to.

Transformers have been used for many applications including language translation, and [text generation](https://huggingface.co/gpt2)

If you'd like more technical details on the BERT model you can see this [well illustrated guide](https://www.exxactcorp.com/blog/Deep-Learning/how-do-bert-transformers-work), but it is not necessary to fully understand the details, thanks to...

# BERTopic
<img src="https://maartengr.github.io/BERTopic/logo.png?raw=true" align="right" width="200">

- [BERTopic Website](https://maartengr.github.io/BERTopic/index.html)

- Grootendorst, M. (2022) ‘BERTopic: Neural topic modeling with a class-based TF-IDF procedure’. arXiv. Available at: [https://doi.org/10.48550/ARXIV.2203.05794](https://doi.org/10.48550/ARXIV.2203.05794)

BERTopic provides us a Python library that leverages BERT transformers whilst providing an accessible set of methods for helpful visualisations, summaries and tweaking of the model.




In [None]:
from bertopic import BERTopic
import pandas as pd

### Our Dataset

In [None]:
articles = pd.read_parquet('farright_dataset_cleaned.parquet')
articles.head()

### Basic BERTopic

BERTopic analysis can be broken down into two parts.

1. Embeddings
2. Topic Representation

#### 1. Embeddings
Embeddings rely on the BERT pre-trained model, like we used in the previous session to determine the similarity/difference of our documents. Rememebr for embeddings, they work best if we provide the whole text with all the variation in words, punctuation etc. We'll use the data in our `cleaned_text` column.

#### 2. Topic Representation
Seperately, BERTopic uses a variation of TFIDF to then generate keywords to represent the topics it finds using the embeddings. In this case TFIDF works best when we **DO** strip out the noise and grammatical features because like TFIDF it is based on the frequency of words. For this we'll use our pre-prepared tokens we created in our session on text preparation, the `tokens` column.





### Basic BERTopic - default settings, no custom pre-processing

In [None]:
from sentence_transformers import SentenceTransformer

transformer = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = transformer.encode(articles['cleaned_text'])

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english', min_df=5, max_df=0.95)

topic_model = BERTopic(calculate_probabilities=True, vectorizer_model=cv) # one extra argument so we can demonstrate a feature later
topics, probabilities = topic_model.fit_transform(articles['cleaned_text'], embeddings=embeddings)

In [None]:
probabilities

In [None]:
topic_model.get_topic_info()

In [None]:
topic_model.update_topics(docs=articles['tokens'], topics=topics, vectorizer_model=cv)
topic_model.get_topic_info()

Here are our topics, plus a noise topic labelled -1. There is almost always a noise topic as not all documents perfectly fit into a cluster. There may be outliers or topics that are difficult to classify as one topic or another.

To understand why we can visualise the documents as a scatterplot similarly to how we did last week.

In [None]:
topic_model.visualize_documents(docs=articles['webTitle'], topics=topics, embeddings=embeddings)

If we remember when we examined embeddings using `embedding_atlas`, there were areas of denser concentration of articles. Here BERTopic has identified those dense areas and labelled them. Areas where there aren't so many articles concentrated have been labelled as noise.

In [None]:
# Our first built in visualisation helps us quickly see the topics and their associated words. Hover over the bars to see words and scores.

topic_model.visualize_barchart(n_words=10,height=400)

If we want to get a sense of what documents are exemplary of these topics we can ask for the representative documents.

In [None]:
topic_model.get_representative_docs(6
                                    )

### Topic and Document Distribution

We can see the similarity of topics using the built in visualiser. Whether they are or are not similar to the extent that they could be merged as a single topic is down to qualitative assessment. Normally they will overlap if they are all part of a larger overarching topic.

In [None]:
topic_model.visualize_topics()

The plot above shows us the distance between topics, with the size of the circle indicating the relative size of the topic in the corpus. Topics that are closer together are considered similar. We can see a more detailed version by visualizing the document embeddings in two dimensons.
The first argument specifies how to label the points, rather than relying on the text itself if we provide the embeddings.

### Hierarchical Clustering
This visual shows us how the topics were determined, indicating where large clusters of documents were split into multiple groups and at what point.

In [None]:
hierarchical_topics = topic_model.hierarchical_topics(docs=articles['tokens'])
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

In [None]:
topic_model.visualize_hierarchical_documents(docs=articles['tokens'],hierarchical_topics=hierarchical_topics, embeddings=embeddings)

### Term scoring
When looking at a topic's keywords, how far down the list do you go until you stop looking. Top 10, top 20? Term rank allows us to see where the number of terms stops adding value to the differentiation of topics. i.e. the point at which adding more terms doesn't aid in differentiating topics anymore.

In [None]:
topic_model.visualize_term_rank()

The guidance is to look for the 'knee' or 'elbow' where the line flattens out. At that point no more terms will improve the differentiation. At this point we can see that differentiation dramatically declines for most topics after only 3 keywords.

# Topics over time
If you have datestamps for your individual data points, you can get BERTopic to show you topic trends over time

In [None]:
topics_over_time = topic_model.topics_over_time(docs=articles['tokens'],topics=topics,timestamps=articles['webPublicationDate'], nr_bins=50)

In [None]:
topic_model.visualize_topics_over_time(topics_over_time)

Note that as you hover over each point, the keywords for the topic change. This helps us see how the topic discourse may have altered over time.

In [None]:
# The raw data used to generate the visual is in our topics_over_time dataframe
topics_over_time

# Topics per class
Allows us to 'split' up the model to see how different topics might differ depending on some sort of classification. So for example in our data, if we took the time to label each document with the type of publication (Broadsheet newspaper, tabloid, left wing, right wing, etc.) we could see how the topics found across all the documents, differed depending on the type of publisher.

We don't have that information(!) but we can demonstrate using our `query` classification at least.

In [None]:
topics_per_class = topic_model.topics_per_class(docs=articles['tokens'].tolist(), classes=articles['sectionName'].tolist())
topic_model.visualize_topics_per_class(topics_per_class)

This is still informative in that it shows us which topics are most important for each query group, but also that some topics might actually overlap a little. Again note that the words for each topic differ depending on the classification.

In [None]:
# ...and again the raw data is available in to us in the variable we created...
topics_per_class

### Topic Similarity
A different way of examining similar phenomena - where do topics overlap, how similar or different are they. Ideally you don't want all your topics to be highly similar, because then you haven't been able to distinguish different topics. However if some overlap in some way, that might tell you something interesting about how different discourses/issues/cultures might overlap or intersect.

In [None]:
topic_model.visualize_heatmap()

# Topic Distribution
If you recall in LDA topic modelling every document has a score for each topic. Whilst most documents might align strongly with only one topic, this approach recognised that topics existed across documents, and one document could contain multiple topics.

BERTopic does not work like LDA but it does provide us a table of probabilities. This shows us how probable it is that a document could be classified as topic x.

In [None]:
story_index = 1
print(articles.loc[story_index,'webTitle'])
print(topics[story_index])

topic_model.visualize_distribution(probabilities[story_index])

In [None]:
datamap = topic_model.visualize_document_datamap(docs=articles['cleaned_text'], embeddings=embeddings, title='"Far Right" Stories')

In [None]:
from embedding_atlas.widget import EmbeddingAtlasWidget

umap_model = BERTopic().umap_model

umap_model.n_components = 2

two_dim = umap_model.fit_transform(embeddings)

articles[['x','y']] = two_dim

articles['topic_num'] = topics
articles['topic_label'] = articles['topic_num'].map(topic_model.topic_labels_)

widget = EmbeddingAtlasWidget(articles, text='tokens', x='x', y='y', show_charts=False)
widget

## Visualising topic content

In [None]:

from wordcloud import WordCloud
import matplotlib.pyplot as plt

fig = topic_model.visualize_document_datamap(docs=articles['tokens'], embeddings=embeddings, title='"Far Right" Stories')
plt.tight_layout()
fig.savefig('datamap.png', dpi=400)

In [None]:
# saving the model, then advanced tweaking of it
topic_model.update_topics(docs=articles['tokens'], vectorizer_model=cv, top_n_words=50)




def create_wordcloud(model, topic, save_to=None):
    text = {word: value for word, value in model.get_topic(topic)}
    wc = WordCloud(background_color="white", width=600, height=400)
    wc.generate_from_frequencies(text)
    plt.figure( figsize=(20,10))

    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    if save_to is not None:
        plt.savefig(save_to, dpi=400)
    plt.show()

# Show wordcloud
create_wordcloud(topic_model, topic=1, save_to='topic_4.png')

Topics are essentially just another categorical classification for your documents. We can add them to the articles dataframe and then explore them using other techniques we learned in SC207

In [None]:
articles['topic'] = topics

import seaborn as sns

sns.catplot(data=articles[articles['topic'] >=0], x='topic', y='wordcount', kind='box')

In [None]:
def view_stories(data, n_stories=10):
    if n_stories is not None:
        data = data.head(n_stories)
    for index, row in data.iterrows():
        print(index, row['webTitle'])
        print(row['webUrl'])
        print('****')


to_view = articles[articles['topic'] == 0]
view_stories(to_view)

In [None]:
tag_per_line = articles.explode('tags')

tag_data = pd.json_normalize(tag_per_line['tags'])
tag_data = tag_data.set_index(tag_per_line.index)

tag_data['article_title'] = tag_per_line['webTitle']
tag_data['article_url'] = tag_per_line['webUrl']
tag_data['topic'] = tag_per_line['topic']
tag_data = tag_data[['webTitle','article_title','article_url','topic']].rename(columns={'webTitle':'tag'})
tag_data.head()



In [None]:
tag_data_per_topic = tag_data.groupby('topic')

TOPIC = 0

topic_tags = tag_data_per_topic.get_group(TOPIC).reset_index()
order = topic_tags['tag'].value_counts().head(10).index
sns.catplot(data=topic_tags,y='tag', 
             kind='count', 
             aspect=1.5,
               order=order).set(
               title=f'Top tags for Topic {TOPIC}')


## Exporting Analysis for use in your report
### Tables

Any of the tables that are produced by BERTopic can be exported as they are Pandas Dataframes...

In [None]:
from csv import QUOTE_ALL
topic_model.get_topic_freq().to_csv('topic_frequency.csv')
topic_model.get_topic_info().to_csv('topic_info.csv', quoting=QUOTE_ALL)

### Figures
All figures produced by BERTopic are actually [Plotly](https://plotly.com/python/) figures. They can be exported too...

In [None]:


topic_bar_chart = topic_model.visualize_barchart(topics=[1,2,3,4],n_words=20, height=400)
topic_bar_chart.write_html('topic_bar_chart.html')

Whilst you can write directly to image file in plotly, it requires additional packages. It is simpler to generate the html file and then click the camera icon in the top right of the tool bar that appears when you hover over the figure. This will download an image of the plot for you.

On the rare occasion we use a Seaborn chart instead...

In [None]:
sns.catplot(data=articles[articles['topic'] >=0], x='topic', y='wordcount', kind='box')
# fig = heatmap.get_figure()
plt.savefig('topic_word_boxplot.png', dpi=300) # dpi 300 or 400 produces a good sized image