# SC207 - Text Mining
## Topic Modelling with BERT

[BERTopic Website](https://maartengr.github.io/BERTopic/index.html)

In [None]:
! pip install bertopic

In [None]:
from bertopic import BERTopic
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# May be required if iProgress widget error
# ! pip install ipywidgets widgetsnbextension
# !jupyter nbextension enable --py widgetsnbextension

### Our Dataset

In [None]:
news_df = pd.read_csv('sample_news_large_with_tokens.csv')

In [None]:
news_df.head()

In [None]:
raw_corpus = news_df['text'].tolist()
raw_corpus[0:2]

In [None]:
tokenized_corpus = news_df['tokens'].tolist()
tokenized_corpus[0:2]

### Basic BERTopic

BERTopic analysis can be broken down into two parts.

1. Embeddings
2. Topic Representation

#### 1. Embeddings
BERTopic uses a pre-trained model (similar to spacy's language models) to determine the similarity of different documents within a corpus. The BERT model is able to factor in word ordering as well as the semantic similarity
of words into its predictions, as well as better anticipate which words within a document are the most important to conveying the sentence's meaning. In essence it looks at words within their context and makes a prediction about the which documents are similar and which are dissimilar. It expresses that using a large array of numbers called 'embeddings'. In many ways embeddings are similar to our vectors when we counted the frequency of words.


> 1. I am a pet dog
> 2. I am a pet cat
> 3. Architecture is a serious discipline


| **Document** | I | am | a | pet | dog | cat | architecture | is | serious | discipline |
|--------------|---|----|---|-----|-----|-----|--------------|----|---------|------------|
| 1            | 1 | 1  | 1 | 1   | 1   | 0   | 0            | 0  | 0       | 0          |
| 2            | 1 | 1  | 1 | 1   | 0   | 1   | 0            | 0  | 0       | 0          |
| 3            | 0 | 0  | 1 | 0   |     |     | 1            | 1  | 1       | 1          |

BERT embeddings represent each document as a row of 384 columns. This data can be used by our computational tools to find clusters of documents based on their semantic similarity, identify which documents are representative of a particular cluster etc. It is a powerful approach to topic modelling as it is aware of more than just the words used, but how they are used and the context of their usage. As such BERT embeddings are best produced using completely raw text, without any pre-processing.

#### 2. Topic Representation
Seperately, BERTopic uses a variation of TFIDF to then generate keywords to represent clusters of documents found via the embeddings. By default it uses a fairly basic vectorisation and pre-processing routine. However it is also possible to pass it pre-prepared tokens (like we produced in earlier sessions) which it will happily use instead. As the embeddings are generated using the raw text, whilst our keywords are produced using pre-processed text, this means our pre-processing decisions won't impact the embeddings and actual clustering, just the quality of the words used to describe the clusters.



### Basic BERTopic - default settings, no custom pre-processing

In [None]:
topic_model = BERTopic(calculate_probabilities=True)
topics, probabilities = topic_model.fit_transform(raw_corpus)

In [None]:
topic_model.get_topic_info()

Here are our topics. It has discovered 7 topics (plus a noise topic labelled -1). Not great topics in terms of representation.

In [None]:
topic_model.visualize_barchart(n_words=10)

Remember the seperation in the model. The embeddings which determine the topics, and then the topic representation. We can update our topic model's topic representation side without impacting the embedding side.

In [None]:
topic_model.update_topics(docs=tokenized_corpus, topics=topics)

In [None]:
topic_model.get_topic_info()

In [None]:
topic_model.visualize_barchart()

Looking better. The number and distribution of the topics is still broadly the same, but now the topic descriptions are improved. Let's work with this.

#### Assessing our Topic Model

To some extent we already know generally what topics should be in our data as we know the queries that generated the documents.

In [None]:
news_df['query'].unique()

In [None]:
news_df['topic'] = topics

In [None]:
query_topic_crosstab = pd.crosstab(index=news_df['query'], columns=news_df['topic'])
query_topic_crosstab

In [None]:
sns.heatmap(query_topic_crosstab, cmap='YlGnBu')
plt.show()

We can see that generally the topics conform to our queries. This is a good sign indicating that out embeddings were able to accurately determine similarity. We can even see some crossover on particular queries.

We won't always have existing classifications like this, but this helps give us confidence that if we did the same procedure on a set of documents in which we had no sense of the topics, it would be able to surface them for us.

In [None]:
topic_model.generate_topic_labels()

We can see the similarity of topics using the built in visualiser. Whether they are or are not similar to the extent that they could be merged as a single topic is down to qualitative assessment. Normally they will overlap if they are all part of a larger overarching topic.

### Topic and Document Distribution

In [None]:
topic_model.visualize_topics()

The plot above shows us the distance between topics, with the size of the circle indicating the relative size of the topic in the corpus. Topics that are closer together are considered similar. We can see a more detailed version by visualizing the document embeddings in two dimensons.
The first argument specifies how to label the points, rather than relying on the text itself if we provide the embeddings.

In [None]:
embeddings = topic_model._extract_embeddings(raw_corpus)

In [None]:
topic_model.visualize_documents(news_df['title'], embeddings=embeddings,)

If we examine the scatter plot above more closely, and consider the article titles we can see why some articles might be closer together even within a cluster.

### Hierarchical Clustering
This visual shows us how the topics were determined, indicating where large clusters of documents were split into multiple groups and at what point.

In [None]:
topic_model.visualize_hierarchy()

We can see above that topic 4 (Hong kong/Protest) was considered significantly distinct enough to be seperated from the remaining documents early on. Then topic 3 (Tesla). Then topic 5 (online scams and online abuse) then topic 2 (Brexit) and finally topics 0 (facebook/libra) and topic 1 (trump/alt-right). The colouring indicates that these last three topics are more similar to one another than the others, having been split off from a larger cluster.

### Topic Similarity
A different way of examining similar phenomena - where do topics overlap, how similar or different are they. Ideally you don't want all your topics to be highly similar, because then you haven't been able to distinguish different topics. However if some overlap in some way, that might tell you something interesting about how different discourses/issues/cultures might overlap or intersect.

In [None]:
topic_model.visualize_heatmap()

### Term scoring
When looking at a topic's keywords, how far down the list do you go until you stop looking. Top 10, top 20? Term rank allows us to see where the number of terms stops adding value to the differentiation of topics. i.e. the point at which adding more terms doesn't aid in differentiating topics anymore.

In [None]:
topic_model.visualize_term_rank()

The guidance is to look for the 'knee' or 'elbow' where the line flattens out. At that point no more terms will improve the differentiation. At this point we can see that differentiation dramatically declines for most topics after only 3 keywords.

In [None]:
topics_over_time = topic_model.topics_over_time(tokenized_corpus,topics,news_df['published'].tolist())

In [None]:
topic_model.visualize_topics_over_time(topics_over_time)

In [None]:
topics_per_class = topic_model.topics_per_class(tokenized_corpus,topics, classes=news_df['query'].tolist())
topic_model.visualize_topics_per_class(topics_per_class)

In [None]:
topic_model.visualize_distribution(probabilities[2])

In [None]:
news_df.loc[2,'title']

In [None]:
topic_model.get_representative_docs()[5][2]

In [None]:
topic_model.get_topic_info()

In [None]:
cats, labels = pd.factorize(news_df['query'])
cats

class_topic_model = BERTopic()
class_topics, class_probabilities = class_topic_model.fit_transform(documents=raw_corpus, y=cats)

In [None]:
class_topic_model.get_topic_info()

In [None]:
class_topic_model.update_topics(docs=tokenized_corpus, topics=class_topics)

In [None]:
class_topic_model.get_topic_info()

In [None]:
news_df['class_topic'] = class_topics

query_topic_crosstab = pd.crosstab(index=news_df['query'], columns=news_df['class_topic'])
query_topic_crosstab

In [None]:
sns.heatmap(query_topic_crosstab)

# Topic Modelling with Twitter Data - Example

Here we show an example of how you can apply BERTopic to Twitter data. The small text sizes, noise and variability of twitter data can mean it is difficult to get a handle on whether there are any latent topics of discussion within your collected data. In this example we'll show how you can use the community detection analysis from your network work, to help guide the topic modelling process, as well as some of the more advanced tweaks that can help improve your models.

In [None]:
edges = pd.read_csv('retweet_edge_list.csv')
edges

In [None]:
import networkx as nx
rt_network = nx.from_pandas_edgelist(edges, edge_attr='weight', create_using=nx.Graph)
print(rt_network.number_of_nodes())

In [None]:
def filter_by_degree(G, minimum_degree):
    scores = G.degree()
    remove_nodes = [node for node,degree in scores if degree < minimum_degree]
    G.remove_nodes_from(remove_nodes)
    return G

def filter_by_giant_component(G):
    components = sorted(nx.connected_components(G), key=len, reverse=True)
    return G.subgraph(components[0])

def louvain_modularity(G,weight='weight', resolution=1):
    communities = nx.algorithms.community.louvain_communities(G, weight=weight, resolution=resolution)
    modularity_score = nx.algorithms.community.modularity(G,communities, weight=weight)
    com_node_assignments = []
    for community, nodes in enumerate(communities):
        for node in nodes:
            com_node_assignments.append({'community':community, 'node':node})
    return pd.DataFrame(com_node_assignments), modularity_score


Below we are importing a file, `node_communities.csv`. After producing a retweet network, importing it into Gephi and

In [None]:
rt_network = filter_by_giant_component(filter_by_degree(rt_network,2))
rt_network.number_of_nodes()

In [None]:
communities, modularity = louvain_modularity(rt_network)
communities

In [None]:
tweets = pd.read_pickle('example_twitter_data.pkl')

In [None]:
tweets = tweets[tweets['retweeted_status'].isna()] # remove retweets

In [None]:
def flatten_nested_dicts(df):
    dicts = df.to_dict(orient='records')
    flattened = pd.json_normalize(dicts)
    return flattened

tweets = flatten_nested_dicts(tweets)

In [None]:
tweets = tweets.merge(communities, how='left', left_on='user.screen_name', right_on='node').dropna(subset='community')
tweets[['full_text','user.screen_name', 'community']]

In [None]:

from sklearn.feature_extraction.text import CountVectorizer


cv = CountVectorizer(stop_words='english', ngram_range=(1,2), min_df=5, max_df=0.95)
tweet_model = BERTopic(vectorizer_model=cv, diversity=0.8)
topics, probabilities = tweet_model.fit_transform(tweets['full_text'].tolist()
                                                  , y=tweets['community'].tolist())

In [None]:
tweet_model.get_topic_info()

In [None]:
tweet_model.visualize_topics()

In [None]:
tweet_model.visualize_documents(tweets['full_text'].tolist())

In [None]:
tweet_model.visualize_hierarchy()

In [None]:
tweet_model.visualize_heatmap(n_clusters=4)

In [None]:
topic_community_crosstab = pd.crosstab(index=topics, columns=tweets['modularity_class'])
topic_community_crosstab.loc[0:]

In [None]:
sns.heatmap(topic_community_crosstab.loc[0:], cmap='coolwarm')

In [None]:
tweet_model.get_representative_docs(4)

In [None]:
topics_per_class = tweet_model.topics_per_class(tweets['full_text'].tolist(),topics, classes=tweets['modularity_class'].tolist())
tweet_model.visualize_topics_per_class(topics_per_class)