# BERTopic Modelling


## Prerequisites:
- Data File Imported/Stored in the current directory
- Python 3.6 or above

### BERTopic Installation

In [None]:
# Install Bertopic
# Use '!pip install bertopic' if you are running this notebook in Google Colab
%pip install bertopic

### Data Import

- For this data, we are doing Topic Modelling on the Abstract Data of the data corpus.<br>
- Users can modify the last line of the next code block to change the column name of the data to be used for Topic Modelling.


In [None]:
import csv #Import csv library
from bertopic import BERTopic
import pandas as pd #Import pandas library
#Import csv file
pd = pd.read_csv('data.csv')

#Extract abstract and time and store them in two pandas dataframes
abstract_pd=pd['AB']
time=pd['PY']

### UMAP usage:
- UMAP is a dimensionality reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction.

### HDBSCAN usage:
- HDBSCAN is a clustering algorithm developed by Campello, Moulavi, and Sander. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.

### Pipeline:
- The pipeline is a simple way to keep track of all the steps needed to create a topic model. It is a combination of the BERTopic class, UMAP, and HDBSCAN.

In [None]:
from umap import UMAP
from hdbscan import HDBSCAN
from transformers.pipelines import pipeline

#Maintain a random state (like 993) for reproducibility of the model
#Use the UMAP algorithm to reduce the dimensionality of the documents
umap_model = UMAP(n_neighbors=20, n_components=10, metric='cosine', low_memory=False, random_state=993)

# Use the HDBSCAN algorithm to cluster the documents
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean',prediction_data=True)

### Remove English Stopwords from dataset

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")

### Define Topic Model with embedding model, umap model and hdbscan

- Keep probabilities to later check how certain the assigned topics are
- Embedding Model used is all-MiniLM-L6-v2. Can choose other models from the list of models available in the BERTopic documentation. https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html

In [None]:
topic_model = BERTopic(verbose=True, embedding_model="all-MiniLM-L6-v2",
                       umap_model=umap_model, hdbscan_model=hdbscan_model,
                       n_gram_range=(1, 3), calculate_probabilities=True)

In [None]:
# Fit model to abstract data
topics,prob = topic_model.fit_transform(abstract_pd)

## Update model with stop words removal

In [None]:
topic_model.update_topics(abstract_pd, vectorizer_model=vectorizer_model)

In [None]:
count_df = topic_model.get_topic_info()

## Visualisation

### Barchart

In [None]:
topic_model.visualize_barchart()

### Similarity Matrix (Heatmap)

In [None]:
topic_model.visualize_heatmap()

### Visualise Documents and Topics Closeness

In [None]:
topic_model.visualize_documents(abstract_pd)

### Hierarchical Clustering

In [None]:
topic_model.visualize_hierarchy()

### Intertopic Distance Map

In [None]:
topic_model.visualize_topics()

### Topic Probability Distribution

- Visualise topic probability distribution of a single document, for example Document 2 (0-indexing)

In [None]:
topic_model.visualize_distribution(prob[1])

# Extras

## Set Custom Topic Names
- Custom topic names can be given to the topics by using the set_topic_labels() function

- Pass a dictionary with the topic number as key and the topic name as value

In [None]:
model.set_topic_labels({0: "Steganographic Image Data Hiding", 1: "Deep Image Steganalysis", 2:"Neural Watermark Robustness" ,3: "Linguistic Steganography Models",4:"Speech Steganalysis Algorithms",5:"Cognitive DNS Communication",6:"Video Steganography Techniques"})

### Visualisation with Custom Topic Names

- Add a parameter 'custom_labels=True' to the visualisation functions

In [None]:
model.visualize_barchart(custom_labels=True)

## Saving and Loading Model

- Topic models can be saved or retrieved using the save() and load() functions using pickle

In [None]:
# Saving
model.save("my_model")

#Loading
model = BERTopic.load("my_model")

## Saving probabilities

- Probabilities of data aren't saved by default, but can be saved as a np array

In [None]:
np.savetxt(r'/prob.csv',prob,delimiter=',')