<a href="https://colab.research.google.com/github/SEEsuite/colab_scripts/blob/main/english_bertopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Bertopic is a model designed by Maarten Grootendorst that comes with a pretty great python library. There are, for example, a good bit of plots that you can generate immediately. This script generates 3 files - the fitted model, the tweets with topic labels one-hot encoded, and the topics by their top keywords. The script is not meant to be comprehensive - Grootendorst provides a bunch of other example colabs.


BERTopic will contextually embed tweets, then cluster them into topic groups. The number of topics is not neccesarily set, but you can manually fold topics togethr or adjust parts of the model. This script should work as is, but if the results are not great for you data, look into adjusting subcomponents of the model, like the hugging face embedding model, UMAP, or the clustering tool used. There is plenty of documentation [here](https://maartengr.github.io/BERTopic/index.html#:~:text=BERTopic%20is%20a%20topic%20modeling,words%20in%20the%20topic%20descriptions.). There are a million changes you could make to this model, so don't get too caught up in it - if it's not working, cut your losses.


The text may benefit from cleaning, to match the training set of the embedding model. Definitely consider removing punctuation. However, the model will be pretty robust, more so than traditional topic modeling like LDR. Consider removing digits as well. 


[paper](https://arxiv.org/abs/2203.05794)

In [None]:
### HERE IS THE CELL YOU NEED TO CHANGE
link = "https://docs.google.com/spreadsheets/d/1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO/edit?usp=sharing&ouid=101042095541764641159&rtpof=true&sd=true"
model_name = "basic_english_model"
### Expects text to be in a df called 'Full Text'

In [None]:
!pip install transformers
!pip install bertopic

In [None]:
# huggingface's tools for pretrained language models
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer

In [None]:
import numpy as np
import re
import tqdm
import pandas as pd
from nltk import TweetTokenizer
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import hdbscan
from sklearn.cluster import KMeans
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer
import string
import re # search through and clean text

In [None]:
# importing miscelaneaous packages 
import numpy as np # fast manipulation of multidimensional arrays

from tqdm.notebook import tqdm as progress_bar # a little vizualization of how fast a loop is running
from scipy.special import softmax
import pandas as pd # basically the excel of python

In [None]:
import urllib.request
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

def import_data_from_drive(share_link, your_name_for_file="my_data"):
  """Brings data file from a google drive sharepoint to your colab workspace.
     It does not require you to host the dataset on your own account.

     Parameters:
     share_link: the link to view a file in google drive
     our_name_for_file: a string describing the file, preferable endling in a file type, ex. 'data.csv'
     """
  id = share_link.split("/")[5] # separate the id from the link
  print("Using id", id, "to find file on drive")

  # use pydrive and colab modules to authenticate you
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  print("Authenticated colab user")

  # This step will move the file from Drive to the workspace
  download_object = drive.CreateFile({'id':id}) 
  download_object.GetContentFile(your_name_for_file)
  print("Added file to workspace with name", your_name_for_file)

  return

In [None]:
import_data_from_drive(link, your_name_for_file="tweets.xlsx")
df = pd.read_excel('tweets.xlsx')
# df = pd.read_csv('tweets.xlsx')

In [None]:
# df is now an object, with associated methods we can use
df.head(5) # lets look at the first five data samples
# you can even access the spreadsheet in colab... 

In [None]:
stop = stopwords.words('english')
stop.extend([' ', 'ok', 'okay', 'via', 'this', 'that', 'it', 'lol', 'hah', 'haha', 'ha', 'like']) # you can add anythin to this ban list

In [None]:
# there is a problem with the data oh no!
print(df['Full Text'][0])
# Most language models probably don't know what the hell 'https://t.co/F5ak34HrCE' is

In [None]:
# running this cell defines the function, does not run the function

def clean(tweet):

  # remove uppercase letters
  tweet = tweet.lower()

  # remove links
  tweet = re.sub(r"http\S+", "", tweet)

  # remove everything but text
  tweet = re.sub('^[A-Za-z0-9_]S+', "", tweet)


  #some irregular punctuations need to be removed manually
  tweet = re.sub("'|\"|’|…|”|“|’|…|’|“|”","" ,tweet)

  #remove punctuations
  temp = tweet.translate(str.maketrans('', '', string.punctuation))
  tweet = " ".join(temp.split())



  return tweet

In [None]:
df['Clean Text'] = df['Full Text'].apply(clean)

In [None]:
print(df['Clean Text'][0])

In [None]:
model_path = "bert-base-uncased" #one language
sentence_model = SentenceTransformer(model_path, device="cuda")
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, min_samples=5, cluster_selection_method='leaf', prediction_data=True)
# clusterer = KMeans(n_clusters=150)
tokenizer = TweetTokenizer().tokenize
vectorizer_model = CountVectorizer(ngram_range=(1, 1), stop_words=stop, tokenizer=tokenizer) # you could change "ngrams" to consider common word pairs or triplets.


In [None]:
sentence_model = SentenceTransformer(model_path, device="cuda")
topic_model = BERTopic(embedding_model=sentence_model, top_n_words=10, calculate_probabilities=True, verbose=True, hdbscan_model = clusterer, vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(df['Clean Text'])

# Save to Excel

In [None]:
# cleaned tweets needs to be done different. 

tweet_by_topic = pd.DataFrame()
tweet_df = df
tweet_by_topic['topic'] = topic_model.topics_
tweet_by_topic['probability'] = np.max(topic_model.probabilities_, axis=1)

In [None]:
tweet_by_topic

In [None]:
topics = tweet_by_topic['topic'].copy()
topics = pd.get_dummies(topics)
topics.drop(columns=[0])
topics['probability'] =  tweet_by_topic['probability'].copy()

a = topics.head(1)

In [None]:
a

In [None]:

def convert(row):
  topics = row[1:-1]
  prob = row[-1:]
  # print(prob)


  index = np.argmax(topics)
  topics[index] = prob

  return topics

topics = topics.apply(convert, axis=1)
topics

In [None]:
df.columns

In [None]:
tweet_by_topic = topics
tweet_by_topic['text'] = tweet_df['Full Text'] 
tweet_by_topic['cleaned_text'] = tweet_df['Clean Text'] 
tweet_by_topic["Twitter Followers"] =  tweet_df["Twitter Followers"]
tweet_by_topic["Twitter Reply Count"] =  tweet_df["Twitter Reply Count"]
tweet_by_topic["Twitter Retweets"] =  tweet_df["Twitter Retweets"]


In [None]:
tweet_by_topic

In [None]:
tweet_by_topic.to_excel('tweet_by_topic-' + model_name +'with_sent.xlsx')

In [None]:
tweet_by_topic

In [None]:
num_terms=10
# array = [topic x word matrix, ndarray?]
topic_by_words = pd.DataFrame(list(topic_model.topic_representations_.items())) # check axis
a = pd.DataFrame(topic_by_words[1].to_list())
c = pd.DataFrame(topic_by_words[0].to_list())
topic_by_words = pd.DataFrame()
series= pd.Series(topic_model.topic_sizes_.items())
topic_by_words["topic"] = series.apply(lambda x: x[0])
topic_by_words["count"] = series.apply(lambda x: x[1])
for i in range(num_terms):
  col = str(i + 1)
  topic_by_words['term ' + col] = a[i].apply(lambda x: x[0])
topic_by_words
# topic_by_words['confirm topic'] = c

#is


In [None]:
topic_by_words.sort_values(by='topic', inplace=True)
topic_by_words
topic_by_words =topic_by_words.reset_index()
topic_by_words

In [None]:
len(topic_model.get_representative_docs())

In [None]:
s_docs = pd.DataFrame(topic_model.representative_docs_.items())
# s_docs = s_docs.sort_index()
# print(docs[1][:][0])
docs = pd.DataFrame()
# b = pd.DataFrame(topic_by_words[1].to_list())
docs['topic_#'] = s_docs[0]
docs['sample 1'] = s_docs[1].apply(lambda x: x[0])
docs['sample 2'] = s_docs[1].apply(lambda x: x[1])
docs['sample 3'] = s_docs[1].apply(lambda x: x[2])
ro = pd.DataFrame()
ro['sample 1'] = [0]
ro['sample 2'] = [0]
ro['sample 3'] = [0]
ro['topic_#'] = [-1]
print(ro)
docs = pd.concat((docs, ro), axis=0)

docs = docs.sort_values(by=['topic_#'])
docs = docs.reset_index()

docs

In [None]:

save_topics = pd.concat((topic_by_words, docs[['sample 1', 'sample 2', 'sample 3']]), axis=1)
# save_topics = topic_by_words


In [None]:
save_topics = save_topics[1:]

In [None]:
save_topics

In [None]:
save_topics.to_excel("topics_info_" + model_name +".xlsx")


# Pretty Plots

### 2.1 Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

In [None]:
topics = topic_model.topics_
probs = topic_model.probabilities_

In [None]:
tweet_df.info()


### 2.2 Visualizations

More can be found [here](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-hierarchical-documents)

| Method | Description |
|------------------------|---------------------------------------------------------------------------------------------|
|visualize_hierarchy              | In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another.|
| visualize_topics    | We embed our representation of the topics in 2D using Umap and then create an interactive view|
| visualize_barchart         | We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation|

In [None]:
topic_model.visualize_hierarchy(top_n_topics = 70, custom_labels=True)

In [None]:
topic_model.visualize_topics(top_n_topics=60)

In [None]:
topic_model.visualize_topics()

In [None]:
from sentence_transformers import SentenceTransformer

model_path = "cardiffnlp/twitter-roberta-base" #one language
sentence_model = SentenceTransformer(model_path, device="cuda")
# sentence_model = SentenceTransformer("all-MiniLM-L6-v2")


In [None]:
from umap import UMAP
embeddings = sentence_model.encode(df['Clean Text'], show_progress_bar=True)
# umap_model = UMAP(n_neighbors=10, n_components=5, min_dist=0.1, metric='cosine', random_state=42)
reduced_embeddings = UMAP(n_neighbors=15, n_components=12, 
                          min_dist=0.0, metric='cosine').fit_transform(embeddings)

In [None]:
topic_model.visualize_documents(df['Clean Text'], reduced_embeddings=reduced_embeddings,
                                hide_document_hover=False, hide_annotations=True)

In [None]:
topic_model.visualize_barchart(top_n_topics=20, n_words=5) # not the best tool in my opinion