<a href="https://colab.research.google.com/github/SEEsuite/colab_scripts/blob/main/twitter_bertopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Bertopic is a model designed by Maarten Grootendorst that comes with a pretty great python library. There are, for example, a good bit of plots that you can generate immediately. This script generates 3 files - the fitted model, the tweets with topic labels one-hot encoded, and the topics by their top keywords. The script is not meant to be comprehensive - Grootendorst provides a bunch of other example colabs.


BERTopic will contextually embed tweets, then cluster them into topic groups. The number of topics is not neccesarily set, but you can manually fold topics togethr or adjust parts of the model. This script should work as is, but if the results are not great for you data, look into adjusting subcomponents of the model, like the hugging face embedding model, UMAP, or the clustering tool used. There is plenty of documentation [here](https://maartengr.github.io/BERTopic/index.html#:~:text=BERTopic%20is%20a%20topic%20modeling,words%20in%20the%20topic%20descriptions.). There are a million changes you could make to this model, so don't get too caught up in it - if it's not working, cut your losses.


The text may benefit from cleaning, to match the training set of the embedding model. Definitely consider removing punctuation. However, the model will be pretty robust, more so than traditional topic modeling like LDR. Consider removing digits as well. 




[paper](https://arxiv.org/abs/2203.05794)

In [None]:
### HERE IS THE CELL YOU NEED TO CHANGE
link = "https://docs.google.com/spreadsheets/d/1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO/edit?usp=sharing&ouid=101042095541764641159&rtpof=true&sd=true"
model_name = "basic_twitter_model"
### IF YOUR DATASET DOES NOT USE STANDARD BRANDWATCH COLUMN NAMES YOU WILL NEED TO CHANGE THE EXCEL NAMES OR THE DF NAMES BELOW

In [None]:
!pip install transformers
!pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m67.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1
Looking in in

In [None]:
# huggingface's tools for pretrained language models
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer

In [None]:
import numpy as np
import re
import tqdm
import pandas as pd
from nltk import TweetTokenizer
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import hdbscan
from sklearn.cluster import KMeans
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer
import string
import re # search through and clean text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# importing miscelaneaous packages 
import numpy as np # fast manipulation of multidimensional arrays

from tqdm.notebook import tqdm as progress_bar # a little vizualization of how fast a loop is running
from scipy.special import softmax
import pandas as pd # basically the excel of python

In [None]:
import urllib.request
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

def import_data_from_drive(share_link, your_name_for_file="my_data"):
  """Brings data file from a google drive sharepoint to your colab workspace.
     It does not require you to host the dataset on your own account.

     Parameters:
     share_link: the link to view a file in google drive
     our_name_for_file: a string describing the file, preferable endling in a file type, ex. 'data.csv'
     """
  id = share_link.split("/")[5] # separate the id from the link
  print("Using id", id, "to find file on drive")

  # use pydrive and colab modules to authenticate you
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  print("Authenticated colab user")

  # This step will move the file from Drive to the workspace
  download_object = drive.CreateFile({'id':id}) 
  download_object.GetContentFile(your_name_for_file)
  print("Added file to workspace with name", your_name_for_file)

  return

In [None]:
import_data_from_drive(link, your_name_for_file="tweets.xlsx")
df = pd.read_excel('tweets.xlsx')
# df = pd.read_csv('tweets.xlsx')

Using id 1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO to find file on drive
Authenticated colab user
Added file to workspace with name tweets.xlsx


In [None]:
# df is now an object, with associated methods we can use
df.head(5) # lets look at the first five data samples
# you can even access the spreadsheet in colab... 

Unnamed: 0,Date,Full Text,Clean Text,Author,Url,Continent,Country,Region,Country Code,Continent Code,Region Code,City Code,Twitter Followers,Twitter Following,Twitter Reply Count,Twitter Retweets,Twitter Verified,Reach (new)
0,2022-10-01 23:40:00.000,"In Colorado Senate race, Michael Bennet still ...",in colorado senate race michael bennet still f...,Prison_Health,http://twitter.com/Prison_Health/statuses/1576...,North America,United States of America,Hawaii,USA,NORTH AMERICA,USA.HI,USA.HI.Honolulu,19711,2715,0,0,False,7325
1,2022-10-01 23:27:28.000,COMING UP on @WRAL at 7:30pm: We sit down with...,coming up on at 730pm we sit down with and abo...,BryanRAnderson,http://twitter.com/BryanRAnderson/statuses/157...,North America,United States of America,North Carolina,USA,NORTH AMERICA,USA.NC,USA.NC.Raleigh,3832,1103,2,4,True,13263
2,2022-10-01 23:16:38.000,Summaries of high-profile Supreme Court cases:...,summaries of highprofile supreme court cases t...,January20th49,http://twitter.com/January20th49/statuses/1576...,North America,United States of America,Ohio,USA,NORTH AMERICA,USA.OH,,39,300,0,1,False,0
3,2022-10-01 23:05:12.000,Abortion Icon Emma Bonino Trounced in Italian ...,abortion icon emma bonino trounced in italian ...,UsBurning,http://twitter.com/UsBurning/statuses/15763475...,North America,United States of America,Georgia,USA,NORTH AMERICA,USA.GA,USA.GA.Atlanta,360,34,0,0,False,0
4,2022-10-01 22:02:12.000,💥38 DAYS UNTIL #ELECTIONDAY MIDTERMS💥 WHAT R U...,💥38 days until midterms💥 what r u doing for de...,LeviFetterman,http://twitter.com/LeviFetterman/statuses/1576...,North America,United States of America,Pennsylvania,USA,NORTH AMERICA,USA.PA,,33774,1702,1,10,False,16039


In [None]:
stop = stopwords.words('english')
stop.extend([' ', 'ok', 'okay', 'via', 'this', 'that', 'it', 'lol', 'hah', 'haha', 'ha', 'like']) # you can add anythin to this ban list

In [None]:
# there is a problem with the data oh no!
print(df['Full Text'][0])
# Most language models probably don't know what the hell 'https://t.co/F5ak34HrCE' is

In Colorado Senate race, Michael Bennet still fights for child tax credit and immigration reform https://t.co/F5ak34HrCE


In [None]:
# running this cell defines the function, does not run the function

def clean(tweet):

  # remove uppercase letters
  tweet = tweet.lower()

  tweet = re.sub('^[A-Za-z0-9_]S+', "", tweet)

  # remove digits
  tweet = re.sub("[0-9]", "", tweet)
  # remove mentions
  tweet = re.sub("@[A-Za-z0-9_]+", "", tweet)
  # remove hashtags
  tweet = re.sub("#[A-Za-z0-9_]+", "", tweet)
  # remove lins
  tweet = re.sub(r"http\S+", "", tweet)


  #some irregular punctuations need to be removed manually
  tweet = re.sub("'|\"|’|…|”|“|’|…|’|“|”|’|’","" ,tweet)


  #remove punctuations
  temp = tweet.translate(str.maketrans('', '', string.punctuation))
  tweet = " ".join(temp.split())

  return tweet

In [None]:
df['Clean Text'] = df['Full Text'].apply(clean)

In [None]:
print(df['Clean Text'][1])

coming up on at pm we sit down with and about their bid for ncs th congressional district and approach to abortion the economy immigration and more what you need to know


In [None]:
model_path = "cardiffnlp/twitter-roberta-base" #one language
sentence_model = SentenceTransformer(model_path, device="cuda")
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, min_samples=5, cluster_selection_method='leaf', prediction_data=True)
# clusterer = KMeans(n_clusters=150)
tokenizer = TweetTokenizer().tokenize
vectorizer_model = CountVectorizer(ngram_range=(1, 1), stop_words=stop, tokenizer=tokenizer) # you could change "ngrams" to consider common word pairs or triplets.


Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/cardiffnlp_twitter-roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
sentence_model = SentenceTransformer(model_path, device="cuda")
topic_model = BERTopic(embedding_model=sentence_model, top_n_words=10, calculate_probabilities=True, verbose=True, hdbscan_model = clusterer, vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(df['Clean Text'])

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/cardiffnlp_twitter-roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Batches:   0%|          | 0/592 [00:00<?, ?it/s]

2023-05-10 12:50:51,769 - BERTopic - Transformed documents to Embeddings
2023-05-10 12:51:11,030 - BERTopic - Reduced dimensionality
2023-05-10 12:51:21,164 - BERTopic - Clustered reduced embeddings


# Save to Excel

In [None]:
# cleaned tweets needs to be done different. 

tweet_by_topic = pd.DataFrame()
tweet_df = df
tweet_by_topic['topic'] = topic_model.topics_
tweet_by_topic['probability'] = np.max(topic_model.probabilities_, axis=1)

In [None]:
tweet_by_topic

Unnamed: 0,topic,probability
0,-1,0.072786
1,12,1.000000
2,-1,0.047750
3,-1,0.057274
4,-1,0.130690
...,...,...
18912,-1,0.060293
18913,-1,0.040365
18914,-1,0.006739
18915,-1,0.027059


In [None]:
topics = tweet_by_topic['topic'].copy()
topics = pd.get_dummies(topics)
topics.drop(columns=[0])
topics['probability'] =  tweet_by_topic['probability'].copy()

a = topics.head(1)

In [None]:
a

Unnamed: 0,-1,0,1,2,3,4,5,6,7,8,...,73,74,75,76,77,78,79,80,81,probability
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.072786


In [None]:

def convert(row):
  topics = row[1:-1]
  prob = row[-1:]
  # print(prob)


  index = np.argmax(topics)
  topics[index] = prob

  return topics

topics = topics.apply(convert, axis=1)
topics

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,72,73,74,75,76,77,78,79,80,81
0,0.072786,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.047750,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.057274,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.130690,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18912,0.060293,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18913,0.040365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18914,0.006739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18915,0.027059,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df.columns

Index(['Date', 'Full Text', 'Clean Text', 'Author', 'Url', 'Continent',
       'Country', 'Region', 'Country Code', 'Continent Code', 'Region Code',
       'City Code', 'Twitter Followers', 'Twitter Following',
       'Twitter Reply Count', 'Twitter Retweets', 'Twitter Verified',
       'Reach (new)'],
      dtype='object')

In [None]:
tweet_by_topic = topics
tweet_by_topic['text'] = tweet_df['Full Text'] 
tweet_by_topic['cleaned_text'] = tweet_df['Clean Text'] 
tweet_by_topic["Twitter Followers"] =  tweet_df["Twitter Followers"]
tweet_by_topic["Twitter Reply Count"] =  tweet_df["Twitter Reply Count"]
tweet_by_topic["Twitter Retweets"] =  tweet_df["Twitter Retweets"]


In [None]:
tweet_by_topic

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,77,78,79,80,81,text,cleaned_text,Twitter Followers,Twitter Reply Count,Twitter Retweets
0,0.072786,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,"In Colorado Senate race, Michael Bennet still ...",in colorado senate race michael bennet still f...,19711,0,0
1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,COMING UP on @WRAL at 7:30pm: We sit down with...,coming up on at pm we sit down with and about ...,3832,2,4
2,0.047750,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,Summaries of high-profile Supreme Court cases:...,summaries of highprofile supreme court cases t...,39,0,1
3,0.057274,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,Abortion Icon Emma Bonino Trounced in Italian ...,abortion icon emma bonino trounced in italian ...,360,0,0
4,0.130690,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,💥38 DAYS UNTIL #ELECTIONDAY MIDTERMS💥 WHAT R U...,💥 days until midterms💥 what r u doing for demo...,33774,1,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18912,0.060293,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,"Americans realize, this isn't a game where we ...",americans realize this isnt a game where we vo...,217,0,0
18913,0.040365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,Senator has been blocking immigration reforms ...,senator has been blocking immigration reforms ...,2,0,0
18914,0.006739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,Weekend Update: Tammy the Trucker on Gas Price...,weekend update tammy the trucker on gas prices...,3011,0,0
18915,0.027059,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,I think Trump is the only Republican that coul...,i think trump is the only republican that coul...,1121,9,2


In [None]:
tweet_by_topic.to_excel('tweet_by_topic-' + model_name +'with_sent.xlsx')

In [None]:
tweet_by_topic

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,77,78,79,80,81,text,cleaned_text,Twitter Followers,Twitter Reply Count,Twitter Retweets
0,0.072786,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,"In Colorado Senate race, Michael Bennet still ...",in colorado senate race michael bennet still f...,19711,0,0
1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,COMING UP on @WRAL at 7:30pm: We sit down with...,coming up on at pm we sit down with and about ...,3832,2,4
2,0.047750,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,Summaries of high-profile Supreme Court cases:...,summaries of highprofile supreme court cases t...,39,0,1
3,0.057274,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,Abortion Icon Emma Bonino Trounced in Italian ...,abortion icon emma bonino trounced in italian ...,360,0,0
4,0.130690,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,💥38 DAYS UNTIL #ELECTIONDAY MIDTERMS💥 WHAT R U...,💥 days until midterms💥 what r u doing for demo...,33774,1,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18912,0.060293,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,"Americans realize, this isn't a game where we ...",americans realize this isnt a game where we vo...,217,0,0
18913,0.040365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,Senator has been blocking immigration reforms ...,senator has been blocking immigration reforms ...,2,0,0
18914,0.006739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,Weekend Update: Tammy the Trucker on Gas Price...,weekend update tammy the trucker on gas prices...,3011,0,0
18915,0.027059,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,I think Trump is the only Republican that coul...,i think trump is the only republican that coul...,1121,9,2


In [None]:
num_terms=10
# array = [topic x word matrix, ndarray?]
topic_by_words = pd.DataFrame(list(topic_model.topic_representations_.items())) # check axis
a = pd.DataFrame(topic_by_words[1].to_list())
c = pd.DataFrame(topic_by_words[0].to_list())
topic_by_words = pd.DataFrame()
series= pd.Series(topic_model.topic_sizes_.items())
topic_by_words["topic"] = series.apply(lambda x: x[0])
topic_by_words["count"] = series.apply(lambda x: x[1])
for i in range(num_terms):
  col = str(i + 1)
  topic_by_words['term ' + col] = a[i].apply(lambda x: x[0])
topic_by_words
# topic_by_words['confirm topic'] = c

#is


Unnamed: 0,topic,count,term 1,term 2,term 3,term 4,term 5,term 6,term 7,term 8,term 9,term 10
0,-1,12979,immigration,vote,illegal,voters,voting,election,border,democrats,rights,biden
1,0,253,hands,elite,👇,amlo,good,please,think,time,vote,pass
2,1,218,rights,reform,womens,gun,lgbtq,climate,voting,healthcare,social,medicare
3,2,203,passed,house,bill,senate,reform,refused,bring,comprehensive,pass,congress
4,3,181,debate,night,challenger,republican,rep,candidates,district,attorney,texas,cuellar
...,...,...,...,...,...,...,...,...,...,...,...,...
78,77,31,🤣,🤡,🤮,️,😡,🤔,🤦‍♀,🖕,‼,🤷‍♂
79,78,30,tax,pro,gun,yes,molinaros,billionaires,choice,marc,unions,social
80,79,30,cbs,americans,poll,favor,migrants,democrats,say,politico,migration,prefer
81,80,30,fetterman,oz,trust,dr,john,poll,economy,crime,warren,pennsylvania


In [None]:
topic_by_words.sort_values(by='topic', inplace=True)
topic_by_words
topic_by_words =topic_by_words.reset_index()
topic_by_words

Unnamed: 0,index,topic,count,term 1,term 2,term 3,term 4,term 5,term 6,term 7,term 8,term 9,term 10
0,0,-1,12979,immigration,vote,illegal,voters,voting,election,border,democrats,rights,biden
1,1,0,253,hands,elite,👇,amlo,good,please,think,time,vote,pass
2,2,1,218,rights,reform,womens,gun,lgbtq,climate,voting,healthcare,social,medicare
3,3,2,203,passed,house,bill,senate,reform,refused,bring,comprehensive,pass,congress
4,4,3,181,debate,night,challenger,republican,rep,candidates,district,attorney,texas,cuellar
...,...,...,...,...,...,...,...,...,...,...,...,...,...
78,78,77,31,🤣,🤡,🤮,️,😡,🤔,🤦‍♀,🖕,‼,🤷‍♂
79,79,78,30,tax,pro,gun,yes,molinaros,billionaires,choice,marc,unions,social
80,80,79,30,cbs,americans,poll,favor,migrants,democrats,say,politico,migration,prefer
81,81,80,30,fetterman,oz,trust,dr,john,poll,economy,crime,warren,pennsylvania


In [None]:
len(topic_model.get_representative_docs())

83

In [None]:
s_docs = pd.DataFrame(topic_model.representative_docs_.items())
# s_docs = s_docs.sort_index()
# print(docs[1][:][0])
docs = pd.DataFrame()
# b = pd.DataFrame(topic_by_words[1].to_list())
docs['topic_#'] = s_docs[0]
docs['sample 1'] = s_docs[1].apply(lambda x: x[0])
docs['sample 2'] = s_docs[1].apply(lambda x: x[1])
docs['sample 3'] = s_docs[1].apply(lambda x: x[2])
ro = pd.DataFrame()
ro['sample 1'] = [0]
ro['sample 2'] = [0]
ro['sample 3'] = [0]
ro['topic_#'] = [-1]
print(ro)
docs = pd.concat((docs, ro), axis=0)

docs = docs.sort_values(by=['topic_#'])
docs = docs.reset_index()

docs

   sample 1  sample 2  sample 3  topic_#
0         0         0         0       -1


Unnamed: 0,index,topic_#,sample 1,sample 2,sample 3
0,0,-1,should president biden end all of the trump ad...,i trust president biden to pass immigration le...,our gop is responsible for this if you want ch...
1,0,-1,0,0,0
2,1,0,elite or its in your hands so,elite or its in your hands,time out elite or its in your hands so
3,2,1,republicans are wrong on every single issue vo...,why i am voting blue dems in democracy womens ...,climate change lgbtq rights reproductive right...
4,3,2,gop controlled house senate and white house an...,folks this lady say if the house was republica...,yep the senate passed comprehensive immigratio...
...,...,...,...,...,...
79,78,77,a fake birth certificate a fake election a con...,🤣🤣🤣🤣 words with no meanings kind of like a fak...,legal immigration are people who follow the ru...
80,79,78,dem mainstream policies pro choice gun safety ...,dem mainstream policies gun safety pro union p...,dem centrist policies gun safety pro union pro...
81,80,79,cbs poll majority of voters say democrats favo...,cbs poll voters say democrats favor migrants o...,cbs poll democrats favor migrants over americans
82,81,80,come on you trust fetterman over oz unbelievab...,voters trust john fetterman over dr oz on the ...,voters trust john fetterman over dr oz on the ...


In [None]:

save_topics = pd.concat((topic_by_words, docs[['sample 1', 'sample 2', 'sample 3']]), axis=1)
# save_topics = topic_by_words


In [None]:
save_topics = save_topics[1:]

In [None]:
save_topics

Unnamed: 0,index,topic,count,term 1,term 2,term 3,term 4,term 5,term 6,term 7,term 8,term 9,term 10,sample 1,sample 2,sample 3
1,1.0,0.0,253.0,hands,elite,👇,amlo,good,please,think,time,vote,pass,0,0,0
2,2.0,1.0,218.0,rights,reform,womens,gun,lgbtq,climate,voting,healthcare,social,medicare,elite or its in your hands so,elite or its in your hands,time out elite or its in your hands so
3,3.0,2.0,203.0,passed,house,bill,senate,reform,refused,bring,comprehensive,pass,congress,republicans are wrong on every single issue vo...,why i am voting blue dems in democracy womens ...,climate change lgbtq rights reproductive right...
4,4.0,3.0,181.0,debate,night,challenger,republican,rep,candidates,district,attorney,texas,cuellar,gop controlled house senate and white house an...,folks this lady say if the house was republica...,yep the senate passed comprehensive immigratio...
5,5.0,4.0,173.0,high,prices,gas,crime,inflation,illegal,food,record,fentanyl,higher,texas gov greg abbott and democratic challenge...,gov greg abbott and democratic challenger beto...,tx gov greg abbott and democratic challenger b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79,79.0,78.0,30.0,tax,pro,gun,yes,molinaros,billionaires,choice,marc,unions,social,a fake birth certificate a fake election a con...,🤣🤣🤣🤣 words with no meanings kind of like a fak...,legal immigration are people who follow the ru...
80,80.0,79.0,30.0,cbs,americans,poll,favor,migrants,democrats,say,politico,migration,prefer,dem mainstream policies pro choice gun safety ...,dem mainstream policies gun safety pro union p...,dem centrist policies gun safety pro union pro...
81,81.0,80.0,30.0,fetterman,oz,trust,dr,john,poll,economy,crime,warren,pennsylvania,cbs poll majority of voters say democrats favo...,cbs poll voters say democrats favor migrants o...,cbs poll democrats favor migrants over americans
82,82.0,81.0,30.0,effect,lot,doesnt,yeah,gop,median,econ,sometimes,people,true,come on you trust fetterman over oz unbelievab...,voters trust john fetterman over dr oz on the ...,voters trust john fetterman over dr oz on the ...


In [None]:
save_topics.to_excel("topics_info_" + model_name +".xlsx")


# Pretty Plots

### 2.1 Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

In [None]:
topics = topic_model.topics_
probs = topic_model.probabilities_

In [None]:
tweet_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18917 entries, 0 to 18916
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Date                 18917 non-null  object
 1   Full Text            18917 non-null  object
 2   Clean Text           18917 non-null  object
 3   Author               18917 non-null  object
 4   Url                  18917 non-null  object
 5   Continent            18900 non-null  object
 6   Country              18900 non-null  object
 7   Region               14571 non-null  object
 8   Country Code         18900 non-null  object
 9   Continent Code       18900 non-null  object
 10  Region Code          14571 non-null  object
 11  City Code            9821 non-null   object
 12  Twitter Followers    18917 non-null  int64 
 13  Twitter Following    18917 non-null  int64 
 14  Twitter Reply Count  18917 non-null  int64 
 15  Twitter Retweets     18917 non-null  int64 
 16  Twit

### 2.2 Visualizations

More can be found [here](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-hierarchical-documents)

| Method | Description |
|------------------------|---------------------------------------------------------------------------------------------|
|visualize_hierarchy              | In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another.|
| visualize_topics    | We embed our representation of the topics in 2D using Umap and then create an interactive view|
| visualize_barchart         | We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation|

In [None]:
topic_model.visualize_hierarchy(top_n_topics = 70, custom_labels=True)

In [None]:
topic_model.visualize_topics(top_n_topics=60)

In [None]:
topic_model.visualize_topics()

In [None]:
from sentence_transformers import SentenceTransformer

model_path = "cardiffnlp/twitter-roberta-base" #one language
sentence_model = SentenceTransformer(model_path, device="cuda")
# sentence_model = SentenceTransformer("all-MiniLM-L6-v2")


Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/cardiffnlp_twitter-roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
from umap import UMAP
embeddings = sentence_model.encode(df['Clean Text'], show_progress_bar=True)
# umap_model = UMAP(n_neighbors=10, n_components=5, min_dist=0.1, metric='cosine', random_state=42)
reduced_embeddings = UMAP(n_neighbors=15, n_components=12, 
                          min_dist=0.0, metric='cosine').fit_transform(embeddings)

Batches:   0%|          | 0/592 [00:00<?, ?it/s]

In [None]:
topic_model.visualize_documents(df['Clean Text'], reduced_embeddings=reduced_embeddings,
                                hide_document_hover=False, hide_annotations=True)

In [None]:
topic_model.visualize_barchart(top_n_topics=20, n_words=5) # not the best tool in my opinion