# Bertopic
- BerTopic is a topic modeling technique that uses transformers (BERT embeddings) and class-based TF-IDF to create dense clusters. It also allows you to easily interpret and visualize the topics generated.


- The BerTopic algorithm contains 3 stages:
1. Embed the textual data(documents)
- In this step, the algorithm extracts document embeddings with BERT, or it can use any other embedding technique.


- By default, it uses the following sentence transformers

- "paraphrase-MiniLM-L6-v2"- This is an English BERT-based model trained specifically for semantic similarity tasks. 
- "paraphrase-multilingual-MiniLM-L12-v2"- This is similar to the first, with one major difference is that the xlm models work for 50+ languages.
2. Cluster Documents
It uses UMAP to reduce the dimensionality of embeddings and the HDBSCAN technique to cluster reduced embeddings and create clusters of semantically similar documents.

3. Create a topic representation
The last step is to extract and reduce topics with class-based TF-IDF and then improve the coherence of words with Maximal Marginal Relevance.

In [4]:
!pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [11]:
!pip install bertopic[visualization]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
cd drive/My Drive/Colab Notebooks

/content/drive/My Drive


In [12]:
#import packages

import pandas as pd 
import numpy as np
from bertopic import BERTopic

In [14]:
df = pd.read_csv('tokyo_2020_tweets.csv', engine='python')
print(df.shape)

(160548, 16)


In [15]:
# select only 10000 tweets 
df = df[0:10000]

In [16]:
# create model 
 
model = BERTopic(verbose=True)
 
#convert to list 
docs = df.text.to_list()
 
topics, probabilities = model.fit_transform(docs)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

2022-06-14 14:53:44,299 - BERTopic - Transformed documents to Embeddings
2022-06-14 14:54:30,196 - BERTopic - Reduced dimensionality
2022-06-14 14:54:30,602 - BERTopic - Clustered reduced embeddings


In [17]:
model.get_topic_freq().head(11)

Unnamed: 0,Topic,Count
0,-1,2724
1,0,463
2,1,451
3,2,418
4,3,399
5,4,367
6,5,286
7,6,177
8,7,103
9,8,84


In [19]:
model.get_topic(0)

[('sutirtha', 0.03724728509296531),
 ('mukherjee', 0.03565451893226214),
 ('tennis', 0.034955830598846147),
 ('tabletennis', 0.031123120037262454),
 ('round', 0.026996566459222),
 ('singles', 0.024804593859245813),
 ('comeback', 0.020502421350979478),
 ('table', 0.019461791876490463),
 ('manika', 0.01908779762719676),
 ('linda', 0.018264385244567725)]

In [20]:
model.get_topic(7)

[('banda', 0.060470811564520104),
 ('zambia', 0.05349652460336988),
 ('barbra', 0.04057790792462584),
 ('china', 0.03548480485248873),
 ('barbara', 0.027129618553350842),
 ('hattricks', 0.020823531853025084),
 ('hattrick', 0.019074623737739223),
 ('goals', 0.018736163870133025),
 ('44', 0.018564543001354353),
 ('two', 0.016511885626669102)]

In [21]:
model.visualize_topics()

In [22]:
model.visualize_barchart()

## LDA


In [62]:
df = pd.read_csv('tokyo_2020_tweets.csv', engine='python')
df.reset_index(inplace=True,drop=True)

In [63]:
data_text = df[['text']]
data_text['index'] = df.index
documents = data_text

In [64]:
data_text.head()

Unnamed: 0,text,index
0,Let the party begin\n#Tokyo2020,0
1,Congratulations #Tokyo2020 https://t.co/8OFKMs...,1
2,Big Breaking Now \n\nTokyo Olympic Update \n\n...,2
3,Q4: 🇬🇧3-1🇿🇦\n\nGreat Britain finally find a wa...,3
4,All I can think of every time I watch the ring...,4


In [65]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2020)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [66]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [67]:
print(WordNetLemmatizer().lemmatize('went', pos='v'))

go


In [71]:
stemmer = SnowballStemmer('english')

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    text = str(text)
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [72]:
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['@PuneethRajkumar', '@mirabai_chanu', 'Hearty', 'congratulations', '@mirabai_chanu', 'for', 'lifting', '🥈', 'in', "women's", '49Kg', 'Weightlifting.…', 'https://t.co/lrNsnMv8kR']


 tokenized and lemmatized document: 
['puneethrajkumar', 'mirabai_chanu', 'hearti', 'congratul', 'mirabai_chanu', 'lift', 'women', 'weightlift', 'https', 'lrnsnmv']


In [73]:
processed_docs = documents['text'].map(preprocess)

In [74]:
processed_docs[:10]

0                                [parti, begin, tokyo]
1                      [congratul, tokyo, https, ofkm]
2    [break, tokyo, olymp, updat, japan, gold, taka...
3    [great, britain, final, pieters, jack, waller,...
4    [think, time, watch, ring, event, tokyo, olymp...
5    [tokyo, olymp, mirabaichanu, weightlift, women...
6    [help, cheer, banda, goal, game, zambia, goal,...
7    [inquirerdotnet, ftjochoainq, caloy, yulo, goo...
8    [green, card, canada, captain, scott, tupper, ...
9    [hearti, congratul, indian, railway, player, s...
Name: text, dtype: object

In [75]:
dictionary = gensim.corpora.Dictionary(processed_docs)


In [76]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 begin
1 parti
2 tokyo
3 congratul
4 https
5 ofkm
6 break
7 gold
8 japan
9 judo
10 naohisa


## Gensim filter_extremes

Filter out tokens that appear in

less than 15 documents (absolute number) or
more than 0.5 documents (fraction of total corpus size, not absolute number).
after the above two steps, keep only the first 100000 most frequent tokens.

In [77]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)


## Gensim doc2bow

For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.

In [78]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(2, 1), (26, 1), (27, 1), (52, 1), (87, 2), (670, 1), (2466, 1)]

In [79]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 2 ("congratul") appears 1 time.
Word 26 ("weightlift") appears 1 time.
Word 27 ("women") appears 1 time.
Word 52 ("hearti") appears 1 time.
Word 87 ("mirabai_chanu") appears 2 time.
Word 670 ("lift") appears 1 time.
Word 2466 ("puneethrajkumar") appears 1 time.


In [80]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

In [81]:
corpus_tfidf = tfidf[bow_corpus]


In [82]:
from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.6056441383142471), (1, 0.7957356204956474)]


In [83]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)


In [84]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.101*"india" + 0.063*"tokyoolymp" + 0.046*"teamindia" + 0.044*"olymp" + 0.044*"cheer" + 0.032*"box" + 0.027*"hockey" + 0.026*"round" + 0.024*"indian" + 0.020*"tabletenni"
Topic: 1 
Words: 0.041*"olymp" + 0.034*"uswnt" + 0.024*"usavaus" + 0.019*"live" + 0.013*"follow" + 0.012*"game" + 0.010*"like" + 0.010*"taekwondo" + 0.010*"watch" + 0.009*"mckeown"
Topic: 2 
Words: 0.029*"team" + 0.028*"olymp" + 0.023*"game" + 0.019*"lead" + 0.018*"play" + 0.018*"great" + 0.014*"hockey" + 0.014*"australia" + 0.014*"women" + 0.013*"half"
Topic: 3 
Words: 0.029*"teamusa" + 0.023*"olymp" + 0.019*"race" + 0.018*"amaz" + 0.013*"bike" + 0.013*"good" + 0.012*"absolut" + 0.011*"hard" + 0.011*"luck" + 0.011*"sail"
Topic: 4 
Words: 0.091*"olymp" + 0.027*"watch" + 0.024*"olympicgam" + 0.022*"dive" + 0.020*"love" + 0.020*"sport" + 0.018*"skateboard" + 0.014*"gymnast" + 0.014*"teamgb" + 0.014*"athlet"
Topic: 5 
Words: 0.060*"final" + 0.041*"diaz" + 0.030*"olymp" + 0.028*"hidilyn" + 0.026*"shoot" 

In [85]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)


In [86]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.014*"kayle" + 0.013*"mckeown" + 0.013*"medal" + 0.013*"gold" + 0.012*"olymp" + 0.011*"pidcock" + 0.010*"india" + 0.010*"win" + 0.009*"mirabai" + 0.008*"silver"
Topic: 1 Word: 0.040*"hidilyn" + 0.037*"gold" + 0.032*"diaz" + 0.025*"congratul" + 0.023*"olymp" + 0.023*"medal" + 0.022*"philippin" + 0.018*"proud" + 0.015*"weightlift" + 0.015*"india"
Topic: 2 Word: 0.017*"usavaus" + 0.011*"olymp" + 0.010*"duffi" + 0.009*"flora" + 0.008*"dean" + 0.008*"uswnt" + 0.006*"minion" + 0.006*"comeback" + 0.006*"game" + 0.006*"unbeliev"
Topic: 3 Word: 0.018*"rugbi" + 0.015*"teamgb" + 0.013*"olymp" + 0.013*"tomdaley" + 0.010*"dive" + 0.009*"box" + 0.008*"mountainbik" + 0.007*"morn" + 0.007*"watch" + 0.006*"gold"
Topic: 4 Word: 0.022*"olymp" + 0.013*"watch" + 0.013*"skateboard" + 0.009*"volleybal" + 0.009*"game" + 0.009*"basketbal" + 0.008*"good" + 0.007*"osaka" + 0.007*"sport" + 0.007*"team"
Topic: 5 Word: 0.014*"teamcanada" + 0.012*"diaz_hidilyn" + 0.012*"olymp" + 0.011*"jacobi" + 0.00

In [58]:
df.sample(5)

Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet
121,1418888925456871426,#Tokyo2020,"Tokyo, Japan",The official account of The Tokyo Organising C...,2013-12-17 03:43:22,374086,436,2339,True,2021-07-24 11:00:56,Congratulations to @LindseyHoran who is set to...,['Tokyo2020'],Khoros Publishing,7.0,28.0,False
4259,1418874484036014080,International Hockey Federation,Lausanne,Official International Hockey Federation Twitt...,2010-10-20 10:45:59,103980,2724,36554,True,2021-07-24 10:03:33,Q2: 🇬🇧1-1🇿🇦\n\nGB's Chris Griffiths cracks a b...,,Twitter Web App,0.0,1.0,False
4889,1418872692174897153,💉 Tim Coghlan,,Waratahs | QPR | Sydney FC | Bulldogs | Wallab...,2013-10-06 15:23:47,124,387,12484,False,2021-07-24 09:56:25,Tillies need to get more physical. Ref is lett...,"['SWEvAUS', 'football', 'Tokyo2020', 'Olympics']",Twitter for Android,0.0,1.0,False
7888,1418864639568457730,Vicky Whelan,"Liverpool, England",Benjamin / Colorectal Registrar / Sauvignon Bl...,2011-02-24 19:35:27,225,701,2782,False,2021-07-24 09:24:25,Who needs a crowd when you’ve got your team ma...,"['TeamGB', 'mensgymnastics', 'ukgymnastics', '...",Twitter for iPhone,0.0,0.0,False
4391,1418874011669438468,Alistair Hogg,"Dubai, United Arab Emirates",Global Social Media Lead @FIBA 🏀📱 Passionate a...,2009-06-18 02:09:00,6072,3170,3530,False,2021-07-24 10:01:40,Bit of a mismatch in the dojo tonight.\n#Tokyo...,"['Tokyo2020', 'Judo']",TweetDeck,0.0,4.0,False


In [89]:
unseen_document = 'swimming news?'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.6999684572219849	 Topic: 0.156*"gold" + 0.107*"medal" + 0.068*"olymp" + 0.030*"win" + 0.029*"bronz"
Score: 0.033346403390169144	 Topic: 0.053*"olymp" + 0.025*"surf" + 0.025*"triathlon" + 0.023*"watch" + 0.013*"basketbal"
Score: 0.03334055095911026	 Topic: 0.060*"final" + 0.041*"diaz" + 0.030*"olymp" + 0.028*"hidilyn" + 0.026*"shoot"
Score: 0.03333999216556549	 Topic: 0.101*"india" + 0.063*"tokyoolymp" + 0.046*"teamindia" + 0.044*"olymp" + 0.044*"cheer"
Score: 0.03333523869514465	 Topic: 0.057*"olymp" + 0.049*"rugbi" + 0.027*"tenni" + 0.024*"game" + 0.023*"match"
Score: 0.03333459421992302	 Topic: 0.029*"team" + 0.028*"olymp" + 0.023*"game" + 0.019*"lead" + 0.018*"play"
Score: 0.03333417698740959	 Topic: 0.091*"olymp" + 0.027*"watch" + 0.024*"olympicgam" + 0.022*"dive" + 0.020*"love"
Score: 0.033333610743284225	 Topic: 0.041*"olymp" + 0.034*"uswnt" + 0.024*"usavaus" + 0.019*"live" + 0.013*"follow"
Score: 0.03333359584212303	 Topic: 0.076*"congratul" + 0.057*"olymp" + 0.052*"pro