<a href="https://colab.research.google.com/github/Jenni-Hawk/Advanced_Topic_Modeling/blob/main/mynext_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [None]:
%%capture
!pip install bertopic

# Data

In [52]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [55]:
from google.colab import files
uploaded = files.upload()

Saving TweetBatch3b.csv to TweetBatch3b.csv


In [56]:
import pandas as pd
import io

tweets = pd.read_csv(io.BytesIO(uploaded['TweetBatch3b.csv']))

In [57]:
tweets.head()

Unnamed: 0,text,cleaned
0,@ReallyAmerican1 #Roevember and\n#ForThePeople...,roevember and forthepeople and votebluein2022...
1,RT @sandibachom: IS THIS THING ON???!!This is ...,rt is this thing on this is pathetic acting se...
2,RT @sandibachom: IS THIS THING ON???!!This is ...,rt is this thing on this is pathetic acting se...
3,RT @tleehumphrey: Today is the beginning of th...,rt today is the beginning of the inquiry into ...
4,RT @AdamKinzinger: Mitch McConnell.\nKevin McC...,rt mitch mcconnell kevin mccarthy they both kn...


In [58]:
tweets.drop(['cleaned'], axis=1, inplace=True)

In [59]:
tweets.head()

Unnamed: 0,text
0,@ReallyAmerican1 #Roevember and\n#ForThePeople...
1,RT @sandibachom: IS THIS THING ON???!!This is ...
2,RT @sandibachom: IS THIS THING ON???!!This is ...
3,RT @tleehumphrey: Today is the beginning of th...
4,RT @AdamKinzinger: Mitch McConnell.\nKevin McC...


In [60]:
tweets.shape

(34993, 1)

In [61]:
#turn tweet column into a list of strings
tweet_list = tweets["text"].tolist()

# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model.




## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead.

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.

In this code we're also:
- Transforming documents to embeddings
  - default embedding model in BERTopic ("all-MiniLM-L6-v2")
  - If you want to use a model that provides a higher quality, but takes more computing time, then I would advise using all-mpnet-base-v2 works great for English documents. From his FAQs
- Reducing dimensionality
- Clustered reduced embeddings


In [62]:
from bertopic import BERTopic

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)

# Fit Model to Data
# topics = each piece of content is assigned a cluster number
# probs = each piece of content assigned a probability number
topics, probs = topic_model.fit_transform(tweet_list)

Batches:   0%|          | 0/1094 [00:00<?, ?it/s]

2023-08-14 20:57:36,456 - BERTopic - Transformed documents to Embeddings
2023-08-14 20:59:34,558 - BERTopic - Reduced dimensionality
2023-08-14 21:11:55,589 - BERTopic - Clustered reduced embeddings


**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

In [63]:
#topics

In [64]:
#probs

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

In [65]:
freq = topic_model.get_topic_info(); freq.head(8)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2719,-1_plans_nationalist_violent_flood,"[plans, nationalist, violent, flood, warnings,...",[RT @chadloder: The reason law enforcement fai...
1,0,6017,0_both_they_backed_kevin,"[both, they, backed, kevin, mccarthy, mitch, m...",[RT @AdamKinzinger: Mitch McConnell.\nKevin Mc...
2,1,1613,1_creating_hamill_correct_overthrowing,"[creating, hamill, correct, overthrowing, resi...",[RT @Resist_MAGA_GOP: Mark Hamill is correct. ...
3,2,1176,2_demands_ja_deserves_unanimously,"[demands, ja, deserves, unanimously, history, ...",[RT @AdamKinzinger: We just voted unanimously ...
4,3,1111,3_onthis_deploy_defense_pathetic,"[onthis, deploy, defense, pathetic, sec, actin...",[RT @sandibachom: IS THIS THING ON???!!This is...
5,4,1059,4_goaded_author_summoned_excuse,"[goaded, author, summoned, excuse, rioters, at...",[RT @AdamKinzinger: Trump is the author of the...
6,5,950,5_rejection_loss_break_couldnt,"[rejection, loss, break, couldnt, accept, trie...",[RT @AdamKinzinger: When he couldn't accept hi...
7,6,882,6_rudy_hatch_giuliani_meadows,"[rudy, hatch, giuliani, meadows, yet, new, rog...",[RT @StandForBetter: 📺 NEW Video:\n\nTrump Los...


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [66]:
freq.shape

(371, 5)

get top 3 most frequent topics

In [67]:
topic_model.get_topic(0)  # Select the most frequent topic

[('both', 0.0142775613901605),
 ('they', 0.013294008528485592),
 ('backed', 0.013060906657019094),
 ('kevin', 0.013043005451758695),
 ('mccarthy', 0.013033574214260652),
 ('mitch', 0.013032466294675712),
 ('mcconnell', 0.012979684605310215),
 ('responsible', 0.012934623424407213),
 ('called', 0.012906686949853129),
 ('down', 0.012596584306227447)]

In [68]:
topic_model.get_topic(1)

[('creating', 0.04315976383332436),
 ('hamill', 0.04315976383332436),
 ('correct', 0.0431411623869482),
 ('overthrowing', 0.043066924645806505),
 ('resistmagagop', 0.042590838307379315),
 ('january6thcomm', 0.041653718294944846),
 ('love', 0.04112532034081855),
 ('without', 0.04009348558395055),
 ('country', 0.03750021431237921),
 ('violence', 0.02714771495578993)]

In [69]:
topic_model.get_topic(2)

[('demands', 0.0457584310468279),
 ('ja', 0.045570163638357905),
 ('deserves', 0.04407765194484484),
 ('unanimously', 0.043621417551267),
 ('history', 0.04210097451584061),
 ('testify', 0.0404338756458835),
 ('subpoena', 0.03505636855704162),
 ('voted', 0.034525155198128966),
 ('oath', 0.031948989686344305),
 ('just', 0.02877037800330572)]

**NOTE**: BERTopic is stocastich which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

# **KeyBERT-Inspired Model**
- Reduce the appearance of stop words. This also often improves the topic representation:
- https://maartengr.github.io/BERTopic/api/representation/keybert.html
- https://maartengr.github.io/BERTopic/getting_started/representation/representation.html

In [75]:
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic

# Create your representation model
representation_model = KeyBERTInspired()

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

In [76]:
topic_model2

<bertopic._bertopic.BERTopic at 0x7a6546fe3280>

#### Next Steps
- Fit this model, look over and generate visualizations, collect thoughts. Does this look better than the other?

# **Visualization**
Conduct these visualizations:
Visualizations: generate these for keybert and openai
- document visualization
https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_documents.html

- Term one (not bad) do this one too
https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_terms.html

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [74]:
topic_model.visualize_barchart(top_n_topics=5)

In [73]:
topic_model.visualize_topics()