# **BERTopic Modeling**
- Starts the Intrusion Methodology
- BERTopic + KeyBertInspire()

# Imports

In [27]:
import pandas as pd

In [28]:
from bertopic import BERTopic

# Data

In [29]:
tweets = pd.read_csv('/Users/jennihawk/Documents/Data Science Projects/Project_NLP/TweetBatch3.csv')

In [30]:
tweets.head()

Unnamed: 0,text,cleaned
0,@ReallyAmerican1 #Roevember and\n#ForThePeople...,roevember and forthepeople and votebluein2022...
1,RT @sandibachom: IS THIS THING ON???!!This is ...,rt is this thing on this is pathetic acting se...
2,RT @sandibachom: IS THIS THING ON???!!This is ...,rt is this thing on this is pathetic acting se...
3,RT @tleehumphrey: Today is the beginning of th...,rt today is the beginning of the inquiry into ...
4,RT @AdamKinzinger: Mitch McConnell.\nKevin McC...,rt mitch mcconnell kevin mccarthy they both kn...


In [31]:
tweets.drop(['cleaned'], axis=1, inplace=True)

In [32]:
tweets.head()

Unnamed: 0,text
0,@ReallyAmerican1 #Roevember and\n#ForThePeople...
1,RT @sandibachom: IS THIS THING ON???!!This is ...
2,RT @sandibachom: IS THIS THING ON???!!This is ...
3,RT @tleehumphrey: Today is the beginning of th...
4,RT @AdamKinzinger: Mitch McConnell.\nKevin McC...


In [33]:
tweets.shape

(34993, 1)

In [34]:
#turn tweet column into a list of strings
tweet_list = tweets["text"].tolist()

# **Topic Modeling**



## Training

In [35]:
from umap import UMAP
from sentence_transformers import SentenceTransformer

In [36]:
# include random state to replicate results / prevents stochastic behavior
umap_model = UMAP(n_neighbors=15, n_components=5,
                  min_dist=0.0, metric='cosine', random_state=42)

#expose embeddings so we can use for visualization task
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(tweet_list, show_progress_bar=False)

#Initiate BERTopic
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=False, umap_model = umap_model)

# Fit Model to Data
topics, probs = topic_model.fit_transform(tweet_list, embeddings)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

## Extracting Topics - Most Frequent Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

In [52]:
bertopic_info = topic_model.get_topic_info(); freq.head(8)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2695,-1_january6thcommitteehearings_are_will_in,"[january6thcommitteehearings, are, will, in, n...",[RT @StandForBetter: The Secret Service clearl...
1,0,6006,0_both_they_backed_kevin,"[both, they, backed, kevin, mccarthy, mitch, m...",[RT @AdamKinzinger: Mitch McConnell.\nKevin Mc...
2,1,1368,1_rudy_hatch_giuliani_meadows,"[rudy, hatch, giuliani, meadows, yet, new, rog...",[RT @StandForBetter: 📺 NEW Video:\n\nTrump Los...
3,2,1176,2_demands_ja_deserves_unanimously,"[demands, ja, deserves, unanimously, history, ...",[RT @AdamKinzinger: We just voted unanimously ...
4,3,1117,3_creating_hamill_correct_overthrowing,"[creating, hamill, correct, overthrowing, resi...",[RT @Resist_MAGA_GOP: Mark Hamill is correct. ...
5,4,1106,4_deploy_onthis_defense_sec,"[deploy, onthis, defense, sec, pathetic, actin...",[RT @sandibachom: IS THIS THING ON???!!This is...
6,5,1060,5_goaded_author_summoned_excuse,"[goaded, author, summoned, excuse, rioters, at...",[RT @AdamKinzinger: Trump is the author of the...
7,6,949,6_rejection_loss_break_couldnt,"[rejection, loss, break, couldnt, accept, trie...",[RT @AdamKinzinger: When he couldn't accept hi...


In [53]:
bertopic_info.shape

(381, 5)

# **KeyBERT-Inspired Model**
- Reduce the appearance of stop words. This also often improves the topic representation:
- https://maartengr.github.io/BERTopic/api/representation/keybert.html
- https://maartengr.github.io/BERTopic/getting_started/representation/representation.html

In [46]:
from bertopic.representation import KeyBERTInspired

# Conceptually: BERTopic is architectual blueprint for a house and keybert is a specific feature that you can build in a new house
# Conceptually: make the keybert feature of the house:
representation_model = KeyBERTInspired()

keybert_model = BERTopic(language="english", calculate_probabilities=True, verbose=False, umap_model = umap_model,representation_model=representation_model).fit(tweet_list)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [None]:
#pd.set_option('display.max_colwidth', None)

#### KeyBERT topic Representations

In [47]:
keybert_info = keybert_model.get_topic_info()

In [48]:
keybert_info.head(8)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2695,-1_jan_jan6_jan6thhearings_january6thcommittee...,"[jan, jan6, jan6thhearings, january6thcommitte...",[RT @StandForBetter: The Secret Service clearl...
1,0,6006,0_mcconnell_mccarthy_senate_impeached,"[mcconnell, mccarthy, senate, impeached, kenne...",[RT @AdamKinzinger: Mitch McConnell.\nKevin Mc...
2,1,1368,1_trumpscoupplaybook_trump_giuliani_coup,"[trumpscoupplaybook, trump, giuliani, coup, me...",[RT @StandForBetter: 📺 NEW Video:\n\nTrump Los...
3,2,1176,2_subpoena_testify_oath_unanimously,"[subpoena, testify, oath, unanimously, ja, tru...",[RT @AdamKinzinger: We just voted unanimously ...
4,3,1117,3_democracy_january6thcomm_overthrowing_hamill,"[democracy, january6thcomm, overthrowing, hami...",[RT @Resist_MAGA_GOP: Mark Hamill is correct. ...
5,4,1106,4_miller_defense_deploy_sec,"[miller, defense, deploy, sec, national, pathe...",[RT @sandibachom: IS THIS THING ON???!!This is...
6,5,1060,5_trumpisanationalsecurityrisk_rioters_riot_at...,"[trumpisanationalsecurityrisk, rioters, riot, ...",[RT @AdamKinzinger: Trump is the author of the...
7,6,949,6_trump_democracy_loss_rejection,"[trump, democracy, loss, rejection, american, ...",[RT @AdamKinzinger: When he couldn't accept hi...


In [49]:
keybert_info.shape

(381, 5)

## Visualize Terms

In [54]:
topic_model.visualize_barchart(top_n_topics=7)

In [55]:
keybert_model.visualize_barchart(top_n_topics=7)

Visualize documents is two dimensional projection of the embeddings. 