In this I'm going to test BERTopic for topic modeling. It seems to have significant advantages over LDA.

Going to just follow a tutorial and use their dataset for now

In [1]:
from sklearn.datasets import fetch_20newsgroups
from bertopic import BERTopic
import pandas as pd
import pickle

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Since BERT uses sentence embeddings in it's process we want the stop words
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
data = pd.Series(docs['data']).sample(5000).to_list() # huely cut down the data because it takes forever otherwise. This is slow but w/e
data[:5]

['[">"= Mark, ">>"= mp]\n\n\n\n    I am sorry you find these charges amusing, Mark. I understand your\nfrustration though--it can be kind of scary to find your assumptions\nchallenged. Some of the specific cultural interference to which I refer\nincludes linguistic manipulation, for instance, their Tzotzil-Spanish\ndictionary removed both Spanish and Tzotzil words for concepts which are\nthreatening to the ruling ideology, e.g., class, conquer, exploitation,\nrepression, revolution, and described words which can express\nideological concepts in examples like "Boss--the boss is good. He treats\nus well and pays us a good wage." As some of my students would say,\n"NOT!"  \n     Your tone implies that you are unlikely to believe me--indeed, why\nshould you? If you are interested enough to do some further research\nthough, and you sound as if you are, here are some references for you.\n \nStoll, David. _Fishers of Men or Founders of Empire? The Wycliffe Bible\nTranslators in Latin America_

In [3]:
# precalculate sentence embeddings so they don't have to be done each time and save them
from sentence_transformers import SentenceTransformer
if 0:
    embedding_model = SentenceTransformer("all-MiniLM-L6-v2") # fast but pretty accurate embeddings
    embeddings = embedding_model.encode(data, show_progress_bar=True)
    
    with open("embeddings.pkl", "wb") as fOut:
        pickle.dump({'embeddings': embeddings},fOut)



In [10]:
# Load embeddings from file
embeddings = None
if 1:
    with open("embeddings.pkl", "rb") as fIn:
        cache_data = pickle.load(fIn)
        embeddings = cache_data['embeddings']

In [11]:
# Default representations. Want to try another representation that has less stopwords
topicModel = BERTopic(verbose=True, embedding_model="all-MiniLM-L6-v2") 
topics, probs = topicModel.fit_transform(data, embeddings)

2023-10-20 17:17:22,980 - BERTopic - Reduced dimensionality
2023-10-20 17:17:23,159 - BERTopic - Clustered reduced embeddings


In [19]:
# A maybe better representation?
from bertopic.representation import KeyBERTInspired

topicModel = BERTopic(verbose=True, embedding_model="all-MiniLM-L6-v2", representation_model=KeyBERTInspired()) 
topics, probs = topicModel.fit_transform(data, embeddings)

2023-10-20 17:28:48,053 - BERTopic - Reduced dimensionality
2023-10-20 17:28:48,219 - BERTopic - Clustered reduced embeddings


In [20]:
topicModel.save("keyBertModel", serialization='safetensors')

In [21]:
topicModel.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1486,-1_new_other_get_your,"[new, other, get, your, about, all, on, use, m...",[Archive-name: x-faq/speedups\nLast-modified: ...
1,0,495,0_god_be_about_just,"[god, be, about, just, than, think, have, you,...","[Hi everyone,\n\nI'm trying to find my way to ..."
2,1,233,1_defamation_government_anti_information,"[defamation, government, anti, information, wh...","[\n\nI know it doesn't make sense, but since w..."
3,2,217,2_government_what_have_just,"[government, what, have, just, think, are, bei...",[\n There are a couple of ways to look a...
4,3,202,3_energy_said_have_information,"[energy, said, have, information, ax, what, 0q...",[Brian Kendig first states:\n\n\nI ask:\n\n\nB...
5,4,191,4_more_we_drive_get,"[more, we, drive, get, 16550, space, of, me, u...","[Folks,\n\nIt's time to start building some pr..."
6,5,169,5_arab_israel_be_bxn,"[arab, israel, be, bxn, what, they, pl, qq, th...",[\n\nIf you look at the bottom of this article...
7,6,156,6_c__chz_as_k8,"[c_, chz, as, k8, on, x7, qs, of, are, c8]",[\nBecause much of the public aren't even awar...
8,7,139,7_publication_display_x11r5_xterm,"[publication, display, x11r5, xterm, about, wh...",[\nAt issue was not a trial behind closed door...
9,8,115,8_azerbaijan_god_slave_are,"[azerbaijan, god, slave, are, be, about, to, t...",[The ONLY unity I've found which is true is wh...


In [23]:
topicModel.get_topic(1)

[('defamation', 0.29812446),
 ('government', 0.1615123),
 ('anti', 0.15016481),
 ('information', 0.12503073),
 ('what', 0.11455843),
 ('who', 0.10659747),
 ('san', 0.09380421),
 ('which', 0.09249413),
 ('just', 0.09124923),
 ('about', 0.08959746)]

In [24]:
topicModel.visualize_topics()