<a id="top"></a>
# **Topic Modeling: BERTopic Models**

***

## Learning Goals
After reviewing this notebook, you will be familiar with multiple aspects of a commonly used topic modeling tool BERTopic.

```
By the end of this tutorial, you will:

- Train A BERTopic Model
- Generate topic model visualizations
- Generate custom embeddings and vectorizer models.

```

## **Word Embeddings**
Word embeddings relies on the distributional hypothesis, or the idea that in context words are clues to their meanings. Using the nearest words, it allows to the computer to generate dense vectors to then 'learn' the meaning of the texts. The three types of word embeddings are: encode simialirty, automatic generalization, and measuring meaning. These methods in combination have been applied to multiple different forms of topic models like latent dirichelt allocation - [LDA (Blei et al : 2001)](https://proceedings.neurips.cc/paper_files/paper/2001/file/296472c9542ad4d4788d543508116cbc-Paper.pdf), Bidirectional encorder representations from transformers [BERT (Devlin et al)](https://arxiv.org/abs/1810.04805), and Global Vectors for Word Representation [gloVe (Pennington et al)](https://nlp.stanford.edu/pubs/glove.pdf).

Throughout the models' time in research, there have been increasing ways to integrate, analyze, and validate the model responses. High priority is placed on how models should be interpreted and generalized.

## **Why BERTopic?**

BERTopic, developed by [Grootendorst](https://arxiv.org/abs/2203.05794), is a "topic modeling technique that leverages HuggingFace transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics" [documentation](https://maartengr.github.io/BERTopic/index.html). BERTopic is popular, with about 1,833 current citing articles since its introduction in 2018 (about 300 articles per year!).

Additionaly, BERTopic is a low barrier topic modeling tool. It is entirely open source, can be customized to language needs, and can be ran without need for (much) fine tuning. BERTopic can provide a new way to read texts from a distance as it provides the user with representative documents and hierarchical topic organization as well. In comparison to LDA, BERTopic automatically detects potential themes through clustering, instead of having to pass in a seed number of topics.

In this notebook, I will provide an overview of how to instantiate a BERTTopic Model and different ways to customize it.


## **1. Environment Set Up**

The libraries used in this tutorial are:

- *datasets* to import data from HuggingFace
- *BERTopic* to preform a topic modelon the data
- *pandas* to structure data
- *NLTK* the natural language toolkit to clean and preprocess the texts
- *sklearn.feature_extraction.text* used to develop custom models and embeddings
- *Plotly* is used to wrap around the exisiting graphing objects generated by BERTopic and manipulate them

In [None]:
pip install datasets



In [None]:
pip install BERTopic



In [None]:
''' DATA QUERYING '''
from datasets import load_dataset
import pandas as pd

''' TOPIC MODELING '''
from bertopic import BERTopic
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

''' CUSTOM MODELS '''
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sentence_transformers import SentenceTransformer

''' DATA VISUALIZATION '''

from plotly.subplots import make_subplots
import plotly.graph_objects as go

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### **1.2 Data Import**

Let's use this dataset posted by Google on HuggingFace. The name of the dataset is "Civil Comments" and was originally used to identify bias in macihne learning outputs. Further documentation of the dataset can be found on both [HuggingFace]('https://huggingface.co/datasets/google/civil_comments') or in their paper [Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification](https://arxiv.org/abs/1903.04561). The original dataset is quite large, so we will only be using a small section of it.

By conducting a topic model on the training set, it can provide further insight on what kinds of text google is using to train its AI models.

In [None]:
  ## First, let's use Hugging Face's load dataset function. Split, is a function of HuggingFace and it tells the model to only pull out that parition of data.
ds =  load_dataset('google/civil_comments',split='test')

In [None]:
  ## The type of the dataset is a unique object. It has a few options, but it is essentially a dataset
type(ds)

In [None]:
  ## We can put the dataset into a dataframe. This allows for easy parsing and data construction.
df = pd.DataFrame(ds)

In [None]:
  ## To look at the top of the data we can use the 'head' parameter.
df.head()

  ## From this, we can see the kinds of metrics Google is using in its training data.

Unnamed: 0,text,toxicity,severe_toxicity,obscene,threat,insult,identity_attack,sexual_explicit
0,[ Integrity means that you pay your debts.]\n\...,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,This is malfeasance by the Administrator and t...,0.1,0.0,0.0,0.0,0.1,0.0,0.0
2,@Rmiller101 - Spoken like a true elitist. But ...,0.3,0.0,0.0,0.0,0.2,0.0,0.0
3,"Paul: Thank you for your kind words. I do, in...",0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Sorry you missed high school. Eisenhower sent ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
''' CREATING A CORPUS '''

  ## Let's take the entire dataframe of the split and section it out into a smaller section. Typically, the models will preform the best and generate the most robust topics if there are more than 1000 documents.
docs_full = df['text'].to_list()
docs = docs_full[0:1500]

## **2. Training the basic model**

First, the model will be trained without a specific embedding model, and then trained to the specific embeddings of the corpus.

To create a BERTopic model, without any refining you will just call the BERTopic to create a 'topic model'. This object will be called upon later to than further examine and interact with the generated topics

In [None]:
''' SETTING UP THE MODEL '''
  ## Instantiating a vectorizer model with a stopword range of 1 to 2 and appliying English stopwords
  ## By applying stopwords at this stage, it will remove them from the topics
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")

  ## Instantiate topic model by calling BERTopic
topic_model = BERTopic(vectorizer_model=vectorizer_model,verbose = True)

In [None]:
''' CALLING THE MODEL '''

  ## Training the model with the documents
topics, probs = topic_model.fit_transform(docs)

2024-10-22 22:45:04,370 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/47 [00:00<?, ?it/s]

2024-10-22 22:46:45,487 - BERTopic - Embedding - Completed ✓
2024-10-22 22:46:45,489 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-10-22 22:46:55,318 - BERTopic - Dimensionality - Completed ✓
2024-10-22 22:46:55,320 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-10-22 22:46:55,397 - BERTopic - Cluster - Completed ✓
2024-10-22 22:46:55,402 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-10-22 22:46:55,703 - BERTopic - Representation - Completed ✓


## **2.1 Basic Topic Analysis and Visualization**

Using these get_topic_info will show information either about the topic or the document, and can later provide information which will feed into the visualizations. It provides an overview of all of the topics generated in the model.

- **model.get_topic_info**: shows the count of the words belonging to the topic, and then the name of the topic. Any topic with -1 is an outlier and should "typically be ignored". The name of the topic is the top three or four words for each topic.

The two visualizations used in this tutorial are the intertopic distance map and the hierarchy map.

The intertopic distance map is a visualization of the topic in a two-dimensional space. This means that this an be interpreted similar to a shadow of the topics, if you were looking down at it from a multi-dimensional space. In a paper by [Tetteh and Mg](https://www.researchgate.net/figure/BERTopic-Intertopic-Distance-Map_fig5_376672724), the intertopic distance map is used to “show the relationship between the generated topic clusters. Additionally, the inter topic distance map can provide some undesrtanding in interpreting the hierarchical document tree.

A hierarchical document tree will demonstrate the hierarchy within the topics that have been provided. The documentation can be found [here](https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_hierarchy.html), which provides more information about the creation of the visuals themselves. Using the document tree can visually represent how clusters relate to one another, and can inform our selection of topics if we would like to reduce them.


In [None]:
topic_info = topic_model.get_topic_info()
topic_info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,700,-1_people_like_dont_just,"[people, like, dont, just, trump, think, does,...","[Nineteen Eighty-Four George Orwell ,read it\..."
1,0,87,0_gun_guns_police_law,"[gun, guns, police, law, crime, constitution, ...","[[Actually, homicide is defined as ""the delibe..."
2,1,69,1_trump_president_trumps_obama,"[trump, president, trumps, obama, hes, way, li...","[In 2009, Trump the private citizen irked FOX ..."
3,2,60,2_canada_trudeau_canadians_canadian,"[canada, trudeau, canadians, canadian, justin,...","[To suggest that ""Canada was punitive, mean an..."
4,3,51,3_catholic_church_jesus_think,"[catholic, church, jesus, think, catholic chur...","[Well, let's see: what party was the president..."
5,4,50,4_hillary_credibility_clinton_mayor,"[hillary, credibility, clinton, mayor, lost, p...",[All I am concerned about is her deceit. We ca...
6,5,42,5_tax_income_taxes_money,"[tax, income, taxes, money, cash, middle, rate...",[The fact of the matter is that you do not hav...
7,6,41,6_comments_comment_post_commenting,"[comments, comment, post, commenting, globe, g...","[Not sure if my feeling is correct about this,..."
8,7,37,7_war_syria_isis_world,"[war, syria, isis, world, vietnam, time, hitle...","[If Qatar is a puppet of Iran, why does it fun..."
9,8,33,8_team_lynch_players_ball,"[team, lynch, players, ball, coaching, siemian...",[I've seen enough of Paxton Lynch. I never wan...


In [None]:
  ''' TOPIC VISUALIZATION '''
topic_idm = topic_model.visualize_topics()
topic_idm

In [None]:
''' HIERARCHICAL TOPICS '''
hierarchical_topics = topic_model.hierarchical_topics(docs)
base_hierarchical = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)

100%|██████████| 25/25 [00:00<00:00, 191.66it/s]


KeyboardInterrupt: 

In [None]:
base_hierarchical

## **3. Embedding Test: Sentence Transformers**

Let's generate a model with the embeddings that are specific to the corpus. This means that it will only take information provided from the text here to inform the model decision makings.The 'all mini LM' is what the current Hugging Face model is built on, however by using the all MiniLM it will look for word groupings in a different way based on the documents which are passed in. This model will be trainined on only the semantic relationships present in the documents we are currently working with.

Documentation for the model's creation can be found linked at [this HuggingFace Repository](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [None]:
  ## Let's try out a different embedding. We can use any model provided in HuggingFace to do so.
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Train our topic model using our pre-trained sentence-transformers embeddings
topic_model_cust = BERTopic(verbose=True,embedding_model=sentence_model)
topics, probs = topic_model_cust.fit_transform(docs, embeddings)


Batches:   0%|          | 0/47 [00:00<?, ?it/s]

2024-10-22 22:48:13,419 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-10-22 22:48:22,317 - BERTopic - Dimensionality - Completed ✓
2024-10-22 22:48:22,323 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-10-22 22:48:22,441 - BERTopic - Cluster - Completed ✓
2024-10-22 22:48:22,450 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-10-22 22:48:22,640 - BERTopic - Representation - Completed ✓


In [None]:
custom_topic_info = topic_model_cust.get_topic_info()
custom_topic_info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,681,-1_the_to_of_and,"[the, to, of, and, is, in, that, it, you, are]","[Dang, some of you folks are so out of touch w..."
1,0,96,0_the_to_in_it,"[the, to, in, it, and, of, is, you, was, gun]",[The officer had called me earlier to tell me ...
2,1,89,1_trump_he_his_the,"[trump, he, his, the, and, president, to, of, ...",[Trump is comically thin-skinned and the press...
3,2,65,2_the_catholic_and_of,"[the, catholic, and, of, to, church, you, that...","[It's been there decades, Doug and no it's not..."
4,3,60,3_comments_you_my_to,"[comments, you, my, to, comment, your, post, i...","[Just reviewed the Globe's ""Community Guidelin..."
5,4,56,4_canada_trudeau_the_in,"[canada, trudeau, the, in, to, canadian, is, a...",[Larry enough! Why aren't you exposing the dis...
6,5,48,5_she_her_hillary_the,"[she, her, hillary, the, of, for, to, and, was...",[Trump's personal life is a matter of PUBLIC r...
7,6,35,6_the_of_to_and,"[the, of, to, and, in, that, war, it, is, syria]","[I agree with Gen. Dallaire, but there are som..."
8,7,33,7_the_he_lynch_and,"[the, he, lynch, and, to, team, was, last, pla...",[I've seen enough of Paxton Lynch. I never wan...
9,8,32,8_in_the_to_door,"[in, the, to, door, housing, buyers, is, for, ...","[Hi Shipboard,\n\nYou new to these boards? \n\..."


In [None]:
topic_model_cust.visualize_topics()

**These topics have a lot of stopwords in them. We can remove them after the model is trained.**

Stopwords should be left in during training because it semantically supports the model and can then be extracted to provide a more clear representation of the topic. This is part of the data cleaning process, it should be considered and documented what stopwords are being removed or left in. Without being considerate with these methods it can result to a potential erasure from the embodied experiences the data are rooted in.

In [None]:
  ## Let's update the topics by adding our CountVectorizer in
vectorizer = CountVectorizer(stop_words="english")
topic_model_cust.update_topics(docs, vectorizer_model=vectorizer,)

In [None]:
  ## Are the new topics that were generated more clear?
new_custom_topic_info = topic_model_cust.get_topic_info()
new_custom_topic_info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,681,-1_people_like_don_just,"[people, like, don, just, trump, time, governm...","[Dang, some of you folks are so out of touch w..."
1,0,96,0_gun_guns_police_law,"[gun, guns, police, law, constitution, crime, ...",[The officer had called me earlier to tell me ...
2,1,89,1_trump_president_obama_like,"[trump, president, obama, like, press, way, am...",[Trump is comically thin-skinned and the press...
3,2,65,2_catholic_church_jesus_think,"[catholic, church, jesus, think, catholics, kn...","[It's been there decades, Doug and no it's not..."
4,3,60,3_comments_comment_post_globe,"[comments, comment, post, globe, giving, comme...","[Just reviewed the Globe's ""Community Guidelin..."
5,4,56,4_canada_trudeau_canadian_canadians,"[canada, trudeau, canadian, canadians, justin,...",[Larry enough! Why aren't you exposing the dis...
6,5,48,5_hillary_mayor_clinton_credibility,"[hillary, mayor, clinton, credibility, lost, t...",[Trump's personal life is a matter of PUBLIC r...
7,6,35,6_war_syria_isis_vietnam,"[war, syria, isis, vietnam, world, hitler, cou...","[I agree with Gen. Dallaire, but there are som..."
8,7,33,7_lynch_team_players_siemian,"[lynch, team, players, siemian, ball, coach, c...",[I've seen enough of Paxton Lynch. I never wan...
9,8,32,8_door_buyers_housing_prices,"[door, buyers, housing, prices, income, lower,...","[Hi Shipboard,\n\nYou new to these boards? \n\..."


In [None]:
custom_idm = topic_model_cust.visualize_topics()
custom_idm

In [None]:
''' HIERARCHICAL TOPICS '''
custom_hts = topic_model_cust.hierarchical_topics(docs,sentence_model)
custom_hierarchical = topic_model_cust.visualize_hierarchical_documents(docs, custom_hts)

In [None]:
custom_hierarchical

## **4. Embedding Tests: BAII/bge-based**

Now, let's test using an 'off the shelf' model. We can use the models that may have been trained for certain tasks if they are aligned with your data. For example, if your corpus has both Spanish and English, to generate the best topics for your data a custom sentence transformer should be used. In this case, there are many examples on HuggingFace of 'specialized' models.

By using an off-the shelf model, it can both reduce computational load and serve as a low barrier technology to more customizable and accurate results.

The off the shelf model we will test is the ["Flag Embeddings"](https://huggingface.co/BAAI/bge-base-en-v1.5) published by  [BAAI](https://www.baai.ac.cn/english.html), the Being Academy for Artifical Intelligence. This model is the base for their other multilingual models. It is a light weight model, so it will train quickly. In some cases, these models can have millions of training parameters and using them is computationally expensive.

In [None]:
## Let's use the sentence transformer to call the embedding model
embedding_model = SentenceTransformer("BAAI/bge-base-en-v1.5")

## And now instantiate a new model, with our stopwords vectorizer as well.
transformed_model = BERTopic(embedding_model=embedding_model, verbose=True,vectorizer_model=vectorizer)


In [None]:
topic, probs = transformed_model.fit_transform(docs)

2024-10-22 22:48:25,834 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/47 [00:00<?, ?it/s]

2024-10-22 22:56:20,296 - BERTopic - Embedding - Completed ✓
2024-10-22 22:56:20,302 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-10-22 22:56:28,275 - BERTopic - Dimensionality - Completed ✓
2024-10-22 22:56:28,277 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-10-22 22:56:28,338 - BERTopic - Cluster - Completed ✓
2024-10-22 22:56:28,344 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-10-22 22:56:28,469 - BERTopic - Representation - Completed ✓


In [None]:
''' This model generated a similar amount of topics to the other models, but still has some noise'''
tm_topic_info = transformed_model.get_topic_info()
tm_topic_info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,688,-1_people_don_just_like,"[people, don, just, like, good, time, state, n...",[Totally agree and go one better. I don't have...
1,0,144,0_trump_president_obama_news,"[trump, president, obama, news, said, press, l...",[Patricia is appealing to Trump stooges.\n\nTr...
2,1,87,1_police_gun_guns_shot,"[police, gun, guns, shot, crime, case, moose, ...","[[Actually, homicide is defined as ""the delibe..."
3,2,83,2_canada_trudeau_canadians_canadian,"[canada, trudeau, canadians, canadian, governm...","[To suggest that ""Canada was punitive, mean an..."
4,3,65,3_catholic_church_think_catholics,"[catholic, church, think, catholics, jesus, go...","[All the best to you. However, show me one rep..."
5,4,48,4_lynch_team_players_game,"[lynch, team, players, game, play, siemian, ba...",[I've seen enough of Paxton Lynch. I never wan...
6,5,41,5_yes_say_lol_humbly,"[yes, say, lol, humbly, great, nicely, played,...","[Bing! Nicely played, sir (or madam). Nicely..."
7,6,41,6_tax_income_taxes_higher,"[tax, income, taxes, higher, cost, dividend, m...",[The 50k isn't retirement income. You'll pay p...
8,7,29,7_syria_war_terrorist_world,"[syria, war, terrorist, world, isis, iran, cou...","[Eli, \nI think the lofty, aspirational concep..."
9,8,27,8_oil_pipeline_carbon_alberta,"[oil, pipeline, carbon, alberta, bc, demand, s...","["" Those opposing the pipeline need to take a ..."


In [None]:
transformed_idm = transformed_model.visualize_topics()
transformed_idm

In [None]:
''' HIERARCHICAL TOPICS '''
transformed_hts = transformed_model.hierarchical_topics(docs)
transformed_hierarchical = transformed_model.visualize_hierarchical_documents(docs, transformed_hts)

In [None]:
transformed_hierarchical

## **5. Interpreting the Output of Topic Models**
(Grimmer et al: 2022

When interpreting a model, it should be understood that sometimes there are arbitrary tuning variables. This can provide as a strong new way of interpretting the documents, we may not have seen before. In the words of Grimmer et al "When making [the] assessment, our goal is to assess their ability to credibly organize documents according to a particluar organization", it is our role as the researcher to decide what method is appropriate and best representative of the dataset.

Ways to ensure both validity and accurate interpretation by:
*   Reading random samples of documents from the clusters that belong a majority to a certain cluster
*   Identify words that are most salient to the topic.

First, let's look at the topic maps. This can tell us the words that are most salient to a topic as well as how the model has situated the topcics together.

Next, we will compare the hierarhcy of the topics. This will further inform in what ways the salient words are situated together.

Last, we will extract a few random samples from the topics that seem similar across the model.

In [None]:
''' GENERATING IDM COMPARISON '''

## Adding the plots to the plotting space. This will create a 'canvas' that can place the plots next to each other
fig = make_subplots(
    rows=1, cols=3,
    shared_xaxes=True,
    vertical_spacing=0.02,subplot_titles=["Base Model","Custom Embeddings","'Off The Shelf' Embeddings"]
    )

## Generating the 'trace' data to plot the graphs in the canvas.
for i in topic_idm.data :
    fig.add_trace(i, row=1, col=1)

for i in custom_idm.data :
    fig.add_trace(i, row=1, col=2)

for i in transformed_idm.data :
    fig.add_trace(i, row=1, col=3)


## Plotting the maps side by side
fig.update_layout(height=716, width=2100, title_text="Word Embedding Comparison")
fig.show()

In [None]:
''' GENERATING HIERARCHICAL COMPARISON '''

## Adding the plots to the plotting space. This will create a 'canvas' that can place the plots next to each other
fig = make_subplots(
    rows=1, cols=3,
    shared_xaxes=False,
    vertical_spacing=0.02,subplot_titles=["Base Model","Custom Embeddings","'Off The Shelf' Embeddings"]
    )

## Generating the 'trace' data to plot the graphs in the canvas.
for i in base_hierarchical.data :
    fig.add_trace(i, row=1, col=1)

for i in custom_hierarchical.data :
    fig.add_trace(i, row=1, col=2)

for i in transformed_hierarchical.data :
    fig.add_trace(i, row=1, col=3)

## Plotting the maps side by side
fig.update_layout(height=716, width=2100, title_text="Word Embedding Comparison: Hierarchical Topics")
fig.show()

In [None]:
''' EXAMINING REPRESENTATIVE DOCUMENTS: trump, president '''

  ## Using the get_representative_docs feature.
  ## The topic numbers were identified as the BERTopic assiend topics
base_docs = topic_model.get_representative_docs(2)
custom_docs = topic_model_cust.get_representative_docs(0)
ots_docs = transformed_model.get_representative_docs(0)

In [None]:
  ## Let's iterate through the representative documents:

  print ("BASE MODEL -----------------------------------------\n")
  for representative in base_docs:
    print (representative)
    print ("\n")

  print ("CUSTOM MODEL -----------------------------------------\n")
  for representative in custom_docs:
    print (representative)
    print ("\n")


  print ("OFF THE SHELF MODEL -----------------------------------------\n")
  for representative in ots_docs:
    print (representative)
    print ("\n")

BASE MODEL -----------------------------------------

To suggest that "Canada was punitive, mean and vindictive" is an extreme form of self-torture, unnecessary hard language on an issue that is hard to fathom by most Canadians. 

 Canada in no way caused Khadr's plight. It was caused by his own family. In fact, if he had been brought in as a minor he would have been back with his family and would have never been  liberated.  Even he himself in  his recent (7/7)interview with CBC, did  admit to his teenager's frog in the well worldview. He had paid his dues for his errors, induced or otherwise, and the court system addressed Canada's due to him.   He does appears to be a very matured in his thinking.

There is rule of law in Canada and he correctly argued it through the court and the system arrived at a solution. It is unnecessary to harp on Canada. The system worked. One can be certain that in a repeat of a similar situation, Canada will likely to react no differently. A good society 

In the representative documents, it is clear the differences between the meaning of the author. While all of these documents are the most representative for the "Trump" topic, not all of them feel a similar way towards him. In the off the shelf model, we see him to be characterized as a facist and criminal. In comparison to the first model, it speaks more to his political stances. The range in the models are as a result of the embedding models, and what language the model is told to prioritize.

In analyzing quantiative topic model, a close qualitative read is needed to best understand the true meaning behind the topics. Topic models present a new way to read topics, but they may not always be the 'correct' way.

___

## About this Notebook

This notebook was prepared for an independent Text as Data study for the CU Boulder MS in Information Science

**Author:** Natalie Castro  
**Updated On:** 2024-10-21
