# **Tutorial** - (semi)-supervised topic modeling
(last updated 11-09-2022)

In this tutorial, we will be looking at a new feature of BERTopic, namely (semi)-supervised topic modeling! This allows us to steer the dimensionality reduction of the embeddings into a space that closely follows any labels you might already have.

## Semi-supervised modeling
(semi)-supervised topic modeling is a class of methods that allows the user to perform topic modeling with previously defined labels. This might help nudge the model towards specific topics or classes for which you have labels.

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

## Working at TACC
### Removed Steps for working at TACC
* Enabling the GPU
* Installig BERTopic
* Restarting the Notebook

### Loading packages for use at TACC

In [1]:
import os
import logging
import sys

os.environ["TOKENIZERS_PARALLELISM"] = "false"

import plotly.io as pio
pio.renderers.default='iframe'

# **Data**
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts that each is assigned to one of 20 topics:

In [2]:
import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
docs = data["data"]
targets = data["target"]
target_names = data["target_names"]
classes = [data["target_names"][i] for i in data["target"]]

Each document can be put into one of the following categories:

In [3]:
target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

# **(semi)-Supervised modeling**


## Basic Model
Before we start with semi-supervised modeling, let us first take a look at the output of the basic model.

In [4]:
topic_model = BERTopic(verbose=True)
topics, _ = topic_model.fit_transform(docs)

2024-03-04 11:38:33,912 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2024-03-04 11:39:09,614 - BERTopic - Embedding - Completed ✓
2024-03-04 11:39:09,615 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-04 11:39:36,948 - BERTopic - Dimensionality - Completed ✓
2024-03-04 11:39:36,954 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-04 11:39:39,834 - BERTopic - Cluster - Completed ✓
2024-03-04 11:39:39,896 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-04 11:39:42,986 - BERTopic - Representation - Completed ✓


In [5]:
topic_model.get_topic_info().head(10)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6218,-1_to_the_of_and,"[to, the, of, and, is, you, in, that, it, for]",[----- Begin Included Message -----\n\nThe fol...
1,0,1835,0_game_team_games_he,"[game, team, games, he, players, season, hocke...",[#21\tPETER AHOLA\t\tSeason: 2nd\nAcquired:\t'...
2,1,566,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[The following document summarizes the Clipper...
3,2,526,2_ites_cheek_yep_huh,"[ites, cheek, yep, huh, ken, auto, why, each, ...","[\nHuh?, \nYep.\n, ites:]"
4,3,472,3_israel_israeli_jews_arab,"[israel, israeli, jews, arab, jewish, arabs, p...",[From: Center for Policy Research <cpr>\nSubje...
5,4,456,4_card_monitor_video_drivers,"[card, monitor, video, drivers, vga, screen, m...","[Has anyone connected a high-res, fixed freque..."
6,5,427,5_drive_scsi_drives_ide,"[drive, scsi, drives, ide, disk, controller, h...","[Wow, you guys are really going wild on this I..."
7,6,388,6_car_cars_engine_ford,"[car, cars, engine, ford, toyota, mustang, mil...",[\nWhile I don't read normally read this group...
8,7,363,7_jpeg_image_gif_images,"[jpeg, image, gif, images, format, file, files...",[I am happy to announce the first public relea...
9,8,249,8_koresh_fbi_fire_gas,"[koresh, fbi, fire, gas, compound, children, t...",[Ever since the siege at Waco started the FBI ...


The topics that were created mostly make sense. There are some clearly defined topics such as "nasa, orbit, spacecraft, moon" but also some topics that seem mostly derived from other topics. We can visualize this by extracting the topic representations per class and see if our unsupervised model closely resembles this.

**NOTE**: You can **hover** over the bars to see the representation per class!!

In [6]:
topics_per_class = topic_model.topics_per_class(docs, classes=classes)
fig_unsupervised = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10)
fig_unsupervised

20it [00:03,  5.58it/s]


The results do seem promising. Topics like "nasa, space, etc" seem to be clearly related to sci.space, but some topics were created that span many categories. For example, we expect the topic "bike, bikes, etc"  to only appear in rec.motorcycles.  

## Semi-supervised
In the example above you might notice that some topics were somewhat smushed together. What we would like to see is a clear separation between those topics. Fortunately, we have to labels and can use them to improve the model.

Since we are not interested in any other topics, this method is called semi-supervised topic modeling. In practice, this means that we have the labels of some documents but not all.

For this example let's say we only have the labels of all computer-related categories:

In [7]:
 labels_to_add = ['comp.graphics', 'comp.os.ms-windows.misc',
                  'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
                  'comp.windows.x',]
 indices = [target_names.index(label) for label in labels_to_add]
 new_labels = [label if label in indices else -1 for label in targets]

When generating our new labels it is important to mark unknown classes as **-1**. Next, we use those newly constructed labels to again run BERTopic:

In [8]:
topic_model = BERTopic(verbose=True)
topics, _ = topic_model.fit_transform(docs, y=new_labels)

2024-03-04 11:42:02,261 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2024-03-04 11:42:23,599 - BERTopic - Embedding - Completed ✓
2024-03-04 11:42:23,600 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-04 11:42:36,139 - BERTopic - Dimensionality - Completed ✓
2024-03-04 11:42:36,142 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-04 11:42:36,613 - BERTopic - Cluster - Completed ✓
2024-03-04 11:42:36,618 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-04 11:42:39,594 - BERTopic - Representation - Completed ✓


In [9]:
topic_model.get_topic_info().head(10)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,5829,-1_to_the_of_and,"[to, the, of, and, is, it, in, you, that, for]","[\nSince this is alt.atheism, I hope you don't..."
1,0,1821,0_game_team_games_he,"[game, team, games, he, players, season, hocke...","[The problem with your nihilistic approach, Ro..."
2,1,876,1_window_server_motif_widget,"[window, server, motif, widget, xterm, openwin...",[I am working on an X-Window based application...
3,2,843,2_image_jpeg_images_gif,"[image, jpeg, images, gif, format, graphics, f...",[I have posted DISP140.ZIP to alt.binaries.pic...
4,3,555,3_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[-----BEGIN PGP SIGNED MESSAGE-----\n\nPlease ...
5,4,532,4_ites_cheek_yep_huh,"[ites, cheek, yep, huh, ken, ignore, forget, a...","[\nYep.\n, \n \n ..."
6,5,459,5_israel_israeli_jews_arab,"[israel, israeli, jews, arab, jewish, arabs, p...",[From: Center for Policy Research <cpr>\nSubje...
7,6,398,6_gun_guns_firearms_amendment,"[gun, guns, firearms, amendment, militia, crim...","[\nRead it again yourself, then re-apply the a..."
8,7,301,7_bike_ride_riding_my,"[bike, ride, riding, my, lane, you, road, car,...",[Sixteen days I had put off test driving the H...
9,8,251,8_you_your_post_jim,"[you, your, post, jim, context, that, ted, to,...",[\n[ stuff deleted ]\n |> Are you calling na...


Finally, we can again extract the topics per class to see if our semi-supervised approach had some effect:

In [10]:
topics_per_class = topic_model.topics_per_class(docs, classes=classes)
fig_semi_supervised = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10, width=900)
fig_semi_supervised

20it [00:03,  5.56it/s]


We can clearly see that many more topics about computers were created and that the seperation between those topics are solid. This indicates that even if you do not have all the labels, you can definitely improve the model!

However, there are still some clusters that could be improved with the labels that we have.

## Supervised

Finally, we are going to be using all labels. These labels help BERTopic understand where most clusters can be found. However, this does not mean that it will only find the 20 clusters that we have defined. If there are sub-clusters to be found, then there is a good chance BERTopic will find them!

In [11]:
topic_model = BERTopic(verbose=True)
topics, _ = topic_model.fit_transform(docs, y=targets)

2024-03-04 11:42:44,648 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2024-03-04 11:43:06,081 - BERTopic - Embedding - Completed ✓
2024-03-04 11:43:06,082 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-04 11:43:16,326 - BERTopic - Dimensionality - Completed ✓
2024-03-04 11:43:16,327 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-04 11:43:16,751 - BERTopic - Cluster - Completed ✓
2024-03-04 11:43:16,755 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-04 11:43:19,742 - BERTopic - Representation - Completed ✓


In [12]:
topic_model.get_topic_info().head(10)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3971,-1_the_to_of_and,"[the, to, of, and, that, in, you, is, it, for]",[I posted this several days ago for Dave Butle...
1,0,952,0_space_launch_nasa_orbit,"[space, launch, nasa, orbit, shuttle, spacecra...",[Archive-name: space/references\nLast-modified...
2,1,909,1_he_game_year_baseball,"[he, game, year, baseball, team, games, player...",[\nNot particularly *in* the World Series. Dur...
3,2,903,2_window_server_file_widget,"[window, server, file, widget, motif, entry, p...","[Enclosed are the rules, guidelines and relate..."
4,3,895,3_key_encryption_clipper_chip,"[key, encryption, clipper, chip, keys, privacy...",[-----BEGIN PGP SIGNED MESSAGE-----\n\nPlease ...
5,4,893,4_car_cars_engine_ford,"[car, cars, engine, ford, it, my, oil, the, de...",[\nI have had my Probe looked at twice by my l...
6,5,883,5_windows_db_dos_file,"[windows, db, dos, file, files, 31, driver, mo...","[\n: I have normal procomm plus for dos, but..."
7,6,860,6_gun_guns_firearms_fbi,"[gun, guns, firearms, fbi, they, weapons, the,...",[NOTE - local tx groups trimmed out of Newsgro...
8,7,848,7_image_jpeg_images_graphics,"[image, jpeg, images, graphics, format, gif, f...",[Archive-name: jpeg-faq\nLast-modified: 18 Apr...
9,8,527,8_whatta_ites_cheek_ass,"[whatta, ites, cheek, ass, ken, ignore, forget...","[Lets not forget , \n \n ..."


Not only do we see a nice seperation of the topics, there are significantly less outliers which shows that BERTopic has improved in connecting the documents to topics.

Let's see the results by again visualizing the topic representation per class:

In [13]:
topics_per_class = topic_model.topics_per_class(docs, classes=classes)
fig_supervised = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10, width=900)
fig_supervised

20it [00:03,  5.80it/s]


Now that we have used all labels, BERTopic seems to closely match our pre-defined labels. Moreover, it still allows to discover topics that were not previously defined. Thus, you can use this method to find unknown topics in pre-defined topics!