[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/CRCTransformers/deepdive-book/blob/main/Chapter-3-TopicModeling.ipynb)

# Motivation

In this chapter, we looked at several applications of the Transformer architecture. In this case study, we see how to use pretrained (or finetuned) Transformer models to do topic modeling. If one is exploring a new dataset, this method could be used during exploratory data analysis.

We'll use pretrained Transformers to explore the [Yelp reviews dataset](https://huggingface.co/datasets/yelp_review_full) and see what kinds of things the reviewrs have to say. 

There are many ways one can generate sentence embeddings, but we are going to use sentence embeddings from the [sentence-transformers](https://github.com/UKPLab/sentence-transformers) library. Sentence-transformers provides models pretrained for specific tasks, such as semantic search. 

We're going to use [BERTopic](https://github.com/MaartenGr/BERTopic) for topic modeling and [Huggingface Datasets](https://huggingface.co/docs/datasets/) for loading the data. 

Note: Huggingface Datasets lets you work with large datasets without needing to store the entire thing in memory (the data is memory mapped using Apache Airflow).



# Environment setup

In [1]:
!pip install -U datasets==2.2.1 bertopic==0.10.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic==0.10.0
  Downloading bertopic-0.10.0-py2.py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 2.6 MB/s 
Collecting pyyaml<6.0
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 9.8 MB/s 
Installing collected packages: pyyaml, bertopic
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 6.0
    Uninstalling PyYAML-6.0:
      Successfully uninstalled PyYAML-6.0
  Attempting uninstall: bertopic
    Found existing installation: bertopic 0.9.1
    Uninstalling bertopic-0.9.1:
      Successfully uninstalled bertopic-0.9.1
Successfully installed bertopic-0.10.0 pyyaml-5.4.1


In [2]:
import matplotlib.pyplot as plt

%matplotlib notebook

# Data

In [3]:
from datasets import load_dataset
import numpy as np

There are 650,000 reviews in the dataset. To keep the runtime of this case study within reason, we'll only process the first 10,000 reviews. 

To process more reviews, simply change `N`.

In [4]:
N = 10_000
dataset = load_dataset("yelp_review_full", split=f"train[:{N}]")

Downloading builder script:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/979 [00:00<?, ?B/s]

Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...


Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset yelp_review_full downloaded and prepared to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf. Subsequent calls will reuse this data.


In [5]:
dataset

Dataset({
    features: ['label', 'text'],
    num_rows: 10000
})

# Sentence Embeddings

In this case study, we're interested in exploring the Yelp dataset, seeing what topics are being written about.

We'll use the [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model from sentence-transformers. It's built to perform well on semantic search when embedding sentences and longer spans of text.

To use the GPU when computing the embeddings, we set the `device` parameter in `SentenceTransformer` to "cuda".

In [6]:
from sentence_transformers import SentenceTransformer

embeddings_model = SentenceTransformer("all-mpnet-base-v2", device="cuda")

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [7]:
# We embed the reviews in batches, to speed things up
batch_size = 256

In [8]:
def embed(batch):
    batch["embedding"] = embeddings_model.encode(batch["text"])
    return batch

In [9]:
dataset = dataset.map(embed, batch_size=batch_size, batched=True)
dataset.set_format(type='numpy', columns=['embedding'], output_all_columns=True)



  0%|          | 0/40 [00:00<?, ?ba/s]

# Topics

## Building Topics

In [10]:
from bertopic import BERTopic

In [11]:
topic_model = BERTopic(n_gram_range=(1, 2))

In [12]:
topics, probs = topic_model.fit_transform(dataset["text"], 
                                          np.array(dataset["embedding"]))



In [13]:
topic_model1 = BERTopic(n_gram_range=(1, 3), calculate_probabilities=True)
topics1, probs1 = topic_model1.fit_transform(dataset["text"], 
                                          np.array(dataset["embedding"]))

In [14]:
print(f"Number of topics: {len(topic_model.get_topics())}")

Number of topics: 139


Now that we have computed a topic distribution, we need to see what kind of reviews are in each topic.

In [15]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,3115,-1_the_and_was_to
1,0,368,0_we_our_to_was
2,1,340,1_italian_pasta_was_the
3,2,272,2_pizza_the pizza_crust_cheese
4,3,244,3_chinese_chinese food_chicken_food
...,...,...,...
134,133,12,133_comic_comics_comic book_comic books
135,134,11,134_thai_pad_pad thai_thai house
136,135,11,135_we_us_were_was
137,136,11,136_yum_yum yum_salad_julianas


## Topic size distribution

What is the distribution of topic size, where the size is the number of reviews that contain that topic?

In [16]:
topic_sizes = topic_model.get_topic_freq()

In [17]:
topic_sizes

Unnamed: 0,Topic,Count
0,-1,3115
1,0,368
2,1,340
3,2,272
4,3,244
...,...,...
134,133,12
135,134,11
136,135,11
137,136,11


Note the topic with id of -1. This corresponds to the unassigned cluster output by the HDBSCAN algorithm. The unassigned cluster is composed of all the things that could not be assigned to one of the other clusters. It can *generally* be ignored, but if it were too large, it would be a sign that our choice of parameters are probably not good for our data.

In [18]:
topic_sizes[topic_sizes["Topic"] != -1]["Count"].hist()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7fa27eefab50>

Most topics have less than 50 reviews.

Note that the unassigned cluster has been omitted from the histogram.

In [19]:
n = len(topic_sizes) - 1 # subtract 1 to ingnore unassigned cluster

# Visualization of topics

This section shows off some of the ways the topics can be visualized with the BERTopic library.

In [20]:
# Visualize the 10 topics that are most prevalent in the dataset
topic_model.visualize_barchart(top_n_topics=10, 
                               n_words=5, width=1000, height=800)

BERTopic can also show a heatmap of the cosine similarities of the topic embeddings.

In [21]:
topic_model.visualize_heatmap(top_n_topics=20, n_clusters=5)

# Sampling the distribution of topics

Let's look at the largest two topics, smallest two topics, and a topic with median.

In [22]:
def dump_topic_and_docs(text, topic_id):
    print(f"{text} size: {topic_sizes['Count'][topic_id + 1]}\n")
    n = len(topic_sizes) - 1 

    if topic_id != -1:
        reviews = topic_model.get_representative_docs(topic_id)
        print("**** Representative reviews ****")
        for review in reviews:
            print(review, "\n")

    return topic_model.get_topic(topic_id)[:10]

### Unassigned cluster

In [23]:
dump_topic_and_docs("Unassigned cluster", -1)

Unassigned cluster size: 3115



[('the', 0.006485498038619296),
 ('and', 0.005975922658815651),
 ('was', 0.005882849346982429),
 ('to', 0.005471069304066611),
 ('it', 0.005163699384497507),
 ('of', 0.004971711034275305),
 ('for', 0.004804835025489624),
 ('is', 0.004672343048500447),
 ('but', 0.004566944565925044),
 ('in', 0.004490537388902892)]

As we can see, the content of the unassigned cluster contains words that do not strongly belong to any topic.

## Largest topic

In [24]:
dump_topic_and_docs("Largest topic", 0)

Largest topic size: 368

**** Representative reviews ****
The worst customer service by server and manager. We've been coming here for years. Tonight's experience is one of the worst restaurant experiences I've ever had. Supposedly there was a computer error with our food. Fine, I understand it happens sometimes. However while waiting two hours, we repeatedly asked our server about our dinner. We got the same response every time. \"It'll be right up.\" He did not\nconsider actually checking on the food. All the tables around us came, ate and left. Meanwhile, our server could not even remember to bring water to the table without being asked multiple times. \n\nAfter I stopped asking nicely about our food, the server finally went to inquire. The manager offered to comp two beers for waiting two hours if we wanted to just  leave. How generous. Otherwise she offered to discount our meals and offered us what we thought was another round of beers. She did not even follow through on that prom

[('we', 0.009598921317981065),
 ('our', 0.008503262197735627),
 ('to', 0.006763928602678122),
 ('was', 0.006666301867658669),
 ('the', 0.006375484151262111),
 ('she', 0.0063215730037490085),
 ('and', 0.006043346272495334),
 ('us', 0.005864020564012115),
 ('he', 0.005237563163380326),
 ('were', 0.005177237474880569)]

## Smallest topic

In [25]:
dump_topic_and_docs("Smallest topic", n-1)

Smallest topic size: 10

**** Representative reviews ****
Came to this location late at night and was greeted by the friendly staff. Very affordable prices and good food! They even have a dollar menu! I would come back! 

Not a bad store for a few tough-to-find Middle Eastern items, and the price is right.  Because Labad's doesn't appear to be as popular a destination as other stores in the area, some shelves are scantily stocked, and some items look a bit battered and dusty.  Still, this is the place to go for canned fava beans and Turkish delight.\n\nLabad's also offers tasty, filling falafel wraps or gyros for super cheap. 

This isn't my favorite Showmars, but it does the trick on a busy workday when I need to get in and out with a well priced, non-fast food lunch.  Located just off of South Blvd, parking can be a bit of a squeeze during peak lunch hours, but you'll be able to find a spot behind the restaurant if there isn't one available out front.  \n\nToday's lunch of a chicken 

[('showmars', 0.03753402813485906),
 ('pita', 0.029917184491924504),
 ('love showmars', 0.016653718317785962),
 ('salad pita', 0.016101654536505094),
 ('pepperoni rolls', 0.015323628753411958),
 ('chicken salad', 0.01511593918403704),
 ('lunch', 0.014480917079336103),
 ('gyros', 0.011734514740560904),
 ('599 on', 0.011621235452385106),
 ('location', 0.011363726138446406)]

## Median size topic

In [26]:
dump_topic_and_docs("Median topic", n//2)

Median topic size: 28

**** Representative reviews ****
They got some solid subs here, especially on Tuesday for the special.  $6.49 for sub, chips, and a drink.  Just an overall solid place for lunch.  Definitely couldn't handle going here like on a daily, but well worth it to mix things up from time to time.\n\ndmo out 

A neighborhood stalwart and a gem (21 years strong, I am told).  I love this place if no no other reason than meats and cheeses are not sliced until ordered and their staff are not faux \"sandwich artists\" programmed to put exactly 3 tomato slices and 4 pickles on my six-inch sub.  \n\nVeggie friends will love the options here (don't skip the hummus!), but carnivores have the appreciate the fact that there are something like 60 different subs to choose from.  I even love their outmoded ticket system where the cashier has to hit a button on their register to print out what extras you want on your sub.  It's like a flash back to the mid 80s.\n\nThough it's at the dead

[('sub', 0.038694178547913974),
 ('subs', 0.012885976977902834),
 ('subway', 0.011176094579458378),
 ('mikes', 0.00925778945143283),
 ('jersey', 0.008382070934593784),
 ('sub and', 0.008283828260537958),
 ('jersey mikes', 0.007783925171693379),
 ('the sub', 0.0076512278358050675),
 ('cheese', 0.007210145975429207),
 ('mayo', 0.006687675780528334)]