# Reimagining Reviews With Sentence Transformers
## Author: Christian Smith

## Introduction

A business’s image is central to its ability to operate. No client will accept a business with an inferior reputation as it is within everyone’s best interest to seek out the best service possible for the task at hand. Although big billboards, word-of-mouth marketing, first impressions, and endorsements have all been critical schemes to a company’s continued success, the digital age has brought about another avenue, one that far exceeds the others in importance, the online review. Before the era of the internet, it was often difficult to obtain information about a business before simply trying it out. Advertisements were biased, only represented the larger chains, and did not speak on the individual level. With the rise of the internet, companies, such as Yelp, began to offer a centralized platform where the people could relay their experiences with any business from franchised chains all the way down to the mom and pop shops. Online reviews have become the new normal due to their ease of access for reader’s looking for a specific service, their simplicity with each reviewer giving each review typically a range of one to five stars which gets averaged for the reader to quickly assess a business’s performance, and due to alleviation of bias when compared to business advertisements due to each review mostly being tied to a unique, unfiltered, genuine experience of a business’s services.

As the popularity of online reviews began to grow, their influence has become difficult for companies to ignore, particularly in big cities where each service offered by a business is not unique, and thus the existence of the service itself is not a compelling reason for a customer to go. The growth of GPS for traversal has further compounded this issue by making it easier than ever for potential customers to choose the businesses with the best reviews, their decision often determined by a single number on a website like Yelp. Now that businesses are aware of the significance of reviews, how can they improve them?

## Motivation

Due to the ease at which a review can be made on websites like Yelp, they can often quickly build up to an overwhelming size, leaving most business owners puzzled as to what they should improve about their company first. However, through the usage of machine learning, it is possible to identify which topics are the predominant concerns of customers. Within this article, I will explain how we can leverage machine learning to pinpoint topics of interest across thousands of reviews that a business could focus on to drive better reviews for their business to rely less on the luck of the individual customer review.

## Importing Data

Before we begin, there are some important python libraries and two important datasets that will be used within the project. There will also be additional python libraries used later on when creating the Bertopic language analysis models, however those will be imported later in this notebook when they are required to be used. Both of these datasets are modifications of the official Yelp review dataset provided by Yelp on their website: [Yelp Open Dataset](https://www.yelp.com/dataset). Details on how I modified the "business" dataset will be available on the other attached file to this GitHub named: "Business Attributes Dataset Modification Details".

reduced_business Dataset:  
- <b>business_id</b>: A unique ID given to each business registered with Yelp  
- <b>name</b>: The official name of the business  
- <b>RestaurantsPriceRange2</b>: A discrete range of one to four where a four means the business is rated as one of the most expensive per person and a one means the business is rated as one of the cheapest per person. This metric is determined by a survey given to Yelp reviewers when checking in where 1 = under $\$$10, 2 = $\$$11-30, 3 = $\$$31-60, and 4 = $\$$61+ per person attending. While this column is called "RestaurantsPriceRange2", it also applies to some businesses that are not restaurants (i.e. Target)
- <b>GoodForKids</b>: Either a 0 or 1 where 1 means the business is a good place for kids as determined by a survey
- <b>NoiseLevel</b>: A categorical range from "quiet" to "average" to "loud" to "very_loud" in ascending levels of average noise at the business as voted on by a survey
- <b>GoodForDancing</b>: Either a 0 or 1 where 1 means the business is a good place to dance as determined by a survey
- <b>RestaurantsAttire</b>: A categorical range from "casual" to "semi-formal" to "formal" in ascending levels of average dress expectation at the business as voted on by a survey
- <b>BikeParking</b>: Either a 0 or 1 where 1 means the business is has space for bike parking
- <b>WheelchairAccessible</b>: Either a 0 or 1 where 1 means the business has wheelchair accessibility features
- <b>categories</b>: A list that contains all of the categories the business is associated with  
  
reviews Dataset:  
- <b>review_id</b>: A unique ID given to each review created on Yelp
- <b>business_id</b>: The ID of the business the review is associated with
- <b>stars</b>: The number of stars the reviewer gave the review
- <b>text</b>: A string that contains the entire review as written by the reviewer
- <b>date</b>: A datetime object that contains the year, month, day, and exact time at which the review was posted to Yelp's website

In [None]:
# Import required python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Note: The OS library is required to remove some warnings with tensorflow that do not impact performance
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import tensorflow as tf
import json
import torch
import math
%matplotlib inline

# Import reduced_business dataframe
reduced_business = pd.read_csv('reduced_business.csv')

# Import reviews dataframe
# Note: Because the reviews dataframe is so large, it is best to import it in batches using the chunksize argument.
# Note: Language analysis models are very resource intensive when used on large datasets. Due to this, I have decided to only use the first
#       1,000,000 rows of reviews to maintain decent performance. The number of rows that can be used for model training largely depends on the
#       resources your device has available, and the sample size desired.

chunk_size = 1000  # Adjust the chunk size based on your file size
chunks = pd.read_json('yelp_academic_dataset_review.json', lines=True, chunksize=chunk_size, nrows = 1000000) # use nrows to adjust sample size

# Concatenate the chunks into a single DataFrame
reviews = pd.concat(chunks, ignore_index=True)

## What are Sentence Transformers?

Before beginning to train our own machine learning models, we must transform the millions of reviews into usable data for the models. Because Natural Language Processing (NLP) models take vectors of integer values as input, we will need a transformer to convert the string of text into these integer vectors. For this task, we have sentence transformers, a type of pre-trained natural language processing model. The goal of sentence transformers is to ultimately convert sentences, or other short texts, into “embeddings”, fixed-dimensional vectors, which can then be used as input by other language processing models. The significance of sentence transformers is that they operate on the entire sentence rather than the individual words that make them up. This allows the embeddings to capture the semantics of each sentence.

## What are the Use Cases for Sentence Transformers?

To illustrate the capability of sentence transformers, consider a scenario in which the stars given by the users were controversially hidden from their respective company, but still visible to customers. In this scenario, it would be crucial for business owners to be able to predict the number of stars they were given from the text of the review in order to assess the reputation of their business from the perspective of new, potential customers. Although the goal is simple, to predict the number of stars a review received given its text, the process to complete this is rather complex. This is largely due to the fact that humans express language in inconsistent ways.

### Using VADER to Visualize the Complexity of Language Sentiment

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based lexicon approach to generating vectors to encapsulate text. VADER works by applying several rules that judge sentence semantics against a large lexicon of words. VADER is a finer tuned count vectorizer model, as although it simply counts which words are present in each sentence, it assigns unique weights based on other factors. For example, the sentence, "the food was TERRIBLE!!" would be deemed by VADER to be more negative than "the food was terrible" due to the former's use of capitals letters and punctuation. Each text is then assigned a compound sentiment score that ranges from -1 to 1 where 1 is extremely positive and -1 is extremely negative. When visualizing the sentiment score across all reviews split based on the number of stars the review was given, we begin to see why predicting human language becomes so difficult.

In [None]:
# 1) start by initializing VADER
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

def vader_compound(text):
    return analyzer.polarity_scores(text)['compound']

# 2) create a new column in the model_reviews dataframe that displays each review's respective compound sentiment score
model_reviews['compound'] = model_reviews['text'].apply(vader_compound)

# 3) visualize the differences in sentiment between each star level
fig, ((ax1, ax2, ax3, ax4, ax5)) = plt.subplots(1, 5, figsize = (15, 4))
ax1.set_title('1 star')
ax2.set_title('2 star')
ax3.set_title('3 star')
ax4.set_title('4 star')
ax5.set_title('5 star')
model_reviews[model_reviews['stars'] == 1]['compound'].plot(kind='box', ax=ax1)
model_reviews[model_reviews['stars'] == 2]['compound'].plot(kind='box', ax=ax2) 
model_reviews[model_reviews['stars'] == 3]['compound'].plot(kind='box', ax=ax3)
model_reviews[model_reviews['stars'] == 4]['compound'].plot(kind='box', ax=ax4)
model_reviews[model_reviews['stars'] == 5]['compound'].plot(kind='box', ax=ax5)
plt.savefig('vader_box.png')

When looking at the boxplots, VADER confirms that the median sentiment does increase as the star assigned to the review increases. This is expected behavior and how review stars are intended to function. However, what is not an expected result is an outlier 5 star review having a compound score that is close to -1, as negative as possible, or a 1 star review having a compound score of 1, as positive as possible. What is going on here?

In [None]:
# We can use this code block to find 1 star reviews that have an unusually high compound score
low_mask = (model_reviews['stars'] == 1) & (model_reviews['compound'] > 0.9)
low_positive = model_reviews.loc[low_mask].reset_index()
low_positive['text'][0]

When looking at 1 star reviews that have an unusually high compound score, we get returned reviews such as this one:

"As I ate at this restaurant I felt as if I was in a broadway production, a comedy, where everyone was doing their best to do the absolute opposite of good service. It was actually really entertaining, eventually. Toast was disgusting. Other food was good. Grits were really good."

Here, we can see that the reviewer did indeed have a negative experience, but he used a comedic analogy to relay his terrible experience. Because of the words related to comedy and the reviewer’s compliments of some of the food, VADER assigned this review a high compound score.

In [None]:
# We can use this code block to find 5 star reviews that have an unusually low compound score
high_mask = (model_reviews['stars'] == 5) & (model_reviews['compound'] < -0.9)
high_negative = model_reviews.loc[high_mask].reset_index()
high_negative['text'][7]

When looking at 5 star reviews that have an unusually low compound score, we get returned reviews such as this one:

"My daughter hit a re-tread tire on the interstate going 60mph. The tire shrapnel attacked the undercarriage of her car and snapped the exhaust pipe from the engine. We towed her car to Quick Auto as they seemed to be the closest exhaust specialist near us. Uncertain and slightly fearful of the extent of the damage, we literally received a call right when we walked back in the door saying the car was fixed and ready for pick up! The cost was dreamily affordable. Seriously contemplating moving all of our service needs for all of our vehicles to Quick Auto."

Here, we can see that the reviewer spent most of the review describing the dreadful experience the reviewer had before visiting and receiving service from the business. This recap of the dreadful reasons for why the reviewer visited the business greatly overshadows the extremely positive sentiment that was included at the end, hence earning a low compound score.

Due to the inconsistencies of how humans apply sentiment, context is crucial when analyzing languages using models. As shown by VADER, even reviews that are 5 stars can be mostly negative while 1 star reviews can sound like rave endorsements when the entire context is removed. For this reason, sentence transformers are pivotal to processing any form of language. Although imperfect, they continue to improve as they receive more data to become better at splitting topics from each other and considering even more context. The better question now is how can we leverage sentence transformers to drive better reviews?

### What is the Result of this Sentiment Inconsistency?

What is the end result of this inconsistent application of sentiment in reviews? Machine learning models have a difficult time predicting the number of stars a review obtains just from its text. We can run a logistic regression model to see the results of inputting inconsistent data into a machine learning model.

#### Import Necessary Libraries/Dataset for the Logistic Regression Model

In [None]:
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
model_reviews = pd.DataFrame(reviews[['text', 'stars']].sample(100000))

#### Remove All Punctuation and Stopwords From the Text

In [None]:
stop_words = set(stopwords.words('english'))
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = ''.join([char for char in mess if char not in string.punctuation])

    # Join the characters again to form the string.
    #nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stop_words]

#### Train and Fit the Data onto the Logistic Regression Model

In [None]:
X_train, x_test, Y_train, y_test = train_test_split(model_reviews['text'], model_reviews['stars'], test_size=0.2)

In [None]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', LogisticRegression(max_iter = 1000)),  # train on TF-IDF vectors w/ Logistic Regression classifier
])

In [None]:
pipeline.fit(X_train, Y_train)

#### Analyze Accuracy of Predictions

In [None]:
predictions = pipeline.predict(x_test)

In [None]:
print('Confusion Matrix:\n')
print(confusion_matrix(predictions, y_test))
print('\nClassification Report:\n')
print(classification_report(predictions, y_test))

Although the logistic regression model is able to predict the number of stars at each level, it is clear that is tends to favor placing each review at one of the two extreme values, at one or five stars. With an overall poor performance, the logistic regression model does not perform well when predicting the number of stars based solely on their frequency due to the inconsistency of human language.

Due to the variations in how humans apply sentiment, context is crucial when analyzing languages using models. As shown by VADER, even reviews that are 5 stars can be mostly negative while 1 star reviews can sound like rave endorsements when the entire context is removed. For this reason, sentence transformers are pivotal to processing any form of language. Although imperfect, they continue to improve as they receive more data to become better at splitting topics from each other and considering even more context. The better question now is how can we leverage sentence transformers to drive better reviews?

## Leveraging Sentence Transformers to Drive Better Reviews for Businesses

When a business wants to increase their average star rating on websites like Yelp, a sensible place to start would be to find out what the reviewers themselves consider to be “exceptional service”. This task is not a simple one as each reviewer is not only unique, but also inconsistent in the way they tend to rate services. One of the best options, however, for pinpointing what reviewers deem important are the usage of machine learning models that utilize the strengths of sentence transformers to cluster together reviews into topics based on similar semantic meanings.

For this task of uncovering which topics are significant in determining the outcome of customer reviews, we will be employing the use of the BERTopic model. BERTopic is a pre-trained language model that excels in grouping text into various topics that are then ranked by size, which is determined by the number of documents that are assigned to each topic. By narrowing down the reviews to apply to specific attributes of businesses, we will investigate how BERTopic can be utilized in different ways by all kinds of companies to focus on the topics that matter to reviewers the most. As a demonstration of its utility, we will put into practice how the BERTopic model simplifies the overwhelming number of reviews from relatively expensive businesses, Chinese restaurants, medical businesses, and sports and fitness businesses, and condenses them into easily identifiable topics that can shine a light onto overall review improvement.

### How Can We Use Bertopic to Analyze Reviews?

As an introduction to Bertopic, we will see how we can analyze reviews using a subset of reviews associated with relatively expensive businesses (with a price range rating of 3 or 4). In order to train the Bertopic model, we will first need to create the subset of reviews for expensive businesses that we will be using for input.

In [None]:
# 1) create a subset of the reduced_business dataframe to only include businesses where their price range is a 3 or 4
expensive = reduced_business[(reduced_business['RestaurantsPriceRange2'] == 3) | (reduced_business['RestaurantsPriceRange2'] == 4)]

# 2) perform a left merge based on each business' unique ID to only keep expensive businesses while also associating each review with their respective business
expensive_reviews = pd.merge(reviews, expensive, how = 'left', on = 'business_id')

# 3) to perform any categorical analysis in the future, we will drop the few rows in which category information is missing
expensive_reviews = expensive_reviews.dropna(subset=['categories'])

# 4) reset the index of the new dataframe (will be necessary if a class-based analysis is performed)
expensive_reviews = expensive_reviews.reset_index().drop('index', axis = 1)
expensive_reviews.head()

Next, we will import all of the required libraries necessary to run the Bertopic model. Additionally, we will set up the Bertopic model with its necessary parameters.

<b>Note:</b> In order to utilize the GPU acceleration capability of Bertopic (within UMAP and HDBSCAN), Bertopic will need to be run within a WSL environment if on Windows. This is because the cuml library requires a linux-based environment to run. In order to create an environment suitable for GPU hardware acceleration:
1. Ensure your device has a Nvidia GPU and the latest video driver update
2. Download WSL onto your cmd line (i.e. Ubuntu 22.04)
3. Install the Rapids 24.02 environment into WSL using this [Rapids install link](https://docs.rapids.ai/install).
4. ```conda activate rapids-24.02``` environment and install Jupyter Notebook (if necessary)
5. Run ```jupyter-notebook``` on the command line and navigate back to this notebook

In [None]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
from bertopic.representation import KeyBERTInspired
from plotly.offline import plot

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=1, gen_min_span_tree=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10)
representation_model = KeyBERTInspired() # This representation model can make it more obvious which topics are important -> condenses topics into key words that are similar to what are in the analyzed sentence

Now, we will train the model on the review text data derived from the subset of expensive businesses.

In [None]:
exp_reviews_to_analyze = expensive_reviews['text'].tolist()

expensive_topic_model = BERTopic(vectorizer_model = vectorizer_model, umap_model = umap_model,
                                 hdbscan_model = hdbscan_model, representation_model = representation_model)

expensive_topics, expensive_probs = expensive_topic_model.fit_transform(exp_reviews_to_analyze)

After we complete the training of the model on the subset of Yelp reviews that are related to expensive businesses, we can review both how many and which documents were assigned to each topic. We can view these topics using the .get_topic_info() method when used on a topic model object. When viewing the returned dataframe, topics are ranked by descending levels of magnitude starting at 0 and ending at the total number of topics. Topic 0 is the most common topic with the greatest number of documents assigned to it, while Topic -1 is a collection of all the documents that do not fit into a topic (outliers) and should not be treated as a topic.

In [None]:
expensive_topic_model.get_topic_info()

When running .get_topic_info() on the expensive topic model, we see that Bertopic has returned around 3,500 topics alongside potential outlier topics (labeled as -1). This many topics, however, can be overwhelming to visualize alongside the topics being too specific and complex to be useful. To combat this, we will force Bertopic to further condense the number of topics it generates for the review dataset by using the .reduce_topics(docs, nr_topics = num_topics) method, where the nr_topics argument specifies the number of topics we wish the model to have. After running this method, the current topic model will be replaced by a different topic model that only contains the specified number of topics. For this demonstration, I will reduce the number of topics to 20, although this can be fine-tuned based on the size of the data used.

<b>Note:</b> Reducing the number of topics to a value that is too low (i.e. 5 or 10 in my testing) can lead to topics that are too broad, therefore not containing any useful information that is particular to a certain category of businesses.

In [None]:
expensive_topic_model.reduce_topics(exp_reviews_to_analyze, nr_topics=20)

After replacing the old topic model of expensive businesses with the reduced topic model, we can once again run the .get_topic_info() method to see which topics are deemed to be important by the model.

In [None]:
expensive_topic_model.get_topic_info()

In [None]:
# Note: You can save Bertopic models using .save("my_model", serialization="safetensors")
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
expensive_topic_model.save('expensive_Bert_model', serialization='safetensors', save_ctfidf = True, save_embedding_model = embedding_model)

After reducing the number of topics that the model can produce, the size of each of the significant topics have increased. By reducing the number of topics, we are ensuring that each topic is not too specific or too broad to not be useful, thus providing a better representation of which reviews fit in each topic.

At first glance, the model does not provide much insight as to why each document was assigned to a particular topic. To get a better understand of why, we can first call the ```.approximate_distribution(docs, calculate_tokens = True)``` method to calculate the topic distributions on a token level, then we can pass this calculated distribution into the ```.visualize_approximate_distribution(doc[i], topic_token_distr[i])``` method to visualize the results.

In [None]:
# Calculate the topic distributions on a token-level
exp_topic_distr, exp_topic_token_distr = expensive_topic_model.approximate_distribution(exp_reviews_to_analyze, calculate_tokens=True)

In [None]:
# Visualize the token-level distributions
fig = expensive_topic_model.visualize_approximate_distribution(exp_reviews_to_analyze[0], exp_topic_token_distr[0])
fig.to_html('exp_app_dist.html')
fig

In [None]:
# Review that the above distribution relates to for reference:
exp_reviews_to_analyze[0]

There are two things we can gather from the results of this visualization. The first is that certain words have a greater influence on the placement of a review within certain categories than others. For example, the combination "delicious food" is a significant indicator that the review is about food and restaurants, whereas the string "big shout out (to person)" more loosely implies service was performed. The second thing we can gather is the effect of the bidirectional context that is interpreted by the Bertopic model. For example, the significant words (i.e. "delicious food") that point towards a topic always have tails on both the left and right side of the word(s) that decrease in importance the farther the distance from the word. This implies that the model is considering not just how the word fits a review into a certain topic, but how the context around the word also helps to decide which topic a review should be placed in.

To get a better understanding of the relation between reviews of expensive businesses, there are three main methods we can use to visualize their similarities. The first two methods visualize relation on the topic level when the reviews are already grouped, and the third method visualizes reviews on the individual document level, which provides insight as to why each document is grouped into each topic.

The first method, to display the relationship between topics, is to create a visualization that relays how closely related each topic is to the next. We can model this relationship by considering the “distance” between them. The farther the distance between two topics, the more unrelated they are. On the contrary, the shorter the distance, the closer in relation two topics are. As an example, we would expect the statements "the doctor performs a surgery" and "the nurse helps a patient" to have a relatively close distance because they both convey information about performing tasks within the same human occupation. These two statements would likely be placed within the same topic called "medical" and would be located within a circle of influence. However, "the dog barks" would have a relatively greater distance between the previous two statements as there is little relation between the two sentences. This statement would be placed within its own topic called "animals" and would be some distance away from the "medical" topic. We can visualize this distance using the .visualize_topics() method when called on a topic model object.

In [None]:
fig = expensive_topic_model.visualize_topics()
plot(fig, filename = 'expensive_intertopic_dist_plot.html')
fig

When looking at the intertropic distance map, we can see that Topic 0 which is based around restaurants and food is much closer to other topics centered around restaurants, such as Topic 13 which is about bars and dinner. On the other hand, the topics related to store service and appointments, such as Topic 1 (shops) and Topic 2 (salons) are a much greater distance away. These clusters within the visualization make it simple to pinpoint which topics are closely related, while the size of each topic’s sphere of influence gives insight into how significant each topic is as reviewers will tend to mention the topics that matter to them with greater frequency.

The second method, used to also visualize the relationship between topics, is to create a heatmap that relays the similarities between the topics. For this visualization model, we will be using a heatmap generated based on two topic’s similarity score. The closer this score is to 1, the more related the topics are, while the closer this score is to 0, the more unrelated the topics are.

In [None]:
fig = expensive_topic_model.visualize_heatmap(n_clusters = 5)
plot(fig, filename = 'expensive_similarity_mat_plot.html')
fig

From this similarity heatmap, we can see which aspects of a business are conditioned on each other. For example, Topic 0 and Topic 13 have a relatively high similarity score of 0.648. Knowing that Topic 0 relates to restaurants and Topic 13 partially relates to music, it can be inferred that in order for an expensive restaurant to improve their reviews, they must set a fine atmosphere complete with appropriate and enjoyable music.

The third method to analyze the relation between reviews is done on the individual document level. Similarly to the intertropic distance map we created by using the first method, we can use the concept of “distance” to model the relationships between individual reviews. Because each review is displayed on this graph, we can also visualize the topic each review is a part of by seeing which cluster each review belongs to. Each review is colored to demonstrate which topic cluster each review is a member of.

In [None]:
fig = expensive_topic_model.visualize_documents(exp_reviews_to_analyze)
plot(fig, filename = 'expensive_document_dist_plot.html')
fig

With this visualization, we can toggle the exact topics we wish to see by clicking on the legend in the top right. When observing the top four topics on the graph, we can see some patterns emerge that coincide with the results from the other two methods. From the intertropical distance model, we expect the documents clustered around Topic 0 to be significantly farther away from the documents that make up the clusters up documents 1, 2, and 3, and 4, and this is the case. Furthermore, based on the similarity matrix, we see a lot of expected overlap between Topics 1 and 4, and Topics 2 and 3, which have high similarity scores between each respective pair of topics. By visualizing the documents on the individual level, we are able to combine the findings of the previous two methods while also being able to visualize the variance that is present between each document and its respective topic. As expected, the larger topics with larger spheres of influence tend to have greater variation with regards to the distance from the center that their associated documents are.

With its immense support for visualizations and powerful tools to assess sentiment and meaning using context, BERTopic has been demonstrably shown to be a powerful tool for language analysis. For businesses that want to improve client reviews, the most important step is the first step, identifying what needs to change. When we ran the BERTopic model on the subset of expensive businesses, we were able to observe not only which topics were frequently mentioned and critiqued, but also which topics were conditioned on each other. The latter here can arguably be more detrimental to increasing average review ratings as the largest topics are often just a collection of smaller topics. The largest topic may be regarding the restaurant quality, but a quality restaurant is only as good as its food, atmosphere, and customer service which are completely different topics. Now that we understand how to find these dependencies, we can further extend the usage of Bertopic to discover what makes a positive review versus a negative one using "classes".

### How Can We Use Bertopic to Distinguish Between a Positive and Negative Review?

Classes within BERTopic provide a way to split the data into categories based on a certain condition or attribute. In this example, I am analyzing medical businesses. Based on the nature of reviews, it is intuitive to split the reviews into classes based on how many stars the review received. For instance, "Class 1" would be the collection of all reviews that received a one star rating. If we then modeled these classes using a categorical barchart, we can see the differences between which topics are mentioned more frequently at each star level. 

In [None]:
# 1) create a subset of the reduced_business dataframe to only include businesses categorized as medical
medical = reduced_business[(reduced_business['categories'].str.contains('Doctors')) | (reduced_business['categories'].str.contains('Medical'))]

# 2) perform a left merge based on each business' unique ID to only keep medical businesses while also associating each review with their respective business
medical_reviews = pd.merge(reviews, medical, how = 'left', on = 'business_id')

# 3) to perform any categorical analysis in the future, we will drop the few rows in which category information is missing
medical_reviews = medical_reviews.dropna(subset=['categories'])

# 4) reset the index of the new dataframe (will be necessary if a class-based analysis is performed)
medical_reviews = medical_reviews.reset_index()
medical_reviews.head()

In [None]:
medical_reviews_to_analyze = medical_reviews['text'].tolist()

# initiate classes split on the number of stars each review received
classes = [star for star in medical_reviews['stars']]
medical_topic_model = BERTopic(vectorizer_model = vectorizer_model, umap_model = umap_model,
                               hdbscan_model = hdbscan_model, representation_model = representation_model)

medical_topics, medical_probs = medical_topic_model.fit_transform(medical_reviews_to_analyze)

In [None]:
# reduce the number of topics for topic clarity
medical_topic_model.reduce_topics(medical_reviews_to_analyze, nr_topics = 20)

We can create a new object formed using the .topics_by_class(docs, classes) method that can be passed into a categorical barchart visualization.

In [None]:
# this object will be used to model the frequency of topics discussed split by class
medical_topics_per_class = medical_topic_model.topics_per_class(medical_reviews_to_analyze, classes=classes)

In [None]:
# Note: You can save Bertopic models using .save("my_model", serialization="safetensors")
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
medical_topic_model.save('Medical_Bert_model', serialization='safetensors', save_ctfidf = True, save_embedding_model = embedding_model)

In [None]:
medical_topic_model.get_topic_info()

In [None]:
fig = medical_topic_model.visualize_topics_per_class(medical_topics_per_class)
plot(fig, filename = 'medical_by_class_plot.html')
fig

When looking at the visualization, two types of topics begin to stand out, those that have a high frequency at the extremes (i.e. 1 and 5 star reviews) but a low frequency elsewhere, and those that only begin to appear as the reviews approach 5 stars. The first kind of review is best shown by Topic 0 which is represented by words such as appointment and doctor. These types could be considered the "make or break" topics for reviews. They are either mentioned because the experience is phenomenal, or because the experience was particularly horrid. In the example of medical businesses, doctors are the central component. If a patient has a memorable doctor, whether that is for a positive or negative reason, it will affect their entire perception of the visit because it was the entire point of the visit most of the time. There is no "atmosphere" to make up for a poor doctor like there is in the restaurant industry. Thus, reviews that were elevated to 5 stars or tanked to 1 most likely had a memorable doctor experience that either "made" or "broke" the trip for the customer. The second kind of topic is the "nice to have" variety. These kinds of topics are predominantly noticed when, despite being unnecessary for a positive experience, they make a noticeable impact on the customer experience in addition to their main point of attending. This is best shown by Topic 8 which is based around environment and comfort. Within one star reviews, the environment is hardly noticed because there are other major factors that likely overshadowed it, such as a poor doctor. However, once the customer is enjoying their time, they begin to relax a little and take in their surroundings to notice the things that may be missed, such as the relaxing environments in Topic 8 of medical businesses. This can also work the other direction, notably in customer service. Good customer service can often go overlooked, but poor customer service is often remembered with far greater resentment, as demonstrated by Topic 5 of medical business reviews. However, what is instead we wanted to view how these topics changed over time to see what is the most relevant to the modern reviewer?

### How Can We Use Bertopic to Model Significant Review Topics Over Time?

Reviewers are humans, and humans tend to change their minds about what is important to them in a service frequently. How can we model this change of topic significance over time? As long as the reviews are dated accordingly, BERTopic can be used to visualize how frequently a topic is mentioned over a span of time. To illustrate this concept, we will be taking a look at sports and fitness businesses and see how the topics important to their customers have evolved since reviews were being uploaded to the Yelp platform.

In [None]:
# 1) create a subset of the reduced_business dataframe to only include businesses categorized as sports and fitness
sports = reduced_business[(reduced_business['categories'].str.contains('Sports')) | (reduced_business['categories'].str.contains('Fitness'))]

# 2) perform a left merge based on each business' unique ID to only keep medical businesses while also associating each review with their respective business
sports_reviews = pd.merge(reviews, sports, how = 'left', on = 'business_id')

# 3) to perform any categorical analysis in the future, we will drop the few rows in which category information is missing
sports_reviews = sports_reviews.dropna(subset=['categories'])

# 4) modify the "date" column to include just the year to model frequency of topics over time
sports_reviews['date'] = pd.to_datetime(sports_reviews['date']).dt.strftime('%Y')

# 5) reset the index of the new dataframe (will be necessary if a class-based analysis is performed)
sports_reviews = sports_reviews.reset_index()
sports_reviews.head()

In [None]:
sports_reviews_to_analyze = sports_reviews['text'].tolist()
sports_dates = sports_reviews['date'].tolist()

sports_topic_model = BERTopic(vectorizer_model = vectorizer_model, umap_model = umap_model,
                              hdbscan_model = hdbscan_model, representation_model = representation_model)

sports_topics, sports_probs = sports_topic_model.fit_transform(sports_reviews_to_analyze)

In [None]:
sports_topic_model.reduce_topics(sports_reviews_to_analyze, nr_topics = 20)

We can create a new object formed using the .topics_over_time(docs, dates) method that can be passed into a frequency over time visualization

In [None]:
sports_topics_over_time = sports_topic_model.topics_over_time(sports_reviews_to_analyze, sports_dates)

In [None]:
sports_topic_model.get_topic_info()

In [None]:
# Note: You can save Bertopic models using .save("my_model", serialization="safetensors")
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
sports_topic_model.save('Sports_Bert_model', serialization='safetensors', save_ctfidf = True, save_embedding_model = embedding_model)

In [None]:
# Note: I removed review data from 2022 for this visualization as there was not enough data points to generate a good plot using it
fig = sports_topic_model.visualize_topics_over_time(sports_topics_over_time[sports_topics_over_time['Timestamp'] != '2022-01-01'],
                                                    normalize_frequency = True)
plot(fig, filename = 'sports_topics_over_time_plot.html')
fig

When looking at the normalized frequency distribution over time, we begin to understand how customers' priorities changed over time. Starting by looking at the elephant in the room, 2020 has a massive frequency spike for one of the topics. This topic, of course, resides over the issue of masking in public areas, undoubtedly applicable sports and fitness businesses. During this time, companies' decisions regarding how they would enforce masking policies were always under intense scrutiny from reviewers. As can be concluded by this chart, having an effective and popular masking policy would most likely lead to better reviews quicker. With this context from recent history, topics of importance can similarly be applied to other topic areas. For example, Topic 7 demonstrates how the demand for quality pool instructors has steadily increased over the years. On the contrary, Topic 3 shows how the interest in shoes tailored for sports use within certain department stores (i.e. a Big-5 type chain) have been on a decline since 2016. 

## Conclusion

Businesses are constantly engaged in a mind game with their customers. These corporations have only a finite amount of capital to spend on generating a positive user experience to make up for the negative impact they impose on them, mainly taking their money in order to turn a profit. But with such limited resources to please the customer and also turn a profit, how can businesses decide what improvements to their service they want to focus on? This is the reasoning as to why being able to model human speech through machine learning models is so revolutionary. Models that utilize the immense power of sentence transformers that turn inconsistent language semantics into simple numerical vectors are able to embrace the chaos of language and return simpler results that can be understood easier by the programmers. Through the usage of BERTopic, and similar models, we are able to pinpoint areas of interest and show how they change over time, or whether they are noticed when a user has a positive or negative experience. It allows businesses to learn how they can better allocate their finite resources to achieve the desired result, higher average star ratings with their reviews.

## Credits

<b>Yelp dataset:</b>  
- Link: [Yelp Open Dataset](https://www.yelp.com/dataset)  
   
<b>BERTopic documentation:</b>  
- Title: BERTopic: Neural topic modeling with a class-based TF-IDF procedure  
- Author: Grootendorst, Maarten  
- Year: 2022  
- Link: [BERTopic Documentation](https://maartengr.github.io/BERTopic/index.html#citation)  
  
<b>VADER Documentation:</b>  
- Title: VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text  
- Authors: Hutto, C.J. and Gilbert, E.E.  
- Year: 2014  
- Link: [VADER Documentation](https://github.com/cjhutto/vaderSentiment)

<b>Sentence Transformer documentation:</b>  
- Title: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks  
- Authors: Reimers, Nils and Gurevych, Iryna  
- Year: 2019  
- Link: [Sentence Transformer Documentation](https://github.com/UKPLab/sentence-transformers/blob/master/index.rst)