## 1.Problem Definition
Can Reviews Be Faked? It's harder to fake a review on Airbnb than on many other booking sites, since only guests who've already verified their identity are able to book a place, and only verified guests who've already checked out of their stay can leave a review for that listing/host.
The study, "Analyzing Airbnb Reviews with a Hierarchical Topic Model," employs an algorithmic approach to gain insight into travelers' experiences through  the online reviews. Utilizing the statistical methodology of topic modeling, the model should be  able to uncover latent topics across many user reviews on Airbnb's website. This process aims in identifying  distinct categories that provides accurate insights into travelers' overall experience with regards to comfort, spatial arrangements, and other typical amenities offered by most accommodation providers.


## 2.Installing and importing necessary python libraries

In [1]:
# Install bertopic
!pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic
  Downloading bertopic-0.13.0-py2.py3-none-any.whl (103 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.9/103.9 KB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting umap-learn>=0.5.0
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.2/88.2 KB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting hdbscan>=0.8.29
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m62.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1
  Downloading sentence-transform

In [2]:
# Data processing
import pandas as pd
import numpy as np
# Text preprocessiong
import nltk
nltk.download('stopwords')
# Dimension reduction
from umap import UMAP
# Clustering
from hdbscan import HDBSCAN
# Count vectorization
from sklearn.feature_extraction.text import CountVectorizer
# Topic model
from bertopic import BERTopic

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## 3.Getting the data ready

A website called Inside Airbnb had the Airbnb data publicly available for research. We used the review data for Washington D.C. for this analysis, but the website provides other listing data from other locations around the world,here is a link to the dataset:http://insideairbnb.com/get-the-data/

Now let’s read the data into a pandas dataframe and see what the dataset looks like.

The dataset has multiple columns, but we only read the comments column because the reviews are the only information needed for this project. The dataset has over three hundred thousand reviews, we read ten thousand reviews to make the computation manageable and save time for each iteration.

In [4]:
# Read in data
df = pd.read_csv('/content/drive/MyDrive/filename/reviews.csv.gz', nrows=10000, usecols=['comments'], compression='gzip')
# Take a look at the data
df.head()

Unnamed: 0,comments
0,"Don's apartment is comfortable, clean and well..."
1,Don was a great host. He went out of his way t...
2,This place was great! Don was extremely helpfu...
3,This condo was in a great location for touring...
4,"The apartment was very clean, comfortable and ..."


In [5]:
# Get the dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   comments  10000 non-null  object
dtypes: object(1)
memory usage: 78.2+ KB


## 4.Remove Noises from Topic Top Words
We will eliminate the noise from the most prominent words of the topic model. There are three main sources of noise that can affect the accuracy and interpretation of the topic model: stop words, personal names, and domain-specific words. Stop words are words that are commonly used in sentences but do not carry any real meaning, such as 'the' and 'for'. The Python package NLTK contains a list of 179 stop words.

In [6]:
# NLTK English stopwords
stopwords = nltk.corpus.stopwords.words('english')
# Print out the NLTK default stopwords
print(f'There are {len(stopwords)} default stopwords. They are {stopwords}')

There are 179 default stopwords. They are ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'no

Persons’ names are high-frequency words for Airbnb reviews because reviewers like to mention hosts’ names in the review. Therefore, names are likely to become top words representing the topic, making interpreting the topics difficult.

To remove the hosts’ names from the top keywords representing the topics, I downloaded the frequently occurring surname list from the https://www.census.gov/topics/population/genealogy/data/2000_surnames.html. It contains the first names with a frequency of more than 100.

1. There are two columns in the US Census Bureau surnames frequency dataset. We only read the column name because only name is needed for the tutorial.

2. After reading the names into a pandas dataframe, they are transformed from upper cases to lower cases because the topic model uses lower cases.

3. Two name lists are created using lowercase names, one list has the names, and the other has the name plus the letter s as the element. This is because a lot of reviewers mention the host's apartment such as "Don's apartment". The word "Don's" becomes "Dons" after removing punctuation. So we need to remove the name plus the letter s from the top words as well.

4. Each name list has 151,671 names and the top three names with the highest frequency are Smith, Johnson, and Williams.

In [7]:
# Read in names
names = pd.read_csv('/content/drive/MyDrive/filename/app_c.csv', usecols=['name'])
# Host name list
name_list = names['name'].str.lower().tolist()
# Host's name list
names_list = list(map(( lambda x: str(x)+'s'), name_list))
# Print out the number of names
print(f'There are {len(name_list)} names in the surname list, and the top three names are {name_list[:3]}.')

There are 151671 names in the surname list, and the top three names are ['smith', 'johnson', 'williams'].


Domain-specific words are high-frequency words related to the business. For Airbnb reviews, reviewers frequently mention the word airbnb, time, would, and stay. Because I am using the data for Washington D.C., the word dc is a frequent word too.

In [8]:
# Domain specific words to remove
airbnb_related_words = ['stay', 'airbnb', 'dc', 'would', 'time']

Removing noises is an iterative process, and we can add new words to the list if they do not provide valuable meanings and appear in the top topic words. For example, some less common names such as natasha can appear as top words. The word also appears in the top words too, but does not provide valuable information about the topic, so we will remove such words.

In [9]:
# Other words to remove
other_words_to_remove = ['natasha', 'also', 'vladi']

To eliminate the noise from the most prominent words representing the topics, we augmented the stopwords list with name lists and Airbnb-specific words. After the extension, 303,529 words were excluded from the top words, resulting in a more accurate and comprehensive analysis.

In [10]:
# Expand stopwords
stopwords.extend(name_list + names_list + airbnb_related_words + other_words_to_remove)
print(f'There are {len(stopwords)} stopwords.')

There are 303529 stopwords.


##  5.Building a Basic BERTopic Model
The Hierarchical Topic Model (HTM) produces results that reveal the similarities among topics and helps us to comprehend the topic-subtopic structure. To begin, we will construct a basic BERTopic model.

BERTopic model utilizes UMAP (Uniform Manifold Approximation & Projection) for dimensionality reduction. By default, BERTopic produces varying results due to the stochasticity inherited from UMAP. To ensure reproducible topics, we must pass a value to the random_state parameter in the UMAP method.

~  n_neighbors=15 means that the local neighborhood size for UMAP is 15. This is the parameter that controls the local versus global structure in data.

1. A low value forces UMAP to focus more on local structure, and may lose insights into the big picture.
2. A high value pushes UMAP to look at the broader neighborhood, and may lose details on local structure.
3. The default n_neighbors value for UMAP is 15.


*   n_components=5 indicates that the target dimension from UMAP is 5. This is the dimension of data that will be passed into the clustering model.
*   min_dist controls how tightly UMAP is allowed to pack points together. It's the minimum distance between points in the low-dimensional space.


1.   Small values of min_dist result in clumpier embeddings, which is good for clustering. Since our goal of dimension reduction is to build clustering models, we set min_dist to 0.
2.   Large values of min_dist prevent UMAP from packing points together and preserves the broad structure of data.



*   metric='cosine' indicates that we will use cosine to measure the distance.
*   random_state sets a random seed to make the UMAP results reproducible.

CountVectorizer is for counting the word's frequency. Passing the extended stop words list helps us to remove noises from the top words representing each topic.







In [11]:
# Initiate UMAP
umap_model = UMAP(n_neighbors=15, 
                  n_components=5, 
                  min_dist=0.0, 
                  metric='cosine', 
                  random_state=100)
# Count vectorizer
vectorizer_model = CountVectorizer(stop_words=stopwords)



*   umap_model takes the model for dimensionality reduction. We are using the UMAP model for this project, but it can be other dimensionality reduction models such as PCA (Principle Component Analysis).
*   vectorizer_model takes the term vectorization model. The extended stop words list is passed into the BERTopic model through CountVectorizer.


*   diversity helps to remove the words with the same or similar meanings. It has a range of 0 to 1, where 0 means least diversity and 1 means most diversity.
*   min_topic_size is the minimum number of documents in a topic. min_topic_size=200 means that a topic needs to have at least 200 reviews.


*   top_n_words=4 indicates that we will use the top 4 words to represent the topic.
*   language has English as the default. We set it to multilingual because there are multiple languages in the Airbnb reviews.

*   calculate_probabilities=True means that the probabilities of each document belonging to each topic are calculated. The topic with the highest probability is the predicted topic for a new document. This probability represents how confident we are about finding the topic in the document.









In [12]:
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, 
                       vectorizer_model=vectorizer_model, 
                       diversity=0.8, 
                       min_topic_size=200,
                       top_n_words=4,
                       language="multilingual",
                       calculate_probabilities=True)
# Run BERTopic model
topics = topic_model.fit_transform(df['comments'])
# Get the list of topics
topic_model.get_topic_info()

Downloading:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name
0,-1,2442,-1_recommend_restaurants_stayed_bed
1,0,4564,0_clean_neighborhood_restaurants_bed
2,1,515,1_breakfast_neighborhood_recommend_feel
3,2,513,2_bike_bed_neighbourhood_efficiency
4,3,487,3_comfortable_meet_neighborhood_subway
5,4,447,4_cathedral_recommend_basement_feel
6,5,282,5_comfortable_breakfast_tv_organic
7,6,280,6_apartment_recommend_bathroom_feel
8,7,242,7_great_neighborhood_restaurants_comfortable
9,8,228,8_stadium_parking_staying_bathroom


Using the attribute get_topic_info() on the topic model gives us a list of topics. We can see that the output gives us 9 rows in total.


*   Topic -1 should be ignored. It indicates that the reviews are not assigned to any specific topic. The count for topic -1 is 2442, meaning that there are 2442 outlier reviews that do not belong to any topic.
*   Topic 0 to topic 8 are the 9 topics created for the reviews. It was ordered by the number of reviews in each topic, so topic 0 has the highest number of reviews.
*   The Name column lists the top terms for each topic. For example, the top 4 terms for Topic 0 are clean, neighborhood, restaurants, and bed, indicating that it is a topic related to a convenient neighborhood.



## 6.Merging Topics Manually
If we would like to manually pick which topics to merge together based on domain knowledge, we can list the topic numbers and pass them into the merge_topics function.

In this example, we merged topic 1 and topic 5 together because they both talk about breakfast, and merged topic 0 and topic 7 together because they both talk about neighborhood restaurants.

In [13]:
# Topic to merge
topics_to_merge = [[1, 5],
                   [0, 7]]
# Merge topics                   
topic_model.merge_topics(df['comments'], topics_to_merge)
# Get the list of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,2442,-1_recommend_restaurants_stayed_bed
1,0,4806,0_neighborhood_recommend_restaurants_bed
2,1,797,1_breakfast_neighborhood_recommend_bathroom
3,2,513,2_clean_bed_bike_neighbourhood
4,3,487,3_comfortable_neighborhood_meet_subway
5,4,447,4_cathedral_recommend_basement_feel
6,5,280,5_apartment_recommend_bathroom_feel
7,6,228,6_stadium_parking_staying_bathroom


We can see that the number of topics is reduced by two, and we now have 7 topics.

## 7. Extracting Topic Hierarchy
After finishing building the basic BERTopic model, we can extract the hierarchical structure using hierarchical_topics. The output is a dataframe with 8 columns.


*   Parent_ID is a new topic ID created for the parent topics.
*   Parent_Name is a list of top words describing the parent topic.
*   Topics is a list of child topic numbers included in the parent topic. All the child topic numbers in this column are from the basic BERTopic model.

*   Child_Left_ID is the left child topic number. This child topic number can be from the basic BERTopic model or an existing parent topic number.
*   Child_Left_Name has the top words describing the left child topic.
*   Child_Right_ID is the right child topic number. Similar to Child_Left_ID, this child topic number can be from the basic BERTopic model or an existing parent topic number.
*   Child_Right_Name has the top words describing the right child topic.
*   Distance shows the distance between the left and right child topics.






In [14]:
# Hierachical topics
hierarchical_topics = topic_model.hierarchical_topics(df['comments'])
# Take a look at the data
hierarchical_topics

100%|██████████| 6/6 [00:01<00:00,  3.48it/s]


Unnamed: 0,Parent_ID,Parent_Name,Topics,Child_Left_ID,Child_Left_Name,Child_Right_ID,Child_Right_Name,Distance
5,12,apartment_recommend_parking_distance,"[0, 1, 2, 3, 4, 5, 6]",6,stadium_parking_staying_bathroom,11,neighborhood_recommend_restaurants_bed,0.37835
4,11,neighborhood_recommend_restaurants_bed,"[0, 1, 2, 3, 4, 5]",10,neighborhood_recommend_restaurants_bed,5,apartment_recommend_bathroom_feel,0.239829
3,10,neighborhood_recommend_restaurants_bed,"[0, 1, 2, 3, 4]",8,breakfast_recommend_feel_apartment,9,neighborhood_recommend_restaurants_bed,0.222739
2,9,neighborhood_recommend_restaurants_bed,"[0, 2, 3]",7,clean_neighborhood_restaurants_bed,3,comfortable_neighborhood_meet_subway,0.191799
1,8,breakfast_recommend_feel_apartment,"[1, 4]",4,cathedral_recommend_basement_feel,1,breakfast_neighborhood_recommend_bathroom,0.160202
0,7,clean_neighborhood_restaurants_bed,"[0, 2]",2,clean_bed_bike_neighbourhood,0,neighborhood_recommend_restaurants_bed,0.117556


## 8.Create Mapping Between Parent and Child Topics
we will take a close look at the topic hierarchy, and create a mutually exclusive mapping between the parent topics and the child topics.

Using topic_model.visualize_hierarchy, we can visualize the hierarchical structure of the topics. Hovering the mouse over the black dots shows the top words for the level of the hierarchy.

In [15]:
# Visualize heirarchical topics
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

Another way to visualize the topic hierarchy is to create a topic tree using topic_model.get_topic_tree. This view lists the topic representations for all hierarchy levels.

In [16]:
# Topic tree
tree = topic_model.get_topic_tree(hierarchical_topics)
# Print out the tree
print(tree)

.
├─■──stadium_parking_staying_bathroom ── Topic: 6
└─neighborhood_recommend_restaurants_bed
     ├─neighborhood_recommend_restaurants_bed
     │    ├─breakfast_recommend_feel_apartment
     │    │    ├─■──cathedral_recommend_basement_feel ── Topic: 4
     │    │    └─■──breakfast_neighborhood_recommend_bathroom ── Topic: 1
     │    └─neighborhood_recommend_restaurants_bed
     │         ├─clean_neighborhood_restaurants_bed
     │         │    ├─■──clean_bed_bike_neighbourhood ── Topic: 2
     │         │    └─■──neighborhood_recommend_restaurants_bed ── Topic: 0
     │         └─■──comfortable_neighborhood_meet_subway ── Topic: 3
     └─■──apartment_recommend_bathroom_feel ── Topic: 5



After examining the topic hierarchy, we decide to create four parent topics.

*   Parent topic 1 is about breakfast. It includes child topic 1 and child topic 4.
*   Parent topic 2 is about the neighborhood with a focus on restaurants. It includes child topics 0, 2, and 3.
*   Parent topic 3 is about the apartment itself with comments on the bathroom. It includes child topic 5.
*  Parent topic 4 is about the stadium. It includes child topic 6.
*  A function called topic_mapping is created for the mapping between the parent topics and the child topics.



In [17]:
# A function to map parent and child topics
def topic_mapping(child_topic_number):
  if child_topic_number in [1, 4]:
    return 'P1_breakfast_recommend_feel_apartment'
  elif child_topic_number in [0, 2, 3]:
    return 'P2_neighborhood_recommend_restaurants_bed'
  elif child_topic_number in [5]:
    return 'P3_apartment_recommend_bathroom_feelm'
  elif child_topic_number in [6]:
    return 'P4_stadium_parking_staying_bathroom'

## 9.Topic Model In-sample Hierarchical Predictions
First, we get the predicted child topic numbers from the BERTopic model. Then, a column child_topic is created in the dataframe. After that, the topic_mapping function is applied to each child topic number and the parent_topic column is created.

In [18]:
# Get the child topic predictions from the basic BERTopic model
child_topic_prediction = topic_model.topics_[:]
# Save the child predictions in the dataframe
df['child_topic'] = child_topic_prediction
# Create the parent topics
df['parent_topic'] = df['child_topic'].apply(topic_mapping)
# Take a look at the data
df.head()

Unnamed: 0,comments,child_topic,parent_topic
0,"Don's apartment is comfortable, clean and well...",0,P2_neighborhood_recommend_restaurants_bed
1,Don was a great host. He went out of his way t...,-1,
2,This place was great! Don was extremely helpfu...,0,P2_neighborhood_recommend_restaurants_bed
3,This condo was in a great location for touring...,-1,
4,"The apartment was very clean, comfortable and ...",0,P2_neighborhood_recommend_restaurants_bed


The first review in the dataset says “Don’s apartment is comfortable, clean and well equipped. Three minutes walk from the metro, wonderful terrace view of the river and Washington Monument. All in all a great experience. I would certainly recommend Don’s apartment to friends and would be very happy to stay there again.” It is a good match for the parent topic 2 of recommending the apartment for the neighborhood.

In [19]:
# Take a look at the first review
df['comments'][0]

"Don's apartment is comfortable, clean and well equipped. Three minutes walk from the metro, wonderful terrace view of the river and Washington Monument. All in all a great experience. I would certainly recommend Don's apartment to friends and would be very happy to stay there again."

Using value_counts, we get the frequencies of the parent topics. We can see that most reviews are about the neighborhood and a lot of people care about breakfast.

In [20]:
# Review frequency by parent topic
df['parent_topic'].value_counts()

P2_neighborhood_recommend_restaurants_bed    5806
P1_breakfast_recommend_feel_apartment        1244
P3_apartment_recommend_bathroom_feelm         280
P4_stadium_parking_staying_bathroom           228
Name: parent_topic, dtype: int64

## 10.New Document Topic Predictions
Suppose there is a new review “I like the apartment. The bathroom is spacious and clean.”, we can find the topic that is most similar to the review using find_topics.

In [21]:
# New data for the review
new_review = "I like the apartment. The bathroom is spacious and clean."
# Find topics
similar_topics, similarity = topic_model.find_topics(new_review, top_n=1);
# Print results
print(f'The most similar child topic is {similar_topics[0]}, and the similarities is {np.round(similarity,2)[0]}')

The most similar child topic is 5, and the similarities is 0.56


The result shows that child topic 5 is the topic that is most similar to the review. The corresponding parent topic is 3, which is about the apartment itself.

## 11.Conclusion
The use of Airbnb Reviews with a Hierarchical Topic Model proved to be an effective way of understanding user sentiment across numerous locations. This model identified some key levels, allowing us to analyze reviews in more detail and obtain more meaningful insights. Leveraging this data allowed us to better understand user opinions, preferences, and experiences, leading to improved decisions in terms of design, growth strategies and market localization.
