Introduction:
I used LDA to break down all the hotel reviews into hidden themes, without pre-defining what those themes were. Basically, the model read through thousands of reviews and grouped words that often appear together into topics, like cleanliness, staff behavior, food, or location. Then, for each review, it figured out how much of it was about each topic. This way, I could see not only the main topic a review was focused on but also other smaller topics that were mixed in. After that, I separated the reviews into good and bad based on their ratings so I could see which topics showed up more often in bad reviews and which ones were common in good reviews.

In [1]:
#Importing necessary packages


import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [6]:
from google.colab import files
uploaded = files.upload()



Saving booking_reviews copy.csv to booking_reviews copy.csv


In [8]:
booking = pd.read_csv("booking_reviews copy.csv") # Booking.com df


In [9]:
booking.head()

Unnamed: 0,index,review_title,reviewed_at,reviewed_by,images,crawled_at,url,hotel_name,hotel_url,avg_rating,nationality,rating,review_text,raw_review_text,tags,meta
0,0,Exceptional,11 July 2021,Kyrylo,,"07/20/2021, 01:09:27",https://www.booking.com/reviews/be/hotel/villa...,Villa Pura Vida,https://www.booking.com/hotel/be/villa-pura-vi...,9.7,Poland,10.0,"Everything was perfect! Quite, cozy place to r...","<p class=""review_pos ""><svg aria-label=""Positi...",Business trip~Solo traveller~Junior Suite~Stay...,"{'language': 'en-gb', 'source': 'https://www.b..."
1,1,I highly recommend this b&b! We enjoyed it a lot!,24 November 2019,Dimitri,,"07/20/2021, 01:09:27",https://www.booking.com/reviews/be/hotel/villa...,Villa Pura Vida,https://www.booking.com/hotel/be/villa-pura-vi...,9.7,Belgium,9.0,Very friendly host and perfect breakfast!,"<p class=""review_pos ""><svg aria-label=""Positi...",Leisure trip~Couple~Deluxe Suite~Stayed 1 nigh...,"{'language': 'en-gb', 'source': 'https://www.b..."
2,2,Exceptional,3 January 2020,Virginia,,"07/20/2021, 01:09:27",https://www.booking.com/reviews/be/hotel/hydro...,Hydro Palace Apartment,https://www.booking.com/hotel/be/hydro-palace....,9.2,United Kingdom,10.0,It was just what we wanted for a week by the b...,"<p class=""review_neg ""><svg aria-label=""Negati...",Leisure trip~Couple~Apartment with Sea View~St...,"{'language': 'en-gb', 'source': 'https://www.b..."
3,3,My stay in the house was a experiencing bliss ...,8 September 2019,Kannan,,"07/20/2021, 01:09:28",https://www.booking.com/reviews/be/hotel/villa...,Villa Pura Vida,https://www.booking.com/hotel/be/villa-pura-vi...,9.7,Netherlands,10.0,My stay in the house was a experiencing bliss ...,"<p class=""review_pos ""><svg aria-label=""Positi...",Business trip~Solo traveller~Junior Suite~Stay...,"{'language': 'en-gb', 'source': 'https://www.b..."
4,4,One bedroom apartment with wonderful view and ...,23 June 2019,Sue,https://cf.bstatic.com/xdata/images/xphoto/squ...,"07/20/2021, 01:09:28",https://www.booking.com/reviews/be/hotel/hydro...,Hydro Palace Apartment,https://www.booking.com/hotel/be/hydro-palace....,9.2,South Africa,9.2,The building itself has a very musty smell in ...,"<p class=""review_neg ""><svg aria-label=""Negati...",Leisure trip~People with friends~Apartment wit...,"{'language': 'en-gb', 'source': 'https://www.b..."


In [10]:
 booking.shape

(26675, 16)

In [12]:
booking = booking[['review_text', 'avg_rating']] # only keeping the review and text column


In [13]:
booking.isna().sum()

Unnamed: 0,0
review_text,289
avg_rating,289


In [None]:
booking.dropna(inplace=True)# checking and dropping null values

In [21]:
booking.head()

Unnamed: 0,review_text,avg_rating
0,"Everything was perfect! Quite, cozy place to r...",9.7
1,Very friendly host and perfect breakfast!,9.7
2,It was just what we wanted for a week by the b...,9.2
3,My stay in the house was a experiencing bliss ...,9.7
4,The building itself has a very musty smell in ...,9.2


In [24]:
booking2=booking.rename(columns={'review_text': 'Review', "avg_rating":"rating"})# changing column name for consistency

In [25]:
booking2.head()

Unnamed: 0,Review,rating
0,"Everything was perfect! Quite, cozy place to r...",9.7
1,Very friendly host and perfect breakfast!,9.7
2,It was just what we wanted for a week by the b...,9.2
3,My stay in the house was a experiencing bliss ...,9.7
4,The building itself has a very musty smell in ...,9.2


In [32]:
def classify_rating(r):         #creating a function where less than 7 rating is bad and above  is termed  as good
    if r < 7.0:
        return "bad"
    else:
        return "good"


In [28]:

booking2['Sentiment'] = booking2['rating'].apply(classify_rating)
booking2.head()

Unnamed: 0,Review,rating,Sentiment
0,"Everything was perfect! Quite, cozy place to r...",9.7,good
1,Very friendly host and perfect breakfast!,9.7,good
2,It was just what we wanted for a week by the b...,9.2,good
3,My stay in the house was a experiencing bliss ...,9.7,good
4,The building itself has a very musty smell in ...,9.2,good


In [31]:
#Vectorizing  Reviews  Bag of Words with count vectorizer

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(
    max_df=0.95,  # Ignore words if  in >95% of reviews like common words like is and the  etc
    min_df=5,     # Ignore words in <5 reviews
    stop_words='english'
)

dtm = cv.fit_transform(booking2['Review'])  # document term matrix based on freq of words
print("Document-Term Matrix Shape:", dtm.shape)

Document-Term Matrix Shape: (26386, 4237)


In [None]:
#Applying LDA

from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(
    n_components=7,          # Starting  with 7 random  topics
    doc_topic_prior=0.1,     # Alpha parameter  document text and topic relation
    topic_word_prior=0.01,   # Beta paramter topic and words
    random_state=42
)

LDA.fit(dtm)

In [35]:
vocab = cv.get_feature_names_out()


for index, topic in enumerate(LDA.components_):  # looping over lda components to print top words
    print(f"THE TOP 15 WORDS FOR TOPIC #{index}")
    print([vocab[i] for i in topic.argsort()[-15:]])
    print("\n")

THE TOP 15 WORDS FOR TOPIC #0
['like', 'owner', 'floor', 'didn', 'clean', 'easy', 'time', 'kitchen', 'stay', 'location', 'great', 'booking', 'stairs', 'check', 'apartment']


THE TOP 15 WORDS FOR TOPIC #1
['floor', 'open', 'didn', 'hot', 'reception', 'did', 'good', 'bathroom', 'staff', 'water', 'breakfast', 'hotel', 'shower', 'night', 'room']


THE TOP 15 WORDS FOR TOPIC #2
['train', 'rooms', 'nice', 'centre', 'clean', 'center', 'walk', 'parking', 'good', 'city', 'room', 'station', 'close', 'location', 'hotel']


THE TOP 15 WORDS FOR TOPIC #3
['really', 'clean', 'comfortable', 'room', 'friendly', 'beautiful', 'perfect', 'host', 'nice', 'location', 'lovely', 'place', 'great', 'stay', 'breakfast']


THE TOP 15 WORDS FOR TOPIC #4
['service', 'comfortable', 'rooms', 'nice', 'clean', 'excellent', 'helpful', 'hotel', 'room', 'great', 'good', 'friendly', 'breakfast', 'location', 'staff']


THE TOP 15 WORDS FOR TOPIC #5
['expo', 'exploring', 'decorating', 'centrum', 'kamer', 'stad', 'locatie',

In [37]:
topic_results = LDA.transform(dtm)
booking2['Dominant_Topic'] = topic_results.argmax(axis=1)# choosing what topuc the review falls under after applying LDA

booking2[['Review', 'rating', 'Sentiment', 'Dominant_Topic']].head()

Unnamed: 0,Review,rating,Sentiment,Dominant_Topic
0,"Everything was perfect! Quite, cozy place to r...",9.7,good,3
1,Very friendly host and perfect breakfast!,9.7,good,3
2,It was just what we wanted for a week by the b...,9.2,good,3
3,My stay in the house was a experiencing bliss ...,9.7,good,3
4,The building itself has a very musty smell in ...,9.2,good,0


In [38]:
#seeing dominant topics per good and bad

topic_sentiment = booking2.groupby(['Dominant_Topic', 'Sentiment']).size().unstack(fill_value=0)#grouping by topic and then size


# and then size check how many rows are in each subgroup,unstack makes sentiment become columns instead of rows



In [44]:
#which topics are dominating good and bad


topic_sentiment.head(7)

Sentiment,bad,good
Dominant_Topic,Unnamed: 1_level_1,Unnamed: 2_level_1
0,43,1304
1,168,2463
2,148,2884
3,22,3794
4,101,4430
5,224,7522
6,163,3120


In [40]:
#Top words for Good vs Bad,chossing 20 words

import numpy as np

def top_words_by_sentiment(df, sentiment_label, top_n=20):
    subset = df[df['Sentiment'] == sentiment_label]
    subset_dtm = cv.transform(subset['Review'])

    word_counts = np.array(subset_dtm.sum(axis=0))[0]
    word_freq = list(zip(vocab, word_counts))   #zip makimng them tuples
    sorted_words = sorted(word_freq, key=lambda x: x[1], reverse=True)[:top_n]

    return pd.DataFrame(sorted_words, columns=['Word', 'Frequency'])

bad_words_df = top_words_by_sentiment(booking2, 'bad')
good_words_df = top_words_by_sentiment(booking2, 'good')


In [41]:
bad_words_df.head()   # which words in bad sentiment

Unnamed: 0,Word,Frequency
0,room,377
1,available,252
2,comments,226
3,review,223
4,location,220


In [42]:
good_words_df.head()  # which words in good sentiment

Unnamed: 0,Word,Frequency
0,room,9078
1,available,8006
2,comments,7488
3,review,7481
4,location,6685


Conclusion:
From this, I got two clear insights — first, at the topic level, I now know which areas are driving complaints versus praise, like cleanliness and staff behavior being strong drivers for bad reviews, while food and location appear more in good reviews. Second, at the word level, I could see exactly which words dominate positive reviews, like friendly and amazing, versus negative ones, like dirty and rude. This gives me both a big-picture view through topics and a detailed view through words, helping me understand exactly what guests love and what frustrates them.