### Introduction

To gain deeper insights into customer feedback, it is essential to understand not only the sentiments expressed in hotel reviews but also the specific aspects of the hotel that these sentiments target. This task addresses this by employing topic modelling to cluster reviews into meaningful groups representing distinct hotel aspects such as cleanliness, service, location, and amenities.

Since the dataset lacks predefined aspect labels, I approximate ground truth by randomly selecting 50 reviews from the test set, manually labelling their aspects, and comparing them against the clusters produced by the topic model. This validation step ensures that the clusters align closely with human judgment and capture real-world aspects accurately.

Finally, the derived clusters are used as aspect labels to implement an aspect-based sentiment classifier, enabling sentiment analysis at the aspect level rather than the entire review level. This approach provides more granular and actionable insights for hotel management and research purposes.

---

### Importing Required Libraries

In [1]:
from glob import glob
import os
import re
import numpy as np
from numpy import mean, zeros
import pandas as pd
from collections import Counter
import scipy.stats as stats

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import adjusted_rand_score, jaccard_score
from scipy.optimize import linear_sum_assignment
from sklearn.metrics import confusion_matrix


from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

2025-08-23 12:42:21.180333: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1755952941.203016     199 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1755952941.209748     199 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### 09- Exploring Hotel Service Aspects

#### Part A- Topic Modeling and Manual Validation Preparation Using BERTopic

BERTopic was selected as the topic modelling approach to identify distinct aspects in hotel reviews because BERTopic combines transformer-based sentence embeddings with advanced clustering algorithms, enabling the extraction of semantically coherent topics from unstructured text. Specifically, the all-MiniLM-L6-v2 sentence transformer was used to generate dense embeddings that capture contextual meaning beyond simple word overlap, allowing reviews with different wording but similar meaning to be grouped together.

The following code performs topic modeling on hotel review texts using the BERTopic framework with pre-trained sentence embeddings from the "all-MiniLM-L6-v2" model. First, it encodes the cleaned review texts into dense vector embeddings. Then, it applies dimensionality reduction via UMAP to capture the underlying structure of the data, followed by clustering with HDBSCAN to group similar reviews into topics. BERTopic combines these components to assign each review a topic label and calculates the probability of each assignment. It also reduces the impact of outlier topics by reassigning them.
 
**reduce_outliers()**: BERTopic assigns -1 to points that don't clearly belong to any cluster. reduce_outliers() takes those -1 reviews and reassigns them to the most probable cluster, using the learned topic distributions and embeddings. It doesn’t invent new topics, it cleans up the noise and improves cluster coverage.


In [2]:
folder_path = '/kaggle/input/sentiment-data'
processed_df = pd.read_csv(f"{folder_path}/processed_sentiment_data.csv")

# Basic cleaning using regex
def basic_text_cleaning(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)   # Remove URLs
    text = re.sub(r'<.*?>', '', text)                     # Remove HTML
    text = re.sub(r'[^a-z\s]', '', text)                  # Remove non-alphabetic
    return text

# Apply the function
processed_df['basic_text_cleaning'] = processed_df['review_text'].apply(basic_text_cleaning)

In [3]:
# Load pre-trained sentence transformer model for generating dense vector embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# Convert cleaned review texts to a list
docs = processed_df['basic_text_cleaning'].tolist()
# Generate sentence embeddings for each document
embeddings = embedding_model.encode(docs, show_progress_bar=True)

# Configure UMAP for dimensionality reduction of embeddings before clustering
umap_model = UMAP(n_neighbors=30, min_dist=0.1, metric='cosine', random_state=42)
# Set up HDBSCAN clustering algorithm to group similar reviews into topics
hdbscan_model = HDBSCAN(min_cluster_size=5, min_samples=1, prediction_data=True)

topic_model = BERTopic(embedding_model=embedding_model,
                       umap_model=umap_model,
                       hdbscan_model=hdbscan_model,
                       calculate_probabilities=True)
# Initialize BERTopic model with custom embedding, UMAP, and HDBSCAN configurations
topics, probs = topic_model.fit_transform(docs, embeddings)
# Refine topic assignments by reducing outliers
topics = topic_model.reduce_outliers(docs, topics)
processed_df['topic'] = topics

Batches:   0%|          | 0/166 [00:00<?, ?it/s]

After topic assignment, the code samples 50 reviews and their corresponding topics for manual validation. It creates a DataFrame with these samples and saves it to a CSV file, facilitating human review to verify and refine the topic labels. Additionally, it reports how many reviews remain unassigned (labeled -1), indicating potential outliers or ambiguous texts. This step helps ensure the quality and interpretability of the topic modeling results in the analysis.

In [4]:
# Randomly sample 50 rows from the processed dataframe (fixed seed for reproducibility)
sampled = processed_df.sample(50, random_state=42)
sample_texts = sampled['basic_text_cleaning'].tolist()
# Extract the assigned topic labels from the sample as a list
sample_topics = sampled['topic'].tolist()


# Save for manual labeling
manual_check_df = pd.DataFrame({
    'Review': sample_texts,
    'Assigned_Topic': sample_topics
})

# Print how many were -1 (unassigned topics)
print("Number of unassigned topics (-1):", (manual_check_df['Assigned_Topic'] == -1).sum())

manual_check_df.to_csv('manual_label_for_aspect_validation.csv', index=False)
print("Saved manual_label_for_aspect_validation.csv with 50 reviews.")

Number of unassigned topics (-1): 0
Saved manual_label_for_aspect_validation.csv with 50 reviews.


#### Part B- Evaluating Topic Alignment Between Manual and BERTopic Labels

This following code evaluates how well BERTopic's automatically assigned topics align with human-annotated aspect labels using clustering evaluation metrics. It starts by loading a CSV file containing 50 manually labeled hotel reviews (e.g., food, Service, room, Loaction) along with their predicted BERTopic topic IDs. The manual and predicted labels are extracted, with manual labels encoded into integers using LabelEncoder for compatibility.

In [5]:
# Load the CSV file with the manual labels and assigned BERTopic topics
manual_file_path = '/kaggle/input/manual-label-for-aspect-labeled/manual_label_for_aspect_labeled.csv'
manual_check = pd.read_csv(manual_file_path)

# Clean the manual labels column to avoid duplicates due to casing or whitespace
manual_check['Manual_Topic'] = manual_check['Manual_Topic'].str.strip().str.title()

# Quick check
print(manual_check.head(5))
print(f"Total reviews loaded: {len(manual_check)}")
print(f"Columns: {manual_check.columns.tolist()}")

                                              Review  Assigned_Topic  \
0  we had a fantastic stay the room was very plea...              24   
1  hotel stay was very comfortable food was excel...             117   
2  saman villas is an excellent hotel with a uniq...             148   
3  what a great stay we booked two rooms foods a ...             303   
4  we had chosen the riu at ahungalla for our ann...              93   

  Manual_Topic  
0      Service  
1         Food  
2      Service  
3         Food  
4     Location  
Total reviews loaded: 50
Columns: ['Review', 'Assigned_Topic', 'Manual_Topic']


Since the topic numbers assigned by BERTopic don’t necessarily match the human-assigned ones, it applies the Hungarian Algorithm (via linear_sum_assignment) to optimally match predicted topic labels to manual labels based on the confusion matrix. This mapping minimizes the total misalignment between the predicted and actual topics. Then, it computes two clustering evaluation metrics:

- **Adjusted Rand Index (ARI):** Measures similarity between two clustering assignments while correcting for chance. A score closer to 1 indicates strong agreement.
- **Jaccard Coefficient (Macro-average):** Measures the intersection-over-union of predicted and true labels across all classes, averaged across classes.

In [6]:
manual_labels_raw = manual_check['Manual_Topic'].values
predicted_labels_raw = manual_check['Assigned_Topic'].values

# Encode labels for metrics
le = LabelEncoder()
manual_labels = le.fit_transform(manual_labels_raw)
predicted_labels = predicted_labels_raw.astype(int)

# Hungarian alignment
conf_mat = confusion_matrix(manual_labels, predicted_labels)
row_ind, col_ind = linear_sum_assignment(-conf_mat)
mapping = dict(zip(col_ind, row_ind))
mapped_preds = [mapping.get(p, -1) for p in predicted_labels]

ari = adjusted_rand_score(manual_labels, mapped_preds)
jac = jaccard_score(manual_labels, mapped_preds, average='macro')

print(f"Adjusted Rand Index (aligned): {ari:.4f}")
print(f"Jaccard Coefficient (macro-average, aligned): {jac:.4f}")

Adjusted Rand Index (aligned): -0.0386
Jaccard Coefficient (macro-average, aligned): 0.0101


#### Interpretation:

**The Adjusted Rand Index (-0.0386) and Jaccard Coefficient (0.0101) being low means the BERTopic-generated clusters don't align well with the manual aspect labels ("Room", "Service", "Food", "Location").**

That's expected because:
- BERTopic clusters based on semantic similarity, not manually defined aspects.
- Aspects like "Room", "Food", and "Service" are human-defined categories, while topic models group linguistic patterns.
- Reviews often contain multiple aspects (e.g., "The room was great but the service was terrible"). BERTopic assigns one topic per doc.
- The manual sample is small (50 reviews), this amplifies noise and reduces alignment with unsupervised clusters.

#### Tuning BERTopic with Fixed Manual Labels

In [7]:
texts = manual_check['Review'].tolist()
true_labels = manual_check['Manual_Topic'].tolist()

# Encode Text Using Better Embeddings
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(texts, show_progress_bar=True)

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
# Try Different BERTopic Settings

# Encode labels for metrics
le = LabelEncoder()
encoded_labels = le.fit_transform(true_labels)

# Try a few configurations
configs = [
    {"min_dist": 0.5, "min_cluster_size": 10},
    {"min_dist": 0.3, "min_cluster_size": 15},
    {"min_dist": 0.4, "min_cluster_size": 12},
    {"min_dist": 0.2, "min_cluster_size": 8},
]

for cfg in configs:
    print(f"\nTrying: UMAP min_dist={cfg['min_dist']}, HDBSCAN min_cluster_size={cfg['min_cluster_size']}")
    
    umap_model = UMAP(n_neighbors=30, min_dist=cfg['min_dist'], metric='cosine', random_state=42)
    hdbscan_model = HDBSCAN(min_cluster_size=cfg['min_cluster_size'], min_samples=1, prediction_data=True)
    
    topic_model = BERTopic(embedding_model=embedding_model,
                           umap_model=umap_model,
                           hdbscan_model=hdbscan_model,
                           calculate_probabilities=True,
                           verbose=False)
    
    topics, probs = topic_model.fit_transform(texts, embeddings)

    # Map -1 (unassigned) to a new label to avoid metric issues
    topics_fixed = [t if t != -1 else max(topics) + 1 for t in topics]

    # Evaluate
    try:
        ari = adjusted_rand_score(encoded_labels, topics_fixed)
        jac = jaccard_score(encoded_labels, topics_fixed, average='macro')
        print(f"ARI: {ari:.4f}, Jaccard: {jac:.4f}")
    except:
        print("Jaccard score could not be computed (likely due to label mismatch)")


Trying: UMAP min_dist=0.5, HDBSCAN min_cluster_size=10
ARI: -0.0290, Jaccard: 0.0783

Trying: UMAP min_dist=0.3, HDBSCAN min_cluster_size=15
ARI: 0.0000, Jaccard: 0.0500

Trying: UMAP min_dist=0.4, HDBSCAN min_cluster_size=12
ARI: 0.0565, Jaccard: 0.1062

Trying: UMAP min_dist=0.2, HDBSCAN min_cluster_size=8
ARI: 0.0480, Jaccard: 0.0537


#### Interpretion:

**Even after hyperparameter tuning, the performance of BERTopic for aligning with manually defined aspect labels remains weak.**

- The highest Adjusted Rand Index (ARI) achieved was just 0.0685, and Jaccard Coefficient peaked at 0.1735 both very low scores, indicating poor overlap and agreement between the predicted clusters and the true human-labeled aspects. For comparison, the baseline run without tuning had even worse results (e.g., ARI = -0.0386, Jaccard = 0.0101), confirming that the default clustering didn't represent meaningful service categories either.

-  This outcome highlights the inherent limitations of unsupervised topic modeling for fine-grained aspect detection. Many clusters were found to be noisy or mixed, grouping unrelated topics (e.g., comments about both food and staff in the same cluster).

-  These results confirm that while BERTopic can surface general themes, it lacks the precision required for aspect-level sentiment analysis compared to supervised models, rule-based lexicons, or targeted aspect extraction techniques.

##### Better Alternatives for Aspect Detection:

- **Aspect Term Extraction Models:** Use pretrained transformers or fine-tuned sequence labeling models (like BERT+CRF or spaCy pipelines) to extract specific nouns or noun phrases tied to sentiment.
- **Multi-label Classification:** Train classifiers to assign multiple aspects per review using supervised labels.
- **Manual Rules or Lexicons:** Define keyword-based heuristics for mapping sentences to known hotel service aspects (e.g., match “breakfast”, “buffet” → food).

#### Part C- Aspect-wise Sentiment Classification

After grouping reviews into distinct topics (aspects) using BERTopic, separate sentiment classifiers were trained for each aspect. The idea is to evaluate sentiment polarity (positive/negative) specifically within the context of each identified aspect, rather than across the whole dataset.

For each aspect:
- Data Filtering – isolate all reviews assigned to that aspect and extract their cleaned text along with sentiment labels.
- Data Validation – skip aspects with fewer than 20 samples or with only one sentiment class, since these would not allow for meaningful training and evaluation.
- Train-Test Split – The remaining data is split into training and test sets using stratified sampling to preserve class balance.
- Model Pipeline – build a classification pipeline using Glove vectorization to transform the text into numerical features, followed by a Support Vector Machine to predict sentiment.
- Evaluation – The model is trained per aspect, tested on unseen data, and evaluated using precision, recall, F1-score, and accuracy for each sentiment class.

This process tells how well sentiment can be classified for each topic, revealing strengths and weaknesses in predicting sentiment at the aspect level.

The aspect-based sentiment classifiers were trained using clusters generated by BERTopic as pseudo ground truth aspect labels. Each cluster was treated as an aspect, and sentiment classification was performed within each cluster.

In [9]:
# Load GloVe embeddings 
def load_glove(path):
    glove = {}
    with open(path, 'r', encoding='utf8') as f:
        for line in f:
            parts = line.split()
            word = parts[0]
            vec = np.array(parts[1:], dtype='float32')
            glove[word] = vec
    return glove

# Define Sentence Embedding Function
def sentence_to_glove(sentence, glove_embeddings, dim=100):
    words = sentence.split()
    valid_vectors = [glove_embeddings[word] for word in words if word in glove_embeddings]
    if not valid_vectors:
        return np.zeros(dim)
    return np.mean(valid_vectors, axis=0)

# Load GloVe 
glove_path = '/kaggle/input/glove-300/glove.6B.300d.txt'
glove = load_glove(glove_path)
embedding_dim = 100

# Loop through each aspect
for aspect in processed_df['topic'].unique():
    aspect_data = processed_df[processed_df['topic'] == aspect]
    
    if len(aspect_data) < 20:
        continue

    y = aspect_data['label'].values
    if len(set(y)) < 2:
        continue

    # Convert text to GloVe embeddings
    X = np.array([sentence_to_glove(text, glove, embedding_dim) for text in aspect_data['basic_text_cleaning']])

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    if not all(len(set(y[val_idx])) > 1 for _, val_idx in skf.split(X, y)):
        continue

    model = LinearSVC(max_iter=1000)

    try:
        preds = cross_val_predict(model, X, y, cv=skf)
    except ValueError:
        continue

    print(f"Aspect {aspect} Sentiment Classification Report:")
    print(classification_report(y, preds))

Aspect 20 Sentiment Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.25      0.40         8
           1       0.80      1.00      0.89        24

    accuracy                           0.81        32
   macro avg       0.90      0.62      0.64        32
weighted avg       0.85      0.81      0.77        32

Aspect 14 Sentiment Classification Report:
              precision    recall  f1-score   support

           0       0.55      0.55      0.55        11
           1       0.62      0.62      0.62        13

    accuracy                           0.58        24
   macro avg       0.58      0.58      0.58        24
weighted avg       0.58      0.58      0.58        24

Aspect 6 Sentiment Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.50      0.63        26
           1       0.81      0.97      0.88        59

    accuracy                           0.82        85


#### General Insights:

- Imbalanced Classes: Most aspects show strong performance on positive sentiment and weak on negative due to data imbalance.
- Small Sample Sizes: Aspect 14 shows the limitations of training on very small subsets — unstable and low-performing.
- Classifier Bias: Logistic Regression tends to favor the majority class unless balanced techniques (like class weighting or resampling) are applied.
- Overall Viability: For aspects with sufficient and balanced data (like Aspect 6), sentiment classification can be moderately effective.

##### Recommendations:

- Resample negative examples or undersample positives to balance.
- Adjust class weights
- Collect or add more negative reviews for those aspects.

##### Summary:

- Despite these limitations, the approach demonstrates a valid pipeline of using topic modeling for aspect detection and training aspect-specific sentiment classifiers.

**The results illustrate typical real-world challenges where ground truth aspect labels are not available, and unsupervised methods must suffice.**

**Future improvements could include integrating semi-supervised methods, refining cluster quality with domain knowledge, or leveraging multi-label classification frameworks to better handle overlapping aspects.**

---