In [1]:
# pip install vaderSentiment

In [2]:
import pandas as pd, numpy as np
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

notes = pd.read_csv("notes.csv")
listings = pd.read_csv("listings.csv")

In [3]:
analyzer = SentimentIntensityAnalyzer()
notes['vader_compound'] = notes['comments'].dropna().apply(lambda t: analyzer.polarity_scores(str(t))['compound'])
notes['vader_compound']

0         0.9827
1         0.8122
2         0.9722
3         0.9781
4         0.8479
           ...  
126420    0.4927
126421    0.5709
126422    0.9778
126423    0.8516
126424    0.9037
Name: vader_compound, Length: 126425, dtype: float64

In [4]:
notes['is_negative_review'] = notes['vader_compound'] < -0.05
len(notes[notes['is_negative_review'] == True])


4943

In [5]:
neg = notes[notes['is_negative_review']].copy()
pos = notes[~notes['is_negative_review']].copy()

print(f"Negative reviews : {len(neg):,}  ({100*len(neg)/len(notes):.1f}%)")
print(f"Positive reviews : {len(pos):,}  ({100*len(pos)/len(notes):.1f}%)")

Negative reviews : 4,943  (3.9%)
Positive reviews : 121,482  (96.1%)


In [6]:
ASPECT_LEXICONS = {
    'cleanliness': ['clean','dirty','dust','smell','odor','hygiene','stain','mold','filth'],
    'location':    ['location','far','remote','transport','tube','metro','walk','noise','area','neighbourhood'],
    'value':       ['price','expensive','overpriced','cheap','worth','value','money','cost'],
    'communication': ['host','respond','reply','communication','message','contact','rude','helpful','friendly'],
    'checkin':     ['checkin','check-in','key','access','lock','door','arrival','late'],
    'accuracy':    ['accurate','description','mislead','misleading','photo','picture','expect','disappoint','different'],
}
# LLM-ed list, to be edited later on

In [7]:
def tag_aspects(text):
    text = str(text).lower()
    return {asp: int(any(kw in text for kw in kws))
            for asp, kws in ASPECT_LEXICONS.items()}

aspect_tags = neg['comments'].apply(tag_aspects).apply(pd.Series)
neg = pd.concat([neg.reset_index(drop=True), aspect_tags], axis=1)

print(aspect_tags.sum().sort_values(ascending=False))

location         1094
communication     650
checkin           504
accuracy          397
cleanliness       306
value             154
dtype: int64


## üìä Phase 2 ‚Äî What Are People Actually Complaining About?

After scanning negative reviews for aspect keywords, here's the breakdown:

| Aspect | Mentions |
|---|---|
| location | 1,094 |
| communication | 650 |
| checkin | 504 |
| accuracy | 397 |
| cleanliness | 306 |
| value | 154 |

**Location is #1** ‚Äî which is kind of perfect for what we're trying to prove. You can't move a flat. If someone complains about location on a listing that everyone else loves, that's on them, not the host.

**Accuracy at #4** is also interesting ‚Äî "it looked different in the photos" is basically a mismatch complaint by definition.

**Cleanliness at the bottom** makes sense too ‚Äî that one's harder to spin as a mismatch. Dirty is dirty.




In [8]:
# pip install bertopic 


In [9]:
# pip install sentence_transformers


In [10]:
# pip install hdbscan


In [11]:

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
import re

neg_clean = neg.dropna(subset=['comments'])

sample = neg_clean.sample(
    n=min(50_000, len(neg_clean)), random_state=42
)
docs = sample['comments'].str.strip().tolist()

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0,
                  metric='cosine', random_state=42)

hdbscan_model = HDBSCAN(min_cluster_size=50, metric='euclidean',
                         cluster_selection_method='eom', prediction_data=True)

topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    nr_topics="auto",          
    calculate_probabilities=False,
    verbose=True
)

topics, probs = topic_model.fit_transform(docs)
sample['topic'] = topics

print(topic_model.get_topic_info().head(20))


  from .autonotebook import tqdm as notebook_tqdm
2026-02-20 12:19:00,657 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 155/155 [01:29<00:00,  1.74it/s]
2026-02-20 12:20:29,917 - BERTopic - Embedding - Completed ‚úì
2026-02-20 12:20:29,918 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2026-02-20 12:20:43,054 - BERTopic - Dimensionality - Completed ‚úì
2026-02-20 12:20:43,055 - BERTopic - Cluster - Start clustering the reduced embeddings
2026-02-20 12:20:43,187 - BERTopic - Cluster - Completed ‚úì
2026-02-20 12:20:43,188 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2026-02-20 12:20:43,324 - BERTopic - Representation - Completed ‚úì
2026-02-20 12:20:43,325 - BERTopic - Topic reduction - Reducing number of topics
2026-02-20 12:20:43,329 - BERTopic - Representation - Fine-tuning topics using representation models.
2026-02-20 12:20:43,459 - BERTopic - Represen

   Topic  Count                Name  \
0     -1      9     -1_br_ÎÑàÎ¨¥_Ï†ïÎßê_Í∑∏Î¶¨Í≥†   
1      0   1935  0_und_die_sehr_ist   
2      1   1200    1_et_de_tr√®s_est   
3      2    717    2_the_and_to_was   
4      3    698      3_el_la_muy_de   
5      4    171     4_een_het_en_de   
6      5    128  5_muito_com_de_foi   
7      6     85  6_di_per_molto_che   

                                      Representation  \
0    [br, ÎÑàÎ¨¥, Ï†ïÎßê, Í∑∏Î¶¨Í≥†, Îü∞Îçò, ÏûàÏäµÎãàÎã§, Îã§Ïãú, ÎïåÎ¨∏Ïóê, ÏïÑÏπ®, Ï¢ãÏïòÏäµÎãàÎã§]   
1  [und, die, sehr, ist, war, wir, der, in, das, zu]   
2  [et, de, tr√®s, est, nous, le, pour, la, tout, br]   
3      [the, and, to, was, is, we, of, it, not, for]   
4      [el, la, muy, de, en, que, con, es, no, para]   
5       [een, het, en, de, van, je, is, og, met, op]   
6  [muito, com, de, foi, para, em, uma, do, um, que]   
7  [di, per, molto, che, la, il, sono, casa, con,...   

                                 Representative_Docs  
0  [Îü∞Îçò Ïó¨ÌñâÏùÑ Í

## üåç Phase 3 ‚Äî BERTopic Results: A Multilingual Surprise

BERTopic ran successfully and found **8 topics** (including the noise bucket `-1`). But here's the thing ‚Äî it didn't find *thematic* topics at all. It found **language clusters**:

| Topic | Language | Count |
|---|---|---|
| 0 | German (`und, die, sehr`) | 1,935 |
| 1 | French (`et, de, tr√®s`) | 1,200 |
| 2 | English (`the, and, to`) | 717 |
| 3 | Spanish (`el, la, muy`) | 698 |
| 4 | Dutch (`een, het, en`) | 171 |
| 5 | Portuguese (`muito, com, de`) | 128 |
| 6 | Italian (`di, per, molto`) | 85 |
| -1 | Noise / Korean | 9 |

This makes complete sense ‚Äî the London dataset is full of international tourists reviewing in their native language, and BERTopic's sentence embeddings picked up on language similarity before topic similarity.

### What this means for us

This isn't a failure ‚Äî it's a data quality signal. We have two options going forward:

**Option A (quick)** ‚Äî filter to English-only reviews before re-running BERTopic:
```python
from langdetect import detect

neg_clean['lang'] = neg_clean['comments'].apply(
    lambda x: detect(str(x)) if pd.notna(x) else 'unknown'
)
neg_en = neg_clean[neg_clean['lang'] == 'en']
print(f"English reviews: {len(neg_en):,} out of {len(neg_clean):,}")
```

**Option B (better for the paper)** ‚Äî use a multilingual embedding model like `paraphrase-multilingual-MiniLM-L12-v2` so we keep all reviews and get genuinely thematic topics across languages.

> üí° Also worth noting: only **4,943 negative reviews (3.9%)** out of 126K is quite low. VADER tends to be conservative on polite hospitality text ‚Äî guests rarely write aggressively even when unhappy. Worth considering lowering the threshold from `-0.05` to `0.0`, or experimenting with a stricter rating-based definition of "negative".


In [12]:
import numpy as np
import pandas as pd

# 0. Build neg_with_scores if it doesn't exist yet
# (safe to re-run; it just overwrites with the correct merge)
score_cols = [
    'review_scores_cleanliness',
    'review_scores_location',
    'review_scores_value',
    'review_scores_communication',
    'review_scores_checkin',
    'review_scores_accuracy',
]

aspect_cols = ['cleanliness', 'location', 'value',
               'communication', 'checkin', 'accuracy']

missing_aspects = [c for c in aspect_cols if c not in neg.columns]
if missing_aspects:
    raise ValueError(f"Aspect columns missing on `neg`: {missing_aspects}")

listing_aspect_complaints = (
    neg.groupby('listing_id')[aspect_cols]
       .mean()
       .rename(columns={k: f'pct_neg_{k}' for k in aspect_cols})
)

listing_scores = (
    notes.groupby('listing_id')[score_cols]
         .first()
)

listing_profile = listing_aspect_complaints.join(listing_scores, how='left')

neg_with_scores = neg.merge(
    listing_profile[score_cols],
    on='listing_id',
    how='left',
    suffixes=('', '_listing')
)

# 1. Define the score map
SCORE_MAP = {
    'cleanliness':   'review_scores_cleanliness',
    'location':      'review_scores_location',
    'value':         'review_scores_value',
    'communication': 'review_scores_communication',
    'checkin':       'review_scores_checkin',
    'accuracy':      'review_scores_accuracy',
}

# 2. Keep only negative reviews with all six scores present
scored_cols = list(SCORE_MAP.values())
neg_scored = neg_with_scores.dropna(subset=scored_cols).copy()

print(f"Total negative reviews: {len(neg_with_scores):,}")
print(f"With full score info : {len(neg_scored):,} "
      f"({len(neg_scored)/len(neg_with_scores)*100:.1f}%)")

# 3. Soft classification
def classify_review_soft(row):
    mismatch_signals, deficiency_signals = 0, 0
    for asp, score_col in SCORE_MAP.items():
        if row.get(asp, 0) == 1:  # aspect mentioned in this review
            listing_score = row.get(score_col, np.nan)
            if pd.isna(listing_score):
                continue
            if listing_score >= 4.6:
                mismatch_signals += 1
            elif listing_score <= 4.2:
                deficiency_signals += 1
    if mismatch_signals == 0 and deficiency_signals == 0:
        return 'ambiguous'
    return 'mismatch' if mismatch_signals >= deficiency_signals else 'deficiency'

neg_scored['review_label_soft'] = neg_scored.apply(classify_review_soft, axis=1)

# 4. Summary stats
print("\nSoft label distribution within scored negatives (%):")
print(
    neg_scored['review_label_soft']
        .value_counts(normalize=True)
        .mul(100)
        .round(2)
)

print("\nMismatch share by room_type (soft):")
print(
    neg_scored.groupby('room_type')['review_label_soft']
        .value_counts(normalize=True)
        .unstack()
        .fillna(0)
        .mul(100)
        .round(1)
)

# 5. Attach soft labels back to the full neg_with_scores
# If your review id column has a different name, adjust 'id_x' accordingly
id_col = 'id_x' if 'id_x' in neg_scored.columns else 'id'
neg_with_scores = neg_with_scores.merge(
    neg_scored[[id_col, 'review_label_soft']],
    on=id_col,
    how='left',
    suffixes=('', '_soft')
)


Total negative reviews: 9,686
With full score info : 4,943 (51.0%)

Soft label distribution within scored negatives (%):
review_label_soft
ambiguous     98.81
mismatch       1.17
deficiency     0.02
Name: proportion, dtype: float64

Mismatch share by room_type (soft):
review_label_soft  ambiguous  deficiency  mismatch
room_type                                         
Entire home/apt         98.8         0.0       1.2
Hotel room             100.0         0.0       0.0
Private room            98.8         0.0       1.2
Shared room            100.0         0.0       0.0


### üîç Phase 4 ‚Äî So‚Ä¶ why is almost everything ‚Äúambiguous‚Äù?

After wiring up the score-based classifier, here‚Äôs what we see on the **4,943** negative reviews that actually have full listing scores:

- **~98.8%** ‚Üí `ambiguous`
- **~1.2%** ‚Üí `mismatch`
- **~0.0%** ‚Üí `deficiency`

At first glance this looks disappointing, but it‚Äôs actually telling us something important about the *data*, not the code:

- Airbnb ratings are **crazy compressed at the top end**. Almost every listing sits somewhere between 4.2 and 4.9 on all subscores.
- With that kind of compression, even clearly negative textual experiences do **not** translate into obviously ‚Äúlow‚Äù scores.
- Under a strict rule like ‚Äúmismatch = high score + complaint, deficiency = low score + complaint‚Äù, almost everything will naturally fall into ‚Äúü§∑ not clearly one or the other‚Äù.

So what do we take from this?

- **Structured scores are too blunt an instrument** to cleanly separate ‚Äúbad fit‚Äù from ‚Äúbad quality‚Äù.
- The ~1% of reviews that *do* pass our harsh mismatch filter are best seen as a **hard lower bound**: *at least* that many negative reviews are clear mismatches, even in a world of inflated ratings.
- For the rest, the interesting signal is not in tiny variations between 4.6 and 4.8, but in **what people actually say**.

Conclusion for the project:  
We‚Äôll keep these labels around for sanity checks and robustness, but the real action for the research question now moves to **topics and language** (BERTopic + qualitative reading), not to slicing already-inflated numeric scores.


In [13]:
import pandas as pd

if 'orig_index' not in neg.columns:
    neg = neg.reset_index().rename(columns={'index': 'orig_index'})

neg_with_scores = neg.merge(
    listing_profile[score_cols],
    on='listing_id',
    how='left',
    suffixes=('', '_listing')
)

scored_cols = list(SCORE_MAP.values())
neg_scored = neg_with_scores.dropna(subset=scored_cols).copy()

def classify_review_soft(row):
    mismatch_signals, deficiency_signals = 0, 0
    for asp, score_col in SCORE_MAP.items():
        if row.get(asp, 0) == 1:
            listing_score = row.get(score_col, np.nan)
            if pd.isna(listing_score):
                continue
            if listing_score >= 4.6:
                mismatch_signals += 1
            elif listing_score <= 4.2:
                deficiency_signals += 1
    if mismatch_signals == 0 and deficiency_signals == 0:
        return 'ambiguous'
    return 'mismatch' if mismatch_signals >= deficiency_signals else 'deficiency'

neg_scored['review_label_soft'] = neg_scored.apply(classify_review_soft, axis=1)


neg_scored_clean = neg_scored.dropna(subset=['comments'])
sample = neg_scored_clean.sample(
    n=min(10_000, len(neg_scored_clean)), random_state=42
).copy()

docs = sample['comments'].str.strip().tolist()

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

umap_model = UMAP(
    n_neighbors=15, n_components=5, min_dist=0.0,
    metric='cosine', random_state=42
)

hdbscan_model = HDBSCAN(
    min_cluster_size=50, metric='euclidean',
    cluster_selection_method='eom', prediction_data=True
)

topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    nr_topics="auto",
    calculate_probabilities=False,
    verbose=True
)

topics, probs = topic_model.fit_transform(docs)
sample['topic'] = topics

assert 'review_label_soft' in sample.columns

topic_cross = (
    sample.groupby(['topic', 'review_label_soft'])
          .size()
          .unstack(fill_value=0)
)

topic_cross['mismatch_share'] = (
    topic_cross.get('mismatch', 0) /
    topic_cross.sum(axis=1) * 100
).round(1)

print("\nTopic-level mismatch shares (soft labels, scored-only sample):")
print(topic_cross.sort_values('mismatch_share', ascending=False).head(15))


2026-02-20 12:20:48,635 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 155/155 [00:34<00:00,  4.52it/s]
2026-02-20 12:21:22,974 - BERTopic - Embedding - Completed ‚úì
2026-02-20 12:21:22,975 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2026-02-20 12:21:28,724 - BERTopic - Dimensionality - Completed ‚úì
2026-02-20 12:21:28,725 - BERTopic - Cluster - Start clustering the reduced embeddings
2026-02-20 12:21:28,852 - BERTopic - Cluster - Completed ‚úì
2026-02-20 12:21:28,853 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2026-02-20 12:21:28,989 - BERTopic - Representation - Completed ‚úì
2026-02-20 12:21:28,989 - BERTopic - Topic reduction - Reducing number of topics
2026-02-20 12:21:28,993 - BERTopic - Representation - Fine-tuning topics using representation models.
2026-02-20 12:21:29,122 - BERTopic - Representation - Completed ‚úì
2026-02-20 12:21:29,123 - B


Topic-level mismatch shares (soft labels, scored-only sample):
review_label_soft  ambiguous  deficiency  mismatch  mismatch_share
topic                                                             
 4                       166           0         5             2.9
 0                      1905           0        30             1.6
 2                       708           1         8             1.1
 3                       692           0         6             0.9
 1                      1191           0         9             0.8
-1                         9           0         0             0.0
 5                       128           0         0             0.0
 6                        85           0         0             0.0


### üß© Phase 5 ‚Äî Topics vs. Mismatches (a tiny but real signal)

On our scored negative sample, BERTopic found **8 topics**.  
When we overlay the soft mismatch labels, the picture is:

| Topic | ambiguous | mismatch | mismatch share |
|-------|-----------|----------|----------------|
| 4     | 166       | 5        | 2.9%           |
| 0     | 1905      | 30       | 1.6%           |
| 2     | 708       | 8        | 1.1%           |
| 3     | 692       | 6        | 0.9%           |
| 1     | 1191      | 9        | 0.8%           |
| 5/6/-1| 100% amb. | 0        | 0.0%           |

Given how ratings are squashed near 5 stars, **any** mismatch signal is already impressive. A few things stand out:

- Topic **4** clearly has the **highest mismatch share (~3%)**. This is our best candidate for a ‚Äúclassic mismatch‚Äù theme: guests complaining even though the listing‚Äôs scores look great.
- Topics **0‚Äì3** also contain some mismatches, but at a lower rate (~1‚Äì1.6%). They‚Äôre more mixed bags: mostly ‚Äúgeneric‚Äù negatives with a small mismatch tail.
- Topics **5, 6, and the noise bucket (-1)** are basically all ambiguous under our rules, so they‚Äôre not helpful for score-based separation.

The key takeaway is not the exact percentages (they‚Äôre tiny by construction), but **which topics systematically over-index on mismatches**. Those are the clusters we want to read closely and describe qualitatively as ‚Äúlatent interest vs. listing affordance‚Äù failures in the write‚Äëup.
