---
# Review Based User Recommendations

---
## Problem Statement

Context:

As a Data Scientist, you’ve been asked to build a recommendation service for users on a vacation rental platform based on their previous experience.

Task:

Your task would be to develop a recommendation model that could recommend returning users new properties based on their old reviews. Let’s just assume that our platform has only vacation houses in London and we would like to recommend new properties only our returning users.

Data:

As an input you get London Airbnb Dataset where you can find user reviews and general information about listings.

recommendations.zip

Deliverables / outcome:

Upon completion of your work, your presentation should encompass the following:
* insights and challenges that you’ve faced during the discovery process;  
* results of the sentiment analysis, how you extracted signals for the recommendation model;  
* recommendation model itself: what approach and algorithm was selected, why and how it can be evaluated.  

---

## Assumptions

* New properties: 
    * using reviews as a proxy for bookings, a new property is taken as a property the reviewer has not reviewed
    * in reality they may have booked a property and not reviewed it and this will be included in the new property recommendation
* Returning users: 
    * taken to mean anyone who has made a review as they could all return to make another booking

---

## Key Insights & Challenges


### User-Level Sparsity

Only **~12% of users have reviewed more than one property** meaning most users only have 1 recorded interaction

* Personalised modelling is therefore limited to a small subset of users
* User preference must often be inferred from minimal history
* Next-item prediction is inherently challenging for this user set

This places greater importance on strong listing representations

---

### Long-Tail Listings-Review interactions

The listing distribution is strongly long-tailed:

* Most listings have few reviews
* Only a small number accumulate high interaction counts

As a result:

* Popularity performs poorly at small K
* The held-out listing is rarely globally popular
* Absolute Recall@10 values are naturally low and must be interpreted in context

---

### Implicit Feedback Setting

Reviews act as a proxy for bookings:

* A review indicates interest
* Absence of a review does not imply dislike

This required:

* Pairwise ranking (BPR) instead of regression
* Careful negative sampling

---

### Impact of Feature Engineering

Experiments showed that:

* Structured features alone were insufficient
* Hard negative sampling improved ranking sharpness
* Adding summary text embeddings produced the most meaningful gains

---

# Reviews

Only 12% of users have reviewed more than 1 property. This highlights significant user-level sparsity limiting how strongly personalised the model can be.

## Sentiment

A review indicates that a user selected and experienced a property. Regardless of whether the review is positive or negative, it reflects an initial preference for that type of listing.

Given time constraints I used overall review sentiment as a soft preference signal rather than performing deeper aspect level extraction  
* I convert the sentiment score into a smooth weight between 0 and 1  
* A user profile becomes a weighted average of the listings they’ve reviewed  
* Positive reviews contribute more strongly to the profile  
* Negative reviews reduce influence but are not treated as strict dislikes  
* This avoids overreacting to one-off bad experiences like a noisy weekend or a rude host that don’t necessarily reflect the user’s overall property preferences  

### Sentiment distribution

Review sentiment is highly skewed:  
* ~91% positive  
* ~6% negative  
* ~2% neutral  

Implications:  
* Sentiment primarily acts as a strength of preference signal  
* The weighting refines user profiles but does not drastically change ranking behaviour  

### Further sentiment considerations for extended work

The negative reviews don't necesarrily indicate that they were unhappy with their choice of property type and are often more situational, e.g:  
* property specific - unclean, faulty appliances, noise, unresponsive host, small size  
* location - hard to find, unsafe  

A more advanced extension would involve:  
* Extracting aspect level sentiment (e.g. location, host, cleanliness, noise)  
* Aligning those aspects directly with listing features, e.g. if they mentioned location negatively this can be used alongside the location features  
* Compare positive/negative topics against other reviews to recommend properties where consensus from reviews for these aspects align  

---
# Feature Selection

In order to select the features, I reviewed the data available with consideration to
* what might be most informative to a recommendation based on review 
* practicality within time and computational constraints  

## Structured Features

Features that likely reflect what users actively consider when selecting vacation rentals

* **Location based features**
    * Neighbourhood
        * strong indicator of location preference 

    * Latitude & Longitude
        * capture nearby neighbourhoods

* **Property based features**
    * Room type
        * entire home or a private room is likely to be one of the strongest drivers of booking decisions
    * Property type
        * apartment, house, loft etc. provide additional differentiation  
    * Accommodates
        * could be indicator of user type (solo, couple, group)
    * Amenities
        * property features such as wifi, kitchen, heating, washer, etc.  
        * limited to the most frequent amenities to reduce sparsity and noise  

* **Price**
  * Although this is not necessarily the booking price as it is the price on scraping day it can be considered a proxy for property quality type

## Text Features

Capture the semantic representation not available within the structured features

* **Summary**
  * pre-trained sentence embeddings were used to encode listing summaries  
  * capture qualitative differences such as style, ambience, and unique characteristics  

## Sentiment

Without time to break down sentiment into specific features and preferences, I used sentiment as a preference signal to avoid overfitting to one-off neagtive reivews

* review sentiment was converted into a smooth weight between 0 and 1  
* used to weight historical interactions when constructing user profiles  
* negative reviews reduced influence rather than acting as strict negative labels  

## Features Explicitly Excluded

The following features were intentionally not used:

* review score aggregates (rating, cleanliness score, etc.)  
* number of reviews  
* availability fields  

Reasons:

* they bias the model toward already popular listings rather than helping it learn individual user preferences
* they are better suited for post-ranking adjustments or filtering (e.g. ensuring quality thresholds or availability)

---

# Model Architecture & Evaluation

## Recommendation Model Approach

Given the objective is to recommend new listings to returning users based on their past reviews, the model must infer user preferences from historical interactions.

The model I chose took this approach
* Learning a dense representation of each listing
* Constructing a user preference vector from previously reviewed listings
* Ranking unseen listings by similarity to the user profile


**Listing encoder**

Each listing is encoded into a dense vector using:
* Structured features (neighbourhood, room type, property type, amenities, price, location)
* Summary text embeddings to capture qualitative differences between properties
* These are combined in a neural network to produce a compact listing embedding

**User representation**  
* A user is represented as a sentiment-weighted average of the embeddings of previously reviewed listings.
* Positive reviews influence the profile more
* Negative reviews reduce influence rather than acting as strict dislikes
* This avoids overreacting to one-off negative experiences while still capturing overall preferences.

**Training objective - Bayesian Personalized Ranking (BPR)**  
* The model is trained using Bayesian Personalized Ranking (BPR), 
* This optimises reviewed listings so that they should rank higher than unobserved listings (no reviews)
* This is appropriate for implicit feedback and directly aligns with the top-K recommendation objective

**Hard negative sampling**  
* First runs used uniform random sampling for negatives which made the ranking task too easy and didn't provide meaningful separation  
* To improve discrimination between similar properties, negatives were sampled from listings sharing the same room type (this could further be extended to neighbourhood with more time)
* This encouraged the model to distinguish between similar listings, resulting in improved ranking performance. (MRR)


**Evaluation**
* Performed using temporal hold-out
* For each user, the last reviewed listing was held out
* The model was trained on earlier reviews
* Metrics reported: 
    * Recall@K: how often does the model retrieve the correct listing within a shortlist of size K
    * MRR@K: how highly the correct listing is ranked when it appears in the top K.

---

## Baseline Comparison

### Popularity Baseline

* As a benchmark, a non-personalised popularity baseline was calculated
* Listings were ranked by total number of reviews
* For each user, previously reviewed listings were excluded from recommendations

Results showed:

* **Recall@10 ≈ 0.0**
* **Recall@20 ≈ 0.0**
* **Recall@1000 ≈ 16%**

This indicates:

* The catalogue is strongly long-tailed.
* The held-out listing is rarely among the globally most popular items.
* The task is non-trivial at small K.

Popularity is then used as a lower bound for performance.

---

## Experiment Results / Model Development

Only around 12% of users have reviewed more than one property, so most users provide very limited behavioural history. This makes exact next-item prediction inherently difficult, particularly in a large and long-tailed catalogue.

I therefore started with a simple structured-feature baseline and incrementally added complexity to evaluate which modelling changes produced meaningful improvements.

### Structured Features Only

Using only structured listing features:

* **Recall@10 ≈ 1.1%**
* **Recall@20 ≈ 1.4%**

Observations:

* The model slightly outperformed popularity at small K
* However struggled to distinguish between structurally similar listings
* This suggested that structured metadata alone was insufficient to fully capture user preference nuances

---

### Hard Negative Sampling

Switching from uniform random negatives to same-room-type negatives:

* Improved MRR
* Slightly improved Recall@K stability
* Increased training difficulty (loss decreased more slowly)

Observations:

* Uniform negatives made the ranking task too easy.  
* Hard negatives forced the model to distinguish between similar listings, improving ranking performance even if recall improvements were modest

---

### Adding Summary Text Embeddings

Incorporating semantic embeddings of the listing summary text:

* Recall@10 improved from ~1.1% → ~1.6%
* Recall@20 improved from ~1.4% → ~2.4%
* MRR increased further
* This represents a meaningful relative improvement

Observations:

* Structured features alone lack sufficient granularity
* Summary text captures qualitative differences (style, ambiance, unique attributes)
* Semantic representation significantly improves differentiation between similar properties

This confirms that representation quality is very important to performance of the model

---

## Model Insights & Conclusions

### Deeper Representation of Features is Required

* Optimisation changes (more epochs or adjusting learning rate) yielded marginal improvements
* Adding semantic text embeddings yielded significant improvements
* Deeper representation of the text based features could significantly improve results given minimal experiments so far

---

### Personalisation is Limited by Data Sparsity

* Only a small proportion of users have multiple reviews.
* Most user profiles are built from very limited historical interactions
* I evaluated the model using next item prediction, but with more time I would also look at general preference alignment, such as similarity to the held-out listing or neighbourhood level accuracy

---

### Sentiment Weighting as a Stabiliser

* Sentiment was used as a soft weighting mechanism
* Negative reviews reduced influence rather than acting as strict negative signals
* This prevents overfitting to situational dissatisfaction
* With more time I would look at modelling the key factors mentioned in the reviews and the positive/negative sentiment aligned to those

---

## Extensions & Future Improvements

With additional time, the following extensions would likely improve performance:

* **Aspect-level sentiment extraction**
  * Extract sentiment on specific aspects (location, cleanliness, host, noise)
  * Align aspect level sentiment directly with listing features
  * This would provide more targeted preference signals rather than relying on overall review sentiment

* **More advanced hard negative sampling**
  * Sample within same neighbourhood and price band
  * This could further improve the model's ability to distinguish similar listings

* **Cold-Start Strategy for Single-Review Users**
    * Given that only ~12% of users have multiple reviews, most users provide very limited history
    * Nearest-neighbour retrieval
        * Recommend listings most similar to the reviewed property using structured and semantic embeddings
    * Adjusting recommendations when confidence is low
        * Estimate confidence based on similarity strength and adjust ranking conservatively when signal is weak

* **Online evaluation**  
If deployed some of the evaluations used could focus on:  
  * CTR
  * Converstion rate / booking rate - do recommendations lead to bookings
  * Diversity of recommnedations - we want varied recommnedations to ensure wide inventory is booked




# Recommender Model Code



## Calculate sentiment of reviews

In [1]:
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm.auto import tqdm

MODEL_NAME = "cardiffnlp/twitter-xlm-roberta-base-sentiment"

def clean_text(s: str) -> str:
    if s is None:
        return ""
    s = str(s).strip()
    s = " ".join(s.split())
    return s

@torch.inference_mode()
def sentiment_xlmr(
    texts,
    batch_size: int = 64,
    max_length: int = 128,
    device: str | None = None,
):
    """
    Returns:
      - label: {negative, neutral, positive}
      - p_negative, p_neutral, p_positive
      - sentiment_score: scalar in [-1, 1] computed as p_pos - p_neg
    """
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"

    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to(device)
    model.eval()

    # Model label order is typically: negative, neutral, positive
    id2label = model.config.id2label
    # Ensure consistent column ordering
    # We'll map probabilities by label name
    out_labels = []
    p_neg, p_neu, p_pos = [], [], []
    score = []

    # Clean texts
    texts = [clean_text(t) for t in texts]

    for i in tqdm(range(0, len(texts), batch_size), desc="Sentiment (XLM-R)"):
        batch = texts[i:i + batch_size]
        enc = tokenizer(
            batch,
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt",
        ).to(device)

        logits = model(**enc).logits
        probs = torch.softmax(logits, dim=-1).detach().cpu().numpy()  # shape (B, C)

        for row in probs:
            # Build dict {label_name: prob}
            prob_by_label = {id2label[j].lower(): float(row[j]) for j in range(len(row))}
            neg = prob_by_label.get("negative", 0.0)
            neu = prob_by_label.get("neutral", 0.0)
            pos = prob_by_label.get("positive", 0.0)

            # Predicted label
            pred = max(prob_by_label, key=prob_by_label.get)

            out_labels.append(pred)
            p_neg.append(neg)
            p_neu.append(neu)
            p_pos.append(pos)

            # Scalar sentiment in [-1, 1]
            # (pos - neg) is a simple, interpretable signal for weighting interactions
            score.append(pos - neg)

    return pd.DataFrame({
        "sentiment_label": out_labels,
        "p_negative": p_neg,
        "p_neutral": p_neu,
        "p_positive": p_pos,
        "sentiment_score": score,   # [-1, 1]
    })

In [2]:
# I would normally batch the processing of this for production but for this task have run all together for simplicity

run_sentiment_analysis = False

if run_sentiment_analysis:
    reviews = pd.read_csv('data/reviews.csv')
    sent_df = sentiment_xlmr(reviews["comments"].tolist(), batch_size=64, max_length=128)
    reviews = pd.concat([reviews.reset_index(drop=True), sent_df], axis=1)
    reviews.to_csv('data/reviews_sentiment.csv', index=False)
else:
    reviews = pd.read_csv('data/reviews_sentiment.csv')

In [3]:
reviews

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,word_count,lang,lang_prob,sentiment_label,p_negative,p_neutral,p_positive,sentiment_score
0,11551,30672,2010-03-21,93896,Shar-Lyn,"The flat was bright, comfortable and clean and...",49,en,0.989361,positive,0.041174,0.138245,0.820581,0.779408
1,11551,32236,2010-03-29,97890,Zane,We stayed with Adriano and Valerio for a week ...,46,en,0.965011,positive,0.024325,0.100110,0.875564,0.851239
2,90700,337227,2011-06-27,311071,Miqua,it was all in all the perfect week!\r\nchilton...,84,en,0.986724,positive,0.074755,0.112785,0.812459,0.737704
3,90700,378738,2011-07-17,224367,Prateek,"I'll start with the host, and then move on to ...",189,en,0.993054,positive,0.202945,0.295640,0.501414,0.298469
4,90700,543840,2011-09-18,1115024,Jennifer,Great location. Plenty to do just steps outsid...,92,en,0.977307,positive,0.046721,0.104764,0.848515,0.801794
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1485592,39740287,559509688,2019-11-04,182032644,Isabel,"A very good stay, I would repeat for sure.",9,en,0.996551,positive,0.020210,0.054513,0.925276,0.905066
1485593,22701498,558667202,2019-11-03,65955902,Shereen,"Set in a lovely development with onsite bar, c...",24,en,0.967389,positive,0.010547,0.090022,0.899432,0.888885
1485594,38398365,552239161,2019-10-21,60436496,Chee Ling,(Website hidden by Airbnb) a.best owner and ge...,31,en,0.891100,positive,0.034510,0.113235,0.852255,0.817745
1485595,38398365,559541617,2019-11-04,97684167,Carolyn,This flat is perfection! Everything you need i...,54,en,0.975492,positive,0.035403,0.072402,0.892195,0.856791


# Recommender

## Setup

In [4]:
import re
import math
import numpy as np
import pandas as pd
from dataclasses import dataclass
from typing import Dict, List, Tuple
import time
import logging

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sentence_transformers import SentenceTransformer

# -------------------------
# Logging setup
# -------------------------
def setup_logger(name="recommender", level=logging.INFO):
    logger = logging.getLogger(name)
    logger.setLevel(level)
    if not logger.handlers:
        h = logging.StreamHandler()
        fmt = logging.Formatter(
            fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
            datefmt="%Y-%m-%d %H:%M:%S",
        )
        h.setFormatter(fmt)
        logger.addHandler(h)
    logger.propagate = False
    return logger

LOGGER = setup_logger()

## Summary Embeddings

In [5]:
def compute_summary_embeddings(
    listings: pd.DataFrame,
    text_col: str = "summary",
    model_name: str = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    batch_size: int = 64,
    logger: logging.Logger = LOGGER,
) -> np.ndarray:
    """
    Returns:
      emb: (n_listings, text_dim) float32 numpy array
    """
    texts = listings[text_col].fillna("").astype(str).tolist()
    # light cleanup helps a bit
    texts = [t.strip() if t.strip() else "" for t in texts]

    logger.info(f"Computing summary embeddings: n={len(texts)} model={model_name} batch_size={batch_size}")
    st = SentenceTransformer(model_name, device="cpu")

    emb = st.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=True,
    ).astype("float32")

    logger.info(f"Summary embeddings ready: shape={emb.shape}")
    return emb

## Preprocessing

In [6]:
# -------------------------
# Preprocessing helpers
# -------------------------
def parse_money_to_float(x):
    if pd.isna(x): return np.nan
    if isinstance(x, (int, float, np.number)): return float(x)
    s = str(x).strip().replace(",", "")
    s = re.sub(r"[^0-9\.\-]", "", s)
    try:
        return float(s) if s else np.nan
    except ValueError:
        return np.nan

def normalize_amenities(amenities_str):
    # clean text and return as list
    if pd.isna(amenities_str): return []
    s = str(amenities_str).strip().strip("{}")
    parts = [p.strip().strip('"').strip("'") for p in s.split(",")]
    parts = [p.lower() for p in parts if p]
    # light normalization
    out = []
    for p in parts:
        p = re.sub(r"\s+", "_", p)
        p = re.sub(r"[^a-z0-9_\-]+", "", p)
        if p:
            out.append(p)
    return out

def sentiment_to_weight(sentiment_score):
    # convert sentiment_score from [-1,1] to [0,1] to use as interaction weight
    s = np.clip(sentiment_score, -1, 1)
    return (s + 1.0) / 2.0  # [0,1]

# -------------------------
# Build feature index / encodings
# -------------------------
@dataclass
class FeatureIndex:
    neigh2i: Dict[str, int]
    ptype2i: Dict[str, int]
    rtype2i: Dict[str, int]
    amen2i: Dict[str, int]
    lid2i: Dict[int, int] 
    i2lid: List[int]

def build_feature_index(listings: pd.DataFrame) -> FeatureIndex:
    def make_map(vals):
        vals = [v for v in vals if isinstance(v, str) and v.strip()]
        uniq = sorted(set(vals))
        return {v: i+1 for i, v in enumerate(uniq)}  # 0 UNK/none

    neigh2i = make_map(listings["neighbourhood_cleansed"].fillna("").astype(str).tolist())
    ptype2i = make_map(listings["property_type"].fillna("").astype(str).tolist())
    rtype2i = make_map(listings["room_type"].fillna("").astype(str).tolist())

    # amenities index
    amen_counts = {}
    for a_list in listings["amenities_list"]:
        for a in a_list:
            amen_counts[a] = amen_counts.get(a, 0) + 1
    # keep top amenities (selected 50 for now but would ideally tune this)
    TOP_N_AMENITIES = 50
    top_amen = sorted(amen_counts.items(), key=lambda x: x[1], reverse=True)[:TOP_N_AMENITIES]
    amen2i = {a: i for i, (a, _) in enumerate(top_amen)}

    # listing id map
    i2lid = listings["id"].astype(int).tolist()
    lid2i = {lid: i for i, lid in enumerate(i2lid)}

    return FeatureIndex(neigh2i, ptype2i, rtype2i, amen2i, lid2i, i2lid)

def encode_listings(listings: pd.DataFrame, feature_index: FeatureIndex) -> Dict[str, torch.Tensor]:
    # categorical features
    neighbourhood_idx = listings["neighbourhood_cleansed"].fillna("").astype(str).map(lambda x: feature_index.neigh2i.get(x, 0)).to_numpy()
    property_type_idx = listings["property_type"].fillna("").astype(str).map(lambda x: feature_index.ptype2i.get(x, 0)).to_numpy()
    room_type_idx = listings["room_type"].fillna("").astype(str).map(lambda x: feature_index.rtype2i.get(x, 0)).to_numpy()

    # numeric features
    latitude = listings["latitude"].astype(float).to_numpy()
    longitude = listings["longitude"].astype(float).to_numpy()
    accommodates = listings["accommodates"].astype(float).fillna(0).to_numpy()

    total_price = listings["total_price"].astype(float).to_numpy()
    valid_mask = np.isfinite(total_price) & (total_price > 0)
    total_price = np.where(valid_mask, np.log1p(total_price), 0.0)

    numeric_features = np.stack([latitude, longitude, accommodates, total_price], axis=1).astype("float32")

    # normalize numeric
    num_mean = np.nanmean(numeric_features, axis=0)
    num_std  = np.nanstd(numeric_features, axis=0) + 1e-6
    numeric_features = np.nan_to_num((numeric_features - num_mean) / num_std, nan=0.0)

    # amenities multi-hot
    n_listings = len(listings)
    n_amen = len(feature_index.amen2i)

    amenity_matrix = np.zeros((n_listings, n_amen), dtype=np.float32)

    for i, amenities in enumerate(listings["amenities_list"]):
        for a in amenities:
            idx = feature_index.amen2i.get(a)
            if idx is not None:
                amenity_matrix[i, idx] = 1.0

    return {
        "neighbourhood_idx": torch.tensor(neighbourhood_idx, dtype=torch.long),
        "property_type_idx": torch.tensor(property_type_idx, dtype=torch.long),
        "room_type_idx": torch.tensor(room_type_idx, dtype=torch.long),
        "numeric_features": torch.tensor(numeric_features, dtype=torch.float32),
        "amenity_features": torch.tensor(amenity_matrix, dtype=torch.float32),
    }

## Model

In [7]:
# -------------------------
# Listing encoder model
# -------------------------
class ListingEncoder(nn.Module):
    def __init__(self, n_neigh, n_ptype, n_rtype, n_amen, d_numeric, text_dim, d=64, out_dim=64):
        super().__init__()
        # categorical embeddings
        self.neighbourhood_embedding = nn.Embedding(n_neigh, d)
        self.property_type_embedding = nn.Embedding(n_ptype, d)
        self.room_type_embedding = nn.Embedding(n_rtype, d)

        # numeric projection
        self.numeric_projection = nn.Sequential(
            nn.Linear(d_numeric, 64),
            nn.ReLU(),
            nn.Linear(64, d),
        )

        # amentities projection
        self.amenity_linear = nn.Linear(n_amen, d)

        # summary text projection
        self.text_proj = nn.Sequential(
            nn.Linear(text_dim, d),
            nn.ReLU(),
        )

        # final mlp
        self.final_mlp = nn.Sequential(
            nn.Linear(d*6, 256),
            nn.ReLU(),
            nn.Linear(256, out_dim),
        )

    def forward(self, neighbourhood_idx, property_type_idx, room_type_idx, numeric_features, amenity_features, summary_embedding):
        e_neigh = self.neighbourhood_embedding(neighbourhood_idx)
        e_ptype = self.property_type_embedding(property_type_idx)
        e_rtype = self.room_type_embedding(room_type_idx)

        e_num = self.numeric_projection(numeric_features)
        e_amen = self.amenity_linear(amenity_features)
        e_txt = self.text_proj(summary_embedding)

        combined_features = torch.cat([e_neigh, e_ptype, e_rtype, e_amen, e_num, e_txt], dim=1)

        listing_embedding = self.final_mlp(combined_features)
        listing_embedding = F.normalize(listing_embedding, dim=1)  
        return listing_embedding


# -------------------------
# Training dataset: next-item pairs
# -------------------------
class NextItemDataset(Dataset):
    """
    Builds training samples for next-item prediction from time-ordered reviews per user.
    Each sample is (history_listing_indices, history_weights, pos_listing_index)
    where weights come from sentiment_to_weight(sentiment_score).
    Note:
      - This dataset only yields samples for users with >= 2 reviews.
      - Single-review users are still recommendable at inference time, but they do not
        contribute to next-item training/evaluation under this setup.
    """
    def __init__(self, reviews: pd.DataFrame, feature_index: FeatureIndex, min_hist=1):
        df = reviews.loc[:, ["reviewer_id", "listing_id", "date", "sentiment_score"]].copy()
        df["date"] = pd.to_datetime(df["date"], errors="coerce")
        df = df.dropna(subset=["reviewer_id", "listing_id", "date"])

        df["listing_id"] = df["listing_id"].astype(int)
        df["reviewer_id"] = df["reviewer_id"].astype(str)

        df = df[df["listing_id"].isin(feature_index.lid2i)]
        df["listing_idx"] = df["listing_id"].map(feature_index.lid2i).astype(int)
        df["weight"] = df["sentiment_score"].astype(float).map(sentiment_to_weight)

        df = df.sort_values(["reviewer_id", "date"])

        samples: list[tuple[list[int], list[float], int]] = []

        for _, g in df.groupby("reviewer_id", sort=False):
            seq = g["listing_idx"].to_list()
            wts = g["weight"].to_list()

            # Need at least 2 interactions to form (history -> next item)
            if len(seq) < 2:
                continue

            # Create next-item samples
            for t in range(1, len(seq)):
                hist = seq[:t]
                hist_w = wts[:t]
                pos = seq[t]
                if len(hist) >= min_hist:
                    samples.append((hist, hist_w, pos))

        self.samples = samples

    def __len__(self) -> int:
        return len(self.samples)

    def __getitem__(self, idx: int):
        return self.samples[idx]
    

def collate_batch(batch, max_hist=50):
    # pad histories to a fixed length
    hists, weights, pos = zip(*batch)

    B = len(hists)
    L = min(max_hist, max(len(h) for h in hists))

    hist_pad = torch.zeros((B, L), dtype=torch.long)
    w_pad = torch.zeros((B, L), dtype=torch.float32)
    mask = torch.zeros((B, L), dtype=torch.float32)

    for i, (h, w) in enumerate(zip(hists, weights)):
        h = h[-L:]  # keep most recent L items
        w = w[-L:]

        n = len(h)
        hist_pad[i, :n] = torch.tensor(h, dtype=torch.long)
        w_pad[i, :n] = torch.tensor(w, dtype=torch.float32)
        mask[i, :n] = 1.0

    pos = torch.tensor(pos, dtype=torch.long)
    return hist_pad, w_pad, mask, pos

## Training loop

In [8]:
def build_room_type_buckets(listing_feats):
    """
    Returns: dict {room_type_id: tensor_of_listing_indices}
    """
    room_type_idx = listing_feats["room_type_idx"]
    buckets = {}

    for idx, rt in enumerate(room_type_idx.tolist()):
        buckets.setdefault(rt, []).append(idx)

    # convert lists to tensors for fast sampling
    for rt in buckets:
        buckets[rt] = torch.tensor(buckets[rt], dtype=torch.long)

    return buckets


def train(
    model: nn.Module,
    listing_feats: dict,
    dataset,
    n_listings: int,
    collate_batch,
    device: str | None = None,
    epochs: int = 2,
    batch_size: int = 256,
    lr: float = 2e-3,
    neg_k: int = 10,
    logger: logging.Logger = LOGGER,
):

    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"

    logger.info("==== Training start ====")
    logger.info(f"device={device} epochs={epochs} batch_size={batch_size} lr={lr} neg_k={neg_k}")

    model = model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

    loader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        collate_fn=collate_batch,
        drop_last=False,
    )

    # move listing features to device
    feats = {k: v.to(device) for k, v in listing_feats.items()}
    n_batches = len(loader)
    logger.info(f"dataset_samples={len(dataset)} batches_per_epoch={n_batches}")

    # room type buckets for negative sampling
    room_type_buckets = build_room_type_buckets(listing_feats)


    for epoch in range(1, epochs + 1):
        model.train()
        epoch_loss = 0.0
        n_seen = 0

        for hist, w, mask, pos in loader:
            hist = hist.to(device)
            w = w.to(device)
            mask = mask.to(device)
            pos = pos.to(device)

            B, L = hist.shape

            # Encode history listings
            hist_neigh = feats["neighbourhood_idx"][hist]
            hist_ptype = feats["property_type_idx"][hist]
            hist_rtype = feats["room_type_idx"][hist]
            hist_num   = feats["numeric_features"][hist]
            hist_amen  = feats["amenity_features"][hist]
            hist_txt   = feats["summary_embedding"][hist]

            v_hist = model(
                hist_neigh.reshape(-1),
                hist_ptype.reshape(-1),
                hist_rtype.reshape(-1),
                hist_num.reshape(-1, hist_num.shape[-1]),
                hist_amen.reshape(-1, hist_amen.shape[-1]),
                hist_txt.reshape(-1, hist_txt.shape[-1]),
            ).reshape(B, L, -1)

            # User embedding (weighted average)
            w_eff = w * mask
            denom = w_eff.sum(dim=1, keepdim=True).clamp_min(1e-6)
            u = (v_hist * w_eff.unsqueeze(-1)).sum(dim=1) / denom
            u = F.normalize(u, dim=1)

            # Positive item
            v_pos = model(
                feats["neighbourhood_idx"][pos],
                feats["property_type_idx"][pos],
                feats["room_type_idx"][pos],
                feats["numeric_features"][pos],
                feats["amenity_features"][pos],
                feats["summary_embedding"][pos],
            )

            # Negative sampling
            #neg = torch.randint(0, n_listings, (B, neg_k), device=device)
            # Hard negatives: same room_type as positive
            pos_room_types = listing_feats["room_type_idx"][pos].to(device)

            neg_list = []

            for i in range(B):
                rt = int(pos_room_types[i].item())
                candidates = room_type_buckets.get(rt)

                if candidates is None or len(candidates) == 0:
                    # fallback to random if no candidates
                    sampled = torch.randint(0, n_listings, (neg_k,), device=device)
                else:
                    # sample with replacement
                    rand_idx = torch.randint(0, len(candidates), (neg_k,))
                    sampled = candidates[rand_idx].to(device)

                neg_list.append(sampled)

            neg = torch.stack(neg_list, dim=0)  # shape (B, neg_k)

            v_neg = model(
                feats["neighbourhood_idx"][neg].reshape(-1),
                feats["property_type_idx"][neg].reshape(-1),
                feats["room_type_idx"][neg].reshape(-1),
                feats["numeric_features"][neg].reshape(-1, feats["numeric_features"].shape[-1]),
                feats["amenity_features"][neg].reshape(-1, feats["amenity_features"].shape[-1]),
                feats["summary_embedding"][neg].reshape(-1, feats["summary_embedding"].shape[-1]),
            ).reshape(B, neg_k, -1)

            # BPR (Bayesian Personalized Ranking) loss
            s_pos = (u * v_pos).sum(dim=1, keepdim=True)
            s_neg = (u.unsqueeze(1) * v_neg).sum(dim=2)

            loss = -F.logsigmoid(s_pos - s_neg).mean()

            # Backprop
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item() * B
            n_seen += B

        avg_loss = epoch_loss / max(n_seen, 1)
        logger.info(f"Epoch {epoch}/{epochs} - loss: {avg_loss:.4f}")

    logger.info("=== Training finished ===")
    return model

## Run Training

In [9]:
# -------------------------
# Precompute listing embeddings
# -------------------------
@torch.inference_mode()
def compute_all_listing_embeddings(
    model: nn.Module,
    listing_feats: dict,
    batch_size: int = 4096,
    device: str | None = None,
    logger: logging.Logger = LOGGER,
):
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"

    model = model.to(device)
    model.eval()

    feats = {k: v.to(device) for k, v in listing_feats.items()}

    n_listings = feats["neighbourhood_idx"].shape[0]
    logger.info(f"Precomputing listing embeddings (n={n_listings}, batch_size={batch_size}, device={device})")

    embeddings = []
    start_time = time.time()

    for start in range(0, n_listings, batch_size):
        end = min(start + batch_size, n_listings)

        batch_embeddings = model(
            feats["neighbourhood_idx"][start:end],
            feats["property_type_idx"][start:end],
            feats["room_type_idx"][start:end],
            feats["numeric_features"][start:end],
            feats["amenity_features"][start:end],
            feats["summary_embedding"][start:end],
        )

        embeddings.append(batch_embeddings.cpu())

        logger.info(f"  processed {end}/{n_listings}")

    listing_embeddings = torch.cat(embeddings, dim=0)

    elapsed = time.time() - start_time
    logger.info(f"Finished embedding precompute. Shape={tuple(listing_embeddings.shape)} time={elapsed:.1f}s")

    return listing_embeddings


# listings preprocessing
def prepare_listings(listings: pd.DataFrame) -> pd.DataFrame:
    listings = listings.copy()
    listings["id"] = listings["id"].astype(int)

    listings["total_price"] = listings["price"].map(parse_money_to_float)
    listings["amenities_list"] = listings["amenities"].map(normalize_amenities)

    listings = listings.dropna(subset=["latitude", "longitude", "accommodates", "total_price"])
    listings["accommodates"] = listings["accommodates"].astype(float)
    listings["total_price"] = listings["total_price"].astype(float)
    return listings

# -------------------------
# Run training
# -------------------------

def run_training(listings: pd.DataFrame, reviews: pd.DataFrame):
    listings = prepare_listings(listings)

    summary_emb = compute_summary_embeddings(listings, text_col="summary", batch_size=64)    
    
    feature_index = build_feature_index(listings)
    listing_feats = encode_listings(listings, feature_index)
    listing_feats["summary_embedding"] = torch.tensor(summary_emb, dtype=torch.float32)
    text_dim = listing_feats["summary_embedding"].shape[1]

    dataset = NextItemDataset(reviews, feature_index, min_hist=1)

    model = ListingEncoder(
        n_neigh=max(feature_index.neigh2i.values(), default=0) + 1,
        n_ptype=max(feature_index.ptype2i.values(), default=0) + 1,
        n_rtype=max(feature_index.rtype2i.values(), default=0) + 1,
        n_amen=max(feature_index.amen2i.values(), default=0) + 1,
        d_numeric=4,
        text_dim=text_dim,
        d=64,
        out_dim=64,
    )

    model = train(
        model=model,
        listing_feats=listing_feats,
        dataset=dataset,
        n_listings=len(feature_index.i2lid),
        collate_batch=collate_batch,
        epochs=20,
        batch_size=256,
        lr=0.002,
        neg_k=20,
    )

    # Precompute listing embeddings
    listing_embeddings = compute_all_listing_embeddings(model, listing_feats, batch_size=4096)

    return model, feature_index, listing_embeddings

In [10]:
# load data

LISTINGS_PATH = "data/listings.csv"   
REVIEWS_PATH  = "data/reviews_sentiment.csv"

listings = pd.read_csv(LISTINGS_PATH)
reviews  = pd.read_csv(REVIEWS_PATH)

reviews["date"] = pd.to_datetime(reviews["date"], errors="coerce")
reviews = reviews.dropna(subset=["date", "reviewer_id", "listing_id"])
reviews["reviewer_id"] = reviews["reviewer_id"].astype(str)
reviews["listing_id"] = reviews["listing_id"].astype(int)

listings["id"] = listings["id"].astype(int)

# run model

model, feature_index, listing_embeddings = run_training(listings, reviews)

print("Trained model. Listing embedding matrix shape:", tuple(listing_embeddings.shape))

  listings = pd.read_csv(LISTINGS_PATH)
2026-02-26 19:17:35 | INFO | recommender | Computing summary embeddings: n=85068 model=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 batch_size=64


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Batches:   0%|          | 0/1330 [00:00<?, ?it/s]

2026-02-26 19:22:42 | INFO | recommender | Summary embeddings ready: shape=(85068, 384)
2026-02-26 19:23:21 | INFO | recommender | ==== Training start ====
2026-02-26 19:23:21 | INFO | recommender | device=cpu epochs=20 batch_size=256 lr=0.002 neg_k=20
2026-02-26 19:23:21 | INFO | recommender | dataset_samples=253115 batches_per_epoch=989
2026-02-26 19:24:00 | INFO | recommender | Epoch 1/20 - loss: 0.4805
2026-02-26 19:24:39 | INFO | recommender | Epoch 2/20 - loss: 0.4655
2026-02-26 19:25:18 | INFO | recommender | Epoch 3/20 - loss: 0.4584
2026-02-26 19:25:57 | INFO | recommender | Epoch 4/20 - loss: 0.4530
2026-02-26 19:26:35 | INFO | recommender | Epoch 5/20 - loss: 0.4486
2026-02-26 19:27:15 | INFO | recommender | Epoch 6/20 - loss: 0.4452
2026-02-26 19:27:52 | INFO | recommender | Epoch 7/20 - loss: 0.4418
2026-02-26 19:28:30 | INFO | recommender | Epoch 8/20 - loss: 0.4390
2026-02-26 19:29:08 | INFO | recommender | Epoch 9/20 - loss: 0.4365
2026-02-26 19:29:46 | INFO | recommend

Trained model. Listing embedding matrix shape: (85068, 64)


## Get user recommendations

In [11]:
def recommend_for_user(user_reviews: pd.DataFrame, feature_index: FeatureIndex, listing_embeddings: torch.Tensor, k: int = 20) -> list[int]:

    if user_reviews.empty: return []

    user_reviews = user_reviews.sort_values("date")
    # history listing indices and weights
    listing_indices = [feature_index.lid2i.get(int(lid)) for lid in user_reviews["listing_id"].astype(int).tolist()]
    weights = [sentiment_to_weight(float(s)) for s in user_reviews["sentiment_score"].astype(float).tolist()]

    # keep only ineractions that are in the index
    history = [(idx, w) for idx, w in zip(listing_indices, weights) if idx is not None]
    if not history: return []

    seen_indices = {idx for idx, _ in history}
    hist_idx = torch.tensor([idx for idx, _ in history], dtype=torch.long)
    hist_wts = torch.tensor([w for _, w in history], dtype=torch.float32)

    # user embedding: weighted average of listing embeddings
    user_vec = (listing_embeddings[hist_idx] * hist_wts.unsqueeze(1)).sum(dim=0)
    user_vec = user_vec / (hist_wts.sum() + 1e-9)
    user_vec = F.normalize(user_vec, dim=0)

    # Compute scores via dot product (cosine similarity because embeddings are normalized)
    scores = torch.mv(listing_embeddings, user_vec).cpu().numpy()
    # exclude seen
    scores[list(seen_indices)] = -1e9

    # Sort all listings by score descending
    sorted_idx = np.argsort(-scores)

    # Take top-k
    top_k_idx = sorted_idx[:k]

    return [feature_index.i2lid[i] for i in top_k_idx]

def pick_returning_users(reviews_df, n=5, min_reviews=3, random_state=42):
    counts = reviews_df.groupby("reviewer_id")["listing_id"].nunique()
    eligible = counts[counts >= min_reviews].index.tolist()
    rng = np.random.default_rng(random_state)
    picks = rng.choice(eligible, size=min(n, len(eligible)), replace=False)
    return [str(x) for x in picks]

def show_user_recs(user_id, k=10):
    user_hist = reviews[reviews["reviewer_id"] == str(user_id)].copy().sort_values("date")
    recs = recommend_for_user(user_hist, feature_index, listing_embeddings, k=k)

    print("\n" + "="*80)
    print(f"USER {user_id} | reviews={len(user_hist)} | unique_listings={user_hist['listing_id'].nunique()}")
    print("Last 3 reviewed listings:", user_hist["listing_id"].tail(3).tolist())
    print("Top recommendations:", recs[:k])

    # show listing summaries
    cols_to_show = [c for c in ["name","neighbourhood_cleansed","room_type","property_type","accommodates","price"] if c in listings.columns]
    if cols_to_show:
        listings_indexed = listings.set_index("id")
        rev_df = listings_indexed.loc[user_hist['listing_id'], cols_to_show]
        print("\nReviewed listing details:")
        print(rev_df.to_string())

        ordered_ids = [lid for lid in recs[:k] if lid in listings_indexed.index]
        rec_df = listings_indexed.loc[ordered_ids, cols_to_show]
        print("\nRecommended listing details:")
        print(rec_df.to_string())

# review samples
sample_users = pick_returning_users(reviews, n=3, min_reviews=2)
for uid in sample_users:
    show_user_recs(uid, k=10)


USER 42827397 | reviews=2 | unique_listings=2
Last 3 reviewed listings: [14518189, 17658798]
Top recommendations: [8203315, 13026922, 7495122, 22480412, 7313478, 7164478, 12899998, 3957718, 3994973, 6506872]

Reviewed listing details:
                                               name neighbourhood_cleansed     room_type property_type  accommodates   price
id                                                                                                                          
14518189  Cellar room with Shared Bathroom & Toilet            Westminster  Private room     Apartment             1  $19.00
17658798              Double room in Balham, London             Wandsworth  Private room     Apartment             1  $50.00

Recommended listing details:
                                                       name neighbourhood_cleansed        room_type      property_type  accommodates   price
id                                                                                           

## Baseline popularity

In [12]:
def popularity_baseline_eval(
    reviews: pd.DataFrame,
    feature_index: FeatureIndex,
    n_users: int = 200,
    min_unique: int = 2,
    K_list=(5, 10, 20),
    seed: int = 42,
):
    """
    Popularity baseline for next-item (last interaction) prediction.

    For each user:
      - sort by date
      - hold out the last listing as target
      - recommend most popular listings globally (from all reviews),
        excluding the user's seen listings
      - compute Recall@K and MRR@K
    """

    df = reviews.copy()
    df["reviewer_id"] = df["reviewer_id"].astype(str)
    df["listing_id"] = df["listing_id"].astype(int)
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    df = df.dropna(subset=["reviewer_id", "listing_id", "date"])

    # keep only listings that exist in our listing table / index
    df = df[df["listing_id"].isin(feature_index.lid2i)]

    # global popularity ranking (most reviewed listings first)
    pop_rank = df["listing_id"].value_counts().index.to_numpy()

    # pick eligible users
    uniq_counts = df.groupby("reviewer_id")["listing_id"].nunique()
    eligible = uniq_counts[uniq_counts >= min_unique].index.to_numpy()
    if len(eligible) == 0:
        raise ValueError("No eligible users found (min_unique too high).")

    rng = np.random.default_rng(seed)
    chosen = rng.choice(eligible, size=min(n_users, len(eligible)), replace=False)

    # metrics accumulators
    metrics = {f"recall@{k}": 0.0 for k in K_list}
    metrics.update({f"mrr@{k}": 0.0 for k in K_list})
    n_eval = 0

    for uid in chosen:
        g = df[df["reviewer_id"] == uid].sort_values("date")
        if g["listing_id"].nunique() < min_unique:
            continue

        target = int(g["listing_id"].iloc[-1])
        hist = g.iloc[:-1]
        if hist.empty:
            continue

        seen = set(hist["listing_id"].tolist())

        # recommend by popularity, excluding seen
        recs = [lid for lid in pop_rank if lid not in seen]
        if not recs:
            continue

        n_eval += 1
        for k in K_list:
            topk = recs[:k]
            hit = 1.0 if target in topk else 0.0
            metrics[f"recall@{k}"] += hit
            if hit:
                rank = topk.index(target) + 1
                metrics[f"mrr@{k}"] += 1.0 / rank

    # finalize
    for k in K_list:
        metrics[f"recall@{k}"] /= max(n_eval, 1)
        metrics[f"mrr@{k}"] /= max(n_eval, 1)

    return metrics, n_eval

metrics, n_eval = popularity_baseline_eval(
    reviews=reviews,
    feature_index=feature_index,
    n_users=200,
    min_unique=2,
    K_list=(20, 100, 500, 1000),
    seed=42,
)

print(f"Evaluated users: {n_eval}")
for k, v in metrics.items():
    print(f"{k}: {v:.4f}")

Evaluated users: 200
recall@20: 0.0000
recall@100: 0.0400
recall@500: 0.0950
recall@1000: 0.1600
mrr@20: 0.0000
mrr@100: 0.0008
mrr@500: 0.0011
mrr@1000: 0.0011


# Results

In [13]:
def evaluate_last_item_holdout(
    reviews: pd.DataFrame,
    feature_index: FeatureIndex,
    listing_embeddings: torch.Tensor,
    K_list=(5, 10, 20),
    n_users: int | None = None,
    logger: logging.Logger = LOGGER,
):
    """
    For each user:
      - Use all but last review as history
      - Check if last reviewed listing is in top-K recommendations
    """

    logger.info("=== Evaluation start ===")

    df = reviews.copy()
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    df = df.dropna(subset=["reviewer_id", "listing_id", "date"])
    df["reviewer_id"] = df["reviewer_id"].astype(str)
    df["listing_id"] = df["listing_id"].astype(int)

    # Only users with at least 2 distinct listings
    user_counts = df.groupby("reviewer_id")["listing_id"].nunique()
    eligible_users = user_counts[user_counts >= 2].index.tolist()

    if n_users is not None:
        eligible_users = eligible_users[:n_users]

    max_k = max(K_list)

    recall_scores = {k: 0.0 for k in K_list}
    mrr_scores = {k: 0.0 for k in K_list}

    n_evaluated = 0

    for user_id in eligible_users:
        user_df = df[df["reviewer_id"] == user_id].sort_values("date")

        if user_df["listing_id"].nunique() < 2:
            continue

        target_listing = int(user_df["listing_id"].iloc[-1])
        history = user_df.iloc[:-1]

        if target_listing not in feature_index.lid2i:
            continue

        recs = recommend_for_user(
            history,
            feature_index,
            listing_embeddings,
            k=max_k,
        )

        n_evaluated += 1

        for k in K_list:
            topk = recs[:k]
            is_correct = target_listing in topk

            recall_scores[k] += float(is_correct)

            if is_correct:
                rank_position = topk.index(target_listing) + 1
                mrr_scores[k] += 1.0 / rank_position

    if n_evaluated == 0:
        logger.warning("No users evaluated.")
        return {}, 0

    # Normalize
    metrics = {}
    for k in K_list:
        recall = recall_scores[k] / n_evaluated
        mrr = mrr_scores[k] / n_evaluated
        metrics[f"recall@{k}"] = recall
        metrics[f"mrr@{k}"] = mrr

    logger.info(f"Evaluated users: {n_evaluated}")
    for k in K_list:
        logger.info(
            f"K={k} | Recall={metrics[f'recall@{k}']:.4f} | "
            f"MRR={metrics[f'mrr@{k}']:.4f}"
        )

    logger.info("=== Evaluation finished ===")

    return metrics, n_evaluated


n_users = 1000
metrics, n_eval = evaluate_last_item_holdout(
    reviews=reviews,
    feature_index=feature_index,
    listing_embeddings=listing_embeddings,
    n_users=n_users,
    K_list=(5, 10, 20),
)


2026-02-26 19:36:21 | INFO | recommender | === Evaluation start ===
2026-02-26 19:37:04 | INFO | recommender | Evaluated users: 1000
2026-02-26 19:37:04 | INFO | recommender | K=5 | Recall=0.0110 | MRR=0.0062
2026-02-26 19:37:04 | INFO | recommender | K=10 | Recall=0.0160 | MRR=0.0069
2026-02-26 19:37:04 | INFO | recommender | K=20 | Recall=0.0200 | MRR=0.0071
2026-02-26 19:37:04 | INFO | recommender | === Evaluation finished ===


# End