# **Project NLP: TripAdvisor Recommendations**
#### By Tiberio Zolzettich & Ocean Spiess
---

### Import of the Dataset


In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("joebeachcapital/hotel-reviews")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /Users/oceanspiess/.cache/kagglehub/datasets/joebeachcapital/hotel-reviews/versions/2


In [2]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords') 
nltk.download('wordnet')

# Load the dataset
df = pd.read_csv(f"{path}/reviews.csv")

df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/oceanspiess/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/oceanspiess/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/oceanspiess/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,ratings,title,text,author,date_stayed,offering_id,num_helpful_votes,date,id,via_mobile
0,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...","“Truly is ""Jewel of the Upper Wets Side""”",Stayed in a king suite for 11 nights and yes i...,"{'username': 'Papa_Panda', 'num_cities': 22, '...",December 2012,93338,0,2012-12-17,147643103,False
1,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...",“My home away from home!”,"On every visit to NYC, the Hotel Beacon is the...","{'username': 'Maureen V', 'num_reviews': 2, 'n...",December 2012,93338,0,2012-12-17,147639004,False
2,"{'service': 4.0, 'cleanliness': 5.0, 'overall'...",“Great Stay”,This is a great property in Midtown. We two di...,"{'username': 'vuguru', 'num_cities': 12, 'num_...",December 2012,1762573,0,2012-12-18,147697954,False
3,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...",“Modern Convenience”,The Andaz is a nice hotel in a central locatio...,"{'username': 'Hotel-Designer', 'num_cities': 5...",August 2012,1762573,0,2012-12-17,147625723,False
4,"{'service': 4.0, 'cleanliness': 5.0, 'overall'...",“Its the best of the Andaz Brand in the US....”,I have stayed at each of the US Andaz properti...,"{'username': 'JamesE339', 'num_cities': 34, 'n...",December 2012,1762573,0,2012-12-17,147612823,False


### Data Preparation
- In this data preparation phase, we:
    1. Extracted rating scores from the 'ratings' dictionary into separate columns
    2. Removed unnecessary columns like title, author, dates and metadata
    3. Reordered columns to put offering_id first for better organization
    4. Grouped reviews by offering_id, concatenating all review texts for each hotel
    5. Calculated mean scores for all rating categories (service, cleanliness, etc.)


In [3]:
# Extract ratings dictionary into separate columns
df['ratings'] = df['ratings'].apply(eval)  # Convert string to dictionary
rating_cols = ['service', 'cleanliness', 'value', 'sleep_quality', 'rooms', 'location', 'overall']

for col in rating_cols:
    df[col] = df['ratings'].apply(lambda x: x.get(col))

# Drop unnecessary columns and ratings dictionary
columns_to_drop = ['title', 'author', 'date_stayed', 'num_helpful_votes', 
                  'date', 'id', 'via_mobile', 'ratings']
df = df.drop(columns=columns_to_drop)

# Reorder columns to put offering_id first
cols = ['offering_id'] + [col for col in df.columns if col != 'offering_id']
df = df[cols]



In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878561 entries, 0 to 878560
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   offering_id    878561 non-null  int64  
 1   text           878561 non-null  object 
 2   service        760918 non-null  float64
 3   cleanliness    759835 non-null  float64
 4   value          753695 non-null  float64
 5   sleep_quality  500903 non-null  float64
 6   rooms          705404 non-null  float64
 7   location       664904 non-null  float64
 8   overall        878561 non-null  float64
dtypes: float64(7), int64(1), object(1)
memory usage: 60.3+ MB


In [5]:
df = df.dropna(subset=['service', 'cleanliness', 'value', 'sleep_quality', 'rooms', 'location', 'overall'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 436391 entries, 0 to 878552
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   offering_id    436391 non-null  int64  
 1   text           436391 non-null  object 
 2   service        436391 non-null  float64
 3   cleanliness    436391 non-null  float64
 4   value          436391 non-null  float64
 5   sleep_quality  436391 non-null  float64
 6   rooms          436391 non-null  float64
 7   location       436391 non-null  float64
 8   overall        436391 non-null  float64
dtypes: float64(7), int64(1), object(1)
memory usage: 33.3+ MB


In [6]:
# Group texts by offering_id and concatenate them
df = df.groupby('offering_id').agg({
    'text': ' '.join,
    'service': 'mean',
    'cleanliness': 'mean', 
    'value': 'mean',
    'sleep_quality': 'mean',
    'rooms': 'mean',
    'location': 'mean',
    'overall': 'mean'
}).reset_index()

df.head()

Unnamed: 0,offering_id,text,service,cleanliness,value,sleep_quality,rooms,location,overall
0,72572,I had to make fast visit to seattle and I foun...,4.60101,4.636364,4.323232,4.333333,4.282828,4.570707,4.388889
1,72579,"Great service, rooms were clean, could use som...",4.232,4.24,4.152,3.768,3.856,4.192,3.888
2,72586,Beautiful views of the space needle - especial...,4.25,4.287879,4.05303,4.113636,3.992424,4.537879,4.045455
3,72598,This hotel is in need of some serious updates....,3.243243,3.243243,3.054054,3.27027,3.189189,3.027027,2.918919
4,73236,My experience at this days inn was perfect. th...,4.277778,3.111111,3.777778,3.722222,3.222222,4.111111,3.388889


### Creation of the base model BM25 without pre-processing

#### Overview
- In this phase, we will:
    1. Create a BM25 model using the raw review texts without any pre-processing
    2. This will serve as our baseline model to compare against later versions



In [7]:
from rank_bm25 import BM25Okapi

def build_bm25_model(df):
    """
    Build BM25 model from the text data
    
    Args:
        df: DataFrame containing 'text' column with reviews
        
    Returns:
        BM25Okapi model trained on the corpus
    """
    # Tokenize each document (hotel reviews)
    tokenized_corpus = [doc.split() for doc in df['text']]
    
    # Create and return the BM25 model
    return BM25Okapi(tokenized_corpus)

### Creating Evaluation Functions

#### Overview
- In this phase, we will create functions:
    1. To evaluate our BM25 models
    2. To get the ranking of recommended hotels 


In [9]:
def rank_hotels(query, df, bm25_model, top_n=10):
    """
    Rank hotels based on relevance to query using BM25
    
    Args:
        query: Search query string
        df: DataFrame containing hotel data
        bm25_model: Trained BM25 model
        top_n: Number of top results to return
        
    Returns:
        List of tuples containing (offering_id, bm25_score)
    """
    # Clean and tokenize the query using same preprocessing as corpus
    tokenized_query = query.split()
    
    # Get BM25 scores for all documents
    scores = bm25_model.get_scores(tokenized_query)
    
    # Create list of (offering_id, score) tuples
    hotel_scores = list(zip(df['offering_id'], scores))
    
    # Sort by score in descending order and get top_n results
    ranked_hotels = sorted(hotel_scores, key=lambda x: x[1], reverse=True)[:top_n]
    
    return ranked_hotels

In [26]:
def evaluate_recommendations(target_hotel_id, comparison_df, aspects=None):
    """
    Evaluates BM25 recommendations for a target hotel by comparing its BM25 scores with the ratings of recommended hotels.
    
    Args:
        target_hotel_id (int): ID of the target hotel.
        comparison_df (pd.DataFrame): Subsample of 100 hotels for comparison.
        aspects (list, optional): List of aspects to use for calculating MSE.
                                Default: ['service', 'cleanliness', 'value', 'sleep_quality', 'rooms', 'location', 'overall'].
    
    Returns:
        tuple:
            - float: Average MSE of the top 10 most relevant recommendations.
            - list: List of top 10 recommendations with their MSE and BM25 scores.
    """
    if aspects is None:
        aspects = ['service', 'cleanliness', 'value', 'sleep_quality', 'rooms', 'location', 'overall']
    
    # Extract target hotel information
    target_hotel = comparison_df[comparison_df['offering_id'] == target_hotel_id]
    target_text = target_hotel['text'].iloc[0]
    target_ratings = target_hotel[aspects].iloc[0]
    
    # Build BM25 model for the sample
    tokenized_texts = [text.split() for text in comparison_df['text']]
    bm25_model = BM25Okapi(tokenized_texts)
    
    # Prepare texts and calculate BM25 scores
    tokenized_target_text = target_text.split()
    scores = bm25_model.get_scores(tokenized_target_text)
    
    # Add BM25 scores to DataFrame
    comparison_df = comparison_df.reset_index(drop=True)  # Reset index to avoid errors
    comparison_df['bm25_score'] = scores
    
    # Exclude target hotel from recommendations
    recommendations_df = comparison_df[comparison_df['offering_id'] != target_hotel_id]
    
    # Sort by descending BM25 score and take top 10
    top_recommendations = recommendations_df.nlargest(10, 'bm25_score')
    
    # Calculate MSE for each recommended hotel
    recommendations = []
    for _, row in top_recommendations.iterrows():
        mse = ((target_ratings - row[aspects]) ** 2).mean()
        recommendations.append({
            'offering_id': row['offering_id'],
            'bm25_score': row['bm25_score'],
            'mse': mse,
            'ratings': row[aspects].to_dict()
        })
    
    # Calculate average MSE of recommendations
    avg_mse = sum([rec['mse'] for rec in recommendations]) / len(recommendations)
    
    return avg_mse, recommendations


### Testing the Base BM25 Model

#### Overview
- In this phase, we will:
    1. Test our baseline BM25 model using sample queries
    2. Evaluate the model's performance with raw, unprocessed text
    3. Document the results to establish a baseline for comparison

#### Expected Outcomes
- Understanding of base model performance
- Identification of potential areas for improvement
- Baseline metrics for comparing with future enhanced models




In [27]:
NUM_QUERIES = 50
NUM_COMPARISON_HOTELS = 2000

all_mse_scores = []

print("Base BM25 Test Results:")
print("-" * 80)

for i in range(NUM_QUERIES):
    # 1. Select a target hotel
    target_hotel = df.sample(n=1).iloc[0]
    target_id = target_hotel['offering_id']
    
    # 2. Create a subsample of hotels
    comparison_df = df[df['offering_id'] != target_id].sample(n=NUM_COMPARISON_HOTELS)
    comparison_df = pd.concat([comparison_df, target_hotel.to_frame().T])  # Include target hotel
    
    # 3. Evaluate recommendations
    avg_mse, recommendations = evaluate_recommendations(target_id, comparison_df)
    all_mse_scores.append(avg_mse)
    
    # 4. Display results for each query
    print(f"Query {i+1}/{NUM_QUERIES}")
    print(f"Target Hotel ID: {target_id}")
    print(f"Average MSE for top 10 recommendations: {avg_mse:.4f}")
    print("-" * 80)

# Calculate overall performance
if all_mse_scores:
    overall_avg_mse = sum(all_mse_scores) / len(all_mse_scores)
    print("\nOverall Performance:")
    print(f"Average MSE over {len(all_mse_scores)} queries: {overall_avg_mse:.4f}")


Base BM25 Test Results:
--------------------------------------------------------------------------------
Query 1/50
Target Hotel ID: 98856
Average MSE for top 10 recommendations: 0.1231
--------------------------------------------------------------------------------
Query 2/50
Target Hotel ID: 74587
Average MSE for top 10 recommendations: 0.2766
--------------------------------------------------------------------------------
Query 3/50
Target Hotel ID: 112039
Average MSE for top 10 recommendations: 0.1603
--------------------------------------------------------------------------------
Query 4/50
Target Hotel ID: 109468
Average MSE for top 10 recommendations: 0.2449
--------------------------------------------------------------------------------
Query 5/50
Target Hotel ID: 217550
Average MSE for top 10 recommendations: 0.2648
--------------------------------------------------------------------------------
Query 6/50
Target Hotel ID: 81315
Average MSE for top 10 recommendations: 0.0887
-

In [32]:
def get_hotel_details(df, hotel_id):
    """
    Get and print details for a specific hotel from the dataframe.
    
    Args:
        df: Pandas DataFrame containing hotel data
        hotel_id: ID of the hotel to look up
    """
    hotel_details = df[df['offering_id'] == hotel_id][['text', 'service', 'cleanliness', 
                                                      'value', 'sleep_quality', 'rooms', 
                                                      'location', 'overall']]

    if not hotel_details.empty:
        print("Hotel Reviews and Ratings:")
        print("\nText:")
        print(hotel_details['text'].values[0])
        print("\nRatings:")
        print(f"Service: {hotel_details['service'].values[0]:.2f}")
        print(f"Cleanliness: {hotel_details['cleanliness'].values[0]:.2f}")
        print(f"Value: {hotel_details['value'].values[0]:.2f}")
        print(f"Sleep Quality: {hotel_details['sleep_quality'].values[0]:.2f}")
        print(f"Rooms: {hotel_details['rooms'].values[0]:.2f}")
        print(f"Location: {hotel_details['location'].values[0]:.2f}")
        print(f"Overall: {hotel_details['overall'].values[0]:.2f}")
    else:
        print(f"No data found for hotel ID {hotel_id}")


### Analyzing Hotels with Poor MSE Scores

In this section, we'll examine specific hotel IDs that showed high Mean Squared Error (MSE) scores in our recommendations. By analyzing these cases where the model performed poorly, we can better understand:

1. The characteristics of these hotels
2. Why the recommendations may have been inaccurate
3. Potential improvements to our recommendation system


In [None]:
# Get details for specific hotels
hotel_ids = [99514, 98820, 217240, 88419]

print("Getting details for selected hotels...")
for hotel_id in hotel_ids:
    print("\n" + "="*80)
    print(f"\nHotel ID: {hotel_id}")
    get_hotel_details(df, hotel_id)


### Analysis of Poor MSE Scores

We find that the queries that get bad MSE scores are predominantly associated with hotels that have poor ratings. The model appears to have more difficulty finding similar hotels when the target hotel has negative reviews, suggesting that matching negative experiences may be inherently more challenging than matching positive ones.



# Better version than Model BM25

### Part 1 : Preprocessed Data
Text preprocessing to improve BM25 search quality by cleaning and standardizing review text through:
- Lowercase conversion
- Special character removal
- Stopword removal
- Word lemmatization



In [30]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download required NLTK data
nltk.download('punkt_tab')
nltk.download('stopwords') 
nltk.download('wordnet')

# Initialize lemmatizer and get stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    """
    Clean text by:
    1. Converting to lowercase
    2. Removing special characters and punctuation (keeping newlines)
    3. Removing stopwords
    4. Lemmatizing words
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and punctuation but keep newlines
    text = re.sub(r'[^\w\s\n]', ' ', text)
    
    # Tokenize
    words = word_tokenize(text)
    
    # Remove stopwords and lemmatize
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words and not word.isnumeric()]
    
    # Join words back together, preserving newlines
    cleaned_text = ' '.join(words)
    
    return cleaned_text

# Create a copy of the dataframe for cleaned version
df_cleaned = df.copy()

# Apply text cleaning to the text column of the cleaned dataframe
df_cleaned['text'] = df_cleaned['text'].apply(clean_text)


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/oceanspiess/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/oceanspiess/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/oceanspiess/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [35]:
# Preprocessed BM25 Test
NUM_QUERIES = 50
NUM_COMPARISON_HOTELS = 2000

all_mse_scores = []

print("Preprocessed BM25 Test Results:")
print("-" * 80)

for i in range(NUM_QUERIES):
    # 1. Select a target hotel from the cleaned dataset
    target_hotel = df_cleaned.sample(n=1).iloc[0]
    target_id = target_hotel['offering_id']
    
    # 2. Create a subsample of hotels for comparison
    comparison_df = df_cleaned[df_cleaned['offering_id'] != target_id].sample(n=NUM_COMPARISON_HOTELS)
    comparison_df = pd.concat([comparison_df, target_hotel.to_frame().T])
    
    # 3. Evaluate recommendations (removed bm25_preprocessed parameter)
    avg_mse, recommendations = evaluate_recommendations(target_id, comparison_df)
    all_mse_scores.append(avg_mse)
    
    # 4. Display results for each query
    print(f"Query {i+1}/{NUM_QUERIES}")
    print(f"Target Hotel ID: {target_id}")
    print(f"Average MSE for top 10 recommendations: {avg_mse:.4f}")
    print("-" * 80)

# Calculate overall performance for the preprocessed BM25 model
if all_mse_scores:
    overall_avg_mse = sum(all_mse_scores) / len(all_mse_scores)
    print("\nOverall Performance with Preprocessed BM25:")
    print(f"Average MSE over {len(all_mse_scores)} queries: {overall_avg_mse:.4f}")


Preprocessed BM25 Test Results:
--------------------------------------------------------------------------------
Query 1/50
Target Hotel ID: 1641016
Average MSE for top 10 recommendations: 0.1464
--------------------------------------------------------------------------------
Query 2/50
Target Hotel ID: 656235
Average MSE for top 10 recommendations: 0.0689
--------------------------------------------------------------------------------
Query 3/50
Target Hotel ID: 1140049
Average MSE for top 10 recommendations: 0.1226
--------------------------------------------------------------------------------
Query 4/50
Target Hotel ID: 88429
Average MSE for top 10 recommendations: 2.3986
--------------------------------------------------------------------------------
Query 5/50
Target Hotel ID: 94186
Average MSE for top 10 recommendations: 2.4701
--------------------------------------------------------------------------------
Query 6/50
Target Hotel ID: 1152288
Average MSE for top 10 recommendatio

### part 2 : Fine tune k1 and b parameters

- We are gonna change the two main parameter k1 and b of the BM25 model and see how it affects the performance

In [40]:
k1_values = np.linspace(0.5, 2.0, 4)
b_values = np.linspace(0.5, 1.0, 6)

best_k1, best_b, best_score = None, None, float('inf')

print("Fine-tuning BM25 Parameters (k1, b):")
print("-" * 80)

for k1 in k1_values:
    for b in b_values:
        print(f"Testing BM25 with k1={k1}, b={b}")
        all_mse_scores = []

        # Iterate through each query
        for i in range(30):
            # Select a target hotel
            target_hotel = df.sample(n=1).iloc[0]
            target_id = target_hotel['offering_id']

            # Create a subsample of hotels for this query
            comparison_df = df[df['offering_id'] != target_id].sample(n=100)
            comparison_df = pd.concat([comparison_df, target_hotel.to_frame().T])  # Include target hotel
            
            # Tokenize the corpus dynamically
            tokenized_corpus = [doc.split() for doc in comparison_df['text']]
            
            # Create BM25 model with current k1 and b
            bm25 = BM25Okapi(tokenized_corpus, k1=k1, b=b)
            
            # Evaluate recommendations
            avg_mse, recommendations = evaluate_recommendations(target_id, comparison_df)  # Remove bm25 parameter
            all_mse_scores.append(avg_mse)

        # Calculate overall performance for this parameter combination
        overall_avg_mse = sum(all_mse_scores) / len(all_mse_scores)
        print(f"Average MSE across all queries: {overall_avg_mse:.4f}")
        print("-" * 70)

        # Update best parameters if performance improves
        if overall_avg_mse < best_score:
            best_k1, best_b, best_score = k1, b, overall_avg_mse

print(f"\nBest combination: k1={best_k1}, b={best_b} with Average MSE={best_score:.4f}")

Fine-tuning BM25 Parameters (k1, b):
--------------------------------------------------------------------------------
Testing BM25 with k1=0.5, b=0.5
Average MSE across all queries: 0.7942
----------------------------------------------------------------------
Testing BM25 with k1=0.5, b=0.6
Average MSE across all queries: 1.0681
----------------------------------------------------------------------
Testing BM25 with k1=0.5, b=0.7
Average MSE across all queries: 0.8300
----------------------------------------------------------------------
Testing BM25 with k1=0.5, b=0.8
Average MSE across all queries: 0.8956
----------------------------------------------------------------------
Testing BM25 with k1=0.5, b=0.9
Average MSE across all queries: 0.9230
----------------------------------------------------------------------
Testing BM25 with k1=0.5, b=1.0
Average MSE across all queries: 1.1525
----------------------------------------------------------------------
Testing BM25 with k1=1.0, b=0.

#### The best performing model in our use case is the one with k1=2.0 and b=1.0 with a better MSE score compared to the base model.

### Part 3 : Further Exploration

- The performance can be further improved with embeddings (to make link between sentences with same meaning)
- 3 steps :
    - Convert every text into vector with a model trained on comments and review if possible
    - Store all embeddings
    - Convert the query into vector as well and compare the cosine_similarity with every embeddings to find top k vector
    (it's possible to use a vector search to better performance if the df is too big)