# **Project NLP: TripAdvisor Recommendations**
#### By Tiberio Zolzettich & Ocean Spiess
---

### Import of the Dataset


In [47]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("joebeachcapital/hotel-reviews")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\tzolz\.cache\kagglehub\datasets\joebeachcapital\hotel-reviews\versions\2


In [48]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords') 
nltk.download('wordnet')

# Load the dataset
df = pd.read_csv(f"{path}/reviews.csv")

df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tzolz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tzolz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tzolz\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,ratings,title,text,author,date_stayed,offering_id,num_helpful_votes,date,id,via_mobile
0,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...","“Truly is ""Jewel of the Upper Wets Side""”",Stayed in a king suite for 11 nights and yes i...,"{'username': 'Papa_Panda', 'num_cities': 22, '...",December 2012,93338,0,2012-12-17,147643103,False
1,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...",“My home away from home!”,"On every visit to NYC, the Hotel Beacon is the...","{'username': 'Maureen V', 'num_reviews': 2, 'n...",December 2012,93338,0,2012-12-17,147639004,False
2,"{'service': 4.0, 'cleanliness': 5.0, 'overall'...",“Great Stay”,This is a great property in Midtown. We two di...,"{'username': 'vuguru', 'num_cities': 12, 'num_...",December 2012,1762573,0,2012-12-18,147697954,False
3,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...",“Modern Convenience”,The Andaz is a nice hotel in a central locatio...,"{'username': 'Hotel-Designer', 'num_cities': 5...",August 2012,1762573,0,2012-12-17,147625723,False
4,"{'service': 4.0, 'cleanliness': 5.0, 'overall'...",“Its the best of the Andaz Brand in the US....”,I have stayed at each of the US Andaz properti...,"{'username': 'JamesE339', 'num_cities': 34, 'n...",December 2012,1762573,0,2012-12-17,147612823,False


### Data Preparation
- In this data preparation phase, we:
    1. Extracted rating scores from the 'ratings' dictionary into separate columns
    2. Removed unnecessary columns like title, author, dates and metadata
    3. Reordered columns to put offering_id first for better organization
    4. Grouped reviews by offering_id, concatenating all review texts for each hotel
    5. Calculated mean scores for all rating categories (service, cleanliness, etc.)


In [49]:
# Extract ratings dictionary into separate columns
df['ratings'] = df['ratings'].apply(eval)  # Convert string to dictionary
rating_cols = ['service', 'cleanliness', 'value', 'sleep_quality', 'rooms', 'location', 'overall']

for col in rating_cols:
    df[col] = df['ratings'].apply(lambda x: x.get(col))

# Drop unnecessary columns and ratings dictionary
columns_to_drop = ['title', 'author', 'date_stayed', 'num_helpful_votes', 
                  'date', 'id', 'via_mobile', 'ratings']
df = df.drop(columns=columns_to_drop)

# Reorder columns to put offering_id first
cols = ['offering_id'] + [col for col in df.columns if col != 'offering_id']
df = df[cols]



In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878561 entries, 0 to 878560
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   offering_id    878561 non-null  int64  
 1   text           878561 non-null  object 
 2   service        760918 non-null  float64
 3   cleanliness    759835 non-null  float64
 4   value          753695 non-null  float64
 5   sleep_quality  500903 non-null  float64
 6   rooms          705404 non-null  float64
 7   location       664904 non-null  float64
 8   overall        878561 non-null  float64
dtypes: float64(7), int64(1), object(1)
memory usage: 60.3+ MB


In [51]:
df = df.dropna(subset=['service', 'cleanliness', 'value', 'sleep_quality', 'rooms', 'location', 'overall'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 436391 entries, 0 to 878552
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   offering_id    436391 non-null  int64  
 1   text           436391 non-null  object 
 2   service        436391 non-null  float64
 3   cleanliness    436391 non-null  float64
 4   value          436391 non-null  float64
 5   sleep_quality  436391 non-null  float64
 6   rooms          436391 non-null  float64
 7   location       436391 non-null  float64
 8   overall        436391 non-null  float64
dtypes: float64(7), int64(1), object(1)
memory usage: 33.3+ MB


In [52]:
# Group texts by offering_id and concatenate them
df = df.groupby('offering_id').agg({
    'text': ' '.join,
    'service': 'mean',
    'cleanliness': 'mean', 
    'value': 'mean',
    'sleep_quality': 'mean',
    'rooms': 'mean',
    'location': 'mean',
    'overall': 'mean'
}).reset_index()

df.head()

Unnamed: 0,offering_id,text,service,cleanliness,value,sleep_quality,rooms,location,overall
0,72572,I had to make fast visit to seattle and I foun...,4.60101,4.636364,4.323232,4.333333,4.282828,4.570707,4.388889
1,72579,"Great service, rooms were clean, could use som...",4.232,4.24,4.152,3.768,3.856,4.192,3.888
2,72586,Beautiful views of the space needle - especial...,4.25,4.287879,4.05303,4.113636,3.992424,4.537879,4.045455
3,72598,This hotel is in need of some serious updates....,3.243243,3.243243,3.054054,3.27027,3.189189,3.027027,2.918919
4,73236,My experience at this days inn was perfect. th...,4.277778,3.111111,3.777778,3.722222,3.222222,4.111111,3.388889


In [53]:
df.to_excel("ratings.xlsx", index=False)
print('done')

done


### Creation of the base model BM25 without pre-processing

#### Overview
- In this phase, we will:
    1. Create a BM25 model using the raw review texts without any pre-processing
    2. This will serve as our baseline model to compare against later versions



In [54]:
from rank_bm25 import BM25Okapi

def build_bm25_model(df):
    """
    Build BM25 model from the text data
    
    Args:
        df: DataFrame containing 'text' column with reviews
        
    Returns:
        BM25Okapi model trained on the corpus
    """
    # Tokenize each document (hotel reviews)
    tokenized_corpus = [doc.split() for doc in df['text']]
    
    # Create and return the BM25 model
    return BM25Okapi(tokenized_corpus)

In [55]:

# Create base BM25 model without any preprocessing
base_bm25 = build_bm25_model(df)

print("Base BM25 model created successfully")

Base BM25 model created successfully


### Creating Evaluation Functions

#### Overview
- In this phase, we will create functions:
    1. To evaluate our BM25 models
    2. To get the ranking of recommended hotels 


In [56]:
def rank_hotels(query, df, bm25_model, top_n=10):
    """
    Rank hotels based on relevance to query using BM25
    
    Args:
        query: Search query string
        df: DataFrame containing hotel data
        bm25_model: Trained BM25 model
        top_n: Number of top results to return
        
    Returns:
        List of tuples containing (offering_id, bm25_score)
    """
    # Clean and tokenize the query using same preprocessing as corpus
    tokenized_query = query.split()
    
    # Get BM25 scores for all documents
    scores = bm25_model.get_scores(tokenized_query)
    
    # Create list of (offering_id, score) tuples
    hotel_scores = list(zip(df['offering_id'], scores))
    
    # Sort by score in descending order and get top_n results
    ranked_hotels = sorted(hotel_scores, key=lambda x: x[1], reverse=True)[:top_n]
    
    return ranked_hotels

In [57]:
def evaluate_recommendations(query, query_offering_id, df, bm25_model, aspects=['service', 'cleanliness', 'value', 'sleep_quality', 'rooms', 'location', 'overall']):
    """
    Evaluate BM25 recommendations by comparing aspect ratings between target and recommended hotels
    
    Args:
        query (str): Search query
        query_offering_id (int): Offering ID of the target hotel
        df (pd.DataFrame): DataFrame containing hotel data
        bm25_model: Trained BM25 model
        aspects (list): List of aspects to compare ratings for
        
    Returns:
        tuple: Average MSE across recommendations, and detailed list of recommendations with MSE scores
    """
    # Get target hotel ratings
    target_ratings = df[df['offering_id'] == query_offering_id][aspects].iloc[0]
    
    # Get recommended hotels
    recommended_hotels = rank_hotels(query, df, bm25_model)
    recommended_ids = [hotel_id for hotel_id, _ in recommended_hotels]
    
    # Get ratings for recommended hotels
    recommended_ratings = df[df['offering_id'].isin(recommended_ids)][['offering_id'] + aspects]
    
    # Calculate MSE between target and each recommended hotel
    mse_scores = []
    results = []
    for _, rec_row in recommended_ratings.iterrows():
        rec_ratings = rec_row[aspects]
        mse = ((target_ratings - rec_ratings) ** 2).mean()
        mse_scores.append(mse)
        results.append({
            'offering_id': rec_row['offering_id'],
            'mse': mse,
            'ratings': rec_ratings.to_dict()
        })
    
    # Return average MSE and detailed results
    avg_mse = sum(mse_scores) / len(mse_scores)
    return avg_mse, results


### Testing the Base BM25 Model

#### Overview
- In this phase, we will:
    1. Test our baseline BM25 model using sample queries
    2. Evaluate the model's performance with raw, unprocessed text
    3. Document the results to establish a baseline for comparison

#### Expected Outcomes
- Understanding of base model performance
- Identification of potential areas for improvement
- Baseline metrics for comparing with future enhanced models




In [58]:
# Example for 1 query 
query = "great service clean room excellent breakfast"
target_hotel_id = 75662  

avg_mse, recommendations = evaluate_recommendations(query, target_hotel_id, df, base_bm25)

# Display results
print(f"\nMean Squared Error for recommendations: {avg_mse:.4f}")
print("\nRecommended Hotels and their MSE:")
print("-" * 50)
for rec in recommendations:
    print(f"Hotel ID: {rec['offering_id']}")
    print(f"MSE: {rec['mse']:.4f}")
    print(f"Ratings: {rec['ratings']}")
    print("-" * 50)



Mean Squared Error for recommendations: 0.0528

Recommended Hotels and their MSE:
--------------------------------------------------
Hotel ID: 79868.0
MSE: 0.0094
Ratings: {'service': 4.598870056497175, 'cleanliness': 4.5423728813559325, 'value': 4.423728813559322, 'sleep_quality': 4.423728813559322, 'rooms': 4.3107344632768365, 'location': 4.669491525423729, 'overall': 4.454802259887006}
--------------------------------------------------
Hotel ID: 93340.0
MSE: 0.0731
Ratings: {'service': 4.819108280254777, 'cleanliness': 4.807643312101911, 'value': 4.467515923566879, 'sleep_quality': 4.4636942675159235, 'rooms': 4.543949044585987, 'location': 4.84968152866242, 'overall': 4.70828025477707}
--------------------------------------------------
Hotel ID: 99762.0
MSE: 0.0783
Ratings: {'service': 4.790502793296089, 'cleanliness': 4.864525139664805, 'value': 4.379888268156424, 'sleep_quality': 4.582402234636872, 'rooms': 4.670391061452514, 'location': 4.656424581005586, 'overall': 4.754189944

In [59]:
# Multiple queries for accurate evaluation

# Define test queries and their corresponding target hotel IDs
test_queries = {
    "great service clean room excellent breakfast": 75662,
    "perfect location friendly staff comfortable bed": 72572,
    "luxury hotel amazing view great amenities": 72586,
    "budget friendly good value central location": 72579,
    "quiet room helpful staff near attractions": 73236,
    "modern design spa services business center": 72586,
    "family friendly spacious rooms pool": 73236,
    "historic building charming atmosphere": 72572,
    "close to shopping restaurants nightlife": 72579,
    "ocean view beachfront access": 72598,
    "romantic getaway luxury amenities": 72586,
    "airport shuttle convenient location": 75662,
    "fitness center wellness facilities": 72579,
    "rooftop bar city views": 72586,
    "business conference facilities": 73236,
    "budget-friendly dirty and ugly motel":83044,
    "A historic and charming hotel with a grand atmosphere, excellent location near the zoo and metro, friendly staff, and spacious rooms, though it could use updates in some areas like bathrooms and elevators; overall a solid choice for business or leisure stays in DC.":84087,
    "A clean, comfortable, and well-maintained Courtyard Marriott with friendly staff, modern amenities, and convenient location in Charlotte's Ballantyne area; minor issues like small bathrooms and occasional WiFi problems, but overall a great value for both business and leisure travelers.":100616
}

# Store MSE scores for all queries
all_mse_scores = []

# Evaluate each query
print("Individual Query Results:")
print("-" * 70)

for query, hotel_id in test_queries.items():
    avg_mse, recommendations = evaluate_recommendations(query, hotel_id, df, base_bm25)
    all_mse_scores.append(avg_mse)
    
    print(f"Query: {query}")
    print(f"Target Hotel ID: {hotel_id}")
    print(f"Average MSE: {avg_mse:.4f}")
    print("-" * 70)

# Calculate and display overall model performance
overall_avg_mse = sum(all_mse_scores) / len(all_mse_scores)
print(f"\nOverall Model Performance:")
print(f"Average MSE across all queries: {overall_avg_mse:.4f}")


Individual Query Results:
----------------------------------------------------------------------
Query: great service clean room excellent breakfast
Target Hotel ID: 75662
Average MSE: 0.0528
----------------------------------------------------------------------
Query: perfect location friendly staff comfortable bed
Target Hotel ID: 72572
Average MSE: 0.0649
----------------------------------------------------------------------
Query: luxury hotel amazing view great amenities
Target Hotel ID: 72586
Average MSE: 0.2945
----------------------------------------------------------------------
Query: budget friendly good value central location
Target Hotel ID: 72579
Average MSE: 0.1445
----------------------------------------------------------------------
Query: quiet room helpful staff near attractions
Target Hotel ID: 73236
Average MSE: 0.7175
----------------------------------------------------------------------
Query: modern design spa services business center
Target Hotel ID: 72586
Aver

# Better version than Model BM25

### Part 1 : Preprocessed Data
Text preprocessing to improve BM25 search quality by cleaning and standardizing review text through:
- Lowercase conversion
- Special character removal
- Stopword removal
- Word lemmatization



In [62]:
df.head()

Unnamed: 0,offering_id,text,service,cleanliness,value,sleep_quality,rooms,location,overall
0,72572,I had to make fast visit to seattle and I foun...,4.60101,4.636364,4.323232,4.333333,4.282828,4.570707,4.388889
1,72579,"Great service, rooms were clean, could use som...",4.232,4.24,4.152,3.768,3.856,4.192,3.888
2,72586,Beautiful views of the space needle - especial...,4.25,4.287879,4.05303,4.113636,3.992424,4.537879,4.045455
3,72598,This hotel is in need of some serious updates....,3.243243,3.243243,3.054054,3.27027,3.189189,3.027027,2.918919
4,73236,My experience at this days inn was perfect. th...,4.277778,3.111111,3.777778,3.722222,3.222222,4.111111,3.388889


In [63]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download required NLTK data
nltk.download('punkt_tab')
nltk.download('stopwords') 
nltk.download('wordnet')

# Initialize lemmatizer and get stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    """
    Clean text by:
    1. Converting to lowercase
    2. Removing special characters and punctuation (keeping newlines)
    3. Removing stopwords
    4. Lemmatizing words
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and punctuation but keep newlines
    text = re.sub(r'[^\w\s\n]', ' ', text)
    
    # Tokenize
    words = word_tokenize(text)
    
    # Remove stopwords and lemmatize
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words and not word.isnumeric()]
    
    # Join words back together, preserving newlines
    cleaned_text = ' '.join(words)
    
    return cleaned_text

# Create a copy of the dataframe for cleaned version
df_cleaned = df.copy()

# Apply text cleaning to the text column of the cleaned dataframe
df_cleaned['text'] = df_cleaned['text'].apply(clean_text)


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\tzolz\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tzolz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tzolz\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [64]:
# Tokenize each document (review) for BM25
tokenized_corpus = [doc.split() for doc in df_cleaned['text']]

# Create the BM25 model
bm25 = BM25Okapi(tokenized_corpus)

In [65]:
# Multiple queries for accurate evaluation

# Define test queries and their corresponding target hotel IDs
test_queries = {
    "great service clean room excellent breakfast": 75662,
    "perfect location friendly staff comfortable bed": 72572,
    "luxury hotel amazing view great amenities": 72586,
    "budget friendly good value central location": 72579,
    "quiet room helpful staff near attractions": 73236,
    "modern design spa services business center": 72586,
    "family friendly spacious rooms pool": 73236,
    "historic building charming atmosphere": 72572,
    "close to shopping restaurants nightlife": 72579,
    "ocean view beachfront access": 72598,
    "romantic getaway luxury amenities": 72586,
    "airport shuttle convenient location": 75662,
    "fitness center wellness facilities": 72579,
    "rooftop bar city views": 72586,
    "business conference facilities": 73236
}

# Store MSE scores for all queries
all_mse_scores = []

# Evaluate each query
print("Individual Query Results:")
print("-" * 70)

for query, hotel_id in test_queries.items():
    avg_mse, recommendations = evaluate_recommendations(query, hotel_id, df_cleaned, bm25)
    all_mse_scores.append(avg_mse)
    
    print(f"Query: {query}")
    print(f"Target Hotel ID: {hotel_id}")
    print(f"Average MSE: {avg_mse:.4f}")
    print("-" * 70)

# Calculate and display overall model performance
overall_avg_mse = sum(all_mse_scores) / len(all_mse_scores)
print(f"\nOverall Model Performance:")
print(f"Average MSE across all queries: {overall_avg_mse:.4f}")


Individual Query Results:
----------------------------------------------------------------------
Query: great service clean room excellent breakfast
Target Hotel ID: 75662
Average MSE: 0.0486
----------------------------------------------------------------------
Query: perfect location friendly staff comfortable bed
Target Hotel ID: 72572
Average MSE: 0.0601
----------------------------------------------------------------------
Query: luxury hotel amazing view great amenities
Target Hotel ID: 72586
Average MSE: 0.2820
----------------------------------------------------------------------
Query: budget friendly good value central location
Target Hotel ID: 72579
Average MSE: 0.1653
----------------------------------------------------------------------
Query: quiet room helpful staff near attractions
Target Hotel ID: 73236
Average MSE: 0.8990
----------------------------------------------------------------------
Query: modern design spa services business center
Target Hotel ID: 72586
Aver

In [66]:
# Get the text for offering_id 75662
hotel_text = df[df['offering_id'] == 75662]['text'].iloc[0]
print("\nText for hotel 75662:")
print("-" * 50)
print(hotel_text)



Text for hotel 75662:
--------------------------------------------------
We stayed one night prior to our early flight at the end of our vacation in Sedona. We did not want to make the two hour drive from Sedona so early in the morning so we booked one night here. I have stayed in airport hotels before so frankly did not expect much. I was pleasantly surprised. The rooms were spacious. Front desk was very helpful. We returned our car when we arrived and took the car rental shuttle to the airport and was picked up (in less than 10 minutes). Then in the morning took the shuttle back to the airport. Very seamless and quick service. The breakfast was wonderful. Again we were not expecting much but was very surprised. My husband even personally thanked the woman who was putting out the breakfast since everything was very good and much more than you usually get at an airport hotel. My husand ate so much breakfast he skipped lunch that day. So if you are looking for a great airport hotel thi

### part 2 : Fine tune k1 and b parameters

- We are gonna change the two main parameter k1 and b of the BM25 model and see how it affects the performance

In [74]:
k1_values = np.linspace(0.5, 2.0, 4)
b_values = np.linspace(0.5, 1.0, 6)


for k1 in k1_values:
    for b in b_values:
        bm25 = BM25Okapi(tokenized_corpus, k1=k1, b=b)
        print(f"Testing BM25 with k1={k1}, b={b}")
        all_mse_scores = []

        for query, hotel_id in test_queries.items():
            avg_mse, recommendations = evaluate_recommendations(query, hotel_id, df_cleaned, bm25)
            all_mse_scores.append(avg_mse)

        # Calculate and display overall model performance
        overall_avg_mse = sum(all_mse_scores) / len(all_mse_scores)
        print(f"\nOverall Model Performance:")
        print(f"Average MSE across all queries: {overall_avg_mse:.4f}")
        print("-" * 70)

        if overall_avg_mse < best_score or best_score is None:
            best_k1, best_b, best_score = k1, b, overall_avg_mse


print(f"Best combination: k1={best_k1}, b={best_b} with score={best_score}")
        

Testing BM25 with k1=0.5, b=0.5

Overall Model Performance:
Average MSE across all queries: 0.3314
----------------------------------------------------------------------
Testing BM25 with k1=0.5, b=0.6

Overall Model Performance:
Average MSE across all queries: 0.3220
----------------------------------------------------------------------
Testing BM25 with k1=0.5, b=0.7

Overall Model Performance:
Average MSE across all queries: 0.3143
----------------------------------------------------------------------
Testing BM25 with k1=0.5, b=0.8

Overall Model Performance:
Average MSE across all queries: 0.3119
----------------------------------------------------------------------
Testing BM25 with k1=0.5, b=0.9

Overall Model Performance:
Average MSE across all queries: 0.3159
----------------------------------------------------------------------
Testing BM25 with k1=0.5, b=1.0

Overall Model Performance:
Average MSE across all queries: 0.3723
---------------------------------------------------

In [72]:
# The best performing model in our use case is the one with k1=0.5 and b=0.8 with a better MSE score compared to the base model. (-0.02 MSE)

### Part 3 : Opening for future developpement

- The performance can be further improved with embeddings (to make link between sentences with same meaning)
- 3 steps :
    - Convert every text into vector with a model trained on comments and review if possible
    - Store all embeddings
    - Convert the query into vector as well and compare the cosine_similarity with every embeddings to find top k vector
    (it's possible to use a vector search to better performance if the df is too big)