## **Project One: TripAdvisor Recommendation Challenge - Beating BM25**
_**Authors:** Alberto MARTINELLI, Alessia SARRITZU_

### **Introduction**
The goal of this project is to develop an unsupervised recommendation system that uses user reviews to suggest similar locations, outperforming the BM25 baseline. The system is evaluated using **Mean Squared Error (MSE)** between query and recommended location ratings, focusing exclusively on review text.

### **Development Phases**

1. **Data Preparation:**
   - Filter reviews to include only those with ratings strictly covering seven aspects: **service**, **cleanliness**, **overall**, **value**, **location**, **sleep quality**, and **rooms**.
   - Concatenate reviews by `offering_id` and compute average ratings for each aspect to represent each location.
   - Take a random sample of 100 queries from the dataset for consistent evaluation of model performance.

2. **Data Pre-Processing:**
   - Apply text preprocessing to standardize review text:
     - Tokenization: Split text into words.
     - Stop word removal: Exclude common, irrelevant words.
     - Lemmatization: Reduce words to their base forms.

3. **BM25 Implementation:**
   - Use the **Rank-BM25** library to recommend locations based on textual similarity.
   - Evaluate performance by calculating MSE between query and recommended location ratings.

4. **Enhanced Model Implementation:**
   - Create a more advanced unsupervised model to outperform BM25.
   - Use **TF-IDF vectorization** and **cosine similarity** to capture semantic relationships between reviews.

5. **Evaluation and Comparison:**
   - Compute MSE for both BM25 and the enhanced model across the test set.
   - Compare results to determine the improved performance of the enhanced model.

--- 

0. **Data and Libraries Import**

In [None]:
import os
import pandas as pd
from datetime import datetime
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import swifter
from rank_bm25 import BM25Okapi
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('reviews.csv')

In [None]:
aspects = ["service", "cleanliness", "overall", "value", "location", "sleep_quality", "rooms"]

# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

---
1. **Data Preparation**

In [None]:
def dataframe_preparation(df):
    # --------------------------- Filter data --------------------------------------------
    required_aspects = {"service", "cleanliness", "overall", "value", "location", "sleep_quality", "rooms"}
    filtered_df = df[df['ratings'].apply(lambda x: set(eval(x).keys()) == required_aspects)]
    filtered_df = filtered_df.reset_index(drop=True)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}]    2.1 Data filtered")

    # -------------------------- Take a sample for model testing --------------------------
    sample_df = filtered_df.sample(n=100, random_state=42)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}]    2.2 Sample of 100 queries retrieved for model testing")

    # ------------------ Concatenate reviews for the same place ---------------------------
    filtered_df.loc[:, 'ratings'] = filtered_df['ratings'].apply(eval)
    expanded_ratings_df = pd.json_normalize(filtered_df['ratings']).join(filtered_df[['offering_id', 'title', 'text']])

    # Calculate the mean of each rating aspect and concatenate reviews
    final_df = expanded_ratings_df.groupby('offering_id').agg(
        service=('service', 'mean'),  
        cleanliness=('cleanliness', 'mean'),
        overall=('overall', 'mean'),
        value=('value', 'mean'),
        location=('location', 'mean'),
        sleep_quality=('sleep_quality', 'mean'),
        rooms=('rooms', 'mean'),
        text=('text', lambda x: ' '.join(x)), 
    ).reset_index()

    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}]    2.3 Reviews concatenated")
    return sample_df, final_df

sample_df, final_df = dataframe_preparation(df)

---
2. **Data Pre-Processing**

In [None]:
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token not in stop_words]
    return ' '.join(tokens)

preprocessed_file = "final_df_preprocessed.csv"
if os.path.exists(preprocessed_file):
    final_df = pd.read_csv(preprocessed_file)
else:
    final_df['text'] = final_df['text'].swifter.apply(preprocess_text)
    final_df.to_csv(preprocessed_file, index=False)

---
3. **BM25 Implementation**

In [1]:
# Query details extraction 
def extract_query(query_row, aspects, df):
    query_id = query_row['offering_id']
    query_text = query_row['text']
    place_ratings = df[df['offering_id'] == query_id][aspects].iloc[0]
    return query_id, query_text, place_ratings

# BM25 implementation
def apply_bm25(query_id, query_text, place_ratings, df, aspects):
    # Exclude the query place from the documents to avoid recommending it
    documents_df = df[df['offering_id'] != query_id].reset_index(drop=True)

    # Tokenize each document for BM25
    documents_df['text'] = documents_df['text'].astype(str)
    documents = documents_df['text'].apply(lambda x: x.split())
    bm25 = BM25Okapi(documents)
    scores = bm25.get_scores(query_text.split())
    top_match_index = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[0]
    top_match = documents_df.iloc[top_match_index]

    # Calculate the MSE between the ratings of the query place and the recommended place
    recommended_ratings = top_match[aspects]
    mse = mean_squared_error(place_ratings, recommended_ratings)
    return mse

In [None]:
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] 4. Processing data (BM25)")
output_file = "results.csv"
if not os.path.exists(output_file):
    pd.DataFrame(columns=["row_id", "offering_id", "bm25_mse"]).to_csv(output_file, index=False)

existing_results = pd.read_csv(output_file)
processed_ids = set(existing_results[~existing_results["bm25_mse"].isna()]["row_id"].values)  # Exclude rows with missing bm25_mse

for index, row in sample_df.iterrows():
    # Process rows if they are not processed or have missing bm25_mse
    if index not in processed_ids or pd.isna(existing_results.loc[existing_results["row_id"] == index, "bm25_mse"].values[0]):
        query_id, query_text, place_ratings = extract_query(row, aspects, final_df)
        mse = apply_bm25(query_id, query_text, place_ratings, final_df, aspects)
        
        if index in existing_results["row_id"].values:
            existing_results.loc[existing_results["row_id"] == index, "bm25_mse"] = mse
        else:
            new_row = pd.DataFrame([{"row_id": index, "offering_id": query_id, "bm25_mse": mse}])
            existing_results = pd.concat([existing_results, new_row], ignore_index=True)
        
        print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}]    -> Row {index}: processed with BM25 MSE={mse}")

existing_results.to_csv(output_file, index=False)
if not existing_results.empty:
    overall_average_mse = existing_results["bm25_mse"].mean()
    res = overall_average_mse
else:
    res = "No data available in the file"
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}]    4.1 Calculating average MSE: {res}")


---
4. **Enhanced Model Implementation**