## **Project One: TripAdvisor Recommendation Challenge - Beating BM25**
_**Authors:** Alberto MARTINELLI, Alessia SARRITZU_

The goal of the project is to develop a recommendation system that relies solely on user reviews to suggest similar places based on given queries. The system will propose the most relevant location based on the text of the reviews.

1. **Data pre-processing**:
  - Utilise only the reviews where ratings are composed strictly with this aspects: **service**, **cleanliness**, **overall**, **value**, **location**, **sleep quality**, and **rooms**.
  - Concatenate reviews by `offering_id` to compute average ratings for evaluation.

2. **BM25 Implementation**:
    - Implement a BM25 baseline using the **Rank-BM25** library.
    - Measure the performance of BM25 through Mean Square Error (MSE) between each ratings of the query place and the recommended place.

3. **Enhanced Unsupervised Model**:
    - Create a new unsupervised model to outperform BM25, potentially integrating it with other methods, while ensuring the model does not directly utilize ratings in its learning process.
    - Measure performance of the **Enhanced Model** through Mean Square Error (MSE) between the ratings of the query place and the recommended place, with the aim of achieving a lower MSE than the BM25 baseline.

### 1)

In [23]:
import pandas as pd

pd.set_option('display.max_colwidth', 200)
df = pd.read_csv('reviews.csv')
df.sample(1)

Unnamed: 0,ratings,title,text,author,date_stayed,offering_id,num_helpful_votes,date,id,via_mobile
260586,"{'service': 5.0, 'cleanliness': 5.0, 'overall': 5.0, 'value': 5.0, 'location': 5.0, 'sleep_quality': 5.0, 'rooms': 4.0}","“Great service, modern amenities, great views, small rooms”","If I had to go to NYC again, I would definitely check the rates for this Hilton Garden Inn. My husband and I stayed here just one night, but it was exactly what we needed. We checked in very early...","{'username': 'Sueswim03', 'num_cities': 2, 'num_helpful_votes': 1, 'num_reviews': 2, 'location': 'Irvine, California', 'id': 'A8171ADE45270015FAFD48DDFBE45B28'}",October 2010,1218792,0,2010-10-05,82237329,False


On the table Reviews.csv, keep only the reviews where ratings are composed strictly with this 
aspects:  
“service”, “cleanliness”, “overall”, “value”, “location”, “sleep quality”, “rooms” 
(not more and not less in order to compare places accurately)

In [24]:
required_aspects = {"service", "cleanliness", "overall", "value", "location", "sleep_quality", "rooms"}

filtered_df = df[df['ratings'].apply(lambda x: set(eval(x).keys()) == required_aspects)]

filtered_df.sample(1)

Unnamed: 0,ratings,title,text,author,date_stayed,offering_id,num_helpful_votes,date,id,via_mobile
390574,"{'service': 5.0, 'cleanliness': 5.0, 'overall': 5.0, 'value': 5.0, 'location': 4.0, 'sleep_quality': 5.0, 'rooms': 5.0}",“Renommée internationale”,"Hôtel très chic à côté de l'aéroport, des locations de voitures, bien situé. Check-in rapide et service ok. Chambres spacieuses avec des lits queen bed plus que confortables. Petit-déjeuner excell...","{'username': 'Laetitiakse', 'num_cities': 33, 'num_helpful_votes': 15, 'num_reviews': 55, 'num_type_reviews': 24, 'id': 'F1CA1D56DE716553305607BDE3712DB2', 'location': 'Charleroi'}",May 2012,78046,0,2012-06-21,132416614,False


You must concatenate reviews from the same place based on attribute “offering_id”. The rating of a 
place is just the average of all the reviews ratings on each aspect.

In [25]:
# Convert ratings field to dictionary and expand it to columns
filtered_df['ratings'] = filtered_df['ratings'].apply(eval) #Modify filtered_df to convert ratings field from string to dictionary (json format)
filtered_df = filtered_df.reset_index(drop=True) #Reset index of filtered_df to avoid misalignment issues
ratings_expanded_df = pd.json_normalize(filtered_df['ratings']) #expand dictionary to columns
selected_columns_df = filtered_df[['text', 'offering_id']] 
expanded_df = pd.concat([ratings_expanded_df, selected_columns_df], axis=1) # join with offering_id, title and text

expanded_df.sample(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['ratings'] = filtered_df['ratings'].apply(eval) #Modify filtered_df to convert ratings field from string to dictionary (json format)


Unnamed: 0,service,cleanliness,overall,value,location,sleep_quality,rooms,text,offering_id
207661,5.0,5.0,5.0,5.0,5.0,5.0,5.0,"I really enjoyed the The Peal Hotel. \n It was fun, funky, comfortable and ultra groovy. Brianna, Thomas, Devon and the rest of the staff were awesome. My roommate Greta the fish was quiet when I ...",658421


In [26]:
# Calculate the mean of each rating aspect and concatenate texts of reviews
combined_df = expanded_df.groupby('offering_id').agg(
    service=('service', 'mean'),  # Average rating for 'service' aspect (you can add others as needed)
    cleanliness=('cleanliness', 'mean'),
    overall=('overall', 'mean'),
    value=('value', 'mean'),
    location=('location', 'mean'),
    sleep_quality=('sleep_quality', 'mean'),
    rooms=('rooms', 'mean'),
    text=('text', lambda x: ' '.join(x)), # Concatenate all text entries
).reset_index()

combined_df.sample(1)

Unnamed: 0,offering_id,service,cleanliness,overall,value,location,sleep_quality,rooms,text
2483,240079,2.0,3.2,1.8,2.8,3.4,2.6,2.8,"We just wanted a place to sleep and shower on our way to New York, mainly to kill some time while the roads were cleared from the snowstorm. It was late. We noticed right away some food smeared on..."


### 2)

In [27]:
# Select a random query place from the dataset
query_place = expanded_df.sample(1).iloc[0]
# And print its details
query_id = query_place['offering_id']
query_ratings = query_place[['service', 'cleanliness', 'overall', 'value', 'location', 'sleep_quality', 'rooms']]
query_text = query_place['text']
print("Query Place Details:")
print(f"Offering ID: {query_id}")
print("Ratings:")
for aspect, rating in query_ratings.items():
    print(f"  {aspect}: {rating}")
print("\nReviews texts concatenated:")
print(query_text)

# Exclude from combined_df the row (place) with the same offering_id as the query place
# to avoid recommending it
documents_df = combined_df[combined_df['offering_id'] != query_id].reset_index(drop=True)


Query Place Details:
Offering ID: 502408
Ratings:
  service: 5.0
  cleanliness: 5.0
  overall: 5.0
  value: 4.0
  location: 5.0
  sleep_quality: 5.0
  rooms: 5.0

Reviews texts concatenated:
I visited NYC for the first time the last week of March with my son, daughter-in-law, 14 year old grandson & 12 year old granddaughter. We had a junior suite with 2 queen beds & a full size hidebed. I was worried the 5 of us would be stumbling all over each other in one room, but it was perfect, much larger than I expected. Anything we asked for, more towels, more coffee, a bellman when we left , etc. was supplied within 5 minutes. The location was great. We walked to St. Patrick's, TIme Square, Grand Central Terminal, & more. Also it was very convenient to the subway. I would highly recommend this hotel to families and if I ever visit NYC again, I'll stay at Affinia 50.


In [None]:
from rank_bm25 import BM25Okapi
from sklearn.metrics import mean_squared_error
from datetime import datetime

# Load and preprocess the reviews (assuming df contains concatenated reviews by 'offering_id')
documents = documents_df['text'].apply(lambda x: x.split())  # Tokenize each document by splitting words
bm25 = BM25Okapi(documents)

# Define a query (a review or set of reviews from a specific place)
tokenized_query = query_text.split()  # Tokenized query

# Get BM25 scores for the query across all documents
scores = bm25.get_scores(tokenized_query)

# Step 5: Retrieve the top matching place based on BM25 score
top_n = 1  # Number of top matches to retrieve
top_match_index = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_n][0] # Sorts the scores in descending order and retrieves the highest score
top_match = documents_df.iloc[top_match_index]

# Step 6: Calculate the MSE between the ratings of the query place and the recommended place
recommended_ratings = top_match[['service', 'cleanliness', 'overall', 'value', 'location', 'sleep_quality', 'rooms']]
mse = mean_squared_error(query_ratings, recommended_ratings)

# Print the results
print(f"Query Offering ID: {query_id}")
print(f"Recommended Offering ID: {top_match['offering_id']}")
print(f"BM25 Score: {scores[top_match_index]}")
print(f"MSE: {mse}")

[2024-11-22 14:47:45] Started BM25 ranking...
[2024-11-22 14:49:37] BM25 ranking completed.
[2024-11-22 14:49:37] Started BM25 scoring...
[2024-11-22 14:49:38] BM25 scoring completed.
[2024-11-22 14:49:38] Top matching place retrieved.
[2024-11-22 14:49:38] MSE calculated.
Query Offering ID: 502408
Recommended Offering ID: 292142
BM25 Score: 438.7279450206638
MSE: 0.21538939690925638
