## **Project One: TripAdvisor Recommendation Challenge - Beating BM25**
_**Authors:** Alberto MARTINELLI, Alessia SARRITZU_

The goal of the project is to develop a recommendation system that relies solely on user reviews to suggest similar places based on given queries. The system will propose the most relevant location based on the text of the reviews.

1. **Data pre-processing**:
  - Utilise only the reviews where ratings are composed strictly with this aspects: **service**, **cleanliness**, **overall**, **value**, **location**, **sleep quality**, and **rooms**.
  - Concatenate reviews by `offering_id` to compute average ratings for evaluation.

2. **BM25 Implementation**:
    - Implement a BM25 baseline using the **Rank-BM25** library.
    - Measure the performance of BM25 through Mean Square Error (MSE) between each ratings of the query place and the recommended place.

3. **Enhanced Unsupervised Model**:
    - Create a new unsupervised model to outperform BM25, potentially integrating it with other methods, while ensuring the model does not directly utilize ratings in its learning process.
    - Measure performance of the **Enhanced Model** through Mean Square Error (MSE) between the ratings of the query place and the recommended place, with the aim of achieving a lower MSE than the BM25 baseline.

### 1)

In [42]:
import pandas as pd

pd.set_option('display.max_colwidth', 200)
df = pd.read_csv('reviews.csv')
df.sample(1)

Unnamed: 0,ratings,title,text,author,date_stayed,offering_id,num_helpful_votes,date,id,via_mobile
752601,"{'service': 2.0, 'cleanliness': 4.0, 'overall': 2.0, 'value': 1.0, 'rooms': 4.0, 'sleep_quality': 4.0}",“A night in Boston”,I was returning from a trip to Iceland and had to stay overnight since there are no late night flights to Kansas City. The shuttle driver could not find me at the airport. He had my cell phone num...,"{'username': 'Stephen Y', 'num_reviews': 1, 'id': 'DAA0DE180891BCFA2A691B996F00585A', 'location': 'BLUE SPRINGS', 'num_helpful_votes': 1}",September 2012,217148,1,2012-09-29,141587560,False


On the table Reviews.csv, keep only the reviews where ratings are composed strictly with this 
aspects:  
“service”, “cleanliness”, “overall”, “value”, “location”, “sleep quality”, “rooms” 
(not more and not less in order to compare places accurately)

In [43]:
required_aspects = {"service", "cleanliness", "overall", "value", "location", "sleep_quality", "rooms"}

filtered_df = df[df['ratings'].apply(lambda x: set(eval(x).keys()) == required_aspects)]

filtered_df.sample(1)

Unnamed: 0,ratings,title,text,author,date_stayed,offering_id,num_helpful_votes,date,id,via_mobile
638844,"{'service': 4.0, 'cleanliness': 3.0, 'overall': 4.0, 'value': 4.0, 'location': 4.0, 'sleep_quality': 4.0, 'rooms': 4.0}",“nice stay for business”,I upgraded to the concierge floor and that was a great thing. It was like staying in a small hotel. The room was spacious but the stain on the comforter was gross. The staff were excellent. My onl...,"{'username': 'LItravelers_8', 'num_reviews': 1, 'id': 'D4B6D083267A5C7258FC066AFF6190C2', 'location': 'Great Neck, New York'}",March 2011,676408,0,2011-03-26,101620938,False


You must concatenate reviews from the same place based on attribute “offering_id”. The rating of a 
place is just the average of all the reviews ratings on each aspect.

In [44]:
# Convert ratings field to dictionary and expand it to columns
filtered_df['ratings'] = filtered_df['ratings'].apply(eval) #Modify filtered_df to convert ratings field from string to dictionary (json format)
filtered_df = filtered_df.reset_index(drop=True) #Reset index of filtered_df to avoid misalignment issues
ratings_expanded_df = pd.json_normalize(filtered_df['ratings']) #expand dictionary to columns
selected_columns_df = filtered_df[['text', 'offering_id']] 
combined_df = pd.concat([ratings_expanded_df, selected_columns_df], axis=1) # join with offering_id, title and text

combined_df.sample(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['ratings'] = filtered_df['ratings'].apply(eval) #Modify filtered_df to convert ratings field from string to dictionary (json format)


Unnamed: 0,service,cleanliness,overall,value,location,sleep_quality,rooms,text,offering_id
288173,5.0,5.0,4.0,5.0,5.0,5.0,4.0,"In my opinion, many of the reviews for this hotel came from spoiled brats! If you want 5 star, stay elsewhere. If you want clean rooms, comfy beds, free parking, EXCELLENT location where you can w...",81466


In [45]:
df = combined_df
# Calculate the mean of each rating aspect and concatenate texts of reviews
df = df.groupby('offering_id').agg(
    service=('service', 'mean'),  # Average rating for 'service' aspect (you can add others as needed)
    cleanliness=('cleanliness', 'mean'),
    overall=('overall', 'mean'),
    value=('value', 'mean'),
    location=('location', 'mean'),
    sleep_quality=('sleep_quality', 'mean'),
    rooms=('rooms', 'mean'),
    text=('text', lambda x: ' '.join(x)), # Concatenate all text entries
).reset_index()

df.sample(1)

Unnamed: 0,offering_id,service,cleanliness,overall,value,location,sleep_quality,rooms,text
2299,223917,4.0,4.12,3.76,4.16,4.24,3.92,3.76,"I stayed here during the week of the Democratic Convention, when even the sleaziest motels were charging an arm and a leg. I stumbled upon a website deal for this hotel that was way too good to be..."


### 2)

In [46]:
# Select a random query place from the dataset
query_place = df.sample(1).iloc[0]
# And print its details
query_id = query_place['offering_id']
query_ratings = query_place[['service', 'cleanliness', 'overall', 'value', 'location', 'sleep_quality', 'rooms']]
query_text = query_place['text']
print("Query Place Details:")
print(f"Offering ID: {query_id}")
print("Ratings:")
for aspect, rating in query_ratings.items():
    print(f"  {aspect}: {rating}")
print("\nReviews texts concatenated:")
print(query_text)

# Exclude the query place from the documents to avoid recommending it
documents_df = df[df['offering_id'] != query_id].reset_index(drop=True)


Query Place Details:
Offering ID: 108979
Ratings:
  service: 4.25
  cleanliness: 4.55
  overall: 4.15
  value: 4.15
  location: 3.65
  sleep_quality: 4.4
  rooms: 4.1

Reviews texts concatenated:
We had a large room, reliable WiFi, good shower, and great breakfasts. In-room HVAC was noisy. Free shuttle service to/from airport provided by a third party. Must have a vehicle to get to anything. I don't have anything else to say but must keep writing to get to the 200 character minimum. We stayed here twice during a month's touring of America. Our room was large, and the bed was by far the most comfortable out of all the hotels we stayed in. It was wonderful having a fridge in the room so we could occasionally do our waistlines a favour and have something light from Wholefoods. Housekeeping did a great job keeping everything clean.
The breakfasts were really good with lots of choice, the pastries being my particular favourite.We were even provided with a breakfast bag when we had to leave 

In [None]:
from rank_bm25 import BM25Okapi
from sklearn.metrics import mean_squared_error

# Load and preprocess the reviews (assuming df contains concatenated reviews by 'offering_id')
documents = documents_df['text'].apply(lambda x: x.split())  # Tokenize each document by splitting words
bm25 = BM25Okapi(documents)

# Define a query (a review or set of reviews from a specific place)
tokenized_query = query_text.split()  # Tokenized query

# Get BM25 scores for the query across all documents
scores = bm25.get_scores(tokenized_query)

# Step 5: Retrieve the top matching place based on BM25 score
top_n = 1  # Number of top matches to retrieve
top_match_index = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_n][0] # Sorts the scores in descending order and retrieves the highest score
top_match = documents_df.iloc[top_match_index]

# Step 6: Calculate the MSE between the ratings of the query place and the recommended place
recommended_ratings = top_match[['service', 'cleanliness', 'overall', 'value', 'location', 'sleep_quality', 'rooms']]
mse = mean_squared_error(query_ratings, recommended_ratings)

# Print the results
print(f"Query Offering ID: {query_id}")
print(f"Recommended Offering ID: {top_match['offering_id']}")
print(f"BM25 Score: {scores[top_match_index]}")
print(f"MSE: {mse}")

Query Offering ID: 108979
Recommended Offering ID: 93339
BM25 Score: 48.350943036568594
MSE: 0.26942810333148126
The best match for the given query is:
offering_id                                                                                                                                                                                                        93339
service                                                                                                                                                                                                         3.748175
cleanliness                                                                                                                                                                                                     4.109489
overall                                                                                                                                                                                              