## **Project One: TripAdvisor Recommendation Challenge - Beating BM25**
_**Authors:** Alberto MARTINELLI, Alessia SARRITZU_

The goal of the project is to develop a recommendation system that relies solely on user reviews to suggest similar places based on given queries. The system will propose the most relevant location based on the text of the reviews.

1. **Data pre-processing**:
  - Utilise only the reviews where ratings are composed strictly with this aspects: **service**, **cleanliness**, **overall**, **value**, **location**, **sleep quality**, and **rooms**.
  - Concatenate reviews by `offering_id` to compute average ratings for evaluation.

2. **BM25 Implementation**:
    - Implement a BM25 baseline using the **Rank-BM25** library.
    - Measure the performance of BM25 through Mean Square Error (MSE) between each ratings of the query place and the recommended place.

3. **Enhanced Unsupervised Model**:
    - Create a new unsupervised model to outperform BM25, potentially integrating it with other methods, while ensuring the model does not directly utilize ratings in its learning process.
    - Measure performance of the **Enhanced Model** through Mean Square Error (MSE) between the ratings of the query place and the recommended place, with the aim of achieving a lower MSE than the BM25 baseline.

### 1)

In [52]:
import pandas as pd

pd.set_option('display.max_colwidth', 200)
df = pd.read_csv('reviews.csv')
df.sample(1)

Unnamed: 0,ratings,title,text,author,date_stayed,offering_id,num_helpful_votes,date,id,via_mobile
314499,{'overall': 5.0},“Wonderful stay.”,Stayed at the hotel over the June 15-16 evening. Lovely room. The only issue was nothing the hotel had control over... hearing emergency vehicles screaming down the streets. Fortunately it was pri...,"{'username': 'chipp610', 'num_cities': 22, 'num_helpful_votes': 18, 'num_reviews': 33, 'num_type_reviews': 17, 'id': '3F708723623C71AB3984409DC614D7D2', 'location': 'Warner Robins, Georgia'}",June 2012,1742256,1,2012-06-24,132691496,False


On the table Reviews.csv, keep only the reviews where ratings are composed strictly with this 
aspects:  
“service”, “cleanliness”, “overall”, “value”, “location”, “sleep quality”, “rooms” 
(not more and not less in order to compare places accurately)

In [53]:
required_aspects = {"service", "cleanliness", "overall", "value", "location", "sleep_quality", "rooms"}

filtered_df = df[df['ratings'].apply(lambda x: set(eval(x).keys()) == required_aspects)]

filtered_df.sample(1)

Unnamed: 0,ratings,title,text,author,date_stayed,offering_id,num_helpful_votes,date,id,via_mobile
681644,"{'service': 3.0, 'cleanliness': 3.0, 'overall': 3.0, 'value': 3.0, 'location': 3.0, 'sleep_quality': 3.0, 'rooms': 3.0}",“Decent and dog friendly”,"Room was clean. Breakfast was good... fruit, yogurt, hard boiled eggs, waffles, cereal, bread products, juices and coffee. I brought my very large lab/pitt with no problems. No pool... sigh. Parki...","{'username': 'rebols', 'num_helpful_votes': 4, 'num_reviews': 3, 'num_type_reviews': 3, 'id': '0AFBE4E6EAEAB3F0E6E3604B65EC04BA', 'location': 'Durham, North Carolina'}",August 2012,94133,2,2012-08-28,138646879,False


You must concatenate reviews from the same place based on attribute “offering_id”. The rating of a 
place is just the average of all the reviews ratings on each aspect.

In [54]:
# Convert ratings field to dictionary and expand it to columns
filtered_df['ratings'] = filtered_df['ratings'].apply(eval) #Modify filtered_df to convert ratings field from string to dictionary (json format)
filtered_df = filtered_df.reset_index(drop=True) #Reset index of filtered_df to avoid misalignment issues
ratings_expanded_df = pd.json_normalize(filtered_df['ratings']) #expand dictionary to columns
selected_columns_df = filtered_df[['text', 'offering_id']] 
combined_df = pd.concat([ratings_expanded_df, selected_columns_df], axis=1) # join with offering_id, title and text

combined_df.sample(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['ratings'] = filtered_df['ratings'].apply(eval) #Modify filtered_df to convert ratings field from string to dictionary (json format)


Unnamed: 0,service,cleanliness,overall,value,location,sleep_quality,rooms,text,offering_id
137750,5.0,5.0,5.0,5.0,5.0,5.0,5.0,"My colleague and I had the pleasure of staying at The Public a few weeks ago while on a business trip and I have to say, it blew us away! The staff was professional, kind, courteous and...beautifu...",87629


In [55]:
df = combined_df
# Calculate the mean of each rating aspect and concatenate texts of reviews
df = df.groupby('offering_id').agg(
    service=('service', 'mean'),  # Average rating for 'service' aspect (you can add others as needed)
    cleanliness=('cleanliness', 'mean'),
    overall=('overall', 'mean'),
    value=('value', 'mean'),
    location=('location', 'mean'),
    sleep_quality=('sleep_quality', 'mean'),
    rooms=('rooms', 'mean'),
    text=('text', lambda x: ' '.join(x)), # Concatenate all text entries
).reset_index()

df.sample(1)

Unnamed: 0,offering_id,service,cleanliness,overall,value,location,sleep_quality,rooms,text
698,89600,4.414239,4.491909,4.265372,4.0,4.68932,4.278317,4.423948,"My friend and I arrived around 1pm and there was no problem with an early check-in. Even though I had booked through Hotwire, the clerk was very welcoming. They even gave us a late checkout of 3pm..."


### 2)

In [56]:
# Select a random query place from the dataset
query_place = df.sample(1).iloc[0]
# And print its details
query_id = query_place['offering_id']
query_ratings = query_place[['service', 'cleanliness', 'overall', 'value', 'location', 'sleep_quality', 'rooms']]
query_text = query_place['text']
print("Query Place Details:")
print(f"Offering ID: {query_id}")
print("Ratings:")
for aspect, rating in query_ratings.items():
    print(f"  {aspect}: {rating}")
print("\nReviews texts concatenated:")
print(query_text)

# Exclude the query place from the documents to avoid recommending it
documents_df = df[df['offering_id'] != query_id].reset_index(drop=True)


Query Place Details:
Offering ID: 604455
Ratings:
  service: 4.3
  cleanliness: 4.383333333333334
  overall: 4.183333333333334
  value: 3.816666666666667
  location: 4.7
  sleep_quality: 4.133333333333334
  rooms: 4.183333333333334

Reviews texts concatenated:
The hotel is on 4th Street, two blocks from the fame 6th Street. Our room faced 7th street and we heard the festivites until 2am. The room is nice and clean. But that noise was a big issue. Also the parking garage is tight and not truck/SUV friendly. If you are coming to Austin to party then this is a good hotel to stay in, but if you are coming to relax, you will not be able to till after 2am. The room was well-appointed and looks new- the carpet, the sheets, curtains and the rest of the furniture and fixture. I was actually surprised as the pictures on their site feature old style sheets, carpet and furniture pieces. I guess they updated their rooms recently to make it a little hip. Also the room was clean and the bathroom toil

In [None]:
from rank_bm25 import BM25Okapi
from sklearn.metrics import mean_squared_error

# Load and preprocess the reviews (assuming df contains concatenated reviews by 'offering_id')
documents = documents_df['text'].apply(lambda x: x.split())  # Tokenize each document by splitting words
bm25 = BM25Okapi(documents)

# Define a query (a review or set of reviews from a specific place)
tokenized_query = query_text.split()  # Tokenized query

# Get BM25 scores for the query across all documents
scores = bm25.get_scores(tokenized_query)

# Step 5: Retrieve the top matching place based on BM25 score
top_n = 1  # Number of top matches to retrieve
top_match_index = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_n][0] # Sorts the scores in descending order and retrieves the highest score
top_match = documents_df.iloc[top_match_index]

# Step 6: Calculate the MSE between the ratings of the query place and the recommended place
recommended_ratings = top_match[['service', 'cleanliness', 'overall', 'value', 'location', 'sleep_quality', 'rooms']]
mse = mean_squared_error(query_ratings, recommended_ratings)

# Print the results
print(f"Query Offering ID: {query_id}")
print(f"Recommended Offering ID: {top_match['offering_id']}")
print(f"BM25 Score: {scores[top_match_index]}")
print(f"MSE: {mse}")