## **Project One: TripAdvisor Recommendation Challenge - Beating BM25**
_**Authors:** Alberto MARTINELLI, Alessia SARRITZU_

The goal of the project is to develop a recommendation system that relies solely on user reviews to suggest similar places based on given queries. The system will propose the most relevant location based on the text of the reviews.

1. **Data pre-processing**:
  - Utilise only the reviews where ratings are composed strictly with this aspects on reviews with ratings for the following aspects: **service**, **cleanliness**, **overall**, **value**, **location**, **sleep quality**, and **rooms**.
  - Concatenate reviews by `offering_id` to compute average ratings for evaluation.

2. **BM25 Implementation**:
    - Implement a BM25 baseline using the **Rank-BM25** library.
    - Measure the performance of BM25 through Mean Square Error (MSE) between the ratings of the query place and the recommended place.

3. **Enhanced Unsupervised Model**:
    - Create a new unsupervised model to outperform BM25, potentially integrating it with other methods, while ensuring the model does not directly utilize ratings in its learning process.
    - Measure performance of the **Enhanced Model** through Mean Square Error (MSE) between the ratings of the query place and the recommended place, with the aim of achieving a lower MSE than the BM25 baseline.

# 1)

In [28]:
import pandas as pd

pd.set_option('display.max_colwidth', None)
df = pd.read_csv('reviews.csv')
df.sample(1)

Unnamed: 0,ratings,title,text,author,date_stayed,offering_id,num_helpful_votes,date,id,via_mobile
626316,"{'service': 4.0, 'cleanliness': 5.0, 'overall': 5.0, 'value': 5.0, 'location': 5.0, 'sleep_quality': 5.0, 'rooms': 5.0}",“Cant beat the location”,I was debating if I should stay at Hyatt at Fishermans Wharf or the Grand Hyatt. I wont regret this place is completely refurbished and rooms was real nice and jazzy. \nThe club is really great. The best is the coffee machine and could get my Cappuccion any time. Right next to the shopping district and it is not more than $15 cab ride to the Fishermans wharf. You could also take $6 ride on the cable car right from the next block.\nI would certainly come back again to this property.,"{'username': 'Trvlr_freq', 'num_cities': 12, 'num_helpful_votes': 6, 'num_reviews': 22, 'num_type_reviews': 14, 'id': '8FCA7E4FB78D333CA290754FC9AB88B6', 'location': 'Bloomington, Illinois'}",November 2012,80999,0,2012-12-16,147578986,False


On the table Reviews.csv, keep only the reviews where ratings are composed strictly with this 
aspects:  
“service”, “cleanliness”, “overall”, “value”, “location”, “sleep quality”, “rooms” 
(not more and not less in order to compare places accurately)

In [None]:
required_aspects = {"service", "cleanliness", "overall", "value", "location", "sleep_quality", "rooms"}

filtered_df = df[df['ratings'].apply(lambda x: set(eval(x).keys()) == required_aspects)]

df = filtered_df

df.sample(1)

                                                                                                                        ratings  \
0       {'service': 5.0, 'cleanliness': 5.0, 'overall': 5.0, 'value': 5.0, 'location': 5.0, 'sleep_quality': 5.0, 'rooms': 5.0}   
1       {'service': 5.0, 'cleanliness': 5.0, 'overall': 5.0, 'value': 5.0, 'location': 5.0, 'sleep_quality': 5.0, 'rooms': 5.0}   
2       {'service': 4.0, 'cleanliness': 5.0, 'overall': 4.0, 'value': 4.0, 'location': 5.0, 'sleep_quality': 4.0, 'rooms': 4.0}   
3       {'service': 5.0, 'cleanliness': 5.0, 'overall': 4.0, 'value': 5.0, 'location': 5.0, 'sleep_quality': 5.0, 'rooms': 5.0}   
4       {'service': 4.0, 'cleanliness': 5.0, 'overall': 4.0, 'value': 3.0, 'location': 5.0, 'sleep_quality': 5.0, 'rooms': 5.0}   
...                                                                                                                         ...   
878548  {'service': 4.0, 'cleanliness': 5.0, 'overall': 4.0, 'value': 3.0, 'locatio

Unnamed: 0,ratings,title,text,author,date_stayed,offering_id,num_helpful_votes,date,id,via_mobile
866770,"{'service': 5.0, 'cleanliness': 5.0, 'overall': 5.0, 'value': 5.0, 'location': 5.0, 'sleep_quality': 5.0, 'rooms': 5.0}",“Terrific in DC”,"My son and I stayed for 6 nights, just to do tourist stuff. The staff was excellent. Helpful, efficient, and always welcoming you back with a cheerful ""Welcome Home!"" 3 blocks from the DuPont Circle Metro and 2 blocks from some good restaurants (reasonably priced) .. and the sharing bike station at the Safeway which is a great way to see DC. The hotel is ArtDeco, lovely, clean, comfortable, with nice bathrooms and a kitchenette. We were delighted. The breakfast buffet (about $12) was disappointing and bland, so we cooked breakfast in our room.","{'username': 'jlmcadams', 'num_cities': 2, 'num_helpful_votes': 2, 'num_reviews': 2, 'location': 'Saint Louis, Missouri', 'id': 'ADFCFE6EEAF852BA1FB18CD6E6F9A6A1'}",June 2012,84109,2,2012-06-11,131763197,False


You must concatenate reviews from the same place based on attribute “offering_id”. The rating of a 
place is just the average of all the reviews ratings on each aspect.

Index(['ratings', 'title', 'text', 'author', 'date_stayed', 'offering_id',
       'num_helpful_votes', 'date', 'id', 'via_mobile'],
      dtype='object')