### Load Data

In [24]:
import pandas as pd
df = pd.read_csv('Hotel_Review.csv')
df.head()

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7.1,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968
3,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,United Kingdom,My room was dirty and I was afraid to walk ba...,210,1403,Great location in nice surroundings the bar a...,26,1,3.8,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",3 days,52.360576,4.915968
4,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/24/2017,7.7,Hotel Arena,New Zealand,You When I booked with your company on line y...,140,1403,Amazing location and building Romantic setting,8,3,6.7,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",10 days,52.360576,4.915968


### Drop redundant columns

In [25]:
df = df.drop(['Hotel_Address', 'Additional_Number_of_Scoring', 'Review_Date', 'Average_Score',  'Reviewer_Nationality', 
         'Review_Total_Negative_Word_Counts', 'Total_Number_of_Reviews', 'Review_Total_Positive_Word_Counts',
        'Total_Number_of_Reviews_Reviewer_Has_Given', 'Tags', 'days_since_review', 'lat', 'lng' ], axis = 1)

In [26]:
# Lets add a new column, review which combines both reviews
df["review"] = df["Negative_Review"] + df["Positive_Review"]

# remove 'No Negative' or 'No Positive' from text
df["review"] = df["review"].apply(lambda x: x.replace("No Negative", "").replace("No Positive", ""))

### Remove Duplicates

In [27]:
print(sum(df.duplicated()))
df = df.drop_duplicates()
print(f'After removing Duplicates: {df.shape}')

1334
After removing Duplicates: (514404, 5)


### Removing Stopwords

To clean textual data, we call our custom 'clean_text' function that performs several transformations:

* lower the text
* tokenize the text (split the text into words) and remove the punctuation
* remove useless words that contain numbers
* remove useless stop words like 'the', 'a' ,'this' etc.

In [28]:
import string
from nltk.corpus import stopwords
from nltk.corpus import wordnet

def clean_text(text):
    """
    This functions performs several cleaning operation on
    any textual data
    """
    # lower text
    text = text.lower()
    # tokenize text and remove puncutation
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    # remove words that contain numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # remove stop words
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    # remove empty tokens
    text = [t for t in text if len(t) > 0]
    text = " ".join(text)
    return(text)

In [29]:
# Remove stopwords and punctions
df['review'] = df["review"].apply(lambda x: clean_text(x))

# Recommender System based on Hotel Reviews

In this section, lets build a system that recommends hotels that are similar to a particular hotel. To achieve this, we will compute pairwise cosine similarity scores for all hotels based on their plot reviews and recommend hotels based on that similarity score threshold.

**Code Referenced:** The code implemented in this section is referenced from Datacamp's tutorial on recommender systems. the link to the tutorial is https://www.datacamp.com/community/tutorials/recommender-systems-python

In [30]:
# selecting subset of feature from original dataset. grouping hotel names and joining negative and positive review column with it. 
cleaned_review = df.groupby('Hotel_Name').agg({'review': ', '.join}).reset_index()

In [31]:
cleaned_review[['review']].head()

Unnamed: 0,review
0,thought prise drinks bar little excessive part...
1,air conditioning room work despite complaining...
2,breakfast included buffet really expensive coo...
3,thing like central proximity close services re...
4,kinds fruit juice make mini bar better everyth...


We will compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. This will give you a matrix where each column represents a word in the review vocabulary (all the words that appear in at least one document), and each column represents a hotel.

In its essence, the TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.

Fortunately, scikit-learn gives you a built-in TfIdfVectorizer class that produces the TF-IDF matrix in a couple of lines.

* Import the Tfidf module using scikit-learn;
* Remove stop words like 'the', 'an', etc. since they do not give any useful information about the topic;
* Replace not-a-number values with a blank string;
* Finally, construct the TF-IDF matrix on the data.

In [32]:
#findout similarity between the reviews of hotel using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
cleaned_review['review'] = cleaned_review['review'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(cleaned_review['review'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(1492, 76344)

We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two hotels. Since we have used the TF-IDF vectorizer, calculating the dot product between each vector will directly give you the cosine similarity score. Therefore, you will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

In [33]:
#used linear_kernal method for calculating similarity between the hotels.
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Lets get a reverse mapping of hotel names and DataFrame indices. In other words, you need a mechanism to identify the index of a hotel names in your metadata DataFrame, given its name.

In [34]:
#Construct a reverse map of indices and hotel names
indices = pd.Series(cleaned_review.index, index=cleaned_review['Hotel_Name'])
indices.head()

Hotel_Name
11 Cadogan Gardens                    0
1K Hotel                              1
25hours Hotel beim MuseumsQuartier    2
41                                    3
45 Park Lane Dorchester Collection    4
dtype: int64

Lets build a recommendation system function that will perform the following steps:

* Get the index of the hotel given its name.

* Get the list of cosine similarity scores for that particular hotel with all hotels. Convert it into a list of tuples where    the first element is its position, and the second is the similarity score.

* Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

* Get the top 10 elements of this list. Ignore the first element as it refers to self.

* Return the titles corresponding to the indices of the top elements.

In [71]:
# Function that takes in hotel name as input and outputs most similar hotels
def get_recommendations(cosine_sim=cosine_sim):
    """
    This functions gets the name of a hotel from the user's input
    terminal and returns a list of top 10 recommended hotels to a user.
    """
    # Prompt the user for hotel name
    try:
        title = input('Please enter a hotel name: ')
    except KeyError as err:
        print('Please enter a valid hotel name', err)
    
    # Get the index of the hotel that matches the hotel_name
    idx = indices[title]

    # Get the pairwsie similarity scores of all hotels with that hotel
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the hotels based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar hotels
    sim_scores = sim_scores[1:11]

    # Get the hotel indices
    hotel_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar hotel
    return cleaned_review[['Hotel_Name']].iloc[hotel_indices]

### Executing the Recommender System 

In [72]:
get_recommendations()

Please enter a hotel name: Hotel Arena


Unnamed: 0,Hotel_Name
1145,Park Plaza Vondelpark Amsterdam
182,Boutique Hotel Notting Hill
126,Best Western Blue Tower Hotel
173,Bilderberg Garden Hotel
789,Hotel Vondel Amsterdam
495,Hampshire Hotel Amsterdam American
362,Grand Hotel Amr th Amsterdam
39,Amadi Park Hotel
819,INK Hotel Amsterdam MGallery by Sofitel
1140,Park Plaza London Riverbank


# Conclusion

In this notebook, we walked through several steps to build a **Content based recommender system**, where we recommended list of Top 10 hotels to a user based on the review similarities of the hotels. This system uses item metadata, such as reviews, tags etc. for hotels, to make these recommendations.
