### About Dataset
 - Acknowledgements <br>
The data was scraped from Booking.com. All data in the file is publicly available to everyone already. Please be noted that data is originally owned by Booking.com.

 - Data Context<br>
This dataset contains 515,000 customer reviews and scoring of 1493 luxury hotels across Europe. Meanwhile, the geographical location of hotels are also provided for further analysis.

 - Data Content<br>
The csv file contains 17 fields. The description of each field is as below:

Hotel_Address: Address of hotel.
Review_Date: Date when reviewer posted the corresponding review.
Average_Score: Average Score of the hotel, calculated based on the latest comment in the last year.
Hotel_Name: Name of Hotel
Reviewer_Nationality: Nationality of Reviewer
Negative_Review: Negative Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Negative'
Review_Total_Negative_Word_Counts: Total number of words in the negative review.
Positive_Review: Positive Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Positive'
Review_Total_Positive_Word_Counts: Total number of words in the positive review.
Reviewer_Score: Score the reviewer has given to the hotel, based on his/her experience
Total_Number_of_Reviews_Reviewer_Has_Given: Number of Reviews the reviewers has given in the past.
Total_Number_of_Reviews: Total number of valid reviews the hotel has.
Tags: Tags reviewer gave the hotel.
days_since_review: Duration between the review date and scrape date.
Additional_Number_of_Scoring: There are also some guests who just made a scoring on the service rather than a review. This number indicates how many valid scores without review in there.
lat: Latitude of the hotel
lng: longtitude of the hotel
In order to keep the text data clean, I removed unicode and punctuation in the text data and transform text into lower case. No other preprocessing was performed.

 - Inspiration<br>
The dataset is large and informative, I believe you can have a lot of fun with it! Let me put some ideas below to futher inspire kagglers!

- Fit a regression model on reviews and score to see which words are more indicative to a higher/lower score
- Perform a sentiment analysis on the reviews
- Find correlation between reviewer's nationality and scores.
- Beautiful and informative visualization on the dataset.
- Clustering hotels based on reviews
- Simple recommendation engine to the guest who is fond of a special characteristic of hotel.
- The idea is unlimited! Please, have a look into data, generate some ideas and leave a master kernel here! I am ready to upvote your ideas and kernels! Cheers!

In [61]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from ast import literal_eval

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Onkar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Onkar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Onkar\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [62]:
df = pd.read_csv(r"D:\Python_Workspace\Datasets\515_Hotel_Reviews_Europe\Hotel_Reviews.csv")

In [63]:
df.head()

Unnamed: 0,Id_Hotel_Rating,Hotel_Address,Province_Name,Country_Name,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,1,s Gravesandestraat 55 Oost 1092 AA,Amsterdam,Netherlands,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,2,s Gravesandestraat 55 Oost 1092 AA,Amsterdam,Netherlands,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,3,s Gravesandestraat 55 Oost 1092 AA,Amsterdam,Netherlands,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7.1,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968
3,4,s Gravesandestraat 55 Oost 1092 AA,Amsterdam,Netherlands,194,7/31/2017,7.7,Hotel Arena,United Kingdom,My room was dirty and I was afraid to walk ba...,210,1403,Great location in nice surroundings the bar a...,26,1,3.8,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",3 days,52.360576,4.915968
4,5,s Gravesandestraat 55 Oost 1092 AA,Amsterdam,Netherlands,194,7/24/2017,7.7,Hotel Arena,New Zealand,You When I booked with your company on line y...,140,1403,Amazing location and building Romantic setting,8,3,6.7,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",10 days,52.360576,4.915968


In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515738 entries, 0 to 515737
Data columns (total 20 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   Id_Hotel_Rating                             515738 non-null  int64  
 1   Hotel_Address                               515737 non-null  object 
 2   Province_Name                               515738 non-null  object 
 3   Country_Name                                515738 non-null  object 
 4   Additional_Number_of_Scoring                515738 non-null  int64  
 5   Review_Date                                 515738 non-null  object 
 6   Average_Score                               515738 non-null  float64
 7   Hotel_Name                                  515738 non-null  object 
 8   Reviewer_Nationality                        515738 non-null  object 
 9   Negative_Review                             515738 non-null  object 
 

In [65]:
 df['Province_Name'].unique()

array(['Amsterdam ', 'London', 'Paris ', 'Barcelona ', 'Milan ',
       'Vienna '], dtype=object)

In [66]:
df['Country_Name'].unique()

array(['Netherlands', 'United Kingdom', ' France', 'Spain', 'Italy',
       ' Austria'], dtype=object)

In [67]:
df['Country_Name'].replace(' France','France',inplace=True)
df['Country_Name'].replace(' Austria','Austria',inplace=True)

In [68]:
df['Country_Name'].unique()

array(['Netherlands', 'United Kingdom', 'France', 'Spain', 'Italy',
       'Austria'], dtype=object)

In [69]:
df['Country_Name'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 515738 entries, 0 to 515737
Series name: Country_Name
Non-Null Count   Dtype 
--------------   ----- 
515738 non-null  object
dtypes: object(1)
memory usage: 3.9+ MB


In [70]:
df['Tags'][0]

"[' Leisure trip ', ' Couple ', ' Duplex Double Room ', ' Stayed 6 nights ']"

In [71]:
df.columns

Index(['Id_Hotel_Rating', 'Hotel_Address', 'Province_Name', 'Country_Name',
       'Additional_Number_of_Scoring', 'Review_Date', 'Average_Score',
       'Hotel_Name', 'Reviewer_Nationality', 'Negative_Review',
       'Review_Total_Negative_Word_Counts', 'Total_Number_of_Reviews',
       'Positive_Review', 'Review_Total_Positive_Word_Counts',
       'Total_Number_of_Reviews_Reviewer_Has_Given', 'Reviewer_Score', 'Tags',
       'days_since_review', 'lat', 'lng'],
      dtype='object')

In [72]:
new_df = df[['Hotel_Address','Average_Score','Hotel_Name','Tags','Country_Name']]

In [73]:
new_df.head()

Unnamed: 0,Hotel_Address,Average_Score,Hotel_Name,Tags,Country_Name
0,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Duplex Double...",Netherlands
1,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Duplex Double...",Netherlands
2,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,"[' Leisure trip ', ' Family with young childre...",Netherlands
3,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",Netherlands
4,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",Netherlands


In [74]:
#Converting Tags fron strings in list to a normal string format

def convert(column):
    if(type(column) != list):
        return "".join(literal_eval(column))
    else:
        return column

In [75]:
new_df.head()

Unnamed: 0,Hotel_Address,Average_Score,Hotel_Name,Tags,Country_Name
0,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Duplex Double...",Netherlands
1,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Duplex Double...",Netherlands
2,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,"[' Leisure trip ', ' Family with young childre...",Netherlands
3,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",Netherlands
4,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",Netherlands


In [76]:
#Converting the tags
new_df['Tags'] = new_df['Tags'].apply(convert)

In [77]:
new_df.head()

Unnamed: 0,Hotel_Address,Average_Score,Hotel_Name,Tags,Country_Name
0,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,Leisure trip Couple Duplex Double Room Sta...,Netherlands
1,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,Leisure trip Couple Duplex Double Room Sta...,Netherlands
2,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,Leisure trip Family with young children Dup...,Netherlands
3,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,Leisure trip Solo traveler Duplex Double Ro...,Netherlands
4,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,Leisure trip Couple Suite Stayed 2 nights ...,Netherlands


In [78]:
#Converting Tags and Country Name in Lower Case for Simplicity
new_df['Tags'] = new_df['Tags'].str.lower()
new_df['Country_Name'] = new_df['Country_Name'].str.lower()
new_df.head()

Unnamed: 0,Hotel_Address,Average_Score,Hotel_Name,Tags,Country_Name
0,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,leisure trip couple duplex double room sta...,netherlands
1,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,leisure trip couple duplex double room sta...,netherlands
2,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,leisure trip family with young children dup...,netherlands
3,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,leisure trip solo traveler duplex double ro...,netherlands
4,s Gravesandestraat 55 Oost 1092 AA,7.7,Hotel Arena,leisure trip couple suite stayed 2 nights ...,netherlands


In [79]:
#Defining the Recommender Function
def recommender(location, description):
    #Dividing given discription into small tokens
    description = description.lower()
    description = word_tokenize(description)
    #Applying StopWords
    stop_words = stopwords.words('english')
    #Lemmatizer
    lemm = WordNetLemmatizer()
    #Filtering
    filtered = {word for word in description if not word in stop_words}
    filtered_set = set()
    for word in filtered:
        filtered_set.add(lemm.lemmatize(word))
        
    #Creating a variable that takes the location and returns the following features
    country = new_df[new_df['Country_Name'] == location.lower()]
    country = country.set_index(np.arange(country.shape[0]))
    list1 = []; list2 = []; cos = [];
    
    for i in range(country.shape[0]):
        temp_token = word_tokenize(country['Tags'][i])
        temp_set = [word for word in temp_token if not word in stop_words]
        temp2_set = set()
        
        for s in temp_set:
            temp2_set.add(lemm.lemmatize(s))
        
        vector = temp2_set.intersection(filtered_set)
        cos.append(len(vector))
        
    #Applying Cosine Similarity
    country['similarity'] = cos
    country = country.sort_values(by='similarity', ascending=False)
    country.drop_duplicates(subset='Hotel_Name', keep='first', inplace=True)
    country.sort_values('Average_Score', ascending=False, inplace=True)
    country.reset_index(inplace=True)
    
    return country[['Hotel_Name','Average_Score','Hotel_Address','Tags']].head(10)
        
    

In [80]:
# abc = 'what are you doing I want a luxorious room which should also be spacious'
# print(word_tokenize(abc))
# print(abc)

In [81]:
#Recommendations
recommender('Spain','Business Trip couple pool twin room')

Unnamed: 0,Hotel_Name,Average_Score,Hotel_Address,Tags
0,H10 Casa Mimosa 4 Sup,9.6,Pau Claris 179 Eixample 08037,business trip couple deluxe double room wit...
1,Hotel The Serras,9.6,Passeig de Colom 9 Ciutat Vella 08002,business trip couple superior double or twi...
2,Hotel Casa Camper,9.6,Elisabets 11 Ciutat Vella 08001,business trip couple camper room stayed 5 ...
3,Mercer Hotel Barcelona,9.5,Dels Lledo 7 Ciutat Vella 08003,business trip couple superior double room ...
4,The One Barcelona GL,9.4,277 Carrer de Proven a Eixample 08037,business trip couple double or twin room wi...
5,Catalonia Square 4 Sup,9.4,Ronda Sant Pere 9 Eixample 08010,business trip couple superior double or twi...
6,Catalonia Magdalenes,9.4,Magdalenes 13 15 Ciutat Vella 08002,business trip couple double or twin room s...
7,Hotel Palace GL,9.4,Gran Via de les Corts Catalanes 668 Eixample 0...,business trip couple deluxe double room 1 2...
8,Hotel Margot House,9.4,Paseo de Gracia 46 Eixample 08007,business trip couple superior double or twi...
9,The Wittmore Adults Only,9.4,Riudarenes 7 Ciutat Vella 08002,business trip couple superior king room st...


In [82]:
recommender('Spain','twin room')

Unnamed: 0,Hotel_Name,Average_Score,Hotel_Address,Tags
0,H10 Casa Mimosa 4 Sup,9.6,Pau Claris 179 Eixample 08037,leisure trip couple deluxe double room sta...
1,Hotel The Serras,9.6,Passeig de Colom 9 Ciutat Vella 08002,leisure trip couple superior double or twin...
2,Hotel Casa Camper,9.6,Elisabets 11 Ciutat Vella 08001,leisure trip couple camper room stayed 5 n...
3,Mercer Hotel Barcelona,9.5,Dels Lledo 7 Ciutat Vella 08003,leisure trip couple superior double room s...
4,Catalonia Magdalenes,9.4,Magdalenes 13 15 Ciutat Vella 08002,leisure trip couple double or twin room st...
5,Hotel Margot House,9.4,Paseo de Gracia 46 Eixample 08007,business trip solo traveler double or twin ...
6,Catalonia Square 4 Sup,9.4,Ronda Sant Pere 9 Eixample 08010,leisure trip group double or twin room sta...
7,The One Barcelona GL,9.4,277 Carrer de Proven a Eixample 08037,leisure trip couple double or twin room st...
8,The Wittmore Adults Only,9.4,Riudarenes 7 Ciutat Vella 08002,leisure trip couple deluxe king room staye...
9,Hotel Palace GL,9.4,Gran Via de les Corts Catalanes 668 Eixample 0...,leisure trip couple deluxe double room 1 2 ...


In [83]:
recommender('France','A spacious room with double bed travelling for a business trip')

Unnamed: 0,Hotel_Name,Average_Score,Hotel_Address,Tags
0,Ritz Paris,9.8,15 Place Vend me 1st arr 75001,business trip solo traveler executive doubl...
1,H tel de La Tamise Esprit de France,9.6,4 rue d Alger 1st arr 75001,business trip solo traveler standard double...
2,Le Narcisse Blanc Spa,9.5,19 Boulevard De La Tour Maubourg 7th arr 75007,business trip solo traveler superior double...
3,Hotel The Peninsula Paris,9.5,19 avenue Kleber 16th arr 75116,business trip solo traveler deluxe double r...
4,Hotel Monge,9.4,55 rue Monge 5th arr 75005,business trip solo traveler classic double ...
5,Nolinski Paris,9.4,16 Avenue de l Opera 1st arr 75001,business trip couple deluxe double or twin ...
6,La Chambre du Marais,9.4,85 87 RUE DES ARCHIVES 3rd arr 75003,business trip solo traveler double room st...
7,H tel D Aubusson,9.4,33 Rue Dauphine 6th arr 75006,business trip solo traveler superior double...
8,Hotel Eiffel Blomet,9.4,78 Rue Blomet 15th arr 75015,business trip solo traveler deluxe double r...
9,Goralska R sidences H tel Paris Bastille,9.4,7 Boulevard Bourdon 4th arr 75004,business trip couple exclusive suite staye...
