# Simple Content-based Hotel Recommendation System


The purpose of this notebook is to present the analysis of a dataset that collects the name, address, and description of some hotels located in Seattle. From these data, a recommendation system is generated from the TF-IDF with trigrams and cosine similarity.

Import require packages

In [44]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import random
import plotly.graph_objs as go
import plotly.express as px

Download stopword corpus

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/andres/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Read and load data associated with hotels located in Seattle. The original data can be downloaded from [here](https://github.com/susanli2016/NLP-with-Python/blob/master/data/Seattle_Hotels_dirty.csv). This dataset contains the name, address and description of 153 hotels located in Seattle.

In [3]:
df = pd.read_csv('data/Seattle_Hotels.csv', encoding="latin-1")
df.head()

Unnamed: 0,name,address,desc
0,Hilton Garden Seattle Downtown,"1821 Boren Avenue, Seattle Washington 98101 USA","Located on the southern tip of Lake Union, the..."
1,Sheraton Grand Seattle,"1400 6th Avenue, Seattle, Washington 98101 USA","Located in the city's vibrant core, the Sherat..."
2,Crowne Plaza Seattle Downtown,"1113 6th Ave, Seattle, WA 98101","Located in the heart of downtown Seattle, the ..."
3,Kimpton Hotel Monaco Seattle,"1101 4th Ave, Seattle, WA98101",What?s near our hotel downtown Seattle locatio...
4,The Westin Seattle,"1900 5th Avenue, Seattle, Washington 98101 USA",Situated amid incredible shopping and iconic a...


Function to print a hotel description and name, as of index value.

In [4]:
def print_description(index):
    example = df[df.index == index][['desc', 'name']].values[0]
    if len(example) > 0:
        print(example[0])
        print('Name:', example[1])

In [5]:
print_description(10)

Soak up the vibrant scene in the Living Room Bar and get in the mix with our live music and DJ series before heading to a memorable dinner at TRACE. Offering inspired seasonal fare in an award-winning atmosphere, it's a not-to-be-missed culinary experience in downtown Seattle. Work it all off the next morning at FIT®, our state-of-the-art fitness center before wandering out to explore many of the area's nearby attractions, including Pike Place Market, Pioneer Square and the Seattle Art Museum. As always, we've got you covered during your time at W Seattle with our signature Whatever/Whenever® service - your wish is truly our command.
Name: W Seattle


### Visualize Token (vocabulary) Frequency Distribution Before Removing Stop Words

In [6]:
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_words(df['desc'], 20)
df1 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
top_words = df1.groupby('desc').sum()['count'].sort_values()
fig = px.bar(top_words, x="count", y=top_words.index, title="Top 20 words in hotel description before removing stop words", orientation='h')
fig.show()

### Visualize Token (vocabulary) Frequency Distribution After Removing Stop Words

In [7]:
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_words(df['desc'], 20)
df1 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
top_words = df1.groupby('desc').sum()['count'].sort_values()
fig = px.bar(top_words, x="count", y=top_words.index, title="Top 20 words in hotel description after removing stop words", orientation='h')
fig.show()

### Bigrams Frequency Distribution Before Removing Stop Word

In [8]:
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_bigram(df['desc'], 20)
df3 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
top_n_bigram_no_clean = df3.groupby('desc').sum()['count'].sort_values(ascending=False)
fig = px.bar(top_n_bigram_no_clean, x="count", y=top_n_bigram_no_clean.index, title="Top 20 bigrams in hotel description before removing stop words", orientation='h')
fig.show()

In [9]:
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_bigram(df['desc'], 20)
df3 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
top_n_bigram_clean = df3.groupby('desc').sum()['count'].sort_values(ascending=False)
fig = px.bar(top_n_bigram_clean, x="count", y=top_n_bigram_clean.index, title="Top 20 bigrams in hotel description After removing stop words", orientation='h')
fig.show()

### Trigrams Frequency Distribution Before Removing Stop Word

In [55]:
def get_top_n_trigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_trigram(df['desc'], 20)
df5 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
top_trigram_with_stop_words = df5.groupby('desc').sum()['count'].sort_values(ascending=False)
fig = px.bar(top_trigram_with_stop_words, x="count", y=top_trigram_with_stop_words.index, title="Top 20 trigrams in hotel description before removing stop words", orientation='h')
fig.show()

In [11]:
def get_top_n_trigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_trigram(df['desc'], 20)
df5 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
top_trigram_with_stop_words = df5.groupby('desc').sum()['count'].sort_values(ascending=False)
fig = px.bar(top_trigram_with_stop_words, x="count", y=top_trigram_with_stop_words.index, title="Top 20 trigrams in hotel description after removing stop words", orientation='h')
fig.show()

### Hotel Description Length Distribution

In [12]:
df['word_count'] = df['desc'].apply(lambda x: len(str(x).split()))

desc_lengths = list(df['word_count'])

print(f"Number of descriptions: {len(desc_lengths)}\nAverage word count: {np.average(desc_lengths)} \nMinimum word count: {min(desc_lengths)} \nMaximum word count: {max(desc_lengths)}")

Number of descriptions: 152
Average word count: 156.94736842105263 
Minimum word count: 16 
Maximum word count: 494


In [13]:
fig = px.histogram(df, x="word_count", title='Word Count Distribution in Hotel Description', nbins=50)
fig.show()

### Preprocessing hotel description text 

The test is pretty clean, we don't have a lot to do, but just in case.

In [14]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))


def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. 
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
    return text
    
df['desc_clean'] = df['desc'].apply(clean_text)

In [15]:
def print_description(index):
    example = df[df.index == index][['desc_clean', 'name']].values[0]
    if len(example) > 0:
        print(example[0])
        print('Name:', example[1])
print_description(10)

soak vibrant scene living room bar get mix live music dj series heading memorable dinner trace offering inspired seasonal fare awardwinning atmosphere nottobemissed culinary experience downtown seattle work next morning fit stateoftheart fitness center wandering explore many areas nearby attractions including pike place market pioneer square seattle art museum always weve got covered time w seattle signature whatever whenever service wish truly command
Name: W Seattle


In [16]:
df.set_index('name', inplace = True)

In [17]:
df.head()

Unnamed: 0_level_0,address,desc,word_count,desc_clean
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Hilton Garden Seattle Downtown,"1821 Boren Avenue, Seattle Washington 98101 USA","Located on the southern tip of Lake Union, the...",184,located southern tip lake union hilton garden ...
Sheraton Grand Seattle,"1400 6th Avenue, Seattle, Washington 98101 USA","Located in the city's vibrant core, the Sherat...",152,located citys vibrant core sheraton grand seat...
Crowne Plaza Seattle Downtown,"1113 6th Ave, Seattle, WA 98101","Located in the heart of downtown Seattle, the ...",147,located heart downtown seattle awardwinning cr...
Kimpton Hotel Monaco Seattle,"1101 4th Ave, Seattle, WA98101",What?s near our hotel downtown Seattle locatio...,150,whats near hotel downtown seattle location bet...
The Westin Seattle,"1900 5th Avenue, Seattle, Washington 98101 USA",Situated amid incredible shopping and iconic a...,151,situated amid incredible shopping iconic attra...


Get TF-IDF (Term Frecuency - Inverse Document Frecuency) as of Trigram, processing the cleaned description. Also, get cosine similarities between descriptions of hotels, as of the Tf-Idf matrix. 

In [18]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df['desc_clean'])
cosine_similarities = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [34]:
indices = pd.Series(df.index)

In [35]:
indices

0               Hilton Garden Seattle Downtown
1                       Sheraton Grand Seattle
2                Crowne Plaza Seattle Downtown
3                Kimpton Hotel Monaco Seattle 
4                           The Westin Seattle
                        ...                   
147                  The Halcyon Suite Du Jour
148                                Vermont Inn
149                 Stay Alfred on Wall Street
150         Pike's Place Lux Suites by Barsala
151    citizenM Seattle South Lake Union hotel
Name: name, Length: 152, dtype: object

Function to get 10 first recommendations as of cosine similarity, given a hotel name.

In [48]:
def recommendations(name, cosine_similarities = cosine_similarities):
    
    recommended_hotels = []
    
    # gettin the index of the hotel that matches the name
    idx = indices[indices == name].index[0]
        
    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)
    
    # getting the indexes of the 10 most similar hotels except itself
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the names of the top 10 matching hotels
    for i in top_10_indexes:
        recommended_hotels.append(list(df.index)[i])
        
    return recommended_hotels

First 10 recommendations for Hotel: Hilton Seattle Airport & Conference Center

In [52]:
recommendations('Hilton Seattle Airport & Conference Center')

['Embassy Suites by Hilton Seattle Tacoma International Airport',
 'DoubleTree by Hilton Hotel Seattle Airport',
 'Seattle Airport Marriott',
 'Motel 6 Seattle Sea-Tac Airport South',
 'Econo Lodge SeaTac Airport North',
 'Four Points by Sheraton Downtown Seattle Center',
 'Knights Inn Tukwila',
 'Econo Lodge Renton-Bellevue',
 'Hampton Inn Seattle/Southcenter',
 'Radisson Hotel Seattle Airport']

The following are recommended by Booking platform for "Hilton Seattle Airport & Conference Center":

![Search of Hilton Seattle Airport & Conference Center in Booking P1](imgs/search1)
![Search of Hilton Seattle Airport & Conference Center in Booking P2](imgs/search2)