# Hotel Review Sentiment Analysis


### Importing the relevant libraries

In [1]:
import numpy as np
import re
import pickle 
import nltk
import pandas as pd
from nltk.corpus import stopwords
from sklearn.datasets import load_files
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/JeremyBook/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
raw_data = pd.read_csv('Hotel_Reviews.csv')
raw_data.head()

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7.1,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968
3,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,United Kingdom,My room was dirty and I was afraid to walk ba...,210,1403,Great location in nice surroundings the bar a...,26,1,3.8,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",3 days,52.360576,4.915968
4,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/24/2017,7.7,Hotel Arena,New Zealand,You When I booked with your company on line y...,140,1403,Amazing location and building Romantic setting,8,3,6.7,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",10 days,52.360576,4.915968


### Copy data to ensure original data is not modified

In [3]:
df = raw_data.copy()

In [4]:
df['Review'] = df.Negative_Review + df.Positive_Review
df['Review'][2]

' Rooms are nice but for elderly a bit difficult as most rooms are two story with narrow steps So ask for single level Inside the rooms are very very basic just tea coffee and boiler and no bar empty fridge  Location was good and staff were ok It is cute hotel the breakfast range is nice Will go back '

### Create the target and drop all columns that are not relevant

* Reviewer_score lower than 5 will be considered negative
* Reviewer_score equal or higher than 5 will be consider positive

In [5]:
# create the target
#lambda x: x*10 if x<2 else (x**2 if x<4 else x+10)

#df['Target'] = df["Reviewer_Score"].apply(lambda x: 0 if x < 5 else 1) #[0-4]:0, [5-10]:1

df['Target'] = df["Reviewer_Score"].apply(lambda x: 0 if x < 4 else (1 if x < 7 else 2))
df = df[['Review', 'Target']]

df.head()

Unnamed: 0,Review,Target
0,I am so angry that i made this post available...,0
1,No Negative No real complaints the hotel was g...,2
2,Rooms are nice but for elderly a bit difficul...,2
3,My room was dirty and I was afraid to walk ba...,0
4,You When I booked with your company on line y...,1


In [6]:
X, y = df.Review, df.Target

In [7]:
print('Total number of rows: ', df.shape[0])
print('Total number of positive reviews: ', y.sum())
print('Percentage of positive reivews:', y.sum()/df.shape[0])

Total number of rows:  515738
Total number of positive reviews:  933897
Percentage of positive reivews: 1.8107973428368669


In [8]:
df["Target"].value_counts()

2    428887
1     76123
0     10728
Name: Target, dtype: int64

### From the above output, the dataset is highly imbalance. We will have to balance in such a way it is approximately 50% positive and negative

In [9]:
# #remove excess 0s
# one_counter = 0
# counter = 0
# indices_to_remove =[]


# for index, row in df.iterrows():
#     if row['Target'] == 1:
#         one_counter+=1
#         if one_counter >= (df.shape[0] - df.Target.sum()):
#             indices_to_remove.append(index)
    
# df_balanced = df.drop(indices_to_remove)
# df_balanced.reset_index(inplace=True, drop=True)

# #check if targets are balance (approx. 50%)

# print(df_balanced['Target'].sum())
# print(df_balanced['Target'].shape[0])
# print(df_balanced['Target'].sum()/df_balanced['Target'].shape[0])

## Down-sample Majority Class
Down-sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm.

The most common heuristic for doing so is resampling without replacement.

steps:

1. First, we'll separate observations from each class into different DataFrames.
2. Next, we'll resample the majority class without replacement, setting the number of samples to match that of the minority class.
3. Finally, we'll combine the down-sampled majority class DataFrame with the original minority class DataFrame.

In [10]:
from sklearn.utils import resample

# Separate majority and minority classes
df_majority = df[df.Target==1]
df_minority = df[df.Target==0]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=22281,     # to match minority class
                                 random_state=123) # reproducible results
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled.Target.value_counts()

1    22281
0    10728
Name: Target, dtype: int64

In [11]:
# Separate majority and minority classes
df_majority = df[df.Target==2]
df_minority = df[df.Target==0]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=22281,     # to match minority class
                                 random_state=123) # reproducible results
 
# Combine minority class with downsampled majority class
df_balanced = pd.concat([df_majority_downsampled, df_downsampled])
 
# Display new class counts
df_balanced.Target.value_counts()

2    22281
1    22281
0    10728
Name: Target, dtype: int64

In [12]:
X = df_balanced['Review']
y = df_balanced['Target']

### Text cleaning
* Removing white spaces, punctuations, single letter words

In [13]:
# Creating the corpus

def text_cleaner(X):
    corpus = []
    for i in X:
        review = re.sub(r'\W', ' ', i)
        review = review.lower()
        review = re.sub(r'^br$', ' ', review)
        review = re.sub(r'\s+br\s+',' ',review)
        review = re.sub(r'\s+[a-z]\s+', ' ',review)
        review = re.sub(r'^b\s+', '', review)
        review = re.sub(r'\s+', ' ', review)
        corpus.append(review)
    return X

corpus = text_cleaner(X)

### Prior to lemmatizing, we have to take a part of speech parameter, “pos”.

### If not supplied, the default is “noun”. We will make a function to solve it

In [14]:
# return the wordnet object value corresponding to the POS tag
from nltk.corpus import wordnet

def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

### In lemmatisation, the part of speech of a word should be first determined and will return the dictionary form of a word, which must be a valid word.

In [15]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
lemmatizer = WordNetLemmatizer()

# Lemmatization
pos_tags = pos_tag(corpus)
corpus = [WordNetLemmatizer().lemmatize(text[0], get_wordnet_pos(text[1])) for text in pos_tags]

### Creating the Tf-Idf model

In [16]:
# Creating the Tf-Idf model directly
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features = 6000, min_df = 3, max_df = 0.6, stop_words = stopwords.words('english'))
X = vectorizer.fit_transform(corpus).toarray()

### Splitting the dataset into training and test set

In [17]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
text_train, text_test, sent_train, sent_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### Training the model

In [18]:
# Training the classifier
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(text_train,sent_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

### Testing model performance

In [19]:
sent_pred = model.predict(text_test)

In [20]:
model.score(text_test, sent_test)

0.6973232049195153

In [21]:
sample = [text_cleaner("""The room was simple, quite small and had basic equipments. The room was also clean. 
The main advantage of the room was the balcony which offers a really nice view on the street. 
The breakfast was also simple. We preferred to take it outside as the continental breakfast was not really suitable for us. 
At last, the location is really great, in the Chinatown Food Street and close to the Chinatown train station. 
The only problem is the street which is quite noisy until late night.""")]

sample = vectorizer.transform(sample).toarray()
sentiment = model.predict(sample)
sentiment[0]

1

In [38]:
sample = [text_cleaner("""The receptionist has a serious attitude problem. 
the room is not what you expected from the picture. room was small, aircon was not cold. 
the aircon and bed was at two different location. This hotel is definitely not worth the money will never come back"""
)]
pos_tags = pos_tag(sample)
sample = [WordNetLemmatizer().lemmatize(text[0], get_wordnet_pos(text[1])) for text in pos_tags]
sample = vectorizer.transform(sample).toarray()
sentiment = model.predict(sample)
sentiment[0]

1

In [41]:
sample = [text_cleaner("""Talk about an icon, skyline defining, world renowned, you would imagine the customer service is decent, but the front desk actually insulted me and made a mockery out of me, I have never met such a rude bunch of hotel staff in my life, in particular a staff named YURI, working the night shift."""
)]

pos_tags = pos_tag(sample)
sample = [WordNetLemmatizer().lemmatize(text[0], get_wordnet_pos(text[1])) for text in pos_tags]
sample = vectorizer.transform(sample).toarray()
sentiment = model.predict(sample)
sentiment[0]

0

In [39]:
sample = [text_cleaner("""Very Bad and worst service. and sorry to say , SLOW and UNPROFESSIONAL receptionist . (since she admitted that she cannot multi task).

Check in at 3 pm , was in queue at 2.50 pm but was served 30 min after.

AND we spent another 30 minutes with pretty, prettier and prettiest receptionist Devi and her intelligent manager Kenny .

We requested early in email for connecting room . And Alice Chia replied to me that she can provide connecting room without any TERM AND CONDITION (like please bring all your household NRIC and birth cert)

yes we spend time at the front desk just to clarify if we all in the same household.

AND. when im in queue , one of your staff asked me to give way to one of your guest because they got baby with them .

but NOW , i got two kids and father with disabilities . No place to be seated for my dad until he claimed that his leg and waist cramp .

Somemore you can asked we to give way to others? Is this the way 5 -star hotel serve?

Yes maybe it is 5 star hotel but not the staff .

Why employ staff who are not efficient and give SG 5 * hotel a bad name?

In addition , I cant queue for swimming pool . ITS FULL. Im really dissappointed and regret to check in here. I should went back to Fairmont Hotel instead since its cheaper and better than MARINA BAY SANDS .

SERIOUSLY . HAIZZZ. We are here to celebrate my birthday today , but thankyou to all of the staff that involved to ruin our mood.""")]


pos_tags = pos_tag(sample)
sample = [WordNetLemmatizer().lemmatize(text[0], get_wordnet_pos(text[1])) for text in pos_tags]
sample = vectorizer.transform(sample).toarray()
sentiment = model.predict(sample)
sentiment[0]




0

In [27]:
sample = [text_cleaner("""MBS, supposingly one of the premium hotel and I was looking forward to it.
I have had high expectations on it since it prices are usually within >$600 but I got it at a discounted rate so it’s a good time to see what’s the hype all about.

It’s covid I know the hassle of getting the people safe distancing from each other.

I was with child, pram and luggage, sadly I found myself having difficulties with the door, it wasn’t an automated door, it was a pull/push heavy glass door, so erm I had to figure my way in. I Guess the hotel staffs are more busy being the safety ambassador, the valets parking attendants, the screening of temperature, so nobody to attend to the door.

Check in was smooth, request granted and everywhere we go we were greeted with so much love and warmth. So yay!!

You have to book your pool time, so it wasn’t that bad.

Prices at spago was super expensive for local standard. But ok la expected it’s meant for tourist.

Sadly I know MBS is someway trying to cater to muslim by having the no pork no lard/ halal guide around MBS and sg. Just wondering is it so difficult to have one, just one halal cafe/restaurants like some hotels have. I walked around MBS shopping centre, barely anything for me to eat there. So hmm.. I wish someday somehow MBS management might consider but if not it’s ok also. Just a thought.

Tv wise it got stucked the night we stay, so we thought okla just wait, on and off but nth changed. So we were tvless for like 2 hours until we decided to call and it took only a simple step as to just remove your key card, let the room off for awhile and restart. Aiya so simple if only we had known.

Minibar is complimentary! Yay to that.

Bed wise, aiya, I had high expectations☹️
Batam hotels had better bed experience. Idk it just don’t feel rich hotel feel to it, like I’m not sure if it’s not enough duvets underneath? The bedsheet also not smooth kind of feeling and the pillow has got the egg white smell? So bed was hmm not so good.

But the space, omgosh so huge.
The toilet is bigger than my hdb common room. But the downside is no bidet. Maybe catering to angmoh, but we asians mostly need water to wash our butts for business number 2 so it was hmm.. pls do consider bidet. Even Japanese hotels have.

The interior, hmm design abit oldies la. Like it’s a new hotel but how come the design, the furniture choice, the fittings, the carpet, are like old school hotel design when it’s actually not an old hotel.

View is awesome! Pool is awesome but no baby pool or I can’t see any. Pool meant for adult cuz of the depth.

I like how they always end with my pleasure. It’s my pleasure too. This is just my honest review.
Would I return again? 30% maybe, but I’ve got nth to look forward to.. maybe if they change the bed.. I would probably go for a good bed and good food experience.""")]

sample = vectorizer.transform(sample).toarray()
sentiment = model.predict(sample)
sentiment[0]

1

In [40]:
sample = [text_cleaner("""We enjoyed our stay at 35th floor garden view!
Staff at the concierge was very accommodating to my request to be at the high floor despite surge of guests at that time.We had a great time having breakfast at Spago at 57th floor! It was am Amazing experience!""")]

pos_tags = pos_tag(sample)
sample = [WordNetLemmatizer().lemmatize(text[0], get_wordnet_pos(text[1])) for text in pos_tags]
sample = vectorizer.transform(sample).toarray()
sentiment = model.predict(sample)
sentiment[0]

2

## Saving the model and vectorizer

In [30]:
with open('model', 'wb') as file:
    pickle.dump(model, file)
    
with open('vectorizer', 'wb') as file:
    pickle.dump(vectorizer, file)