<b>Inspiration</b>

The dataset is large and informative, I believe you can have a lot of fun with it! Let me put some ideas below to futher inspire kagglers!

Fit a regression model on reviews and score to see which words are more indicative to a higher/lower score
Perform a sentiment analysis on the reviews
Find correlation between reviewer's nationality and scores.
Beautiful and informative visualization on the dataset.
Clustering hotels based on reviews
Simple recommendation engine to the guest who is fond of a special characteristic of hotel.

In [1]:
# Importing Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os

# NLP Packages
import nltk 
from textblob import TextBlob 

In [2]:
# Import csv file
df = pd.read_csv('csv/Hotel_Reviews.csv')

# Data Cleaning and EDA

## Understand Dataset

In [3]:
# Taking a lot at the dataset
df.head(2)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968


In [4]:
# Checking the shape of the dataframe
df.shape

(515738, 17)

In [5]:
# Checking null values
df.isna().sum()

Hotel_Address                                    0
Additional_Number_of_Scoring                     0
Review_Date                                      0
Average_Score                                    0
Hotel_Name                                       0
Reviewer_Nationality                             0
Negative_Review                                  0
Review_Total_Negative_Word_Counts                0
Total_Number_of_Reviews                          0
Positive_Review                                  0
Review_Total_Positive_Word_Counts                0
Total_Number_of_Reviews_Reviewer_Has_Given       0
Reviewer_Score                                   0
Tags                                             0
days_since_review                                0
lat                                           3268
lng                                           3268
dtype: int64

In [6]:
# Checking how many hotels in this dataset
len(df.Hotel_Name.unique())

1492

In [7]:
# Checking the hotel with the highest number of reviews
df.pivot_table(index=['Hotel_Name'], aggfunc='size').nlargest()

Hotel_Name
Britannia International Hotel Canary Wharf           4789
Strand Palace Hotel                                  4256
Park Plaza Westminster Bridge London                 4169
Copthorne Tara Hotel London Kensington               3578
DoubleTree by Hilton Hotel London Tower of London    3212
dtype: int64

In [8]:
# Double checking if the number matches to the column Total_Number_of_Reviews
df[df['Hotel_Name'] == 'Britannia International Hotel Canary Wharf'].head(2)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
63942,163 Marsh Wall Docklands Tower Hamlets London ...,2682,8/3/2017,7.1,Britannia International Hotel Canary Wharf,United Kingdom,The car park was small and unpleasant People ...,31,9086,The location was excellent for getting to the O2,10,3,7.9,"[' Leisure trip ', ' Group ', ' Standard Doubl...",0 days,51.50191,-0.023221
63943,163 Marsh Wall Docklands Tower Hamlets London ...,2682,8/3/2017,7.1,Britannia International Hotel Canary Wharf,United Kingdom,We weren t told that the only spa facility op...,34,9086,The house keeping lady made my boyfriends day...,14,3,8.3,"[' Leisure trip ', ' Couple ', ' Standard Doub...",0 days,51.50191,-0.023221


### Findings:

- There are reviews from 1,492 hotels
- The data is fairly clean. It doesn't much null values
- It's missing the cities where the hotels are located.
- There are reviews without the latitude and longitude.
- The actual number of reviews per hotel does not match to the actual number

## Data Cleaning

In [9]:
# Checking rows where the values are null
len(df[['Hotel_Address']][df.isnull().any(axis=1)])

3268

### Findings and Takeaways:

- There are 17 hotels without latitude and longitude. I'll work on it as a stretch goal

## Fix Spelling

# Data Engineering

## Create a function for Sentiment Analysis

In this step, I will generate a sentiment analysis. Normally, this would be a step that I'd run after data cleaning for NLP. However, previous tests showed me that data cleaning does not affect the sentiment analysis using TextBlob.

Running sentiment analysis takes a lot of time because I have more than 515K observations. For this reason, once the sentiment analysis is created, I will pickle the DataFrame and upload it again, so it won't run again.

In [10]:
# Create a function to get subjectivity
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

# Create a function to get polarity with tweets
def getPolarity(text):
    return TextBlob(text).sentiment.polarity

def spellcheck(text):
    returnrn TextBlob(text).sentiment.correct

In [63]:
df.Negative_Review[3]

' My room was dirty and I was afraid to walk barefoot on the floor which looked as if it was not cleaned in weeks White furniture which looked nice in pictures was dirty too and the door looked like it was attacked by an angry dog My shower drain was clogged and the staff did not respond to my request to clean it On a day with heavy rainfall a pretty common occurrence in Amsterdam the roof in my room was leaking luckily not on the bed you could also see signs of earlier water damage I also saw insects running on the floor Overall the second floor of the property looked dirty and badly kept On top of all of this a repairman who came to fix something in a room next door at midnight was very noisy as were many of the guests I understand the challenges of running a hotel in an old building but this negligence is inconsistent with prices demanded by the hotel On the last night after I complained about water damage the night shift manager offered to move me to a different room but that offer

<b>NOTE:</b>

Each of the two following cells takes around 10 minutes to run. For this reason, I will sabe the DataFrame into a csv file and upload it again.

In [11]:
# # Create new columns to compare polarity and subjetivity on Negative Reviews
# df['Polarity_Net'] = df['Negative_Review'].apply(getPolarity)
# df['Polarity_Pos'] = df['Positive_Review'].apply(getPolarity)

In [12]:
# # Saving csv with sentiment analysis
# df.to_csv("csv/df_sentiment_analysis.csv")

### Importing the DataFrame

Now let's import the DataFrame again with the sentiment analysis and check if the results make sense

In [13]:
# Importing DataFrame with new Polarity column
df = pd.read_csv('csv/df_sentiment_analysis.csv', index_col=0)

In [48]:
# Checking columns
df.columns

Index(['Hotel_Address', 'Additional_Number_of_Scoring', 'Review_Date',
       'Average_Score', 'Hotel_Name', 'Reviewer_Nationality',
       'Negative_Review', 'Review_Total_Negative_Word_Counts',
       'Total_Number_of_Reviews', 'Positive_Review',
       'Review_Total_Positive_Word_Counts',
       'Total_Number_of_Reviews_Reviewer_Has_Given', 'Reviewer_Score', 'Tags',
       'days_since_review', 'lat', 'lng', 'Polarity_Net', 'Polarity_Pos',
       'Sent_Analysis_Neg', 'Sent_Analysis_Pos', 'Score'],
      dtype='object')

In [27]:
# Creating function to classify the Sentiment Analysis
df['Sent_Analysis_Neg'] = df['Polarity_Net'].apply(lambda x: 0 if x < 0 else 1 if x > -0.1 and x < 0.1 else 2)
df['Sent_Analysis_Pos'] = df['Polarity_Pos'].apply(lambda x: 0 if x < 0 else 1 if x > -0.1 and x < 0.1 else 2)

In [51]:
# Creating a csv file with the sentiment analysis
sentiment_analysis = df[['Hotel_Name','Negative_Review','Positive_Review','Reviewer_Score','Sent_Analysis_Neg','Sent_Analysis_Pos']]

# Uncomment cell below to export file
# sentiment_analysis.to_csv('sentiment_analysis.csv')

In [55]:
sentiment_analysis.head(10)

Unnamed: 0,Hotel_Name,Negative_Review,Positive_Review,Reviewer_Score,Sent_Analysis_Neg,Sent_Analysis_Pos
0,Hotel Arena,I am so angry that i made this post available...,Only the park outside of the hotel was beauti...,2.9,1,2
1,Hotel Arena,No Negative,No real complaints the hotel was great great ...,7.5,2,2
2,Hotel Arena,Rooms are nice but for elderly a bit difficul...,Location was good and staff were ok It is cut...,7.1,1,2
3,Hotel Arena,My room was dirty and I was afraid to walk ba...,Great location in nice surroundings the bar a...,3.8,0,2
4,Hotel Arena,You When I booked with your company on line y...,Amazing location and building Romantic setting,6.7,0,2
5,Hotel Arena,Backyard of the hotel is total mess shouldn t...,Good restaurant with modern design great chil...,6.7,0,2
6,Hotel Arena,Cleaner did not change our sheet and duvet ev...,The room is spacious and bright The hotel is ...,4.6,1,2
7,Hotel Arena,Apart from the price for the brekfast Everyth...,Good location Set in a lovely park friendly s...,10.0,2,2
8,Hotel Arena,Even though the pictures show very clean room...,No Positive,6.5,0,0
9,Hotel Arena,The aircondition makes so much noise and its ...,The room was big enough and the bed is good T...,7.9,0,2


### Findings and Takeaways:

- It was created Subjectivity and Polarity features using sentiment analysis for Negative and Positive Reviews. 
- Polarity ranges between -1 and 1. Where -1 means that the review was very negative and 1 means that the review was very positive.
- Seems like sentiment analysis does a good job identifying positive reviews, but the negative reviews could be improved.

## Target Variable

In this section, I will create a target variable and use it to train my models. I will turn the Reviewer Score classes feature into:

- <b>Bad:</b> Scores below 5
- <b>Regular:</b> Scores between 5 and 7
- <b>Good:</b> Scores above 7

In [57]:
# Checking dataframe
df.head(1)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,...,Reviewer_Score,Tags,days_since_review,lat,lng,Polarity_Net,Polarity_Pos,Sent_Analysis_Neg,Sent_Analysis_Pos,Score
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,...,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968,0.028671,0.283333,1,2,0


In [45]:
# Create function that turns the Reviewer Score into a classification target with 3 values
df['Score'] = df['Reviewer_Score'].apply(lambda x: 0 if x < 5 else 1 if x >= 5 and x < 7 else 2)

In [56]:
# Checking if function worked
df[['Reviewer_Score', 'Score']].head(20)

Unnamed: 0,Reviewer_Score,Score
0,2.9,0
1,7.5,2
2,7.1,2
3,3.8,0
4,6.7,1
5,6.7,1
6,4.6,0
7,10.0,2
8,6.5,1
9,7.9,2


In [47]:
# Checking if there will be class imbalance
df.Score.value_counts()

2    428887
1     64570
0     22281
Name: Score, dtype: int64

### Findings and Takeaways:

- There is class imbalance in the target variable. Since the dataset if very large, it should not be a problem use downsampling or upsampling.

# Pickle DataFrame

In [64]:
# Pickle DataFrame
pd.to_pickle(df, "./dummy.pkl")

# Ideas

- Check if the review is worse if it takes time to be made
- Check the country and nationalities
- Time of the year with more complaints

# Stretch Goals

- Get latitude and longitude for hotels that are missing this information
- People might base their review on an isolated bad experience

In [6]:
''' getting hotels latitude and longetude '''

from geopy.extra.rate_limiter import RateLimiter
# 1 - conveneint function to delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
# 2- - create location column
df['location'] = df['ADDRESS'].apply(geocode)
# 3 - create longitude, laatitude and altitude from location column (returns tuple)
df['point'] = df['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
df[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df['point'].tolist(), index=df.index)

NameError: name 'locator' is not defined