<b>Inspiration</b>

The dataset is large and informative, I believe you can have a lot of fun with it! Let me put some ideas below to futher inspire kagglers!

Fit a regression model on reviews and score to see which words are more indicative to a higher/lower score
Perform a sentiment analysis on the reviews
Find correlation between reviewer's nationality and scores.
Beautiful and informative visualization on the dataset.
Clustering hotels based on reviews
Simple recommendation engine to the guest who is fond of a special characteristic of hotel.

In [2]:
# Importing Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os

# NLP Packages
import nltk 
from nltk.corpus import stopwords
from textblob import TextBlob 
from textblob import Word
import re
import string

# WordCloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Sklearn Packages
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text 
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, precision_score, f1_score, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression

# ImbLearn Packages
from imblearn.over_sampling import SMOTE

# Pandas Settings
pd.set_option('display.max_columns', 10000)
pd.set_option('display.max_rows', 100)

# Solve warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [3]:
# Import csv file
df = pd.read_csv('csv/Hotel_Reviews.csv')

In [4]:
df

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7.1,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968
3,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,United Kingdom,My room was dirty and I was afraid to walk ba...,210,1403,Great location in nice surroundings the bar a...,26,1,3.8,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",3 days,52.360576,4.915968
4,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/24/2017,7.7,Hotel Arena,New Zealand,You When I booked with your company on line y...,140,1403,Amazing location and building Romantic setting,8,3,6.7,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",10 days,52.360576,4.915968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
515733,Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ...,168,8/30/2015,8.1,Atlantis Hotel Vienna,Kuwait,no trolly or staff to help you take the lugga...,14,2823,location,2,8,7.0,"[' Leisure trip ', ' Family with older childre...",704 day,48.203745,16.335677
515734,Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ...,168,8/22/2015,8.1,Atlantis Hotel Vienna,Estonia,The hotel looks like 3 but surely not 4,11,2823,Breakfast was ok and we got earlier check in,11,12,5.8,"[' Leisure trip ', ' Family with young childre...",712 day,48.203745,16.335677
515735,Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ...,168,8/19/2015,8.1,Atlantis Hotel Vienna,Egypt,The ac was useless It was a hot week in vienn...,19,2823,No Positive,0,3,2.5,"[' Leisure trip ', ' Family with older childre...",715 day,48.203745,16.335677
515736,Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ...,168,8/17/2015,8.1,Atlantis Hotel Vienna,Mexico,No Negative,0,2823,The rooms are enormous and really comfortable...,25,3,8.8,"[' Leisure trip ', ' Group ', ' Standard Tripl...",717 day,48.203745,16.335677


# Data Cleaning and EDA

## Understand Dataset

The idea in this step is to take a look at the data. A key points that I want to investigate are:
- Shape of the dataset
- If there are null values
- How many unique hotels and if it matches to the value said in the dataset
- The types of data

In [3]:
# # Taking a lot at the dataset
# df.head(5)

In [4]:
# Checking the shape of the dataframe
df.shape

(515738, 17)

In [5]:
df.columns

Index(['Hotel_Address', 'Additional_Number_of_Scoring', 'Review_Date',
       'Average_Score', 'Hotel_Name', 'Reviewer_Nationality',
       'Negative_Review', 'Review_Total_Negative_Word_Counts',
       'Total_Number_of_Reviews', 'Positive_Review',
       'Review_Total_Positive_Word_Counts',
       'Total_Number_of_Reviews_Reviewer_Has_Given', 'Reviewer_Score', 'Tags',
       'days_since_review', 'lat', 'lng'],
      dtype='object')

In [6]:
# Selecting only the columns that I will use
features = ['Hotel_Name', 'Negative_Review','Positive_Review', 'Reviewer_Score']
df = df[features]

In [7]:
df['Reviews'] = df['Negative_Review'] + df['Positive_Review']

In [8]:
# Reducing the size of the dataframe to 20%
df = df.sample(frac=0.2, random_state=1)

In [9]:
# Checking if it worked
df.shape

(103148, 5)

In [10]:
# Checking null values
df.isna().sum()

Hotel_Name         0
Negative_Review    0
Positive_Review    0
Reviewer_Score     0
Reviews            0
dtype: int64

In [11]:
# Checking how many hotels in this dataset
len(df.Hotel_Name.unique())

1488

In [12]:
# Checking the hotel with the highest number of reviews
df.pivot_table(index=['Hotel_Name'], aggfunc='size').nlargest()

Hotel_Name
Britannia International Hotel Canary Wharf           965
Strand Palace Hotel                                  900
Park Plaza Westminster Bridge London                 846
Copthorne Tara Hotel London Kensington               748
DoubleTree by Hilton Hotel London Tower of London    641
dtype: int64

### Findings:

- There are reviews from 1,492 hotels
- The data is fairly clean. It doesn't much null values
- It's missing the cities where the hotels are located.
- There are reviews without the latitude and longitude.

## Data Cleaning

In this section, I will start the data cleaning. I am expecting to do it in different steps depending on the complexity of the data. Since the main focus on this data set is to prepare data for the vanilla model, I won't overwork the

### Remove Punctuation and Numbers

In [13]:
df.columns

Index(['Hotel_Name', 'Negative_Review', 'Positive_Review', 'Reviewer_Score',
       'Reviews'],
      dtype='object')

In [14]:
# This function lowercase all the review words, removes punctuation and numbers
def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)

    return text

round1 = lambda x: clean_text_round1(x)

In [15]:
# Applying clean_text_round1 function
df['Reviews_Clean'] = pd.DataFrame(df.Reviews.apply(round1))

In [16]:
df.head()

Unnamed: 0,Hotel_Name,Negative_Review,Positive_Review,Reviewer_Score,Reviews,Reviews_Clean
356054,Canal House,No Negative,Nothing was too much trouble The staff were a...,10.0,No Negative Nothing was too much trouble The s...,no negative nothing was too much trouble the s...
395957,Grosvenor House A JW Marriott Hotel,I had a Junior suite The bed was only a queen...,I loved there shower It felt like you were un...,10.0,I had a Junior suite The bed was only a queen...,i had a junior suite the bed was only a queen...
468352,Imperial Riding School Renaissance Vienna Hotel,staff could be less rude the pool area is hor...,beds really comfy and the location is great a...,6.7,staff could be less rude the pool area is hor...,staff could be less rude the pool area is hor...
281462,Hotel SB Icaria Barcelona,No Negative,Really nice hotel good facilities great staff...,9.6,No Negative Really nice hotel good facilities ...,no negative really nice hotel good facilities ...
498978,Hotel Vilamar,No Negative,Everything is super And room and design Very ...,10.0,No Negative Everything is super And room and d...,no negative everything is super and room and d...


In [17]:
stop_words = stopwords.words('english')
# df['Reviews_Clean_2'] = df['Reviews_Clean'].apply(lambda x: [item for item in x if item not in stop_words])

In [18]:
# # Create a new column only with English words
# words = set(nltk.corpus.words.words())
# df['Negative_Review_Clean2'] = ''

# for index in df.index:
#     sent = df.Negative_Review_Clean[index]
#     df['Negative_Review_Clean2'][index] = " ".join(w for w in nltk.wordpunct_tokenize(sent) if w.lower() in words or not w.isalpha())
    
#     # 32 minutes

#### Findings and Takeaways:

- There are 17 hotels without latitude and longitude. I'll work on it as a stretch goal

## NEED TO WORK ON THAT - Fix Spelling

- To do a spell check, I will choose a random review and check if there is any misspells in it
- I will create a function that will use TextBlob to fix misspellings
- Check the result

It seems that there are a few misspellings, such as the words `theough` and `extreamly`. I'll use TextBlob to fix them.

In [19]:
# # Create function to fix misspells
# # Create a function to get subjectivity
# def spellcheck(text):
#     return Word(text).spellcheck

# # def spellcheck(text):
    

In [20]:
# df['Reviews_Clean'][356054]

In [21]:
# # Checking if function works
# spellcheck(df['Reviews_Clean'][356054])

In [22]:
# w = Word(df['Negative_Review'][4])
# w.spellcheck()

In [23]:
# df['Negative_Review_SC'] = df['Negative_Review'].apply(spellcheck)

In [24]:
# df.Negative_Review[4]

In [25]:
# blob = TextBlob(df.Negative_Review[4])
# blob.correct()

### Findings and Takeaways:
- While checking a random 

# Data Engineering

## Create a function for Sentiment Analysis

In this step, I will generate a sentiment analysis. Normally, this would be a step that I'd run after data cleaning for NLP. However, previous tests showed me that data cleaning does not affect the sentiment analysis using TextBlob.

Running sentiment analysis takes a lot of time because I have more than 515K observations. For this reason, once the sentiment analysis is created, I will pickle the DataFrame and upload it again, so it won't run again.

In [26]:
# Create a function to get subjectivity
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

# Create a function to get polarity with tweets
def getPolarity(text):
    return TextBlob(text).sentiment.polarity

<b>NOTE:</b>

Each of the two following cells takes around 10 minutes to run. For this reason, I will sabe the DataFrame into a csv file and upload it again.

In [27]:
# Create new columns to compare polarity and subjetivity on Negative Reviews
df['Polarity_Net'] = df['Negative_Review'].apply(getPolarity)
df['Polarity_Pos'] = df['Positive_Review'].apply(getPolarity)

# 8 minutes

In [1]:
# Saving csv with sentiment analysis
df.to_csv("csv/df_sentiment_analysis.csv")

NameError: name 'df' is not defined

## Importing the Updated DataFrame

Now let's import the DataFrame again with the sentiment analysis and check if the results make sense

In [29]:
# Importing DataFrame with new Polarity column
df = pd.read_csv('csv/df_sentiment_analysis.csv', index_col=0)

In [30]:
# Checking columns
df.columns

Index(['Hotel_Name', 'Negative_Review', 'Positive_Review', 'Reviewer_Score',
       'Reviews', 'Reviews_Clean', 'Polarity_Net', 'Polarity_Pos'],
      dtype='object')

In [31]:
# Creating function to classify the Sentiment Analysis
df['Sent_Analysis_Neg'] = df['Polarity_Net'].apply(lambda x: 0 if x < 0 else 1 if x > -0.1 and x < 0.1 else 2)
df['Sent_Analysis_Pos'] = df['Polarity_Pos'].apply(lambda x: 0 if x < 0 else 1 if x > -0.1 and x < 0.1 else 2)

In [32]:
# Creating a csv file with the sentiment analysis
sentiment_analysis = df[['Hotel_Name','Negative_Review','Positive_Review','Reviewer_Score','Sent_Analysis_Neg','Sent_Analysis_Pos']]

# Uncomment cell below to export file
# sentiment_analysis.to_csv('sentiment_analysis.csv')

### Findings and Takeaways:

- It was created Subjectivity and Polarity features using sentiment analysis for Negative and Positive Reviews. 
- Polarity ranges between -1 and 1. Where -1 means that the review was very negative and 1 means that the review was very positive.
- Seems like sentiment analysis does a good job identifying positive reviews, but the negative reviews could be improved.

## Target Variable

In this section, I will create a target variable and use it to train my models. I will turn the Reviewer Score classes feature into:

- <b>0 - Bad:</b> Scores below 5
- <b>1 - Regular:</b> Scores between 5 and 7
- <b>2 - Good:</b> Scores above 7

In [33]:
# Create function that turns the Reviewer Score into a classification target with 3 values
df['Score'] = df['Reviewer_Score'].apply(lambda x: 0 if x < 5 else 1 if x >= 5 and x < 7 else 2)

In [34]:
# Checking if function worked
df[['Reviewer_Score', 'Score']].head(5)

Unnamed: 0,Reviewer_Score,Score
356054,10.0,2
395957,10.0,2
468352,6.7,1
281462,9.6,2
498978,10.0,2


In [35]:
# Checking if there will be class imbalance
df.Score.value_counts()

2    85623
1    13027
0     4498
Name: Score, dtype: int64

### Findings and Takeways:
- I created the target variable and turned it until numbers. 0 means that is a bad review. 1 means that is a neutral review, and 2 means that is a positive review. 
- There is a big class imbalance in this data set. It is taking a few seconds to process the data. I'll work on it in the next step.

### Solving Class Imbalance Manually

I decided to start reducing the size of the set and fix the class imbalance manually. It will help to process the data faster. There are around 20 times more positive scores than negative scores; And around 30% more neutral scores than negative scores. To do so, I will:

- Create different dataframes with for each class
- Take a sample of each data frame accordingly to it's size
    - 5% of the positive score
    - 30% of the neutral score
- The I will concatenate the results
- Export the new dataset, so I can use it later

In [36]:
# Creating separate dataframes depending on the classification
df_Score_0 = df[df.Score == 0]
df_Score_1 = df[df.Score == 1].sample(frac=0.3)
df_Score_2 = df[df.Score == 2].sample(frac=0.05)

In [37]:
# Concatenating th 
df = pd.concat([df_Score_2, df_Score_1, df_Score_0])
df.shape

(12687, 11)

In [38]:
df.Score.value_counts()

0    4498
2    4281
1    3908
Name: Score, dtype: int64

In [39]:
# Saving csv with sentiment analysis
features = ['Hotel_Name', 'Negative_Review', 'Positive_Review', 'Reviewer_Score', 'Reviews_Clean', 'Score']
df = df[features]
df.to_csv("csv/df_no_class_imbalance.csv")

### Findings and Takeaways:

- There is class imbalance in the target variable. Since the dataset if very large, I manually downsampled the dataset to solve class imbalance

## Lemmatization and Stemming

In [40]:
# # Importing data set with the class imbalance fixed
# df = pd.read_csv('csv/df_no_class_imbalance.csv', index_col=0)
# features = ['Hotel_Name', 'Negative_Review', 'Positive_Review', 'Reviewer_Score', 'Reviews_Clean', 'Score']
# df = df[features]
# df.shape

In [41]:
df.col umns

Index(['Hotel_Name', 'Negative_Review', 'Positive_Review', 'Reviewer_Score',
       'Reviews_Clean', 'Score'],
      dtype='object')

In [42]:
# df.head()

In [43]:
# # Use English stemmer.
# from nltk.stem.snowball import SnowballStemmer
# stemmer = SnowballStemmer("english")
# df['stemmed'] = df['unstemmed'].apply(lambda x: [stemmer.stem(y) for y in x])
# df.drop(columns=(['unstemmed']), inplace=True)

In [44]:
# Use English lemmatizing

In [45]:
# df.head()

# Vanilla Model

Before preparing the data for model sampling, I have two problems to solve:

- <b>Class imbalance:</b> There is a big class imbalance in my data set, where there are more positive reviews than negative reviews.
- <b>Curse of Dimensionality:</b> Since my dataset is so large, I'd end up with more than 50,000 features on top of 515K rows. That's too much for my computer to handle with.

Luckily, I might be able to solve both problems at the same time. First, I will balance my target manually, which will help me solve the Curse of Dimensionality partially. Then, I will use Sparse on tokenized dataset.

In [46]:
# Evaluation function

def evaluation(y_true, y_pred):
       
# Print Accuracy, Recall, F1 Score, and Precision metrics.
    print('Evaluation Metrics:')
    print('Accuracy: ' + str(metrics.accuracy_score(y_test, y_pred)))
    print('F1 Score: ' + str(metrics.f1_score(y_test, y_pred, average="weighted")))

## Vectorizing Dataset

In [47]:
df.columns

Index(['Hotel_Name', 'Negative_Review', 'Positive_Review', 'Reviewer_Score',
       'Reviews_Clean', 'Score'],
      dtype='object')

In [48]:
df.head()

Unnamed: 0,Hotel_Name,Negative_Review,Positive_Review,Reviewer_Score,Reviews_Clean,Score
279305,Catalonia Atenas,Nothing,Staff nice location excellent breakfast amazi...,8.8,nothing staff nice location excellent breakfa...,2
330486,Avenida Palace,No bar or restaurant,No Positive,9.2,no bar or restaurant no positive,2
223191,DoubleTree by Hilton London Islington,No Negative,The very friendly and helpful staff Made us f...,10.0,no negative the very friendly and helpful staf...,2
282475,Hotel Ronda Lesseps,Breakfast could have been better,Nice hotel super nice rooms and great locatio...,8.3,breakfast could have been better nice hotel ...,2
436370,Crowne Plaza London Docklands,Breakfast Staff not all really helpful a litt...,Location was great beautiful view close to Co...,7.9,breakfast staff not all really helpful a litt...,2


In [49]:
stop_words = stopwords.words('english')
# df['Reviews_Clean_2'] = df['Reviews_Clean'].apply(lambda x: [item for item in x if item not in stop_words])

In [50]:
# Instantiate CountVectorizer
cv = CountVectorizer(stop_words=stop_words)

# Fit and transform dataframe without data cleaning
df_cv = cv.fit_transform(df['Reviews_Clean'])
df_tk = pd.DataFrame(df_cv.toarray(), columns = cv.get_feature_names())
df_tk.index = df.index

ValueError: np.nan is an invalid document, expected byte or unicode string.

In [None]:
df_tk_clean.head()

In [None]:
df_tk.shape

In [None]:
# Using Sparse in the DataFrame
df_sparse = df_tk.astype('Sparse')

In [None]:
df_sparse.head()

In [None]:
df_sparse.to_csv('sparse_data.csv')
# df.to_csv("csv/df_sentiment_analysis.csv")

In [None]:
y = df.Score
X = df_tk

## Train Test Split

In [None]:
# Running Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25)

# This cell takes around 2 minutes to run

In [None]:
X_train.shape

## Vanilla Models

### First Model

In [None]:
# Baseline Regression Model
logreg_base = LogisticRegression()
logreg_base.fit(X_train, y_train) 
y_logreg_base = logreg_base.predict(X_test)

# 8 minutes

In [None]:
# Logistic Regression baseline evaluation
evaluation(y_test, y_logreg_base)

# Pickle DataFrame

# Ideas

- Check if the review is worse if it takes time to be made
- Check the country and nationalities
- Time of the year with more complaints

# Stretch Goals

- Get latitude and longitude for hotels that are missing this information
- People might base their review on an isolated bad experience

In [None]:
''' getting hotels latitude and longetude '''

from geopy.extra.rate_limiter import RateLimiter
# 1 - conveneint function to delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
# 2- - create location column
df['location'] = df['ADDRESS'].apply(geocode)
# 3 - create longitude, laatitude and altitude from location column (returns tuple)
df['point'] = df['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
df[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df['point'].tolist(), index=df.index)