# What is NLP?
Working in Data Science and having a background in Technical Writing, I was drawn to the field of Natural Language Processing (NLP). Machines understanding language fascinates me, and I often ponder which algorithms Aristotle would have used to build a rhetorical analysis machine if he had the chance. If you’re new to Data Science, getting into NLP can seem complicated, especially since there have been so many recent advancements in the field. It is hard to know where to start.

# The Projects and the Data
These Reviews of London-based hotels restorenet projects will give you an introduction to concepts and techniques used in Natural Language Processing.

## Overall problem
In this competitive world every individual business wanted to stand out from others. The hotel industry in London is also trying to improve customer satisfaction and experience using machine learning algorithms. London is well known place for tourism in the world so we can understand dependency of economy on tourism of London. London hotels play a major role in tourism because most of tourists stay in hotels. Many tourists give a negative review which needs to be focused on but cannot be done appropriately. Because of this problem, many tourists avoid staying in specific hotel and sometimes they don’t like to visit London again due to bad experience of hotels which will impact on economy indirectly or directly

## Project aim and objective
As per knowledge, Manager/Owner of the hotels can’t exactly focus on specific improvements area from the comments given by various customer without analyze the data. And ant to Improve customer satisfaction and experience of London Hotels using machine learning algorithms Understand requirements of customers

• What are major comment subjects?

• What facilities need to be focused?

• What are target regions?

## Natural Language Processing

### Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
from nltk.corpus import stopwords

### Importing the dataset

In [2]:
df = pd.read_csv('London_hotel_reviews.csv', encoding = "ISO-8859-1")
print(df.shape)
df.head()

(27330, 6)


Unnamed: 0,Property_Name,Review_Rating,Review_Title,Review_Text,Location_Of_The_Reviewer,Date_Of_Review
0,Apex London Wall Hotel,5,Ottima qualità prezzo,Siamo stati a Londra per un week end ed abbiam...,"Casale Monferrato, Italy",10/20/2012
1,Corinthia Hotel London,5,"By far, my best hotel in the world",I had a pleasure of staying in this hotel for ...,"Savannah, Georgia",3/23/2016
2,The Savoy,5,First visit to the American Bar at the Savoy,A very lovely first visit to this iconic hotel...,London,7/30/2013
3,Rhodes Hotel,4,Nice stay,3 of us stayed at the Rhodes Hotel for 4 night...,"Maui, Hawaii",06-02-2012
4,The Savoy,5,Perfection,Form the moment we arrived until we left we ex...,"London, United Kingdom",11/24/2017


In [3]:
print(df.isna().sum(), end = '\n\n')
df[df.isnull().any(axis=1)].head()

Property_Name                  0
Review_Rating                  0
Review_Title                   0
Review_Text                    0
Location_Of_The_Reviewer    3953
Date_Of_Review                 1
dtype: int64



Unnamed: 0,Property_Name,Review_Rating,Review_Title,Review_Text,Location_Of_The_Reviewer,Date_Of_Review
5,Corinthia Hotel London,1,Staff stole from me!!,Well I am no strange to London's 5star hotels ...,,03-01-2013
24,Mondrian London at Sea Containers,5,"Fantastic nights stay, one of the best hotels ...",My partner and I found this hotel by chance an...,,6/20/2015
34,Mondrian London at Sea Containers,5,Just as good second time around!,After an amazing experience the first time we ...,,10/15/2015
37,The Rembrandt,5,Good Hotel - Great Area,Very good hotel in a lovely area. Handy to the...,,11-04-2016
45,Apex London Wall Hotel,4,Eccellente,"In tutto, dalla struttura nuova in stile moder...",,4/30/2013


In [4]:
df[df["Date_Of_Review"].isnull()]

Unnamed: 0,Property_Name,Review_Rating,Review_Title,Review_Text,Location_Of_The_Reviewer,Date_Of_Review
6556,The Lanesborough,4,<U+0412> <U+043F><U+043E><U+0434><U+0440><U+04...,<U+041E><U+0442><U+0435><U+043B><U+044C> : | <...,,


In [5]:
print(len(df[df['Review_Title'].str.contains("<U")]), 'reviews that are probably gibberish.')
df[df['Review_Title'].str.contains("<U")].head()

431 reviews that are probably gibberish.


Unnamed: 0,Property_Name,Review_Rating,Review_Title,Review_Text,Location_Of_The_Reviewer,Date_Of_Review
90,The Dorchester,5,<U+0391><U+03B3><U+03B3><U+03BB><U+03B9><U+03B...,e<U+03B9><U+03BD>a<U+03B9> e<U+03BD>a<U+03C2> ...,"Athens, Greece",7/18/2017
174,Corinthia Hotel London,5,<U+041B><U+0443><U+0447><U+0448><U+0438><U+043...,<U+0423><U+0440><U+043E><U+0432><U+0435><U+043...,Zurich,2/17/2016
178,"Mandarin Oriental Hyde Park, London",5,<U+512A><U+96C5><U+306A><U+6642><U+9593><U+304...,<U+5148><U+6708><U+3001><U+304A><U+98DF><U+4E8...,"Aichi Prefecture, Japan",11/20/2015
328,The Savoy,5,<U+C544><U+B984><U+B2F5><U+ACE0> <U+C6B0><U+C5...,<U+CE5C><U+C808><U+D558><U+ACE0> <U+C720><U+CF...,,5/15/2017
364,45 Park Lane - Dorchester Collection,5,<U+041C><U+043E><U+0434><U+043D><U+044B><U+043...,<U+041E><U+0442><U+0435><U+043B><U+044C> <U+04...,"Moscow, Russia",09-01-2015


In [6]:
df = df[df['Review_Title'].str.contains("<U") == False]

In [7]:
print(df.shape)
df.head()

(26899, 6)


Unnamed: 0,Property_Name,Review_Rating,Review_Title,Review_Text,Location_Of_The_Reviewer,Date_Of_Review
0,Apex London Wall Hotel,5,Ottima qualità prezzo,Siamo stati a Londra per un week end ed abbiam...,"Casale Monferrato, Italy",10/20/2012
1,Corinthia Hotel London,5,"By far, my best hotel in the world",I had a pleasure of staying in this hotel for ...,"Savannah, Georgia",3/23/2016
2,The Savoy,5,First visit to the American Bar at the Savoy,A very lovely first visit to this iconic hotel...,London,7/30/2013
3,Rhodes Hotel,4,Nice stay,3 of us stayed at the Rhodes Hotel for 4 night...,"Maui, Hawaii",06-02-2012
4,The Savoy,5,Perfection,Form the moment we arrived until we left we ex...,"London, United Kingdom",11/24/2017


In [8]:
df['Complete_Review'] = df['Review_Title'] + ' ' + df['Review_Text']
df.loc[df['Review_Rating'] > 4, 'Good_Review'] = 1
df.loc[df['Review_Rating'] <= 4, 'Good_Review'] = 0
print(sum(df['Good_Review'] == 1) / len(df['Good_Review']) * 100, 'percent of reviews are bad (less than 5 star).')

67.1251719394773 percent of reviews are bad (less than 5 star).


In [9]:
print(df.isna().sum(), end = '\n\n')
df[df.isnull().any(axis=1)].head()

Property_Name                  0
Review_Rating                  0
Review_Title                   0
Review_Text                    0
Location_Of_The_Reviewer    3839
Date_Of_Review                 0
Complete_Review                0
Good_Review                    0
dtype: int64



Unnamed: 0,Property_Name,Review_Rating,Review_Title,Review_Text,Location_Of_The_Reviewer,Date_Of_Review,Complete_Review,Good_Review
5,Corinthia Hotel London,1,Staff stole from me!!,Well I am no strange to London's 5star hotels ...,,03-01-2013,Staff stole from me!! Well I am no strange to ...,0.0
24,Mondrian London at Sea Containers,5,"Fantastic nights stay, one of the best hotels ...",My partner and I found this hotel by chance an...,,6/20/2015,"Fantastic nights stay, one of the best hotels ...",1.0
34,Mondrian London at Sea Containers,5,Just as good second time around!,After an amazing experience the first time we ...,,10/15/2015,Just as good second time around! After an amaz...,1.0
37,The Rembrandt,5,Good Hotel - Great Area,Very good hotel in a lovely area. Handy to the...,,11-04-2016,Good Hotel - Great Area Very good hotel in a l...,1.0
45,Apex London Wall Hotel,4,Eccellente,"In tutto, dalla struttura nuova in stile moder...",,4/30/2013,"Eccellente In tutto, dalla struttura nuova in ...",0.0


In [10]:
df.shape

(26899, 8)

### Cleaning the texts

In [11]:
features=df.iloc[:,6].values

In [12]:
Processed_features = []
for sentence in range(0,len(features)):
    # Remove all the special characters 
    processed_feature = re.sub(r'\W',' ',str(features[sentence]))
    
    # Remove all single characters
    precessed_feature = re.sub(r'\s+[a-zA-Z]\s+', ' ',  processed_feature)
    
    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ',  processed_feature)
    
    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ',  processed_feature, flags=re.I)
    
    # Removing prefixed 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)
    
    # Converting to Lowercase
    processed_feature = processed_feature.lower()
    
    Processed_features.append(processed_feature)

### Creating the Bag of Words model

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 3000)
X = cv.fit_transform(Processed_features).toarray()
y = df.iloc[:, 7].values

In [14]:
len(X)

26899

In [15]:
len(y)

26899

### Splitting the dataset into the Training set and Test set

Next to make a train and test set, with 80% of the data being the train set and 20% being the test set. The model will be trained on the training data and evaluated on the test data.

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Training the Naive Bayes model on the Training set

In [17]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

### Predicting the Test set results

In [18]:
y_pred = classifier.predict(X_test)

### Making the Confusion Matrix

In [19]:
from sklearn.metrics import confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))
print('accuracy score', accuracy_score(y_test, y_pred))

[[1029  703]
 [ 584 3064]]
accuracy score 0.7607806691449814
