### Sentiment Analysis on Hotel Reviews

#### Objective:
The goal of this project is to perform sentiment analysis on hotel reviews using natural language processing (NLP) techniques. The dataset comprises user reviews with associated sentiment labels (happy or not happy). The sentiment analysis aims to determine the sentiment expressed in each review.

##### Dataset Description:
The dataset used in this sentiment analysis project focuses on hotel reviews, encompassing various aspects of user experiences. Each entry in the dataset consists of specific attributes, including User_ID, Description (textual review), Browser_Used, Device_Used, and Is_Response (sentiment label - 'happy' or 'not happy').

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Lets import the dataset
df = pd.read_csv('HotelReview.csv', encoding='latin-1')

In [3]:
df.head()

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
0,id10326,The room was kind of clean but had a VERY stro...,Edge,Mobile,not happy
1,id10327,I stayed at the Crown Plaza April -- - April -...,Internet Explorer,Mobile,not happy
2,id10328,I booked this hotel through Hotwire at the low...,Mozilla,Tablet,not happy
3,id10329,Stayed here with husband and sons on the way t...,InternetExplorer,Desktop,happy
4,id10330,My girlfriends and I stayed here to celebrate ...,Edge,Tablet,not happy


In [4]:
# Lets keep only the relevant columns
df=df.drop(['User_ID','Browser_Used','Device_Used'],axis=1)

In [5]:
df

Unnamed: 0,Description,Is_Response
0,The room was kind of clean but had a VERY stro...,not happy
1,I stayed at the Crown Plaza April -- - April -...,not happy
2,I booked this hotel through Hotwire at the low...,not happy
3,Stayed here with husband and sons on the way t...,happy
4,My girlfriends and I stayed here to celebrate ...,not happy
...,...,...
38927,We arrived late at night and walked in to a ch...,happy
38928,The only positive impression is location and p...,not happy
38929,Traveling with friends for shopping and a show...,not happy
38930,The experience was just ok. We paid extra for ...,not happy


In [6]:
# Checking for any missing data

In [7]:
df.isna().sum()

Description    0
Is_Response    0
dtype: int64

In [8]:
duplicates=df.duplicated().sum()
duplicates

0

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38932 entries, 0 to 38931
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Description  38932 non-null  object
 1   Is_Response  38932 non-null  object
dtypes: object(2)
memory usage: 608.4+ KB


In [10]:
# Now lets perform all the preprocessing steps

In [11]:
# As a first step in preprocessing, lets convert the reviews to lower case

In [12]:
df['Description']=df['Description'].str.lower()
df

Unnamed: 0,Description,Is_Response
0,the room was kind of clean but had a very stro...,not happy
1,i stayed at the crown plaza april -- - april -...,not happy
2,i booked this hotel through hotwire at the low...,not happy
3,stayed here with husband and sons on the way t...,happy
4,my girlfriends and i stayed here to celebrate ...,not happy
...,...,...
38927,we arrived late at night and walked in to a ch...,happy
38928,the only positive impression is location and p...,not happy
38929,traveling with friends for shopping and a show...,not happy
38930,the experience was just ok. we paid extra for ...,not happy


In [13]:
#Lets remove html tags if any

In [14]:
import re

In [15]:
def remove_html_tags(text):
    pattern=re.compile('<.*?>')
    return pattern.sub(r'',text)

In [16]:
df['Description']=df['Description'].apply(remove_html_tags)

In [17]:
# Lets remove all the punctuations

In [18]:
import string

In [19]:
punctuations=string.punctuation
punctuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [20]:
def remove_punctuation(text):
    return text.translate(str.maketrans('','',punctuations))

In [21]:
df['Description']=df['Description'].apply(remove_punctuation)

In [22]:
df['Description'][5]

'we had  rooms one was very nice and clearly had been updated more recently than the other the other was clean and the bed was comfy but it needed some updating carpet was old and wrinkled for example great location for visiting inner harbor getting to fells point orioles games etc supershuttle from bwi worked great both ways tv remotes in both rooms were terrible but we didnt watch much tv so not a big deal wireless was sketchy on th and th floors but again didnt need it much  we were on vacation so it didnt really matter breakfast was good each morning would stay again if in town'

In [23]:
# From above it can be seen that all the punctuations are removed.

In [24]:
# Now lets remove all the stop words as those words donot have much contribution on the target response.

In [25]:
import nltk

In [26]:
from nltk.corpus import stopwords

In [27]:
stop_words=set(stopwords.words('english'))

In [28]:
def remove_stop_words(text):
    new_text = []

    for word in text.split():
        if word not in stop_words:
            new_text.append(word)

    return " ".join(new_text)

In [29]:
df['Description']=df['Description'].apply(remove_stop_words)
df

Unnamed: 0,Description,Is_Response
0,room kind clean strong smell dogs generally av...,not happy
1,stayed crown plaza april april staff friendly ...,not happy
2,booked hotel hotwire lowest price could find g...,not happy
3,stayed husband sons way alaska cruise loved ho...,happy
4,girlfriends stayed celebrate th birthdays plan...,not happy
...,...,...
38927,arrived late night walked checkin area complet...,happy
38928,positive impression location public parking op...,not happy
38929,traveling friends shopping show location great...,not happy
38930,experience ok paid extra view pool got view pa...,not happy


In [30]:
# Removed all the stopwords

In [31]:
# Lets seperate our X and Y before training our model

X=df['Description']
y=df['Is_Response']

In [32]:
X

0        room kind clean strong smell dogs generally av...
1        stayed crown plaza april april staff friendly ...
2        booked hotel hotwire lowest price could find g...
3        stayed husband sons way alaska cruise loved ho...
4        girlfriends stayed celebrate th birthdays plan...
                               ...                        
38927    arrived late night walked checkin area complet...
38928    positive impression location public parking op...
38929    traveling friends shopping show location great...
38930    experience ok paid extra view pool got view pa...
38931    westin wonderfully restored grande dame hotel ...
Name: Description, Length: 38932, dtype: object

In [33]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,stratify=y,random_state=2)   #stratify tells the train_test_splt that, it wants equal proportion of the two classes both in training and test data

In [34]:
# Lets perform Feature Extraction i.e converting the dataset to numerical vectors

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tvec = TfidfVectorizer()
clf = LogisticRegression(solver = "lbfgs")

from sklearn.pipeline import Pipeline

In [36]:
model = Pipeline([('vectorizer',tvec),('classifier',clf)])

model.fit(X_train, y_train)


from sklearn.metrics import confusion_matrix

predictions = model.predict(X_test)

confusion_matrix(predictions, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([[4995,  593],
       [ 310, 1889]], dtype=int64)

In [37]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print("Accuracy : ", accuracy_score(predictions, y_test))
print("Precision : ", precision_score(predictions, y_test, average = 'weighted'))
print("Recall : ", recall_score(predictions, y_test, average = 'weighted'))

Accuracy :  0.8840374983947605
Precision :  0.8905967888561083
Recall :  0.8840374983947605


In [38]:
# The above model is giving a good accuracy score of 88%

#### Model Evaluation

In [39]:
# Lets check the sentiments for new unknown words

In [40]:
Review1 = ["Rooms were not cleaned properly"]
result = model.predict(Review1)

print(result)

['not happy']


In [41]:
Review2 = ["location of the hotel is too far from airport"]
result = model.predict(Review2)

print(result)

['not happy']


In [42]:
Review3 = ["breakfast offered was delicious"]
result = model.predict(Review3)

print(result)

['happy']


#### Saving the file

In [45]:
import pickle

In [47]:
with open('classifier.pkl','wb') as file:
    pickle.dump(model,file)