## Trip Advisor Hotel Review Prediction

### References:

1.https://www.kaggle.com/ruchi798/how-do-you-recognize-fake-news

2.https://www.kaggle.com/anokas/data-analysis-xgboost-starter-0-35460-lb

In this kernel,I am going to do an analysis on the hotel reviews from Trip advisor dataset.The ratings scale is from 1-5 and there are 20491 reviews provided.

#### Approach:

I plan to do an n-gram and word cloud analysis to find out if the reviews could be easily differentiated with respect to their ratings.Then I plan to build basic models -TFIDF,Count vectorizer with either Logit or RandomForest.This is a multi-class classification problem.

### Loading the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from collections import Counter
import string
from nltk.corpus import stopwords
import spacy
from wordcloud import WordCloud

In [2]:
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss,f1_score,confusion_matrix,plot_confusion_matrix
from scipy.sparse import hstack,csr_matrix
from tqdm import tqdm
import operator

In [3]:
data=pd.read_csv('../data/tripadvisor_hotel_reviews.csv')
#data=pd.read_csv('../input/trip-advisor-hotel-reviews/tripadvisor_hotel_reviews.csv')

In [None]:
data.head()

In [None]:
data.shape

### Data Cleaning and handling missing values (if any)

In [None]:
data.isna().sum()

There are no null values in either of the columns.

In [None]:
plt.figure(figsize=(8,8))
sns.countplot(data['Rating'])
plt.title('Ratings Count in the dataset',fontsize=15)
plt.xlabel('Rating',fontsize=8)
plt.ylabel('Count',fontsize=8)

In [None]:
(data['Rating'].value_counts()/data.shape[0])*100

In [None]:
list(data['Review'])[:3]

44 % of the dataset has reviews with rating 5 while 29 % of the datset has reviews with rating 4.

## n-gram Analysis

In [None]:
stopwrds=set(stopwords.words("english"))

In [None]:
#https://www.kaggle.com/ruchi798/how-do-you-recognize-fake-news
def get_bigram(df,n):
   
    vec=CountVectorizer(ngram_range=(2,2),stop_words=stopwrds).fit(df)
    bag_of_words=vec.transform(df)
    sum_words=bag_of_words.sum(0)
    word_freq=[(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    word_freq=sorted(word_freq,key=lambda x:x[1],reverse=True)
    return word_freq[:n]

In [None]:
bigram_rat1=get_bigram(data.loc[data['Rating']==1,'Review'].apply(lambda x:" ".join(sent for sent in x.split() if sent not in ['did','not','hotel','room','does'])),10)
bigram_rat1

In [None]:
bigram_rat2=get_bigram(data.loc[data['Rating']==2,'Review'].apply(lambda x:" ".join(sent for sent in x.split() if sent not in ['did','not','hotel','room','does'])),10)
bigram_rat2

In [None]:
bigram_rat3=get_bigram(data.loc[data['Rating']==3,'Review'].apply(lambda x:" ".join(sent for sent in x.split() if sent not in ['did','not','hotel','room','does'])),10)
bigram_rat3

In [None]:
bigram_rat4=get_bigram(data.loc[data['Rating']==4,'Review'].apply(lambda x:" ".join(sent for sent in x.split() if sent not in ['did','not','hotel','room','does'])),10)
bigram_rat4

In [None]:
bigram_rat5=get_bigram(data.loc[data['Rating']==5,'Review'].apply(lambda x:" ".join(sent for sent in x.split() if sent not in ['did','not','hotel','room','does'])),10)
bigram_rat5

In [None]:
## Check for common words in highest rated reviews
set([x[0] for x in bigram_rat3])\
&set([x[0] for x in bigram_rat4])\
&set([x[0] for x in bigram_rat5])

In [None]:
#Check for common words in least rated reviews
set([x[0] for x in bigram_rat1])&set([x[0] for x in bigram_rat2])&set([x[0] for x in bigram_rat3])


Analysis of the bigrams for the ratings provides us the following insight:

1.There are bigrams which are more common in all the reviews.

2.great location,staff friendly,staff helpful,stayed nights,walking nights were more pronunced as the rating moves from 3 to 5 whereas stay away,customer service,make sure were few bigrams which were seen in reviews with rating 1-3.

4.Punta cana and San Juan are names of places which were found in almost all of the reviews irrespective of the rating.

For trigram analysis,Lets remove these common location names and check the corpus.

In [None]:
def get_trigram(df,n):
   
    vec=CountVectorizer(ngram_range=(3,3),stop_words=stopwrds).fit(df)
    bag_of_words=vec.transform(df)
    sum_words=bag_of_words.sum(0)
    word_freq=[(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    word_freq=sorted(word_freq,key=lambda x:x[1],reverse=True)
    return word_freq[:n]

In [None]:
trigram_rat1=get_trigram(data.loc[data['Rating']==1,'Review'].apply(lambda x:" ".join(sent for sent in x.split() if sent not in ['did','not','hotel','room','does','san','juan','punta','cana'])),10)
trigram_rat1

In [None]:
trigram_rat2=get_trigram(data.loc[data['Rating']==2,'Review'].apply(lambda x:" ".join(sent for sent in x.split() if sent not in ['did','not','hotel','room','does','san','juan','punta','cana'])),10)
trigram_rat2

In [None]:
trigram_rat3=get_trigram(data.loc[data['Rating']==3,'Review'].apply(lambda x:" ".join(sent for sent in x.split() if sent not in ['did','not','hotel','room','does','san','juan','punta','cana'])),10)
trigram_rat3

In [None]:
trigram_rat4=get_trigram(data.loc[data['Rating']==4,'Review'].apply(lambda x:" ".join(sent for sent in x.split() if sent not in ['did','not','hotel','room','does','san','juan','punta','cana'])),10)
trigram_rat4

In [None]:
trigram_rat5=get_trigram(data.loc[data['Rating']==5,'Review'].apply(lambda x:" ".join(sent for sent in x.split() if sent not in ['did','not','hotel','room','does','san','juan','punta','cana'])),10)
trigram_rat5

In [None]:
## Check for common words in highest rated reviews
set([x[0] for x in trigram_rat3])\
&set([x[0] for x in trigram_rat4])\
&set([x[0] for x in trigram_rat5])

In [None]:
## Check for common words in least rated reviews
set([x[0] for x in trigram_rat1])\
&set([x[0] for x in trigram_rat2])\
&set([x[0] for x in trigram_rat3])

Similar to bigram analysis,trigram analysis presents us imporant insights:

1.There are common words in highest rated reviews among the top 10 most common words. - 10 minute walk,flat screen tv,staff friendly helpful are a few.

2.King size bed is the common trigram which appears in all the reviews across the rating.

From the n-gram analysis of words,it is seen that there are many words which appear across the reviews and this is a challenge since it might make our model difficult to distinguish between the ratings.

Lets see if we can find any difference in the length,number of words etc across the rating.

### Feature Engineering

In [None]:
data['length']=data['Review'].apply(lambda x:len(x.split()))
data['num_chars']=data['Review'].apply(lambda x:len(str(x)))
data['num_punctuations']=data['Review'].apply(lambda x:len([c for c in x if x in string.punctuation]))
data['num_stopwords']=data['Review'].apply(lambda x:len([c for c in str(x).lower().split() if c in stopwrds]))

In [None]:
plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
sns.boxplot(x='Rating',y='length',data=data,palette=sns.color_palette('colorblind'))
plt.title('Distribution of Length by Rating',fontsize=15)
plt.xlabel('Rating',fontsize=8)
plt.ylabel('Length',fontsize=8)
plt.subplot(2,2,2)
sns.boxplot(x='Rating',y='num_chars',data=data,palette=sns.color_palette('colorblind'))
plt.title('Distribution of Number of Characters by Rating',fontsize=15)
plt.xlabel('Rating',fontsize=8)
plt.ylabel('Num Chars',fontsize=8)
plt.subplot(2,2,3)
sns.boxplot(x='Rating',y='num_punctuations',data=data,palette=sns.color_palette('colorblind'))
plt.title('Distribution of Num Punctuations by Rating',fontsize=15)
plt.xlabel('Rating',fontsize=8)
plt.ylabel('Num Punctuations',fontsize=8)
plt.subplot(2,2,4)
sns.boxplot(x='Rating',y='num_stopwords',data=data,palette=sns.color_palette('colorblind'))
plt.title('Distribution of Stopwords by Rating',fontsize=15)
plt.xlabel('Rating',fontsize=8)
plt.ylabel('Num Stopwords',fontsize=8)

* There is a slight difference in the median length of the reviews with respect to each rating.But the difference is not pronunced much.Rating 5 has lesser median value compared to other ratings.

* Similarly when the number of characters is considered,rating 5 has a smaller median value compared to other.But the difference is not easily distinguishable.

* An empty plot for the punctuation indicates that there are no reviews having punctuations !!! Strange ..

* The number of stopwords with respect to each rating is also dominated by lot of outliers.Those providing ratings of 5 are using more stopwords compared to other ratings.

## Basic Modelling

In [None]:
kf=StratifiedKFold(n_splits=5,random_state=42,shuffle=True)

In [None]:
feat=['length', 'num_chars', 'num_punctuations', 'num_stopwords']

In [None]:
encoding_dict={1:0,
              2:1,
              3:2,
              4:3,
              5:4}
data['Rating']=data['Rating'].map(encoding_dict)

In [None]:
data['Rating'].value_counts()

In [None]:
nlp=spacy.load('en_core_web_sm',disable=['ner','parser','tagger'])

def spacy_tokenizer(text):
    tokens=[x.text for x in nlp(text)]
    tokens=[tok.strip() for tok in tokens]
    ## remove most common terms identified from n-gram analysis,
    tokens=[tok for tok in tokens if tok!='' and tok not in ['did','not','hotel','room','does','san','juan','punta','cana']]
    return tokens

## Using TF-IDF and Random Forest Model

In [None]:
oof_preds_tfidf=np.zeros((len(data),1))
for i,(trn_idx,val_idx) in enumerate(kf.split(data['Review'],data['Rating'])):
    print(f'Fold {i+1} Training ...')
    train_x=data.iloc[trn_idx,].reset_index(drop=True)
    valid_x=data.iloc[val_idx,].reset_index(drop=True)
    train_y=data.iloc[trn_idx,1].values
    valid_y=data.iloc[val_idx,1].values
    
    word_vectorizer=TfidfVectorizer(analyzer='word',tokenizer=spacy_tokenizer,
                       token_pattern=r'\w{1,}',
                       stop_words=stopwrds,
                      ngram_range=(1,3),max_features=8000)
    
    word_vectorizer.fit(list(train_x['Review'].values))
    train_word_vec=word_vectorizer.transform(list(train_x['Review']))
    valid_word_vec=word_vectorizer.transform(list(valid_x['Review']))
    train_x_sparse=hstack((csr_matrix(train_x[feat]),train_word_vec))
    valid_x_sparse=hstack((csr_matrix(valid_x[feat]),valid_word_vec))
    rf=RandomForestClassifier(n_estimators=500,
                             max_depth=20,
                             max_features='auto',
                             min_samples_split=5,
                             bootstrap=True,
                             n_jobs=-1,
                             random_state=42,
                             verbose=False)
    rf.fit(train_x_sparse,train_y)
    preds=rf.predict(valid_x_sparse)
    score=f1_score(valid_y,preds,average='macro')
    print(f'Fold {i+1} f1 score {score}')
    oof_preds_tfidf[val_idx]=preds.reshape(-1,1)
oof_score_tfidf=f1_score(data['Rating'],oof_preds_tfidf.astype('int'),average='macro')
print(f'Overall OOF f1 score {oof_score_tfidf}')

### Using Count Vectorizer and Random Forest

In [None]:
oof_preds_cv=np.zeros((len(data),1))
for i,(trn_idx,val_idx) in enumerate(kf.split(data['Review'],data['Rating'])):
    print(f'Fold {i+1} Training ...')
    train_x=data.iloc[trn_idx,].reset_index(drop=True)
    valid_x=data.iloc[val_idx,].reset_index(drop=True)
    train_y=data.iloc[trn_idx,1].values
    valid_y=data.iloc[val_idx,1].values
    
    count_vectorizer=CountVectorizer(analyzer='word',tokenizer=spacy_tokenizer,
                       token_pattern=r'\w{1,}',
                       stop_words=stopwrds,
                      ngram_range=(1,3),max_features=8000)
    
    count_vectorizer.fit(list(train_x['Review'].values))
    train_word_vec=count_vectorizer.transform(list(train_x['Review']))
    valid_word_vec=count_vectorizer.transform(list(valid_x['Review']))
    train_x_sparse=hstack((csr_matrix(train_x[feat]),train_word_vec))
    valid_x_sparse=hstack((csr_matrix(valid_x[feat]),valid_word_vec))
    rf=RandomForestClassifier(n_estimators=500,
                             max_depth=20,
                             max_features='auto',
                             min_samples_split=5,
                             bootstrap=True,
                             n_jobs=-1,
                             random_state=42,
                             verbose=False)
    rf.fit(train_x_sparse,train_y)
    preds=rf.predict(valid_x_sparse)
    score=f1_score(valid_y,preds,average='macro')
    print(f'Fold {i+1} f1 score {score}')
    oof_preds_cv[val_idx]=preds.reshape(-1,1)
oof_score_cv=f1_score(data['Rating'],oof_preds_cv.astype('int'),average='macro')
print(f'Overall OOF f1 score {oof_score_cv}')