![tripadvisor](https://static.tacdn.com/img2/brand_refresh/Tripadvisor_lockup_horizontal_secondary_registered.svg)

# Tripadvisor Hotel Reviews

`This kernel is in under construction 🔨`

Traveling is exciting! 

For the experience be all good, the hotel were we stay must be good enough to receive us.

The main idea behind this kernel is explore tripadvisor reviews about hotels and make a model that be capable to predict hotel rating based on a new review :)

If you like, please do a upvote! 


# What's in this kernel?

* Exploratory Data Analysis (with [Plotly](https://plotly.com/))
* Data Cleaning 
* A simple LSTM model for sentiment analysis using PyTorch and GloVe pre trained embeddings

In [None]:
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt

import random as rnd
import re
import string
import operator
import numpy as np
import pandas as pd

from wordcloud import WordCloud

from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, word_tokenize

# Exploratory Data Analysis

## Loading data

In [None]:
df = pd.read_csv('../input/trip-advisor-hotel-reviews/tripadvisor_hotel_reviews.csv')

In [None]:
print(f"Theres {df.shape[0]} reviews in this dataset. Lets explore it!")

The most part of this reviews is about 5 stars rating hotels. Followed by 4, 3... and so on.

For better visualization understand, along this notebook i will keep fixed the color for each rating:

1. red
2. orange
3. yellow
4. light green
5. green

Please note: The graphs made with Plotly are interactive. You can zoom in/zoom out and filter by some ratings.

In [None]:
color_map = {
                "1": "#EF553B",
                "2": "#FFA15A",
                "3": "#FECB52",
                "4": "#B6E880",
                "5": "#00CC96"
            }

In [None]:
x = df.Rating.value_counts().reset_index().rename(columns={'index': 'Rating', 'Rating': 'Count'})
fig = px.bar(x, 
             x="Rating", 
             y="Count", 
             color="Rating", 
             color_continuous_scale=["#EF553B", "#FFA15A", "#FECB52", "#B6E880", "#00CC96"],
             title="Reviews by rating"
            )


fig.update_layout(
    autosize=False,
    width=800,
    height=500,
    margin=dict(l=0,r=0,b=0)
)

fig.update_layout(coloraxis_showscale=False)

fig.show()

In [None]:
len(df.Review.unique()), df.isna().sum()

There's no null on this data. This is very good!

# Review length by sentiment

The average size of a review is ~700 characters. 

There are some really big reviews in this dataset, mostly about 3+ stars ratings.

In [None]:
df['text_length'] = df.Review.str.len()

fig = px.histogram(df, 
                   x='text_length', 
                   color='Rating', 
                   color_discrete_sequence=["#B6E880", "#FFA15A", "#FECB52", "#00CC96", "#EF553B"],
                   title="Review length distributions"
                  )

fig.update_layout(
    autosize=False,
    width=800,
    height=500,
    margin=dict(l=0,r=0,b=0)
)

fig.show()

# Review number of words by sentiment

The number of words follow the same characteristic of length. 



High rating reviews have greater number of words than lower ratings reviews.

In [None]:
df['number_of_words'] = df.Review.str.split().map(lambda x: len(x))

fig = px.histogram(df, 
                   x='number_of_words', 
                   color='Rating', 
                   color_discrete_sequence=["#B6E880", "#FFA15A", "#FECB52", "#00CC96", "#EF553B"],
                   title="Number of words distribution"
                  )

fig.update_layout(
    autosize=False,
    width=800,
    height=500,
    margin=dict(l=0,r=0,b=0)
)

fig.show()

# Common stopwords in reviews

In [None]:
def create_corpus_df(df):
    corpus = []
    
    
    for i in range(df.shape[0]):
        words = word_tokenize(df.loc[i, 'Review'])
        rating = df.loc[i, 'Rating']
        
        for w in words:
            corpus.append([w, rating])
            
    return np.asarray(corpus)

In [None]:
corpus = create_corpus_df(df)

In [None]:
corpus_df = pd.DataFrame(corpus, columns=['token', 'rating'])

In [None]:
def get_stopwords_from_corpus(corpus_df, n=10):
    stop = set(stopwords.words('english'))
    
    corpus_stop_df = corpus_df[corpus_df['token'].isin(stop)].copy()
    
    corpus_stop_df['count'] = 1
    
    corpus_stop_df = corpus_stop_df.groupby(by=['token', 'rating']).count().reset_index().sort_values('count', ascending=False)

    tokens = corpus_stop_df['token'].unique().tolist()[:n]
        
    return corpus_stop_df[corpus_stop_df['token'].isin(tokens)]

The most used stopword in review is `not`, even on 5 rating reviews. This is curious. Generally the word not is associated with a bad opinion.

Here i limited to show top five stopwords. The magnitude of another stopwords are too little when compared with those five.

In [None]:
x = get_stopwords_from_corpus(corpus_df, n=5)

fig = px.bar(x, 
             x="token", 
             y="count", 
             color="rating", 
             color_discrete_map=color_map,
             title="Top common stopwords"
            )


fig.update_layout(
    autosize=False,
    width=800,
    height=500,
    margin=dict(l=0,r=0,b=0),
    barmode='group'
)

fig.show()

# Common words in review

The most used word is `hotel` followed by `room`. That's expected in hotel reviews.

Words that describe characteristics like `location`, `breakfast`, `service`, `clean`... are mostly used on good reviews. 

In [None]:
def get_common_words_from_corpus(corpus_df, n_top=20):
    stop = set(stopwords.words('english'))
    puncts = word_tokenize(string.punctuation)
    
    corpus_stop_df = corpus_df[(~corpus_df['token'].isin(stop)) & (~corpus_df['token'].isin(puncts))].copy()
    
    corpus_stop_df['count'] = 1
    
    corpus_stop_df = corpus_stop_df.groupby(by=['token', 'rating']).count().reset_index().sort_values('count', ascending=False)

    tokens = corpus_stop_df['token'].unique().tolist()[:n_top]
    
    return corpus_stop_df[corpus_stop_df['token'].isin(tokens)]

In [None]:
x = common_words = get_common_words_from_corpus(corpus_df, n_top=10)

fig = px.bar(x, x="token", y="count", color="rating", color_discrete_map=color_map, title="Top common words")


fig.update_layout(
    autosize=False,
    width=800,
    height=500,
    margin=dict(l=0,r=0,b=0),
    barmode='group'
)

fig.show()

# Common punctuations 

This dataset does not have punctuations. Just comma that was used to substitute tabs `\t` and break lines `\n`.

There's also one `:` lost on 3 star rating review. 

In [None]:
def get_common_punctuations_from_corpus(corpus_df):
    puncts = word_tokenize(string.punctuation)
    
    corpus_stop_df = corpus_df[corpus_df['token'].isin(puncts)].copy()
    
    corpus_stop_df['count'] = 1
    
    corpus_stop_df = corpus_stop_df.groupby(by=['token', 'rating']).count().reset_index().sort_values('count', ascending=False)
    
    return corpus_stop_df

In [None]:
x = get_common_punctuations_from_corpus(corpus_df)

x.groupby('token')['count'].sum()

# Clean Data

For now we only remove all punctuations (major comma) and extra spaces from reviews to create N-grams and WordClouds

In [None]:
def remove_punctuation(df):
    
    # remove punctuations
    table = str.maketrans('', '', string.punctuation)
    df['Review'] = df['Review'].str.translate(table)
    
    # remove extra spaces '  '
    df['Review'] = df['Review'].replace(r'\s\s+', ' ', regex=True)
    
    return df

In [None]:
df = remove_punctuation(df)

# Creating corpus

In [None]:
%%time
corpus = df['Review'].apply(word_tokenize).tolist()
len(corpus), df.shape[0]

In [None]:
corpus = np.asarray(corpus)
targets = np.asarray(df['Rating'].tolist())

In [None]:
%%time
corpus_1 = df[df['Rating'] == 1]['Review'].apply(word_tokenize).tolist()
corpus_2 = df[df['Rating'] == 2]['Review'].apply(word_tokenize).tolist()
corpus_3 = df[df['Rating'] == 3]['Review'].apply(word_tokenize).tolist()
corpus_4 = df[df['Rating'] == 4]['Review'].apply(word_tokenize).tolist()
corpus_5 = df[df['Rating'] == 5]['Review'].apply(word_tokenize).tolist()

In [None]:
def count_n_grams(corpus, n=2, n_top=None):
    n_grams = {}
    
    for review in corpus:
        review = tuple(review)
        
        for i in range(0, len(review) - n + 1): 
                # Get the n-gram from i to i+n
                n_gram = review[i:i+n]
                
                # check if the n-gram is in the dictionary
                if n_gram in n_grams:
                    # Increment the count for this n-gram
                    n_grams[n_gram] += 1
                else:
                    # Initialize this n-gram count to 1
                    n_grams[n_gram] = 1
                    
    n_grams = pd.DataFrame.from_dict(n_grams, orient='index', columns=['count']).sort_values('count', ascending=False)
    
    n_grams.reset_index(inplace=True)
    
    n_grams['index'] = n_grams['index'].str.join(', ')
    
    
    if (n_top):
        n_grams = n_grams[:n_top]    
        
    return n_grams

# Bi-grams

All ratings have (did, nt) and (did, not) pairs mostly used on reviews.


Lower rating reviews (1 and 2) do not have (great, location) pair. Instead, they have (room, not) and (hotel, not) pairs that can be describing bad characteristics about hotel. 

High ratings (3+) have good pairs of words about hotel characteristics, like (staff, friendly), (great, location) and (punta, cana) 😅 

In [None]:
%%time
bi_grams_1 = count_n_grams(corpus_1, n_top=10)
bi_grams_2 = count_n_grams(corpus_2, n_top=10)
bi_grams_3 = count_n_grams(corpus_3, n_top=10)
bi_grams_4 = count_n_grams(corpus_4, n_top=10)
bi_grams_5 = count_n_grams(corpus_5, n_top=10)

bi_grams_1['Rating'] = 1
bi_grams_2['Rating'] = 2
bi_grams_3['Rating'] = 3
bi_grams_4['Rating'] = 4
bi_grams_5['Rating'] = 5

bi_grams = pd.concat([bi_grams_5, bi_grams_4, bi_grams_3, bi_grams_2, bi_grams_1])

In [None]:
fig = px.bar(bi_grams, x="index", y="count", color="Rating", facet_col="Rating",  color_continuous_scale=["#EF553B", "#FFA15A", "#FECB52", "#B6E880", "#00CC96"])
fig.update_xaxes(matches=None)
fig.update_layout(coloraxis_showscale=False)
fig.show()

# Tri-grams

Analyzing tri-grams we can collect more details about reviews on diferent ratings. 

In lower ratings we have bad sequence of words about hotel. Like (worst, hotel, stayed) and (not, recoomend, hotel).

In high ratings we have compliments about hotel services, rooms and locations.

In [None]:
%%time
tri_grams_1 = count_n_grams(corpus_1, n=3, n_top=10)
tri_grams_2 = count_n_grams(corpus_2, n=3, n_top=10)
tri_grams_3 = count_n_grams(corpus_3, n=3, n_top=10)
tri_grams_4 = count_n_grams(corpus_4, n=3, n_top=10)
tri_grams_5 = count_n_grams(corpus_5, n=3, n_top=10)

tri_grams_1['Rating'] = 1
tri_grams_2['Rating'] = 2
tri_grams_3['Rating'] = 3
tri_grams_4['Rating'] = 4
tri_grams_5['Rating'] = 5

tri_grams = pd.concat([tri_grams_5, tri_grams_4, tri_grams_3, tri_grams_2, tri_grams_1])

In [None]:
fig = px.bar(tri_grams, x="index", y="count", color="Rating", facet_col="Rating", color_continuous_scale=["#EF553B", "#FFA15A", "#FECB52", "#B6E880", "#00CC96"])
fig.update_xaxes(matches=None)
fig.update_layout(coloraxis_showscale=False)
fig.show()

# WordCloud

For finish exploratory data analisys, i will plot some wordclouds to observe words used on different ratings.

I exclude from wordcloud common english stopwords and three words that are most used on reviews: `hotel`, `room`, `not`.

As expected: sentiment expressed by words are declined with rating associated. We can point too that basic hotel aspectics like breakfast, bathroom, pool and employees are important to guests rating.

In [None]:
stops = stopwords.words('english') + ['hotel', 'room', 'not']

w1 = WordCloud(color_func=lambda *args, **kwargs: "#EF553B", 
               background_color='white', 
               max_words=50, 
               stopwords=stops,
               random_state=42,
               height=1000, 
               width=1000) \
            .generate(" ".join(df[df['Rating'] == 1]['Review']))

w2 = WordCloud(color_func=lambda *args, **kwargs: "#FFA15A", 
               background_color='white', 
               max_words=50, 
               stopwords=stops,
               random_state=42,
               height=1000, 
               width=1000) \
            .generate(" ".join(df[df['Rating'] == 2]['Review']))

w3 = WordCloud(color_func=lambda *args, **kwargs: "#FECB52", 
               background_color='white', 
               max_words=50, 
               stopwords=stops,
               random_state=42,
               height=1000, 
               width=1000) \
            .generate(" ".join(df[df['Rating'] == 3]['Review']))

w4 = WordCloud(color_func=lambda *args, **kwargs: "#B6E880", 
               background_color='white', 
               max_words=50, 
               stopwords=stops,
               random_state=42,
               height=1000, 
               width=1000) \
            .generate(" ".join(df[df['Rating'] == 4]['Review']))

w5 = WordCloud(color_func=lambda *args, **kwargs: "#00CC96", 
               background_color='white', 
               max_words=50,
               stopwords=stops,
               random_state=42,
               height=1000, 
               width=1000) \
            .generate(" ".join(df[df['Rating'] == 5]['Review']))



In [None]:
fig = plt.figure(figsize=(26, 12))

fig.add_subplot(1, 3, 1)
plt.axis('off')
plt.title('Rating=5')
plt.imshow(w5)

fig.add_subplot(1, 3, 2)
plt.title('Rating=4')
plt.axis('off')
plt.imshow(w4)

fig.add_subplot(1, 3, 3)
plt.axis('off')
plt.title('Rating=3')
plt.imshow(w3)

fig = plt.figure(figsize=(26, 12))


fig.add_subplot(1, 2, 1)
plt.title('Rating=2')
plt.axis('off')
plt.imshow(w2)

fig.add_subplot(1, 2, 2)
plt.title('Rating=1')
plt.axis('off')
plt.imshow(w1)

# Data Cleaning + Modelling

![](https://media.giphy.com/media/fVeAI9dyD5ssIFyOyM/giphy.gif)

Coming soon!

:) 