# Table of Contents

* [Importing Libraries](#1)
* [Load Dataset](#2)
* [Data Visualization](#3)
* [Text Preprocessing](#4)
* [Building Model with Tensorflow](#5)
* [Prediction](#6)

In [1]:
# nltk is one of the most useful libraries when it comes to nlp
!pip install nltk

<a id="1"></a>
# Importing Libraries

In [2]:
import string
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import tensorflow as tf
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

<a id="1"></a>
# Load Dataset

In [84]:
df = pd.read_csv('../input/trip-advisor-hotel-reviews/tripadvisor_hotel_reviews.csv')
df.head()

In [85]:
df.info()

As you can see from above details, there is **no null values** in this dataset

<a id="3"></a>
# Data Visualization

In [92]:
sns.countplot(data=df, x='Rating', palette='mako').set_title('Rating Distribution Across Dataset')

In [86]:
# Length of word in sentence
df['Length'] = df['Review'].apply(len)
df.head()

In [87]:
sns.displot(data=df, x='Length', hue='Rating', palette='mako', kind='kde', fill=True, aspect=4)

In [90]:
g = sns.FacetGrid(data=df, col='Rating')
g.map(plt.hist, 'Length', color='#1D3557')

In [91]:
sns.stripplot(data=df, x='Rating', y='Length', palette='mako', alpha=0.3)

From above plot we can say that **the higher the rating of the hotel, the more likely the visitors wrote a long review**

<a id="4"></a>
# Text Preprocessing

In [93]:
# Let's change the rating to be more general and easier to understand
def rating(score):
    if score > 3:
        return 'Good'
    elif score == 3:
        return 'Netral'
    else:
        return 'Bad'

In [94]:
df['Rating'] = df['Rating'].apply(rating)

In [95]:
df.head()

In [96]:
# Total word in dataset before cleaning
length = df['Length'].sum()

### Stemming vs Lemmatization
I think this picture can give you a sense of what is the different between stemming and lemmatization

![image](https://lh3.googleusercontent.com/BP5TVAfMRDWXufbDRorQs0s84WXcrmYEuru1tLrSBOd_xTtv06f2qld5VMXIvA_Y0iqeG__w0iXsTeZj9fSpocIx7eEZSqbY_gDihdIAHwuqlPSK244_IfK9tXaow3-Y3ftpW5WxEJ_58Meukw)

In [98]:
print('Original:')
print(df['Review'][0])
print()

sentence = []
for word in df['Review'][0].split():
    stemmer = SnowballStemmer('english')
    sentence.append(stemmer.stem(word))
print('Stemming:')
print(' '.join(sentence))
print()

sentence = []
for word in df['Review'][0].split():
    lemmatizer = WordNetLemmatizer()
    sentence.append(lemmatizer.lemmatize(word, 'v'))
print('Lemmatization:')
print(' '.join(sentence))

There are some difference among those 3 sentences, for instance:
* Original -> got, arrived
* Stemming -> got, arriv
* Lemmatization -> get, arrive

This time, we will use Lemmatization in order to get the base form of the word

In [99]:
def cleaning(text):
    #remove punctuations and uppercase
    clean_text = text.translate(str.maketrans('','',string.punctuation)).lower()
    
    #remove stopwords
    clean_text = [word for word in clean_text.split() if word not in stopwords.words('english')]
    
    #lemmatize the word
    sentence = []
    for word in clean_text:
        lemmatizer = WordNetLemmatizer()
        sentence.append(lemmatizer.lemmatize(word, 'v'))

    return ' '.join(sentence)

In [100]:
df['Review'] = df['Review'].apply(cleaning)

In [101]:
df['Length'] = df['Review'].apply(len)
new_length = df['Length'].sum()

print('Total word before cleaning: {}'.format(length))
print('Total word after cleaning: {}'.format(new_length))

<a id="5"></a>
# Building Model with Tensorflow

In [102]:
X_train, X_test, y_train, y_test = train_test_split(df['Review'], df['Rating'], test_size=0.2)

In [104]:
tokenizer = Tokenizer(oov_token='<OOV>')

tokenizer.fit_on_texts(X_train)
# print(tokenizer.word_index)
total_word = len(tokenizer.word_index)
print('Total distinct words: {}'.format(total_word))

train_seq = tokenizer.texts_to_sequences(X_train)
train_padded = pad_sequences(train_seq)

test_seq = tokenizer.texts_to_sequences(X_test)
test_padded = pad_sequences(test_seq)

# One hot encoding the label
lb = LabelBinarizer()
train_labels = lb.fit_transform(y_train)
test_labels = lb.transform(y_test)

In [105]:
model = tf.keras.models.Sequential([tf.keras.layers.Embedding(total_word, 8),
                                    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16)),
                                    tf.keras.layers.Dropout(0.5),
                                    tf.keras.layers.Dense(16),
                                    tf.keras.layers.Dropout(0.5),
                                    tf.keras.layers.Dense(3, activation='softmax')])

model.summary()

In [106]:
model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.0001), loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(train_padded, train_labels, epochs=25, validation_data=(test_padded, test_labels))

In [107]:
metrics = pd.DataFrame(model.history.history)
metrics[['accuracy', 'val_accuracy']].plot()
metrics[['loss', 'val_loss']].plot()

<a id="6"></a>
# Prediction

Let's make some prediction using the model we have trained. You can create your own reviews and let the model predict the sentiment in your text

In [108]:
def predict(text):
    clean_text = cleaning(text)
    seq = tokenizer.texts_to_sequences([clean_text])
    padded = pad_sequences(seq)

    pred = model.predict(padded)
    # Get the label name back
    result = lb.inverse_transform(pred)[0]
    
    return result

In [109]:
text = 'Such a comfy place to stay with the loved one'
predict(text)

In [110]:
text2 = 'Awful room services and slow wifi connection'
predict(text2)

In [111]:
text3 = 'Hard to get here but the scenery is wonderful'
predict(text3)

#### If you find this notebook useful, please upvote👍
#### Thanks