<a href="https://colab.research.google.com/github/HassanJoumaa/Trip_Advisor_Hotel_Reviews/blob/main/Trip_Advisor_Hotel_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trip Advisor Hotel Reviews
Hotels play a crucial role in traveling and with the increased access to information new pathways of selecting the best ones emerged.

With this dataset, consisting of 20k reviews crawled from Tripadvisor, you can explore what makes a great hotel and maybe even use this model in your travels!

## 1. Problem

We are provided with this dataset, consisting of **20k** reviews crawled from Tripadvisor. Each review has its corresponding rating ***(1-5)***. The goal is to develop a Sequence model capable of classifying the reviews but into **3** classes instead of **5**. 

## 2. Data

The data we're using is from Kaggle's Trip Advisor Hotel Reviews dataset.

https://www.kaggle.com/andrewmvd/trip-advisor-hotel-reviews

## 3. Evaluation

We will evaluate the model based on the accuracy metric, making sure that it doesn't have **High Variance** or **High Bias**.


## 4. Features
* There are 5 ratings ***1*** to ***5*** which we will have to change to 3 classes ***Bad, Good, and Neutral 1 to 3***.
* There are around ***20k*** reviews which we will split into training and testing.

### Getting the Data and Importing the Libraries
We will start of by getting the data from Kaggle, using the Kaggle api but will do a pip "force install" first in order to prevent any problems. 

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime 

from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
print(tf.__version__) #Make sure Tensorflow 2 is imported

In [None]:
# Adding the Username and Key from the Kaggle Token Folder
os.environ['KAGGLE_USERNAME']="hassanjoumaa"
os.environ['KAGGLE_KEY']="d235272b72cd0021c0b402a603c814c5"

In [None]:
!kaggle datasets download -d andrewmvd/trip-advisor-hotel-reviews

In [None]:
!unzip /content/trip-advisor-hotel-reviews.zip

### Preprocessing the Data

In [None]:
df = pd.read_csv("/content/tripadvisor_hotel_reviews.csv")
print("Number of records is:",len(df))
df.head(10)

In [None]:
df.info()

#### We will remove the stopwords from the reviews in order to reduce the data the model has to process. 

This does not affect the model's decision in a bad way since we are building a model for sentiment classification and not machine translation or any other use case which might need these words. 

In [None]:
STOPWORDS = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

def remove_stopwords(x, stopwords=STOPWORDS):
  sentence = x.split()
  new_sentence=[]
  for word in sentence:
    if word in stopwords:
      continue
    else:
      new_sentence.append(word)
  return " ".join(new_sentence)

In [None]:
df["Review"] = df["Review"].apply(lambda x: remove_stopwords(x))

lengths = df["Review"].str.split().apply(lambda x: len(x))
maxlen_index = lengths.argmax()
maxlen_rating = df["Rating"][maxlen_index]
max_len = len(df["Review"][maxlen_index].split())

mean = lengths.mean()
median = lengths.median()

print("Index of largest sentence:", maxlen_index)
print("Rating of largest sentence:", maxlen_rating)
print("Length of largest sentence:", max_len)
print("The mean Length is:", mean)

#### Modify the Ratings 1 is Bad, 2 is Neutral, 3 is Good

In [None]:
def modify_ratings(x):
  if x==5 or x==4:
    x=3
    return x

  elif x==1 or x==2:
    x=1
    return x

  else:
    x=2
    return x

In [None]:
df["Rating"] = df["Rating"].apply(lambda x: modify_ratings(x))

In [None]:
labels = list(df["Rating"])
unique_ratings = np.unique(labels)
lbl=LabelEncoder()
labels=lbl.fit_transform(labels)
labels = to_categorical(labels)
print("Number of unique labels:",len(unique_ratings))
labels[0:2]

### Preparing the Data

In [None]:
NUM_SAMPLES=20490 #@param{type:"slider", min:10000, max:20490, step:10} 
MAXLEN = 350
EMBEDDINGS_DIM= 128
vocab_size=35000
trunc_type='post'
padding_type='post'
test_portion = 0.1
BATCHE_SIZE=64

df.sample(frac=1)
X= np.array(df["Review"])
y= labels

In [None]:
sentences = X[:NUM_SAMPLES]
labels = y[:NUM_SAMPLES]
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index


sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences, maxlen= MAXLEN, truncating=trunc_type, padding=padding_type)

split = int(test_portion * NUM_SAMPLES)

test_padded = padded[0:split]
train_padded = padded[split:NUM_SAMPLES]
test_labels = labels[0:split]
train_labels = labels[split:NUM_SAMPLES]


print(test_padded.shape)
print(train_padded.shape)
print(test_labels.shape)
print(train_labels.shape)

print(train_padded[0:2],"\n")
print(train_labels[0:2])


In [None]:
train_data = tf.data.Dataset.from_tensor_slices((tf.constant(train_padded),tf.constant(train_labels)))
test_data = tf.data.Dataset.from_tensor_slices((tf.constant(test_padded),tf.constant(test_labels)))

In [None]:
train_dataset = train_data.batch(BATCHE_SIZE)
test_dataset = test_data.batch(BATCHE_SIZE)

train_dataset.element_spec, test_dataset.element_spec


In [None]:
# Uncomment to see a sample of the Data
# train_sentence, train_labels = next(train_data.as_numpy_iterator())

# train_sentence, train_labels

### Building the Model

In [None]:
model = tf.keras.Sequential([
          tf.keras.layers.Embedding(vocab_size, EMBEDDINGS_DIM, input_length=MAXLEN),
          tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
          tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
          tf.keras.layers.Dense(64, activation='relu'),
          tf.keras.layers.Dense(len(unique_ratings), activation='softmax')
])
        

In [None]:
model.summary()

In [None]:
model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['accuracy'])
%load_ext tensorboard
!mkdir ./logs
logdir = os.path.join("./logs",
                        # Make it so the logs get tracked whenever we run an experiment
                        datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard = tf.keras.callbacks.TensorBoard(logdir)


In [None]:
NUM_EPOCHS=5
history = model.fit(x=train_dataset,
                    epochs=NUM_EPOCHS,
                    validation_data=test_dataset,
                    callbacks=[tensorboard])

In [None]:
%tensorboard --logdir /content/logs

### Creating Data to use in Tensorflow's Embedding Projector:
https://projector.tensorflow.org/

In [None]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape)
import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

In [None]:
from google.colab import files
files.download('vecs.tsv')
files.download('meta.tsv')