# Trip Advisor model

## Steps

1. Import Trip Advisor data
2. Tokenize the data (create a word index that represents words as numbers)
3. Use an oov token to include words not seen before
4. Pad the sentences to have similar length


In [1]:
import pandas as pd

In [2]:
# Important Variables
vocab_size = 10000
trunc_type ="post"
padding_type = "post"
oov_tok = "<OOV>"
embedding_dim = 16

In [3]:
df = pd.read_csv('to_model.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Rating
0,0,fabulous hotel mum just returned 4 night stay ...,4
1,1,romantic international ambience spent honeymoo...,4
2,2,great hotel location union square stay great 6...,3
3,3,pretty outside smelly inside beach beautiful p...,1
4,4,kids loved spent nights hotel wife kids christ...,4


In [4]:
X_train = list(df['Review'])
y_train = list(df['Rating'])

print(X_train[:5])
print(y_train[-5:])

['fabulous hotel mum just returned 4 night stay hotel 1898 fabulous, recently decorated rooms immaculate decor fantastic really beautiful hotel, location perfect nicer end la ramblas need close treat return tranquility hotel busy day sight seeing.the staff attentive polite spoke perfect english, price drinks bar expect little hard swallow having gorgeous course meal local resturant price gin tonics.although roof pool undergoing refurbishment stay basement pool surrounding facilities adequate,  ', "romantic international ambience spent honeymoon melia caribe 23-30. plane landed torrential downpour soaked skin steps plane, rained 7 days just hot gorgeous, truly loved resort food people, management helpful needed courteous friendly, nightly shows fun casino, pools incredible beach beautiful, just short stroll resort swim deserted stretches beach wanted, took outback tour must-do tourists, islanders live visit mountains macou beach enjoy lunch siesta hammocks, buy souviniers tour rum 2/bot

In [5]:
df = pd.read_csv('test.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Rating
0,0,ehhh better punta cana twice compared hotel st...,1
1,1,"4 n't think, decided book atenea night stay de...",1
2,2,awesome time just returned vacation fantastic ...,3
3,3,"grand oasis wonderful second time, group 20 fr...",4
4,4,not bad stay stayed hotel family attending jav...,2


In [6]:
X_test = list(df['Review'])
y_test = list(df['Rating'])

print(X_test[:5])
print(y_test[-5:])

["ehhh better punta cana twice compared hotel stayed hotel major need help, let start good rooms little houses 4 rooms bungalow meant privacy, minibar minifrige pretty nice included array different liquors, staff really nice helpful exception, room did not smell like people expierience, pools nice, little shops really cool club plays mixture music, bad, desk staff exception nice staff, completely unorganized rude, rooms n't use key cards just keys, charge room prepared pay cash, n't creditcards, minibar frige stocked day, worst hike pool desk beach, closer not like, area like maze lost houses look map given, overall rate hotel 2-3 5 stars, like title states better,  ", "4 n't think, decided book atenea night stay december received fairly good reviews convenient metro provided breakfast, unfortunately disappointed yes hotel convenient metro positive feature, hotel dated curtains carpets rooms bit scrub, rooms kitchens unless bring kitchen utensils not major advantage, plus sheets clean 

In [7]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [8]:
# Tokenize the words (bag of words) with an oov token
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index

In [9]:
import numpy as np

In [10]:
sequences_training = tokenizer.texts_to_sequences(X_train)
max_length = int(np.median([len(x) for x in X_train]))
padded_training = pad_sequences(
    sequences_training, 
    padding=padding_type, 
    truncating=trunc_type,
    maxlen=max_length
)

print(padded_training[0])
print(padded_training.shape)

[ 324    2 3605   11  192   67   18    9    2 5158  324  562  438   13
 1125  426   93   27   54    2   15   97  769  198  261  636  108   90
  920  155 5557    2  336   20 1303  787   29    8  660  593  466   97
  152   72  123   43  219   35  203    1  186  492  383  385  312 1989
   72 5054    1 2993  924   30 4218 2867    9 1973   30 1584  366  564
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [11]:
sequences_test = tokenizer.texts_to_sequences(X_test)
padded_test = pad_sequences(
    sequences_test, 
    padding=padding_type, 
    truncating=trunc_type,
    maxlen=max_length
)

print(padded_test[0])
print(padded_test.shape)

[   1   62  216  213  436  763    2   17    2  425  108  239  287  384
    7   13   35 2745   67   13 4684 1165 1438 1131    1  128   12  245
 4055  175 5001    8   27   12   40 1456    3   10    4  532   25   33
    1  357   12   35  319   27  381  172 5094 5008  435  102   58    8
 1456   12    8  712 6350  522   13    6  110  489 1187   11 1611  322
    3  645  191 1273    6    1 1131    1  792   20  544 3144   30   58
   21 1157    4   25   39   25 5286  929 2745  246 1212  259  112  227
    2   31   44   42  707   25 5815 1572   62    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [12]:
# Need this block to get it to work with TensorFlow 2.x
import numpy as np
padded_training = np.array(padded_training)
y_train = np.array(y_train)
padded_test = np.array(padded_test)
y_test = np.array(y_test)

In [13]:

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(5, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [14]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 531, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 32)                544       
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 165       
Total params: 160,709
Trainable params: 160,709
Non-trainable params: 0
_________________________________________________________________


In [15]:
num_epochs = 50
history = model.fit(
    padded_training, y_train,
    epochs= num_epochs,
    validation_data = (padded_test, y_test),
    verbose=2
)

Epoch 1/50
481/481 - 1s - loss: 1.3808 - accuracy: 0.4392 - val_loss: 1.3301 - val_accuracy: 0.4353
Epoch 2/50
481/481 - 1s - loss: 1.2520 - accuracy: 0.4580 - val_loss: 1.1252 - val_accuracy: 0.4909
Epoch 3/50
481/481 - 1s - loss: 1.0256 - accuracy: 0.5262 - val_loss: 0.9958 - val_accuracy: 0.5382
Epoch 4/50
481/481 - 1s - loss: 0.9281 - accuracy: 0.5744 - val_loss: 0.9399 - val_accuracy: 0.5862
Epoch 5/50
481/481 - 1s - loss: 0.8730 - accuracy: 0.6026 - val_loss: 0.9225 - val_accuracy: 0.5885
Epoch 6/50
481/481 - 1s - loss: 0.8343 - accuracy: 0.6249 - val_loss: 0.8962 - val_accuracy: 0.6069
Epoch 7/50
481/481 - 1s - loss: 0.8010 - accuracy: 0.6421 - val_loss: 0.8819 - val_accuracy: 0.6131
Epoch 8/50
481/481 - 1s - loss: 0.7771 - accuracy: 0.6575 - val_loss: 0.8717 - val_accuracy: 0.6223
Epoch 9/50
481/481 - 1s - loss: 0.7484 - accuracy: 0.6723 - val_loss: 0.8778 - val_accuracy: 0.6207
Epoch 10/50
481/481 - 1s - loss: 0.7247 - accuracy: 0.6847 - val_loss: 0.8662 - val_accuracy: 0.6262