# Trip Advisor model

## Steps

1. Import Trip Advisor data
2. Tokenize the data (create a word index that represents words as numbers)
3. Use an oov token to include words not seen before
4. Pad the sentences to have similar length


In [1]:
import pandas as pd

In [2]:
# Important Variables
vocab_size = 10000
trunc_type ="post"
padding_type = "post"
oov_tok = "<OOV>"
embedding_dim = 16

In [3]:
df = pd.read_csv('to_model.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Rating
0,0,fabul hotel mum return 4 night stay hotel 1898...,4
1,1,romant intern ambienc spent honeymoon melia ca...,4
2,2,great hotel locat union squar stay great 6 peo...,3
3,3,pretti outsid smelli insid beach beauti pool p...,1
4,4,kid love spent night hotel wife kid christma h...,4


In [4]:
X_train = list(df['Review'])
y_train = list(df['Rating'])

print(X_train[:5])
print(y_train[-5:])

['fabul hotel mum return 4 night stay hotel 1898 fabulous, recent decor room immacul decor fantast realli beauti hotel, locat perfect nicer end la rambla need close treat return tranquil hotel busi day sight seeing.th staff attent polit spoke perfect english, price drink bar expect littl hard swallow gorgeou cours meal local restur price gin tonics.although roof pool undergo refurbish stay basement pool surround facil adequate,', "romant intern ambienc spent honeymoon melia carib 23-30. plane land torrenti downpour soak skin step plane, rain 7 day hot gorgeous, truli love resort food people, manag help need courteou friendly, nightli show fun casino, pool incred beach beautiful, short stroll resort swim desert stretch beach wanted, took outback tour must-do tourists, island live visit mountain macou beach enjoy lunch siesta hammocks, buy souvini tour rum 2/bottl jewelri 5-10, shop beach bargain big time, n't pay 1/3 ask price, phone room dinner reserv need places, avg, salari hotel wor

In [5]:
df = pd.read_csv('test.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Rating
0,0,ehhh better punta cana twice compar hotel stay...,1
1,1,"4 n't think, decid book atenea night stay dece...",1
2,2,"awesom time return vacat fantast time, weather...",3
3,3,"grand oasi wonder second time, group 20 friend...",4
4,4,bad stay stay hotel famili attend javaon 2008....,2


In [6]:
X_test = list(df['Review'])
y_test = list(df['Rating'])

print(X_test[:5])
print(y_test[-5:])

["ehhh better punta cana twice compar hotel stay hotel major need help, let start good room littl hous 4 room bungalow meant privacy, minibar minifrig pretti nice includ array differ liquors, staff realli nice help exception, room smell like peopl expierience, pool nice, littl shop realli cool club play mixtur music, bad, desk staff except nice staff, complet unorgan rude, room n't use key card keys, charg room prepar pay cash, n't creditcards, minibar frige stock day, worst hike pool desk beach, closer like, area like maze lost hous look map given, overal rate hotel 2-3 5 stars, like titl state better,", "4 n't think, decid book atenea night stay decemb receiv fairli good review conveni metro provid breakfast, unfortun disappoint ye hotel conveni metro posit feature, hotel date curtain carpet room bit scrub, room kitchen unless bring kitchen utensil major advantage, plu sheet clean provid clean towel day, wall hotel extrem familiar neighbour nocturn habit smell smoke rooms.breakfast b

In [7]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [8]:
# Tokenize the words (bag of words) with an oov token
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index

In [9]:
import numpy as np

In [10]:
sequences_training = tokenizer.texts_to_sequences(X_train)
max_length = int(np.median([len(x) for x in X_train]))
padded_training = pad_sequences(
    sequences_training, 
    padding=padding_type, 
    truncating=trunc_type,
    maxlen=max_length
)

print(padded_training[0])
print(padded_training.shape)

[ 413    2 3188   60   75    9    4    2 4613 1177  420  219    3 1059
  219  147   25   64    2   13  105  751  133  189  537   58   82  421
   60 2811    2  134   11  620 3097   27    8  471  626  475  105  158
   51   69   30   93   36  204 8673  660  400  263  253 1154   51 4396
    1 2699  858   24 3575 1452    4 1802   24  753  356 1453    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [11]:
sequences_test = tokenizer.texts_to_sequences(X_test)
padded_test = pad_sequences(
    sequences_test, 
    padding=padding_type, 
    truncating=trunc_type,
    maxlen=max_length
)

print(padded_test[0])
print(padded_test.shape)

[   1   72  211  209  439  474    2    4    2  371   58   35  265  214
    7    3   36  422   75    3 3413 1084 2869 1032    1  145   10  114
 3527  154    1    8   25   10   35 3929    3  337   21   40    1   24
   10   36   87   25  364  160  408 4686  433  113   62    8  585   10
    8  398 6720  508    3    6   53  401  323 4650  237    3  588  139
 1157    6    1 1032    1  730   11  543 2256   24   62   15 1078   21
   29   21 4605  884  422   45  876  264  127  130    2   28   50   46
 1669   21 4696  581   72    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [12]:
# Need this block to get it to work with TensorFlow 2.x
import numpy as np
padded_training = np.array(padded_training)
y_train = np.array(y_train)
padded_test = np.array(padded_test)
y_test = np.array(y_test)

In [13]:

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(5, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [14]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 470, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 32)                544       
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 165       
Total params: 160,709
Trainable params: 160,709
Non-trainable params: 0
_________________________________________________________________


In [15]:
num_epochs = 30
history = model.fit(
    padded_training, y_train,
    epochs= num_epochs,
    validation_data = (padded_test, y_test),
    verbose=2
)

Epoch 1/30
481/481 - 1s - loss: 1.3827 - accuracy: 0.4323 - val_loss: 1.3355 - val_accuracy: 0.4353
Epoch 2/30
481/481 - 1s - loss: 1.2622 - accuracy: 0.4541 - val_loss: 1.1440 - val_accuracy: 0.4650
Epoch 3/30
481/481 - 1s - loss: 1.0537 - accuracy: 0.5134 - val_loss: 1.0415 - val_accuracy: 0.5366
Epoch 4/30
481/481 - 1s - loss: 0.9703 - accuracy: 0.5499 - val_loss: 0.9915 - val_accuracy: 0.5505
Epoch 5/30
481/481 - 1s - loss: 0.9280 - accuracy: 0.5696 - val_loss: 0.9667 - val_accuracy: 0.5698
Epoch 6/30
481/481 - 1s - loss: 0.8931 - accuracy: 0.5932 - val_loss: 0.9636 - val_accuracy: 0.5659
Epoch 7/30
481/481 - 1s - loss: 0.8637 - accuracy: 0.6022 - val_loss: 0.9452 - val_accuracy: 0.5846
Epoch 8/30
481/481 - 1s - loss: 0.8372 - accuracy: 0.6208 - val_loss: 0.9294 - val_accuracy: 0.5944
Epoch 9/30
481/481 - 1s - loss: 0.8069 - accuracy: 0.6378 - val_loss: 0.9252 - val_accuracy: 0.6006
Epoch 10/30
481/481 - 1s - loss: 0.7855 - accuracy: 0.6498 - val_loss: 0.9242 - val_accuracy: 0.6032