# Trip Advisor model

## Steps

1. Import Trip Advisor data
2. Tokenize the data (create a word index that represents words as numbers)
3. Use an oov token to include words not seen before
4. Pad the sentences to have similar length


In [1]:
import pandas as pd

In [2]:
# Important Variables
vocab_size = 10000
trunc_type ="post"
padding_type = "post"
oov_tok = "<OOV>"
embedding_dim = 16

In [3]:
df = pd.read_csv('to_model.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Rating,weight
0,6344,nice corporate westin stayed westin market str...,4,14
1,11013,nice hotel chose hotel based tripadvisor revie...,4,14
2,2,great hotel location union square stay great 6...,4,14
3,4624,ok great potential stayed nights caribe hilton...,3,21
4,2236,just fine make minute trip seattle conventions...,3,21


In [4]:
X_train = list(df['Review'])
y_train = list(df['Rating'])

print(X_train[:5])
print(y_train[-5:])

["nice corporate westin stayed westin market street november 23rd pretty good non refundable internet rate 129 hotel website, n't fact just stays making platinum starwood preferred paid 100 using priceline equivalent 4* hotel.this perfectly acceptable nice clean comfortable corporate hotel no wow factor, decor bit grey gloomy lacks charm westins no near dismal recently renovated westin galleria dallas reviewed room bathroom sized starwood preferred guest floors good views city, conveniently located block away bart moma moscone centre blocks union square.if attending convention moscone centre ideal not recommend romance fun suggest historic westin st francis union square instead chic city centre boutique hotels hotel monaco palomar say staff extremely nice friendly helpful,  ", 'nice hotel chose hotel based tripadvisor reviews loved, actually better expected, room bathroom really big clean, location great short walk main attractions, 5 minutes walk subway, cafe inside building breakfast

In [5]:
df = pd.read_csv('test.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Rating
0,0,ehhh better punta cana twice compared hotel st...,2
1,1,"4 n't think, decided book atenea night stay de...",2
2,2,awesome time just returned vacation fantastic ...,4
3,3,"grand oasis wonderful second time, group 20 fr...",5
4,4,not bad stay stayed hotel family attending jav...,3


In [6]:
X_test = list(df['Review'])
y_test = list(df['Rating'])

print(X_test[:5])
print(y_test[-5:])

["ehhh better punta cana twice compared hotel stayed hotel major need help, let start good rooms little houses 4 rooms bungalow meant privacy, minibar minifrige pretty nice included array different liquors, staff really nice helpful exception, room did not smell like people expierience, pools nice, little shops really cool club plays mixture music, bad, desk staff exception nice staff, completely unorganized rude, rooms n't use key cards just keys, charge room prepared pay cash, n't creditcards, minibar frige stocked day, worst hike pool desk beach, closer not like, area like maze lost houses look map given, overall rate hotel 2-3 5 stars, like title states better,  ", "4 n't think, decided book atenea night stay december received fairly good reviews convenient metro provided breakfast, unfortunately disappointed yes hotel convenient metro positive feature, hotel dated curtains carpets rooms bit scrub, rooms kitchens unless bring kitchen utensils not major advantage, plus sheets clean 

In [7]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [8]:
# Tokenize the words (bag of words) with an oov token
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index

In [9]:
import numpy as np

In [10]:
sequences_training = tokenizer.texts_to_sequences(X_train)
max_length = int(np.median([len(x) for x in X_train]))
padded_training = pad_sequences(
    sequences_training, 
    padding=padding_type, 
    truncating=trunc_type,
    maxlen=max_length
)

print(padded_training[0])
print(padded_training.shape)

[  12 2112  780   21  780  446   79 1505 3319  105    7  366 4640  160
  208 4964    2  401    5  301   13 1678  439 1736 2696 1279  157  553
  666 1400 3764   50    2  714 1017 1109   12   23   75 2112    2   11
 1255 2631  443   73 3320 4965 2968 1587    1   11  185 4391  614  760
  780 5333 2490 4966    3   42  672 2696 1279  374  682    7  381   85
 1711  161  367   68 3605 8189 8190  415  463 1168  302  634 3094 1315
 8190  415  705    4   57 8191  179  774 1960  780  619 9883 1168  302
  421 2303   85  415  914   64    2 4967    1   80    9  204   12   39
   51    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [11]:
sequences_test = tokenizer.texts_to_sequences(X_test)
padded_test = pad_sequences(
    sequences_test, 
    padding=padding_type, 
    truncating=trunc_type,
    maxlen=max_length
)

print(padded_test[0])
print(padded_test.shape)

[   1   55  203  199  424  773    2   21    2  453  103  234  267  422
    7   14   37 2627   50   14 3586  947 1294 1060    1  105   12  262
 4318  184 4654    9   30   12   51 1439    3    8    4  402   25   29
    1  382   12   37  418   30  392  198 4559 4243  410   82   47    9
 1439   12    9  625 3853  384   14    5  104  465 1151   13 1620  279
    3  585  141 1171    5    1 1060    1 1093   19  348 3600   31   47
   18 1061    4   25   36   25 4648  794 2627  214 1366  212  110  208
    2   32   38   35  581   25 6835 2055   55    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [12]:
# Need this block to get it to work with TensorFlow 2.x
import numpy as np
padded_training = np.array(padded_training)
y_train = np.array(y_train)
padded_test = np.array(padded_test)
y_test = np.array(y_test)

In [13]:

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(6, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [14]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 554, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 32)                544       
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 198       
Total params: 160,742
Trainable params: 160,742
Non-trainable params: 0
_________________________________________________________________


In [15]:
num_epochs = 50
history = model.fit(
    padded_training, y_train,
    epochs= num_epochs,
    validation_data = (padded_test, y_test),
    verbose=2
)

Epoch 1/50
145/145 - 0s - loss: 1.6944 - accuracy: 0.3213 - val_loss: 1.5393 - val_accuracy: 0.2979
Epoch 2/50
145/145 - 0s - loss: 1.5788 - accuracy: 0.3289 - val_loss: 1.4888 - val_accuracy: 0.2979
Epoch 3/50
145/145 - 0s - loss: 1.5625 - accuracy: 0.3343 - val_loss: 1.4956 - val_accuracy: 0.2979
Epoch 4/50
145/145 - 0s - loss: 1.5505 - accuracy: 0.3258 - val_loss: 1.4871 - val_accuracy: 0.2979
Epoch 5/50
145/145 - 0s - loss: 1.5384 - accuracy: 0.3341 - val_loss: 1.4638 - val_accuracy: 0.2985
Epoch 6/50
145/145 - 0s - loss: 1.5052 - accuracy: 0.3395 - val_loss: 1.4345 - val_accuracy: 0.3072
Epoch 7/50
145/145 - 0s - loss: 1.4121 - accuracy: 0.3831 - val_loss: 1.3481 - val_accuracy: 0.3260
Epoch 8/50
145/145 - 0s - loss: 1.3005 - accuracy: 0.4328 - val_loss: 1.2417 - val_accuracy: 0.3447
Epoch 9/50
145/145 - 0s - loss: 1.2083 - accuracy: 0.4631 - val_loss: 1.1931 - val_accuracy: 0.3574
Epoch 10/50
145/145 - 0s - loss: 1.1403 - accuracy: 0.4829 - val_loss: 1.1355 - val_accuracy: 0.3613