# Trip Advisor model

## Steps

1. Import Trip Advisor data
2. Tokenize the data (create a word index that represents words as numbers)
3. Use an oov token to include words not seen before
4. Pad the sentences to have similar length


In [1]:
import pandas as pd

In [2]:
# Important Variables
vocab_size = 10000
trunc_type ="post"
padding_type = "post"
oov_tok = "<OOV>"
embedding_dim = 16

In [3]:
df = pd.read_csv('to_model.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Rating,weight
0,6375,"great place stay, great place stay looking hot...",3,21
1,11040,nice hotel overal overall inn opera boutique h...,1,29
2,2,great hotel location union square stay great 6...,3,21
3,4636,nice hotel great location nice hotel great loc...,4,14
4,2243,dissapointing 5 star hotel just got 6 night st...,1,29


In [4]:
X_train = list(df['Review'])
y_train = list(df['Rating'])

print(X_train[:5])
print(y_train[-5:])

['great place stay, great place stay looking hotel close airport, room clean staff helpful breakfast good shuttle great, booked reservation upcoming trip, a+,  ', 'nice hotel overal overall inn opera boutique hotel nice antique furnishings small, beware noise issues, heard plumbing surrounding rooms toilet flushes water pressure pipes running shower/bath etc., noise got quite audible odd hours, got home night room true oasis, hope room works, nice quiet non-union square locale relaxing,  ', 'great hotel location union square stay great 6 people total 3 rooms, room little different, enjoyed location close union square shopping just cable car away wharf, staff friendly helpful, definately stay returning sf,  ', "nice hotel great location nice hotel great location arno river, rooms large bathroom large compared places stayed italy a/c really worked, staff helpful accomodating maps directions restaurant recommendations taxis laundry service, booth hotel lobby internet access, n't miss gela

In [5]:
df = pd.read_csv('test.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Rating
0,0,ehhh better punta cana twice compared hotel st...,1
1,1,"4 n't think, decided book atenea night stay de...",1
2,2,awesome time just returned vacation fantastic ...,3
3,3,"grand oasis wonderful second time, group 20 fr...",4
4,4,not bad stay stayed hotel family attending jav...,2


In [6]:
X_test = list(df['Review'])
y_test = list(df['Rating'])

print(X_test[:5])
print(y_test[-5:])

["ehhh better punta cana twice compared hotel stayed hotel major need help, let start good rooms little houses 4 rooms bungalow meant privacy, minibar minifrige pretty nice included array different liquors, staff really nice helpful exception, room did not smell like people expierience, pools nice, little shops really cool club plays mixture music, bad, desk staff exception nice staff, completely unorganized rude, rooms n't use key cards just keys, charge room prepared pay cash, n't creditcards, minibar frige stocked day, worst hike pool desk beach, closer not like, area like maze lost houses look map given, overall rate hotel 2-3 5 stars, like title states better,  ", "4 n't think, decided book atenea night stay december received fairly good reviews convenient metro provided breakfast, unfortunately disappointed yes hotel convenient metro positive feature, hotel dated curtains carpets rooms bit scrub, rooms kitchens unless bring kitchen utensils not major advantage, plus sheets clean 

In [7]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [8]:
# Tokenize the words (bag of words) with an oov token
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index

In [9]:
import numpy as np

In [10]:
sequences_training = tokenizer.texts_to_sequences(X_train)
max_length = int(np.median([len(x) for x in X_train]))
padded_training = pad_sequences(
    sequences_training, 
    padding=padding_type, 
    truncating=trunc_type,
    maxlen=max_length
)

print(padded_training[0])
print(padded_training.shape)

[   5   28    9    5   28    9  151    2  100  108    3   24    8   44
   23    7  368    5   73  305 7068   49  289    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [11]:
sequences_test = tokenizer.texts_to_sequences(X_test)
padded_test = pad_sequences(
    sequences_test, 
    padding=padding_type, 
    truncating=trunc_type,
    maxlen=max_length
)

print(padded_test[0])
print(padded_test.shape)

[   1   62  221  210  426  689    2   19    2  399  117  248  303  413
    7   13   36 3467   60   13 3840 1121 1353 1081    1   98   12  253
 3838  187 3832    8   27   12   44 1393    3   10    4  442   26   34
    1  364   12   36  378   27  420  186 3893 3909  454   90   53    8
 1393   12    8  684 4467  453   13    6   97  418 1242   14 1515  308
    3  637  168 1403    6    1 1081    1  740   18  482 3270   30   53
   17 1027    4   26   37   26 4844  910 3467  251 1305  247  115  240
    2   31   40   39  645   26 6682 1582   62    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [12]:
# Need this block to get it to work with TensorFlow 2.x
import numpy as np
padded_training = np.array(padded_training)
y_train = np.array(y_train)
padded_test = np.array(padded_test)
y_test = np.array(y_test)

In [13]:

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(5, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [14]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 542, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 32)                544       
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 165       
Total params: 160,709
Trainable params: 160,709
Non-trainable params: 0
_________________________________________________________________


In [15]:
num_epochs = 50
history = model.fit(
    padded_training, y_train,
    epochs= num_epochs,
    validation_data = (padded_test, y_test),
    verbose=2
)

Epoch 1/50
145/145 - 1s - loss: 1.5152 - accuracy: 0.3384 - val_loss: 1.3758 - val_accuracy: 0.4353
Epoch 2/50
145/145 - 1s - loss: 1.4653 - accuracy: 0.3375 - val_loss: 1.3694 - val_accuracy: 0.4392
Epoch 3/50
145/145 - 1s - loss: 1.4552 - accuracy: 0.3382 - val_loss: 1.3541 - val_accuracy: 0.4408
Epoch 4/50
145/145 - 0s - loss: 1.4187 - accuracy: 0.3531 - val_loss: 1.3135 - val_accuracy: 0.4533
Epoch 5/50
145/145 - 0s - loss: 1.3287 - accuracy: 0.3950 - val_loss: 1.2172 - val_accuracy: 0.5017
Epoch 6/50
145/145 - 0s - loss: 1.1991 - accuracy: 0.4490 - val_loss: 1.1125 - val_accuracy: 0.5120
Epoch 7/50
145/145 - 0s - loss: 1.1054 - accuracy: 0.4777 - val_loss: 1.0589 - val_accuracy: 0.5245
Epoch 8/50
145/145 - 0s - loss: 1.0435 - accuracy: 0.5082 - val_loss: 1.0595 - val_accuracy: 0.5161
Epoch 9/50
145/145 - 0s - loss: 0.9958 - accuracy: 0.5343 - val_loss: 1.0057 - val_accuracy: 0.5432
Epoch 10/50
145/145 - 0s - loss: 0.9652 - accuracy: 0.5460 - val_loss: 0.9916 - val_accuracy: 0.5530