NLP PREDICTIONS FOR YELP AND AMAZON REVIEWS

In [12]:
# The objective here is to prepare the data for a NLP prediction, using this Yelp and Amazon reviews DS.
# The Dataset contains a review and a label, 0 or 1, expressing the sentiment.
import numpy as np
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

path = 'https://raw.githubusercontent.com/AleGL92/TensorFlow/main/NLP/combined_data.csv'
ds = pd.read_csv(path)
print(ds.head())
reviews = list(ds['text'])
# print(reviews)

   Unnamed: 0                                               text  sentiment
0           0  So there is no way for me to plug it in here i...          0
1           1                         Good case Excellent value.          1
2           2                             Great for the jawbone.          1
3           3  Tied to charger for conversations lasting more...          0
4           4                                  The mic is great.          1


1. Tokenize the text

In [13]:
tokenizer = Tokenizer(oov_token = '<OOV>')
tokenizer.fit_on_texts(reviews)
word_index = tokenizer.word_index
print(f'There are {len(word_index)} different words')
# print(word_index)

sequences = tokenizer.texts_to_sequences(reviews)
padded_sequences = pad_sequences(sequences, padding='post')     # Post makes the zeros padded appear after the sequence

print(padded_sequences.shape)       # The shape shows the number of sequences and the length of each one.
print(reviews[0])               # Printing the first review in words
print(padded_sequences[0])      # Printing the first review in sequence

There are 3261 different words
(1992, 139)
So there is no way for me to plug it in here in the US unless I go by a converter.
[  28   59    8   56  142   13   61    7  269    6   15   46   15    2
  149  449    4   60  113    5 1429    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0]


TRAINING A BASIC SENTIMENT MODEL WITH EMBEDDINGS

1. Preparing the dataset

In [14]:
# We'll be using the same DS as before, but we will use labels now.

path = 'https://raw.githubusercontent.com/AleGL92/TensorFlow/main/NLP/combined_data.csv'
ds = pd.read_csv(path)
print(ds.head())
reviews = list(ds['text'])
labels = list(ds['sentiment'])

# We dont have something like train_test_split here, so we'll separate the traning and evaluation data manually
training_size = int(len(reviews) * 0.8)
training_sentences = reviews[0:training_size]
testing_sentences = reviews[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]
# Make labels into numpy arrays for use with the network later
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

   Unnamed: 0                                               text  sentiment
0           0  So there is no way for me to plug it in here i...          0
1           1                         Good case Excellent value.          1
2           2                             Great for the jawbone.          1
3           3  Tied to charger for conversations lasting more...          0
4           4                                  The mic is great.          1


2. Tokenize the data

In [15]:
vocab_size = 1000           # Maximun number of words
embedding_dim = 16          # Maximun number of possible sentiments
max_length = 100            # Maximun lenght of the sequences
trunc_type='post'           # Truncate the end of the sequences
padding_type='post'         # Pad the end of the sequences
oov_tok = "<OOV>"           # Out Of Vocabulary token

tokenizer = Tokenizer(num_words = vocab_size, oov_token = oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen = max_length, padding = padding_type, truncating = trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen = max_length, padding = padding_type, truncating = trunc_type)
# print(word_index)         # {'<OOV>': 1, 'the': 2, 'and': 3, 'i': 4, 'it': 5, 'a': 6, etc

In [16]:
# Checking the preparation results
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])    # Just reversing key-value for value-key

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])         # The symbol ? represents 0s. When we use get, if the value is
    # not provided, we will get the '?' string
    # .join sets a space everytime we add an element to the list. get returns the value for a given key in a dictionary.

print(padded[1])
print(decode_review(padded[1]))
print(training_sentences[1])

[ 20  90  76 364   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0]
good case excellent value ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Good case Excellent value.


3. Training the model

In [17]:
# Now we train the sentiment model with embeddings. The embedding layer is first, and the output is only 1 node as it is either 
# 0 or 1 (negative or positive)
import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation = 'relu'),
    tf.keras.layers.Dense(1, activation = 'sigmoid')
])
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])     
# This are the parameters recommended by TF for NLP and tokenization models 
model.summary()

num_epochs = 10
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 100, 16)           16000     
                                                                 
 flatten_1 (Flatten)         (None, 1600)              0         
                                                                 
 dense_4 (Dense)             (None, 6)                 9606      
                                                                 
 dense_5 (Dense)             (None, 1)                 7         
                                                                 
Total params: 25,613
Trainable params: 25,613
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1e85dd0c9a0>

4. Making predictions

In [18]:
# Use the model to predict a review   
user_reviews = ['I love this phone', 'I hate spaghetti', 
                'Everything was cold',
                'Everything was hot exactly as I wanted', 
                'Everything was green', 
                'the host seated us immediately',
                'they gave us free chocolate cake', 
                'not sure about the wilted flowers on the table',
                'only works when I stand on tippy toes', 
                'does not work when I stand on my head']
print(user_reviews) 

# Create the sequences
padding_type = 'post'
sample_sequences = tokenizer.texts_to_sequences(user_reviews)
user_padded = pad_sequences(sample_sequences, padding = padding_type, maxlen = max_length)           

print('\nPredictions for the user reviews:')              
classes = model.predict(user_padded)

# The closer the class is to 1, the more positive the review is deemed to be
for x in range(len(user_reviews)):
  print(user_reviews[x])
  print(classes[x], '\n')

# The predictions are mostly around 0.5, so they're not too good. But we should considerate that the reviews were confusing on
# purpose sometimes, to check how the model did.

['I love this phone', 'I hate spaghetti', 'Everything was cold', 'Everything was hot exactly as I wanted', 'Everything was green', 'the host seated us immediately', 'they gave us free chocolate cake', 'not sure about the wilted flowers on the table', 'only works when I stand on tippy toes', 'does not work when I stand on my head']

Predictions for the user reviews:
I love this phone
[0.9959698] 

I hate spaghetti
[0.04349113] 

Everything was cold
[0.4546066] 

Everything was hot exactly as I wanted
[0.877783] 

Everything was green
[0.5545907] 

the host seated us immediately
[0.75958633] 

they gave us free chocolate cake
[0.76843816] 

not sure about the wilted flowers on the table
[0.02256727] 

only works when I stand on tippy toes
[0.9578408] 

does not work when I stand on my head
[0.00701591] 



5. Tweaking the model

In [19]:
# We define another model, but with different values for the parameters
vocab_size = 500           # Maximun number of words. Before it was 1000
embedding_dim = 16          # Maximun number of possible sentiments
max_length = 50            # Maximun lenght of the sequences. Before it was 100
trunc_type='post'           # Truncate the end of the sequences
padding_type='post'         # Pad the end of the sequences
oov_tok = "<OOV>"           # Out Of Vocabulary token

tokenizer = Tokenizer(num_words = vocab_size, oov_token = oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen = max_length, padding = padding_type, truncating = trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen = max_length, padding = padding_type, truncating = trunc_type)

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),
    tf.keras.layers.GlobalAveragePooling1D(),      # Using GlobalAveragePooling() instead of flatten(). It's supposed to give better results.
    tf.keras.layers.Dense(6, activation = 'relu'),
    tf.keras.layers.Dense(1, activation = 'sigmoid')
])
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])     
# This are the parameters recommended by TF for NLP and tokenization models 
model.summary()

num_epochs = 30         # 30 epochs instead of 10
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))


Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 50, 16)            8000      
                                                                 
 global_average_pooling1d_1   (None, 16)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_6 (Dense)             (None, 6)                 102       
                                                                 
 dense_7 (Dense)             (None, 1)                 7         
                                                                 
Total params: 8,109
Trainable params: 8,109
Non-trainable params: 0
_________________________________________________________________
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 1

<keras.callbacks.History at 0x1e85db0b790>

In [20]:
# Use the model to predict a review. This is the same part of code repeated, to check results after tweaking.
user_reviews = ['I love this phone', 'I hate spaghetti', 
                'Everything was cold',
                'Everything was hot exactly as I wanted', 
                'Everything was green', 
                'the host seated us immediately',
                'they gave us free chocolate cake', 
                'not sure about the wilted flowers on the table',
                'only works when I stand on tippy toes', 
                'does not work when I stand on my head']
print(user_reviews) 

# Create the sequences
padding_type = 'post'
sample_sequences = tokenizer.texts_to_sequences(user_reviews)
user_padded = pad_sequences(sample_sequences, padding = padding_type, maxlen = max_length)           

print('\nPredictions for the user reviews:')              
classes = model.predict(user_padded)

# The closer the class is to 1, the more positive the review is deemed to be
for x in range(len(user_reviews)):
    print(user_reviews[x])
    print(classes[x], '\n')

# Before, the predictions were mostly around 0.5 and we didn't consider them too good.
# This time, after the tweaks, they are much better. For instance, the model recognises words like love or free, considering them as
# positive words, and words like dont, not or hate, as bad words. 
# There are still some ambiguous reviews that get around 0.5 as the model doesn't recognise well the sentiment.

['I love this phone', 'I hate spaghetti', 'Everything was cold', 'Everything was hot exactly as I wanted', 'Everything was green', 'the host seated us immediately', 'they gave us free chocolate cake', 'not sure about the wilted flowers on the table', 'only works when I stand on tippy toes', 'does not work when I stand on my head']

Predictions for the user reviews:
I love this phone
[0.91190165] 

I hate spaghetti
[0.162869] 

Everything was cold
[0.6069859] 

Everything was hot exactly as I wanted
[0.4335965] 

Everything was green
[0.6069859] 

the host seated us immediately
[0.5828072] 

they gave us free chocolate cake
[0.8504108] 

not sure about the wilted flowers on the table
[0.09598145] 

only works when I stand on tippy toes
[0.85347605] 

does not work when I stand on my head
[0.02362505] 

