# Instructor Do: RNNs for NLP - Sentiment Analysis

In this activity, students will learn how to define a LSTM RNN model for sentiment analysis using Keras. Also, data preparation for using LSTM models for natural language processing is introduced.

In [3]:
# Initial imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from pathlib import Path

%matplotlib inline

## The Dataset

The provided data file contains `6878` customer reviews of Coffee Shops in Austin, Texas. The reviews were taken from Yelp; however, the names of the Coffee Shops were anonymized for privacy reasons.

The dataset has the following columns:

* `coffee_shop_name`: The anonymized name of the coffee shop.

* `full_review_text`: The customer reviews.

* `sentiment`: The sentiment of each customer's review. `0` - Negative, `1` - Positive.

In [4]:
# Import the dataset
# 1 = positive sentiment, 0 = negative sentiment
file_path = Path("../Resources/austin_coffee_shops_reviews.csv")
reviews_df = pd.read_csv(file_path)
reviews_df.head(20)

Unnamed: 0,coffee_shop_name,full_review_text,sentiment
0,Coffee Shop 66,Love love loved the atmosphere! Every corner o...,1
1,Coffee Shop 66,"Listed in Date Night: Austin, Ambiance in Aust...",1
2,Coffee Shop 66,Listed in Brunch Spots I loved the eclectic an...,1
3,Coffee Shop 66,Very cool decor! Good drinks Nice seating How...,0
4,Coffee Shop 66,They are located within the Northcross mall sh...,1
5,Coffee Shop 66,Very cute cafe! I think from the moment I step...,1
6,Coffee Shop 66,"2 check-ins Listed in ""Nuptial Coffee Bliss!"",...",1
7,Coffee Shop 66,2 check-ins Love this place! 5 stars for clea...,1
8,Coffee Shop 66,3 check-ins This place has been shown on my so...,1
9,Coffee Shop 66,Listed in Americano This is not your average c...,1


## Data Preprocessing

RNN input requires an array data type. The `full_review_text` column will be transformed into the `X` array and the “sentiment” column into the `y` array.

In [5]:
# Creating the X and y vectors
X = reviews_df["full_review_text"].values
y = reviews_df["sentiment"].values

To train the RNN model, we need to encode the text data as an integer. This transformation can be done using the following tools from Keras.

In [6]:
# Import Keras modules for data encoding
# To train the RNN model, the text data should be encoded as an integer.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [7]:
# Create an instance of the Tokenizer and fit it with the X text data
tokenizer = Tokenizer(lower=True) # ensure data consistancy by turning each word to lower case
tokenizer.fit_on_texts(X)

In [8]:
# Print the first five elements of the encoded vocabulary
for token in list(tokenizer.word_index)[:5]:
    print(f"word: '{token}', token: {tokenizer.word_index[token]}")

word: 'the', token: 1
word: 'and', token: 2
word: 'a', token: 3
word: 'i', token: 4
word: 'to', token: 5


In [10]:
# Transform the text data to numerical sequences
X_seq = tokenizer.texts_to_sequences(X)

# Contrast a sample numerical sequence with its text version
print("**Text comment**")
print({X[0]})
print("**Numerical sequence representation**")
print(X_seq[0])

**Text comment**
{'Love love loved the atmosphere! Every corner of the coffee shop had its own style, and there were swings!!! I ordered the matcha latte, and it was muy fantastico! Ordering and getting my drink were pretty streamlined. I ordered on an iPad, which included all beverage selections that ranged from coffee to wine, desired level of sweetness, and a checkout system. I got my latte within minutes!  I was hoping for a typical heart or feather on my latte, but found myself listing out all the possibilities of what the art may be. Any ideas?'}
**Numerical sequence representation**
[53, 53, 301, 1, 114, 188, 589, 6, 1, 8, 65, 29, 255, 351, 810, 2, 36, 50, 1138, 4, 125, 1, 511, 69, 2, 11, 10, 5621, 5019, 506, 2, 319, 16, 106, 50, 89, 4562, 4, 125, 21, 58, 1112, 68, 1909, 40, 967, 998, 18, 5020, 43, 8, 5, 416, 3656, 1018, 6, 732, 2, 3, 4563, 1289, 4, 90, 16, 69, 999, 312, 4, 10, 1364, 12, 3, 811, 652, 39, 5622, 21, 16, 69, 17, 302, 474, 4202, 38, 40, 1, 4203, 6, 71, 1, 368, 439, 

The RNN model requires that all the values of the `X` vector have the same length; the `pad_sequences` method will ensure that all integer encoded reviews have the same size. Each entry in `X` will be shortened to `140` integers, or pad with `0's` in case it's shorter.

In [11]:
# Padding sequences
X_pad = pad_sequences(X_seq, maxlen=140, padding="post")

Now that the data is encoded, the training and testing sets will be created.

In [13]:
# Creating training, validation, and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_pad, y, random_state=78)

## Build and Train the LSTM RNN Model

In this section, a custom LSTM RNN model is going to be designed in Keras, and it's going to be fitted (trained) using the training data we defined.

These are the steps that will be followed:

* Define the model architecture in Keras.

* Compile the model.

* Fit the model to the training data.

### Importing the Keras Modules

To build an LSTM RNN model in Keras, the `Sequential` model is used; however, there are two new types of layers that are needed:

* `Embeding`: It's a type of layer that is used in neural networks to process encoded text data.

* `LSTM`: It's used to add an LSTM layer to the model.

In [14]:
# Import Keras modules for model creation
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

### Setting Up the Model

The `Embedding` layer requires as parameter the size of the vocabulary in the text that is going to be processed. The `vocabulary_size` is set at the total number of words in the `tokenizer` dictionary plus `1`. The other parameter needed by this layer is the `input_length`; this parameter is set at `140` (`max_words` variable) that is the value defined for padding the reviews.

The `embedding_size` parameter specifies how many dimensions will be used to represent each word. As a rule-of-thumb, a multiple of eight could be used; for this demo, tuning the model value to `64` delivered the best result.

In [18]:
# Model set-up
vocabulary_size = len(tokenizer.word_counts.keys()) + 1
max_words = 140
embedding_size = 64
# great article on word embedding 
# http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/#Word%20Embeddings

### Defining the Model's Structure

In [19]:
# Define the LSTM RNN model
model = Sequential()

# Layer 1
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))

# Layer 2
# this layer will contain multiple LSTM units, structurally identical but each eventually "learning to remember" some different thing.
model.add(LSTM(units=280))

# Output layer
model.add(Dense(1, activation="sigmoid"))

### Compiling the Model

In [20]:
# Compile the model
model.compile(
    loss="binary_crossentropy",
    optimizer="adam"
)

In [21]:
# Summarize the model
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 140, 64)           1091264   
_________________________________________________________________
lstm (LSTM)                  (None, 280)               386400    
_________________________________________________________________
dense (Dense)                (None, 1)                 281       
Total params: 1,477,945
Trainable params: 1,477,945
Non-trainable params: 0
_________________________________________________________________


### Training the Model

In [22]:
# Training the model
# This will take ~ 5 minutes to run
batch_size = 1000
model.fit(
    X_train,
    y_train,
    epochs=10,
    batch_size=batch_size,
    verbose=0,
)

<tensorflow.python.keras.callbacks.History at 0x1b767283790>

 ### Making Predictions

In [23]:
# Make sentiment predictions
predicted = model.predict_classes(X_test[:10])

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).


In [24]:
# Create a DataFrame of Real and Predicted values
sentiments = pd.DataFrame({"Text": X[:10], "Actual": y_test[:10], "Predicted": predicted.ravel()})
sentiments

Unnamed: 0,Text,Actual,Predicted
0,Love love loved the atmosphere! Every corner o...,1,1
1,"Listed in Date Night: Austin, Ambiance in Aust...",1,1
2,Listed in Brunch Spots I loved the eclectic an...,1,1
3,Very cool decor! Good drinks Nice seating How...,1,1
4,They are located within the Northcross mall sh...,1,1
5,Very cute cafe! I think from the moment I step...,1,1
6,"2 check-ins Listed in ""Nuptial Coffee Bliss!"",...",1,1
7,2 check-ins Love this place! 5 stars for clea...,1,1
8,3 check-ins This place has been shown on my so...,1,1
9,Listed in Americano This is not your average c...,1,1
