In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

## Recurrent Neural Networks

In this assignment, we will learn about recurrent neural networks. We will create an RNN and learn to classify text data.

In [2]:
import warnings

warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

import tensorflow as tf

tf.debugging.set_log_device_placement(True)

<IPython.core.display.Javascript object>

In [3]:
import os

tf.compat.v1.disable_eager_execution()

hello = tf.constant("Hello, TensorFlow!")

# sess = tf.compat.v1.Session()

os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # You need to tell CUDA
# which GPU you'd like to use. if you have one GPU probably your GPU is '0'with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name="a")
b = tf.constant([4.0, 3.0, 2.0, 1.0], shape=[2, 2], name="b")
c = tf.matmul(a, b)
# with tf.compat.v1.Session() as sess:
#     print(sess.run(c))

sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device

[[ 8.  5.]
 [20. 13.]]


<IPython.core.display.Javascript object>

In [4]:
yelp = pd.read_csv("data/yelp_labeled.csv", error_bad_lines=False)

b'Skipping line 281: expected 2 fields, saw 3\nSkipping line 290: expected 2 fields, saw 3\nSkipping line 296: expected 2 fields, saw 3\nSkipping line 322: expected 2 fields, saw 3\nSkipping line 373: expected 2 fields, saw 3\nSkipping line 417: expected 2 fields, saw 3\nSkipping line 427: expected 2 fields, saw 3\nSkipping line 429: expected 2 fields, saw 3\nSkipping line 577: expected 2 fields, saw 3\nSkipping line 578: expected 2 fields, saw 3\nSkipping line 611: expected 2 fields, saw 3\nSkipping line 677: expected 2 fields, saw 3\nSkipping line 771: expected 2 fields, saw 3\nSkipping line 930: expected 2 fields, saw 3\nSkipping line 979: expected 2 fields, saw 4\nSkipping line 980: expected 2 fields, saw 3\n'


<IPython.core.display.Javascript object>

In [5]:
yelp.head()

Unnamed: 0,text,sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


<IPython.core.display.Javascript object>

In [6]:
yelp.sentiment.value_counts()

1    494
0    482
Name: sentiment, dtype: int64

<IPython.core.display.Javascript object>

We have loaded a Yelp review dataset above. A positive sentiment is classified as 1 and a negative sentiment is classified as 0. 

In [7]:
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()


def remove_stopwords(input_text):
    stopwords_list = stopwords.words("english")
    # Some words which might indicate a certain sentiment are kept via a whitelist
    whitelist = ["n't", "not", "no"]
    words = input_text.split()
    clean_words = [
        word
        for word in words
        if (word not in stopwords_list or word in whitelist) and len(word) > 1
    ]
    return " ".join(clean_words)


def stem_list(word_list):
    stemmed = []
    for word in word_list:
        stemmedword = stemmer.stem(word)
        stemmed.append(stemmedword)
    return stemmed


def normalize(terms):
    terms = terms.lower()
    terms = remove_stopwords(terms)
    word_delimiters = u"[\\[\\]\n.!?,;:\t\\-\\\"\\(\\)\\'\u2019\u2013 ]"
    term_list = re.split(word_delimiters, terms)
    trimmed = [x.rstrip() for x in term_list]
    stemmed = stem_list(trimmed)
    space = " "
    normed = space.join(stemmed)
    normed = normed.replace("  ", " ")
    return normed

<IPython.core.display.Javascript object>

In the code block above, we have functions to remove stopwords, stem, and normalize the text (remove special characters and trim white space). Apply the normalize function to every yelp review and assign the normalized text to a new column.

In [8]:
# Answer below:
yelp['normed'] = yelp['text'].apply(lambda x: normalize(x))


<IPython.core.display.Javascript object>

Next, use the one hot function for text encoding and encode the normalized text. Determine the vocabulary size to perform the encoding.

In [9]:
full = " ".join(yelp["normed"])
words = full.split(" ")
len(set(words))

1629

<IPython.core.display.Javascript object>

In [10]:
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences

<IPython.core.display.Javascript object>

In [11]:
docs = list(yelp["normed"].values)

<IPython.core.display.Javascript object>

In [12]:
# Answer below:
vocab_size=len(set(words))

encoded_docs = [one_hot(d,vocab_size) for d in docs]


<IPython.core.display.Javascript object>

Convert the encoded sequences into a numpy array and make sure all reviews are the same length using the `pad_sequences` function in Keras.

In [13]:
# Answer below:

ind_vars = pad_sequences(encoded_docs)

<IPython.core.display.Javascript object>

Split the data into train and test. Use 20% for test. The sentiment column should be used as the target variable.

In [14]:
# Answer below:
from sklearn.model_selection import train_test_split

X = ind_vars
y = yelp["sentiment"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=34
)

<IPython.core.display.Javascript object>

Create a sequential model. The model should contain an embedding layer with input dim that is the size of the largest encoding in the vocabulary. The output dim should be 100, the input length is the number of columns in the training data. 
After the embedding layer, add a SimpleRNN layer with unit size 32, a dense layer of size 8 and a dense output layer.

In [15]:
# Answer below:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding



<IPython.core.display.Javascript object>

In [17]:
max_words = np.max(ind_vars)

model = Sequential()
model.add(Embedding(max_words, 100, input_length=ind_vars.shape[1]))
model.add(SimpleRNN(32))
model.add(Dense(8, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 83, 100)           162800    
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 32)                4256      
_________________________________________________________________
dense (Dense)                (None, 8)                 264       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 9         
Total params: 167,329
Trainable params: 167,329
Non-trainable params: 0
_________________________________________________________________


<IPython.core.display.Javascript object>

Compile using the optimizer of your choice, use crossentropy for your loss function. Fit the model using a batch size of 128 and 50 epochs

KeyboardInterrupt: 

In [18]:
# Answer below:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=128, epochs=50)

Train on 780 samples, validate on 196 samples
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device

Epoch 1/50
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.


InvalidArgumentError: indices[94,79] = 1628 is not in [0, 1628)
	 [[{{node embedding/embedding_lookup}}]]

<IPython.core.display.Javascript object>