# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109B Introduction to Data Science

## Lab 8: Recurrent Neural Networks and Introduction to Natural Language Processing

**Harvard University**<br/>
**Spring 2022**<br/>
**Instructors**: Mark Glickman & Pavlos Protopapas<br/>
**Lab Leaders**: Marios Mattheakis & Chris Gumb
<br/>

## Learning Objectives

By the end of this Lab, you should understand how to:
- use `keras` for constructing a simple RNN for time-series prediction
- perform basic preprocessing on text data (stemming, tokenization, padding, one-hot encoding)
- Feed Forward NNs for NLP tasks 
- add embedding layers to improve the performance 
- use `keras` simple RNNs for NLP 
- inspect the embedding space

<a id="contents"></a>

## Notebook Contents

- [**Simple RNNs**](#rnn_intro)
    - [Time-series prediction](#timeSeries)
    - [Activity 1: Forecasting timeseries](#act1)
- [**Introduction to NLP**](#NLP_intro)
    - [Case Study: IMDB Review Dataset](#imdb)
- [**Preprocessing Text Data**](#prep)
    - [Tokenization](#token)
    - [Stemming](#stem)
    - [Padding](#pad)
    - [Numerical Encoding](#encode)    
- [**Neural Networks for NLP**](#NN)
    - [Feed Forward Neural Networks](#FFNN)
    - [Embedding layer](#embedding)    
    - [Activity 2: Recurrent Neural Networks with embeddings](#act2)
- [**Extra Material: Inspecting the embedding space**](#SM)


In [None]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Embedding, SimpleRNN, Flatten #GRU, LSTM
from sklearn.model_selection import train_test_split
import tensorflow_datasets
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

# fix random seed for reproducibility
np.random.seed(109)

import warnings
warnings.filterwarnings('ignore')

# Simple Recurrent Neural Networks (RNNs) <div id='rnn_intro'>



An RNN is similar to a FFNN in that there is an input layer, a hidden layer, and an output layer. The input layer is fully connected to the hidden layer, and the hidden layer is fully connected to the output layer. However, the crux of what makes it a **recurrent** neural network is that the hidden layer for a given time _t_ is not only based on the input layer at time _t_ but also the hidden layer from time _t-1_.

Here's a popular blog post on [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).


In Keras, the vanilla RNN unit is implemented the`SimpleRNN` layer:
```
tf.keras.layers.SimpleRNN(
    units, activation='tanh', use_bias=True,
    kernel_initializer='glorot_uniform',
    recurrent_initializer='orthogonal',
    bias_initializer='zeros', kernel_regularizer=None,
    recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None,
    kernel_constraint=None, recurrent_constraint=None, bias_constraint=None,
    dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False,
    go_backwards=False, stateful=False, unroll=False, **kwargs
)
```
For more details check Keras' documention https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN.

As you can see, recurrent layers in Keras take many arguments. We only need to be concerned with `units`, which specifies the size of the hidden state. 

**REMOVE**, and `return_sequences`, which will be discussed shortly. For the moment is it fine to leave this set to the default of `False`.

As you will see next week simple RNNs have some serious problems and limitations, like the gradient vanishing/exploding issue.  Due to these limitations, simple RNN unit  tends not to be used much in practice. For this reason it seems that the Keras developers neglected to implement GPU acceleration for this layer! Later in the Lab, you will notice that training an RNN is slower the training an FFNN even when the RNN has fewer parameters. 
https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN

## Time-series prediction <div id = 'timeSeries'>
    
RNNs become effective in learning from sequential data like time series and text. Let's start this journey in RNNs by predicting a noisy time series. 

Generate some synthetic sequential noisy data

In [None]:
N =  1000    
Tp = 800    

t=np.arange(0,N)

x=np.sin(0.02*t)* 1*np.sin(0.05*t) + 2*np.exp(-(t-500)**2/1000)
#Add gaussian (white) noise
x += np.random.rand(N)


df = pd.DataFrame(x)


plt.plot(t, x,'k')
plt.xlabel('t'); plt.xlabel('x'); 
plt.xlabel('Time'); plt.ylabel('Series')



#### Split data into training and testing sets
Note, this is forecasting, so we do not know the future 

In [None]:
values=df.values
train,test = values[0:Tp,:], values[Tp:N,:]

plt.plot(df[0:Tp], 'b', label='training')
plt.plot(df[Tp:N], 'g', label='testing')

plt.axvline(df.index[Tp], c="r")
plt.xlabel('Time'); plt.ylabel('Series')
plt.legend()

#### Prepare the data

RNNs  require a step value that contains `n` number of elements as an input sequence. Here, we define it as a `step`. 
Let's understand this concept through  two simple cases. Cosidere the input `x` and the output `y`:
- For step=1: 
   - x=[1,2,3,4,5]
   - y=[2,3,4,5,6]
- For step=2: 
   - x=[ (1,2), (2,3), (3,4) (4,5) ]
   - y=[3,4,5,6]

   

The sizes of `x`  and `y` are  different. We can  fix this by adding step size into the training and test data.


In [None]:
print(train.shape, test.shape )

In [None]:
step = 4
# add step elements into train and test
test = np.append(test,np.repeat(test[-1,],step))
train = np.append(train,np.repeat(train[-1,],step))

In [None]:
print(train.shape, test.shape )

Convert the datasets into the matrix with step value as it has shown above explation.


In [None]:
def convertToMatrix(data, step):
    X, Y =[], []
    for i in range(len(data)-step):
        d=i+step  
        X.append(data[i:d,])
        Y.append(data[d,])
    return np.array(X), np.array(Y)

trainX, trainY =convertToMatrix(train,step)
testX,  testY =convertToMatrix(test,step)

print('Shapes of the training dataset for (x,y): ', trainX.shape, trainY.shape)
print('Shapes of the testing dataset for (x,y) : ', testX.shape, testY.shape)

Finally, we reshape `trainX` and `testX` to fit with the Keras RNN model that  requires three-dimensional input data.



In [None]:
trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
print(trainX.shape, testX.shape)

In [None]:
model = Sequential()
# Here, we add the RNN unit. Keras makes it easy for us 
model.add(SimpleRNN(units=32, input_shape=(1,step), activation="relu"))
#
model.add(Dense(8, activation="relu")) 
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer= 'adam' )
model.summary()

In [None]:
model.fit(trainX,trainY, epochs=100, batch_size=32, verbose=0)

trainPredict = model.predict(trainX)
testPredict  = model.predict(testX)

# concate train and test predictions for plotting purposes
predicted = np.concatenate((trainPredict,testPredict),axis=0)


In [None]:
trainScore = model.evaluate(trainX, trainY, verbose=0)
testScore = model.evaluate(testX, testY, verbose=0)
print('Train score: ', trainScore)
print('Test score: ', testScore)

In [None]:
index = df.index.values
plt.plot(df[0:Tp], 'b', label='training')
plt.plot(df[Tp:N], 'g', label='testing')
# plt.plot(index,predicted)
plt.plot(predicted, 'm', label='network')
plt.axvline(df.index[Tp], c="r")
plt.xlabel('Time'); plt.ylabel('Series')
plt.legend()


# Activity 1 <div id='act1'></div>
- Repeat the above experiment for different steps in the range [1, 10, 100].
- Does the step affect the performance? Make some comments


In [None]:
# you code here


# Introduction to NLP <div id = 'LP_intro'></div>
    
## Case Study: IMDB Review Classifier <div id='imdb'></div>
<!-- <img src='fig/manyto1.png' width='300px'> -->

Let's frame our introduction to NLP  around the example of a text classifier. Specifically, we'll build and evaluate various models that all attempt to descriminate between positive and negative reviews through the Internet Movie Database (IMDB). The dataset is again made available to us through the tensorflow datasets API.

In [None]:
(train, test), info = tensorflow_datasets.load('imdb_reviews', split=['train', 'test'], with_info=True)

The helpful `info` object provides details about the dataset.

In [None]:
info

We see that the dataset consists of text reviews and binary good/bad labels. Here are two examples:

In [None]:
labels = {0: 'bad', 1: 'good'}
seen = {'bad': False, 'good': False}
for review in train:
    label = review['label'].numpy()
    if not seen[labels[label]]:
        print(f"text:\n{review['text'].numpy().decode()}\n")
        print(f"label: {labels[label]}\n")
        seen[labels[label]] = True
    if all(val == True for val in seen.values()):
        break

# Preprocessing Text Data <div id='prep'></div>

Computers have no built-in knowledge of language and cannot understand text data in any rich way that humans do -- at least not without some help! The first crucial step in natural language processing is to clean and preprocess your data so that your algorithms and models can make use of it.
    
We'll look at a few preprocess steps:
- Tokenization
- Stemming 
- Padding
- Numerical encoding
        
Depending on your NLP task, you may (or may not) want to take additional preprocessing steps which we will not cover here. These can include:
- converting all characters to lowercase
- treating each punctuation mark as a token (e.g., , . ! ? are each separate tokens)
- removing punctuation altogether
- separating each sentence with a unique symbol (e.g., <S> and </S>)
- removing words that are incredibly common (e.g., function words, (in)definite articles). These are referred to as 'stopwords').
- Lemmatizing (replacing words with their 'dictionary entry form')
    
Useful NLP Python libraries such as [NLTK](https://www.nltk.org/) and [spaCy](https://spacy.io/) provide built in methods for many of these preprocessing steps.

<!-- <div class='exercise' id='token'><b>Tokenization</b></div></br> -->
## Tokenization  <div id='token'></div>

**Tokenization**   is the process of breaking a document down into words, punctuation marks, numeric digits, etc.

**Tokens** are the atomic units of meaning which our model will be working with. What should these units be? These could be characters, words, or even sentences. For our movie review classifier we will be working at the word level.

For this example we will process just a subset of the original dataset.

In [None]:
SAMPLE_SIZE = 10 # # of the reviews to be considered
subset = list(train.take(SAMPLE_SIZE))
subset[5]

The TFDS format process datasets into a standard format and therefore, allows for the construction of efficient preprocessing pipelines. But for our own preprocessing example we will be primarily working with Python `list` objects. This gives us a chance to practice the Python **list comprehension** which is a powerful tool to have at your disposal. It will serve you well when processing arbitrary text which may not already be in a nice TFDS format (such as in the HW 😉).

We'll convert our data subset into X and y lists.

In [None]:
X = [x['text'].numpy().decode() for x in subset]
y = [x['label'].numpy() for x in subset]

In [None]:
print(f'X has {len(X)} reviews')
print(f'y has {len(y)} labels')

In [None]:
N_CHARS = 20
print(f'First {N_CHARS} characters of all reviews:\n{[x[:20]+"..." for x in X]}\n')
print(f'All labels:\n{y}')

Each observation in `X` is a review. A review is a `str` object which we can think of as a sequence of characters. This is indeed how Python treats strings as made clear by how we are printing 'slices' of each review in the code cell above.<br>

In this example, we will work  at a word level.  This means that our observations should be organized as **sequences of words** rather than sequences of characters. In general, we can prepare our data in different ways like at a character level.



In [None]:
# list comprehensions again to the rescue!
X_ = [x.split() for x in X]   # keep this temporal object for a comparison purpose, will see shortly
X = [x.split() for x in X]


Now let's look at the first 10 **tokens** in the first 2 reviews.

In [None]:
print('Review 1: ', X[0][:10])
print('Review 2: ', X[1][:10])

## Stemming <div id='stem'></div>
**Stemming**  is the process of producing morphological variants of a root/base word. For example, a stemming algorithm reduces the words "chocolates", "chocolatey", "choco" to the root word, "chocolate" or the words "likes", "liked", "likely", "liking" to "like".

Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same.

Here, we use the package **Natural Language Tool Kit (NLTK)** for more information check [here](https://www.nltk.org/api/nltk.stem.html)

In [None]:
from nltk.stem import PorterStemmer

In [None]:
# object for steamming 
ps = PorterStemmer()
# perform stemming in all the sentences
X = [[ps.stem(w) for w in x] for x in X]

Inspect the above nested list comprehension: 

```
i=0
for words in X:
    j=0
    for w in words:
        X[i][j]=ps.stem(w)
        j+=1
    i+=1
```

Let's compare the words before and after stemming

In [None]:
for i in range(20):
    print(X_[0][i], ' --> ', X[0][i])

<div  style="background-color:#b3e6ff">
<b>Q</b>: Should we always use stemming?
</div>

In classification tasks (like sentiment analysis) stemming is fine. But what about in a text generation task?

## Padding <div id='pad'></div>

Let's take a look at the lengths of the reviews in our subset.

In [None]:
[len(x) for x in X]

If we were training our RNN one sentence at a time, it would be okay to have sentences of varying lengths. However, as with any neural network, it can be sometimes be advantageous to train inputs in batches. When doing so with RNNs, our input tensors need to be of the same length/dimensions.

Here are two examples of tokenized reviews padded to have a length of 5.
```
['I', 'loved', 'it', '<PAD>', '<PAD>']
['It', 'stinks', '<PAD>', '<PAD>', '<PAD>']
```
Now let's pad our own examples. Note that 'padding' in this context also means truncating sequences that are longer than our specified max length.

In [None]:
MAX_LEN = 500
PAD = '<PAD>'
# truncate
X = [x[:MAX_LEN] for x in X]
# pad
for x in X:
    while len(x) < MAX_LEN:
        x.append(PAD)

In [None]:
[len(x) for x in X]

Now all reviews have the same length!

## Numerical Encoding <div id='encode'></div>

If each review in our dataset is an observation, then the features of each observation are the tokens, in this case, words. But these words are still **strings**. Our machine learning methods require us to be able to multiple our features by weights. If we want to use these words as inputs for a neural network we'll have to convert them into some **numerical representation**.

One solution is to create a **one-to-one mapping** between unique words and integers.

If the five sentences below were our entire corpus, our conversion would look this:

1. i have books - [1, 4, 2]
2. interesting books are useful [11,2,9,8]
3. i have computers [1,4,3]
4. computers are interesting and useful [3,5,11,10,8]
5. books and computers are both valuable. [2,10,3,9,13,12]
6. bye bye [7,7]

I-1, books-2, computers-3, have-4, are-5, computers-6,bye-7, useful-8, are-9, and-10,interesting-11, valuable-12, both-13

To accomplish this we'll first need to know what all the unique words are in our dataset.

In [None]:
all_tokens = [word for review in X for word in review]

In [None]:
# sanity check
len(all_tokens), sum([len(x) for x in X])

Casting our `list` of words into a `set` is a great way to get all the *unique* words in the data. Hence, we build our **vocabulary**. 

In [None]:
vocab = sorted(set(all_tokens))
print('Unique Words in our vocabulary:', len(vocab))

You can easily check that the vocabulary will be larger if stemming is not applied. Check it by yourself.

Now we need to create a mapping from words to integers. For this we will perform a **dictionary comprehension**.

In [None]:
word2idx = {word: idx for idx, word in enumerate(vocab)}

In [None]:
word2idx

We repeat the process, this time mapping integers to words.

In [None]:
idx2word = {idx: word for idx, word in enumerate(vocab)}

In [None]:
idx2word

Now, perform the mapping to encode the observations in our subset. One more  ***nested list comprehensions***!

In [None]:
X_proc = [[word2idx[word] for word in review] for review in X]
X_proc[0][:10], X_proc[1][:10]


# Neural Networks for NLP <div id='NN'></div>

`X_proc` is a list of lists but if we are going to feed it into a `keras` model we should convert both it and `y` into `numpy` arrays.

Just a reminder that `y` is the response variable: 
```
X = [x['text'].numpy().decode() for x in subset]
y = [x['label'].numpy() for x in subset]
```

In [None]:
X_proc = np.hstack(X_proc).reshape(-1, MAX_LEN)
y = np.array(y)
print(X_proc.shape, y.shape)
X_proc, y

## Feed Forward Neural Network  <div id='NN'></div>

Now, just to show that we've successfully processed the data, we perform a test train split and feed it into an FFNN.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_proc, y, test_size=0.2, stratify=y)

In [None]:
model = Sequential()

model.add(Dense(250, activation='relu',input_dim=MAX_LEN))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=2, verbose=2)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

It worked! 
Is this a good performance? 

Well,  our subset is balanced and very small. So we shouldn't get excited about this results. 
Note that  adding more layers or neurons does not improve the performance, check it by your own! <br> 

### Load more clean data
The IMDB dataset is very popular so `keras` also includes an alternative method for loading the data. This method can save us a lot of time for many reasons:
- Cleaned text with less meaningless punctuation
- Pre-tokenized and numerically encoded
- Allows us to specify maximum vocabulary size
- more ...

In [None]:
from tensorflow.keras.datasets import imdb

# We want to have a finite vocabulary to make sure that our word matrices are not arbitrarily small
MAX_VOCAB = 10000
INDEX_FROM = 3   # word index offset 
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=MAX_VOCAB, index_from=INDEX_FROM)

`get_word_index` will load a json object we can store in a dictionary. This gives us the word-to-integer mapping.

In [None]:
word2idx = imdb.get_word_index(path='imdb_word_index.json')
word2idx = {k:(v + INDEX_FROM) for k,v in word2idx.items()}
word2idx["<PAD>"] = 0
word2idx["<START>"] = 1
word2idx["<UNK>"] = 2
word2idx["<UNUSED>"] = 3
word2idx

In [None]:
idx2word = {v: k for k,v in word2idx.items()}
idx2word

We can see that the text data is already preprocessed for us.

In [None]:
print('Number of reviews', len(X_train))
print('Length of first and fifth review before padding', len(X_train[0]) ,len(X_train[4]),'\n')
print('First review: ', X_train[0],'\n')
print('First label: ', y_train[0],'\n')

Here is an example review using the index-to-word mapping we created from the loaded JSON file to view the a review in its original form.

In [None]:
def show_review(x):
    review = ' '.join([idx2word[idx] for idx in x])
    print(review)

show_review(X_train[0])

NOTE: This text is not comming with **padding** and **stemming**.

Looking at the distribution of lengths will help us determine what a reasonable length to pad to will be.

In [None]:
plt.hist([len(x) for x in X_train])
plt.title('review lengths');

We saw one way of doing this earlier, but Keras actually has a built in `pad_sequences` helper function. This handles both padding and truncating. By default padding is added to the *beginning* of a sequence.

<div class="exercise"  style="background-color:#b3e6ff">
<b>Q</b>: Why might we want to truncate? Why might we want to pad from the beginning of a sequence (sentence in this case)?
</div>

- Unless we truncate we need to pad every sentence according to the longest sentence. That will require too much padding providing a lot of useless information and long vectors which might be computationally costly. 
- Padding in the beginning of a sentence, retain the most important information in the end of sequence that sometime ehnances the performance since it keeps the 'short' memory more informative. 

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
MAX_LEN = 500
X_train = pad_sequences(X_train, maxlen=MAX_LEN, padding='pre') #padding='post' will pad in the end of a sequence
X_test = pad_sequences(X_test, maxlen=MAX_LEN, padding='pre')
print('Length of first and fifth review after padding', len(X_train[0]) ,len(X_train[4]))
print("Note that earlier the lenghts were 218 and 147.")

In [None]:
print((X_train.shape))
X_train[0]

## Model 1: Naive Feed-Forward Network <div id='FFNN'></div>

In [None]:
model = Sequential(name='Naive_FFNN')
model.add(Dense(250, activation='relu',input_dim=MAX_LEN))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=2)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

<div class="exercise"  style="background-color:#b3e6ff">
<b>Q</b>: Why was the performance so poor? How could we improve our encoding?
    
<b>A</b>: The 'magic' Embedding Layer
</div>


## Model 2: Feed-Forward Network with Embeddings <div id='embedding'></div>
<img src='wordembedding2.png' width=450px>

    
    
Embedding process is  a linear projection from one vector space to another. For NLP, we usually use embeddings to project the **sparse one-hot encodings** of words on to **a more compact lower-dimensional** continuous space.
We can view this embedding layer process as  a transformation from $\mathbb{R}^\text{inp} \rightarrow$ $\mathbb{R}^\text{emb}$

This **not only reduces dimensionality** but also **allows semantic similarities** between tokens to be captured by 'similiarities' between the embedding vectors. This was not possible with one-hot encoding as all vectors there were orthogonal to one another. 

<img src='wordembedding.png' width=450px>

It is also possible to load pretrained embeddings that were learned from giant corpora. This would be an instance of transfer learning.

If you are interested in learning more, start with the astromonically impactful papers of [word2vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) and [GloVe](https://www.aclweb.org/anthology/D14-1162.pdf).

Next **Advanced Section** will focus on  *word2vec*. 

In Keras we use the [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer:
```
tf.keras.layers.Embedding(
    input_dim, output_dim, embeddings_initializer='uniform',
    embeddings_regularizer=None, activity_regularizer=None,
    embeddings_constraint=None, mask_zero=False, input_length=None, **kwargs
)
```
We'll need to specify the `input_dim` and `output_dim`. Since we are working with sequences we  also need to set the `input_length`.

Let's implement this

In [None]:
MAX_LEN

In [None]:
EMBED_DIM = 2

model.reset_states()

model = Sequential(name='embedding_FFNN')
## EMBEDDING AND FLATTEN LAYERs  
model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))
model.add(Flatten())
#-
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=2)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

WoW! Notice the huge improvement in the performance. Embedding layer really helps! 

NOTE: We need a flatten layer to correct the dimensions. The embedding layer returns a matrix where each column corresponds to a word encoding. However, the next `Dense` layer is expecting a vector instead of a matrix

# Activity 2: RNN with embedding for NLP <div id='act2'></div>

<img src='simplernn.png' width=300px>


- Construct a network architecture with: 
    - an embedding layer
    - SimpleRNN unit of 250 neurons
    - Dense layer 
- Train this network on the data used in the previous example, namely `X_train`, `y_train`
    - Train for 3 epochs and for a batch_size=128. It is slow because it is not run on GPUs.
- Accordingly, evaluate on `X_test`, `y_test` datasets
- Report the accuracy score on the testing set
- Can you see any improvement comparing to the FFNN model? Make some comments


In [None]:
# Your code here



Notice that we do not get any improvement comparing to FFNNs. What is going on here??? Why does FFNN perform better that RNNs? 

It is because this task is extremely easy and the network does not  really need memory to make a good prediction. Just some key words appearing in the text like "terrible" or "amazing" can determine the prediction. 

In more challenging tasks, like mult-categorical classification and text generation,  memory is crucial and  recurrency is a way to make it. 


Next week you will see some more efficient RNN architectures like **LSTM** and **GRU**. These are much more efficient RNNs and can also be implemented on GPUs. 

# Extra Material <div id='SM'></div>


## Inpsecting the embedding space

Let's train again the FFNN with the embeddings layer

In [None]:
EMBED_DIM = 2

model.reset_states()

model = Sequential(name='embedding_FFNN')
model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=0)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

#### Get access to the embeddings or embedding space or latent space

In [None]:
from tensorflow.keras import backend

In [None]:
# with a Sequential model
get_embed_out = backend.function(
    [model.layers[0].input],
    [model.layers[1].output])

In [None]:
layer_output = get_embed_out([X_test[0]])
# layer_output = get_embed_out([X_train[0]])

print(type(layer_output), len(layer_output), layer_output[0].shape)

#### Create a list of some representative words and check where they live in the embedding space. 
Can you see any meaningful patern? 

In [None]:
words = layer_output[0]
plt.plot(words[:,0], words[:,1],'bo')

In [None]:
review = ['great',   'pleasure', 'good', 'awesome',
          'movie', 'and', 'was' ,
          'bad', 'boring' , 'crap']

enc_review = tf.constant([word2idx[word] for word in review])
enc_review


In [None]:

words = get_embed_out([enc_review])[0]

plt.figure(figsize=[10,10])
plt.plot(words[:,0], words[:,1], 'ob')
for i, txt in enumerate(review):
    plt.annotate(txt, (words[i,0], words[i,1]),  size=18)
