# **Lesson 10: NLP Recurrent Neural Networks**

This lesson is split into 2 parts
- Introducing recurrent neural networks and other methods and layers which consider seqeunces of data.
- Generative text models.

## **Lesson Setup**

import lesson dependencies

In [4]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow_datasets as tfds

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

print(tf.__version__)
print(np.__version__)

2.8.0
1.21.5


download the text data needed for the review

In [8]:
# download the csv file
!wget --no-check-certificate \
    -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P

--2022-03-22 21:23:16--  https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P
Resolving drive.google.com (drive.google.com)... 74.125.137.100, 74.125.137.101, 74.125.137.139, ...
Connecting to drive.google.com (drive.google.com)|74.125.137.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-08-ak-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/2iuk9bqircsjtr7337e72sgcocl9m03b/1647984150000/11118900490791463723/*/13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P [following]
--2022-03-22 21:23:17--  https://doc-08-ak-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/2iuk9bqircsjtr7337e72sgcocl9m03b/1647984150000/11118900490791463723/*/13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P
Resolving doc-08-ak-docs.googleusercontent.com (doc-08-ak-docs.googleusercontent.com)... 142.250.141.132, 2607:f8b0:4023:c0b::84
Connecting to doc-08-ak-docs.googleusercontent.com (doc-08-ak-docs.googleusercontent.com)|142.250.141.1

In [9]:
# read the csv file as a pandas dataset
dataset = pd.read_csv('/tmp/sentiment.csv')

# get the sentences and labels
sentences = dataset['text'].tolist()
labels = dataset['sentiment'].tolist()

# Split the sentences and label into training and test set

training_size = int(len(sentences)*0.8)

# define training data
training_sentences = sentences[:training_size]
training_label = labels[:training_size]

# define test data
testing_sentences = sentences[training_size:]
testing_labels = labels[training_size:]


define some parameters and helper functions

In [14]:
VOCAB_SIZE = 1000
OOV ="<00V>"
MAX_SEQUENCE_LENGTH = 100
MAX_SUBWORD_LENGTH = 5


In [15]:
def apply_padding(sequences):
  return pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH,
                       padding='post', truncating='post')

# Before the test data is passed to the model it would have to be prepared.
def prepare_test_data(test_sentences, tokenizer):
    sequences = tokenizer.texts_to_sequences(testing_sentences)
    padded_sequences = apply_padding(sequences)
    padded_sequences = np.array(padded_sequences).reshape(-1, MAX_SEQUENCE_LENGTH, len(padded_sequences))
    return padded_sequences


Create a padded tokenized sequence

In [12]:
# create a tokenizer 
Full_word_tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=VOCAB_SIZE, oov_token=OOV)

# fit the tokenizer onto the training sentence
Full_word_tokenizer.fit_on_texts(training_sentences)

# with the tokenizer convert the training sentences into sequences of tokens
Training_sentences_sequences = Full_word_tokenizer.texts_to_sequences(training_sentences)

# apply padding to the training sentence sequences
Training_sentences_padded_sequences = apply_padding(Training_sentences_sequences)


In [13]:
# Apply the whole process to the testing sentences 
Testing_sentences_padded_sequences = prepare_test_data(testing_sentences,
                                                       Full_word_tokenizer)

Create the subword corpus and tokenizer

I'm a bit wary of using this as it has been deprecated. So far for the sake of following the lesson i would keep using subwords.

In [None]:
# create the subword tokenizer and fit it, to the training sentences
subword_tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(training_sentences,
                                                                              target_vocab_size = VOCAB_SIZE,
                                                                              max_subword_length=MAX_SUBWORD_LENGTH)

def create_subword_padded_sequence(sentences):
  if type(sentences) == type([]):
    tokenized_subword_sequences = []
    for sentence in sentences:
      tokenized_subword_sequences.append(subword_tokenizer.encode(sentence))
    
    tokenized_subword_padded_sequences = apply_padding(tokenized_subword_sequences)
    

  return pass


## **Recurrent Neural Networks**

So far in the previous lesson, we have applied
- Tokenization to text to convert individual words into unique tokens.
- How to generate tokenized sequences from sentences.
- Used embeddings to represent these tokens into multi-dimensional vectors which can then be passed to dense layers.
- Create subword corpus from text datasets.

We have been able to get reasonable good performance with the approach we have taken so far, simply by using embedded tokenized words and subwords. But we have not really considerd the order of the sequence.

<br/>

Towards this, we would look at **Recurrent neural networks**
- These are neural networks that are able to taken in sequences and create an output from the sequence.


### **Significance of considering the order of sequences**

Why do we need to consider the order of sequences in our tokenized sequence?
- ***Context matters, there are temporal dependencies in our sentences***. The context and quite often the sentiment of our sentence can be determined by certain words. These words can be close or far apart in the text.

<br/>

For example
- i saw a massive _ in the _
- i am _ and want to _

The 2 sentences above are really ambiguous. But when given more context, we can better predict the missing words and the overall sentiment of the text.
- I saw a massive _ in the garden.
- I am _ and want to sleep.


### **Basics of RNN**
- These are neural networks that taken in input sequences and output sequences or vectors. (sequence to sequence or sequence to vector).

<br/>

Workflow of an RNN
- For each value in a sequence, the RNN would use the value as input in each time step to produce an output and a state vector.
- The state vector in the last time step is then passed on as an additional input in the next time step.
- For the very first time step??? i assume a default value is used for the state vector.

<br/>

**key point here**   
So in this way, some element of the previous input, influences the calculation of the next output & state vector since the previous state vector would be used as an added input.

<br/>

***That's a very laymans term of covering an RNN without going to deep into it***


This lesson focuses on variations of RNN:
- Long Short Term Memory (LSTM)

### **Long Short Term Memory**

So we have an high level understanding of how basic RNN work. While it's cool that we are now able to consider our last input basic RNN are not very useful when considering larger bodies of text in which the significant words that influence the overall context and sentiment are farther apart.

<br/>

So how can we handle temporal dependencies that are really part apart?. LSTMS are our saving grace in this case. These neural networks have a longer short term memory, so in a more naive way it is able to retain information for much longer.

<br/>

Again not much detail have been documented on these neural networks. I've just tried to keep to the very baiscs.

The udacity lesson focused on bi-directional LSTMS, in which information is passed both forwards and backwards through the network.

In [16]:
#define a model with a bi-directional LSTM layer

simple_lstm_model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(VOCAB_SIZE, 64),
                             tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
                             tf.keras.layers.Dense(6, activation='relu'),
                             tf.keras.layers.Dense(1, activation='sigmoid')
])

simple_lstm_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          64000     
                                                                 
 bidirectional (Bidirectiona  (None, 128)              66048     
 l)                                                              
                                                                 
 dense (Dense)               (None, 6)                 774       
                                                                 
 dense_1 (Dense)             (None, 1)                 7         
                                                                 
Total params: 130,829
Trainable params: 130,829
Non-trainable params: 0
_________________________________________________________________


## **Text Generation**

