<a href="https://colab.research.google.com/github/Angel-Castro-RC/Final_NLP/blob/main/F6_4_LongTermRecurrence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Handling Long-Term Information in Recurrent Neural Networks

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F6_4_LongTermRecurrence.ipynb)

## Reference

SLP: RNNs and LSTMs, Chapter 9 of Speech and Language Processing by Daniel Jurafsky & James H. Martin https://web.stanford.edu/~jurafsky/slp3/9.pdf

Wikipedia article on Gated Recurrent Unit: https://en.wikipedia.org/wiki/Gated_recurrent_unit

Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras by Jason Brownlee: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

In [None]:
import sys
!{sys.executable} -m pip install datasets keras tensorflow transformers

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.5-py3-none-any.whl (7.8 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.5


## Example Sentence

*The flights the airline was cancelling were full*

Suppose we have generated `The flights the airline`
* `was` is a good next choice
   - `airline` has context for `was` vs. `were`

Suppose we have generated `The flights the airline was cancelling`
* `was`/`were` depends on `flights`
* much more distance information

## The Vanishing Gradient

The *vanishing gradient* is a common problem in deep neural networks

If weights are < 1, they will get smaller and smaller each node they have to pass through - causing them to have little or not effect

Happens during training too - error/loss is propogated backwards through the network proportional to the weights on each edge
* earlier edges in the network are left with little error to use in adjusting weights

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/vanishing_gradient.png?raw=1">
</div>

image source: https://www.researchgate.net/figure/A-visualization-of-the-vanishing-gradient-problem-using-the-architecture-depicted-in_fig8_277603865

## LSTM

Long Short-Term Memory (LSTM) networks try to address the vanishing gradient through
* removing unneeded information from the context
* adding information likely needed later

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/LSTM_node.png?raw=1", width=700>
</div>

image_source: SLP Fig. 19.3, https://web.stanford.edu/~jurafsky/slp3/9.pdf

### How does it do this?

* Explicit **context layer** $c_t$
* neural **gates** that control the flow of information through the layer
    - $f$ - the **forget gate** - delete info from context that is no longer needed
    - $g$ - basic extraction of info from previous hidden state
    - $i$ - the **add gate** - select info to add to current context
    - $o$ - the **output gate** - decide what info is needed for current hidden state
    
<div>
    <table>
    <tr>
        <td><img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/hadamard_product.png?raw=1"></td><td style="text-align: left;"><b>Hadamard product:</b> bitwise multiplication</td>
    </tr>
    <tr>
        <td><img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/sigmoid.png?raw=1"></td><td style="text-align: left;"><b>Sigmoid activation:</b> pushes everything to 0 or 1</td>
    </tr>
    <tr>
        <td><img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/tanh.png?raw=1"></td><td style="text-align: left;"><b>Hyperbolic tangent activation:</b> pushes to 0 or 1, more like identity at the origin</td>
    </tr>
    </table>
</div>

Combining sigmoid with ⊙ has the effect of *masking* out information removing some, leaving others

## Gated Recurrent Unit

A **Gated Recurrent Unit** is a popular unit similar to LSTM, except more lightweight
* no output gate
* no context vector

Performance is often similar, but fewer parameters
* faster
* less memory


<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN-vs-LSTM-vs-GRU.png?raw=1", width=700>
</div>

image source: http://dprogrammer.org/rnn-lstm-gru

## Let's work with some data

We'll do something that should be an easier learning problem: text classification with a recurrent network

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_classification.png?raw=1" width=700>
</div>

image source: SLP Fig. 9.8, https://web.stanford.edu/~jurafsky/slp3/9.pdf

### IMDB Reviews Dataset

In [None]:
from datasets import load_dataset
from sklearn.model_selection import train_test_split

dataset = load_dataset("imdb")
dataset

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [None]:
# uncomment these to work with a subset of the data
data_subset_text, _, data_subset_label, _ = train_test_split(dataset["train"]["text"],dataset["train"]["label"],train_size=5000)
train_data_text,  test_data_text, train_data_label, test_data_label = train_test_split(data_subset_text, data_subset_label,test_size = 0.2)

# uncomment these to use the full original data
# train_data_text = dataset["train"]["text"]
# train_data_label = dataset["train"]["label"]
# test_data_text = dataset["test"]["text"]
# test_data_label = dataset["test"]["label"]

In [None]:
#printing out a sample review
print( train_data_text[125] )
print( train_data_label[125] )

Don't bother. A little prosciutto could go a long way, but all we get is pure ham, particularly from Dunaway. The plot is one of those bumper car episodes... the vehicle bounces into another and everything changes direction again, until we are merely scratching our heads wondering if there were ever a plot. Gina Phillips is actually good, but it's hard playing across from a mystified Dunaway playing Lady Macbeth lost in the Marx's Brother's Duck Soup. Ah, the Raven...now there's an actor. And there is the relative who just lies and bed and looks ghostly. Or Dr. Dread who's filled with lots of gloom and no working remedies. I'm one of those suckers who just has to see a movie to the end. Quoth the Raven, "Nevermore."
0


### Importing libraries

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, SimpleRNN, Dropout, GRU
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

### Preparing the data

In [None]:
vocab_size = 10000
pad_length = 500

tokenizer = Tokenizer(num_words=vocab_size) #only keep the 10000 most common words
tokenizer.fit_on_texts(train_data_text)
tokenized_train_data = tokenizer.texts_to_sequences(train_data_text)
processed_train_data = pad_sequences(tokenized_train_data,maxlen=pad_length, padding='pre')
tokenized_test_data = tokenizer.texts_to_sequences(test_data_text)
processed_test_data = pad_sequences(tokenized_test_data,maxlen=pad_length, padding='pre')

train_target = np.array(train_data_label)
test_target = np.array(test_data_label)

**Important Note:** I originally had tried `padding='post'` which led to bad results
* having a bunch of 0s at the end of a sequence is really bad when you are only sending the last output to the next layer
* in general, we shouldn't be using post-padding with recurrent networks
 - unfortunately, this doesn't seem to be the problem with our encoder-decoder example, but it is still worth going back and fixing if you want to keep working with it

### Defining a simple LSTM-based architecture

Since this is a binary classification problem, we can use a sigmoid activation and `binary_crossentropy` loss.

In [None]:
embedding_size = 50
hidden_layer_size = 100

model = Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=pad_length))
model.add(Dropout(0.2))
model.add(SimpleRNN(hidden_layer_size))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 500, 50)           500000    
                                                                 
 dropout_2 (Dropout)         (None, 500, 50)           0         
                                                                 
 simple_rnn (SimpleRNN)      (None, 100)               15100     
                                                                 
 dropout_3 (Dropout)         (None, 100)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 101       
                                                                 
Total params: 515201 (1.97 MB)
Trainable params: 515201 (1.97 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [None]:
model.fit(processed_train_data,
          train_target,
          epochs = 3,
          batch_size = 64,
          validation_data=(processed_test_data,test_target) )

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x7ac18b149b70>

Epoch 1/3
63/63 [==============================] - 61s 916ms/step - loss: 0.6764 - accuracy: 0.5878 - val_loss: 0.6448 - val_accuracy: 0.6420


Epoch 2/3
63/63 [==============================] - 62s 992ms/step - loss: 0.4863 - accuracy: 0.7955 - val_loss: 0.4803 - val_accuracy: 0.7700


Epoch 3/3
63/63 [==============================] - 60s 964ms/step - loss: 0.2533 - accuracy: 0.9043 - val_loss: 0.3874 - val_accuracy: 0.8280


with dropout

Epoch 1/3
63/63 [==============================] - 62s 946ms/step - loss: 0.7236 - accuracy: 0.5385 - val_loss: 0.6816 - val_accuracy: 0.6550

Epoch 2/3
63/63 [==============================] - 59s 925ms/step - loss: 0.6667 - accuracy: 0.7347 - val_loss: 0.6523 - val_accuracy: 0.6970

Epoch 3/3
63/63 [==============================] - 61s 980ms/step - loss: 0.5258 - accuracy: 0.7820 - val_loss: 0.4135 - val_accuracy: 0.8090


SimpleRnn

Epoch 1/3
63/63 [==============================] - 19s 283ms/step - loss: 0.6964 - accuracy: 0.5092 - val_loss: 0.6820 - val_accuracy: 0.5680

Epoch 2/3
63/63 [==============================] - 18s 284ms/step - loss: 0.6449 - accuracy: 0.6363 - val_loss: 0.6850 - val_accuracy: 0.5350

Epoch 3/3
63/63 [==============================] - 18s 290ms/step - loss: 0.6691 - accuracy: 0.5838 - val_loss: 0.6796 - val_accuracy: 0.5290


## Exercise

The source I got this code from (https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ ) included Dropout layers - you can try uncommenting them and see what it does.

It's also equivalent to writing

`model.add(LSTM(hidden_layer_size), dropout=0.2, recurrent_dropout=0.2)`

Do some searching and see what you can find out about what dropout layers do and why people use them. Discuss your findings with your group.

**Did the drop out work**

it made it worst, with more epochs it did better

**What did it do?**

Remove some of the data - randomly

**why use drop out?**

resmoves data so it helps to not overfit

## Exercise

Run an experiment: What is the difference between using `SimpleRNN` and `LSTM` with this data?

* It did worst than LSTM and with the drop out

## Applied Exploration

Do one of the following:

1. Redo your experiment with another classification dataset. Choose something with more than 2 classes - this will be good practice is understanding the difference you need to make to the model and data prep. Describe your data and results as usual.
    * I also suggest including a GRU layer in your experiment as well: https://keras.io/api/layers/recurrent_layers/gru/

2. Edit the Encoder-Decoder code from last time to use LSTM or GRU.
    * Note that since LSTM returns both a context and hidden state, you will get an output, a hidden state, and context returned from the LSTM layer (instead of just the output and state). It will look something like
    

    Gru work but it didn't do much and it didn't do well

In [None]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

print("Categories:", newsgroups.target_names)

Categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [None]:
data_train, data_test, target_train, target_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.2, random_state=42)

# Tokenize and pad the sequences
vocab_size = 10000
pad_length = 500

tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(data_train)
tokenized_train_data = tokenizer.texts_to_sequences(data_train)
processed_train_data = pad_sequences(tokenized_train_data, maxlen=pad_length, padding='pre')
tokenized_test_data = tokenizer.texts_to_sequences(data_test)
processed_test_data = pad_sequences(tokenized_test_data, maxlen=pad_length, padding='pre')

# Convert target labels to numpy array
train_target = np.array(target_train)
test_target = np.array(target_test)

embedding_size = 50
hidden_layer_size = 100

model = Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=pad_length))
model.add(GRU(hidden_layer_size))
model.add(Dropout(0.2))
model.add(Dense(len(newsgroups.target_names), activation='softmax'))  # Softmax for multi-class classification
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# print(model.summary())

# Train the model
model.fit(
    processed_train_data,
    train_target,
    epochs=5,
    batch_size=64,
    validation_data=(processed_test_data, test_target))

NameError: ignored

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, GRU, Dropout, Dense

# Load 20 newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Print target categories
print("Categories:", newsgroups.target_names)

# Split the data into training and testing sets
data_train, data_test, target_train, target_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.2, random_state=42)

# Preprocessing
data_train = [text.lower() for text in data_train]
data_test = [text.lower() for text in data_test]

# Tokenize and pad the sequences
vocab_size = 10000
pad_length = 500

tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
tokenizer.fit_on_texts(data_train)
tokenized_train_data = tokenizer.texts_to_sequences(data_train)
processed_train_data = pad_sequences(tokenized_train_data, maxlen=pad_length, padding='pre')
tokenized_test_data = tokenizer.texts_to_sequences(data_test)
processed_test_data = pad_sequences(tokenized_test_data, maxlen=pad_length, padding='pre')

# Convert target labels to numpy array
train_target = np.array(target_train)
test_target = np.array(target_test)

# Model Architecture
embedding_size = 50
hidden_layer_size = 100

model = Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=pad_length))
model.add(GRU(hidden_layer_size, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model.add(GRU(hidden_layer_size, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(len(newsgroups.target_names), activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(
    processed_train_data,
    train_target,
    epochs=5,  # Increase the number of epochs
    batch_size=64,
    validation_data=(processed_test_data, test_target))

Categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x78f35a41b520>

In [None]:
encoder_rnn = LSTM(100, return_state=True)
encoder_outputs, state_h, state_c = encoder_rnn(enc_emb)

and you will pass both state_h, state_c as the *context vector* which is the initial state for the decoder. See the source from last time to flesh out the example: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html