[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/tut8_RNN_NLP2_student.ipynb)

# Tutorial 8: Text classification considering words as sequence
In [tutorial 5](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/tut5_embeddings_teacher.ipynb), we saw how to classify the reviews in `IMBD` dataset into positive and negative sentiments. However, the approach didn't consider the order of the words in the review. Therefore, in this tutorial, we consider the sequence model approach rather than following the 'bag-of-words' model. 

For this purpose, we cover
1. Load and prepare the well-known IMBD dataset for the sequence model approach.
2. Masking, a way to tell RNNs to skip meaningless inputs
3. Bidirectional RNNs
4. GRU RNN 

For further examples, please visit [demos/nlp/sentiment_analysis.ipynb](https://github.com/Humboldt-WI/adams/blob/master/demos/nlp/sentiment_analysis.ipynb).

## Preprocess IMDB data for the sequence model approach

In [49]:
# Import the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import string
import re
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization

# load the data (be sure to provide the correct file path)
total_imbd = pd.read_csv("../../../demos/nlp/IMDB-50K-Movie-Review.zip", sep=",", encoding="ISO-8859-1")
total_imbd['sentiment'] = total_imbd['sentiment'].map({'positive' : 1, 'negative': 0})
# Split the data
X_train, X_val, y_train, y_val = train_test_split(total_imbd['review'], total_imbd['sentiment'], test_size = 0.2, random_state = 5)
# transform them to numpy 
X_train = X_train.to_numpy()
X_val = X_val.to_numpy()
y_train = y_train.to_numpy()
y_val = y_val.to_numpy()

# define standarization function 
def our_standardization(text_data):
  lowercase = tf.strings.lower(text_data) # convert to lowercase
  remove_html = tf.strings.regex_replace(lowercase, '<br />', ' ') # remove HTML tags
  pattern_remove_punctuation = '[%s]' % re.escape(string.punctuation) # pattern to remove punctuation
  remove_punct = tf.strings.regex_replace(remove_html, pattern_remove_punctuation, '') # apply pattern
  remove_double_spaces = tf.strings.regex_replace(remove_punct, '\s+', ' ') # remove double space
  return remove_double_spaces

# Define the size of the vocabulary and the max number of words in a sequence
vocab_size = 10000
seq_length = 500

# Create a vectorization layer
vectorize_layer = TextVectorization(
    standardize = our_standardization,
    max_tokens = vocab_size,
    output_sequence_length = seq_length
    )
vectorize_layer.adapt(X_train)
## Transform sequences of words to seq of integers and labels to tensor
X_train = vectorize_layer(X_train)
X_val = vectorize_layer(X_val)
y_train = tf.convert_to_tensor(y_train)
y_val = tf.convert_to_tensor(y_val)


## Masking, a way to tell RNNs to skip meaningless inputs (padding)
If our input sequences are full of zeros, that will hurt the model's performance. In our case, we have lots of zeros because we're using the `output_sequence_length=seq_length` option in `TextVectorization`. That truncates sentences longer than `seq_length` tokens to `seq_length` tokens but also pads shorter sentences with zeros.

The RNN may spend its last iterations only seeing vectors encoding these zeros (short sentences). The information stored in the internal state of the RNN will gradually fade out as it gets exposed to these empty inputs. To avoid this, we use masking. The `Embedding` layer can generate a mask (`mask_zero=True`) corresponding to its input data. This mask tells the RNN (as attached metadata) to skip over the iterations containing only vectors that encode padding.


In [None]:
# Example
ex_emb = layers.Embedding(input_dim = 100, output_dim=16, mask_zero=True)
ex_input = [[5,4,3,2,1,0,0],
            [1,2,3,0,0,0,0]]
ex_emb.compute_mask(ex_input)

### Exercise 1
Create a text classification model with `Embedding`+`LSTM`. Use `emb_size = 32` and `16` units for the RNN.

### Exercise 2
Fit the model using only 2 epochs and a `batch_size = 128`.

### Exercise 3
Predict the sentiment of the following phrases

`"This movie never stops surprising me. The actors are good."`

`"This movie never stops surprising me. The actors are good. However, the story is terrible."`

## Bidirectional LSTM
You have seen that RNNs care about the order (that's why they do well when the sequence order is essential). A bidirectional RNN is a type of recurrent neural network that is trained on two separate data sequences, one in chronological order and the other in reverse order. By doing so, we can learn patterns in both directions. 

If the sequence is formed by words, extracting patterns in both directions makes sense since, a priori, the potential relevance of a word in understanding a phrase is not entirely dependent on its position (the order is determined by the grammar rather than the sequential occurrence).  

| ![](https://www.gabormelli.com/RKB/images/4/4f/BRNN_Mike_Paliwal_1997_Fig3.png) | 
|:--:| 
| (Schuster & Paliwal, 1997) |

You can use `Bidirectional` layer in Keras to create a bidirectional RNN.

```python
inputs = tf.keras.Input(shape=(sequence_length, ))
x = layers.Bidirectional(layers.LSTM(n_units))(inputs)
 ...
 ...
 ...
```


### Exercise 4
Create a new model similar to the previous one but using a `Bidirectional` layer instead of one `LSTM`. Check the number of parameters in the new layer.

### Exercise 5
Fit the model with the same arguments as before

### Exercise 6
Predict the sentiment for the previous sentences


In [1]:
# "This movie never stops surprising me. The actors are good."

In [2]:
# "This movie never stops surprising me. The actors are good. However, the story is terrible."

## Try with another RNN
GRU stands for "Gated Recurrent Unit". GRUs are similar to LSTM, but they are more straightforward and were introduced more recently by [Cho et al. (2014)](https://arxiv.org/abs/1409.1259).

![](http://dprogrammer.org/wp-content/uploads/2019/04/RNN-vs-LSTM-vs-GRU-1200x361.png)

### Exercise 7
Using the `GRU` layer instead of the `LSTM`, create the text classification model and compare if there is a significant gain.