[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/tut7_RNN_NLP1_student.ipynb)

# Tutorial 7: Processing words as sequences
In this tutorial, we will try to predict the next word in a sentence. This is challenging, as we will see because we choose a word out of a vocabulary, which is commonly large. Hence, the purpose of this tutorial is not to get an accurate model, but rather to show you how this task can be performed. More accurate models require larger samples and computational resources. 

We cover the following
1. Prepare the text data to represent the sequence $[w_1,w_2,w_3,w_4,w_5,w_6]$ into something like $y=w_6$ and $x=[w_5,w_4,w_3,w_2,w_1]$. Because you are now familiar with IMBD dataset, we will use it to create our sequence data.
2. Train a feedforward network. 
3. Train a NN with `SimpleRNN` layer. 
4. Train a NN with `LSTM` layer.
5. Train a NN with `Embedding` and `LSTM` layers.

For further examples, please visit the demos in [demos/rnn](https://github.com/Humboldt-WI/adams/tree/master/demos/rnn).

## 1. Preprocess IMDB data 

In [2]:
# Import the required libraries
import pandas as pd
import numpy as np
import tensorflow as tf
import string
import re
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization

### Exercise 1
Load the IMDB, and use the first 100 reviews as training and the next 20 as validation. We won't be using the sentiment, only the text.

In [6]:
# load the data (be sure to provide the correct file path)
data = pd.read_csv("IMDB-50K-Movie-Review.zip", sep=",", encoding="ISO-8859-1")
begin_training_at = 0
train_size = 100
train = data['review'][begin_training_at:train_size + begin_training_at]
val_size = 20
val = data['review'][begin_training_at + train_size:begin_training_at + train_size + val_size]

### Exercise 2
Create `our_standardization` function to convert to lowercase, remove HTML tags, punctation and double spaces (check [tut5_embeddings](https://github.com/Humboldt-WI/adams/blob/master/exercises/tut5_embeddings_teacher.ipynb)). 

In [7]:
# def our_standardization(text_data):
def our_standardization(text_data):
  lowercase = tf.strings.lower(text_data) # convert to lowercase

  remove_html = tf.strings.regex_replace(lowercase, '<.*?>', ' ') # remove HTML tags, alternatively remove html characters
  #in advance with beautifulsoup.
  
  pattern_remove_punctuation = '[%s]' % re.escape(string.punctuation) # pattern to remove punctuation
  
  remove_punct = tf.strings.regex_replace(remove_html, pattern_remove_punctuation, '') # apply pattern
    
  remove_double_spaces = tf.strings.regex_replace(remove_punct, '\s+', ' ') # remove double space
  return remove_double_spaces

### Exercise 3
Create `TextVectorization` with `output_mode` integer and without defining the `output_sequence_length`. Use only 100 words as vocabulary (nothing good can be done with 100 words, but the purpose is to illustrate).

In [9]:
# Define the size of the vocabulary and the max number of words in a sequence
vocab_size = 100

# Create a vectorization layer
vectorize_layer = TextVectorization(
    standardize = our_standardization,
    max_tokens = vocab_size,
    output_mode= 'int'
    )


### Exercise 4
Adapt the vectorization layer to the text_data. Recall that this will build your vocabulary based on the provided text data. Specifically, the `vocab_size` most frequent tokens will make up your vocabulary. For illustration, print the first ten elements of your vocabulary. 

In [10]:
# To create the vocabulary, we need to call adapt. The input is only the text
vectorize_layer.adapt(train)
vocab = vectorize_layer.get_vocabulary()
vocab[:10]

['', '[UNK]', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 'it']

### Exercise 5
Next we create *time series* of our text. Recall that, for language modelling, we need training data of the form $y=w_6$ and $x=[w_5,w_4,w_3,w_2,w_1]$, where the symbol $w$ is to represent one word. To achieve this, we supply the custom method `transform_text`. The core of the method is the built-in `timeseries_dataset_from_array` method provided by Keras.

Make sure to examine the method to get an idea of the operations. We use it in the next exercise.

In [20]:
def transform_text(data, vectorize_layer, sequence_length):
    #Each X[i] is an input of the text of sequence length and each y has exactly 1 observation, corresponding to the delay
    delay = sequence_length # the target word is the word after the sequence
    batch_size = 1
    flag = True
    # Generate data
    for rev in data:
        vec_rev = vectorize_layer(rev)
        # Create time series dataset for each review
        aux_dataset = tf.keras.preprocessing.timeseries_dataset_from_array(
            data = vec_rev,
            targets = vec_rev[delay:],
            sequence_length=sequence_length,
            shuffle=False,
            batch_size=batch_size)
        # Concatenate the time series
        for input, target in aux_dataset:
            if flag:
                X = input
                y = target
                flag = False
            else:     
                X = tf.concat([X , input], 0)
                y = tf.concat([y, target], 0)
    return X, y

### Exercise 6
Create the training and validation datasets using our custom function `transform_text()`.

In [21]:
X_train , y_train = transform_text(train, vectorize_layer, sequence_length)
X_val, y_val = transform_text(val, vectorize_layer, sequence_length)

In [41]:
vec_train = vectorize_layer(val)

In [42]:
sum([np.count_nonzero(vec_train[idx]) for idx in range(vec_train.shape[0])])

4077

In [45]:
sum([np.count_nonzero(vec_train[idx]) for idx in range(vec_train.shape[0])]) - X_val.shape[0] == sequence_length * val.shape[0] #The difference seems to be **exactly** 5 * obs

True

In [26]:
(100*691) / 5

13820.0

In [23]:
X_train.shape[0]

22273

### Exercise 7
Check the frequency of each token (you can use `tf.unique_with_counts`). What's the problem?

In [7]:
# tf.unique_with_counts(y_train)

## 2. Feedforward NN
### Exercise 8
Fit a feedforward network

## 3. SimpleRNN
### Exercise 9 
Fit a NN with a `SimpleRNN` layer.

## 4. LSTM
### Exercise 10
Fit a NN with a `LSTM` layer.

## 5. Embedding + LSTM
### Exercise 11
Fit a NN with an `Embedding` and `LSTM` layers.