[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/tut7_RNN_NLP1_student.ipynb)

# Tutorial 7: Processing words as sequences
In this tutorial, we will try to predict the next word in a sentence. This is challenging, as we will see because we choose a word out of a vocabulary, which is commonly large. Hence, the purpose of this tutorial is not to get an accurate model, but rather to show you how this task can be performed. More accurate models require larger samples and computational resources. 

We cover the following
1. Prepare the text data to represent the sequence $[w_1,w_2,w_3,w_4,w_5,w_6]$ into something like $y=w_6$ and $x=[w_5,w_4,w_3,w_2,w_1]$. Because you are now familiar with IMBD dataset, we will use it to create our sequence data.
2. Train a feedforward network. 
3. Train a NN with `SimpleRNN` layer. 
4. Train a NN with `LSTM` layer.
5. Train a NN with `Embedding` and `LSTM` layers.

For further examples, please visit the demos in [demos/rnn](https://github.com/Humboldt-WI/adams/tree/master/demos/rnn).

## 1. Preprocess IMDB data 

In [2]:
# Import the required libraries
import pandas as pd
import numpy as np
import tensorflow as tf
import string
import re
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

### Exercise 1
Load the IMDB, and use the first 100 reviews as training and the next 20 as validation. We won't be using the sentiment, only the text.

In [1]:
# load the data (be sure to provide the correct file path)


### Exercise 2
Create `our_standardization` function to convert to lowercase, remove HTML tags, punctation and double spaces (check [tut5_embeddings](https://github.com/Humboldt-WI/adams/blob/master/exercises/tut5_embeddings_teacher.ipynb)). 

In [2]:
# def our_standardization(text_data):


### Exercise 3
Create `TextVectorization` with `output_mode` integer and without defining the `output_sequence_length`. Use only 100 words as vocabulary (nothing good can be done with 100 words, but the purpose is to illustrate).

In [3]:
# Define the size of the vocabulary and the max number of words in a sequence
vocab_size = 100
# Create a vectorization layer

### Exercise 4
Adapt the vectorization layer to the text_data.

In [4]:
# To create the vocabulary, we need to call adapt. The input is only the text

### Exercise 5
Create `transform_text` function to transform the text data into a time serie. The targets are related with their previous 5 words (similar to what we saw in [tut6_LSTM](https://github.com/Humboldt-WI/adams/blob/master/exercises/tut6_LSTM_teacher.ipynb). You can use built-in `timeseries_dataset_from_array` from Keras. 

In [5]:
# def transform_text(data, sequence_length,num_samp_per_revs):


### Exercise 6
Create the training and validation datasets.

In [6]:
sequence_length = 5 # we use the last 5 words
num_samp_per_revs = 1000

### Exercise 7
Check the frequency of each token (you can use `tf.unique_with_counts`). What's the problem?

In [7]:
# tf.unique_with_counts(y_train)

## 2. Feedforward NN
### Exercise 8
Fit a feedforward network

## 3. SimpleRNN
### Exercise 9 
Fit a NN with a `SimpleRNN` layer.

## 4. LSTM
### Exercise 10
Fit a NN with a `LSTM` layer.

## 5. Embedding + LSTM
### Exercise 11
Fit a NN with an `Embedding` and `LSTM` layers.