In this notebook, we wrapt up time series prediction with RNNs, before moving to a (brief) introduction to NLP. We will see how we can apply different types of models to NLP tasks, from simple bag-of-word models, over RNN-driven sequence models, to transformers. We will also learn how we can use pre-trained transformer models using the 🤗 Hugging Face library.

The notebook builds on a combination of the GitHub repositories by [Aurélien Géron](https://github.com/ageron/handson-ml2) and [Francois Chollet](https://github.com/fchollet/deep-learning-with-python-notebooks), and official TensorFlow and 🤗 Hugging Face tutorials.

In [None]:
import numpy as np
import os
import shutil
import random as rnd
from pathlib import Path
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Embedding, Dense, Dropout, GRU, Bidirectional, Flatten, TextVectorization, GlobalMaxPooling1D, BatchNormalization, Conv1D, LSTM
from tensorflow.keras.metrics import mean_squared_error

# 1. Back to our timeseries prediction example

## 1.1 Preparing the data

We start by fetching the data. This is the same code as last time, just put into one cell. If you have the "climate_data.csv" file in the same folder as the notebook and are running it locally, set `google = True`. If you are running on Colab, make sure to leave `google = True`.

In [None]:
# Load data
google = True

if google:
    from google.colab import drive 
    drive.mount('/content/gdrive')
    path = "gdrive/MyDrive/"
else:
    path = ""
    
fname = os.path.join(path+"climate_data.csv")
with open(fname) as f:
    data = f.read()

lines = data.split("\n")
header = lines[0].split(",")
lines = lines[1:]

# Create temperature data and predictors
temperature = np.zeros((len(lines),))
raw_data = np.zeros((len(lines), len(header) - 1))

# Convert data into arrays
for i, line in enumerate(lines):
    values = [float(x) for x in line.split(",")[1:]] # Date Time is irrelevant, as data is equally spaced, so we drop it
    temperature[i] = values[1] # We store the temperature (this is what we will predict)
    raw_data[i, :] = values[:] # We store all data points (including the temperature), which will be our predictors
    
# Define the parts used for training and validation (rest is for testing)
train_size = int(0.5 * len(raw_data))
valid_size = int(0.25 * len(raw_data))

# Standardize the data (using the training mean and standard deviation)
mean = raw_data[:train_size].mean(axis=0)
std = raw_data[:train_size].std(axis=0)
raw_data = (raw_data - mean) / std

# 1.2 Forecasting multiple timesteps: a sequence-to-vector approach

So far, we have only forecast one timestep. We will now forecast the temperature for the entire next day, instead. For this, we have to recreate our datasets accordingly. To see how to create the correct dataset, let's start with a simple dummy example. The idea is that we create two datasets, one for input sequences, and one for output sequences, then add them together to have a single dataset to call when fitting the model.

**Discuss**: Play around with the input parameters to understand the data generation process:
- Why do we need different sequence lengths
- How do you control the offset between inputs and targets?
- Why do we use a `batch_size` of 1 when generating the individual datasets?

In [None]:
int_sequence = np.arange(18)
sequence_length_input = 3
sequence_length_output = 2
delay = sequence_length_input + 2
batch_size = 4

def create_multiple_dummy_datasets(start_index, end_index):
    dummy_input_data = tf.keras.preprocessing.timeseries_dataset_from_array(
        data=int_sequence[:-delay],
        targets=None,
        sequence_length=sequence_length_input,
        start_index = start_index,
        end_index = end_index,
        batch_size=1).unbatch() # We create "unbatched" datasets, since we want to merge the data one by one
    dummy_target_data = tf.keras.preprocessing.timeseries_dataset_from_array(
        data=int_sequence[delay:],
        targets=None,
        sequence_length=sequence_length_output, # The length of the output sequences may be different from the length of the input sequences
        start_index = start_index,
        end_index = end_index,
        batch_size=1).unbatch() # We create "unbatched" datasets, since we want to merge the data one by one
    return tf.data.Dataset.zip((dummy_input_data, dummy_target_data)).batch(batch_size) # We merge the two datasets and then batch them

train_dummies = create_multiple_dummy_datasets(0, 8)
val_dummies = create_multiple_dummy_datasets(8, None)

for data in [train_dummies, val_dummies]:
    print("Next data set:")
    for inputs, targets in data:
        print("    Next batch:")
        for i in range(targets.shape[0]):
            print("       input:",[int(x) for x in inputs[i]], "| output:",[int(x) for x in targets[i]])

Once we have understood how to combine two data sets in order to generate input and output sequence, we can scale it up to create our actual datasets.

In [None]:
sampling_rate = 6
sequence_length_input = 120
sequence_length_output = 24
delay = sampling_rate * (sequence_length_input + 24 - 1)
batch_size = 256

def create_multiple_datasets(start_index, end_index):
    input_data = tf.keras.utils.timeseries_dataset_from_array(
        data = raw_data[:-delay],
        targets= None,
        sampling_rate=sampling_rate,
        sequence_length=sequence_length_input,
        batch_size=1,
        start_index=start_index,
        end_index=end_index).unbatch()
    target_data = tf.keras.utils.timeseries_dataset_from_array(
        data = temperature[delay:],
        targets = None,
        sampling_rate=sampling_rate,
        sequence_length=sequence_length_output,
        batch_size=1,
        start_index=start_index,
        end_index=end_index).unbatch()
    return tf.data.Dataset.zip((input_data, target_data)).shuffle(100).batch(batch_size) # Compared to the dummy data, we also shuffle the dataset

In [None]:
train_data = create_multiple_datasets(0, train_size)
val_data = create_multiple_datasets(train_size, train_size + valid_size)
test_data = create_multiple_datasets(train_size + valid_size, None)

Let's take a look at dimensions. Note that our inputs are 120-entry sequences of vectors with dimension 14 each. Our outputs, on the other hand, are 24-dimensional vectors, but not actual sequences. We could create target sequences of vectors (for example, for a sequence-to-sequence approach), but we won't worry about this for now, as it's quite a bit more complex.

In [None]:
for inputs, targets in train_data:
    print("Input:",inputs.shape)
    print("Output:",targets.shape)
    break

Let's create a model. We are going to predict ten timesteps ahead with a sequence-to-vector model (i.e., we take the sequence `0,...,119` and predict `143,...,166` as a single output, or vector). The key here is that we need to have 24 outputs in the dense layer, corresponding to the length of our target vectors.

In [None]:
model = tf.keras.Sequential([
    tf.keras.Input(shape=(sequence_length_input, raw_data.shape[-1])),
    GRU(20, return_sequences=True),
    GRU(20),
    Dense(24)
])
model.summary()

Note that it will take a bit to train the model.

In [None]:
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("temp_stv",save_best_only=True)
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_data,
                    epochs=8,
                    validation_data=val_data,
                    callbacks=[checkpoint_cb])

model = tf.keras.models.load_model("temp_stv")
print(f"Test MAE: {model.evaluate(test_data)[1]:.2f}")

Is the model doing well? We can't compare with the previous one, because we are now estimating temperatures for the whole day rather than just one time point. But you could build very similar baselines as before.

# 2. A simple text-classification example

We will next go through a relatively simple NLP task: sentiment analysis. In particular, we take a text, and we predict whether the sentiment of the writer is positive or negative. To train (and test) our model, we will load the IMDB review dataset, which has been pre-labeled.

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

The data comes with a training set and a test set. We will also split off a validation set, and keep the correct folder structure ("neg" and "pos" subfolders within each of the folders "train", "test", "val"):

In [None]:
base_dir = Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    rnd.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

We will use the `tf.data` API that we learned about last time in order to load the revies from our directory when training.

In [None]:
batch_size = 32

train_ds = tf.keras.utils.text_dataset_from_directory("aclImdb/train", batch_size=batch_size)
val_ds = tf.keras.utils.text_dataset_from_directory("aclImdb/val", batch_size=batch_size)
test_ds = tf.keras.utils.text_dataset_from_directory("aclImdb/test", batch_size=batch_size)

Let's take a look at the first batch of data (and the first text + label) within that batch:

**Discussion**: What do you notice? What are the data types and do they make sense to you?

In [None]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

## 2.1 A bag-of-words approach

We start with a simple "bag-of-words" approach, that is, a model that doesn't consider the sequence of the words in the reviews, just their existence. However, to get at least a little bit of contextual information, we will look for "bigrams" rather than only individual words. That is, we consider all sequences of two words ("That is", "is we", "we consider", "consider all", "all sequences", "sequences of", "of two", "two words").

**Discussion**: Can you come up with a simple example where a model will likely missclassify the sentiment when only looking at individual words, but correctly classify the sentiment when looking at 2-grams?

In [None]:
max_tokens = 20000

text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=max_tokens,
    output_mode="multi_hot"
)
text_vectorization.adapt(train_ds.map(lambda x, y: x))

From our previous datasets, we derive new datasets that "vectorize" the data appropriately, in order to get integers corresponding to the desired bigrams. For this, we use the `.map` functionality together with a `lambda` function.

In [None]:
binary_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=tf.data.AUTOTUNE)
binary_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=tf.data.AUTOTUNE)
binary_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y),num_parallel_calls=tf.data.AUTOTUNE)

Let's take another look at the transformed data:

**Discussion**: Why is the input shape (32,20000)?

In [None]:
for inputs, targets in binary_2gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

Let's create a simple FFN with an intermediate `Dense` layer, a `Dropout` layer, and the final classification layer:

In [None]:
inputs = tf.keras.Input(shape=(max_tokens,))
x = Dense(16, activation="relu")(inputs)
x = Dropout(0.5)(x)
outputs = Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.summary()

Let's now run this model on our data. This will be relatively fast, since we have only a small model:

In [None]:
model.compile(optimizer="rmsprop",loss="binary_crossentropy",metrics=["accuracy"])
callbacks = [tf.keras.callbacks.ModelCheckpoint("binary_2gram",save_best_only=True)]
model.fit(binary_2gram_train_ds.cache(), # .cach() saves the whole dataset in memory, which avoids redoing the pre-processing, but is only possible if the dataset isn't too large
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = tf.keras.models.load_model("binary_2gram")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

## 2.2 A first sequential model

Next, we want to try a sequential model, that is, a model that captures the (global) order information of the reviews. The most natural approach is an RNN, which is designed exactly for this purpose.

Before we build the model, we have to encode our strings as integers. This time, we also specify a `max_length` to ensure that our input sequences are not too long (otherwise, training will be very slow).

In [None]:
max_length = 500
max_tokens = 20000
text_vectorization = TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(train_ds.map(lambda x, y: x))

Again, we use the `.map` function of our dataset to go through the vectorization process. We also `.prefetch` here, so that the next batch can be pre-processed on the CPU while the GPU trains on the current one.

In [None]:
int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=tf.data.AUTOTUNE).prefetch(1)
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=tf.data.AUTOTUNE).prefetch(1)
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y),  num_parallel_calls=tf.data.AUTOTUNE).prefetch(1)

Let's take a look at the pre-processed data.

**Discussion**: Do you notice a difference to the previous encoding we used?

In [None]:
for inputs, targets in int_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

We next generate our model. We will use a `Bidirectional` `GRU` before our classification layer. We also use the `tf.one_hot` functionality to turn our input vectors into sequences of (one-hot-encoded) word-representations.

In [None]:
inputs = tf.keras.Input(shape=(None,), dtype="int64")
embedded = tf.one_hot(inputs, depth=max_tokens)
x = Bidirectional(GRU(32))(embedded)
x = Dropout(0.5)(x)
outputs = Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.summary()

Next is the training step. Even on Colab, this will take a while, so don't worry too much about actually running this. In fact, we will only get to around 87% test accuracy, worse than the much faster bigram model. So, really, don't worry :)

In [None]:
model.compile(optimizer="rmsprop",loss="binary_crossentropy",metrics=["accuracy"])
callbacks = [tf.keras.callbacks.ModelCheckpoint("one_hot_bidir_gru",save_best_only=True)]
model.fit(int_train_ds,
          validation_data=int_val_ds,
          epochs=10,
          callbacks=callbacks)
model = tf.keras.models.load_model("one_hot_bidir_gru")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

## 2.3 A sequential model with a newly trained embedding

Especially when using sequential models, we generally rather use an embedding of our data than the original data. The denser input-matrix leads to faster training but also, normally, much better performance. Let's try it out, by adding an `Embedding` layer. The `mask_zero=True` ensures that our model skips zero-words (those that exist only because one sequence was shorter than another and we added "padding").

In [None]:
inputs = tf.keras.Input(shape=(None,), dtype="int64")
embedded = Embedding(input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
x = Bidirectional(GRU(32))(embedded)
x = Dropout(0.5)(x)
outputs = Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.summary()

While faster than the previous model (thanks, `Embedding`!), this still takes quite some time. Hence, here a sneak preview: the test accuracy is around 0.877, so we are still not beating our bag-of-words model - and it's not really worth spending all this runtime.

In [None]:
model.compile(optimizer="rmsprop",loss="binary_crossentropy",metrics=["accuracy"])
callbacks = [tf.keras.callbacks.ModelCheckpoint("embeddings_bidir_gru",save_best_only=True)]
model.fit(int_train_ds,
          validation_data=int_val_ds,
          epochs=10,
          callbacks=callbacks)
model = tf.keras.models.load_model("embeddings_bidir_gru")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

## 2.4 A sequential model with a pre-trained embedding

We will train another model, but this time, we will use a pre-trained embedding, to make our training process faster. In particular, we will use GloVe, one of the two embeddings most commonly used (the other being Text2Vec).

Note 1: Unzipping the GloVe files is a bit time-consuming. Don't worry, that's normal.

Note 2: If you are running this locally, you may first have to install the `wget` command to make the download work. Alternatively, you can download the zip file directly and load it into your current folder

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

Now that we have downloaded and unzipped the GloVe (actually, there are different files, based on the size of the embedding dimension you care for - we will use the 100-dimensional one), we will create a mapping between words and embeddings.

In [None]:
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")

Take a look, for example, at the word "the". The `embeddings_index` will return the value of "the" on each of the 100 dimensions:

In [None]:
embeddings_index['the']

Next, we create a matrix that maps each word into its embedding. We use this matrix to initialize a new `Embedding` layer, and we make sure that this layer has `trainable=False` (since we don't want to overwrite the embedding!) 

In [None]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()
word_index = dict(zip(vocabulary, range(len(vocabulary))))

embedding_matrix = np.zeros((max_tokens, embedding_dim))
for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

embedding_layer = tf.keras.layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True,
)

Next up, defining the model. There is no difference here, other than the dimensionality of the embedding.

**Discussion**: Why are there so many non-trainable parameters here?

In [None]:
inputs = tf.keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = Bidirectional(GRU(32))(embedded)
x = Dropout(0.5)(x)
outputs = Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.summary()

Let's train this again. Things are a bit faster this time, because we don't have to train the `Embedding` layer from scratch. Still not fast, though, thanks to the recurrence. In case training takes too long, here the test accuracy: 0.876

**Discussion**:
- Can you comment on the performance of the model and what do you think the drivers are?
- When are pre-trained embeddings most valuable?

In [None]:
model.compile(optimizer="rmsprop",loss="binary_crossentropy",metrics=["accuracy"])
callbacks = [tf.keras.callbacks.ModelCheckpoint("glove_embeddings_bidir_gru",save_best_only=True)]
model.fit(int_train_ds,
          validation_data=int_val_ds,
          epochs=10,
          callbacks=callbacks)
model = tf.keras.models.load_model("glove_embeddings_bidir_gru")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

## 2.5 A Transformer model for text-classification

We will now use a transformer model (or, at least, the encoder part of a transformer). Don't worry too much about the details here. The two things to really note are that
- We are subsetting the `tf.keras.layers.Layer` class, so we are creating a new type of layer
- While there is no transformer layer per default in TensorFlow, the `MultiHeadAttention` layer exists and so we can call this up directly.

In [None]:
class TransformerEncoder(tf.keras.layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = tf.keras.Sequential(
            [tf.keras.layers.Dense(dense_dim, activation="relu"),
             tf.keras.layers.Dense(embed_dim),]
        )
        self.layernorm_1 = tf.keras.layers.LayerNormalization()
        self.layernorm_2 = tf.keras.layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

Next, we create the `PositionalEmbedding`, that is added to give order information to the model. Again, don't worry too much about the details here.

In [None]:
class PositionalEmbedding(tf.keras.layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = tf.keras.layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = tf.keras.layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

We now build our model. We have to specify the size of the vocabulary and the sequences, the dimension of our positional embedding, the number of heads in the multi-head attention, and the number of Dense layers at the start of the transformer encoding.

In [None]:
vocab_size = 20000
sequence_length = 500
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = tf.keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = GlobalMaxPooling1D()(x)
x = Dropout(0.5)(x)
outputs = Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)

model.summary()

Note that this network is much bigger than the previous one (5 million trainable parameters versus 20 thousand), and even bigger than the one where we trained our own `Embedding`. You will generally also need a few more epochs until convergence. If you don't have the time, here the test accuracy: 0.884

**Discussion**: How does this compare to the previous models we have seen?

In [None]:
model.compile(optimizer="rmsprop",loss="binary_crossentropy",metrics=["accuracy"])
callbacks = [tf.keras.callbacks.ModelCheckpoint("transformer_encoder",save_best_only=True)]
model.fit(int_train_ds,
          validation_data=int_val_ds,
          epochs=20,
          callbacks=callbacks)
model = tf.keras.models.load_model("transformer_encoder",custom_objects={"TransformerEncoder": TransformerEncoder,
                                                                      "PositionalEmbedding": PositionalEmbedding})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

In [None]:
model = tf.keras.models.load_model("transformer_encoder",custom_objects={"TransformerEncoder": TransformerEncoder,
                                                                      "PositionalEmbedding": PositionalEmbedding})
model.summary()

# 3. Pre-trained transformer models using 🤗 Hugging Face

We now explore repeating the same task with a pre-trained model from the 🤗 Hugging Face library. You will probably have to start with some installation:

In [None]:
!pip install transformers
!pip install datasets

## 3.1 A default model for sentiment classifcation

🤗 Hugging Face provides many transformer-based neural network architectures (such as BERT, GPT-2, RoBERTa, XLM, DistilBert, etc.) for NLP, including many pretrained models. It relies on either `TensorFlow` or `PyTorch` (even though the default is `PyTorch`, so if you want to go deeper into NLP it might be worth taking a look at that).

For simple applications, you can use the pre-defined `pipeline` module from the `transformers` package. In this case, there is no fine-tuning of the models, but you can use whatever is there off the bat (and there are [a lot of models](https://huggingface.co/models)).

We simply have to load the "right" pipeline (in our case, `'sentiment-analysis'`) and choose the underlying transformer model. If the model doesn't work with the pipeline, you'll get a warning. If you don't specify a model, it will load a default model.

In [None]:
from transformers import pipeline

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis")

Let's try it out with a simple case:

In [None]:
sentiment_pipeline("The movie was terrible")

Next, we try it out on a few data points from the validation set. Note that the model only works on text up to 512 characters, so we have to cut off our text after some point.

In [None]:
for text, target  in val_ds:
    for i in range(min(3,len(text))):
        print("----- Example", i, "-----")
        print("Sentiment classification by DistilBert:", sentiment_pipeline(text[i].numpy()[:512].decode("utf-8")))
        print("Actual label:", target[i])
    break

Finally, let's do some systematic testing (note that we will only look at one batch, since this will take too much time otherwise. But the accuracy is relatively representative).

**Discussion**: What do you observe about the performance of this pre-trained model? What do you think are the causes?

In [None]:
from sklearn.metrics import accuracy_score

def count_ys(indiv_input,indiv_target):
    pred = sentiment_pipeline(indiv_input.numpy()[:512].decode("utf-8"))[0]['label']
    if pred == "POSITIVE":
        return 1, indiv_target
    else:
        return 0, indiv_target

ypred = []
ytrue = []
for inputs, targets  in test_ds:
    res = list(map(list, zip(*[count_ys(inputs[i],targets[i]) for i in range(len(inputs))])))
    ypred += res[0]
    ytrue += res[1]
    break
    
print(f"Test acc: {accuracy_score(ytrue,ypred):.3f}")


## 3.2 Finetuning the classification model

We can also fine-tune the models from 🤗 Hugging Face to be more relevant for your use cases. There is a 🤗 Hugging Face specific training routing that you can use, but we will try to keep things simple and instead train directly in TensorFlow, as we are used to.

In either case, however, we will have to recreate our dataset. This is because each 🤗 Hugging Face model expects a specific tokenization process, and it is much easier to go through this from the raw data than from data that has already been loaded into a `tf.data.Dataset`. So let's load our data from the drive into the memory:

In [None]:
def read_imdb_manual(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir == "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_manual('aclImdb/train')
val_texts, val_labels = read_imdb_manual('aclImdb/train')
test_texts, test_labels = read_imdb_manual('aclImdb/test')

Next, we load the `Tokenizer` that is specific to the model we are using (a fine-tuned version of `DistilBert`. It is very important that you check the 🤗 Hugging Face documentation before working on any models, so as to make sure you are using the right tokenization process.

In [None]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Now that we have our tokenizer, we will create new datasets with the data appropriately tokenized (and truncated). Note that this might take some time.

In [None]:
batch_size = 32

train_ds_new = tf.data.Dataset.from_tensor_slices((dict(tokenizer(train_texts, truncation=True, padding=True)), train_labels)).batch(batch_size)
val_ds_new = tf.data.Dataset.from_tensor_slices((dict(tokenizer(val_texts, truncation=True, padding=True)), val_labels)).batch(batch_size)
test_ds_new = tf.data.Dataset.from_tensor_slices((dict(tokenizer(test_texts, truncation=True, padding=True)), test_labels)).batch(batch_size)

Next, we load the actual pre-trained model and summarize it.

**Discussion**: What possible issue do you notice with this? How can we fix the issue?

In [None]:
from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model.summary()

Let's now freeze all the layers except the final classifier layer:

In [None]:
for layer in model.layers[:-2]:
    layer.trainable=False
model.summary()

We proceed to fine-tune the model. Given the slow training, we will only do so for one epoch (which already leads to quite the improvement). After fine-tuning, the test accuracy is 0.897

**Discussion**: How does this compare to the previous models?

In [None]:
model.compile(optimizer="rmsprop",loss="binary_crossentropy",metrics=["accuracy"])
model.fit(train_ds_new,
          validation_data=val_ds_new,
          epochs=1)

print(f"Test acc: {model.evaluate(test_ds_new)[1]:.3f}")

## 3.3 Other default model examples

Let's take a look at two of the (many) other NLP algorithms that you can implement with the help of 🤗 Hugging Face. We will focus on the pipelines here and not do any fine-tuning. But the process is the same as before if you want to go further.

### Translation

We will first implement a Transformer-based translator (basically, our own Google Translate). All we need is the `'translation_XX_to_YY'` pipeline. For demonstration, we use `'en_to_de'`. As before, you can choose a (fitting) model:

In [None]:
en_de_translator = pipeline("translation_en_to_de",model='t5-base')

In [None]:
en_de_translator("What's your name?")

Let's make things more complicated and use the complex translation example from the lecture:

In [None]:
en_de_translator("Students start in September, have three terms with classes, finish a project, and successfully complete their degree the following summer.")

### Q&A

Another pipeline is the `'question-answering'` pipeline. It takes a question and a context and extracts the answer to the question from the context. The loading process should be familiar by now:

In [None]:
qa_model = pipeline("question-answering",model='distilbert-base-cased-distilled-squad')

Try it out below:

In [None]:
context = "My name is Philippe and I live in London."
question = "Where do I live?"
qa_model(question = question, context = context)

### Other pipelines

You can find a list of the most commonly used pipelines in the [official documentation](https://huggingface.co/docs/transformers/main_classes/pipelines).

**Coding**: Can you find a pipeline that allows you to implement a simple chatbot? What's the most fun conversation you can come up with?

# 4. Content generation with RNNs: Creating chorales like Johann Sebastian Bach (Exercise)

## 4.1 Loading the data

We will work with the Bach chorales dataset. It is composed of 382 chorales composed by Johann Sebastian Bach. Each chorale is 100 to 640 time steps long, and each time step contains a chord, made up of 4 integers, where each integer corresponds to a note's index on a piano (except for the value 0, which means that no note is played).

Our objective is to train a model that can predict the next time step (consisting of four notes), given a sequence of time steps from a chorale. We will later take a look at generating Bach-like music, one note at a time.

First, load the training, validation, and test data from Dropbox:

In [None]:
path_to_data = tf.keras.utils.get_file("jsb_chorales.tgz",
                                       "https://www.dropbox.com/s/9pwoax3cuylmzht/jsb_chorales.tgz?dl=1",
                                      extract=True)
jsb_chorales_dir = os.path.join(os.path.abspath(os.path.join(path_to_data, os.pardir)),'jsb_chorales')

In [None]:
def load_chorales(path_add):
    filepath = os.path.join(jsb_chorales_dir, path_add)
    return_list = []
    for file in [f for f in os.listdir(filepath) if f.endswith('.csv')]:
        return_list.append(pd.read_csv(os.path.join(filepath, file)).values.tolist())
    return return_list

In [None]:
train_chorales = load_chorales('train')
valid_chorales = load_chorales('valid')
test_chorales = load_chorales('test')

Let's take a look at one data point:

In [None]:
train_chorales[0]

We now write a few functions to listen to these chorales (you don't need to understand the details here, this is only to have a bit of fun with the exercise).

In [None]:
from IPython.display import Audio

def notes_to_frequencies(notes):
    # Frequency doubles when you go up one octave; there are 12 semi-tones
    # per octave; Note A on octave 4 is 440 Hz, and it is note number 69.
    return 2 ** ((np.array(notes) - 69) / 12) * 440

def frequencies_to_samples(frequencies, tempo, sample_rate):
    note_duration = 60 / tempo # the tempo is measured in beats per minutes
    # To reduce click sound at every beat, we round the frequencies to try to
    # get the samples close to zero at the end of each note.
    frequencies = np.round(note_duration * frequencies) / note_duration
    n_samples = int(note_duration * sample_rate)
    time = np.linspace(0, note_duration, n_samples)
    sine_waves = np.sin(2 * np.pi * frequencies.reshape(-1, 1) * time)
    # Removing all notes with frequencies ≤ 9 Hz (includes note 0 = silence)
    sine_waves *= (frequencies > 9.).reshape(-1, 1)
    return sine_waves.reshape(-1)

def chords_to_samples(chords, tempo, sample_rate):
    freqs = notes_to_frequencies(chords)
    freqs = np.r_[freqs, freqs[-1:]] # make last note a bit longer
    merged = np.mean([frequencies_to_samples(melody, tempo, sample_rate)
                     for melody in freqs.T], axis=0)
    n_fade_out_samples = sample_rate * 60 // tempo # fade out last note
    fade_out = np.linspace(1., 0., n_fade_out_samples)**2
    merged[-n_fade_out_samples:] *= fade_out
    return merged

def play_chords(chords, tempo=160, amplitude=0.1, sample_rate=44100):
    samples = amplitude * chords_to_samples(chords, tempo, sample_rate)
    return display(Audio(samples, rate=sample_rate))

We can use our functions to listen to the chorales. Try it out:

In [None]:
play_chords(train_chorales[0])

We need to define a few things relating to the notes in the chorales. This will make pre-processing a lot easier later. Notes range from 36 (C1 = C on octave 1) to 81 (A5 = A on octave 5), plus 0 for silence:

In [None]:
notes = set()
for chorales in (train_chorales, valid_chorales, test_chorales):
    for chorale in chorales:
        for chord in chorale:
            notes |= set(chord)

n_notes = len(notes)
min_note = min(notes - {0})
max_note = max(notes)

## 4.2 Preprocessing the data

In order to be able to generate new chorales, we want to train a model that can predict the next chord given all the previous chords. If we naively try to predict the next chord in one shot, predicting all 4 notes at once, we run the risk of getting notes that don't go very well together. It's much better to predict one note at a time.

Hence, we will need to preprocess every chorale, turning each chord into an arpeggio (i.e., a sequence of notes rather than notes played simultaneuously). So each chorale will be a long sequence of notes (rather than chords), and we can just train a model that can predict the next note given all the previous notes.

So what will our input sequences be for our RNN? We will create sequences made from 32 chords (128 notes). Each x will be a sequence of 127 notes (so the notes 1-127 of the 128 notes). The corresponding y will be the same sequence, but shifted by one (so the notes 2-128 of the 128 notes). This is a "sequence-to-sequence" approach. You could, of course, also try a "sequence-to-vector" approach and only predict the last note of each sequence. However, this won't pick up very well on the combinations of nodes in chords, so I wouldn't recommend it.

In principle, we could create a "sliding window" of 32 chords and go chord-by-chord. For example, if we have 100 chords in a chorale, then we could turn that chorale into 100 - 32 + 1 = 69 sequences of 32 chords. However, there will be a lot of correlation between the sequences, so we will instead slide our window by 16 chords. Hence, our chorale of 100 chords would now be turned into 5 sequences (the last one being from chords 65 to 97).

We will use a TensorFlow `Dataset` to manage our data pre-processing. It's not strictly necessary, as the amount of data isn't massive, but it will hopefully help introduce the concepts better.

We will start with a simple example before putting everything into a function. First, we convert one of our chorales datasets into a Tensor:

In [None]:
chorales = tf.ragged.constant(train_chorales, ragged_rank=1)

We will also shift the values of the notes so that they range from 0 to 46, where 0 represents silence, and values 1 to 46 represent notes 36 (C1) to 81 (A5). This is useful if you want to predict the notes through regression rather than classification (I don't recommend this, for reasons that will become clearer below). But it also helps to keep things more organized.

**Code**: In the code below, can you complete the second line by specifying a note should remain itself if it is zero, but if it is not zero, it should be replaced by `note - min_note + 1`. Use [`tf.where`](https://www.tensorflow.org/api_docs/python/tf/where)!

In [None]:
dataset_all = tf.data.Dataset.from_tensor_slices(chorales)
#

Let's take a look what kind of data we have so far:

In [None]:
for item in dataset_all.take(3):
    print(item.shape)

Each datapoint of our `Dataset` is an individual chorale:

In [None]:
item

We will now turn to processing a single of those chorales. Start by creating a new `Dataset` from the chorale. Each entry of that dataset is now a chord:

In [None]:
chorale = iter(dataset_all).next()
dataset_single = tf.data.Dataset.from_tensor_slices(chorale)
for item in dataset_single.take(5):
    print(item)

To demonstrate how things work, we will look at 3 chords at a time, and slide our window chord-by-chord.
Convert your chorale into multiple sequences of 3 chords, one apart, using the `window(3,1,drop_remainder=True)` operation on `dataset_single`. Run the complete code and compare the output of the previous step with the output below. Can you see what is happening?

In [None]:
dataset_single = dataset_single.window(3, 1, drop_remainder=True)
dataset_single = dataset_single.flat_map(lambda window: window.batch(3))
for item in dataset_single.take(3):
    print(item)

We turn what we just did into a function, so we can apply it to the dataset of all chorales.

In [None]:
def to_windows(chorale):
    dataset_single = tf.data.Dataset.from_tensor_slices(chorale)
    dataset_single = dataset_single.window(3, 1, drop_remainder=True)
    dataset_single = dataset_single.flat_map(lambda window: window.batch(3))
    return dataset_single

In [None]:
dataset_all = dataset_all.flat_map(to_windows)
for item in dataset_all.take(3):
    print(item)

Now that we have created sequences of chords to be used as inputs, we need to turn the chords into arpeggios (i.e., convert everything into long lists of notes). Here, the lists will be of length 12, because we took 3 chords at a time. However, we will later work with 32 chords, so lists of length 128.

**Code**: Complete the lambda function, by applying `tf.reshape(window, [-1])` to all `window` elements in the dataset.

In [None]:
dataset_all = dataset_all.map(lambda # 
for item in dataset_all.take(3):
    print(item)

We will also generate our `x` and our `y` from the lists:  elements 1-11 go into `x`, elements 2-12 go into `y`. Keep in mind that at runtime, our algorithm only sees **future** notes whenever it is working on one specific instance.

**Code**: Again, compare the output of the previous cell with the output of the next cell to verify what is happening:

In [None]:
dataset_all = dataset_all.map(lambda #
for item in dataset_all.take(3):
    print(item)

We can now put everything together into a `create_dataset` function, which we apply to all the chorales, whether training, validation, or testing. There are a few additional points that come up here that we didn't discuss before, but they are commented accordingly.

In [None]:
def create_dataset(chorales, batch_size=32, shuffle_buffer_size=None,
                    window_size=32, window_shift=16, cache=True):

    def to_windows(chorale):
        dataset_single = tf.data.Dataset.from_tensor_slices(chorale)
        dataset_single = dataset_single.window(window_size, window_shift, drop_remainder=True)
        dataset_single = dataset_single.flat_map(lambda window: window.batch(window_size))
        return dataset_single

    chorales = tf.ragged.constant(chorales, ragged_rank=1)
    dataset = tf.data.Dataset.from_tensor_slices(chorales)
    dataset = dataset.map(lambda note: tf.where(note == 0, note, note - min_note + 1))
    dataset = dataset.flat_map(to_windows)
    dataset = dataset.map(lambda window: tf.reshape(window, [-1]))
    if cache:
        dataset = dataset.cache() # If the memory is sufficient, we can cache the data, which makes it faster to call up
    if shuffle_buffer_size:
        dataset = dataset.shuffle(shuffle_buffer_size) # For training, we usually want to shuffle the data randomly
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(lambda batch: (batch[:,:-1], batch[:,1:])) # Note: we are reshaping entire batches, not just single observations
    return dataset.prefetch(1)

In [None]:
train_dataset = create_dataset(train_chorales, shuffle_buffer_size = 1000)
valid_dataset = create_dataset(valid_chorales)
test_dataset = create_dataset(test_chorales)

Let's take a look at one element from the training set to see if we have the expeted shapes:

In [None]:
temp = iter(train_dataset).next()
print(temp[0].shape)
print(temp[1].shape)

## 4.3 Training a model

We are now ready to create a model.

- We could feed the note values directly to the model, as floats, but this would probably not give good results. Indeed, the relationships between notes are not that simple: for example, if you replace a C3 with a C4, the melody will still sound fine, even though these notes are 12 semi-tones apart (i.e., one octave). Conversely, if you replace a C3 with a C\#3, it's very likely that the chord will sound horrible, despite these notes being just next to each other. So we will use an `Embedding` layer to convert each note to a small vector representation. We will use 5-dimensional embeddings, so the output of this first layer will have a shape of `[batch_size, window_size, 5]`. Keep in mind, the number of dimensions is a hyperparameter that you may want to tune
- We will then feed this data into a stack of 4 `Conv1D` layers with. The `dilation_rate` specifies how spread apart each neuron's inputs are. The doubling of rates at each layer means that we are building a hierarchy of sequences, where the first layer captures only short sequences, and the last one captures long ones that consist of combinations of those short ones. The `'causal'` padding ensures that convolutions are padded such that there is no "looking into the future".
- We intersperse the layers with `BatchNormalization` layers for faster convergence.
- At the end, we have one `LSTM` layer to try to capture long-term patterns.
- Finally, a `Dense` layer with `'softmax'` activation is used to produce the final note probabilities. It will predict one probability for each chorale in the batch, for each time step, and for each possible note (including silence). So the output shape will be `[batch_size, window_size, 47]`.

In [None]:
n_embedding_dims = 5

model = tf.keras.Sequential([
    Embedding(input_dim=n_notes, output_dim=n_embedding_dims,input_shape=[None]),
    Conv1D(32, kernel_size=2, padding="causal", activation="relu"),
    BatchNormalization(),
    Conv1D(48, kernel_size=2, padding="causal", activation="relu", dilation_rate=2),
    BatchNormalization(),
    Conv1D(64, kernel_size=2, padding="causal", activation="relu", dilation_rate=4),
    BatchNormalization(),
    Conv1D(96, kernel_size=2, padding="causal", activation="relu", dilation_rate=8),
    BatchNormalization(),
    LSTM(256, return_sequences=True),
    Dense(n_notes, activation="softmax")
])
model.summary()

The function below trains the model and plots the progress. Be aware that training may take a little bit of time.

In [None]:
def train_and_plot(model, learning_rate = 0.001, epochs = 20):
    model.compile(loss="sparse_categorical_crossentropy",
                    optimizer=tf.keras.optimizers.Adam(learning_rate = learning_rate),
                    metrics = ['accuracy'])
    early_stopping_cb = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience = 10, restore_best_weights=True)
    log = model.fit(train_dataset, epochs=epochs,
                        validation_data=valid_dataset,
                    callbacks = [early_stopping_cb])
    
    plt.plot(log.history['accuracy'],label = "training accuracy",color='green')
    plt.plot(log.history['loss'],label = "training loss",color='darkgreen')
    plt.plot(log.history['val_accuracy'], label = "validation accuracy",color='grey')
    plt.plot(log.history['val_loss'], label = "validation loss",color='darkblue')
    plt.legend()
    ax = plt.gca()
    plt.show()
    
    return model

In [None]:
train_and_plot(model)

## 4.4 Creating our own chorale

Now that we can predict the next note of a chorale, let's create some music! We will take a few starting chords, predict the next note to play, then take all the notes so far (the starting ones and the predicted one), and predict the next note, and so on. To do this, we need to process our data a little bit.

**Code**: Let's start somewhere. We will use 8 intial chords to start our own chorale. For simplicity, we take the first 8 chords from a test set chorale.
You will need to convert the notes back (from 36-81 to 1-46), unless a note is already a zero:

In [None]:
seed_chords = test_chorales[0][:8]
arpeggio = #
arpeggio

Keep in mind that we made predictions on lists of notes (arpeggios) instead of chords. Hence, we convert our starting chords:

In [None]:
arpeggio = tf.reshape(arpeggio, [1,-1])
arpeggio

**Code**: Then, we predict the most likely note to come next. Use `model.predict` as well as `np.argmax` (with `axis=-1`):

In [None]:
next_note = #
next_note

The code above predicts as many notes as we gave as an input (because of the sequence-to-sequence model structure). We only want the last chord:

In [None]:
next_note = next_note[:1,-1:]
next_note

We can now add this note to our arpeggio, so we can keep predicting further notes.

**Code**: Use `tf.concat` with `axis=1`. You can concatenate the `arpeggio` with the `next_note` by putting them inside a list which is the primary input to `tf.concat`.

In [None]:
arpeggio = #
arpeggio

Let's wrap all this into a function:

In [None]:
def generate_chorale(model, seed_chords, output_chords):
    arpeggio = tf.constant([[note if note == 0 else note - min_note + 1 for note in chord] for chord in seed_chords], dtype=tf.int64)
    arpeggio = tf.reshape(arpeggio, [1,-1])
    for chord in range(output_chords):
        for note in range(4):
            next_note = np.argmax(model.predict(arpeggio), axis=-1)[:1,-1:]
            arpeggio = tf.concat([arpeggio, next_note], axis=1)
    arpeggio = tf.where(arpeggio == 0, arpeggio, arpeggio + min_note - 1)
    new_chorale = tf.reshape(arpeggio, shape=[-1, 4])
    return new_chorale

Try it out by creating your very own chorale:

In [None]:
new_chorale = generate_chorale(model, seed_chords, 56)
play_chords(new_chorale)

If you listened to your chorale, you may have noticed a major flaw of the approach: it is often too conservative. The model will not take any risk, it will always choose the note with the highest probability, and since repeating the previous note generally sounds good enough, it's the least risky option, so the algorithm will tend to make notes last longer and longer (and, if you are unlucky, simply play one chord after just a few time steps). Pretty boring. Plus, if you run the model multiple times, it will always generate the same melody.

So let's spice things up a bit! Instead of always picking the note with the highest score, we will pick the next note randomly, according to the predicted probabilities. For example, if the model predicts a C3 with 75% probability, and a G3 with a 25% probability, then we will pick one of these two notes randomly, with these probabilities. We will also add a temperature parameter that will control how "hot" (i.e., daring) we want the system to feel. A high temperature will bring the predicted probabilities closer together, reducing the probability of the likely notes and increasing the probability of the unlikely ones.

In [None]:
def generate_chorale_v2(model, seed_chords, output_chords, temperature=1):
    arpeggio = tf.constant([[note if note == 0 else note - min_note + 1 for note in chord] for chord in seed_chords], dtype=tf.int64)
    arpeggio = tf.reshape(arpeggio, [1,-1])
    for chord in range(output_chords):
        for note in range(4):
            next_note_probas = model.predict(arpeggio)[0, -1:]
            rescaled_logits = tf.math.log(next_note_probas) / temperature
            next_note = tf.random.categorical(rescaled_logits, num_samples=1)
            arpeggio = tf.concat([arpeggio, next_note], axis=1)
    arpeggio = tf.where(arpeggio == 0, arpeggio, arpeggio + min_note - 1)
    new_chorale = tf.reshape(arpeggio, shape=[-1, 4])
    return new_chorale

Try it out again. This should sound much better.

In [None]:
new_chorale = generate_chorale_v2(model, seed_chords, 56)
play_chords(new_chorale)

# 5. Text generation with large language models and the risks of bias

We will try out the `'text-generation'` pipeline, using GPT2 (the predecessor of GPT3, which powers ChatGPT, but is not available publicly).

In [None]:
generator = pipeline('text-generation', model='gpt2')

With the text-generation funcionality, you give a prompt, and the model completes the sentence. Here, we are only looking at 20 words returned at maximum.

**Discussion**: In the example below, what do you notice? What do you think the causes are?

In [None]:
from transformers import set_seed
set_seed(99)
generator("The man worked as a", max_length=20, num_return_sequences=3)

In [None]:
set_seed(99)
generator("The woman worked as a", max_length=20, num_return_sequences=3)