<a href="https://colab.research.google.com/github/KCL-Health-NLP/nlp_examples/blob/master/ann/fine_tuning_with_huggingface_and_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a Hugging Face model with Keras

**This is the "question" version of this notebook. It differs from the "answer" version in that it is not complete. You are asked to do the training and evaluate your model. This builds on basic Keras use, covered in previous notebooks**

Based on this [Hugging Face tutorial](https://huggingface.co/docs/transformers/training)



## Fine tuning

In fine-tuning, we take an existing language model that has been pre-trained on some task, and train that model for some other, usually related, task. Pre-training followed by fine-tuning is a kind of transfer learning - learning knowledge from one task, and applying it to another.

Often, pre-training will:
* be unsupervised, e.g. predicting masked words in a corpus of sentences
* include large amounts of training data, to ensure generalisability
* learn a large number of parameters, to ensure a rich representation

We will fine tune the popular [BERT Base](https://aclanthology.org/N19-1423.pdf) model. The original BERT Base model was trained on 11000 books and the whole of wikipedia, and has 110 million parameters.

BERT base was trained on two tasks:
* a masked language task - given a sequence with masked words, predict the missing masked words
* given two sentences, predict whether they follow on from each other in the original data

It is especially useful in tasks that use long sequences of data, e.g. text classification. The practical will fine tune BERT base to classify IMDb movie reviews, using the [Keras library](https://keras.io/) and the [Hugging Face transformers library](https://huggingface.co/docs/transformers/index).

The notebook steps through the basics of setting up fine-tuning, and then asks you to complete model training and evaluaiton, using the classes and methods introduced in previous notebooks on Keras.

## Hugging Face

[Hugging Face](https://huggingface.co/) is company that provides:

* a very popular machine learning library
* a repository for sharing trained models and datasets

The Hugging Face library is especially noted for its transformer support. It has a very high level API, with just a few simple lines of code needed to perform many deep learning tasks. By default, it uses the [PyTorch tensor library](https://pytorch.org/), but there is also support for Keras / TensorFlow and for other tensor libraries.

We will load the pre-trained BERT Base model from Hugging Face. We will use a version trained on case sensitive text (i.e. it includes both capital and lower case characters. It is worth reading the full model description on the [BERT Base cased model card](https://huggingface.co/bert-base-cased) in the Hugging Face repository.

**Hugging Face and Keras:** We will use the Hugging Face transformers library to load the pre-trained model in to a specific TensorFlow model class. This means that we can use the same Keras methods and classes as in previous notebooks (e.g. compiling, saving history, evaluating).


## Using with GPUs

The execution time of this code will benefit from the use of GPUs. To select a GPU runtime in colab:

* Select the *Runtime* menu
* Select the *Change runtime type* submenu
* In the dialog that appears, under *Hardware accelerator* select *GPU*
* Your existing runtime will disconnect, and you will be allocated and connected to a new GPU runtime.

## Install packages
We need to install several Hugging Face libraries, which are not provided by default in Colab.

In [None]:
# Install Hugging Face transformers anddatasets libraries
!pip install transformers datasets

## ***Restart your runtime***
**You need to restart your runtime in order for the above packages to be made available for imports**

* Menus
  * Runtime
    * Restart runtime

## Imports

In [None]:
import numpy as np

# Hugging Face datasets library has facilities for
# loading datasets direct from the Hugging Face
# datatsets repository.
from datasets import load_dataset

# Many pre-trained transformer models have their own
# specific tokenisation
from transformers import AutoTokenizer

# The TensorFlow / Keras version of a Hugging Face
# sequence classifier
from transformers import TFAutoModelForSequenceClassification

# For use in our final Keras model
from tensorflow.keras.optimizers import Adam

# For displaying models
from tensorflow.keras.utils import plot_model
from tensorflow.keras.utils import model_to_dot
import matplotlib.pyplot as plt
from IPython.display import SVG

## Load the data

As in the previous notebooks, we will use the IMDb review dataset. In this, movie reviews are labelles as either positive sentiment (label 1), or negative sentiment (label 0).

A Hugging Face dataset object can hold multiple subsets (e.g. training, testing), and has methods for accessing different portions, splitting, shuffling, etc. We will load the IMDb text dataset from the Hugging Face repository.

Take a look at the object created.

In [None]:
# Load the dataset
ds_imdb = load_dataset("imdb")

# Let's take a look at it
ds_imdb

## Reduce size of dataset to speed up

Transformers can be slow to train, even when fine-tuning for simple tasks. We will therefore fine-tune on a portion of IMDb. Run the code below and make sure you understand:

* the size of this new dataset and it's subsets
* the difference between the test subset in this new dataset, and the test subset in the above dataset

In [None]:
# Make a small training dataset from the 'train' part of IMDb.
# Shuffle the dataset using a random seed, and then select the
# first 600 reviews.
ds_train_sm = ds_imdb['train'].shuffle(seed=42).select(range(600))

# Create a train / test split.
ds_train_sm = ds_train_sm.train_test_split(test_size=0.2)

# Let's take a look
ds_train_sm

## Tokenise

The BERT models have been built with their own specific method of tokenisation, WordPiece tokenisation. This means that we have to tokenise our text with exactly the same method.

This is trained by starting with a vocabulary of tokens consisting of every character in the dataset, and then iteratively merging frequent combinations of these character tokens. The end result is tokenisation that splits words in to fragments, or pieces. It is especially useful when dealing with unseen words. Hugging Face has a good [explanation of WordPiece tokenisation](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt).

We start by loading the BERT Base cased pretrained tokenisation model form Hugging Face. Hugging Face has a tokenizer class that will detect the type of tokenizer to build from the model being loaded.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Let's try it out on some text.

In [None]:
encoding = tokenizer.encode('the quick brown fox jumped over the lazy dog')
print(encoding)

We had 9 words, but we get 11 integers. Why? Let's convert them back to see what tokens they represent.

In [None]:
print(tokenizer.convert_ids_to_tokens([101, 1103, 3613, 3058, 17594, 4874, 1166, 1103, 16688, 3676, 102]))

Two extra tokens have been added, ```[CLS]``` to represent the class of a sequence, and ```[SEP]``` to separate this sequence from some output sequence.

Let's look at some others:

In [None]:
print(tokenizer.convert_ids_to_tokens([101,102,103,904,905,2006,2007,3008,'12807']))

We have a mix of special tokens, single characters that have never been merged, whole words and part words.

What happens if we try to tokenise some made up words?

In [None]:
encoding = tokenizer.encode('elephere')
print(encoding)
encoding = tokenizer.encode('protoshere')
print(encoding)

Can you make out the individual tokens in the integer? Try converting them back to tokens:

In [None]:
print(tokenizer.convert_ids_to_tokens([101, 8468, 8043, 12807, 102]))

Now we will define a simple tokenisation function that takes a dataset of text and labels as input, takes out the text part of it, and returns the tokenised text and the labels.

Remember that we want to pass this to a Keras / TensorFlow model. By default, Hugging Face tokenisers will return PyTorch tensors. So we need to convert these to numpy arrays, which TensorFlow uses. Additionally, Keras will expect the text to be provided as a dictionary of arrays, so we need to do that conversion too.

In [None]:
# Tokenise a dataset
def tokenize(batch):

    # Do the tokenisation.
    # --- Truncate and pad length to 120.
    # --- Return the tensors as numpy arrays.
    text = tokenizer(batch['text'], return_tensors='np', padding=True, truncation=True, max_length=128)

    # return a tuple of the text array cast to a dictionary
    # and the labels as a numpy array.
    return (dict(text), np.array(batch['label']))

Now we can tokenize our training and validaiton sets. We will name them as follows:

* ```train_``` - training data
* ```val_``` - validation data
* ```_x``` - input text features
* ```_y``` - labels

We will print out an example. Take a look - can you see what the tokenisation has done, in addition to creating token vectors?

* attention mask - used to mask / select which tokens to consider, e.g. if we have padded with zeros, we might want to ignore our zero tokens
* token type ids - used to mask / select input and output sequences, if we have both. 0 selects input, 1 selects output.

In [None]:
# Tokenise our datasets and print some out
train_x, train_y = tokenize(ds_train_sm['train'])
val_x, val_y =  tokenize(ds_train_sm['test'])
print(train_y)
print('\n'*4)
print(train_x)

## Create the model

We will now create our model. We will use a ```TFAutoModelForSequenceClassification``` which creates a TensorFlow sequence classification model.

Like the tokeniser, this is a [Hugging Face AutoModel](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) which will load a named pre-trained model from the Hugging Face repository, and detect exactly what class to instantiate. As we are loading a BERT model, it will create a transformer.

In [None]:
# Load and compile our model - this is a Keras model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")

# Exercise

The rest of the code - compiling and evaluating - is for you to complete, using the methods and classes from previous notebooks, and the Keras documentation. Links to documentation are provided.

## Compile the model

* Use the [Keras compile method](https://keras.io/api/models/model_training_apis/) with these arguments:
* [Adam optimiser](https://keras.io/api/optimizers/adam/) with a learning rate of 3e-5 (lower learning rates are often better for fine-tuning transformers)
* evaluating against the accuracy metric during training and validation 

## Train the model

Fit the model to the training data, validating against the validation data, over 10 epochs.

In previous notebooks, we provided our data as tuples of text and labels. Here, we have it split up in to separate objects. You will need to combine, or pass it to the ```fit``` method as different arguments. Take a look at the [documentation](https://keras.io/api/models/model_training_apis/#fit-method) if you are not sure what to do.

## Display training summary

* Plot the training history
* Can you explain the plots?
* What will your next step be?

## Evaluate

* Evaluate against the IMDb test corpus (not the validation one you split out earlier, which was also referred to as ```test``` in some data objects)
* You might want to shuffle and select a portion of the data to evaluate against.