# Creating a sequence classifier with DistilBERT


As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. To fix this, there is a new iteration of the trusty BERT model: DistilBERT. It cuts down on the density and size of BERT by a process called distillation: a technique where a large model, called the teacher, is compressed into a smaller model, called the student.

NLP tasks have been massively simplified due to the advent of Huggingface's Transformer architecture. As a platform hosting 10+ Transformer architectures, 🤗/Transformers makes it very easy to use, fine-tune and compare the models that have transfigured the deep-learning for NLP field. It serves as a backend for many downstream apps that leverage transformer models and is in use in production by many different companies.

Let's take a look at how to create a simple Sequence Classifier using DistilBERT and the Huggingface Transformers architecture.

**Step 1: Import Libraries**

In [0]:
!pip install transformers

In [0]:
%tensorflow_version 2.x

In [0]:
import tensorflow as tf
import tensorflow_datasets as tfds
import transformers
from transformers import TFDistilBertForSequenceClassification
from transformers import DistilBertTokenizer
from transformers import glue_convert_examples_to_features

**Step 2: Loading DistilBERT**

Load pretrained DistilBERT models for the sequence classifier and the tokenizer.

In [0]:
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

**Step 3: See Tokenization**


Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

Here is an example of tokenization: 

***Lions, tigers, and bulldogs are very scary!***

The tokens for this sentence would be:
*   Lions
*   tigers
*   and
*   bulldogs
*   are
*   very
*   scary




Similarly, we will use the DistilBertTokenizer to tokenize our input sequence.

In [0]:
sequence = "Hello! Welcome to the wondrous world of Natural Language Processing."
tokenized_sequence = tokenizer.tokenize(sequence)
print(tokenized_sequence)

**Step 4: Loading the dataset**

We will be using the Microsoft Research Paraphrase Corpus (MRPC) dataset, which is a sequence classification dataset. We get the train and validation data from the `tensorflow_datasets` package. These values are in the form of `tf.data.Dataset`, which is perfect for us.


In [0]:
data = tfds.load("glue/mrpc")
train_dataset = data["train"]
validation_dataset = data["validation"]

Let's take a look at a sample from the dataset.

In [0]:
print(list(train_dataset.__iter__())[42])

**Step 5: Preparing Dataset**

First, we have to convert the dataset values into features usable by DistilBERT for training. Then, we randomly shuffle the dataset splits.

In [0]:
train_dataset = glue_convert_examples_to_features(train_dataset, roberta_tokenizer, 128, 'mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)

In [0]:
validation_dataset = glue_convert_examples_to_features(validation_dataset, roberta_tokenizer, 128, 'mrpc')
validation_dataset = validation_dataset.batch(64)

**Step 6: Preparing the model for fine-tuning**

Before fine-tuning the model, we must define a few hyperparameters that will be used during the training such as the optimizer, the loss and the evaluation metric.
As an optimizer we'll be using Adam, which was the optimizer used during those models' pre-training. As a loss we'll be using the sparse categorical cross-entropy, and the sparse categorical accuracy as the evaluation metric.

Note that the learning rate is extremely low so the model does not forget what it has already learned from the pre-training.

In [0]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

**Step 7: Fine-tune away!**



Using the fit() function from tf.keras, we can easily run fine-tuning on this model.

In [0]:
print("Fine-tuning DistilBERT on MRPC")
history = model.fit(train_dataset, epochs=3, steps_per_epoch=100, validation_data=validation_dataset)

In [0]:
model.evaluate(validation_dataset)