<img align="right" width="450px" src="https://github.com/digitalepidemiologylab/covid-twitter-bert/raw/master/images/COVID-Twitter-BERT-medium.png">

# Finetuning COVID-Twitter-BERT using Huggingface
In this notebook we will finetune CT-BERT for sentiment classification using the transformer library by Huggingface.

Learn more about this library [here](https://huggingface.co/transformers/).

## Before proceeding
Create a copy of this notebook by going to "File - Save a Copy in Drive"


# Install transformers and import libraries

In [14]:
!pip install transformers



In [15]:
from transformers import (
   AutoConfig,
   AutoTokenizer,
   TFAutoModelForSequenceClassification,
   glue_convert_examples_to_features
)
from torch.optim import AdamW
import tensorflow as tf
import tensorflow_datasets as tfds
import json

# Choose a Model from the Huggingface Library

In [16]:
# Choose model
# @markdown >The default model is <i><b>COVID-Twitter-BERT</b></i>. You can however choose <i><b>BERT Base</i></b> or <i><b>BERT Large</i></b> to compare these models to the <i><b>COVID-Twitter-BERT</i></b>. All these three models will be initiated with a random classification layer. If you go directly to the Predict-cell after having compiled the model, you will see that it still runs the predition. However the output will be random. The training steps below will finetune this for the specific task. <br /><br />
model_name = 'digitalepidemiologylab/covid-twitter-bert' #@param ["digitalepidemiologylab/covid-twitter-bert", "bert-large-uncased", "bert-base-uncased"]

# Initialise tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Download the SST-2 Dataset and Prepare for Finetuning
You can skip this step if you are using the already finetuned model

In [17]:
# Paramteters
#@markdown >Batch size and sequence length needs to be set to prepare the data. The size of the batches depends on available memory. For Colab GPU limit batch size to 8 and sequence length to 96. By reducing the length of the input (max_seq_length) you can also increase the batch size. For a dataset like SST-2 with lots of short sentences. this will likely benefit training.
max_seq_length = 96 #@param {type: "integer"}
train_batch_size =  8#@param {type: "integer"}
eval_batch_size = 8 #@param {type: "integer"}


#@markdown >The Glue dataset has around 62000 examples, and we really do not need them all for training a decent model. To cut down training time, please reduse this to only a percentage of the entire set.
use_percentage_of_data = 5 #@param {type: "slider", min: 1, max: 100}

# get dataset sizes
glue_builder = tfds.builder('glue/sst2')
num_train_examples = glue_builder.info.splits['train'].num_examples
num_dev_examples = glue_builder.info.splits['validation'].num_examples
num_labels = glue_builder.info.features['label'].num_classes

# download datasets and convert to training features
glue_builder.download_and_prepare()
train_data = glue_builder.as_dataset(split='train')
train_dataset = glue_convert_examples_to_features(train_data, tokenizer, max_length=max_seq_length, task='sst-2')
train_dataset = train_dataset.shuffle(100).batch(train_batch_size)

dev_data = glue_builder.as_dataset(split='validation')
dev_dataset = glue_convert_examples_to_features(dev_data, tokenizer, max_length=max_seq_length, task='sst-2')
dev_dataset = dev_dataset.shuffle(100).batch(eval_batch_size)

# Map the labels for printing
label_mapping = {i: glue_builder.info.features['label'].int2str(i) for i in range(num_labels)}

print(f'\n\nThe dataset is downloaded. The entire dataset has {num_train_examples + num_dev_examples} examples of which you are using {use_percentage_of_data}%. This will result in a train dataset with {int(num_train_examples * (use_percentage_of_data/100))} examples and a validation dataset with {int(num_dev_examples * (use_percentage_of_data/100))} examples.')

TypeError: TextInputSequence must be str

In [18]:
# Parameters
#@markdown >Batch size and sequence length needs to be set to prepare the data. The size of the batches depends on available memory. For Colab GPU limit batch size to 8 and sequence length to 96. By reducing the length of the input (max_seq_length) you can also increase the batch size. For a dataset like SST-2 with lots of short sentences. this will likely benefit training.
max_seq_length = 96 #@param {type: "integer"}
train_batch_size =  8#@param {type: "integer"}
eval_batch_size = 8 #@param {type: "integer"}

#@markdown >The Glue dataset has around 62000 examples, and we really do not need them all for training a decent model. To cut down training time, please reduse this to only a percentage of the entire set.
use_percentage_of_data = 5 #@param {type: "slider", min: 1, max: 100}

# get dataset sizes
glue_builder = tfds.builder('glue/sst2')  # 注意这里改为sst2
num_train_examples = glue_builder.info.splits['train'].num_examples
num_dev_examples = glue_builder.info.splits['validation'].num_examples
num_labels = glue_builder.info.features['label'].num_classes

# download datasets
glue_builder.download_and_prepare()
train_data = glue_builder.as_dataset(split='train')
dev_data = glue_builder.as_dataset(split='validation')

# 转换为特征的函数
def convert_dataset_to_features(dataset, tokenizer, max_length, task, num_examples=None):
    texts = []
    labels = []

    # 从数据集中提取文本和标签
    for example in dataset.take(num_examples or float('inf')):
        text = example['sentence'].numpy().decode('utf-8')
        label = example['label'].numpy()
        texts.append(text)
        labels.append(label)

    # 使用tokenizer处理文本
    encoded = tokenizer(
        texts,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='tf'
    )

    # 创建特征数据集
    dataset = tf.data.Dataset.from_tensor_slices({
        'input_ids': encoded['input_ids'],
        'attention_mask': encoded['attention_mask'],
        'labels': tf.convert_to_tensor(labels, dtype=tf.int32)
    })

    return dataset

# 计算要使用的数据量
num_train_to_use = int(num_train_examples * (use_percentage_of_data/100))
num_dev_to_use = int(num_dev_examples * (use_percentage_of_data/100))

# 转换数据集为特征
train_dataset = convert_dataset_to_features(
    train_data, tokenizer, max_length=max_seq_length, task='sst-2', num_examples=num_train_to_use
)
train_dataset = train_dataset.shuffle(100).batch(train_batch_size)

dev_dataset = convert_dataset_to_features(
    dev_data, tokenizer, max_length=max_seq_length, task='sst-2', num_examples=num_dev_to_use
)
dev_dataset = dev_dataset.batch(eval_batch_size)

# Map the labels for printing
label_mapping = {i: glue_builder.info.features['label'].int2str(i) for i in range(num_labels)}

print(f'\n\nThe dataset is downloaded. The entire dataset has {num_train_examples + num_dev_examples} examples of which you are using {use_percentage_of_data}%. This will result in a train dataset with {num_train_to_use} examples and a validation dataset with {num_dev_to_use} examples.')



The dataset is downloaded. The entire dataset has 68221 examples of which you are using 5%. This will result in a train dataset with 3367 examples and a validation dataset with 43 examples.


# Compile the Model, Train it on the SST-2 Task and Save the Result
You can skip this step if you are using the already finetuned model

In [19]:
#@markdown >The default learning rate of 2e5 will be fine in most cases
learning_rate = 2e-5 #@param {type: "number"}

#@markdown > Typically these type of models are finetuned for 3 epochs. This can be increased for small datasets and decreased for large datasets.
num_epochs = 1  #@param {type: "integer"}

# Initialise a Model for Sequence Classification with 2 labels
config = AutoConfig.from_pretrained(model_name, num_labels=num_labels)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, config=config)

# Optimizer and loss
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Metrics and callbacks
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
checkpoint_path = './checkpoints/checkpoint.{epoch:02d}'
callbacks = [tf.keras.callbacks.ModelCheckpoint(checkpoint_path, save_weights_only=True)]

# Compute some variables
train_steps_per_epoch = int(num_train_examples * (use_percentage_of_data/100) / train_batch_size)
dev_steps_per_epoch = int(num_dev_examples * (use_percentage_of_data/100) / eval_batch_size)


# Compile model
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

# Train the model
history = model.fit(train_dataset,
  epochs=num_epochs,
  steps_per_epoch=train_steps_per_epoch,
  validation_data=dev_dataset,
  validation_steps=dev_steps_per_epoch,
  callbacks=callbacks)

# Print some information about the training
print(f'\nThe training has finished training after {num_epochs} epochs.')
print('\nThe history contains the accuracy and loss at every epoch:')
print(json.dumps(history.history, indent=4))

print('\nThe checkpoint callback has generated a checkpoint after every epoch (loss being the training loss, val_loss is the validation loss):')
!ls -lha ./checkpoints/

print('\nWe will now save the finetuned model and the corresponding config file on your Colab disk.')
model.save_pretrained('./huggingface_model/')

print('\nTensorflow model and config-file is saved in ./huggingface_model/')
!ls -lha ./huggingface_model/

tf_model.h5:   0%|          | 0.00/1.47G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at digitalepidemiologylab/covid-twitter-bert and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



The training has finished training after 1 epochs.

The history contains the accuracy and loss at every epoch:
{
    "loss": [
        0.3424808979034424
    ],
    "accuracy": [
        0.8494047522544861
    ],
    "val_loss": [
        0.24701440334320068
    ],
    "val_accuracy": [
        0.824999988079071
    ]
}

The checkpoint callback has generated a checkpoint after every epoch (loss being the training loss, val_loss is the validation loss):
total 3.8G
drwxr-xr-x 2 root root 4.0K May 12 14:37 .
drwxr-xr-x 1 root root 4.0K May 12 14:36 ..
-rw-r--r-- 1 root root   83 May 12 14:37 checkpoint
-rw-r--r-- 1 root root 3.8G May 12 14:37 checkpoint.01.data-00000-of-00001
-rw-r--r-- 1 root root  73K May 12 14:37 checkpoint.01.index

We will now save the finetuned model and the corresponding config file on your Colab disk.

Tensorflow model and config-file is saved in ./huggingface_model/
total 1.3G
drwxr-xr-x 2 root root 4.0K May 12 14:37 .
drwxr-xr-x 1 root root 4.0K May 12 14:37 ..

# Predict
Let's run some inference with the trained model

In [20]:
# Small function only used for formatting the output
def format_prediction(preds, label_mapping, label_name):
    preds = tf.nn.softmax(preds, axis=1)
    formatted_preds = []
    for pred in preds.numpy():
        # convert to Python types and sort
        pred = {label: float(probability) for label, probability in zip(label_mapping.values(), pred)}
        pred = {k: v for k, v in sorted(pred.items(), key=lambda item: item[1], reverse=True)}
        formatted_preds.append({label_name: list(pred.keys())[0], f'{label_name}_probabilities': pred})
    return formatted_preds

In [21]:
#@markdown >Please input text that the model can try to classify
input_text = 'Happy little clouds'  #@param {type: "string"}

# Tokenize the input
input_ids = tf.constant(tokenizer.encode(input_text, add_special_tokens=True))[None, :]

# Run predictions
preds = model(input_ids)

# format logits
formatted_preds = format_prediction(preds[0], label_mapping, 'sentiment')

print(f'\nLabel Mapping:{json.dumps(label_mapping, indent=4)}')
print(f'\nLogits: {preds}')
print(f'\nProbabilities:{json.dumps(formatted_preds, indent=4)}')


Label Mapping:{
    "0": "negative",
    "1": "positive"
}

Logits: TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-1.4579324,  0.8760008]], dtype=float32)>, hidden_states=None, attentions=None)

Probabilities:[
    {
        "sentiment": "positive",
        "sentiment_probabilities": {
            "positive": 0.9116486310958862,
            "negative": 0.08835135400295258
        }
    }
]


##### Copyright 2020 Per Egil Kummervold and Martin Müller