# Disaster tweets with RoBERTa

In this notebook we tackle the [disaster tweets](https://www.kaggle.com/c/nlp-getting-started) Kaggle competition using 🤗 Huggingface's transformers. Given a tweet, the task is to predict whether it is about a disaster or not. We will leverage an implementation of RoBERTa to solve this task, a language model based on the transformer architecture.

The challenge is to predict if a tweet refers to an ocurring disaster or if it is about something else. Language is full of figuritave expressions, so it is not straightforward to come up with a rule to classify text as belonging to one or other category that works every time. Even for a human, a message can be difficult to interpret without the appropriate context, which can lead to, sometimes funny, misunderstandings.

As this is a binary classification task, in principle, any type of classifier can be used, such as logistic regression, SVM, random forest and feed-forward neural networks. These methods make use of the bag of words (BOW) approach to create numerical features, where the order of the words in the text and ther relations are ignored. However, language is a sequential phenomenon and words in a sentence have complex relations between them. More sophisticated language models must be used to capture these relations and extract meaningful information from textual data. The present analysis makes use of RoBERTa, a type of transformer language model put forward by [Facebook AI](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/). RoBERTa is a version of BERT which has been trained on a larger corpus for a longer time to achieve better performance in NLU (natural language understanding) tasks. BERT, in turn, is a transformer model originally proposed by [Google Research](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html). A short technical introduction is in order now before we start the analysis.

A [transformer](https://arxiv.org/abs/1706.03762) is a type of artificial neural network that consists of an encoder and a decoder, these are processing blocks that are composed of 'attention' layers. In this context, 'attention' can be thought of as a mechanism to relate inputs and outputs through time. The encoder constructs a high-dimensional numerical representation of textual data. In this form, documents are converted into numeric tensors. The decoder produces an output which depends on both the information from the encoder and on all the previous outputs of the decoder. An example of this type of architecture is [BERT](https://arxiv.org/abs/1810.04805v2) (Bidirectional Encoder Representations from Transformers), a highly-complex model composed of stacks of bi-directional transformers and trained on the BooksCorpus (800M words) and English Wikipedia (2,500M words). BERT can be used for many NLU (natural laguage understanding) tasks, including document classification. RoBERTa is a version of BERT that has been trained on a slightly modified task and with a larger corpus, including news articles, outperforming BERT on all GLUE tasks.

A publicly available implementation of this model (and many more) is offered by a popular python library called [transformers](https://huggingface.co/transformers/) created by 🤗 [Huggingface](https://huggingface.co/). The models are available both as PyTorch and Tensorflow models, and checkpoints are available which allow easy access to trained models ready for use. This notebook will show step by step how to use this library to solve a classification task with relevant explanations in place.

## Contents

1. Loading the data
2. Loading the model and tokenizer
3. Processing the data
4. Fine-tuning RoBERTa
5. Inference
6. Summary

## 1. Loading the data

This Kaggle kernel includes the training and testing data in the folder '../input/nlp-getting-started/'. The files are given in CSV format so they can easily be loaded using pandas. The files contain several fields but we will only be interested in 'text' and 'target', containing the tweets and annotated classification, respectively. We place the labels in a numpy array and cast them as a float as this is the data type we will later use when we place this labels into a tensor. 

In [None]:
import random
import pandas as pd
import numpy as np

# Load training and testing data
df = pd.read_csv('../input/nlp-getting-started/train.csv',index_col=0)
df_test = pd.read_csv('../input/nlp-getting-started/test.csv',index_col=0)
# Extract 'text' and 'target' information from dataframe and shuffle the data
temp = [(x,y) for x,y in zip(list(df['text']),list(df['target']))]
random.shuffle(temp)
tweets = [t[0] for t in temp]
y = [t[1] for t in temp]
# Cast the target labels as a float in preparation for passing it to Tensorflow
y = np.array(y).astype('float32')

We can investigate the distribution of labels. There are slightly fewer positive than negative examples, however this does not represent a significant imbalance, so no further action is required.

In [None]:
print('Observations in training set')
print(df['target'].count())
print()
print('Label proportion in training set')
print(df['target'].value_counts()/(sum(df['target'].value_counts())))
print()
print('Observations in test set')
print(df_test['text'].count())

## 2. Loading the model and tokenizer

The 'transformers' library contains many architectures useful for NLU. What makes this library particularly useful is that model checkpoints are available for a wide variety of models. This means that we don't have to train a new model from scratch but can instead load a pre-trained model and fine-tune it for whatever task we want. This is known as transfer learning and allows users to reuse previous knowledge, which is a more efficient way of advancing research. Within RoBERTa, different implementations are available, including a model returning the last hidden states of RoBERTa as-is, and one with a classification head stacked on top which is useful for sentiment analysis and document classification. This notebook will make use of the version that is already prepared for classification as a Tensorflow model. For an analyses that make use of a transformer model as-is and stacks a customer-made classification head on top, see [this approach (BERT)](https://www.kaggle.com/dhruv1234/huggingface-tfbertmodel).

We also need to load the tokenizer that was used to originally train the model. The tokenizer includes the rules employed to tokenise text, the vocabulary and the dictionary mapping tokens to numerical indices. It is important we use exactly this tokenizer, as the model contains token representations that are identified by the token indices given by this tokenizer.

In [None]:
import tensorflow as tf
from transformers import RobertaTokenizerFast, TFRobertaForSequenceClassification

BERT is available in two versions, BERT-base and BERT-large. The former counts with ~110M parameters and is a reduced version of the latter, which has ~340M parameters. Similarly, RoBERTa is also available in both versions. For the purpose of this competition, we use the large version. However, the difference in performance is relatively small, and the base version can still yield very good results. Moreover, one could also consider DistilBERT, [a compact version of BERT](https://arxiv.org/abs/1910.01108) with 40% less parameters than BERT-base. The difference in score in this competition when using DistilBERT-base compared to BERT-large was of about 0.02 points, however training and inference were much more faster. In real-life scenarios, it is important to consider the trade-off between performance and speed, and choose the appropriate model according to the requirements of the task.

In the following cell, we instantiate the model and use 'from_pretrained' to specify that we want to load weights from an existing checkpoint.

In [None]:
model_name = 'roberta-large'
roberta_tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
roberta_seq = TFRobertaForSequenceClassification.from_pretrained(model_name)

This instance contains the RoBERTa-large model with a classifier on top.

In [None]:
roberta_seq.summary()

Once we have prepared the data sets in the next section, we will verify that the classifier consists of two units with no activation function, giving as output their bare activation values, also known as logits.

## 3. Processing the data

When dealing with text data, some pre-processing may be necessary as data may not be of the same type that the model was trained on. Here we briefly consider only a few pre-processing steps.

First, we know that tweets may contain tags (#tag) and mentions (@name). We will shortly see that the tokenizer can separate the special characters and read the tags and names as a word. It's not clear whether tags and names consisting of multiple words and special characters without space would repesent additional difficulties for the model. An option could be to remove these altogether. However, in this exercise, we keep tags and mentions as they appear.

Next, note that the datasets do not contain unicode codes, which could stand for emojis, for example. This means this data already underwent some type of pre-processing before being published, as tweets often feature emojis in some form. In any case, we wont't have to worry about them in this example.

In [None]:
# Contains unicode codes (e.g. emojis)?
for t in tweets:
    if 'U+' in t:
        print(t)

However, the tweets contain some HTML character entities, these are text representations of special characters for HTML. For example: "&gt" is to be interpreted as ">" (greater than). We can easily verify that this only occurs for 3 types of characters, "&amp", "&gt" and "&lt", corresponding to "&", ">" and "<", respectively.

In [None]:
import re

# Contains HTML character entities?
for t in tweets:
    if '&' in re.sub(r'(&amp|&gt|&lt)','',t):
        print(t)

Addionally, tweets may contain links, which are rendered in a standard format as "http(s)://t.co/xxx" where the (s) is optional and "xxx" stands for an alphanumeric string. Since we know that RoBERTa was trained on books, wikipedia articles and news articles, which do not feature links of this form, we could opt for removing these URLs.

Other processing measures can also be considered. For instance, people often use slang, abbreviations and alternative spellings in their tweets, which are unlikely in the data set that RoBERTa was trained on. For instance, consider the common abbreviations in the following cell

In [None]:
for t in tweets:
    if any([x in t for x in [' btw ',' omg ',' lol ',' thx ']]):
        print(t)

Although we could replace these expressions for their word equivalents (e.g. "tbh": "to be honest"), we can immediately see that there are only a handful of examples that contain these irregularities. Given that our training and testing files consist of thousands of examples, replacing these expressions will not have a large impact. In fact, these may be consider as adding some noise, which may help to prevent overfitting.

Note that it is not immediately clear that we would benefit from more intrusive transformations, such as removing punctuation, numbers, undoing contractions or adding special tokens, because RoBERTa has been trained on text that contains all of these elements. Thus, the model should already be able to capture these basic elements of language as it has already seen them before. Misspelings are different, of course, however, we make no effort to fix them in this approach. You can consider running the tweets through a spelling checker and compare the results.

In summary, we only perform two transformations to the data set, removing URLs and converting HTML character entities to their intended representation. If you prefer not to apply these transformations, simply comment out the following cell.

In [None]:
import re

def process_tweets(tweets):
    r = tweets
    r = [re.sub(r'https?://t.co/\w+','',t) for t in r]
    r = [re.sub('&amp;','&',t) for t in r]
    r = [re.sub('&gt;','gt',t) for t in r]
    r = [re.sub('&lt;','lt',t) for t in r]
    return r

tweets = process_tweets(tweets)

Before we tokenise the text, we should try to understand what the tokenizer does. As an experiment, we can call the tokenizer on the first 5 tweets in the data set. For illustrative purposes, we arbitrarily add a padding to obtain sequences of 50 tokens. The tokenizer returns an object which contains a dictionary with two elements: 'input_ids' and 'attention_mask'

In [None]:
temp = roberta_tokenizer(tweets[:5],padding='max_length',max_length=50)
temp.keys()

'input_ids' are the indices assigned to each token. Decoding the 'input_ids' recovers the original tweet plus some special tokens that the tokenizer has introduced for the model. These special tokens include a start of sequence token "< s >" at the start of the document, an end of sequence token "< \s >" at the end of a sequence and a padding token "< pad >" to fill a sequence to the maximum specified length.

In [None]:
print('Original tweet:')
print(tweets[0])
print('Encoded tweet:')
print(temp['input_ids'][0])
print('Decoded tweet:')
print(roberta_tokenizer.decode(temp['input_ids'][0]))

The 'attention_mask' indicates whether a token in the encoded sequence corresponds to the "< pad >" token or not. It's function is to let the model know that these padding tokens are effectively blank spaces, so there is no need to pay attention to them.

In [None]:
print(temp['attention_mask'][0])

You can experiment by calling the tokenizer on more tweets to see how it treats numbers, tags (#), mentions (@), links and other elements present in the tweets.

Since the encoded tweets will be placed in a tensor for training and inference, they should all be of the same length, so we have to find the length of the longest encoded sequence and pad all tweets to that value. Note that it could be possible that the testing set contains a longer sequence than the training set. Thus, to make sure that we pad to the longest sequence we will find in this exercise, we combine training and testing sets, tokenise them together, and extract the maximum sequence length. In real-life applications where we don't know beforehand what is longest sequence we will find, we can add some arbitrary extra padding just to be safe.

**Important note**: this is the only time we make use of this combined set, as training must be carried out only over the training set to prevent data leakage.

In [None]:
# Combine all data in a separate list used only for determining the length of sequences
all_tweets = list(pd.concat([df,df_test],axis=0)['text'])
# Comment out this line if you don't apply any pre-processing
all_tweets = process_tweets(all_tweets)
max_len = max([len(t) for t in roberta_tokenizer(all_tweets)['input_ids']])
print(max_len)

The available data is split into a training and a testing (or validation) set. Note that we ask the tokenizer to return Tensorflow tensors as that's the library we will be using here, however, one could also use PyTorch.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tweets,y,test_size=0.30)
X_train = roberta_tokenizer(X_train,padding='max_length',max_length=max_len,return_tensors='tf')
X_test = roberta_tokenizer(X_test,padding='max_length',max_length=max_len,return_tensors='tf')

We can verify that both sets contain a similar proportion of positive labels as the original set to make sure that the random splitting has not unintendedly introduced a class imbalance

In [None]:
print(y.sum()/len(y))
print(y_train.sum()/len(y_train))
print(y_test.sum()/len(y_test))

Next, the data is loaded into Tensorflow Datasets. These objects have built-in methods for shuffling and batching the data, and are more efficient for training and inference when dealing with large volumes of data. As RoBERTa-large is a rather heavy model, we have to choose a small batch size, otherwise the examples won't fit in memory.

In [None]:
batch_size = 8
train_dataset = tf.data.Dataset.from_tensor_slices((dict(X_train),y_train))
train_dataset = train_dataset.batch(batch_size)
test_dataset = tf.data.Dataset.from_tensor_slices((dict(X_test),y_test))
test_dataset = test_dataset.batch(batch_size)

Let's verify the structure of data sets. Each one contains a tuple, where the first element is a dictionary of the encoded tweets and the second is an array of labels

In [None]:
print(train_dataset)
print(test_dataset)

As a test to make sure we have the data in the right format, we can evaluate the model on the first batch of the training set and see what comes out. In the following cell, "iter" is used to cast the Dataset object as an iterator and "next" to take the first element, i.e. the first batch of 8 examples, which is spearated into inputs (temp_x) and labels (temp_y). We can verify that the model returns a pair of logits per example, as mentioned earlier.

In [None]:
temp_x, temp_y = next(iter(test_dataset))
temp = roberta_seq(temp_x,temp_y)
temp

## 4. Fine-tuning RoBERTa

Before we can fine-tune the model, we must add an optimiser and a loss function for training. We choose the 'adam' optimiser and set a small learning rate, as we are only doing a fine-tuning of the weights.

As for the loss function, we choose cross entropy as this is a classification task. Since the last layer of this model contains 2 units, while our training targets (y) are given as a single value per example (i.e. indices, [0] or [1]), the function we must call from Tensorflow is SparseCategoricalCrossentropy. If our targets were given as two values per example (i.e. one-hot encoded, [0,1] or [1,0]), we would use CategoricalCrossentropy; and if the model had only one output, same as our targets, then we would use BinaryCrossentropy. Note that in this example (binary classification) these three functions are all equivalent, which one we choose depends only on the format of the data.

In summary:
* Model output: n elements. Target: 1 element.  Use: SparseCategoricalCrossentropy
* Model output: n elements. Target: n elements. Use: CategoricalCrossentropy
* Model output: 1 element.  Target: 1 element.  Use: BinaryCrossentropy

Furthermore, recall that the output of the last layer has no activation function, so the model is returning logits. Therefore, when we call the loss function from Tensorflow, we must pass the argument 'from_logits=True' to indicate that the outputs of the model should be passed through an activation function first when computing the loss. According to Tensorflow's [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy), this is more numerically stable than adding an activation function explicitly to the last layer of the model.

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-6)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
roberta_seq.compile(optimizer=optimizer,loss=loss,metrics=['accuracy'])

Finally, we add a callback to save the checkpoints of the model and keep the one with the best performance only, measured by accuracy on the validation set.

In [None]:
chkpt = './checkpoint'
callback_chkpt = tf.keras.callbacks.ModelCheckpoint(chkpt,
                                              monitor='val_accuracy',
                                              save_weights_only=True,
                                              save_best_only=True,
                                              mode='max')

We can now train the model on our tweets dataset. The dataset objects are already batched, so there is no need to specify the batch size in here. It does not take too long to obtain a good validation score, so 2-3 epochs of training should be enough.

In [None]:
history = roberta_seq.fit(train_dataset,epochs=3,
                          validation_data=test_dataset,
                          callbacks=[callback_chkpt])

Once training finishes, we restore the checkpoint with the best accuracy on the validation set

In [None]:
roberta_seq.load_weights(chkpt)

Next, we can produce a classification report to compare the different metrics of the model. To do this, we compute and save the predictions on the labeled testing set. Calling 'predict' on the data outputs a tuple with a single element, an array of two logits. We asign a label based on the index of the largest value, which can be found using 'argmax' (we use 'axis=1' because the model returns a pair of logits for each example). Note that for inference, it is not necessary to apply an activation function to the logits, as  this function does not change the result of taking 'argmax'. You can verify and convince yourself this is true.

In [None]:
outputs = roberta_seq.predict(test_dataset)
y_pred = outputs[0].argmax(axis=1)

The confusion matrix and classification report are printed using sklearn's functions

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print('Confusion matrix:')
print(confusion_matrix(y_test,y_pred,labels=[0,1]))
print()
print('Classification report:')
print(classification_report(y_test,y_pred,labels=[0,1],target_names=['not a disaster','disaster']))

There is an element of randomness arising from how the data is split intro training and evaluation sets, and in the training process. This can lead to small variations in the performance of the model. Over several iterations of loading and fine-tuning the model, I have obtained an F1-score of 0.83-0.85 on the evaluation set, and 0.8274-0.8424 on the leaderboard of the competition (the real test set). Thus, the impact of these random factors on the score is rather small.

## 5. Inference

We are now ready to make predictions on the real test set. We extract the tweets from the test file, pass them through the tokeniser and place them into a batched dataset.

In [None]:
tweets_test = list(df_test['text'])
# Comment out this line if you are not doing pre-processing
tweets_test = process_tweets(tweets_test)
X_real_test = roberta_tokenizer(tweets_test,padding='max_length',max_length=max_len,return_tensors='tf')
real_test_dataset = tf.data.Dataset.from_tensor_slices(dict(X_real_test))
real_test_dataset = real_test_dataset.batch(batch_size)
real_test_dataset

The labels are assigned the same way as before, taking the 'argmax' from the outputs of the model for each observation.

In [None]:
outputs_test = roberta_seq.predict(real_test_dataset)
y_pred_test = outputs_test[0].argmax(axis=1)

Finally, we can check the proportion of the predicted positive cases in the test set. Assuming that the observations in the test and training sets come from the same distribution and are randomly sampled, the proportion of positive cases should be similar to what we saw before, ~43%. This does not say anything about how good the model is, but if there is a large difference, it can indicate that something is going wrong, either with the model, or with the way the data is distributed.

In [None]:
y_pred_test.sum()/len(y_pred_test)

Assign the id to each prediction and export the data as a csv file ready for submission to the Kaggle competiton.

In [None]:
results = pd.Series(y_pred_test,index=df_test.index,name='target')
results.to_csv('./submission.csv')

## 6. Summary

This notebook has illustrated how to use the 🤗 Huggingface's transformers library to solve a classification task applied to tweets. The RoBERTa-large model with a classifier layer on top is easy to use and can be fine-tuned in just a few epochs. Even with very limited pre-processing of the text, the model achieves a good F1-score showing that it can classify tweets correctly most of the time. Smaller models are also available which sacrifice only a little performance for a great boost in training and inference speed.

I hope this short tutorial has been useful and please feel free to share your comments, questions and any feedback. Thanks!