Edit:
Finally TPU acceleration is enabled in Kaggle Notebooks !!! The difference is as much as 20x lower training time.

**Background on tech that is used:
**
**DNN architecture:
**
> RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.

**Implementation:**

Tensorflow implementation of Roberta pretrained for classification, provided by:
https://github.com/huggingface/transformers

We import the necessary libraries. 

In [None]:
import gc
import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold
import tensorflow as tf

from transformers import RobertaTokenizer, RobertaConfig, TFRobertaPreTrainedModel
from transformers.modeling_tf_roberta import TFRobertaMainLayer
from transformers.modeling_tf_utils import get_initializer

These lines are used to connect and initialize with the TPU system 

In [None]:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)

This loads the pretrained tokenizers from Transformer library.

In [None]:
MODEL_NAME = 'roberta-base'
MAX_LEN = 128
tokenizer = RobertaTokenizer.from_pretrained(MODEL_NAME)
df_train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv', dtype={'id': np.int16, 'target': np.int8})
df_test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv', dtype={'id': np.int16, 'target': np.int8})
df_sample_submission = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')

We fix erroneous labels. Check kernel link given at the end to find out why they are erroneous.

In [None]:
ids_with_target_error = [328,443,513,2619,3640,3900,4342,5781,6552,6554,6570,6701,6702,6729,6861,7226]
df_train.at[df_train['id'].isin(ids_with_target_error),'target'] = 0

We do entire encoding and tokenizing using the below function.
Encoding formats the text supported by Roberta vocab and removes unwanted characters. Tokenizer creates an array from these encoded tokens. This is a numpy array and can be used for training. It is called input_ids. We also pad these input ids to a fixed length so that there is no variations in lenth between indivual rows.

In [None]:
def to_tokens(input_text, tokenizer):
    output = tokenizer.encode_plus(input_text, max_length=MAX_LEN, pad_to_max_length=True)
    return output

def select_field(features, field):
    return [feature[field] for feature in features]

In [None]:
tokenizer_output_train = df_train['text'].apply(lambda x: to_tokens(x, tokenizer))
tokenizer_output_test = df_test['text'].apply(lambda x: to_tokens(x, tokenizer))

We create and store the input ids as well as attention masks as an numpy array. The input ids are the numbers which are understood by the Neural Network.

In [None]:
input_ids_train = np.array(select_field(tokenizer_output_train, 'input_ids'))
attention_masks_train = np.array(select_field(tokenizer_output_train, 'attention_mask'))

input_ids_test = np.array(select_field(tokenizer_output_test, 'input_ids'))
attention_masks_test = np.array(select_field(tokenizer_output_test, 'attention_mask'))

Here we load the pretrained roberta model, create the optimizer, loss and metric functions. Since TPU support was added to Kaggle, we are using this by enclosing model.compile() in the scope of distribution strategy.

In [None]:
 class CustomModel(TFRobertaPreTrainedModel):
    def __init__(self, config, *inputs, **kwargs):
        super(CustomModel, self).__init__(config, *inputs, **kwargs)
        self.num_labels = config.num_labels
        self.roberta = TFRobertaMainLayer(config, name="roberta")
        self.dropout_1 = tf.keras.layers.Dropout(0.3)
        self.classifier = tf.keras.layers.Dense(units=config.num_labels,
                                                name='classifier', 
                                                kernel_initializer=get_initializer(
                                                    config.initializer_range))

    def call(self, inputs, **kwargs):
        outputs = self.roberta(inputs, **kwargs)
        pooled_output = outputs[1]
        pooled_output = self.dropout_1(pooled_output, training=kwargs.get('training', False))
        logits = self.classifier(pooled_output)
        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        return outputs

In [None]:
def init_model(model_name):
    strategy = tf.distribute.experimental.TPUStrategy(resolver)
    with strategy.scope():
        config = RobertaConfig.from_pretrained(model_name, num_labels=2)
        model = CustomModel.from_pretrained(model_name)
        optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
        loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
        metric = tf.keras.metrics.BinaryAccuracy('accuracy')
        model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
    return model

We convert the target labels to categorical one hot encoded format as the roberta model is by default configured to return a tensor with 2 labels (which is the minimum for classification). For training with TPUs, the batch size should be optimally 128. I had used the batch size of 64 because it is divisible by 8 which is recommended by GCP page on TPUs. The gpus are usually severely constrained by Video RAM which is quite low (~15.9GBs) for P100 gpus used by kaggle thus they form a bottleneck. We don't face such problems with TPUs.

TPUs can only work with data if their length is a multiple of the batch size. Since there are 7613 individual rows of texts we have to remove 61 rows of training data because 7613-61 = 7552 which is perfectly divisible by the batch size of 128. In other words, 128 * 59 = 7552. The ideal batch size for TPU training is 128 which also perfectly divides 7552.

For training with GPUs this is not required. Running the training in TPU takes ~20s per epoch (initially it takes more due to loading being a bottleneck). This is in contrast with a NVIDIA P100 GPU taking 10mins per epoch.

Below I am also using Stratified KFold.

In [None]:
BATCH_SIZE = 128
EPOCHS = 10
SPLITS = 5
callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', 
                                              patience=3, verbose=0, 
                                              restore_best_weights=True)]
model_output = np.zeros((df_sample_submission['target'].shape[0], 2))
skf = StratifiedKFold(n_splits=SPLITS, shuffle=False)
X, y = input_ids_train, df_train['target'].values.reshape(-1, 1)
skf.get_n_splits(X, y)
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    X_train, attention_masks_train_stratified = X[train_index], attention_masks_train[train_index]
    X_test, attention_masks_test_stratified =  X[test_index], attention_masks_train[test_index]
    y_train, y_test = tf.keras.utils.to_categorical(y[train_index]), tf.keras.utils.to_categorical(y[test_index])
    X_train = X_train[:-divmod(X_train.shape[0], BATCH_SIZE)[1]]
    attention_masks_train_stratified = attention_masks_train_stratified[:-divmod(attention_masks_train_stratified.shape[0], 
                                                                                 BATCH_SIZE)[1]]
    y_train = y_train[:-divmod(y_train.shape[0], BATCH_SIZE)[1]]
    model = init_model(MODEL_NAME)
    if i == 0:
        print(model.summary())
    model.fit([X_train, attention_masks_train_stratified], y_train, 
              validation_data=([X_test, attention_masks_test_stratified], y_test), 
              batch_size=BATCH_SIZE, epochs=EPOCHS, callbacks=callbacks)
    model_output += model.predict([input_ids_test, attention_masks_test])
    del model
    gc.collect()
    print('='*22 + ' Split ' + str(i+1) + ' finished ' + '='*22)
model_output /= SPLITS

The output is predicted and saved to submission csv file.

In [None]:
df_sample_submission['target'] = np.argmax(model_output, axis=1).flatten()
df_sample_submission['target'].value_counts()

In [None]:
df_sample_submission.to_csv('submission.csv', index=False)

Thanks to [this](Thanks to https://www.kaggle.com/wrrosa/keras-bert-using-tfhub-modified-train-data) notebook for finding out flaws with some of the train data.
Special thanks to Huggingface 🤗 for providing such a wonderful library and Kaggle team for bringing TPUs to notebooks.