# TensorFlow roBERTa Starter - LB 0.705
This notebook is a TensorFlow template for solving Kaggle's Tweet Sentiment Extraction competition as a question and answer roBERTa formulation. In this notebook, we show how to tokenize the data, create question answer targets, and how to build a custom question answer head for roBERTa in TensorFlow. Note that HuggingFace transformers don't have a `TFRobertaForQuestionAnswering` so we must make our own from `TFRobertaModel`. This notebook can achieve LB 0.715 with some modifications. Have fun experimenting!

You can also run this code offline and it will save the best model weights during each of the 5 folds of training. Upload those weights to a private Kaggle dataset and attach to this notebook. Then you can run this notebook with the line `model.fit()` commented out, and this notebook will instead load your offline models. It will use your offline models to predict oof and predict test. Hence this notebook can easily be converted to an inference notebook. An inference notebook is advantageous because it will only take 10 minutes to commit and submit instead of 2 hours. Better to train 2 hours offline separately.

# Load Libraries, Data, Tokenizer
We will use HuggingFace transformers [here][1]

[1]: https://huggingface.co/transformers/

In [None]:
pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/9c/35/1c3f6e62d81f5f0daff1384e6d5e6c5758682a8357ebc765ece2b9def62b/transformers-3.0.0-py3-none-any.whl (754kB)
[K     |████████████████████████████████| 757kB 2.6MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 7.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 9.8MB/s 
Collecting tokenizers==0.8.0-rc4
[?25l  Downloading https://files.pythonhosted.org/packages/e8/bd/e5abec46af977c8a1375c1dca7cb1e5b3ec392ef279067af7f6bc50491a0/tokenizers-0.8.0rc4-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |█████

In [None]:
!python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I hate you'))"

2020-07-02 12:52:17.380981: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Downloading: 100% 629/629 [00:00<00:00, 504kB/s]
Downloading: 100% 232k/232k [00:00<00:00, 420kB/s]
Downloading: 100% 230/230 [00:00<00:00, 163kB/s]
Downloading: 100% 268M/268M [00:04<00:00, 54.9MB/s]
[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]


In [None]:
import pandas as pd, numpy as np
import tensorflow as tf
import tensorflow.keras.backend as K
from sklearn.model_selection import StratifiedKFold
from transformers import *
import tokenizers
print('TF version',tf.__version__)

TF version 2.2.0


In [None]:
from google.colab import files
files.upload()

Saving tweet-sentiment-extraction.zip to tweet-sentiment-extraction.zip


{'tweet-sentiment-extraction.zip': b'PK\x03\x04-\x00\x00\x00\x08\x00\xb7\xad\x89P\x88\xe6?\xfb\xff\xff\xff\xff\xff\xff\xff\xff\x15\x00\x14\x00sample_submission.csv\x01\x00\x10\x00\xbd\xa5\x00\x00\x00\x00\x00\x00^\\\x00\x00\x00\x00\x00\x004\xbd\xbd\x96\xed\xbc\x8ee\xe9\x7f\xcf\x12\x86(\xf1\xd7o\xa7\x9e\xa2\x87H\x8aV{\x95F=~\xcf\x89}jd\x8e\xcc\xb8\xe7F\xec-\x91\xc0\xc2Z \x00\xfe\xcf\xf7\x7f\xfe\xe7\x7f\xfd?\x7f\xff\xfb\xfb\xff\xbe\xf5?\xdf\xfe\x7f\xfd\xcf\xff\x9d\xde\xf6\xf7\xe6\xb6\xe7\xdf\x7f\xa3\xee\x96\xd7l\xf7\xf8\xfb\xef\xfb\xbe\x92\xfa\xfb\xd5\xf6\xf7\xdf\x95\xae~\xd7\xdeW\xfd\xfb\xefyFoo\xe7\xbf\xfd\xfb\xaf\xdd\xf5+W\x1a\xe3\xf9\xfb\xef\xaei<\xfc\x9f\xfc\xfd\xfd\xf7\x9e7\xa5\xfd\xf6\xe7\xf095\xdfW\x9f\xf9\xe3\xe7\xa7\xcd\xf5\xee;\xaf\x97\xdf\xcfk\xdc5\xe77\xff\xfd\x97\x9f\xf9\x8ck>\x0f\x9f_\xc7\xae\xb3\xec\xf1\xf09e\xa5\xef\x9a5\xbd\x89\x9f\xaf\xfce\xfe}\xf3l\xef7\x9ew\x97\xfb\xbdx\xe6\xb3\xbe\xe7Je\xf0\x99\xd7\xb5K\x1a\xe5\xbe\x1f\x9f\xf3\xa4q\x95k\xf1\xbd\xa7\xb5\xd4\xe7\xb3\xf

In [None]:
!unzip -q "../content/tweet-sentiment-extraction.zip"

In [None]:
MAX_LEN = 96
#PATH = '../input/tf-roberta/'
#tokenizer = tokenizers.ByteLevelBPETokenizer(
#    vocab_file=PATH+'vocab-roberta-base.json', 
#    merges_file=PATH+'merges-roberta-base.txt', 
#    lowercase=True,
#    add_prefix_space=True
#)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base') # https://huggingface.co/roberta-base

sentiment_id = {'positive': 1313, 'negative': 2430, 'neutral': 7974}
train = pd.read_csv('../content/train.csv').fillna('')
train.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


# Training Data
We will now convert the training data into arrays that roBERTa understands. Here are example inputs and targets: 
![ids.jpg](attachment:ids.jpg)
The tokenization logic below is inspired by Abhishek's PyTorch notebook [here][1].

[1]: https://www.kaggle.com/abhishek/roberta-inference-5-folds

In [None]:
ct = train.shape[0]
input_ids = np.ones((ct,MAX_LEN),dtype='int32')
attention_mask = np.zeros((ct,MAX_LEN),dtype='int32')
token_type_ids = np.zeros((ct,MAX_LEN),dtype='int32')
start_tokens = np.zeros((ct,MAX_LEN),dtype='int32')
end_tokens = np.zeros((ct,MAX_LEN),dtype='int32')

for k in range(train.shape[0]):
    
    # FIND OVERLAP
    text1 = " "+" ".join(train.loc[k,'text'].split())
    text2 = " ".join(train.loc[k,'selected_text'].split())
    idx = text1.find(text2)
    chars = np.zeros((len(text1)))
    chars[idx:idx+len(text2)]=1
    if text1[idx-1]==' ': chars[idx-1] = 1 
    enc = tokenizer.encode(text1) 
        
    # ID_OFFSETS
    offsets = []; idx=0
    for t in enc:#.ids:
        w = tokenizer.decode([t])
        offsets.append((idx,idx+len(w)))
        idx += len(w)
    
    # START END TOKENS
    toks = []
    for i,(a,b) in enumerate(offsets):
        sm = np.sum(chars[a:b])
        if sm>0: toks.append(i) 
        
    s_tok = sentiment_id[train.loc[k,'sentiment']]
    input_ids[k,:len(enc)+5] = [0] + enc + [2,2] + [s_tok] + [2]
    attention_mask[k,:len(enc)+5] = 1
    if len(toks)>0:
        start_tokens[k,toks[0]+1] = 1
        end_tokens[k,toks[-1]+1] = 1

# Test Data
We must tokenize the test data exactly the same as we tokenize the training data

In [None]:
test = pd.read_csv('../content/test.csv').fillna('')

ct = test.shape[0]
input_ids_t = np.ones((ct,MAX_LEN),dtype='int32')
attention_mask_t = np.zeros((ct,MAX_LEN),dtype='int32')
token_type_ids_t = np.zeros((ct,MAX_LEN),dtype='int32')

for k in range(test.shape[0]):
        
    # INPUT_IDS
    text1 = " "+" ".join(test.loc[k,'text'].split())
    enc = tokenizer.encode(text1)                
    s_tok = sentiment_id[test.loc[k,'sentiment']]
    input_ids_t[k,:len(enc)+5] = [0] + enc + [2,2] + [s_tok] + [2]
    attention_mask_t[k,:len(enc)+5] = 1

# Build roBERTa Model
We use a pretrained roBERTa base model and add a custom question answer head. First tokens are input into `bert_model` and we use BERT's first output, i.e. `x[0]` below. These are embeddings of all input tokens and have shape `(batch_size, MAX_LEN, 768)`. Next we apply `tf.keras.layers.Conv1D(filters=1, kernel_size=1)` and transform the embeddings into shape `(batch_size, MAX_LEN, 1)`. We then flatten this and apply `softmax`, so our final output from `x1` has shape `(batch_size, MAX_LEN)`. These are one hot encodings of the start tokens indicies (for `selected_text`). And `x2` are the end tokens indicies.

![bert.jpg](attachment:bert.jpg)

In [None]:
def build_model():
    ids = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)
    att = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)
    tok = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)

    #config = RobertaConfig.from_pretrained(PATH+'config-roberta-base.json')
    #bert_model = TFRobertaModel.from_pretrained(PATH+'pretrained-roberta-base.h5',config=config)
    bert_model = TFRobertaModel.from_pretrained('roberta-base')
    x = bert_model(ids,attention_mask=att,token_type_ids=tok)
    
    x1 = tf.keras.layers.Dropout(0.1)(x[0]) 
    x1 = tf.keras.layers.Conv1D(1,1)(x1)
    x1 = tf.keras.layers.Flatten()(x1)
    x1 = tf.keras.layers.Activation('softmax')(x1)
    
    x2 = tf.keras.layers.Dropout(0.1)(x[0]) 
    x2 = tf.keras.layers.Conv1D(1,1)(x2)
    x2 = tf.keras.layers.Flatten()(x2)
    x2 = tf.keras.layers.Activation('softmax')(x2)

    model = tf.keras.models.Model(inputs=[ids, att, tok], outputs=[x1,x2])
    optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer)

    return model

# Metric

In [None]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    if (len(a)==0) & (len(b)==0): return 0.5
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

# Train roBERTa Model
We train with 5 Stratified KFolds (based on sentiment stratification). Each fold, the best model weights are saved and then reloaded before oof prediction and test prediction. Therefore you can run this code offline and upload your 5 fold models to a private Kaggle dataset. Then run this notebook and comment out the line `model.fit()`. Instead your notebook will load your model weights from offline training in the line `model.load_weights()`. Update this to have the correct path. Also make sure you change the KFold seed below to match your offline training. Then this notebook will proceed to use your offline models to predict oof and predict test.

In [None]:
jac = []; VER='v0'; DISPLAY=1 # USE display=1 FOR INTERACTIVE
oof_start = np.zeros((input_ids.shape[0],MAX_LEN))
oof_end = np.zeros((input_ids.shape[0],MAX_LEN))
preds_start = np.zeros((input_ids_t.shape[0],MAX_LEN))
preds_end = np.zeros((input_ids_t.shape[0],MAX_LEN))

skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=777)
for fold,(idxT,idxV) in enumerate(skf.split(input_ids,train.sentiment.values)):

    print('#'*25)
    print('### FOLD %i'%(fold+1))
    print('#'*25)
    
    K.clear_session()
    model = build_model()
        
    sv = tf.keras.callbacks.ModelCheckpoint(
        '%s-roberta-%i.h5'%(VER,fold), monitor='val_loss', verbose=1, save_best_only=True,
        save_weights_only=True, mode='auto', save_freq='epoch')
        
    model.fit([input_ids[idxT,], attention_mask[idxT,], token_type_ids[idxT,]], [start_tokens[idxT,], end_tokens[idxT,]], 
        epochs=3, batch_size=32, verbose=DISPLAY, callbacks=[sv],
        validation_data=([input_ids[idxV,],attention_mask[idxV,],token_type_ids[idxV,]], 
        [start_tokens[idxV,], end_tokens[idxV,]]))
    
    print('Loading model...')
    model.load_weights('%s-roberta-%i.h5'%(VER,fold))
    
    print('Predicting OOF...')
    oof_start[idxV,],oof_end[idxV,] = model.predict([input_ids[idxV,],attention_mask[idxV,],token_type_ids[idxV,]],verbose=DISPLAY)
    
    print('Predicting Test...')
    preds = model.predict([input_ids_t,attention_mask_t,token_type_ids_t],verbose=DISPLAY)
    preds_start += preds[0]/skf.n_splits
    preds_end += preds[1]/skf.n_splits
    
    # DISPLAY FOLD JACCARD
    all = []
    for k in idxV:
        a = np.argmax(oof_start[k,])
        b = np.argmax(oof_end[k,])
        if a>b: 
            st = train.loc[k,'text'] # IMPROVE CV/LB with better choice here
        else:
            text1 = " "+" ".join(train.loc[k,'text'].split())
            enc = tokenizer.encode(text1)
            st = tokenizer.decode(enc[a-1:b])
        all.append(jaccard(st,train.loc[k,'selected_text']))
    jac.append(np.mean(all))
    print('>>>> FOLD %i Jaccard ='%(fold+1),np.mean(all))
    print()

#########################
### FOLD 1
#########################


Some weights of the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Epoch 1/3
Epoch 00001: val_loss improved from inf to 1.61203, saving model to v0-roberta-0.h5
Epoch 2/3
Epoch 00002: val_loss improved from 1.61203 to 1.54551, saving model to v0-roberta-0.h5
Epoch 3/3
Epoch 00003: val_loss improved from 1.54551 to 1.52248, saving model to v0-roberta-0.h5
Loading model...
Predicting OOF...
Predicting Test...
>>>> FOLD 1 Jaccard = 0.504246066053762

#########################
### FOLD 2
#########################


Some weights of the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Epoch 1/3
Epoch 00001: val_loss improved from inf to 1.56996, saving model to v0-roberta-1.h5
Epoch 2/3
Epoch 00002: val_loss improved from 1.56996 to 1.51633, saving model to v0-roberta-1.h5
Epoch 3/3
Epoch 00003: val_loss did not improve from 1.51633
Loading model...
Predicting OOF...
Predicting Test...
>>>> FOLD 2 Jaccard = 0.5021087108070871

#########################
### FOLD 3
#########################


Some weights of the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Epoch 1/3
Epoch 00001: val_loss improved from inf to 1.62723, saving model to v0-roberta-2.h5
Epoch 2/3
Epoch 00002: val_loss improved from 1.62723 to 1.53409, saving model to v0-roberta-2.h5
Epoch 3/3
Epoch 00003: val_loss improved from 1.53409 to 1.49771, saving model to v0-roberta-2.h5
Loading model...
Predicting OOF...
Predicting Test...
>>>> FOLD 3 Jaccard = 0.5050925391280524

#########################
### FOLD 4
#########################


Some weights of the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Epoch 1/3
Epoch 00001: val_loss improved from inf to 1.62474, saving model to v0-roberta-3.h5
Epoch 2/3
Epoch 00002: val_loss improved from 1.62474 to 1.48386, saving model to v0-roberta-3.h5
Epoch 3/3
Epoch 00003: val_loss did not improve from 1.48386
Loading model...
Predicting OOF...
Predicting Test...
>>>> FOLD 4 Jaccard = 0.5031727473548202

#########################
### FOLD 5
#########################


Some weights of the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Epoch 1/3
Epoch 00001: val_loss improved from inf to 1.55465, saving model to v0-roberta-4.h5
Epoch 2/3
Epoch 00002: val_loss improved from 1.55465 to 1.52094, saving model to v0-roberta-4.h5
Epoch 3/3
Epoch 00003: val_loss did not improve from 1.52094
Loading model...
Predicting OOF...
Predicting Test...
>>>> FOLD 5 Jaccard = 0.5068040332393245



In [None]:
print('>>>> OVERALL 5Fold CV Jaccard =',np.mean(jac))

>>>> OVERALL 5Fold CV Jaccard = 0.5042848193166092


# Kaggle Submission

In [None]:
all = []
for k in range(input_ids_t.shape[0]):
    a = np.argmax(preds_start[k,])
    b = np.argmax(preds_end[k,])
    if a>b: 
        st = test.loc[k,'text']
    else:
        text1 = " "+" ".join(test.loc[k,'text'].split())
        enc = tokenizer.encode(text1)
        st = tokenizer.decode(enc[a-1:b])
    all.append(st)

In [None]:
test['selected_text'] = all
test[['textID','selected_text']].to_csv('submission.csv',index=False)
pd.set_option('max_colwidth', 60)
test.sample(25)

Unnamed: 0,textID,text,sentiment,selected_text
1211,1137d77ecc,....got your message!!! You are such a twitter freak!,neutral,<s>....got your message!!! You are such a twitter freak
3004,12d70a758e,why is it so **** cold ??!,negative,<s> why is it so **** cold
375,0437867c21,So tired.,negative,<s> So tired
182,4a13c1093d,"between the Garlic Pills, the Spider bite between your ...",neutral,"<s> between the Garlic Pills, the Spider bite between yo..."
2158,4cc3d3824c,Good morning from RSA Twitterverse!! Please send me some...,neutral,<s> Good morning from RSA Twitterverse!! Please send me ...
1362,31d03bff54,"I would lime them. Or lemon? And deal, sounds fabulous...",positive,"deal, sounds fabulous"
450,797f9cddc0,why aren`t you showing up as a #spymaster in my screen?,neutral,<s> why aren`t you showing up as a #spymaster in my screen
2443,1c5ca67cf9,"Good Grief! I can`t say much, I was driving home from M...",neutral,"<s> Good Grief! I can`t say much, I was driving home fro..."
2349,7a7f7181c7,"I just want to say: Both you,Taylor Swift, and Hayley W...",positive,have a great and lovely voice
2914,0457eb75c2,"maaaaan! I spent an hour on a project for work, only to ...",neutral,"<s> maaaaan! I spent an hour on a project for work, only..."
