#### This is mostly the copied notebook submitted to kaggle here: https://www.kaggle.com/competitions/tweet-sentiment-extraction/discussion/159505

Since training data was small, for each idea, we ran the local CV 10 times with 10 different K Fold random seeds and averaged the scores (that's 5 folds times 10 equals 50). Each change below increased CV average by at least 0.001

1. Do not remove extra white space.

The extra white space contains signal. For example if text is "that's awesome!" then selected text is awesome. However if text is " that's awesome!" then selected text is s awesome. The second example has extra white space in the beginning of text. And resultantly the selected text has an extra proceeding letter.

2. Break apart common single tokens

(https://www.kaggle.com/code/cdeotte/tensorflow-roberta-0-705/notebook
RoBERTa makes a single token for "...", so your model cannot chose "fun." if the text is "This is fun...". So during preprocess, convert all single [...] tokens into three [.][.][.] tokens. Similarily, split "..", "!!", "!!!".

In [1]:
import sys
print(sys.version)

3.10.4 (v3.10.4:9d38120e33, Mar 23 2022, 17:29:05) [Clang 13.0.0 (clang-1300.0.29.30)]


In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.keras.backend as K
from sklearn.model_selection import StratifiedKFold
from transformers import *
from tensorflow import keras
import tokenizers
print('TF version',tf.__version__)

2024-01-23 14:07:00.501918: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-23 14:07:09.115631: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


TF version 2.11.0


In [3]:
MAX_LEN = 96
PATH = 'roBERTaFiles'
tokenizer = tokenizers.ByteLevelBPETokenizer(
    vocab=PATH+'/vocab-roberta-base.json', 
    merges=PATH+'/merges-roberta-base.txt', 
    lowercase=True,
    add_prefix_space=True
)
sentiment_id = {'positive': 1313, 'negative': 2430, 'neutral': 7974}
train = pd.read_csv('tweet-sentiment-extraction/train.csv').fillna('')
train.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [4]:
ct = train.shape[0]
input_ids = np.ones((ct,MAX_LEN),dtype='int32')
attention_mask = np.zeros((ct,MAX_LEN),dtype='int32')
token_type_ids = np.zeros((ct,MAX_LEN),dtype='int32')
start_tokens = np.zeros((ct,MAX_LEN),dtype='int32')
end_tokens = np.zeros((ct,MAX_LEN),dtype='int32')

for k in range(train.shape[0]):
    
    # FIND OVERLAP
    text1 = " "+" ".join(train.loc[k,'text'].split())
    text2 = " ".join(train.loc[k,'selected_text'].split())
    idx = text1.find(text2)
    chars = np.zeros((len(text1)))
    chars[idx:idx+len(text2)]=1
    if text1[idx-1]==' ': chars[idx-1] = 1 
    enc = tokenizer.encode(text1) 
        
    # ID_OFFSETS
    offsets = []; idx=0
    for t in enc.ids:
        w = tokenizer.decode([t])
        offsets.append((idx,idx+len(w)))
        idx += len(w)
    
    # START END TOKENS
    toks = []
    for i,(a,b) in enumerate(offsets):
        sm = np.sum(chars[a:b])
        if sm>0: toks.append(i) 
        
    s_tok = sentiment_id[train.loc[k,'sentiment']]
    input_ids[k,:len(enc.ids)+5] = [0] + enc.ids + [2,2] + [s_tok] + [2]
    attention_mask[k,:len(enc.ids)+5] = 1
    if len(toks)>0:
        start_tokens[k,toks[0]+1] = 1
        end_tokens[k,toks[-1]+1] = 1

In [5]:
#We must tokenize the test data exactly the same as we tokenize the training data

test = pd.read_csv('tweet-sentiment-extraction/test.csv').fillna('')

ct = test.shape[0]
input_ids_t = np.ones((ct,MAX_LEN),dtype='int32')
attention_mask_t = np.zeros((ct,MAX_LEN),dtype='int32')
token_type_ids_t = np.zeros((ct,MAX_LEN),dtype='int32')

for k in range(test.shape[0]):
        
    # INPUT_IDS
    text1 = " "+" ".join(test.loc[k,'text'].split())
    enc = tokenizer.encode(text1)                
    s_tok = sentiment_id[test.loc[k,'sentiment']]
    input_ids_t[k,:len(enc.ids)+5] = [0] + enc.ids + [2,2] + [s_tok] + [2]
    attention_mask_t[k,:len(enc.ids)+5] = 1

2.1. Build roBERTa Model

We use a pretrained roBERTa base model and add a custom question answer head. First tokens are input into bert_model and we use BERT's first output, i.e. x[0] below. These are embeddings of all input tokens and have shape (batch_size, MAX_LEN, 768). Next we apply tf.keras.layers.Conv1D(filters=1, kernel_size=1) and transform the embeddings into shape (batch_size, MAX_LEN, 1). We then flatten this and apply softmax, so our final output from x1 has shape (batch_size, MAX_LEN). These are one hot encodings of the start tokens indicies (for selected_text). And x2 are the end tokens indicies

In [6]:
def build_model():
    ids = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)
    att = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)
    tok = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)

    config = RobertaConfig.from_pretrained(PATH+'/input/config-roberta-base.json')
    bert_model = TFRobertaModel.from_pretrained(PATH+"/input/pretrained-roberta-base.h5",config=config)
    x = bert_model(ids,attention_mask=att,token_type_ids=tok)
    
    x1 = tf.keras.layers.Dropout(0.1)(x[0]) 
    x1 = tf.keras.layers.Conv1D(1,1)(x1)
    x1 = tf.keras.layers.Flatten()(x1)
    x1 = tf.keras.layers.Activation('softmax')(x1)
    
    x2 = tf.keras.layers.Dropout(0.1)(x[0]) 
    x2 = tf.keras.layers.Conv1D(1,1)(x2)
    x2 = tf.keras.layers.Flatten()(x2)
    x2 = tf.keras.layers.Activation('softmax')(x2)

    model = tf.keras.models.Model(inputs=[ids, att, tok], outputs=[x1,x2])
    optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
    model.compile(loss=keras.losses.CategoricalCrossentropy(), optimizer=optimizer)

    return model

3. Underestimate train targets

Jaccard score is higher is you underestimate versus overestimate. Therefore if text is " Matt loves ice cream" and the selected text is "t love". Then train your model with selected text "love" not selected text "Matt love". All public notebook do the later, we suggest the former.

4. Modified Question Answer head

First predict the end index. Then concatenate the end index logits with RoBERTa last hidden layer to predict the start index.

In [7]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    if (len(a)==0) & (len(b)==0): return 0.5
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

2.2. Train roBERTa Model

We train with 5 Stratified KFolds (based on sentiment stratification). Each fold, the best model weights are saved and then reloaded before oof prediction and test prediction. Therefore you can run this code offline and upload your 5 fold models to a private Kaggle dataset. Then run this notebook and comment out the line model.fit(). Instead your notebook will load your model weights from offline training in the line model.load_weights(). Update this to have the correct path. Also make sure you change the KFold seed below to match your offline training. Then this notebook will proceed to use your offline models to predict oof and predict test.

In [None]:
jac = []
VER='v0'
DISPLAY=1 # USE display=1 FOR INTERACTIVE

oof_start = np.zeros((input_ids.shape[0],MAX_LEN))
oof_end = np.zeros((input_ids.shape[0],MAX_LEN))

preds_start = np.zeros((input_ids_t.shape[0],MAX_LEN))
preds_end = np.zeros((input_ids_t.shape[0],MAX_LEN))

skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=777)
for fold,(idxT,idxV) in enumerate(skf.split(input_ids,train.sentiment.values)):

    print('#'*25)
    print('### FOLD %i'%(fold+1))
    print('#'*25)
    
    K.clear_session()
    model = build_model()
        
    sv = tf.keras.callbacks.ModelCheckpoint(
        '%s-roberta-%i.h5'%(VER,fold), monitor='val_loss', verbose=1, save_best_only=True,
        save_weights_only=True, mode='auto', save_freq='epoch')
        
    model.fit([input_ids[idxT,], attention_mask[idxT,], token_type_ids[idxT,]], [start_tokens[idxT,], end_tokens[idxT,]], 
        epochs=3, batch_size=32, verbose=DISPLAY, callbacks=[sv],
        validation_data=([input_ids[idxV,],attention_mask[idxV,],token_type_ids[idxV,]], 
        [start_tokens[idxV,], end_tokens[idxV,]]))
    
    print('Loading model...')
    model.load_weights('%s-roberta-%i.h5'%(VER,fold))
    
    print('Predicting OOF...')
    oof_start[idxV,],oof_end[idxV,] = model.predict([input_ids[idxV,],attention_mask[idxV,],token_type_ids[idxV,]],verbose=DISPLAY)
    
    print('Predicting Test...')
    preds = model.predict([input_ids_t,attention_mask_t,token_type_ids_t],verbose=DISPLAY)
    preds_start += preds[0]/skf.n_splits
    preds_end += preds[1]/skf.n_splits
    
    # DISPLAY FOLD JACCARD
    all = []
    for k in idxV:
        a = np.argmax(oof_start[k,])
        b = np.argmax(oof_end[k,])
        if a>b: 
            st = train.loc[k,'text'] # IMPROVE CV/LB with better choice here
        else:
            text1 = " "+" ".join(train.loc[k,'text'].split())
            enc = tokenizer.encode(text1)
            st = tokenizer.decode(enc.ids[a-1:b])
        all.append(jaccard(st,train.loc[k,'selected_text']))
    jac.append(np.mean(all))
    print('>>>> FOLD %i Jaccard ='%(fold+1),np.mean(all))
    print()

loading configuration file roBERTaFiles/input/config-roberta-base.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": null,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "eos_token_ids": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": null,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.4",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file roBERTaFiles/input/pretrained-roberta-base.h5


#########################
### FOLD 1
#########################


All model checkpoint layers were used when initializing TFRobertaModel.

All the layers of TFRobertaModel were initialized from the model checkpoint at roBERTaFiles/input/pretrained-roberta-base.h5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Epoch 1/3
  7/687 [..............................] - ETA: 5:14:44 - loss: nan - activation_loss: nan - activation_1_loss: nan

In [None]:
print('>>>> OVERALL 5Fold CV Jaccard =',np.mean(jac))

5. Use label smoothing

In [None]:
loss = tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.2)

6. Mask words

Use data loader to randomly replace 5% of words with [mask] token 50264. Within your dataloader use the following code. We also maintain where the special tokens are so that they don't get replaced

In [None]:
r = np.random.uniform(0,1,ids.shape)
ids[r<0.05] = 50264 
ids[tru] = self.ids[indexes][tru]

7. Decay learning rate

In [None]:
def lrfn(epoch):
    dd = {0:4e-5,1:2e-5,2:1e-5,3:5e-6,4:2.5e-6}
    return dd[epoch]
lr = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=True)

8. Train each fold 100% data for submit

After using normal 5 fold and early stopping, note how many epochs are optimal. Then for your LB submission, run your 5 folds with the fixed epoch number you found using 100% data each fold.

9. Sample weight positive and negative

In TensorFlow Keras it is easy to make certain training samples more important. The normal output from class DataGenerator(tf.keras.utils.Sequence) is (X,y). Instead output (X,y,w) where weight is the same shape as y. Then make w=2 for all the positive and negative targets and w=1 for all the neutral targets. Then train with the usual TensorFlow Keras calls

In [None]:
t_gen = DataGenerator()
model.fit(t_gen)

10. Post process

The above 9 changes already predict much of the noise. For example the above has no problem with the following 2 examples. Text is " that's awesome!!!" with selected text "s awesome!". And " I'm thinking... wonderful." with selected text ". wonderful". In each case, the model sees the leading double white space and extracts the single proceeding character.

However the model cannot break a single letter off a word like text "went fishing and loved it" with selected text "d loved". This would require breaking a "d" off of the word "and". For these difficult cases, we use post process which increase CV 0.0025 and LB 0.0025

In [None]:
# INPUT s=predicted, t=text, ex=sentiment
# OUTPUT predicted with PP

def applyPP(s,t,ex):

    t1 = t.lower()
    t2 = s.lower()

    # CLEAN PREDICTED
    b = 0
    if len(t2)>=1:
        if t2[0]==' ': 
            b = 1
            t2 = t2[1:]
    x = t1.find(t2)

    # PREDICTED MUST BE SUBSET OF TEXT
    if x==-1:
        print('CANT FIND',k,x)
        print(t1)
        print(t2)
        return s

    # ADJUST FOR EXTRA WHITE SPACE
    p = np.sum( np.array(t1[:x].split(' '))=='' )
    if (p>2): 
        d = 0; f = 0
        if p>3: 
            d=p-3
        return t1[x-1-b-d:x+len(t2)]

    # CLEAN BAD PREDICTIONS
    if (len(t2)<=2)|(ex=='neutral'):
        return t1

    return s

Other ideas

Our team tried tons of more ideas which may have worked if we spent more time to refine them. Below are some interesting things we tried:

replacing **** with the original curse word.
using part of speech information as an additional feature
using NER model predictions as additional features
compare test text with train text using Jaccard and use train selected text when jac >= 0.85 and text length >= 4 . (This gained 0.001 on public LB but didn't change private LB).
pretrain with Sentiment140 dataset as MLM (masked language model)
pseudo label Sentiment140 dataset and pretrain as QA (question answer model)
Train a BERT to choose the best prediction from multiple BERT predictions.
Stack BERTs. Append output from one BERT to the QA training data of another BERT.
Tons of ensembling ideas like Jaccard expectation, softmax manipulations, voting ensembles, etc