# 📖 PyTorch "ShortFormer" - RoBERTa w/Chunks - Infer [0.607]

![](https://storage.googleapis.com/kaggle-competitions/kaggle/31779/logos/header.png)

### A NER "ShortFormer" with chunks, strides, and all the clumsy stuff

**This notebook is a baseline model for the competition [Feedback Prize - Evaluating Student Writing](https://www.kaggle.com/c/feedback-prize-2021). It approaches the problems as a token classification problem ("NER"-like) and builds a RoBERTa base model with `max_length=512`. In order to do so, it manages the chunking with stride of the texts with length greater than 512 (and the posterior merge).**

It is a kind of follow-up of the public work, and relies heavily on the awesome public BigBird baseline by [Chris Deotte](https://www.kaggle.com/cdeotte): [PyTorch - BigBird - NER - [CV 0.615]](https://www.kaggle.com/cdeotte/pytorch-bigbird-ner-cv-0-615). That notebook, in turn, uses code from the following ones:
* [Fine-Tunned on Roberta-base as NER problem [0.533]](https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533) by [RAGHAVENDRAKUTTALA](https://www.kaggle.com/raghavendrakotala)
* [🎓 Student Writing Competition [Twitch Stream]](https://www.kaggle.com/robikscube/student-writing-competition-twitch) by [Rob Mulla](https://www.kaggle.com/robikscube/)
* [Pytorch NER infer](https://www.kaggle.com/zzy990106/pytorch-ner-infer) by [zzy](https://www.kaggle.com/zzy990106)

Don't forget to upvote all these excellent kernels.



### This is the inference notebook.
### The training notebook is here: [📖 PyTorch- "ShortFormer" w/Chunks - Train [0.604]](https://www.kaggle.com/julian3833/pytorch-roberta-w-chunks-train-0-604)


&nbsp;
I loved the dual training/inference nature of Chris' notebook, but it was too much for me right now -I'm learning pytorch- so I unrolled it into the old Training/Inference way that we are used to. 


Both mostly follow Chris'. The main differences are:
1. At the tokenizing step, where I used the hugging face tokenizer functionality to leverage the chunking. See that step for details about the implementation
2. Validation is now performed on a per-epoch fashion
3. The `inference` and `get_predictions` functions had to be adapted to the chunking as well.



# Please _DO_ upvote if you found this kernel useful or interesting! 🤗

&nbsp;
&nbsp;

&nbsp;
&nbsp;

---

# Oh, ($n^2$)oo!: Some context


Transformer models are great. We all love them. _But_ the self-attention mechanism - the core of the Transformer architecture - has a matrix multiplication that scales quadratically with the input sequence length (at least) in terms of memory. The $QK^T$ costs a lot. And it makes the vanilla Transformer prohibitive for long sequences. This lead to the `512` tokens max length in the BERT-like models we see and use constantly.

There is research in the direction of reducing the cost of the attention operation so it scales in a slower fashion with the input length. Two recent models from this research are [LongFormer](https://arxiv.org/abs/2004.05150) and [BigBird](https://arxiv.org/abs/2007.14062), both put on the table by Chris Deotte in this competition (at least for me). 
Those models propose both slight variations of the self-attention mechanism that reduce the memory dependency to $O(n)$, this is, to scale linearly with the length of the input sequences. Both methods are "sparse attention" methods, meaning that, instead of each token attending to (and receiving attention from) all of the others, this cross-attention is pruned to a small number of tokens. In Longformer, there are 2 flavors of local windows (normal and dilated) and a global per-task attention, while in BigBird there is a window, a random and a global attention.


   <center><img src="https://i.imgur.com/t4MYmbj.png" width="50%"></center>
      <center><i>From the LongFormer <a href="https://arxiv.org/abs/2004.05150">paper</a></i></center>


   <center><img src="https://i.imgur.com/4bkL2JA.png" width="50%"></center>
      <center><i>From the BigBird <a href="https://arxiv.org/abs/2007.14062">paper</a></i></center>

&nbsp;
&nbsp;

The lower cost of these sparse self-attention mechanisms allows these models to handle up to `4096` in a normal GPU, this is, `8x` what a normal Transformer can.

Given the lengths of the texts in this competition, it is no surprise that the current public work is focused on those so called "Longformer" models and it is probable that they will be an important part of the final ensemble solutions.

But... but..., on the other hand, the old-fashioned 512-token models _do have_ a mechanism to cope with their sequence length limitations. For a given sequence of length greater than `512`, before longformer-like models, the NLP community would:

1. Split it into chunks of `512` tokens (possibly with some overlap)
2. Use the model to process those chunks
3. Merge back the predictions over the chunks to obtain predictions over the full text

This mechanism was used, for example, during the recently finished [ChaII competition](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering), starting from [Darek Kłeczek](https://www.kaggle.com/thedrcat)'s [baseline](https://www.kaggle.com/thedrcat/chaii-eda-baseline).


It is possible that this mechanism is still relevant although the sparse-attention models.

It is possible that ["ShortFormers"](https://www.kaggle.com/c/feedback-prize-2021/discussion/297461) have something to say in this competition? Maybe add some variance to an ensemble? ... Or even more?


---


&nbsp;
&nbsp;

&nbsp;
&nbsp;


Ok, let's go!

# Imports

In [None]:
import os
import pandas as pd
from collections import defaultdict

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForTokenClassification

os.environ['TOKENIZERS_PARALLELISM'] = 'false'

# Configuration

In [None]:
MODEL_NAME = "roberta-large"

# The model is a roberta large from the HF model hub with a modified token classification head
# The modification is done in the training notebook
MODEL_PATH = '../input/feedback-prize-roberta-weights/model'

# Weights from the training notebook
MODEL_CHECKPOINT = '../input/feedback-prize-roberta-weights/pytorch_model_e3.bin'

# Test batch size
BATCH_SIZE = 4

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

# The stride (overlap) between chunks of texts during the split
DOC_STRIDE = 128

# Max model length, 512 for roberta
MAX_LENGTH = 512

# Load test data

Code from [Fine-Tunned on Roberta-base as NER problem [0.533]](https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533)

In [None]:
def load_df_test():
    test_names, df_test = [], []
    for f in list(os.listdir('../input/feedback-prize-2021/test')):
        test_names.append(f.replace('.txt', ''))
        df_test.append(open('../input/feedback-prize-2021/test/' + f, 'r').read())
    df_test = pd.DataFrame({'id': test_names, 'text': df_test})
    df_test['text_split'] = df_test.text.str.split()
    return df_test

df_test = load_df_test()
df_test.head()

## Create mapping for output labels

In [None]:
# CREATE DICTIONARIES THAT WE CAN USE DURING TRAIN AND INFER
output_labels = ['O', 'B-Lead', 'I-Lead', 'B-Position', 'I-Position', 'B-Claim', 'I-Claim', 'B-Counterclaim', 'I-Counterclaim', 
          'B-Rebuttal', 'I-Rebuttal', 'B-Evidence', 'I-Evidence', 'B-Concluding Statement', 'I-Concluding Statement']

LABELS_TO_IDS = {v:k for k,v in enumerate(output_labels)}
IDS_TO_LABELS = {k:v for k,v in enumerate(output_labels)}

LABELS_TO_IDS

# Tokenization and chunking

This is the main added value of these notebooks.

In particular, the call to the `tokenizer` with the following parameters:

* The text split already into words, in combination with `is_split_into_words`, as used by Chris Deotte and explained [here](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.prepare_for_tokenization.is_split_into_words).
* `return_overflowing_tokens=True`, which activates the "chunking" mechanism (aka: will generate more than one tokenized sample for texts with more than 512 tokens
* `stride`: the size of the overlap between chunked parts of a text

`return_overflowing_tokens=True` sets the key `overflow_to_sample_mapping` which has the index of the original text that generated each sample for each of them.

Moreover, the `word_ids(idx)` method returns a back-reference to the word index in the original text, indexed correctly no matter the chunk, doing a lot of the heavy-lifting. This is, for each token in the tokenized output, it says which word of the original text generated that token. 

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

In [None]:
def get_labels(word_ids, word_labels):
    label_ids = []
    for word_idx in word_ids:                            
        if word_idx is None:
            label_ids.append(-100)
        else:
            label_ids.append(LABELS_TO_IDS[word_labels[word_idx]])
    return label_ids


def tokenize(df, to_tensor=True, with_labels=True):
    
    encoded = tokenizer(df['text_split'].tolist(),
                        is_split_into_words=True,
                        return_overflowing_tokens=True,
                        stride=DOC_STRIDE,
                        max_length=MAX_LENGTH,
                        padding="max_length",
                        truncation=True)

    if with_labels:
        encoded['labels'] = []

    encoded['wids'] = []
    
    n = len(encoded['overflow_to_sample_mapping'])
    
    for i in range(n):

        # Map back to original row
        text_idx = encoded['overflow_to_sample_mapping'][i]
        
        # Get word indexes (this is a global index that takes into consideration the chunking :D )
        word_ids = encoded.word_ids(i)
        
        if with_labels:
            # Get word labels of the full un-chunked text
            word_labels = df['entities'].iloc[text_idx]
        
            # Get the labels associated with the word indexes
            label_ids = get_labels(word_ids, word_labels)
            encoded['labels'].append(label_ids)
        encoded['wids'].append([w if w is not None else -1 for w in word_ids])
    
    if to_tensor:
        encoded = {key: torch.as_tensor(val) for key, val in encoded.items()}
    return encoded

In [None]:
# Tokenize the df_test
tokenized_test = tokenize(df_test, with_labels=False)

# A short exploration of the tokenization procedure

In [None]:
# Original number of rows
len(df_test)

In [None]:
# Number of samples to feed the model
len(tokenized_test['input_ids'])

In [None]:
# Back-reference. 
# The first 2 zeroes mean that the first row was split into 2 samples
# And the 4 twos mean that one big ass motherfucker was split into 4 :P
tokenized_test['overflow_to_sample_mapping']

In [None]:
# The one in position "2" has 1056 words and has to fit into 512, with overlaps of 200.
df_test['text_split'].str.len()

In [None]:
# Actually, it has 1304 tokens (which are subwords)
n_tokens = len(tokenizer(df_test.iloc[2]['text'])['input_ids'])
n_tokens

In [None]:
# Further exploration of the case for those who are interested:

## Verification that 4 chunks of 512 with a stride of 200 is the correct number of chunks to fit 1304 tokens in
# 512 + 2*(512-200) < n_tokens < 512 + 3*(512-200)

## Original text:
# df_test.iloc[2]['text']

## The four 512-token chunks generated by the tokenization procedure:
# tokenizer.decode(tokenized_test['input_ids'][3])
# tokenizer.decode(tokenized_test['input_ids'][4])
# tokenizer.decode(tokenized_test['input_ids'][5])
# tokenizer.decode(tokenized_test['input_ids'][6])

## Dataset class

With the functional tokenization we performed above, the dataset class is trivial.

In [None]:
class FeedbackPrizeDataset(Dataset):
    def __init__(self, tokenized_ds):
        self.data = tokenized_ds

    def __getitem__(self, index):
        item = {}
        for k in self.data.keys():
            item[k] = self.data[k][index]
        return item

    def __len__(self):
        return len(self.data['input_ids'])

    
# Create Dataset and DataLoader
ds_test = FeedbackPrizeDataset(tokenized_test)
dl_test = DataLoader(ds_test, batch_size=BATCH_SIZE, shuffle=False, 
                              num_workers=2, pin_memory=True)

# Load model with fine-tuned weights

Training notebook: https://www.kaggle.com/julian3833/pytorch-roberta-w-chunks-train-0-604

In [None]:
def load_model():
    model = AutoModelForTokenClassification.from_pretrained(MODEL_PATH)
    model.to(DEVICE)
    model.load_state_dict(torch.load(MODEL_CHECKPOINT))
    model.eval()
    print('Model loaded.')
    return model

model = load_model()

# Inference code

We will infer in batches using our data loader which is faster than inferring one text at a time with a for-loop. 

Code taken and adapted from Chris Deotte. In turn his work is based on [this][1] and [this][2].


The adaptions are the minimal required to handle the fact that one text might have generated more than one model sample.

The key `overflow_to_sample_mapping` is a mapping from the sample back to the original text.


During inference our model will make predictions for each subword token. Some single words consist of multiple subword tokens. In the code below, we use a word's first subword token prediction as the label for the entire word. We can try other approaches, like averaging all subword predictions or taking `B` labels before `I` labels etc.

Moreover, since there are a large overlaps, for long texts there will be more than one prediction for various token. In this version, we are using the first prediction found and dropping all the rest. A voting mechanism could be implemented.



[1]: https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533
[2]: https://www.kaggle.com/zzy990106/pytorch-ner-infer

In [None]:

def inference(dl):
    
    # These 2 dictionaries will hold text-level data
    # Helping in the merging process by accumulating data
    # Through all the chunks
    predictions = defaultdict(list)
    seen_words_idx = defaultdict(list)
    
    for batch in dl:
        ids = batch["input_ids"].to(DEVICE)
        mask = batch["attention_mask"].to(DEVICE)
        outputs = model(ids, attention_mask=mask, return_dict=False)
        
        del ids, mask
        
        batch_preds = torch.argmax(outputs[0], axis=-1).cpu().numpy() 
    
        # Go over each prediction, getting the text_id reference
        for k, (chunk_preds, text_id) in enumerate(zip(batch_preds, batch['overflow_to_sample_mapping'].tolist())):
            
            # The word_ids are absolute references in the original text
            word_ids = batch['wids'][k].numpy()
            
            # Map from ids to labels
            chunk_preds = [IDS_TO_LABELS[i] for i in chunk_preds]        
            
            for idx, word_idx in enumerate(word_ids):                            
                if word_idx == -1:
                    pass
                elif word_idx not in seen_words_idx[text_id]:
                    # Add predictions if the word doesn't have a prediction from a previous chunk
                    predictions[text_id].append(chunk_preds[idx])
                    seen_words_idx[text_id].append(word_idx)
    
    final_predictions = [predictions[k] for k in sorted(predictions.keys())]
    return final_predictions


# https://www.kaggle.com/zzy990106/pytorch-ner-infer
# code has been modified from original
# I moved the iteration over the batches to inference because  
# samples from the same text might have be split into different batches
def get_predictions(df, dl):
    
    all_labels = inference(dl)
    final_preds = []
    
    for i in range(len(df)):
        idx = df.id.values[i]
        pred = all_labels[i]
        preds = []
        j = 0
        
        while j < len(pred):
            cls = pred[j]
            if cls == 'O': pass
            else: cls = cls.replace('B','I')
            end = j + 1
            while end < len(pred) and pred[end] == cls:
                end += 1
            if cls != 'O' and cls != '' and end - j > 7:
                final_preds.append((idx, cls.replace('I-',''), 
                                    ' '.join(map(str, list(range(j, end))))))
            j = end
        
    df_pred = pd.DataFrame(final_preds)
    df_pred.columns = ['id','class','predictionstring']
    return df_pred

# Inference
We will now infer the test data and write submission CSV

In [None]:
df_sub = get_predictions(df_test, dl_test)
display(df_sub.head())

# Post processing

The model currently is breaking some discourse blocks by a dot. In `df_sub` above, for example, the row 1 ends in the word `64` and the row 2 starts in the word `66`. An analysis of the cases showed that the word inbetween (`65`) is a dot. 

The following code takes care of those case (and only of those) leading to a +`0.02` LB score.



In [None]:
# Add: https://www.kaggle.com/vuxxxx/tensorflow-longformer-ner-postprocessing#Postprocessing
map_clip = {'Lead':9, 
            'Position':5, 
            'Evidence':14, 
            'Claim':3, 
            'Concluding Statement':11,
            'Counterclaim':6, 
            'Rebuttal':4}

def threshold(df):
    df = df.copy()
    df['len'] = df['predictionstring'].apply(lambda x:len(x.split()))
    for key, value in map_clip.items():
    # if df.loc[df['class']==key,'len'] < value 
        index = df.loc[df['class']==key].query(f'len<{value}').index
        df.drop(index, inplace = True)
    df = df.drop('len', axis=1)
    return df

In [None]:
def post_process(df_sub):
    df_post = [df_sub.iloc[0].copy()]
    for i in range(1, len(df_sub)):
        prev_row = df_post[-1]
        row = df_sub.iloc[i].copy()
        
        # Does the row belong to the same text as the previous and to the same discourse class?
        if row['id'] == prev_row['id'] and row['class'] == prev_row['class']:
            try:
                first_pos_row = int(row['predictionstring'].split(" ")[0])
                last_pos_prev_row = int(prev_row['predictionstring'].split(" ")[-1])
                
                # Is the row starting 2 words after the previous?
                if last_pos_prev_row + 2 == first_pos_row:
                    # In that case, merge with the previous one
                    new_id = last_pos_prev_row + 1
                    row['predictionstring'] = prev_row['predictionstring'] + f' {new_id} ' + row['predictionstring']
                    df_post = df_post[:-1]
                    
                df_post.append(row)
            except:
                df_post.append(row)
        else:
            df_post.append(row)
    df_post = pd.DataFrame(df_post).reset_index(drop=True)
    df_post = threshold(df_post)
    return df_post




# Link Evidence

In [None]:
def jn(pst, start, end):
    return " ".join([str(x) for x in pst[start:end]])


def link_evidence(oof):
    thresh = 1
    idu = oof['id'].unique()
    idc = idu[1]
    eoof = oof[oof['class'] == "Evidence"]
    neoof = oof[oof['class'] != "Evidence"]
    for thresh2 in range(26,27, 1):
        retval = []
        for idv in idu:
            for c in  ['Lead', 'Position', 'Evidence', 'Claim', 'Concluding Statement',
                   'Counterclaim', 'Rebuttal']:
                q = eoof[(eoof['id'] == idv) & (eoof['class'] == c)]
                if len(q) == 0:
                    continue
                pst = []
                for i,r in q.iterrows():
                    pst = pst +[-1] + [int(x) for x in r['predictionstring'].split()]
                start = 1
                end = 1
                for i in range(2,len(pst)):
                    cur = pst[i]
                    end = i
                    #if pst[start] == 205:
                    #   print(cur, pst[start], cur - pst[start])
                    if (cur == -1 and c != 'Evidence') or ((cur == -1) and ((pst[i+1] > pst[end-1] + thresh) or (pst[i+1] - pst[start] > thresh2))):
                        retval.append((idv, c, jn(pst, start, end)))
                        start = i + 1
                v = (idv, c, jn(pst, start, end+1))
                #print(v)
                retval.append(v)
        roof = pd.DataFrame(retval, columns = ['id', 'class', 'predictionstring']) 
        roof = roof.merge(neoof, how='outer')
        return roof

# Submit!

In [None]:
#df_post = post_process(df_sub)
df_post = threshold(df_sub)
df_post = link_evidence(df_post)
df_post.to_csv("submission.csv", index=False)
df_post.head()

# Please _DO_ upvote if you found this kernel useful or interesting! 🤗