## **BigBird NER Baseline in PyTorch with  CV Score 0.740**

https://www.kaggle.com/competitions/feedback-prize-2021/overview

- This notebook serves as an introductory guide for Kaggle's "Feedback Prize - Evaluating Student Writing" Competition using PyTorch. It outlines the process of training, inferring, and submitting a model to Kaggle with no internet connection. The current implementation features:

* BigBird as the backbone (utilizing HuggingFace's TokenClassification head)
* NER (Named Entity Recognition) approach (tokenizing with is_split_into_words=True)
* Single fold
* BigBird serves as the backbone for the Named Entity Recognition (NER) model. BigBird is a state-of-the-art transformer architecture designed to handle long sequences of input tokens effectively. It can manage token inputs as wide as 4096, making it particularly suitable for tasks involving large-scale text processing.

- Modifying just a few lines of code allows for the evaluation of different PyTorch backbones and other experiments. When using a backbone that cannot handle 1024-wide tokens (unlike BigBird or LongFormer), a sliding window can be implemented during training and inference. BigBird, a state-of-the-art transformer, is capable of accepting large token inputs as wide as 4096. The arXiv paper can be found here.

- This model employs HuggingFace's AutoModelForTokenClassification. To use a custom head, AutoModel can be used to build one separately. An example can be found in this TensorFlow notebook.

- The tokenization process in this notebook uses tokenizer(txt.split(), is_split_into_words=True), which omits characters like \n. To enable the model to recognize new paragraphs, the code needs to be rewritten without using is_split_into_words=True. An example is available in this TensorFlow notebook.

- Many code snippets in this notebook come from Raghavendrakotala's excellent notebook, which can be found here. Don't forget to upvote Raghavendrakotala's notebook!

* Configuration Options
- This notebook can train a new model or load a pre-trained one (from a previous notebook version). Additionally, it can create new NER labels or load existing ones (from a previous notebook version). In this version, we will load both the model and NER labels.

- This notebook can also load HuggingFace components (like tokenizers) from a Kaggle dataset or download them from the internet. After downloading from the internet, the components can be placed in a Kaggle dataset, allowing for offline access in the future.

In [1]:
import os
# DECLARE HOW MANY GPUS YOU WISH TO USE. 
# KAGGLE ONLY HAS 1, BUT OFFLINE, YOU CAN USE MORE
os.environ["CUDA_VISIBLE_DEVICES"]="0" #0,1,2,3 for four gpu

# VERSION FOR SAVING MODEL WEIGHTS
VER=26

# IF VARIABLE IS NONE, THEN NOTEBOOK COMPUTES TOKENS
# OTHERWISE NOTEBOOK LOADS TOKENS FROM PATH
LOAD_TOKENS_FROM = './input/py-bigbird-v26'

# IF VARIABLE IS NONE, THEN NOTEBOOK TRAINS A NEW MODEL
# OTHERWISE IT LOADS YOUR PREVIOUSLY TRAINED MODEL
LOAD_MODEL_FROM = None #'./input/py-bigbird-v26'

# IF FOLLOWING IS NONE, THEN NOTEBOOK 
# USES INTERNET AND DOWNLOADS HUGGINGFACE 
# CONFIG, TOKENIZER, AND MODEL
DOWNLOADED_MODEL_PATH = None #'./input/py-bigbird-v26' 


In [2]:
if DOWNLOADED_MODEL_PATH is None:
    DOWNLOADED_MODEL_PATH = 'model'    
MODEL_NAME = 'google/bigbird-roberta-base'

In [3]:
from torch import cuda
config = {'model_name': MODEL_NAME,   
         'max_length': 1024,
         'train_batch_size':16,
         'valid_batch_size':16,
         'epochs':5,
         'learning_rates': [2.5e-5, 2.5e-5, 2.5e-6, 2.5e-6, 2.5e-7],
         'max_grad_norm':10,
         'device': 'cuda' if cuda.is_available() else 'cpu'}

In [4]:
# THIS WILL COMPUTE VAL SCORE DURING COMMIT BUT NOT DURING SUBMIT
COMPUTE_VAL_SCORE = True
if len( os.listdir('./input/feedback-prize-2021/test') )>5:
      COMPUTE_VAL_SCORE = False

# How To Submit PyTorch Without Internet
Many people ask me, how do I submit PyTorch models without internet? With HuggingFace Transformer, it's easy. Just download the following 3 things (1) model weights, (2) tokenizer files, (3) config file, and upload them to a Kaggle dataset. Below shows code how to get the files from HuggingFace for Google's BigBird-base. But this same code can download any transformer, like for example roberta-base.

In [5]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
from transformers import *

In [7]:

if DOWNLOADED_MODEL_PATH == 'model':
#     os.mkdir('model')
    
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, add_prefix_space=True)
    tokenizer.save_pretrained('model')

    config_model = AutoConfig.from_pretrained(MODEL_NAME) 
    config_model.num_labels = 15
    config_model.save_pretrained('model')

    backbone = AutoModelForTokenClassification.from_pretrained(MODEL_NAME, config=config_model)
    backbone.save_pretrained('model')

loading configuration file config.json from cache at /home/orangel/.cache/huggingface/hub/models--google--bigbird-roberta-base/snapshots/5a145f7852cba9bd431386a58137bf8a29903b90/config.json
Model config BigBirdConfig {
  "_name_or_path": "google/bigbird-roberta-base",
  "architectures": [
    "BigBirdForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attention_type": "block_sparse",
  "block_size": 64,
  "bos_token_id": 1,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 4096,
  "model_type": "big_bird",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_random_blocks": 3,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "rescale_embeddings": false,
  "sep_token_id": 66,
  "transformers_version": "4.26.1",
  "type_vocab_

# Load Data and Libraries
In addition to loading the train dataframe, we will load all the train and text files and save them in a dataframe.

In [8]:
import numpy as np, os 
import pandas as pd, gc 
from tqdm import tqdm

from transformers import AutoTokenizer, AutoModelForTokenClassification
from torch.utils.data import Dataset, DataLoader
import torch
from sklearn.metrics import accuracy_score

In [9]:
train_df = pd.read_csv('./input/feedback-prize-2021/train.csv')
print( train_df.shape )
train_df.head()

(144293, 8)


Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring
0,423A1CA112E2,1622628000000.0,8.0,229.0,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
1,423A1CA112E2,1622628000000.0,230.0,312.0,They are some really bad consequences when stu...,Position,Position 1,45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
2,423A1CA112E2,1622628000000.0,313.0,401.0,Some certain areas in the United States ban ph...,Evidence,Evidence 1,60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
3,423A1CA112E2,1622628000000.0,402.0,758.0,"When people have phones, they know about certa...",Evidence,Evidence 2,76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 9...
4,423A1CA112E2,1622628000000.0,759.0,886.0,Driving is one of the way how to get around. P...,Claim,Claim 1,139 140 141 142 143 144 145 146 147 148 149 15...


In [10]:
# https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533
test_names, test_texts = [], []

for f in list(os.listdir('./input/feedback-prize-2021/test')):
    test_names.append(f.replace('.txt', ''))
    test_texts.append(open('./input/feedback-prize-2021/test/' + f, 'r').read())
    
test_texts = pd.DataFrame({'id': test_names, 'text': test_texts})
test_texts.head()

Unnamed: 0,id,text
0,0FB0700DAF44,"During a group project, have you ever asked a ..."
1,D72CB1C11673,Making choices in life can be very difficult. ...
2,DF920E0A7337,Have you ever asked more than one person for h...
3,18409261F5C2,80% of Americans believe seeking multiple opin...
4,D46BCB48440A,"When people ask for advice,they sometimes talk..."


############# -------------------- ###################

#### Using string.punctuation and string.digits constants: 
- You can use the string.punctuation constant to get all punctuation characters, 
- and the string.digits constant to get all digits. 
- Then, you can remove these characters from the string using the str.translate() method.

In [11]:
import string
translator = str.maketrans('', '', string.punctuation + string.digits)

In [12]:
# apply the translation table to the "text" column to remove punctuations and digits
test_texts["text"] = test_texts["text"].apply(lambda x: x.translate(translator))
test_texts.head()

Unnamed: 0,id,text
0,0FB0700DAF44,During a group project have you ever asked a g...
1,D72CB1C11673,Making choices in life can be very difficult P...
2,DF920E0A7337,Have you ever asked more than one person for h...
3,18409261F5C2,of Americans believe seeking multiple opinion...
4,D46BCB48440A,When people ask for advicethey sometimes talk ...


In [13]:
# https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533
test_names, train_texts = [], [] # Initialize two empty lists, test_names and train_texts.

# Iterate through the list of files in the ./input/feedback-prize-2021/train directory using a for loop 
# with a progress bar from tqdm.

# For each file, append the file name without the '.txt' extension to the test_names list
for f in tqdm(list(os.listdir('./input/feedback-prize-2021/train'))):
    test_names.append(f.replace('.txt', ''))
    # Read the content of each file and append it to the train_texts list.
    train_texts.append(open('./input/feedback-prize-2021/train/' + f, 'r').read())

# Create a pandas DataFrame called train_text_df with two columns: 
# 'id' (file names) and 'text' (file contents).    
train_text_df = pd.DataFrame({'id': test_names, 'text': train_texts})
train_text_df.head()

100%|█████████████████████████████████████████████████████████████████████████| 15594/15594 [00:00<00:00, 140490.94it/s]


Unnamed: 0,id,text
0,90F900708083,"Dear Principal,\n\nI heard about the two new c..."
1,29BE597866EE,"Dear Principal of SCHOOL_NAME,\n\nCommunity se..."
2,3B783778AA40,Venus is a planet that for a fact humans could...
3,B5330C56B5B8,I think that Lukes point of viewIf you were go...
4,B0C7779B7276,"Dear Ms. Principal,\n\nCommunity Service is ex..."


In [14]:
###############################################################################

In [15]:
train_text_df["text"] = train_text_df["text"].str.lower()

# apply the translation table to the "text" column to remove punctuations and digits
train_text_df["text"] = train_text_df["text"].apply(lambda x: x.translate(translator))

train_text_df.head()

Unnamed: 0,id,text
0,90F900708083,dear principal\n\ni heard about the two new ce...
1,29BE597866EE,dear principal of schoolname\n\ncommunity serv...
2,3B783778AA40,venus is a planet that for a fact humans could...
3,B5330C56B5B8,i think that lukes point of viewif you were go...
4,B0C7779B7276,dear ms principal\n\ncommunity service is exac...


In [16]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/orangel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/orangel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [17]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Load English stopwords
stop_words = set(stopwords.words('english'))

In [18]:
def remove_stopwords(text):
    # Tokenize the text
    words = word_tokenize(text)
    
    # Filter out stopwords
    filtered_words = [word for word in words if word.lower() not in stop_words]
    
    # Reconstruct the text
    filtered_text = ' '.join(filtered_words)
    return filtered_text

In [19]:
test_texts
# Apply the remove_stopwords function to the 'text' column from train_text_df
train_text_df['text'] = train_text_df['text'].apply(remove_stopwords)

# Display the updated DataFrame
train_text_df.head()

Unnamed: 0,id,text
0,90F900708083,dear principal heard two new cell phone polici...
1,29BE597866EE,dear principal schoolname community service re...
2,3B783778AA40,venus planet fact humans could live gases atmo...
3,B5330C56B5B8,think lukes point viewif going ask someone wou...
4,B0C7779B7276,dear ms principal community service exactly so...


In [20]:
# Apply the remove_stopwords function to the 'text' column
train_text_df['text'] = train_text_df['text'].apply(remove_stopwords)

# Display the updated DataFrame
train_text_df.head()

Unnamed: 0,id,text
0,90F900708083,dear principal heard two new cell phone polici...
1,29BE597866EE,dear principal schoolname community service re...
2,3B783778AA40,venus planet fact humans could live gases atmo...
3,B5330C56B5B8,think lukes point viewif going ask someone wou...
4,B0C7779B7276,dear ms principal community service exactly so...


In [21]:
train_text_df['text'] = train_text_df['text'].str.replace('\n\n', ' ')

# Display the updated DataFrame
train_text_df.head()

Unnamed: 0,id,text
0,90F900708083,dear principal heard two new cell phone polici...
1,29BE597866EE,dear principal schoolname community service re...
2,3B783778AA40,venus planet fact humans could live gases atmo...
3,B5330C56B5B8,think lukes point viewif going ask someone wou...
4,B0C7779B7276,dear ms principal community service exactly so...


In [22]:
###############################################################################

# Convert Train Text to NER Labels
We will now convert all text words into NER labels and save in a dataframe.

In [23]:
if not LOAD_TOKENS_FROM:
    all_entities = []
    for ii,i in enumerate(train_text_df.iterrows()):
        if ii%100==0: print(ii,', ',end='')
        total = i[1]['text'].split().__len__()
        entities = ["O"]*total
        
        for j in train_df[train_df['id'] == i[1]['id']].iterrows():
            discourse = j[1]['discourse_type']
            list_ix = [int(x) for x in j[1]['predictionstring'].split(' ')]
            entities[list_ix[0]] = f"B-{discourse}"
            
            for k in list_ix[1:]: entities[k] = f"I-{discourse}"
                
        all_entities.append(entities)
        
    train_text_df['entities'] = all_entities
    train_text_df.to_csv('train_NER.csv',index=False)
    
else:
    from ast import literal_eval
    train_text_df = pd.read_csv(f'{LOAD_TOKENS_FROM}/train_NER.csv')
    # pandas saves lists as string, we must convert back
    train_text_df.entities = train_text_df.entities.apply(lambda x: literal_eval(x) )
    
print( train_text_df.shape )
train_text_df.head()

(15594, 3)


Unnamed: 0,id,text,entities
0,E1FA876D6E6C,"Dear Senator,\n\nI am writting this letter to ...","[O, O, B-Lead, I-Lead, I-Lead, I-Lead, I-Lead,..."
1,8AC1D6E165CD,"Dear Principal, I believe in policy 2. Kids ar...","[O, O, B-Position, I-Position, I-Position, I-P..."
2,45EF6A4EDB1A,"Summer projects are no fun, but they are a gre...","[B-Lead, I-Lead, I-Lead, I-Lead, I-Lead, I-Lea..."
3,B0070361406D,"The author who wrote ""The challenge of Explori...","[B-Lead, I-Lead, I-Lead, I-Lead, I-Lead, I-Lea..."
4,839F4F7F7DD7,Our school systems have seen many changes as t...,"[B-Lead, I-Lead, I-Lead, I-Lead, I-Lead, I-Lea..."


In [24]:
# CREATE DICTIONARIES THAT WE CAN USE DURING TRAIN AND INFER
output_labels = ['O', 'B-Lead', 'I-Lead', 'B-Position', 'I-Position', 'B-Claim', 'I-Claim', 'B-Counterclaim', 'I-Counterclaim', 
          'B-Rebuttal', 'I-Rebuttal', 'B-Evidence', 'I-Evidence', 'B-Concluding Statement', 'I-Concluding Statement']

labels_to_ids = {v:k for k,v in enumerate(output_labels)}
ids_to_labels = {k:v for k,v in enumerate(output_labels)}

In [25]:
labels_to_ids

{'O': 0,
 'B-Lead': 1,
 'I-Lead': 2,
 'B-Position': 3,
 'I-Position': 4,
 'B-Claim': 5,
 'I-Claim': 6,
 'B-Counterclaim': 7,
 'I-Counterclaim': 8,
 'B-Rebuttal': 9,
 'I-Rebuttal': 10,
 'B-Evidence': 11,
 'I-Evidence': 12,
 'B-Concluding Statement': 13,
 'I-Concluding Statement': 14}

# Define the dataset function
Below is our PyTorch dataset function. It always outputs tokens and attention. During training it also provides labels. And during inference it also provides word ids to help convert token predictions into word predictions.

Note that we use `text.split()` and `is_split_into_words=True` when we convert train text to labeled train tokens. This is how the HugglingFace tutorial does it. However, this removes characters like `\n` new paragraph. If you want your model to see new paragraphs, then we need to map words to tokens ourselves using `return_offsets_mapping=True`. See my TensorFlow notebook [here][1] for an example.

Some of the following code comes from the example at HuggingFace [here][2]. However I think the code at that link is wrong. The HuggingFace original code is [here][3]. With the flag `LABEL_ALL` we can either label just the first subword token (when one word has more than one subword token). Or we can label all the subword tokens (with the word's label). In this notebook version, we label all the tokens. There is a Kaggle discussion [here][4]

[1]: https://www.kaggle.com/cdeotte/tensorflow-longformer-ner-cv-0-617
[2]: https://huggingface.co/docs/transformers/custom_datasets#tok_ner
[3]: https://github.com/huggingface/transformers/blob/86b40073e9aee6959c8c85fcba89e47b432c4f4d/examples/pytorch/token-classification/run_ner.py#L371
[4]: https://www.kaggle.com/c/feedback-prize-2021/discussion/296713

In [26]:
LABEL_ALL_SUBTOKENS = True

class dataset(Dataset):
  def __init__(self, dataframe, tokenizer, max_len, get_wids):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.get_wids = get_wids # for validation

  def __getitem__(self, index):
        # GET TEXT AND WORD LABELS 
        text = self.data.text[index]        
        word_labels = self.data.entities[index] if not self.get_wids else None

        # TOKENIZE TEXT
        encoding = self.tokenizer(text.split(),
                             is_split_into_words=True,
                             #return_offsets_mapping=True, 
                             padding='max_length', 
                             truncation=True, 
                             max_length=self.max_len)
        word_ids = encoding.word_ids()  
        
        # CREATE TARGETS
        if not self.get_wids:
            previous_word_idx = None
            label_ids = []
            for word_idx in word_ids:                            
                if word_idx is None:
                    label_ids.append(-100)
                elif word_idx != previous_word_idx:              
                    label_ids.append( labels_to_ids[word_labels[word_idx]] )
                else:
                    if LABEL_ALL_SUBTOKENS:
                        label_ids.append( labels_to_ids[word_labels[word_idx]] )
                    else:
                        label_ids.append(-100)
                previous_word_idx = word_idx
            encoding['labels'] = label_ids

        # CONVERT TO TORCH TENSORS
        item = {key: torch.as_tensor(val) for key, val in encoding.items()}
        if self.get_wids: 
            word_ids2 = [w if w is not None else -1 for w in word_ids]
            item['wids'] = torch.as_tensor(word_ids2)
        
        return item

  def __len__(self):
        return self.len

# Create Train and Validation Dataloaders
We will use the same train and validation subsets as my TensorFlow notebook [here][1]. Then we can compare results. And/or experiment with ensembling the validation fold predictions.

[1]: https://www.kaggle.com/cdeotte/tensorflow-longformer-ner-cv-0-617

In [27]:
# CHOOSE VALIDATION INDEXES (that match my TF notebook)
IDS = train_df.id.unique()
print('There are',len(IDS),'train texts. We will split 90% 10% for validation.')

There are 15594 train texts. We will split 90% 10% for validation.


In [28]:
# TRAIN VALID SPLIT 90% 10%
np.random.seed(42)
# Sample 30% of the unique IDs
# sampled_IDS = np.random.choice(IDS, int(0.3 * len(IDS)), replace=False)

# Sample 100% of the unique IDs
train_idx = np.random.choice(np.arange(len(IDS)),int(0.9*len(IDS)),replace=False)
valid_idx = np.setdiff1d(np.arange(len(IDS)),train_idx)
np.random.seed(None)

In [29]:
# CREATE TRAIN SUBSET AND VALID SUBSET
data = train_text_df[['id','text', 'entities']]
train_dataset = data.loc[data['id'].isin(IDS[train_idx]),['text', 'entities']].reset_index(drop=True)
test_dataset = data.loc[data['id'].isin(IDS[valid_idx])].reset_index(drop=True)

print("FULL Dataset: {}".format(data.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

FULL Dataset: (15594, 3)
TRAIN Dataset: (14034, 2)
TEST Dataset: (1560, 3)


In [30]:
tokenizer = AutoTokenizer.from_pretrained(DOWNLOADED_MODEL_PATH) 
training_set = dataset(train_dataset, tokenizer, config['max_length'], False)
testing_set = dataset(test_dataset, tokenizer, config['max_length'], True)

loading file spiece.model
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json


In [31]:
# TRAIN DATASET AND VALID DATASET
train_params = {'batch_size': config['train_batch_size'],
                'shuffle': True,
                'num_workers': 8,
                'pin_memory':True
                }

test_params = {'batch_size': config['valid_batch_size'],
                'shuffle': False,
                'num_workers':8,
                'pin_memory':True
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

# TEST DATASET
test_texts_set = dataset(test_texts, tokenizer, config['max_length'], True)
test_texts_loader = DataLoader(test_texts_set, **test_params)

# Train Model
The PyTorch train function is taken from Raghavendrakotala's great notebook [here][1]. I assume it uses a masked loss which avoids computing loss when target is `-100`. If not, we need to update this.

In Kaggle notebooks, we will train our model for 5 epochs `batch_size=4` with Adam optimizer and learning rates `LR = [2.5e-5, 2.5e-5, 2.5e-6, 2.5e-6, 2.5e-7]`. The loaded model was trained offline with `batch_size=8` and `LR = [5e-5, 5e-5, 5e-6, 5e-6, 5e-7]`. (Note the learning rate changes `e-5`, `e-6`, and `e-7`). Using `batch_size=4` will probably achieve a better validation score than `batch_size=8`, but I haven't tried yet.

[1]: https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533

In [32]:
num_epochs = config['epochs']
unfreeze_epochs = [2, 3] # Epochs when we want to unfreeze additional layers

In [33]:
# CREATE MODEL
config_model = AutoConfig.from_pretrained(DOWNLOADED_MODEL_PATH+'/config.json') 
model = AutoModelForTokenClassification.from_pretrained(DOWNLOADED_MODEL_PATH+'/pytorch_model.bin', config=config_model)

model.to(config['device'])
optimizer = torch.optim.Adam(params=model.parameters(), lr=config['learning_rates'][0])
total_steps = len(training_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

loading configuration file model/config.json
Model config BigBirdConfig {
  "_name_or_path": "model/config.json",
  "architectures": [
    "BigBirdForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attention_type": "block_sparse",
  "block_size": 64,
  "bos_token_id": 1,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_10": 10,
    "LABEL_11": 11,
    "LABEL_12": 12,
    "LABEL_13": 13,
    "LABEL_14": 14,
  

In [34]:
# https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533
def train(epoch):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    #tr_preds, tr_labels = [], []
    
    # put model in training mode
    model.train()
    
    for idx, batch in enumerate(training_loader):
        
        ids = batch['input_ids'].to(config['device'], dtype = torch.long)
        mask = batch['attention_mask'].to(config['device'], dtype = torch.long)
        labels = batch['labels'].to(config['device'], dtype = torch.long)

        loss, tr_logits = model(input_ids=ids, attention_mask=mask, labels=labels, return_dict=False)
        tr_loss += loss.item()

        nb_tr_steps += 1
        nb_tr_examples += labels.size(0)
        
        if idx % 200==0:
            loss_step = tr_loss/nb_tr_steps
            print(f"Training loss after {idx:04d} training steps: {loss_step}")
           
        # compute training accuracy
        flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        
        # only compute accuracy at active labels
        active_accuracy = labels.view(-1) != -100 # shape (batch_size, seq_len)
        #active_labels = torch.where(active_accuracy, labels.view(-1), torch.tensor(-100).type_as(labels))
        
        labels = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)
        
        #tr_labels.extend(labels)
        #tr_preds.extend(predictions)

        tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
    
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=config['max_grad_norm']
        )
        
        # backward pass
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Prevent exploding gradients
        optimizer.step()
        scheduler.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")

In [35]:
def unfreeze_bert_layers(model, num_layers_to_unfreeze):
    for layer in model.bert.encoder.layer[-num_layers_to_unfreeze:]:
        for param in layer.parameters():
            param.requires_grad = True

In [36]:
# epoch 5
# Training loss after 0400 training steps: 0.5081020672422395
# LOOP TO TRAIN MODEL (or load model)
if not LOAD_MODEL_FROM:
    for epoch in range(config['epochs']):    
        print(f"### Training epoch: {epoch + 1}")
        
        if epoch in unfreeze_epochs:
            num_layers_to_unfreeze = 3 # Number of layers to unfreeze at each epoch
            unfreeze_bert_layers(model, num_layers_to_unfreeze)
        
        for g in optimizer.param_groups: 
            g['lr'] = config['learning_rates'][epoch]
        lr = optimizer.param_groups[0]['lr']
        print(f'### LR = {lr}\n')
        
        train(epoch)
        torch.cuda.empty_cache()
        gc.collect()
        
    torch.save(model.state_dict(), f'bigbird_v{VER}.pt')
else:
    model.load_state_dict(torch.load(f'{LOAD_MODEL_FROM}/bigbird_v{VER}.pt'))
    print('Model loaded.')

### Training epoch: 1
### LR = 2.5e-05

Training loss after 0000 training steps: 2.6822750568389893
Training loss after 0200 training steps: 1.1064895729520428
Training loss after 0400 training steps: 0.9459145409507942
Training loss after 0600 training steps: 0.8736576732502206
Training loss after 0800 training steps: 0.8251651139295056
Training loss epoch: 0.8100140699968251
Training accuracy epoch: 0.7423063334921682
### Training epoch: 2
### LR = 2.5e-05

Training loss after 0000 training steps: 0.4712039828300476
Training loss after 0200 training steps: 0.6118609144616483
Training loss after 0400 training steps: 0.5998665367130033
Training loss after 0600 training steps: 0.6038803077378805
Training loss after 0800 training steps: 0.6003204164433569
Training loss epoch: 0.5983312263711439
Training accuracy epoch: 0.7973084662773199
### Training epoch: 3
### LR = 2.5e-06

Training loss after 0000 training steps: 0.6126350164413452
Training loss after 0200 training steps: 0.516176141

In [36]:
# Specify the file path where you want to save the model
save_path = "model_02_weights.pth"

In [42]:
# Save the model's state dictionary
torch.save(model.state_dict(), save_path)

In [37]:
# Load the saved state dictionary
model.load_state_dict(torch.load(save_path))

# Set the model to evaluation mode (important for models with batch normalization, dropout, etc.)
model.eval()

BigBirdForTokenClassification(
  (bert): BigBirdModel(
    (embeddings): BigBirdEmbeddings(
      (word_embeddings): Embedding(50358, 768, padding_idx=0)
      (position_embeddings): Embedding(4096, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BigBirdEncoder(
      (layer): ModuleList(
        (0): BigBirdLayer(
          (attention): BigBirdAttention(
            (self): BigBirdBlockSparseAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
            )
            (output): BigBirdSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        

In [38]:
# Clear GPU cache
torch.cuda.empty_cache()

In [39]:
import gpustat

# Get GPU stats
gpu_stats = gpustat.GPUStatCollection.new_query()

# Print formatted GPU stats
print(gpu_stats)

GPUStatCollection(host=Anvil, [
  [36m[0][m [34mNVIDIA RTX A6000[m |[31m 44°C[m, [32m  0 %[m | [36m[1m[33m 1804[m / [33m49140[m MB |
])


# Inference and Validation Code
We will infer in batches using our data loader which is faster than inferring one text at a time with a for-loop. The metric code is taken from Rob Mulla's great notebook [here][2]. Our model achieves validation F1 score 0.615! 

During inference our model will make predictions for each subword token. Some single words consist of multiple subword tokens. In the code below, we use a word's first subword token prediction as the label for the entire word. We can try other approaches, like averaging all subword predictions or taking `B` labels before `I` labels etc.

[1]: https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533
[2]: https://www.kaggle.com/robikscube/student-writing-competition-twitch

In [40]:
def inference(batch):
                
    # MOVE BATCH TO GPU AND INFER
    ids = batch["input_ids"].to(config['device'])
    mask = batch["attention_mask"].to(config['device'])
    outputs = model(ids, attention_mask=mask, return_dict=False)
    all_preds = torch.argmax(outputs[0], axis=-1).cpu().numpy() 

    # INTERATE THROUGH EACH TEXT AND GET PRED
    predictions = []
    for k,text_preds in enumerate(all_preds):
        token_preds = [ids_to_labels[i] for i in text_preds]

        prediction = []
        word_ids = batch['wids'][k].numpy()  
        previous_word_idx = -1
        for idx,word_idx in enumerate(word_ids):                            
            if word_idx == -1:
                pass
            elif word_idx != previous_word_idx:              
                prediction.append(token_preds[idx])
                previous_word_idx = word_idx
        predictions.append(prediction)
    
    return predictions

In [41]:
# https://www.kaggle.com/zzy990106/pytorch-ner-infer
# code has been modified from original
def get_predictions(df=test_dataset, loader=testing_loader):
    
    # put model in training mode
    model.eval()
    
    # GET WORD LABEL PREDICTIONS
    y_pred2 = []
    for batch in loader:
        labels = inference(batch)
        y_pred2.extend(labels)

    final_preds2 = []
    for i in range(len(df)):

        idx = df.id.values[i]
        #pred = [x.replace('B-','').replace('I-','') for x in y_pred2[i]]
        pred = y_pred2[i] # Leave "B" and "I"
        preds = []
        j = 0
        while j < len(pred):
            cls = pred[j]
            if cls == 'O': j += 1
            else: cls = cls.replace('B','I') # spans start with B
            end = j + 1
            while end < len(pred) and pred[end] == cls:
                end += 1
            
            if cls != 'O' and cls != '' and end - j > 7:
                final_preds2.append((idx, cls.replace('I-',''),
                                     ' '.join(map(str, list(range(j, end))))))
        
            j = end
        
    oof = pd.DataFrame(final_preds2)
    oof.columns = ['id','class','predictionstring']

    return oof

In [42]:
# from Rob Mulla @robikscube
# https://www.kaggle.com/robikscube/student-writing-competition-twitch
def calc_overlap(row):
    """
    Calculates the overlap between prediction and
    ground truth and overlap percentages used for determining
    true positives.
    """
    set_pred = set(row.predictionstring_pred.split(' '))
    set_gt = set(row.predictionstring_gt.split(' '))
    # Length of each and intersection
    len_gt = len(set_gt)
    len_pred = len(set_pred)
    inter = len(set_gt.intersection(set_pred))
    overlap_1 = inter / len_gt
    overlap_2 = inter/ len_pred
    return [overlap_1, overlap_2]


def score_feedback_comp(pred_df, gt_df):
    """
    A function that scores for the kaggle
        Student Writing Competition
        
    Uses the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = gt_df[['id','discourse_type','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df = pred_df[['id','class','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df['pred_id'] = pred_df.index
    gt_df['gt_id'] = gt_df.index
    # Step 1. all ground truths and predictions for a given class are compared.
    joined = pred_df.merge(gt_df,
                           left_on=['id','class'],
                           right_on=['id','discourse_type'],
                           how='outer',
                           suffixes=('_pred','_gt')
                          )
    joined['predictionstring_gt'] = joined['predictionstring_gt'].fillna(' ')
    joined['predictionstring_pred'] = joined['predictionstring_pred'].fillna(' ')

    joined['overlaps'] = joined.apply(calc_overlap, axis=1)

    # 2. If the overlap between the ground truth and prediction is >= 0.5, 
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    joined['overlap1'] = joined['overlaps'].apply(lambda x: eval(str(x))[0])
    joined['overlap2'] = joined['overlaps'].apply(lambda x: eval(str(x))[1])


    joined['potential_TP'] = (joined['overlap1'] >= 0.5) & (joined['overlap2'] >= 0.5)
    joined['max_overlap'] = joined[['overlap1','overlap2']].max(axis=1)
    tp_pred_ids = joined.query('potential_TP') \
        .sort_values('max_overlap', ascending=False) \
        .groupby(['id','predictionstring_gt']).first()['pred_id'].values

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    fp_pred_ids = [p for p in joined['pred_id'].unique() if p not in tp_pred_ids]

    matched_gt_ids = joined.query('potential_TP')['gt_id'].unique()
    unmatched_gt_ids = [c for c in joined['gt_id'].unique() if c not in matched_gt_ids]

    # Get numbers of each type
    TP = len(tp_pred_ids)
    FP = len(fp_pred_ids)
    FN = len(unmatched_gt_ids)
    #calc microf1
    my_f1_score = TP / (TP + 0.5*(FP+FN))
    return my_f1_score

In [43]:
if COMPUTE_VAL_SCORE: # note this doesn't run during submit
    # VALID TARGETS
    valid = train_df.loc[train_df['id'].isin(IDS[valid_idx])]

    # OOF PREDICTIONS
    oof = get_predictions(test_dataset, testing_loader)

    # COMPUTE F1 SCORE
    f1s = []
    CLASSES = oof['class'].unique()
    print()
    for c in CLASSES:
        pred_df = oof.loc[oof['class']==c].copy()
        gt_df = valid.loc[valid['discourse_type']==c].copy()
        f1 = score_feedback_comp(pred_df, gt_df)
        print(c,f1)
        f1s.append(f1)
    print()
    print('Overall',np.mean(f1s))
    print()


Evidence 0.6434053003172004
Concluding Statement 0.7792562265438417
Lead 0.7491995731056563
Position 0.6193724420190996
Claim 0.4979180363795748
Rebuttal 0.37546933667083854
Counterclaim 0.48250460405156537

Overall 0.5924465027268252



# Infer Test Data and Write Submission CSV
We will now infer the test data and write submission CSV

In [44]:
sub = get_predictions(test_texts, test_texts_loader)
sub.head()

Unnamed: 0,id,class,predictionstring
0,0FB0700DAF44,Lead,4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2...
1,0FB0700DAF44,Position,41 42 43 44 45 46 47 48 49
2,0FB0700DAF44,Claim,57 58 59 60 61 62 63 64
3,0FB0700DAF44,Claim,66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 8...
4,0FB0700DAF44,Evidence,120 121 122 123 124 125 126 127 128 129 130 13...


In [45]:
sub.to_csv("submission.csv", index=False)