<div style="padding: 5px; min-height:60px; margin: 10px; border-radius: 5px; background: #b3ffb3; display: flex;
  justify-content: center;
  align-content: center;
  flex-direction: column;">
    <h1 style="margin:auto; text-align:center;color: black;"> PyTorch QA/NER hybrid approach to identify important spans </h1>
</div>

## ****** Update (ver. 22) ******

I compared mine to ynakama's and I realized I was ignoring the examples with empty annotations. This version includes those samples and also has some improvements for the model and metric computation.

Now there is better CV/LB: 5 fold average CV = 0.86, 5 fold average LB = 0.86

---
## Description

This approach is kind of like QA and kind of like NER, so I'm calling it QA/NER hybrid. It is like NER because each token will be classified and it is like QA because I am putting the feature text at the beginning of the text (similar to how the question is put in front of the context). The idea is that the classifier will indicate 1 or 0 if that specific token is important for the feature that is put in the input. The feature text will help guide the model to find the right spans. To the extent of my knowledge this is how multi-span QA is done because the typical extractive QA is mainly good for single span prediction.


This uses the Hugging Face Trainer to do 5-fold validation of various base models. This is just a starting point, so please comment if you notice anything that could be improved! There are lots of possibilities for boosting the score so this should be an interesting competition! Happy Kaggling 🦢 (it's supposed to be a Goose). To see the actual training results, check out the versions mentioned in the updates below or check out the [Weights and Biases Report at the end of this notebook.](#Weights-and-Biases-Report)

Note: I'm using base models but performance will likely improve with model size!

This notebook is heavily based off of [my notebook](https://www.kaggle.com/nbroad/bigbird-ner-training-pt-gpu-feedback-prize), the work of others in the Feedback Prize competition, and some of the [Hugging Face example scripts.](https://github.com/huggingface/transformers/tree/master/examples/pytorch) 

The inference notebook is [here](https://www.kaggle.com/nbroad/qa-ner-hybrid-infer-nbme)   
There is a bit of a discrepancy between CV (0.830) and LB (.809)





## ****** Update (ver. 12) ******

I tried 5 different models including `xlnet` and `mpnet` but they failed miserably. Version 14 swaps those two out for `bert` and `roberta`. Results in [Weights and Biases Report at the end of this notebook.](#Weights-and-Biases-Report)

## ****** Update (ver. 14) ******
I thought it would be interesting to show how flexible this notebook is by training 5 different models on 5 different folds. The model definition doesn't change and it still plugs in nicely to the trainer. The models used are: [`bert-base-cased`, `albert-base-v2`, `google/electra-base-discriminator`, `microsoft/deberta-v3-base`, `roberta-base`]. The fast tokenizer for DeBERTa v2/3 is not in the current version of `transformers` but I attach a dataset and add a cell to allow it to be used here. I increase the max length to 512, but I'm not padding every sample to max length, just to the longest sequence in a batch. I use the `group_by_length` flag in `TrainingArguments` which speeds up training by reducing unnecessary padding. Results in [Weights and Biases Report at the end of this notebook.](#Weights-and-Biases-Report)

## ****** Update (ver. 15/16) ******
I realized that there are some subtle differences in tokenizers that can make the labeling step break. `roberta` will tokenize `This is a sentence` as `['This', 'Ġis', 'Ġa', 'Ġsentence']` and the offset mappings will start the last three tokens at the letter and not the whitespace. Specifically, the offset mapping for the second token is (5,7) and not (4,7) where 5 is the index of `i`.  `deberta-v3` on the other hand will start the offset on the whitespace. The same sentence is tokenized as `['▁This', '▁is', '▁a', '▁sentence']` and the offset mapping for the second token is (4,7). This makes it slightly more annoying to label and decode. This version fixes that labeling and decoding process. So far I have only noticed this in DeBERTa, but I've changed the tokenize function and the decoding function.


## ****** Update (ver. 17) ******

Trying a BioBert/PubmedBert model for 5 folds.

## ****** Update (ver. 18) ******

Thanks to @ryotak12's [discussion](https://www.kaggle.com/c/nbme-score-clinical-patient-notes/discussion/305599), I realized I made a mistake when making folds. I previously split across case_num and feature_num, but this leaks notes between folds. It has now been changed to StratifiedGroupKFold. These leaks are likely the reason for the large discrepancy between train CV and lb.  
Furthermore, I now use the annotation corrections by @yasufuminakama [link here](https://www.kaggle.com/yasufuminakama/nbme-deberta-base-baseline-train?scriptVersionId=87264998&cellId=17) and I also include some pseudolabeling that I did using my 5-folds of deberta-v3-base (from Version 16). 
This run uses 5 different models, same as ver 14 but I swap bert-base for biobert/pubmedbert


## ****** Update (ver. 20) ******

Same as 18 but with 5 folds of deberta-v3-base

## ****** Update (ver. 21) ******

Same as 20 but no pseudolabels and this time, it actually uses the label corrections

## Steps to include fast tokenizer for deberta v2 or v3

This must be done before importing transformers

In [None]:
# The following is necessary if you want to use the fast tokenizer for deberta v2 or v3
import shutil
from pathlib import Path

transformers_path = Path("/opt/conda/lib/python3.7/site-packages/transformers")

input_dir = Path("../input/deberta-v2-3-fast-tokenizer")

convert_file = input_dir / "convert_slow_tokenizer.py"
conversion_path = transformers_path / convert_file.name

if conversion_path.exists():
    conversion_path.unlink()

shutil.copy(convert_file, transformers_path)
deberta_v2_path = transformers_path / "models" / "deberta_v2"

for filename in [
    "tokenization_deberta_v2.py",
    "tokenization_deberta_v2_fast.py",
    "deberta__init__.py",
]:
    if str(filename).startswith("deberta"):
        filepath = deberta_v2_path / str(filename).replace("deberta", "")
    else:
        filepath = deberta_v2_path / filename
    if filepath.exists():
        filepath.unlink()

    shutil.copy(input_dir / filename, filepath)

# Imports, setup, and arguments hidden in next cell

In [None]:
import os
from typing import Any
from datetime import datetime
from collections import Counter
from pathlib import Path
from dataclasses import dataclass, field
from typing import Optional
from itertools import chain
from functools import partial
from ast import literal_eval

import torch
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_recall_fscore_support
import plotly.express as px
import plotly.offline as pyo

pyo.init_notebook_mode()
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset

from transformers import (
    AutoConfig,
    AutoTokenizer,
    DataCollatorForTokenClassification,
    HfArgumentParser,
    Trainer,
    TrainingArguments,
    set_seed,
    logging,
)
from transformers.file_utils import ModelOutput

logging.set_verbosity(logging.WARNING)
%env TOKENIZERS_PARALLELISM=false


@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={
            "help": "Path to pretrained model or model identifier from huggingface.co/models"
        }
    )


@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    k_folds: int = field(
        default=5, metadata={"help": "How many folds for kfold validation"}
    )
    num_proc: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    max_seq_length: int = field(
        default=None,
        metadata={
            "help": "The maximum total input sequence length after tokenization. If set, sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )

# Configuration

In [None]:
DEBUG = False

# all_models = [
#     'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext',
#     'albert-base-v2',
#     'google/electra-base-discriminator',
#     'microsoft/deberta-v3-base',
#     'roberta-base'
# ]
all_models = ["../input/deberta-v3-base/deberta-v3-base"] * 5

model_args = ModelArguments(
    model_name_or_path=all_models[0],
)
data_args = DataTrainingArguments(
    k_folds=5,
    max_seq_length=512,
    num_proc=2,
)
training_args = TrainingArguments(
    output_dir="model",
    do_train=True,
    do_eval=True,
    do_predict=True,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=2,
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=5,
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    logging_steps=75,
    evaluation_strategy="epoch",
    save_strategy="no",
    seed=18,
    fp16=False,
    report_to="wandb",
    group_by_length=True,
)
set_seed(training_args.seed)

# Quick EDA

In [None]:
feats_df = pd.read_csv("../input/nbme-score-clinical-patient-notes/features.csv")
notes_df = pd.read_csv("../input/nbme-score-clinical-patient-notes/patient_notes.csv")
train_df = pd.read_csv("../input/nbme-score-clinical-patient-notes/train.csv")

### features.csv 

Description from hosts
> `features.csv` - The rubric of features (or key concepts) for each clinical case.  
`feature_num` - A unique identifier for each feature.  
`case_num` - A unique identifier for each case.  
`feature_text` - A description of the feature.  

The `feature_text` values will be prepended onto the texts before being tokenized, similar to QA.

In [None]:
display(feats_df.head())

# 131/143 of the features are unique and it looks like some have OR delimiting multiple names
len(feats_df), feats_df.feature_text.nunique()

### patient_notes.csv

Description from hosts  
> `patient_notes.csv` - A collection of about 40,000 Patient Note history portions. Only a subset of these have features annotated. You may wish to apply unsupervised learning techniques on the notes without annotations. The patient notes in the test set are not included in the public version of this file.  
`pn_num` - A unique identifier for each patient note.  
`case_num` - A unique identifier for the clinical case a patient note represents.  
`pn_history` - The text of the encounter as recorded by the test taker.  

This is the file that we will use as the text to be tokenized. The `num` columns will be used to link each text to `train.csv`

It looks like there are only 10 different cases and about 42k different notes.

In [None]:
display(notes_df.head())
print("DataFrame shape", notes_df.shape)
print("Unique case_num values", notes_df.case_num.unique())
print("Number of unique pn_num", notes_df.pn_num.nunique())
print("Number of unique note texts", notes_df.pn_history.nunique())

#### Somewhat unequal distribution of notes for each case

In [None]:
px.histogram(notes_df, x="case_num", color="case_num")

## `train.csv`

Description from hosts
> `train.csv` - Feature annotations for 1000 of the patient notes, 100 for each of ten cases.  
`id` - Unique identifier for each patient note / feature pair.  
`pn_num` - The patient note annotated in this row.  
`feature_num` - The feature annotated in this row.  
`case_num` - The case to which this patient note belongs.  
`annotation` - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a single note.  
`location` - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.

In [None]:
print(train_df.shape)
train_df.head()

## Checking for blank annotations

About 4.4k rows have blank annotations.

In [None]:
blank_annotations = train_df["annotation"] == "[]"
blank_locations = train_df["location"] == "[]"
both_blank = (train_df["annotation"] == train_df["location"]) & blank_annotations

sum(blank_annotations), sum(blank_locations), sum(both_blank)

## Looking at distribution of case numbers in train.csv

In [None]:
px.histogram(train_df, x="case_num", color="case_num")

## Looking at distribution of patient note numbers in train.csv

In [None]:
px.histogram(train_df, x="pn_num", color="case_num")

## Equal numbers of features for each case_num

In [None]:
px.histogram(train_df, x="feature_num", color="case_num", nbins=1000)

## Data cleanup

The annotation and location columns are loaded as strings. This turns them back into lists.

In [None]:
train_df["anno_list"] = [literal_eval(x) for x in train_df.annotation]
train_df["loc_list"]  = [literal_eval(x) for x in train_df.location]
train_df.head()

# Stratified KFold

Without any leaks this time ;)  
Thanks @theoviel https://www.kaggle.com/c/nbme-score-clinical-patient-notes/discussion/305599#1678215

In [None]:
skf = StratifiedKFold(n_splits=data_args.k_folds, random_state=training_args.seed, shuffle=True)

notes_df["fold"] = -1

for fold, (_, val_idx) in enumerate(skf.split(notes_df, y=notes_df["case_num"])):
    notes_df.loc[val_idx, "fold"] = fold
    
counts = notes_df.groupby(["fold", "pn_num"], as_index=False).count()

# If the number of rows is the same as the number of 
# unique pn_num, then each pn_num is only in one fold.
# Also if all the counts=1
print(counts.shape, counts.pn_num.nunique(), counts.case_num.unique())
counts

In [None]:
merged = train_df.merge(notes_df, how="left")
merged = merged.merge(feats_df, how="left")

merged.head(10)

## Correcting some annotations

Huge shoutout to @yasufuminakama for this work: https://www.kaggle.com/yasufuminakama/nbme-deberta-base-baseline-train?scriptVersionId=87264998&cellId=17

In [None]:
# incorrect annotations
merged.loc[338, "anno_list"] =  '["father heart attack"]'
merged.loc[338, "loc_list"] =  '["764 783"]'

merged.loc[621, "anno_list"] =  '["for the last 2-3 months", "over the last 2 months"]'
merged.loc[621, "loc_list"] =  '["77 100", "398 420"]'

merged.loc[655, "anno_list"] =  '["no heat intolerance", "no cold intolerance"]'
merged.loc[655, "loc_list"] =  '["285 292;301 312", "285 287;296 312"]'

merged.loc[1262, "anno_list"] =  '["mother thyroid problem"]'
merged.loc[1262, "loc_list"] =  '["551 557;565 580"]'

merged.loc[1265, "anno_list"] =  '[\'felt like he was going to "pass out"\']'
merged.loc[1265, "loc_list"] =  '["131 135;181 212"]'

merged.loc[1396, "anno_list"] =  '["stool , with no blood"]'
merged.loc[1396, "loc_list"] =  '["259 280"]'

merged.loc[1591, "anno_list"] =  '["diarrhoe non blooody"]'
merged.loc[1591, "loc_list"] =  '["176 184;201 212"]'

merged.loc[1615, "anno_list"] =  '["diarrhea for last 2-3 days"]'
merged.loc[1615, "loc_list"] =  '["249 257;271 288"]'

merged.loc[1664, "anno_list"] =  '["no vaginal discharge"]'
merged.loc[1664, "loc_list"] =  '["822 824;907 924"]'

merged.loc[1714, "anno_list"] =  '["started about 8-10 hours ago"]'
merged.loc[1714, "loc_list"] =  '["101 129"]'

merged.loc[1929, "anno_list"] =  '["no blood in the stool"]'
merged.loc[1929, "loc_list"] =  '["531 539;549 561"]'

merged.loc[2134, "anno_list"] =  '["last sexually active 9 months ago"]'
merged.loc[2134, "loc_list"] =  '["540 560;581 593"]'

merged.loc[2191, "anno_list"] =  '["right lower quadrant pain"]'
merged.loc[2191, "loc_list"] =  '["32 57"]'

merged.loc[2553, "anno_list"] =  '["diarrhoea no blood"]'
merged.loc[2553, "loc_list"] =  '["308 317;376 384"]'

merged.loc[3124, "anno_list"] =  '["sweating"]'
merged.loc[3124, "loc_list"] =  '["549 557"]'

merged.loc[3858, "anno_list"] =  '["previously as regular", "previously eveyr 28-29 days", "previously lasting 5 days", "previously regular flow"]'
merged.loc[3858, "loc_list"] =  '["102 123", "102 112;125 141", "102 112;143 157", "102 112;159 171"]'

merged.loc[4373, "anno_list"] =  '["for 2 months"]'
merged.loc[4373, "loc_list"] =  '["33 45"]'

merged.loc[4763, "anno_list"] =  '["35 year old"]'
merged.loc[4763, "loc_list"] =  '["5 16"]'

merged.loc[4782, "anno_list"] =  '["darker brown stools"]'
merged.loc[4782, "loc_list"] =  '["175 194"]'

merged.loc[4908, "anno_list"] =  '["uncle with peptic ulcer"]'
merged.loc[4908, "loc_list"] =  '["700 723"]'

merged.loc[6016, "anno_list"] =  '["difficulty falling asleep"]'
merged.loc[6016, "loc_list"] =  '["225 250"]'

merged.loc[6192, "anno_list"] =  '["helps to take care of aging mother and in-laws"]'
merged.loc[6192, "loc_list"] =  '["197 218;236 260"]'

merged.loc[6380, "anno_list"] =  '["No hair changes", "No skin changes", "No GI changes", "No palpitations", "No excessive sweating"]'
merged.loc[6380, "loc_list"] =  '["480 482;507 519", "480 482;499 503;512 519", "480 482;521 531", "480 482;533 545", "480 482;564 582"]'

merged.loc[6562, "anno_list"] =  '["stressed due to taking care of her mother", "stressed due to taking care of husbands parents"]'
merged.loc[6562, "loc_list"] =  '["290 320;327 337", "290 320;342 358"]'

merged.loc[6862, "anno_list"] =  '["stressor taking care of many sick family members"]'
merged.loc[6862, "loc_list"] =  '["288 296;324 363"]'

merged.loc[7022, "anno_list"] =  '["heart started racing and felt numbness for the 1st time in her finger tips"]'
merged.loc[7022, "loc_list"] =  '["108 182"]'

merged.loc[7422, "anno_list"] =  '["first started 5 yrs"]'
merged.loc[7422, "loc_list"] =  '["102 121"]'

merged.loc[8876, "anno_list"] =  '["No shortness of breath"]'
merged.loc[8876, "loc_list"] =  '["481 483;533 552"]'

merged.loc[9027, "anno_list"] =  '["recent URI", "nasal stuffines, rhinorrhea, for 3-4 days"]'
merged.loc[9027, "loc_list"] =  '["92 102", "123 164"]'

merged.loc[9938, "anno_list"] =  '["irregularity with her cycles", "heavier bleeding", "changes her pad every couple hours"]'
merged.loc[9938, "loc_list"] =  '["89 117", "122 138", "368 402"]'

merged.loc[9973, "anno_list"] =  '["gaining 10-15 lbs"]'
merged.loc[9973, "loc_list"] =  '["344 361"]'

merged.loc[10513, "anno_list"] =  '["weight gain", "gain of 10-16lbs"]'
merged.loc[10513, "loc_list"] =  '["600 611", "607 623"]'

merged.loc[11551, "anno_list"] =  '["seeing her son knows are not real"]'
merged.loc[11551, "loc_list"] =  '["386 400;443 461"]'

merged.loc[11677, "anno_list"] =  '["saw him once in the kitchen after he died"]'
merged.loc[11677, "loc_list"] =  '["160 201"]'

merged.loc[12124, "anno_list"] =  '["tried Ambien but it didnt work"]'
merged.loc[12124, "loc_list"] =  '["325 337;349 366"]'

merged.loc[12279, "anno_list"] =  '["heard what she described as a party later than evening these things did not actually happen"]'
merged.loc[12279, "loc_list"] =  '["405 459;488 524"]'

merged.loc[12289, "anno_list"] =  '["experienced seeing her son at the kitchen table these things did not actually happen"]'
merged.loc[12289, "loc_list"] =  '["353 400;488 524"]'

merged.loc[13238, "anno_list"] =  '["SCRACHY THROAT", "RUNNY NOSE"]'
merged.loc[13238, "loc_list"] =  '["293 307", "321 331"]'

merged.loc[13297, "anno_list"] =  '["without improvement when taking tylenol", "without improvement when taking ibuprofen"]'
merged.loc[13297, "loc_list"] =  '["182 221", "182 213;225 234"]'

merged.loc[13299, "anno_list"] =  '["yesterday", "yesterday"]'
merged.loc[13299, "loc_list"] =  '["79 88", "409 418"]'

merged.loc[13845, "anno_list"] =  '["headache global", "headache throughout her head"]'
merged.loc[13845, "loc_list"] =  '["86 94;230 236", "86 94;237 256"]'

merged.loc[14083, "anno_list"] =  '["headache generalized in her head"]'
merged.loc[14083, "loc_list"] =  '["56 64;156 179"]'

In [None]:
merged["anno_list"] = [
    literal_eval(x) if isinstance(x, str) else x for x in merged["anno_list"]
]
merged["loc_list"] = [
    literal_eval(x) if isinstance(x, str) else x for x in merged["loc_list"]
]

# Before version 21, I mistakenly removed these
# merged = merged[merged["anno_list"].map(len)!=0].copy().reset_index(drop=True)

merged.head()

## Tokenizing and Adding Labels

Since the labeling is given to us at the character level, the tokenizer needs to have `return_offsets_mapping=True` which returns the start and end indexes for each token. These indexes can then map the char-level labels to tokens. The loss for the model must be calculated at the token level.

Here are the 3 scenarios where I mark a token as a label.



`token_start, token_end` are the start and end indexes of the token. start is inclusive, end is exclusive, just like indexing a string.  
`label_start, label_end` are the start and end indexes of the label. start is inclusive, end is exclusive, just like indexing a string.

1. `token_start <= label_start < token_end`  
The token span overlaps with the start of the label span.
2. `token_start < label_end <= token_end`  
The token span overlaps with the end of the label span.
3. `label_start <= token_start < label_end`  
If it doesn't fall into (1) or (2), then the token span is entirely in the label span.

In [None]:
def location_to_ints(loc_list):
    to_return = []

    for loc_str in loc_list:
        loc_strs = loc_str.split(";")

        for loc in loc_strs:
            start, end = loc.split()
            to_return.append((int(start), int(end)))

    return to_return


def process_feature_text(text):
    text = text.replace("-OR-", " or ")
    return text.replace("-", " ")


def tokenize(example, tokenizer):

    tokenized_inputs = tokenizer(
        example["feature_text"],
        example["pn_history"],
        truncation="only_second",
        max_length=data_args.max_seq_length,
        padding=False,
        return_offsets_mapping=True,
    )

    # labels should be float
    labels = [0.0] * len(tokenized_inputs["input_ids"])
    tokenized_inputs["locations"] = location_to_ints(example["loc_list"])
    tokenized_inputs["sequence_ids"] = tokenized_inputs.sequence_ids()

    if len(tokenized_inputs["locations"]) > 0:
        for idx, (seq_id, offsets) in enumerate(
            zip(tokenized_inputs["sequence_ids"], tokenized_inputs["offset_mapping"])
        ):
            if seq_id is None or seq_id == 0:
                # don't calculate loss on question part or special tokens
                labels[idx] = -100.0
                continue

            token_start, token_end = offsets
            for label_start, label_end in tokenized_inputs["locations"]:
                if (
                    token_start <= label_start < token_end
                    or token_start < label_end <= token_end
                    or label_start <= token_start < label_end
                ):
                    labels[idx] = 1.0  # labels should be float

    tokenized_inputs["labels"] = labels

    return tokenized_inputs

In [None]:
merged["feature_text"] = [process_feature_text(x) for x in merged["feature_text"]]

In [None]:
if (
    "deberta-v2" in model_args.model_name_or_path
    or "deberta-v3" in model_args.model_name_or_path
):
    from transformers.models.deberta_v2 import DebertaV2TokenizerFast

    tokenizer = DebertaV2TokenizerFast.from_pretrained(model_args.model_name_or_path)
else:
    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)

## Double-checking alignment is good

It will print out a random one each time, so you can keep running it to check as many as you want.

In [None]:
random_sample = merged.sample(n=1).iloc[0]
example = {
    "feature_text": random_sample.feature_text,
    "pn_history": random_sample.pn_history,
    "loc_list": random_sample.loc_list,
    "annotations": random_sample.anno_list,
}
print(example, "\n\n")
tokenized = partial(tokenize, tokenizer=tokenizer)(example)


tokens = tokenizer.tokenize(
    example["feature_text"], example["pn_history"], add_special_tokens=True
)

print("Locations")
print(example["loc_list"], "\n")

print("Annotations")
print(example["annotations"], "\n")

print("Token | Label | Token Offsets")
zipped = list(zip(tokens, tokenized["labels"], tokenized["offset_mapping"]))
[x for x in zipped if x[1] > 0]

In [None]:
%%time

dataset = Dataset.from_pandas(
    merged[
        [
            "id",
            "case_num",
            "pn_num",
            "feature_num",
            "loc_list",
            "pn_history",
            "feature_text",
            "fold",
        ]
    ]
)

if DEBUG:
    dataset = dataset.shuffle().select(range(1000))
# This can take up to a minute
tokenized_dataset = dataset.map(
    partial(tokenize, tokenizer=tokenizer),
    desc="Tokenizing and adding labels",
    num_proc=data_args.num_proc,
    batched=False,
)
tokenized_dataset

In [None]:
print(tokenized_dataset[0])

## How long are the texts?

(Before ver. 12) Using roberta, a maximum sequence length of 384 would probably be fine. I chose 416 just to make sure nothing gets truncated.   
(After ver. 12) I'm testing multiple different tokenizers, so now I'm dynamically padding to the longest.

I'm not padding to max length across all samples, just within a batch. This will save compute time. Moreover, I've used the `group_by_length` flag in the `TrainingArguments` which means the excess padding will be limited, speeding up training even more.

In [None]:
tokenized_lengths = [len(x) for x in tokenized_dataset["input_ids"]]

print("The longest is", max(tokenized_lengths))

px.histogram(x=tokenized_lengths, labels={"x":"tokenized_length"})

## Setup Weights and Biases for tracking experiments

If you put `report_to="none"` in the `TrainingArguments` then it won't use Weights and Biases. I like using it because it helps keep track of experiments.

In [None]:
if "wandb" in training_args.report_to:
    !pip install -U wandb -qq
    import wandb

    from kaggle_secrets import UserSecretsClient

    user_secrets = UserSecretsClient()
    wandb_key = user_secrets.get_secret("wandb")

    os.environ["WANDB_PROJECT"] = "NBME"
    os.environ["WANDB_RUN_GROUP"] = "hybrid_" + datetime.now().strftime(
        "%Y-%m-%d %H:%M"
    )
    wandb.login(key=wandb_key)

## Model backbone flexibility

This custom model is a bit funky because I tried to make it versatile to whichever model you would like to use (bert, roberta, electra, etc.). It works by pulling the proper classes based on the model_type specified in the config object. If you know of a better way, by all means please share! Unfortunately there are minor differences in how each model is set up, so there are exceptions here and there for individual models.

In [None]:
@dataclass
class TokenClassifierOutput(ModelOutput):
    """
    Base class for outputs of token classification models.
    Args:
        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) :
            Classification loss.
        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`):
            Classification scores (before crf).
        crf (crf output)
    """

    loss: Any = None
    proba: Any = None


# Functions that are similar across all models
def __init__(self, config):
    super(self.PreTrainedModel, self).__init__(config)

    kwargs = {"add_pooling_layer": False}
    if config.model_type not in {"bert", "roberta"}:
        kwargs = {}
    setattr(self, self.backbone_name, self.ModelClass(config, **kwargs))

    classifier_dropout_name = None
    for key in dir(config):
        if ("classifier" in key or "hidden" in key) and "dropout" in key:
            if getattr(config, key) is not None:
                classifier_dropout_name = key
                break

    if classifier_dropout_name is None:
        raise ValueError("Cannot infer dropout name in config")
    classifier_dropout = getattr(config, classifier_dropout_name)
    self.dropout = torch.nn.Dropout(classifier_dropout)
    self.classifier = torch.nn.Linear(config.hidden_size, 1)


def forward(
    self,
    input_ids=None,
    attention_mask=None,
    token_type_ids=None,
    position_ids=None,
    labels=None,
):

    outputs = getattr(self, self.backbone_name)(
        input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
    )

    sequence_output = outputs[0]

    sequence_output = self.dropout(sequence_output)
    logits = self.classifier(sequence_output)

    loss = None
    if labels is not None:
        loss_fct = torch.nn.BCEWithLogitsLoss(reduction="none")
        loss = loss_fct(logits.view(-1, 1), labels.view(-1, 1))

        # this ignores the part of the sequence that got -100 as labels
        loss = torch.masked_select(loss, labels.view(-1, 1) > -1).mean()

    return TokenClassifierOutput(
        loss=loss,
        proba=logits.sigmoid(),
    )

In [None]:
def get_model(config, init=False):
    model_type = type(config).__name__[: -len("config")]
    if model_type == "Bart":
        name = f"{model_type}PretrainedModel"
    else:
        name = f"{model_type}PreTrainedModel"
    PreTrainedModel = getattr(__import__("transformers", fromlist=[name]), name)
    name = f"{model_type}Model"
    ModelClass = getattr(__import__("transformers", fromlist=[name]), name)

    model = type(
        "CustomModel",
        (PreTrainedModel,),
        {"__init__": __init__, "forward": forward},
    )

    model._keys_to_ignore_on_load_unexpected = [r"pooler"]
    model._keys_to_ignore_on_load_missing = [r"position_ids"]

    model.PreTrainedModel = PreTrainedModel
    model.ModelClass = ModelClass
    model.backbone_name = config.model_type

    # changes deberta-v2 --> deberta
    if "deberta" in model.backbone_name:
        model.backbone_name = "deberta"

    if init:
        return model(config)
    return model


def get_pretrained(model_name_or_path, config, **kwargs):

    model = get_model(config, init=False)

    return model.from_pretrained(
        pretrained_model_name_or_path=model_name_or_path,
        config=config,
        **kwargs,
    )

In [None]:
@dataclass
class CustomDataCollator(DataCollatorForTokenClassification):
    """
    Data collator that will dynamically pad the inputs received, as well as the labels.
    Have to modify to make label tensors float and not int.
    """

    tokenizer
    padding = True
    max_length = None
    pad_to_multiple_of = None
    label_pad_token_id = -100
    return_tensors = "pt"

    def torch_call(self, features):
        batch = super().torch_call(features)
        label_name = "label" if "label" in features[0].keys() else "labels"

        batch[label_name] = torch.tensor(batch[label_name], dtype=torch.float32)

        return batch

## Getting locations based on predictions

For each token if the value after a logit goes through a sigmoid is > 0.5, then it is an important token. This is a simple approach, and it would be good to test out different numbers in CV.

In [None]:
def kaggle_metrics(eval_prediction, dataset):
    """
    For `compute_metrics`

    Use partial for the args and kwargs to pass other data
    into the `compute_metrics` function.
    """

    pred_idxs = get_location_predictions(eval_prediction.predictions, dataset)

    all_labels = []
    all_preds = []
    for preds, locations, text in zip(
        pred_idxs,
        dataset["locations"],
        dataset["pn_history"],
    ):

        num_chars = len(text)
        char_labels = np.zeros((num_chars), dtype=bool)

        for start, end in locations:
            char_labels[start:end] = 1

        char_preds = np.zeros((num_chars))

        for start_idx, end_idx in preds:
            char_preds[start_idx:end_idx] = 1
            if (
                text[start_idx].isspace()
                and start_idx > 0
                and not char_preds[start_idx - 1]
            ):
                char_preds[start_idx] = 0

        all_labels.extend(char_labels)
        all_preds.extend(char_preds)

    results = precision_recall_fscore_support(all_labels, all_preds, average="binary")

    return {"precision": results[0], "recall": results[1], "f1": results[2]}


def get_location_predictions(preds, dataset):
    """
    Finds the prediction indexes at the character level.
    """
    all_predictions = []
    for pred, offsets, seq_ids in zip(
        preds, dataset["offset_mapping"], dataset["sequence_ids"]
    ):
        start_idx = None
        current_preds = []
        for p, o, s_id in zip(pred, offsets, seq_ids):
            if s_id is None or s_id == 0:
                continue

            if p > 0.5:
                if start_idx is None:
                    start_idx = o[0]
                end_idx = o[1]
            elif start_idx is not None:
                current_preds.append((start_idx, end_idx))
                start_idx = None

        if start_idx is not None:
            current_preds.append((start_idx, end_idx))

        all_predictions.append(current_preds)

    return all_predictions

## Train!

This will train all k folds and saving each model. 

In [None]:
if DEBUG:
    training_args.num_train_epochs = 1

previous_config = None
for fold, model_name in zip(range(data_args.k_folds), all_models):

    """
    This seems to get reset after each fold and can print out a lot of
    information that I don't really care about. When debugging, you should
    definitely not hide these messages though 😉
    """
    if not DEBUG:
        logging.set_verbosity(logging.CRITICAL)

    print(f"Starting training for fold {fold} using {model_name}")

    config = AutoConfig.from_pretrained(
        model_name,
    )
    using_deberta_v2_3 = "deberta-v2" in model_name or "deberta-v3" in model_name

    # Only re-run when the config changes
    if previous_config is None or previous_config != config.__dict__:

        if using_deberta_v2_3:
            from transformers.models.deberta_v2 import DebertaV2TokenizerFast

            tokenizer = DebertaV2TokenizerFast.from_pretrained(model_name)
        else:
            tokenizer = AutoTokenizer.from_pretrained(model_name)

        data_collator = CustomDataCollator(
            tokenizer,
            pad_to_multiple_of=8 if training_args.fp16 else None,
            padding="longest",
        )

        print("Tokenizing dataset")
        tokenized_dataset = dataset.map(
            partial(
                tokenize,
                tokenizer=tokenizer,
            ),
            desc="Tokenizing and adding labels",
            num_proc=4,
        )

    model_args.model_name_or_path = model_name  # So wandb will track it

    if "wandb" in training_args.report_to:
        wandb_config = {
            **model_args.__dict__,
            **data_args.__dict__,
            **training_args.__dict__,
            **config.__dict__,
        }
        wandb_config["fold"] = fold
        wandb.init(config=wandb_config, group=os.environ["WANDB_RUN_GROUP"])

    model = get_pretrained(model_name, config)

    train_dataset = tokenized_dataset.filter(lambda x: x["fold"] != fold, num_proc=data_args.num_proc)
    eval_dataset = tokenized_dataset.filter(lambda x: x["fold"] == fold, num_proc=data_args.num_proc)

    compute_metrics = partial(kaggle_metrics, dataset=eval_dataset)

    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train()
    trainer.save_model(f"fold{fold}")

    if "wandb" in training_args.report_to:
        wandb.finish()

    previous_config = config.__dict__

# Weights and Biases Report

<iframe src="https://wandb.ai/nbroad/NBME/reports/Hybrid-QA-NER-Train-Results--VmlldzoxNTE4Mjk1" style="border:none;height:1024px;width:100%">