# NBME - Score Clinical Patient Notes
In this notebook, I explain about -
1. How to generate tokens
2. How to generate labels (why -1)
3. How to get prediction and convert it back

In [None]:
from ast import literal_eval
from itertools import chain

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from torch import optim
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from tqdm.notebook import tqdm
from transformers import AutoModel, AutoTokenizer

### 1. Datasets Helper Function
need to merge `features.csv`, `patient_notes.csv` with `train.csv`

also, `annotation` and `location` these two columns are actually a list, so we need to convert them back. Let's use `ast` library for this conversion



In [None]:
BASE_URL = "../input/nbme-score-clinical-patient-notes"


def process_feature_text(text):
    return text.replace("-OR-", ";-").replace("-", " ")


def prepare_datasets():
    features = pd.read_csv(f"{BASE_URL}/features.csv")
    notes = pd.read_csv(f"{BASE_URL}/patient_notes.csv")
    df = pd.read_csv(f"{BASE_URL}/train.csv")
    df["annotation_list"] = [literal_eval(x) for x in df["annotation"]]
    df["location_list"] = [literal_eval(x) for x in df["location"]]

    merged = df.merge(notes, how="left")
    merged = merged.merge(features, how="left")

    merged["feature_text"] = [process_feature_text(x) for x in merged["feature_text"]]
    merged["feature_text"] = merged["feature_text"].apply(lambda x: x.lower())
    merged["pn_history"] = merged["pn_history"].apply(lambda x: x.lower())

    return merged

In [None]:
df = prepare_datasets()

### Lets Tokenize
Here, we are going to use `BertTokenizerFast` insted of `BertTokenizer`. You can take a look on the documentation [PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizerFast) vs [PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer)

> A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library 🤗 Tokenizers. The “Fast” implementations allows: a significant speed-up in particular when doing batched tokenization and
additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).

In [None]:
from transformers import BertTokenizer, BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

In [None]:
auto_tokenizer = AutoTokenizer.from_pretrained("../input/huggingface-bert/bert-base-uncased")

print("Bert Tokenizer Type", type(tokenizer))
print("Auto Tokenizer Type", type(auto_tokenizer))

type(auto_tokenizer) == type(tokenizer)

so, both are using same tokenizer.

Now take the first row from the datasets, and step by step prepare for model input

In [None]:
first = df.iloc[0]
first

In [None]:
feature_text, pn_history = first.feature_text, first.pn_history
print("feature_text", feature_text)
print("\n")
print("pn_history", pn_history)

In [None]:
# Let's see how the text look like after it's converted into bert token
tokens = tokenizer.tokenize(feature_text, pn_history)
print("Total Tokens", len(tokens))
print(tokens)

In [None]:
pn_history[696:724]

Let's ask the tokenizer to return the offsets mapping, by sending argument `return_offsets_mapping` as True

- return_offsets_mapping:
> Whether or not to return (char_start, char_end) for each token. This is only available on fast tokenizers inheriting from PreTrainedTokenizerFast, if using Python’s tokenizer, this method will raise NotImplementedError.

In [None]:
out = tokenizer(
        feature_text,
        pn_history,
        truncation=True,
        max_length=1000,
        padding="max_length",
        return_offsets_mapping=True
)

Ok, we are successfully able to generate tokens for training, but we also need Labels

Steps to generate labels:
 1. Zip sequence_ids and offset mapping
 2. if sequence_id is None or 0 then the label is -1 (or any value you want)
 3. and if the location (from datasets) is valid (compared with offset mapping) then its true label which is 1


#### What is sequence ids?
it's mapping of tokens to the id of their original sentences:
- `None` for special tokens added around or between sequences,
- `0` for tokens corresponding to words in the first sequence,
- `1` for tokens corresponding to words in the second sequence when a pair of sequences was jointly encoded.

#### Why -1 as label?
if sequence id is none then it's basically a special token like ['SEP'], ['CLS'], and if it is 0 then it's coming from the first sentence, in our case, it's `feature_text`

In [None]:
zipped = zip(out.sequence_ids(), out["offset_mapping"])

idx, (seq_id, offsets) = next(enumerate(zipped))
if not seq_id or seq_id == 0:
    print("Seq ID zero, so level is -1 also")

In [None]:
seq_id = 1 #assume
loc_list = [696, 724]

for idx, (seq_id, offsets)  in enumerate(zip(out.sequence_ids(), out["offset_mapping"])):
    token_start, token_end = offsets
    for feature_start, feature_end in [loc_list]:
        if token_start >= feature_start and token_end <= feature_end:
            print(f"Word {pn_history[token_start:token_end]}, label: 1")

# Prediction
Let's convert it back from a pre-trained model. Here I am going to use [Pytorch Bert baseline NBME](https://www.kaggle.com/code/iamsdt/pytorch-bert-baseline-nbme) notebook model

In [None]:
class CustomModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.bert = AutoModel.from_pretrained(config['model_name'])  # BERT model
        self.dropout = nn.Dropout(p=config['dropout'])
        self.config = config
        self.fc1 = nn.Linear(768, 512)
        self.fc2 = nn.Linear(512, 512)
        self.fc3 = nn.Linear(512, 1)

    def forward(self, input_ids, attention_mask, token_type_ids):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        logits = self.fc1(outputs[0])
        logits = self.fc2(self.dropout(logits))
        logits = self.fc3(self.dropout(logits)).squeeze(-1)
        return logits
    

hyperparameters = {
    "max_length": 416,
    "padding": "max_length",
    "return_offsets_mapping": True,
    "truncation": "only_second",
    "model_name": "../input/huggingface-bert/bert-base-uncased",
    "dropout": 0.2,
    "lr": 1e-5,
    "test_size": 0.2,
    "seed": 1268,
    "batch_size": 8
}

model = CustomModel(hyperparameters)
model.load_state_dict(torch.load("../input/pytorch-bert-baseline-nbme/nbme_bert_v2.pth", map_location = "cpu"))

In [None]:
def create_test_df():
    feats = pd.read_csv(f"{BASE_URL}/features.csv")
    notes = pd.read_csv(f"{BASE_URL}/patient_notes.csv")
    test = pd.read_csv(f"{BASE_URL}/test.csv")

    merged = test.merge(notes, how = "left")
    merged = merged.merge(feats, how = "left")

    def process_feature_text(text):
        return text.replace("-OR-", ";-").replace("-", " ")
    
    merged["feature_text"] = [process_feature_text(x) for x in merged["feature_text"]]
    
    return merged


class SubmissionDataset(Dataset):
    def __init__(self, data, tokenizer, config):
        self.data = data
        self.tokenizer = tokenizer
        self.config = config
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        example = self.data.loc[idx]
        tokenized = self.tokenizer(
            example["feature_text"],
            example["pn_history"],
            truncation = self.config['truncation'],
            max_length = self.config['max_length'],
            padding = self.config['padding'],
            return_offsets_mapping = self.config['return_offsets_mapping']
        )
        tokenized["sequence_ids"] = tokenized.sequence_ids()

        input_ids = np.array(tokenized["input_ids"])
        attention_mask = np.array(tokenized["attention_mask"])
        token_type_ids = np.array(tokenized["token_type_ids"])
        offset_mapping = np.array(tokenized["offset_mapping"])
        sequence_ids = np.array(tokenized["sequence_ids"]).astype("float16")

        return input_ids, attention_mask, token_type_ids, offset_mapping, sequence_ids


test_df = create_test_df()

submission_data = SubmissionDataset(test_df, tokenizer, hyperparameters)
submission_dataloader = DataLoader(submission_data, batch_size=1, shuffle=False)

In [None]:
test_pn_history = test_df.iloc[0]['pn_history']
print(test_pn_history)

In [None]:
# take a single batch, also used batch size as 1, so it will load only one history
batch = next(iter(submission_dataloader))

In [None]:
input_ids = batch[0]
attention_mask = batch[1]
token_type_ids = batch[2]
offset_mapping = batch[3]
sequence_ids = batch[4]

logits = model(input_ids, attention_mask, token_type_ids)
predicted = logits.detach().cpu().numpy()
offset_mapping = offset_mapping.numpy()
sequence_ids = sequence_ids.numpy()
print(predicted.size == hyperparameters['max_length'])

#### Steps
1. pass the model output into a sigmoid function
2. for sequence_id  if it's `none` or `0` then the same logic
3. if the prediction is greater than 0.5 (threshold) then check the offset index

In [None]:
for pred, offsets, seq_ids in zip(predicted, offset_mapping, sequence_ids):
    pred = 1 / (1 + np.exp(-pred)) # which is sigmoid function    
    start_idx = None
    end_idx = None
    
    for pred, offset, seq_id in zip(pred, offsets, seq_ids):
        if not seq_id or seq_id == 0:
            continue
    
        if pred > 0.5:
            if not start_idx:
                start_idx = offset[0]
            end_idx = offset[1]
            
        elif start_idx:
            print("Current index", f"{start_idx} {end_idx}")
            print("Word: ", test_pn_history[start_idx:end_idx])
            start_idx = None

That's it we are done. IF you any thoughts or want to add something let's use comment section
> Thank you very much