# Introduction

In this tutorial, we explore medical natural language inference task. In this task, we want to classify a pair of medical-related sentences as entailment, contradiction, or neutral. Adaped from Bert for Text Classification on GLUE [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb).

This notebook uses python3 and transformer 4.12.2

# Dataset



The dataset used in this tutorial is created from MIMIC-III. Thus, make sure you have the corresponding credentials before proceeding. <br>

Download original NLI data [here](https://physionet.org/content/mednli/1.0.0/ ). <br>

Download MedNLI test set [here](https://physionet.org/content/mednli-bionlp19/1.0.1/). <br>

Upload to Google Drive.



# Preparation

Mount Google Drive to access files stored in Drive.

In [None]:
### Google Colab Mount Drive ###

# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd drive/MyDrive

/content/drive/MyDrive


Go to original dataset directory.


In [None]:
# cd into original data directory
%cd nli-original/

/content/drive/MyDrive/nli-original


The directory should contain train/dev/test split each in a jsonl file. More on json lines [here](https://jsonlines.org). (mli_test_v1.jsonl mli_dev_v1.jsonl  mli_train_v1.jsonl)

In [None]:
!ls

LICENSE.txt	  mli_test_v1.jsonl   README.txt      test-1
mli_dev_v1.jsonl  mli_train_v1.jsonl  SHA256SUMS.txt


Insall relevant packages.

In [None]:
! pip install datasets transformers

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-crendential store but this isn't the helper defined on your machine.
You will have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal to set it as the default

git config --global credential.helper store[0m


In [None]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 2s (913 kB/s)
Selecting previously unselected package git-lfs.
(Reading database ... 155219 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.3.4-1_amd64.deb ...
Unpacking git-lfs (2.3.4-1) ...
Setting up git-lfs (2.3.4-1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


In [None]:
import transformers

print(transformers.__version__)

4.12.3


Check availability of GPU.

In [None]:
import torch
# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


# Loading Dataset

Select a pretrained model from HuggingFace model hub [here](https://huggingface.co/models).  <br>
Make sure in the config file, "architectures": "BertForMaskedLM" <br>
Some model choices: <br>
[bert-base-uncased](https://huggingface.co/bert-base-uncased) <br>
PubmedBERT ([abstract](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) or [fulltext](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext)) <br>

In [None]:
# model_checkpoint = microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
# model_checkpoint = microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
# model_checkpoint = "bert-base-uncased"

# in this example, we use PubMedBert fulltext
model_checkpoint = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"

batch_size = 8

In [None]:
from datasets import load_dataset, load_metric, DatasetDict, Metric

Inspect available metrics.

In [None]:
from datasets import list_metrics
metrics_list = list_metrics()
metrics_list

['accuracy',
 'bertscore',
 'bleu',
 'bleurt',
 'cer',
 'chrf',
 'comet',
 'competition_math',
 'coval',
 'cuad',
 'f1',
 'gleu',
 'glue',
 'google_bleu',
 'indic_glue',
 'matthews_correlation',
 'meteor',
 'pearsonr',
 'precision',
 'recall',
 'rouge',
 'sacrebleu',
 'sari',
 'seqeval',
 'spearmanr',
 'squad',
 'squad_v2',
 'super_glue',
 'ter',
 'wer',
 'wiki_split',
 'xnli']


Since the evaluation metric for the original task is accuracy, we will use the same metric.

In [None]:
metric = load_metric('accuracy')

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

In [None]:
metric

Metric(name: "accuracy", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions: Predicted labels, as returned by a model.
    references: Ground truth labels.
    normalize: If False, return the number of correctly classified samples.
        Otherwise, return the fraction of correctly classified samples.
    sample_weight: Sample weights.
Returns:
    accuracy: Accuracy score.
Examples:

    >>> accuracy_metric = datasets.load_metric("accuracy")
    >>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
    >>> print(results)
    {'accuracy': 1.0}
""", stored examples: 0)

An example use of metric.

In [None]:
import numpy as np
fake_preds = np.random.randint(0, 3, size=(64,))
fake_labels = np.random.randint(0, 3, size=(64,))
print(metric.compute(predictions=fake_preds, references=fake_labels))

{'accuracy': 0.375}


Load data into a DatasetDict directly from jsonl files.

In [None]:
train_file = 'mli_train_v1.jsonl'
dev_file = 'mli_dev_v1.jsonl'
test_file = 'mli_test_v1.jsonl'

In [None]:
dataset = DatasetDict.from_json({'train': train_file, 'dev': dev_file, 'test': test_file})

Using custom data configuration default-a0d3fe7e9550c67c


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-a0d3fe7e9550c67c/0.0.0...


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-a0d3fe7e9550c67c/0.0.0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'pairID', 'sentence1_parse', 'sentence1_binary_parse', 'sentence2', 'sentence2_parse', 'sentence2_binary_parse', 'gold_label'],
        num_rows: 11232
    })
    dev: Dataset({
        features: ['sentence1', 'pairID', 'sentence1_parse', 'sentence1_binary_parse', 'sentence2', 'sentence2_parse', 'sentence2_binary_parse', 'gold_label'],
        num_rows: 1395
    })
    test: Dataset({
        features: ['sentence1', 'pairID', 'sentence1_parse', 'sentence1_binary_parse', 'sentence2', 'sentence2_parse', 'sentence2_binary_parse', 'gold_label'],
        num_rows: 1422
    })
})

In [None]:
dataset['train'][0]

{'gold_label': 'entailment',
 'pairID': '23eb94b8-66c7-11e7-a8dc-f45c89b91419',
 'sentence1': 'Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.',
 'sentence1_binary_parse': '( Labs ( ( were ( notable ( for ( ( ( ( Cr 1.7 ) ( -LRB- ( ( ( baseline 0.5 ) ( per ( old records ) ) ) -RRB- ) ) ) and ) ( lactate 2.4 ) ) ) ) ) . ) )',
 'sentence1_parse': '(ROOT (S (NP (NNPS Labs)) (VP (VBD were) (ADJP (JJ notable) (PP (IN for) (NP (NP (NP (NN Cr) (CD 1.7)) (PRN (-LRB- -LRB-) (NP (NP (NN baseline) (CD 0.5)) (PP (IN per) (NP (JJ old) (NNS records)))) (-RRB- -RRB-))) (CC and) (NP (NN lactate) (CD 2.4)))))) (. .)))',
 'sentence2': ' Patient has elevated Cr',
 'sentence2_binary_parse': '( Patient ( has ( elevated Cr ) ) )',
 'sentence2_parse': '(ROOT (S (NP (NN Patient)) (VP (VBZ has) (NP (JJ elevated) (NN Cr)))))'}

The following function helps to better visualize each data instance.

In [None]:
def show_one(example):
  print(f"Sentence 1: {example['sentence1']}")
  print(f"Sentence 2: {example['sentence2']}")
  print(f"Ground truth: {example['gold_label']}")

In [None]:
show_one(dataset['train'][3])

Sentence 1: Nystagmus and twiching of R arm was noted.
Sentence 2:  The patient had abnormal neuro exam.
Ground truth: entailment


# Preprocessing

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/221k [00:00<?, ?B/s]

In preprocess function, in addition to tokenize sentences, we also convert label to ids. <br>

Important parameters for Bert model: <br>
input_ids: indices of input sequence tokens in the vocabulary <br>
attention_masks: mask to avod performing attention on padding token indices <br>
labels: labels for computing loss <br>

In [Default data collator](https://huggingface.co/transformers/main_classes/data_collator.html#default-data-collator), no additional preprocessing is done. Thus, property names of the input object will be used as corresponding inputs to the model. Make sure the column name is titled correctly as 'label'. <br>
Alternatively, you can also write your own DataCollator. 

In [None]:
def preprocess_function(examples):
    output = tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, max_length=128)
    label_to_id = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
    output['label'] = list(map(lambda x : label_to_id[x], examples['gold_label']))
    return output

Insepct keys to make sure input_ids, attention_masks and labels are included.

In [None]:
preprocess_function(dataset['train'][:5]).keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'label'])

Preprocess all splits of the dataset.

In [None]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

  0%|          | 0/12 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
encoded_dataset['train'][0]

{'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'gold_label': 'entailment',
 'input_ids': [2,
  17919,
  1985,
  11892,
  1958,
  2345,
  21,
  18,
  27,
  12,
  3703,
  20,
  18,
  25,
  2079,
  4156,
  7123,
  13,
  1930,
  9456,
  22,
  18,
  24,
  18,
  3,
  2774,
  2258,
  4664,
  2345,
  3],
 'label': 0,
 'pairID': '23eb94b8-66c7-11e7-a8dc-f45c89b91419',
 'sentence1': 'Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.',
 'sentence1_binary_parse': '( Labs ( ( were ( notable ( for ( ( ( ( Cr 1.7 ) ( -LRB- ( ( ( baseline 0.5 ) ( per ( old records ) ) ) -RRB- ) ) ) and ) ( lactate 2.4 ) ) ) ) ) . ) )',
 'sentence1_parse': '(ROOT (S (NP (NNPS Labs)) (VP (VBD were) (ADJP (JJ notable) (PP (IN for) (NP (NP (NP (NN Cr) (CD 1.7)) (PRN (-LRB- -LRB-) (NP (NP (NN baseline) (CD 0.5)) (PP (IN per) (NP (JJ old) (NNS records)))) (-RRB- -RRB-))) (CC a

# Finetuning

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

In [None]:
num_labels = 3
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Ber

Inspect model architecture.

In [None]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

Training arguments.

In [None]:
args = TrainingArguments(
    f"../finetune-medical-nli",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    logging_strategy = "epoch", # log training loss
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Function to compute accuracy of predictions.

In [None]:
import numpy as np
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Initialize bert trainer.

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["dev"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Train the model.

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: gold_label, sentence2_binary_parse, sentence1, sentence1_parse, pairID, sentence1_binary_parse, sentence2_parse, sentence2.
***** Running training *****
  Num examples = 11232
  Num Epochs = 10
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 14040


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5825,0.432031,0.842294
2,0.3828,0.509577,0.837276
3,0.297,0.648013,0.849462
4,0.2229,0.777256,0.855914
5,0.1623,0.84179,0.860215
6,0.104,0.938593,0.860215
7,0.0683,1.013631,0.860215
8,0.0444,1.069693,0.864516
9,0.0264,1.118391,0.871685
10,0.0229,1.134652,0.870251


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: gold_label, sentence2_binary_parse, sentence1, sentence1_parse, pairID, sentence1_binary_parse, sentence2_parse, sentence2.
***** Running Evaluation *****
  Num examples = 1395
  Batch size = 8
Saving model checkpoint to ../finetune-medical-nli/checkpoint-1404
Configuration saved in ../finetune-medical-nli/checkpoint-1404/config.json
Model weights saved in ../finetune-medical-nli/checkpoint-1404/pytorch_model.bin
tokenizer config file saved in ../finetune-medical-nli/checkpoint-1404/tokenizer_config.json
Special tokens file saved in ../finetune-medical-nli/checkpoint-1404/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: gold_label, sentence2_binary_parse, sentence1, sentence1_parse, pairID, sentence1_binary_parse, sentenc

TrainOutput(global_step=14040, training_loss=0.19136185306429523, metrics={'train_runtime': 3983.2319, 'train_samples_per_second': 28.198, 'train_steps_per_second': 3.525, 'total_flos': 3679690480691904.0, 'train_loss': 0.19136185306429523, 'epoch': 10.0})

Check model performance on the dev set.

In [None]:
trainer.evaluate()

Since model configurations are saved at the end of each epoch, we can load any pretrained model to do evaluation and predictions. Note the checkpoint of your model choice.

Performance on test set: <br>
Epoch 8  model: 0.8411 <br>
Epoch 9  model: 0.8404 <br>

In [None]:
model_checkpoint = '../finetune-medical-nli/checkpoint-11232' # epoch 8
#model_checkpoint = '../finetune-medical-nli/checkpoint-12636' # epoch 9

In [None]:
num_labels = 3
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

loading configuration file ../finetune-medical-nli/checkpoint-11232/config.json
Model config BertConfig {
  "_name_or_path": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.12.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loadi

In [None]:
batch_size = 8
args = TrainingArguments(
    f"../finetune-medical-nli",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    logging_strategy = "epoch", # log training loss
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["dev"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
test_predictions = trainer.evaluate(encoded_dataset['test'])

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2_parse, gold_label, sentence2, sentence2_binary_parse, sentence1_binary_parse, pairID, sentence1, sentence1_parse.
***** Running Evaluation *****
  Num examples = 1422
  Batch size = 8


In [None]:
test_predictions

{'eval_accuracy': 0.8410689170182841,
 'eval_loss': 1.3067668676376343,
 'eval_runtime': 11.3992,
 'eval_samples_per_second': 124.745,
 'eval_steps_per_second': 15.615}

Let's inspect the predictions of some test samples.

In [None]:
predictions = trainer.predict(encoded_dataset['test'])

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2_parse, gold_label, sentence2, sentence2_binary_parse, sentence1_binary_parse, pairID, sentence1, sentence1_parse.
***** Running Prediction *****
  Num examples = 1422
  Batch size = 8


In [None]:
import numpy as np
id_to_label = {0: 'entailment', 1: 'contradiction', 2: 'neutral'}
def show_prediction(dataset, predictions, index):
  show_one(dataset['test'][index])
  prediction = np.argmax(predictions[0][index]) 
  print(f"Predicted: {id_to_label[prediction]}")

In [None]:
index = 0
show_prediction(dataset, predictions, index)

Sentence 1: In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA.
Sentence 2:  The patient is hemodynamically stable 
Ground truth: entailment
Predicted: entailment


# Evaluating on MedNLI

Go into MedNLI directory.

In [None]:
%cd ..

/content/drive/My Drive


In [None]:
%cd mednli-for-shared-task-at-acl-bionlp-2019/

/content/drive/My Drive/mednli-for-shared-task-at-acl-bionlp-2019


The directory should contain test set sentences in mednli_bionlp19_shared_task.jsonl and the correct labels in mednli_bionlp19_shared_task_ground_truth.csv 




In [None]:
!ls

finetune-medical-nli			      mednli_bionlp19_shared_task.jsonl
LICENSE.txt				      README.txt
mednli_bionlp19_shared_task_ground_truth.csv  SHA256SUMS.txt


Load MedNLI data.

In [None]:
mediqa_file = 'mednli_bionlp19_shared_task.jsonl'
mediqa_label_file = 'mednli_bionlp19_shared_task_ground_truth.csv'

Load mednli_file and labels manually. Cannot load using load_dataset('json', mednli)

Load data first.

In [None]:
import json
data = []
with open(mediqa_file) as f:
  for line in f:
    data.append(json.loads(line))

In [None]:
data[0]

{'gold_label': '',
 'pairID': 'mediqa-6784509e-5ca1-11e9-8844-f45c89b91419',
 'sentence1': 'She arrived with her friend, very lethargic.',
 'sentence1_binary_parse': '( She ( ( ( ( arrived ( with ( her friend ) ) ) , ) ( very lethargic ) ) . ) )',
 'sentence1_parse': '(ROOT (S (NP (PRP She)) (VP (VBD arrived) (PP (IN with) (NP (PRP$ her) (NN friend))) (, ,) (ADJP (RB very) (JJ lethargic))) (. .)))',
 'sentence2': ' she  appeared unenergetic',
 'sentence2_binary_parse': '( she ( appeared unenergetic ) )',
 'sentence2_parse': '(ROOT (S (NP (PRP she)) (VP (VBD appeared) (ADJP (JJ unenergetic)))))'}

In [None]:
import pandas as pd
label_df = pd.read_csv(mediqa_label_file)
# set pair_id as index
label_df = label_df.set_index('pair_id')
# create a dictionary of (id, label) pairs
label_dic = label_df.to_dict()

In [None]:
# inspect dictionary
label_dic['label']

{'mediqa-6784509e-5ca1-11e9-8844-f45c89b91419': 'entailment',
 'mediqa-678582f4-5ca1-11e9-af66-f45c89b91419': 'contradiction',
 'mediqa-67858838-5ca1-11e9-ad79-f45c89b91419': 'neutral',
 'mediqa-67858ba8-5ca1-11e9-8a1f-f45c89b91419': 'entailment',
 'mediqa-67858f54-5ca1-11e9-baa4-f45c89b91419': 'contradiction',
 'mediqa-678594ae-5ca1-11e9-a95d-f45c89b91419': 'neutral',
 'mediqa-67859800-5ca1-11e9-a86a-f45c89b91419': 'entailment',
 'mediqa-67859a76-5ca1-11e9-8140-f45c89b91419': 'contradiction',
 'mediqa-67859d08-5ca1-11e9-b1d5-f45c89b91419': 'neutral',
 'mediqa-67859f94-5ca1-11e9-9963-f45c89b91419': 'entailment',
 'mediqa-6785a1e4-5ca1-11e9-adbf-f45c89b91419': 'contradiction',
 'mediqa-6785a82c-5ca1-11e9-bc81-f45c89b91419': 'neutral',
 'mediqa-6785ab58-5ca1-11e9-9698-f45c89b91419': 'entailment',
 'mediqa-6785e9a6-5ca1-11e9-9365-f45c89b91419': 'contradiction',
 'mediqa-6785f298-5ca1-11e9-a800-f45c89b91419': 'neutral',
 'mediqa-6785f642-5ca1-11e9-9dd3-f45c89b91419': 'entailment',
 'mediqa

Add label to test set data.

In [None]:
for d in data:
  d['gold_label'] = label_dic['label'][d['pairID']]

Load complete test set into a dataset.

In [None]:
data[:2]

[{'gold_label': 'entailment',
  'pairID': 'mediqa-6784509e-5ca1-11e9-8844-f45c89b91419',
  'sentence1': 'She arrived with her friend, very lethargic.',
  'sentence1_binary_parse': '( She ( ( ( ( arrived ( with ( her friend ) ) ) , ) ( very lethargic ) ) . ) )',
  'sentence1_parse': '(ROOT (S (NP (PRP She)) (VP (VBD arrived) (PP (IN with) (NP (PRP$ her) (NN friend))) (, ,) (ADJP (RB very) (JJ lethargic))) (. .)))',
  'sentence2': ' she  appeared unenergetic',
  'sentence2_binary_parse': '( she ( appeared unenergetic ) )',
  'sentence2_parse': '(ROOT (S (NP (PRP she)) (VP (VBD appeared) (ADJP (JJ unenergetic)))))'},
 {'gold_label': 'contradiction',
  'pairID': 'mediqa-678582f4-5ca1-11e9-af66-f45c89b91419',
  'sentence1': 'She arrived with her friend, very lethargic.',
  'sentence1_binary_parse': '( She ( ( ( ( arrived ( with ( her friend ) ) ) , ) ( very lethargic ) ) . ) )',
  'sentence1_parse': '(ROOT (S (NP (PRP She)) (VP (VBD arrived) (PP (IN with) (NP (PRP$ her) (NN friend))) (, ,) 

In [None]:
# convert data to a dictionary
data = {'gold_label': [d['gold_label'] for d in data],
        'pairID': [d['pairID'] for d in data],
        'sentence1': [d['sentence1'] for d in data],
        'sentence2': [d['sentence2'] for d in data]}

In [None]:
from datasets import Dataset
mediqa = Dataset.from_dict(data)

In [None]:
mediqa

Dataset({
    features: ['gold_label', 'pairID', 'sentence1', 'sentence2'],
    num_rows: 405
})

Encode/Preprocess mednli test set.

In [None]:
encoded_mediqa = mediqa.map(preprocess_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
encoded_mediqa

Dataset({
    features: ['attention_mask', 'gold_label', 'input_ids', 'label', 'pairID', 'sentence1', 'sentence2', 'token_type_ids'],
    num_rows: 405
})

In [None]:
predictions = trainer.predict(encoded_mediqa)

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: pairID, sentence1, sentence2, gold_label.
***** Running Prediction *****
  Num examples = 405
  Batch size = 8


Inspect performance.

In [None]:
predictions[2]

{'test_accuracy': 0.8469135802469135,
 'test_loss': 1.2830283641815186,
 'test_runtime': 3.1996,
 'test_samples_per_second': 126.579,
 'test_steps_per_second': 15.94}

Inspect some predictions.

In [None]:
import numpy as np
id_to_label = {0: 'entailment', 1: 'contradiction', 2: 'neutral'}
def show_mednli_prediction(dataset, predictions, index):
  show_one(dataset[index])
  prediction = np.argmax(predictions[0][index]) 
  print(f"Predicted: {id_to_label[prediction]}")

In [None]:
show_mednli_prediction(mediqa, predictions, 100)

Sentence 1: Post-ERCP, she was admitted to the ICU with a diagnosis of cholangitis.
Sentence 2:  She tolerated the ERCP well and felt better post procedure
Ground truth: contradiction
Predicted: contradiction
