# Summary

This notebook outlines an error detection pipeline that can help find and sort valuable text samples for manual review.


## Problem description

The need arises in the context of an **automatic speech recognition*** and transcription system (ASR for short) used in the medical domain. The transcriptions currently contain a number of errors big enough to be problematic, most likely due to the significant disparity between the **target domain (healthcare)** and the domain composition of the datasets used for training the base ASR system (most likely not focused exclusively on healthcare).

In order to improve the ASR's performance, these errors should be detected, so that they can be characterized, the effort of fixing them can then be adequately quantified, the impact and trade-off of doing so can thus be estimated (so that each error type can be prioritized effectively) and, finally, so that corrected versions of the ideal transcription can be produced (and added as regression tests and/or further training data to mitigate the model's underfitting).

The core task of transcription error detection can be framed, to at least some extent, as the more general **spellcehcking/error detection** task. Our proposed solution in this spirit is outlined throughout the rest of this notebook.

However, it's worth also noting that a purely binary objective function (_error_ or _no error_) would not be optimal in our case, though, as not all errors are equally damaging for the proper understanding of a transcription (more damaging errors should be assigning higher scores, and prioritized accordingly) and, similarly, not all transcriptions contain the same amount of errors (we'd expect transcriptions containing many errors to be ranked higher than those containing fewer errors). Ideally, this objective calls for a probabilistic score that can reflect these quantitative nuances.

This becomes particularly more important in the context of an annotation task where a limited-size team of annotators is required to annotate as effectively as possible within a constrained annotation budget in terms of person hours. In order to scale the annotation process and ensure that the annotators, whichever their number, can be as productive as possible, we should have them look at samples with a higher error density first, in order to highlight recurrent patterns and/or particularly serious ASR errors.

## Proposed solution

To ensure the domain relevance of the prototype presented here, we leverage the English translation of the [CodiEsp corpus](https://zenodo.org/records/3693570#.X3rm8C8RrOR) of clinical cases (general medicine) as our **dataset**. File `src/codiesp.py` contains the Python data loaders for the dataset and exposes as constants two pandas DataFrames (`CODIESP`, `TEST`) with the expected data schema. Please note that this module presupposes the existence of the sibling folder `data`, under which the CodiEsp dataset raw data files must have been placed:

```
.
├── data                                   # generic data folder
│   └── final_dataset_v2_to_publish             # RAW DATASET FILES
├── notebooks
│   ├── DocPlanner candidate ranking.ipynb              # (this notebook)
├── requirements.txt
└── src
    ├── __init__.py
    ├── __pycache__
    ├── codiesp.py                            # dataset loaders
    ├── error_injector.py
    ├── language_model.py
    └── utils.py
```

> **CodiEsp corpus: Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020**
> The CodiEsp corpus contains manually coded clinical cases. All documents are in Spanish language and CIE10 is the coding terminology (it is the Spanish version of ICD10-CM and ICD10-PCS). The CodiEsp corpus has been randomly sampled into three subsets: the train, the development, and the test set. The train set contains 500 clinical cases, and the development and test set 250 clinical cases each.

In the same vein, as the **base model** for detection will be using [Bio-ClinicalBERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT), a transformer fine-tuned for the healthcare domain:

> This model card describes the **Bio+Clinical BERT** model, which was initialized from **BioBERT** & trained on all **MIMIC** notes.
> 
> 1. **BioBERT** (BioBERT-Base v1.0 + PubMed 200K + PMC 270K)
> 2. The Bio_ClinicalBERT model was trained on all notes from **MIMIC** III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see here. All notes from the NOTEEVENTS table were included (~880M words).

For our task, we will
1. fine-tune Bio-ClinicalBERT on
   1. the training portion of CodiEsp as the negative examples (_no error_)
   2. and a corrupted version of it as the positive cases (_error_) that we will **synthesize programmatically**;
2. evaluate the fine-tuned model on
   1. a random subset of 1000 sentences from CodiEsp's test set again as the negative class
   2. their corrupted counterparts as the positive class.

Note that CodiEsp contains full clinical cases but that our samples have been segmented at the **sentence level**.

The code used to **synthesize errors** from real samples is implemented in module `src/error_injector`. In addition, module `src/language_model` implements a wrapper for easier handling of the **model, its tokenizer and its parameters**, for convenience. Lastly, module `src/utils` contains **miscellaneous functions** to avoid overloading this notebook with code.

# References

1. [Pre-trained model: Bio-ClinicalBERT (link to HuggingFace)](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT)
2. [Clinical cases dataset](https://zenodo.org/records/3693570#.X3rm8C8RrOR)

# About this notebook

The rest of this notebook details the prototype pipeline implementing the 1) dataset synthesis, 2) model fine-tuning, 3) model evaluation workflow introduced in the previous section. 

1. It first installs the dependencies from the associated `requirements.txt` file.
2. It then initializes the dependencies and related artifacts.
3. Next, synthetic errors are added to both the training and test sets.
4. The training set is then used to fine-tune Bio-ClinicalBERT.
5. Once the model has been fine-tuned, it is concept-proven on a single sample.
6. After this initial smoke test, the fine-tune model is ran on the full test set and the final metric is computed.
7. The result is a 92% accuracy

# Prototype pipeline

## Install project dependencies

In [98]:
!pip install -r ../requirements.txt



## Initialize dependencies

In [99]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [100]:
%autoreload
import os
import sys

import pandas as pd
from tqdm.notebook import tqdm
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    pipeline,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset


curr_dir = os.getcwd()
while os.path.basename(curr_dir) != "candidate_ranking":
    curr_dir = os.path.dirname(curr_dir)
sys.path.append(curr_dir)

In [101]:
from src.codiesp import (
    CODIESP,
    TEST
)
from src.error_injector import corrupt_text
from src.language_model import BioClinicalBert
from src.utils import (
    negative_case,
    positive_case,
    zero
)

## Set parameters

In [102]:
BIOBERT = BioClinicalBert()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [103]:
PATH_DATASET_TRAIN_SYNTHETIC = os.path.join(curr_dir, "data", "codiesp.synthetic.train.csv")
PATH_DATASET_TEST_SYNTHETIC = os.path.join(curr_dir, "data", "codiesp.synthetic.test.csv")

In [104]:
CODIESP.head()

Unnamed: 0,text_path,text_id,sentence
0,S1130-05582007000500007-1.txt,0,The patient is a 38-year-old male who attended...
1,S1130-05582007000500007-1.txt,0,1.
2,S1130-05582007000500007-1.txt,0,The pathological antecedents include being exa...
3,S1130-05582007000500007-1.txt,0,The examination revealed a right mandibular tu...
4,S1130-05582007000500007-1.txt,0,The orthopantomography showed a mixed lesion w...


## Synthesize train and test datasets by injecting simulated transcription errors

We’ll write a function to:
Randomly introduce errors (typos, swaps, deletions, insertions).
Prioritize errors in medical entities if specified.
Ensure a user-defined ratio of altered words.

### Training set

In [106]:
dataset = CODIESP.copy()
corrupted_dataset = dataset.copy()

In [107]:
dataset["label"] = dataset["sentence"].apply(negative_case)
dataset["num_character_errors"] = dataset["sentence"].apply(zero)
dataset["num_word_errors"] = dataset["sentence"].apply(zero)

In [108]:
error_rows = list(corrupted_dataset["sentence"].apply(
    lambda x: corrupt_text(x, word_error_probability=0.2, character_error_probability=0.2)
))

In [109]:
error_columns_df = pd.DataFrame(error_rows, columns=["sentence", "num_character_errors", "num_word_errors"])

corrupted_dataset = pd.concat(
    [corrupted_dataset.drop("sentence", axis=1, inplace=True), error_columns_df],
    axis=1
)

error_columns_df

Unnamed: 0,sentence,num_character_errors,num_word_errors
0,The pathient si a 38-eyar-old male who attende...,8,0
1,1u,1,0
2,The pathological antecedents inlcude being exa...,5,0
3,The examination revealeq a right mandibular qt...,7,1
4,The orthopantomography showed a mixed lesion w...,4,0
...,...,...,...
8027,"In th denudedraea, an amniotic membrane graft ...",9,1
8028,During follow-up there wsa a progressive reepi...,7,1
8029,Three weeks after surgery there was a regulpr ...,3,1
8030,mVA of the left sye improved to 4/10.,2,1


In [110]:
corrupted_dataset["label"] = corrupted_dataset.sentence.apply(positive_case)

# Merge and shuffle dataset
final_dataset = pd.concat([dataset, corrupted_dataset])
final_dataset.to_csv(PATH_DATASET_TRAIN_SYNTHETIC, index=False)

In [111]:
corrupted_dataset

Unnamed: 0,sentence,num_character_errors,num_word_errors,label
0,The pathient si a 38-eyar-old male who attende...,8,0,1
1,1u,1,0,1
2,The pathological antecedents inlcude being exa...,5,0,1
3,The examination revealeq a right mandibular qt...,7,1,1
4,The orthopantomography showed a mixed lesion w...,4,0,1
...,...,...,...,...
8027,"In th denudedraea, an amniotic membrane graft ...",9,1,1
8028,During follow-up there wsa a progressive reepi...,7,1,1
8029,Three weeks after surgery there was a regulpr ...,3,1,1
8030,mVA of the left sye improved to 4/10.,2,1,1


Explanation:
1. Medical terms from CANTEMIST are identified and corrupted preferentially based on entity_error_prob.
2. Non-medical words are corrupted with a lower probability other_word_prob.
3. The output dataset (corrupted_CANTEMIST.csv) contains:
   1. Original sentences (label = 0)
   2. Corrupted sentences (label = 1)

### Test set

We apply the same sequence of steps to add positive cases (errors) to the label set:

In [112]:
test = TEST.copy()
corrupted_test = test.copy()

test["label"] = test["sentence"].apply(negative_case)

error_rows_test = list(corrupted_test["sentence"].apply(
    lambda x: corrupt_text(x, word_error_probability=0.2, character_error_probability=0.2)
))

error_columns_test_df = pd.DataFrame(
    error_rows_test,
    columns=["sentence", "num_character_errors", "num_word_errors"]
)

corrupted_test = pd.concat(
    [corrupted_test.drop("sentence", axis=1, inplace=True), error_columns_test_df],
    axis=1
)

error_columns_test_df

corrupted_test["label"] = corrupted_test.sentence.apply(positive_case)

# Merge and shuffle dataset
final_test = pd.concat([test, corrupted_test])
final_test.to_csv(PATH_DATASET_TEST_SYNTHETIC, index=False)

## Training ClinicalBERT to detect errors

Now, we’ll train ClinicalBERT on the synthetic dataset to detect transcription errors.

### Load dataset

In [113]:
dataset = load_dataset("csv", data_files={"train": PATH_DATASET_SYNTHETIC})
dataset = dataset["train"].train_test_split(test_size=0.2)

print(dataset)

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text_path', 'text_id', 'sentence', 'label', 'num_character_errors', 'num_word_errors'],
        num_rows: 12851
    })
    test: Dataset({
        features: ['text_path', 'text_id', 'sentence', 'label', 'num_character_errors', 'num_word_errors'],
        num_rows: 3213
    })
})


### Fine-tune base-model

In [114]:
tokenized_datasets = dataset.map(BIOBERT.tokenize, batched=True)

# Training settings
training_args = TrainingArguments(
    output_dir="./error_detection",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
)

trainer = Trainer(
    model=BIOBERT.model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

# Train model
trainer.train()

Map:   0%|          | 0/12851 [00:00<?, ? examples/s]

Map:   0%|          | 0/3213 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss
1,0.2044,0.174313
2,0.0871,0.21822
3,0.0526,0.233648


TrainOutput(global_step=4821, training_loss=0.12451475029094564, metrics={'train_runtime': 1456.361, 'train_samples_per_second': 26.472, 'train_steps_per_second': 3.31, 'total_flos': 2535930129323520.0, 'train_loss': 0.12451475029094564, 'epoch': 3.0})

## Proof of concept: error detection on new transcriptions

Once trained, ClinicalBERT can now classify new transcriptions for errors.

In [115]:
error_detector = pipeline("text-classification", model=BIOBERT.model, tokenizer=BIOBERT.tokenizer)

test_text = "The patient has dyspnaea and needs immdeiate attention."
score = error_detector(test_text)
print(score)  # [{'label': '1' (error), 'score': 0.95}]


Device set to use mps:0


[{'label': 'LABEL_1', 'score': 0.9999202489852905}]


In [116]:
test_text = "The patient has dyspnea and needs immediate attention."
score = error_detector(test_text)
score

[{'label': 'LABEL_0', 'score': 0.9979265928268433}]

- A high probability means the model detects errors.
- A low probability means the transcription is likely correct.

## Quantitative evaluation

In [119]:
final_test["prediction"] = [
    1 if '1' in detection['label'] else 0
    for detection in tqdm(error_detector(final_test["sentence"].values.tolist()))
]

In [127]:
round(float(final_test[["label", "prediction"]].corr().loc['label']['prediction']), 4)

0.9235

In [128]:
from sklearn.metrics import accuracy_score

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


In [133]:
final_test[final_test['label'].isna()]

Unnamed: 0,text_path,text_id,sentence,label,num_character_errors,num_word_errors,prediction
8032,,,"Three minutes were recorded wih eyes closed, t...",,1.0,0.0,1
8033,,,A sample of ascites was taken for ceyl block a...,,2.0,0.0,0
8034,,,"Th patientl progressed win good condition, asy...",,6.0,0.0,1
8035,,,EKG: atrial fibillation wuth a mean frequency ...,,2.0,1.0,1
8036,,,Computed toomgraphy showed a pancreatic head t...,,4.0,0.0,1
...,...,...,...,...,...,...,...
9995,,,vzmkwptge radiological significance.,,0.0,1.0,1
9996,,,Urgent abdominal coemputed tomography: difufse...,,7.0,0.0,1
9997,,,"In the initial complementary tests, hypokalemi...",,2.0,1.0,1
9998,,,A g59-year-old male witm a history of smoking ...,,5.0,0.0,1


In [129]:
accuracy_score(final_test['label'], final_test['prediction'])

  return x.astype(dtype, copy=copy, casting=casting)


ValueError: Input y_true contains NaN.