# Explanation of what I want to achieve

The goal of this work is to classify texts from medical notes into diagnosis codes. The dataset contains short texts, written, apparently, by clinicians in medical summary documents (**sentence** column); and the related diagnosis codes (**code** column).

## Installing libraries for model loading, data pre-processing and metrics evaluation.

As the suggested model is a *huggingface*-style model, I'm gonna load the **transformers library** which is developed to work with this kind of models, and the **dataset library** for respective data pre-processing.

In [None]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 35.0 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 27.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.6 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 23.1 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: p

In [None]:
! pip install datasets

Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 5.2 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 44.9 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 45.7 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 28.4 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 36.6 MB/s 
[?25hCollecting multidict<7.0,>=4.5
  Downloading multidict

## Loading the suggested model 
And the relevant tokenizer.

In [None]:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT", do_lower_case=False, max_seq_len=512)
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Looking at the number of parameters just to get an idea of the model size: this seems a normal (not large) BERT model.

In [None]:
print(sum(p.numel() for p in model.parameters()))

108310272


Importing pandas to load the dataset

In [None]:
import pandas as pd

# Explanation of my assumptions

I'm working with a simple sequence classification task, multi-class, but single-label. 

So, I'm going to use a workflow to build a sequence classification model based on a huggingface BERT model from these resources:

https://huggingface.co/docs/transformers/tasks/sequence_classification, 

The model is prepared in pytorch (see paper by Alsentzer et al. 2019, https://arxiv.org/abs/1904.03323), so I'm gonna use the pytorch-based sequence classification tutorial: 
https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb.

The main assumption of using the suggested model in this task is that the codes areactual clinical codes, i.e. they are assigned based on the sentence contents, and the BERT-like models can effectively capture the semantics of the sentences; moreover, the suggested clinical-text-trained BERT captures more specifics in their semantics.

Importing the required libs for the classification experiment

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

#Consuming the provided dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
foo = './drive/MyDrive/Maverick/'
fn = foo + 'MIMIC III per sentence annotated dataset.csv'
df = pd.read_csv(fn)
df

Unnamed: 0.1,Unnamed: 0,sentence,code
0,0,pt transferred to [**hospital unit name 4**] c...,J80
1,1,chb d/t hypothyroidism--pt with recent hx of n...,E039
2,2,the patient is a 67-year-old female with a his...,I4891
3,3,"rca, htn, gerd, left knee replacement, bipolar...",F319
4,4,chronic obstructive pulmonary disease diabete...,E119
...,...,...,...
81022,64472,# encephalopathy following improvement of pati...,F329
81023,64473,"secondary progressive ms, sx onset [**8-/2167...",F329
81024,64474,"major depressive disorder, recurrent, without ...",F329
81025,64475,h/o major depression 3,F329


Checking the number of duplicates for 2 reasons:
1. if many text duplicates are present, this might actually look more like a multi-label task (i.e. a text can normally have more than 1 class);
2. if duplicates are a big part of the data, it might make sense to de-duplicate. However, de-duplication is rather an issue of question here.

Less than 5% duplicates. Seems we don't have to worry about the duplicate-related issues mentioned above.

In [None]:
df.sentence.drop_duplicates().shape

(77971,)

Assigning numeric codes to the string class labels, because I'm not sure the classification model is able to handle string labels.

In [None]:
code2id, id2code = {}, {}
res = []
m = -1
for c in df.code:
  if c in code2id.keys():
    res.append(code2id[c])
  else:
    m = max(list(code2id.values()) + [-1])+1
    res.append(m)  
    code2id[c] = m
    id2code[m] = c
for i in range(50):
  print(res[i], list(df.code)[i])

0 J80
1 E039
2 I4891
3 F319
4 E119
5 F0280
6 I609
7 R65.21
4 E119
4 E119
5 F0280
5 F0280
8 K7030
9 K766
2 I4891
10 M810
11 B182
12 D696
2 I4891
10 M810
13 I469
14 N186
7 R65.21
15 J449
16 I2699
12 D696
17 Z79.4
2 I4891
1 E039
6 I609
9 K766
11 B182
2 I4891
14 N186
18 F10239
19 I25.2
14 N186
20 E6601
18 F10239
4 E119
21 A419
2 I4891
17 Z79.4
4 E119
5 F0280
11 B182
11 B182
20 E6601
17 Z79.4
2 I4891


In [None]:
df['label'] = res
df

Unnamed: 0.1,Unnamed: 0,sentence,code,label
0,0,pt transferred to [**hospital unit name 4**] c...,J80,0
1,1,chb d/t hypothyroidism--pt with recent hx of n...,E039,1
2,2,the patient is a 67-year-old female with a his...,I4891,2
3,3,"rca, htn, gerd, left knee replacement, bipolar...",F319,3
4,4,chronic obstructive pulmonary disease diabete...,E119,4
...,...,...,...,...
81022,64472,# encephalopathy following improvement of pati...,F329,28
81023,64473,"secondary progressive ms, sx onset [**8-/2167...",F329,28
81024,64474,"major depressive disorder, recurrent, without ...",F329,28
81025,64475,h/o major depression 3,F329,28


Tokenizing the texts to get an idea of the sequence length that I need to cover most of the texts.

In [None]:
lens = []
for i,x in enumerate(df.sentence.to_list()):
  lens.append(len(tokenizer.tokenize(x)))
  if i%1000 == 0:
    print(i)

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000


Statistics of tokenized sequenced length shows that sequence length=32 will cover between 50 and 75% of texts. length=64 would cover even more, but for the sake of speed let's use 32 here below.

In [None]:
pd.Series(lens).describe()

count    81027.000000
mean        34.440014
std         33.721708
min          1.000000
25%         15.000000
50%         23.000000
75%         42.000000
max       2149.000000
dtype: float64

#Data preparation and dataset preprocessing where required
I'm using the steps from the link above to create a dataset that would be processed by the AutoClassificationModel.

In [None]:
from datasets import Dataset

In [None]:
def preprocess_function(examples):
    return tokenizer(examples['sentence'], truncation=True, max_length=32, padding = True)

Get an idea of class sparcity. Not too sparse, let's get back to removing the smallest 1 or 3 classes later, *if needed*. 

In [None]:
df.code.value_counts()

I469      4326
E780      4048
I609      3295
Z79.4     3284
E119      3084
I4891     2976
E039      2811
F0280     2698
R570      2695
R65.21    2575
J9620     2548
A419      2522
F10239    2512
N186      2417
I10       2358
K7030     2283
D696      2266
B182      2057
I509      2013
M810      1991
E6601     1975
I472      1741
I25.2     1705
K766      1670
I6529     1624
J80       1382
C7931     1369
G936      1321
J15211    1258
K219      1250
J449      1231
I714      1016
M069      1014
F319       968
N189       955
I214       758
I2510      746
E46        725
I619       626
I739       620
I200       545
I6350      503
I129       395
I2699      223
J690       186
N179       178
R569       153
C787        81
F329        44
F341         6
Name: code, dtype: int64

In [None]:
df1 = df
#df1 = df[:1000][(df.code != 'F329') & (df.code != 'J690')]

## Making a train-dev-test split to perform the training and dev testing, and then the final sanity/quality check in heldout test data.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
idx = range(df1.shape[0])
X_train, X_test1 = train_test_split(idx, stratify = list(df1['label']),
                                   test_size=0.5)
len(X_train), len(X_test1)
X_test, X_val = train_test_split(X_test1, stratify = list(df1['label'].iloc[X_test1]),
                                   test_size=0.5)
print(len(X_train), len(X_val), len(X_test))
print(X_train[:10], X_val[:10], X_test[:10])


40513 20257 20257
[54583, 49559, 8136, 27682, 37968, 40531, 77030, 59713, 76908, 67796] [2841, 39748, 80228, 13722, 13043, 40747, 31895, 9093, 69090, 6828] [43010, 44322, 72306, 56091, 29770, 33357, 54032, 37793, 42909, 39420]


In [None]:
set(X_test).intersection(X_val), set(X_test).intersection(X_train), set(X_train).intersection(X_val)

(set(), set(), set())

The above cell should be used instead of the below one.
Actually, this was the source of the error that led to perfect results later:

In [None]:
#Found an error here:
X_train, X_test1 = train_test_split(df1['label'], stratify = df1['label'],
                                   test_size=0.5)
len(X_train), len(X_test1)
X_test, X_val = train_test_split(df1['label'].iloc[X_test1], stratify = df1['label'].iloc[X_test1],
                                   test_size=0.5)
print(len(X_train), len(X_val), len(X_test))
print(X_train[:10], X_val[:10], X_test[:10])

40513 20257 20257
73363    13
10650    43
51701    35
45365    10
25335    29
45094    10
73362    13
11630     1
61483    25
7535     43
Name: label, dtype: int64 23    15
26    17
11     5
6      6
43     4
35    19
46    11
6      6
31    11
10     5
Name: label, dtype: int64 10     5
6      6
19    10
37    20
13     9
10     5
29     6
8      4
27     2
30     9
Name: label, dtype: int64


In [None]:
data_train = df1.iloc[X_train]
data_val = df1.iloc[X_val]
data_test = df1.iloc[X_test]
labels_test = data_test.label

I only include the input features data (no labels) in the test dataset, to make 100% sure there's no data leakage.

The test label are stored separately for later evaluation.

In [None]:
dataset_train = Dataset.from_pandas(data_train[['sentence', 'label']])
dataset_val = Dataset.from_pandas(data_val[['sentence', 'label']])
dataset_test = Dataset.from_pandas(data_test[['sentence']])

In [None]:
encoded_dataset_train = dataset_train.map(preprocess_function, batched = True)
encoded_dataset_val = dataset_val.map(preprocess_function, batched = True)
encoded_dataset_test = dataset_test.map(preprocess_function, batched = True)

  0%|          | 0/41 [00:00<?, ?ba/s]

  0%|          | 0/21 [00:00<?, ?ba/s]

  0%|          | 0/21 [00:00<?, ?ba/s]

In [None]:
columns_to_return = ['input_ids', 'label', 'attention_mask']
encoded_dataset_train.set_format(type='torch', columns=columns_to_return)
encoded_dataset_val.set_format(type='torch', columns=columns_to_return)
encoded_dataset_test.set_format(type='torch', columns=['input_ids', 'attention_mask'])

In [None]:
encoded_dataset_val, encoded_dataset_test

(Dataset({
     features: ['sentence', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 20257
 }), Dataset({
     features: ['sentence', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 20257
 }))

In [None]:
encoded_dataset_test[0], encoded_dataset_test[1]

({'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1]),
  'input_ids': tensor([  101,  1884, 15789,  1616, 18593,  3653,   117, 14255,  7562,  3946,
           1762,  4290,   117, 17972,  1143,  6473,  4814,   117, 17963, 26557,
           3653,   117, 24438, 16071,  4043,  3457,  9870, 23179,   117,   185,
          15384,   102])},
 {'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0]),
  'input_ids': tensor([  101,  6613, 11153,  6620,   117,   184, 13894,  4184, 14824,  4863,
            117,  2012,  7777,  7874,  1279,   102,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0])})

#Model training
Most steps are taken from the tutorial https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb, except for the metrics evaluation part.

The parameters are taken as-is, except for the batch size (after preliminary experiments have shown that batch size = 64 with sequence length = 32 works fine with the current GPU resources), N of epochs = 5 (taken randomly) and push to github = False (we don't need anything pushed there currently).

In [None]:
num_labels = len(set(df1.label))

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("emilyalsentzer/Bio_ClinicalBERT", 
                                                           num_labels=num_labels)

Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model

In [None]:
metric_name = "accuracy"

In [None]:
args = TrainingArguments(
    f"medical-finetuned",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=False,
)

In [None]:
import numpy as np
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

## Setting up evaluation metrics
as described here https://huggingface.co/transformers/v3.0.2/training.html#trainer

Using the macro-averaged metrics (e.i. weighting different classes equally), because accuracy will give an idea of the micro-averaged performance, and because micro-averaged performance will give an overestimation: it weights large classes as more important, and it is the large classes which will likely have better results.

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

#Training

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset_train,
    eval_dataset=encoded_dataset_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __index_level_0__ are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 40513
  Num Epochs = 5
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 3170


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,1.7358,0.803036,0.819766,0.710466,0.75494,0.702706
2,0.8031,0.710095,0.830626,0.755808,0.783392,0.75061
3,0.6906,0.680115,0.832947,0.771905,0.827557,0.760758
4,0.5712,0.671982,0.833736,0.784317,0.830808,0.77
5,0.5352,0.670377,0.833934,0.790986,0.822818,0.776877


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __index_level_0__ are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 20257
  Batch size = 64
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to medical-finetuned/checkpoint-634
Configuration saved in medical-finetuned/checkpoint-634/config.json
Model weights saved in medical-finetuned/checkpoint-634/pytorch_model.bin
tokenizer config file saved in medical-finetuned/checkpoint-634/tokenizer_config.json
Special tokens file saved in medical-finetuned/checkpoint-634/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __ind

TrainOutput(global_step=3170, training_loss=0.8101381151458066, metrics={'train_runtime': 2899.4362, 'train_samples_per_second': 69.864, 'train_steps_per_second': 1.093, 'total_flos': 3332503782284160.0, 'train_loss': 0.8101381151458066, 'epoch': 5.0})

The validation on the val dataset shows reasonably high results. 

Let's use the heldout test dataset for actual testing. Because the val results could be overestimated, as the model is over-fitted to the val sample (i.e. for val metrics, the best model is chosen).

# Testing/Validation: evaluation of trained model

In [None]:
pred = trainer.predict(encoded_dataset_test)

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __index_level_0__ are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 20257
  Batch size = 64


Outputting different metrics for testing: number of correct predictions + the traditional metrics.

In [None]:
predicted = pred.predictions.argmax(axis=1)
sum(labels_test == predicted) ,'/', len(predicted)

(16831, '/', 20257)

In [None]:
precision, recall, f1, _ = precision_recall_fscore_support(labels_test, predicted, average='macro')
acc = accuracy_score(labels_test, predicted)
precision, recall, f1, acc

  _warn_prf(average, modifier, msg_start, len(result))


(0.8130028958806585,
 0.7625909720994425,
 0.7763661629663445,
 0.8308732783729081)

In [None]:
from sklearn.metrics import classification_report

#Presentation of results

Overall metrics:

In [None]:
print(classification_report(labels_test, predicted,
      target_names = [id2code[x] for x in range(50)]))

              precision    recall  f1-score   support

         J80       0.94      0.86      0.90       346
        E039       0.66      0.73      0.69       703
       I4891       0.84      0.81      0.82       744
        F319       0.80      0.77      0.78       242
        E119       0.60      0.71      0.65       771
       F0280       0.87      0.87      0.87       674
        I609       0.97      0.99      0.98       824
      R65.21       0.91      0.88      0.90       644
       K7030       0.90      0.92      0.91       571
        K766       0.86      0.88      0.87       417
        M810       0.73      0.69      0.71       498
        B182       0.81      0.80      0.81       515
        D696       0.88      0.84      0.86       567
        I469       0.90      0.92      0.91      1081
        N186       0.92      0.94      0.93       605
        J449       0.65      0.66      0.66       307
       I2699       0.85      0.82      0.83        55
       Z79.4       0.97    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Errors output:

In [None]:
size = 120
labelsl = list(labels_test)
print('Sentence\tTrue label\tPredicted label')
for i, s in enumerate(dataset_test['sentence'][:size]):
  if labelsl[i] != predicted[i]:
    print(s, '\t', id2code[labelsl[i]], '\t', id2code[predicted[i]], '\n')

Sentence	True label	Predicted label
 this is a 78 year-old male with a history of afib, pes while on coumadin, pulmonary hypertension and hypothyroidism who is admitted with shock on pressors 	 E039 	 J449 

 [**known lastname 106013**] is a 60-year-old female with htn, ckd, untreated chronic hepatitis c and active depression with related acute on chronic renal failure (baseline cr 1.1), and mechanical avr/mvr on coumadin who is presenting with hyperkalemia and hypertensive urgency 	 B182 	 F0280 

 global loss of [**doctor last name 352**]-white differentiation in bilateral cerebral hemispheres with diffuse hypodense appearance, most likely representing cerebral edema in this patient with cardiac arrest 	 I469 	 G936 

2) htn 3) pvd, s/p l fem-[**doctor last name **] [**2103**] 4) tcc of bladder - s/p turbt and local bcg treatments, no evidence of recurrence at last urology f/u 6 months ago 5) osteoporosis 6) hyperlipidemia 7) cataract surgery [**9-10**] 	 M810 	 E780 

he was called 

#Explanation of results
The classifier works pretty well for large classes over 100 examples: F1 >= 0.6, mostly >= 0.8. It is the sparse classes which mostly degrade the overall macro-metrics.

With more time resources, I'd suggest an output of the confusion matrix with a more thorough analysis of which classes were confused and why, in linguistic/semantic terms.

# Conclusions
The suggested simple first-shot model works well on classifying the suggested dataset. The reasons for that:
1. The suggested model was initially trained on the MIMIC-III dataset, which the current dataset is a part of. Although it was trained on other tasks (NLI and NER), the model appears to grasp important structural understanding of the domain.

2. The suggested dataset contains very specific and well-structured short texts: probably, short summaries with diagnoses.

However, the result are expectedly poor for extremely low-populated classes.

Also, there seems to be no over-fitting in the current experiment, because the test results are very similar to the validation ones.

#How to further continue given more time and resources
The current dataset is classified reasonably well.

However, to increase the performance and/or in further experiments, the following steps should/could be taken:

1. Technical checks:

-random seed initialization wherever needed (both in the model and in the dataset split) for full reproducibility;

-check whether the best val model is actually used by the Trainer, or load the best saved model.

2. Technical steps:

-tweak parameters: adding more epochs (the results in terms of F1 are still increasing after 5 epochs, specifically, in terms of recall); changing learning rate and weight decay; adding  dropout and/or regularization (if it's not already there), perform early stopping based on evaluation loss dynamics (when it stops reducing);

-balance data: duplicate/assign higher weights to low-populated class data; perform some synthetic data enhancement? - the simplest example being, combining several diagnoses the way it's done in some sentences.

3. Analyze the reasons for errors:

-are longer sentences more error-prone? if yes, increasing the max sequence length could help;

-linguistic/semantic analysis: with the confusion matrix, analyze errors performed on specific classes and try to come up with generalized reasons of errors:

--Are sentences containing many diagnoses at once more prone to errors? (for ex. *significant for hepatitis c virus and alcoholic cirrhosis diagnosed in* is labeled as *hepatitis* and classified as *cirrhosis*, but both seem correct.) If yes, we should probably modify the task to include many labels - and re-annotate it, too. 

--Does the model seem to ignore some specific information, or ignore everything apart from some words which are too significant? If yes, the training set could be balanced in these terms (see above);

-output attention heatmaps of sentences where the algorithm makes mistakes: are the most attention-laden tokens actually correct? If not, try to add more training data containing them.

#Some older runs

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __index_level_0__ are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 54288
  Num Epochs = 5
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 4245


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4948,0.004355,1.0,1.0,1.0,1.0
2,0.0036,0.001552,1.0,1.0,1.0,1.0
3,0.0015,0.000884,1.0,1.0,1.0,1.0
4,0.0011,0.000628,1.0,1.0,1.0,1.0
5,0.0008,0.000551,1.0,1.0,1.0,1.0


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __index_level_0__ are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 26739
  Batch size = 64
Saving model checkpoint to medical-finetuned/checkpoint-849
Configuration saved in medical-finetuned/checkpoint-849/config.json
Model weights saved in medical-finetuned/checkpoint-849/pytorch_model.bin
tokenizer config file saved in medical-finetuned/checkpoint-849/tokenizer_config.json
Special tokens file saved in medical-finetuned/checkpoint-849/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __index_level_0__ are not expected by `BertForSequenceClassi

TrainOutput(global_step=4245, training_loss=0.0604822888145458, metrics={'train_runtime': 3935.9711, 'train_samples_per_second': 68.964, 'train_steps_per_second': 1.079, 'total_flos': 4465602777692160.0, 'train_loss': 0.0604822888145458, 'epoch': 5.0})

In [None]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __index_level_0__ are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 26739
  Batch size = 64


{'epoch': 5.0,
 'eval_accuracy': 1.0,
 'eval_f1': 1.0,
 'eval_loss': 0.004354993347078562,
 'eval_precision': 1.0,
 'eval_recall': 1.0,
 'eval_runtime': 115.5524,
 'eval_samples_per_second': 231.402,
 'eval_steps_per_second': 3.617}

In [None]:
encoded_dataset_val.

[]

In [None]:
pred = trainer.predict(encoded_dataset_test)

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __index_level_0__ are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 500
  Batch size = 64


In [None]:
pred.predictions.shape

(500, 50)

In [None]:
predicted = pred.predictions.argmax(axis=1)
predicted

array([ 5, 15,  7, 16,  2,  6,  5,  9,  2, 13,  5,  9,  4, 13, 10, 12,  1,
        2,  2,  5, 17,  7,  4, 15, 18,  2, 13,  2,  4, 20, 12,  9, 10,  2,
       15,  9, 18, 17,  2,  4,  5, 13, 14,  7,  4,  7, 10,  8,  5,  2,  8,
        9,  9,  7, 20,  5, 14,  2,  4,  5, 11, 18,  2,  5,  5,  2,  8,  1,
        9, 12,  4,  9, 11,  7,  9, 10, 13, 12,  7,  9, 20,  4, 18, 14,  9,
       15,  4,  2, 21,  9, 20,  2, 15,  2,  9,  9,  1,  2,  6,  5,  4, 15,
       13, 13, 20,  2, 17, 18,  2,  2,  9, 13,  2,  2, 12,  5,  6, 17,  4,
       18, 12, 18, 11,  0,  0, 14, 10,  2,  9,  5,  2,  0,  9,  2,  1,  1,
       12,  6,  2,  2, 15, 17, 10, 13, 11,  5, 12, 14,  6, 12,  5, 18, 14,
        4, 20, 16,  4,  5, 10,  6, 14, 15,  2,  9, 18,  9, 14, 20, 12,  7,
       17, 20,  6,  4,  6,  5,  9, 19,  2,  8, 20, 14, 10, 15,  2,  1,  2,
       18, 12,  3,  0, 12,  2, 15, 11,  5, 13, 18, 15,  7,  9, 16,  2,  5,
        7,  6,  3, 17,  5,  6, 10,  0,  6, 11,  2,  4,  4,  5, 17,  0,  5,
       16, 17,  2,  6,  4

In [None]:
sum(labels_test == predicted)

500

In [None]:
foo = './medical-finetuned/checkpoint-4245/'
os.listdir(foo)

['trainer_state.json',
 'special_tokens_map.json',
 'scheduler.pt',
 'training_args.bin',
 'rng_state.pth',
 'optimizer.pt',
 'vocab.txt',
 'config.json',
 'tokenizer_config.json',
 'pytorch_model.bin',
 'tokenizer.json']

In [None]:
m1 = AutoModelForSequenceClassification.from_pretrained(foo)

loading configuration file ./medical-finetuned/checkpoint-4245/config.json
Model config BertConfig {
  "_name_or_path": "./medical-finetuned/checkpoint-4245/",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "LABEL_20",
    "21": "LABEL_21",
    "22": "LABEL_22",
    "23": "LABEL_23",
    "24": "LABEL_24",
    "25": "LABEL_25",
    "26": "LABEL_26",
    "27": "LABEL_27",
    "28": "LABEL_28",

In [None]:
m1(encoded_dataset_val)

AttributeError: ignored

In [None]:
columns_to_return = ['input_ids', 'label', 'attention_mask']
encoded_dataset_train.set_format(type='torch', columns=columns_to_return)
encoded_dataset_val.set_format(type='torch', columns=columns_to_return)

In [None]:
m1.classifier

Linear(in_features=768, out_features=50, bias=True)

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __index_level_0__ are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 668
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 105


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,3.095351,0.506061,0.262821,0.248094,0.328736
2,No log,2.491139,0.654545,0.387783,0.377002,0.424839
3,No log,2.037671,0.8,0.582829,0.588836,0.595455
4,No log,1.798174,0.824242,0.596699,0.591393,0.613636
5,No log,1.712147,0.866667,0.686627,0.687894,0.695455


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __index_level_0__ are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 330
  Batch size = 32
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to medical-finetuned/checkpoint-21
Configuration saved in medical-finetuned/checkpoint-21/config.json
Model weights saved in medical-finetuned/checkpoint-21/pytorch_model.bin
tokenizer config file saved in medical-finetuned/checkpoint-21/tokenizer_config.json
Special tokens file saved in medical-finetuned/checkpoint-21/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, __index_level_0__. If sentence, __index_leve

TrainOutput(global_step=105, training_loss=2.523679896763393, metrics={'train_runtime': 117.4686, 'train_samples_per_second': 28.433, 'train_steps_per_second': 0.894, 'total_flos': 109891276024320.0, 'train_loss': 2.523679896763393, 'epoch': 5.0})