<a href="https://colab.research.google.com/github/KiyoshiMu/tagC/blob/master/demo_BERT_active_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip -q install tagc --upgrade

Upload data.zip to /content

In [None]:
!unzip data.zip

In [None]:
import gc
import random
import shutil

import torch
from sklearn.preprocessing import MultiLabelBinarizer

from tagc.domain import Params, RawData
from tagc.io_utils import load_datazip
from tagc.make_figs import make_figures
from tagc.model import StandaloneModel
from tagc.train import Pipeline
from tagc.validation import eval_model

random.seed(42)


def train_main_model(
    dataset: RawData, model_p="model", outdir="out", epoch=10, upsmaple=200
):
    keep_key = True
    max_len = 150
    dropout = 0.5
    mlb = MultiLabelBinarizer().fit(dataset.y_tags)
    params = Params(
        dataset, max_len, upsmaple, dropout, "bert-base-uncased", keep_key, epoch
    )
    pipeline = Pipeline(params)
    model_tmp = pipeline.train(output_dir=outdir)
    model = StandaloneModel(
        pipeline.model, pipeline.tokenizer, keep_key=keep_key, max_len=max_len
    )
    _, judges_count, _, _ = eval_model(model, dataset, 5, mlb, outdir)
    print(judges_count)
    pipeline.trainer.save_model(model_p)
    del pipeline
    gc.collect()
    with torch.no_grad():
        torch.cuda.empty_cache()
    shutil.rmtree(model_tmp)

The data that support the findings of this study are available on reasonable request from the corresponding author [CJVC], pending local REB and privacy office approval. The data are not publicly available due to them containing information that could compromise research participant privacy/consent.

## Train model

In [None]:
dataset_p = "/content/standardDs.zip"
ds = load_datazip(dataset_p)
train_main_model(ds)

## Analysis results

In [None]:
# prepare unlabelled.json
from tagc.make_figs import make_figures
model_p = "model"
outdir = "figs"
unlabelled_p = "/content/unlabelled.json"
make_figures(model_p, dataset_p, unlabelled_p, dst=outdir)

The /content/figs/label_tsne.pdf should have similar clusters like https://storage.googleapis.com/pathopatho/label_tsne.html

The location of the clusters will be different from that in the Paper, due to the randomness of t-SNE.

You can check other figures and compare them with the results in the Paper as well.

## Make embedding

In [None]:
import pickle
from tagc.io_utils import load_json
def embed(model_p, case_p):
    cases = load_json(case_p)
    model = StandaloneModel.from_path(model_p)
    embed = model.predict(cases, pooled_output=True)
    with open("embed.pkl", "wb") as target:
        pickle.dump(embed, target)

In [None]:
embed(model_p, unlabelled_p)

  0%|          | 0/125 [00:00<?, ?it/s]

/content/embed.pkl has the semantic embeddings of the /content/output/unlabelled.json

You can load the embeddings like the following. Use them as index, we make a simple sematic search application. (https://kkkfff.web.app/#/)

In [None]:
with open("embed.pkl", "rb") as target:   
    embedding = pickle.load(target)

In [None]:
embedding.shape # 768-dim vecors for 1000 cases

(1000, 768)

## Scripts

In [None]:
!pip -q install tagc --upgrade

In [None]:
!git clone https://github.com/KiyoshiMu/tagC.git

In [None]:
%cd tagC/

/content/tagC


Upload data.zip to /content/tagC

In [None]:
!unzip data.zip

### Dataset Creation by MCCV

In [None]:
!python3 make_dataset.py report.xlsx standardDs.zip standardDsTmp.zip mona_j.csv

The outputs are in the **data** folder.

### Train

In [None]:
!python3 make_train.py standardDs.zip unlabelled.json out/model out --plot True --train True

The model is in the **out/model** folder and its figuers are in **out** folder

### Active learning comparison


#### Models trained on data sampled by active learning.

In [None]:
!python3 make_exp.py lab0 --dataset_path standardDs.zip

The final model path is lab0/keepKey_200/model/

#### Models trained on data sampled by random selection

In [None]:
!python3 make_exp.py lab0R --dataset_path randomDs.zip

The final model path is lab0R/keepKey_200/model/

### Improvement from feedback

In [None]:
!python3 feedback.py [model_path] --eval_ret mona_j.csv \
--dataset_p standardDs.zip \
--ori_eval_p eval.json \
--outdir lab0/feedbackM \
--unlabelled_p unlabelled.json