### This notebook is an advanced tutorial detailing the config changes for optimising the BERT and LSTM models for Experiencer classification task on custom dataset

In [1]:
import json
import os
from datetime import date
from medcat.cat import CAT
from medcat.meta_cat import MetaCAT
from medcat.config_meta_cat import ConfigMetaCAT
from medcat.tokenizers.meta_cat_tokenizers import TokenizerWrapperBPE, TokenizerWrapperBERT
from tokenizers import ByteLevelBPETokenizer

In [2]:
# if you want to enable info level logging
import logging
logging.basicConfig(level=logging.INFO,force=True)

# Set parameters

In [3]:
# relative path to working_with_cogstack folder
_rel_path = os.path.join("..", "..", "..")
# absolute path to working_with_cogstack folder
base_path = os.path.abspath(_rel_path)
# Load mct export
ann_dir = os.path.join(base_path, "data", "medcattrainer_export")

mctrainer_export_path = ann_dir + ""  # name of your mct export

# Load model
model_dir = os.path.join(base_path, "models", "modelpack")
modelpack = '' # name of modelpack
model_pack_path = os.path.join(model_dir, modelpack)
     #output_modelpack = model_dir + f"{today}_trained_model"

# will be used to date the trained model
today = str(date.today())
today = today.replace("-","")

# Initialise meta_ann models
if model_pack_path[-4:] == '.zip':
    base_dir_meta_models = model_pack_path[:-4]
else:
    base_dir_meta_models = model_pack_path

# Iterate through the meta_models contained in the model
meta_model_names = [] # These Meta_annotation tasks should correspond to the ones labelled in the mcttrainer export
for dirpath, dirnames, filenames in os.walk(base_dir_meta_models):
    for dirname in dirnames:
        if dirname.startswith('meta_'):
            meta_model_names.append(dirname[5:])

Before you run the next section please double check that the model meta_annotation names matches to those specified in the mct export.



# For LSTM model

In [4]:
for meta_model in meta_model_names:
    vocab_file = os.path.join(base_dir_meta_models,"meta_"+meta_model,'bbpe-vocab.json')
    merges_file = os.path.join(base_dir_meta_models,"meta_"+meta_model,'bbpe-merges.txt')
    tokenizer = TokenizerWrapperBPE(ByteLevelBPETokenizer(vocab=vocab_file,
                                    merges=merges_file,
                                    lowercase=True))
    # load and sort out the config
    config_file = os.path.join(base_dir_meta_models,"meta_"+meta_model,"config.json")
    with open(config_file, 'r') as jfile:
        config_dict = json.load(jfile)
    config = ConfigMetaCAT()
    for key, value in config_dict.items():
        setattr(config, key, value['py/state']['__dict__'])
        
    save_dir_path= "test_meta_"+meta_model # Where to save the meta_model and results. 
    #Ideally this should replace the meta_models inside the modelpack
    
    #Below are the config values used for Experiencer classification task
    
    #Class weights--------------------------------------------------------------
    #adjusting class weights to give more importance to minority classes
    # To use class weights, we have 2 options:

    #1st option:
    #to calculate class weights based on class distribution
    #NOTE: this will only be applicable if config.train.class_weights is empty
    config.train['class_weights'] = []
    config.train['compute_class_weights'] = True

    #2nd option
    #using specified class weights
    config.train['class_weights'] = [0.4,1.5,0.1]

    #we'll use the 2nd option in this example
    #----------------------------------------------------------------------------
    
    #NOTE: when using class weights, it is recommended to define the category to index mapping to ensure the weights are assigned to the right class
    config.general['category_value2id'] = {'Family':1, 'Other':0, 'Patient':2}

    config.train['test_size'] = 0.2
    config.train['nepochs'] = 15

    #since we have class imbalance, macro avg is better suited than weighted avg
    config.train.metric['base'] = 'macro avg'

    # Initialise and train meta_model
    mc = MetaCAT(tokenizer=tokenizer, embeddings=None, config=config)
    results = mc.train_from_json(mctrainer_export_path, save_dir_path=save_dir_path)

INFO:medcat.meta_cat:LSTM model used for classification
INFO:medcat.utils.meta_cat.data_utils:Updated label_data: {1: 75, 0: 75, 2: 75}
INFO:medcat.utils.meta_cat.ml_utils:Total steps for optimizer: 1684
INFO:medcat.utils.meta_cat.ml_utils:Epoch: 0 ************************************************** Train
INFO:medcat.utils.meta_cat.ml_utils:              precision    recall  f1-score   support

           0       0.46      0.40      0.43       793
           1       0.00      0.00      0.00        64
           2       0.92      0.94      0.93      6331

    accuracy                           0.87      7188
   macro avg       0.46      0.45      0.45      7188
weighted avg       0.86      0.87      0.87      7188

INFO:medcat.utils.meta_cat.ml_utils:Epoch: 0 ************************************************** Test
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
INFO:medcat

LSTM has high weighted F1-score (0.94), however due to the class imbalance, that can be misleading. <br>The recall values for the minority classes (Other and Family) are low, especially for Family - 0.36
<br> There is room for improvement in performance, especially for the minority classes

# For BERT model

In [5]:
for meta_model in meta_model_names:
    # load and sort out the config
    config_file = os.path.join(base_dir_meta_models,"meta_"+meta_model,"config.json")
    with open(config_file, 'r') as jfile:
        config_dict = json.load(jfile)
    config = ConfigMetaCAT()
    for key, value in config_dict.items():
        setattr(config, key, value['py/state']['__dict__'])

    # change model name if training BERT for the first time
    config.model['model_name'] = 'bert'

    tokenizer = TokenizerWrapperBERT.load(os.path.join(base_dir_meta_models,"meta_"+meta_model), config.model['model_variant'])
    
    save_dir_path= "test_meta_"+meta_model # Where to save the meta_model and results. 
    #Ideally this should replace the meta_models inside the modelpack

    #Below are the config values used for Experiencer classification task

    config.model['nclasses'] = 3
    config.general['category_name'] = 'Experiencer'

    config.train.lr = 5e-4
    config.train['test_size'] = 0.2
    config.train['nepochs'] = 15

    # you can also switch between freezing BERT layers or using LoRA during training
    # to use LORA:
    config.model['model_freeze_layers'] = False

    config.train.metric['base'] = 'macro avg'

    config.train['class_weights'] = [0.4,1.5,0.1]
    config.general['category_value2id'] = {'Family':1, 'Other':0, 'Patient':2}

    save_dir_path= "test_meta" # Where to save the meta_model and results.
    #Ideally this should replace the meta_models inside the modelpack

    # Initialise and train meta_model
    mc = MetaCAT(tokenizer=tokenizer, embeddings=None, config=config)
    results = mc.train_from_json(mctrainer_export_path, save_dir_path=save_dir_path)

    # Save results
    json.dump(results, open(os.path.join(save_dir_path,'meta_results.json'), 'w'))

If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`. Loading from library for model variant: bert-base-uncased
Input size for bert-base-uncased model should be 768, provided input size is 300 Input size changed to 768
INFO:medcat.meta_cat:BERT model used for classification
Token indices sequence length is longer than the specified maximum sequence length for this model (554 > 512). Running this sequence through the model will result in indexing errors
INFO:medcat.utils.meta_cat.data_utils:Updated label_data: {1: 75, 0: 75, 2: 75}
INFO:medcat.utils.meta_cat.ml_utils:Total steps for optimizer: 1078
INFO:medcat.utils.meta_cat.ml_utils:Epoch: 0 ************************************************** Train
INFO:medcat.utils.meta_cat.ml_utils:              precision    recall  f1-score   support

           0       0.12      0.25      0.16       793
           1       0.00   

BERT model shows improvement for the Family class compared to LSTM, however the recall values can still be improved.<br> To help tackle this, we'll use 2 phase learning for training.

## If you dont have the model packs, and are training from scratch

In [None]:
config = ConfigMetaCAT()
# make sure to change the following parameters:
# config.model['nclasses']
# config.general['category_name']

# change model name if training BERT for the first time
config.model['model_name'] = 'bert'

tokenizer = TokenizerWrapperBERT.load("", config.model['model_variant'])

save_dir_path= "test_meta" # Where to save the meta_model and results. 
#Ideally this should replace the meta_models inside the modelpack

# Initialise and train meta_model
mc = MetaCAT(tokenizer=tokenizer, embeddings=None, config=config)
results = mc.train_from_json(mctrainer_export_path, save_dir_path=save_dir_path)

# Save results
json.dump(results, open(os.path.join(save_dir_path,'meta_results.json'), 'w'))

## If using 2 phase learning for training

### Phase 1

In [6]:
######################################################################################################
# 2 phase learning (used for imbalanced datasets) - trains the models twice: 
#                    phase 1: trains for minority class(es) by undersampling data
#                    phase 2: trains for all classes
# parameter values: 
# 1: Phase 1 - Train model on undersampled data
# 2: Phase 2 - Continue training on full data
# 0: None
#
# Paper reference - https://ieeexplore.ieee.org/document/7533053
# NOTE: Make sure to use class weights in favour of minority classes with 2 phase learning
#####################################################################################################

# Follow same steps till defining save_dir_path

#change phase number to 1
config.model.phase_number = 1

# specify the class that will define the desired sample size for the undersampling process
# if this is left empty, the class with the lowest samples will be chosen
# example
config.model['category_undersample'] = 'Other'

#Below are the config values used for Experiencer classification task

config.model['nclasses'] = 3
config.general['category_name'] = 'Experiencer'

config.model['category_undersample'] = 'Other'

config.train.lr = 5e-4
config.train['test_size'] = 0.2
config.train['nepochs'] = 20

config.train.metric['base'] = 'macro avg'

config.train['class_weights'] = [0.4,1.5,0.05]
config.general['category_value2id'] = {'Other':0, 'Family':1, 'Patient':2}

config.model['model_freeze_layers'] = False

# Initialise and train meta_model 
mc = MetaCAT(tokenizer=tokenizer, embeddings=None, config=config)
results = mc.train_from_json(mctrainer_export_path, save_dir_path=save_dir_path)

If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`. Loading from library for model variant: bert-base-uncased
Input size for bert-base-uncased model should be 768, provided input size is 300 Input size changed to 768
INFO:medcat.meta_cat:BERT model used for classification
Token indices sequence length is longer than the specified maximum sequence length for this model (554 > 512). Running this sequence through the model will result in indexing errors
INFO:medcat.utils.meta_cat.data_utils:Updated label_data: {0: 1002, 1: 75, 2: 1002}
INFO:medcat.utils.meta_cat.ml_utils:Total steps for optimizer: 665
INFO:medcat.utils.meta_cat.ml_utils:Epoch: 0 ************************************************** Train
INFO:medcat.utils.meta_cat.ml_utils:              precision    recall  f1-score   support

           0       0.48      0.80      0.60       809
           1       0.03

## Phase 2

In [7]:
# Perform 2nd round of training

config.model['phase_number'] = 2
config.train['class_weights'] = [0.3,1,0.05]
config.train['nepochs'] = 10

results = mc.train_from_json(mctrainer_export_path, save_dir_path=save_dir_path)

If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`. Loading from library for model variant: bert-base-uncased
Input size for bert-base-uncased model should be 768, provided input size is 300 Input size changed to 768
INFO:medcat.meta_cat:BERT model used for classification
Token indices sequence length is longer than the specified maximum sequence length for this model (554 > 512). Running this sequence through the model will result in indexing errors
INFO:medcat.utils.meta_cat.data_utils:Updated label_data: {0: 1002, 1: 75, 2: 1002}
INFO:medcat.meta_cat:Model state loaded from dict for 2 phase learning
INFO:medcat.utils.meta_cat.ml_utils:Total steps for optimizer: 718
INFO:medcat.utils.meta_cat.ml_utils:Epoch: 0 ************************************************** Train
INFO:medcat.utils.meta_cat.ml_utils:              precision    recall  f1-score   support

      

Using 2 phase learning boosts the performance of the model, especially for the minority classes, wiht 0.88 and 1 recall values. <br> This highlights the impact of using 2 phase learning for model training with imbalanced datasets
<br><br><b>NOTE:</b> The observed performance improvements are dataset-dependent, and you may not experience such substantial gains. Additionally, class weights and other hyperparameters will need to be fine-tuned for your specific dataset.

# Oversampling data

You can generate synthetic data to help mitigate class imbalance. <br> Use this code to generate synthetic data using LLM - [link](https://gist.github.com/shubham-s-agarwal/401ef8bf6cbbd66fa0c76a8fbfc1f6c4) <br> <b>NOTE</b>: the generated data will require manual quality check to ensure that high quality and relevant data is used for training. 

The data generated from the gist code and the format of the data required by MedCAT are different, requiring manual formatting at the moment. We will update this module to include the code to handle the same.

In [None]:
# To run the training with original + synthetic data
# Follow all the same steps till initializing the metacat model

# Initialise and train meta_model
mc = MetaCAT(tokenizer=tokenizer, embeddings=None, config=config)

# the format expected is [[['text','of','the','document'], [index of medical entity], "label" ],
#                ['text','of','the','document'], [index of medical entity], "label" ]]

synthetic_data_export = [[],[],[]]

results = mc.train_from_json(mctrainer_export_path, save_dir_path=save_dir_path,data_oversampled=synthetic_data_export)