Skip to content

Official implementation of Towards Self-Explainable Transformers for Cell Classification in Flow Cytometry Data

License

Notifications You must be signed in to change notification settings

CaRniFeXeR/self_explainable_transformerflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Self Explainable Transformers for Flow Cytometry Cell Classification

Official implementation of our work: Towards Self-Explainable Transformers for Cell Classification in Flow Cytometry Data by Florian Kowarsch, Lisa Weijler, Matthias Wödlinger, Michael Reiter, Margarita Maurer-Granofszky, Angela Schumich, Elisa O. Sajaroff, Stefanie Groeneveld-Krentz, Jorge G. Rossi, Leonid Karawajew, Richard Ratei, Michael N. Dworzak

Abstract

Decisions of automated systems in healthcare can have far-reaching consequences such as delayed or incorrect treatment and thus must be explainable and comprehensible for medical experts. This also applies to the field of automated Flow Cytometry (FCM) data analysis. In leukemic cancer therapy, FCM samples are obtained from the patient’s bone marrow to determine the number of remaining leukemic cells. In a manual process, called gating, medical experts draw several polygons among different cell populations on 2D plots in order to hierarchically sub-select and track down cancer cell populations in an FCM sample. Several approaches exist that aim at automating this task. However, predictions of state-of-the-art models for automatic cell-wise classification act as black-boxes and lack the explainability of human-created gating hierarchies. We propose a novel transformer-based approach that classifies cells in FCM data by mimicking the decision process of medical experts. Our network considers all events of a sample at once and predicts the corresponding polygons of the gating hierarchy, thus, producing a verifiable visualization in the same way a human operator does. The proposed model has been evaluated on three publicly available datasets for acute lymphoblastic leukemia (ALL). In experimental comparison, it reaches state-of-the-art performance for automated blast cell identification while providing transparent results and explainable visualizations for human experts.

Installation

All decencies are provided in requirements.txt. Install with:

pip install -r requirements.txt

This repo needs flowmepy to be installed. flowmepy is a python package for fcm data loading. For more information see: https://pypi.org/project/flowmepy/ Install with:

pip install flowmepy

(If you run into issues with newer versions of dependencies check the requirements.txt file. It contains the environment package dependencies at the time of testing)

IMPORTANT: As of now the flowmepy package is only supported on windows. If you are running a unix based system and want to try out our method you will need to preload the data (for example to a pandas dataframe) on a windows machine and then adapt the lines in the code where the flowme python package is called. Simply load your preloaded event matrices (dataframes or csv) instead of the events = sample.events() lines and load your gate label matrices (dataframes or csv) instead of the lines where labels = sample.gate_labels(). Sorry for the inconvenience, we are working on a solution.

Usage

In order to perform the experiments from the paper the following steps must be followed:

  1. Create preprocessed cache files from FCM files (includes event data as well as polygons).
  2. Train model with the created cache files.
  3. Test a trained model

Creating Cache

createcache.py generates one cache file per FCM-sample which includes the necessary data to training the model. Preprocessing steps such as computing the convex hull for the specified Gate-Definition is applied as well as determining the ground truth class for every cell.

{
   "type_name": "src.datastructures.configs.cachedatacreationconfig.CacheDataCreationConfig",
   "output_location": "path to folder where cached files should be stored",
   "blacklist_path": "",   //optional text file that specfies FCM-files that should be skipped
   "ignore_blacklist": true,
   "outlier_handler_config": {
       "n_events_threshold": 300, // min number of events needed bevore outlier removal is executed
       "alpha": 0.00001 // alpha value for Mahalanobis outlier removal
   },
   "source_datasets": [] // list of datasets that should be used
   "gate_defintions": [] // definition of gates from which the convex hull should be created
}

Train Model

train.py serves as entrypoint for model training.

{
    "type_name": "src.datastructures.configs.trainconfig.TrainConfig",
    "name": "train_vie14_val_bln",
    "default_retrieve_options": {
        "shuffle": true,
        "use_convex_gates": true, //wheter actual human gt polygons or generated convex gates are used
        "filter_gate" : "Intact", //Gate after which events are considered in training
        "polygon_min": -0.1,
        "polygon_max": 1.7,
        "always_keep_blasts": true, //wheter blast should be favored when sampling events
        "gate_polygon_interpolation_length" : 120, // number of points per polygon that are interpolated
        "gate_polygon_seq_length": 20, //number of points per polygon
        "events_seq_length": 50000, //number of events per sample used for training
        "used_markers": [],      // names of used markers
        "used_gates": [],        // names of used gates
        "gate_definitions" : [], //used Gate Definitions
        "events_mean": [
            1.2207406759262085,
            1.245536208152771,
            1.413953185081482,
            0.9958911538124084,
            2.1471059322357178,
            2.0955066680908203,
            1.4734785556793213,
            0.42288827896118164,
            0.8758889436721802,
            2.472586154937744
        ],
        "events_sd": [
            0.3791336715221405,
            0.3249945342540741,
            0.3320983946323395,
            0.18722516298294067,
            0.35046276450157166,
            0.25932908058166504,
            0.4003131091594696,
            0.014336027204990387,
            0.5279378294944763,
            0.34515222907066345
        ],
        "augmentation_config": {
            "shift_propability": 0.7,
            "shift_percent": 0.25,
            "polygon_scale_range" : {
                "Syto" : 0.01,
                "Singlets" : 0.01,
                "Intact" : 0.05,
                "CD19" : 0.15,
                "Blasts_CD45CD10" : 0.3,
                "Blasts_CD20CD10" : 0.3,
                "Blasts_CD38CD10" : 0.3
            },
            "scale_propability" : 0.7,
            "scale_propability_2nd_marker" : 0.3
        }
    },
    "train_data":  {}, //dataset for training
    "validation_data": {}, //dataset for validation
    "model_storage": {
        "file_path": "./data/saved_models/train_vie14_val_bln",
        "load_stats_from_file": false,
        "gpu_name" :"cuda"
    },
    "train_params": {
        "learning_rate": 0.001,
        "weight_decay": 0.00000000000001,
        "validation_interval": 50,
        "n_training_epochs": 1500,
        "training_batchsize": 2,
        "clip_norm": 4.0,
        "random_seed": 42,
        "polygon_loss_weight": 1.0,
        "saving_interval": 50,
        "use_auxiliary_loss" : true
    },
    "model_factory": {
        "model_type": "src.model.FlowGATR.FlowGATR",
        "params_type": "src.datastructures.configs.modelparams.ModelParams",
        "params": {
            "dim_input": 10,
            "n_hidden_layers_ISAB": 2,
            "n_hidden_layers_decoder": 2,
            "n_obj_queries": 7,
            "points_per_query" : 5,
            "dim_latent": 36,
            "n_polygon_out": 20,
            "n_decoder_cross_att_heads": 6,
            "n_hidden_layers_polygon_out": 2,
            "n_perciever_blocks_decoder" : 4
        }
    },
    "wandb_config": {
        "entity": "your_wandb_username",
        "prj_name": "fcm-polygon-pred",
        "notes": "",
        "tags": [],
        "enabled": true //wheter training is logged to wandb or not
    },
    "gpu_name": "cuda",
    "n_workers": 0
}

Test Model

test.py allows to evaluate the performance of an already trained model on a given test set.

Data

The vie14, bln and bue data from our work can be downloaded from here: https://flowrepository.org/id/FR-FCM-ZYVT

Cite

If you use this project please consider citing our work

@inproceedings{kowarsch2022towards,
  title={Towards Self-explainable Transformers for Cell Classification in Flow Cytometry Data},
  author={Kowarsch, Florian and Weijler, Lisa and W{\"o}dlinger, Matthias and Reiter, Michael and Maurer-Granofszky, Margarita and Schumich, Angela and Sajaroff, Elisa O and Groeneveld-Krentz, Stefanie and Rossi, Jorge G and Karawajew, Leonid and others},
  booktitle={Interpretability of Machine Intelligence in Medical Image Computing: 5th International Workshop, iMIMIC 2022, Held in Conjunction with MICCAI 2022, Singapore, Singapore, September 22, 2022, Proceedings},
  pages={22--32},
  year={2022},
  organization={Springer}
}

About

Official implementation of Towards Self-Explainable Transformers for Cell Classification in Flow Cytometry Data

Topics

Resources

License

Stars

Watchers

Forks

Languages