# Bias Amplification by Quantitative Misrepresentation

This notebook guides you through the process of bias amplification by quantitative misrepresentation. This demonstration uses MIDRC open A1 cheset X-ray image data as the example, and provides you with the instruction on how to process the data, train and deploy the model, as well as bias visualization.
Quantitative misrepresentation (i.e., data set skew) is systematically applied to the training set to simulate different levels of selection bias. Specifically, the data used during development are selected such that the disease prevalence varies between patient subgroups. The degree to which bias is promoted can be controlled by changing the degree to which the prevalence varies between subgroups.

## Data Preprocessing

To amplify the model bias by quantitative misrepresentation, the first step is to vary the disease prevalence within each subgroup in the **training set** while maintaining constant overall disease prevalence and subgroup distribution. 
This demonstration takes patient sex subgroup (male and female) as an example. You can run the following code to sample the disease prevalence to 10%, 25%, 50%, 75% and 90% in "F" (memale) subgroup, while the disease prevalence in "M" (male) will be 90%, 75%, 50%, 25% and 10% repectively. Noted that training set with 50% diease prevalence in each subgroup serves as the baseline.
The code will save resulted training set *.csv* files and renamed with subgroup disease prevalence (e.g., training set with 10% disease prevalence in female subgroup will be saved as *train_10FP.csv*).

In [1]:
# manipulate subgroup disease prevalence in training/validation set
main_dir = "/gpfs_projects/yuhang.zhang/OUT/2022_CXR/test/RAND_0"
%run ../src/utils/quantitative_misrepresentation_data_process.py --input_file "train.csv" \
                                                                 --prevalences 0.1 0.25 0.5 0.75 0.9 \
                                                                 --test_subgroup "F" \
                                                                 --in_dir "{main_dir}" \
                                                                 --save_dir "{main_dir}"
%run ../src/utils/quantitative_misrepresentation_data_process.py --input_file "validation.csv" \
                                                                 --prevalences 0.1 0.25 0.5 0.75 0.9 \
                                                                 --test_subgroup "F" \
                                                                 --in_dir "{main_dir}" \
                                                                 --save_dir "{main_dir}"

Start data split of 0.1 for F

Start data split of 0.25 for F

Start data split of 0.5 for F

Start data split of 0.75 for F

Start data split of 0.9 for F



## Model Training

After data preprocessing is done, you can run the following cell to train models with these different training sets. In this demonstration we uses *ResNet-18* as the example network architecture, and pre-trained weights trained from a contrastive
self-supervised learning (CSL) approach and data from the CheXpert data. This weight file can be found under */example/* directory.

In [None]:
# model training
exp_list = ["10FP", "25FP", "50FP", "75FP", "90FP"]
for EXP in exp_list:
    %run ../src/utils/model_train.py -i "{main_dir}/{EXP}_train.csv" \
                                     -v "{main_dir}/validation.csv" \
                                     -o "{main_dir}/{EXP}" \
                                     -l "{main_dir}/{EXP}/run_log.log" \
                                     -c "checkpoint_csl.pth.tar" \
                                     -p "adam" \
                                     -g 0 \
                                     --pretrained_weights True

Start experiment...


2024-02-29 15:14:40.637353: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-29 15:14:40.702802: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Full fine tuning selected
['patient_id', 'Path', 'M', 'F', 'White', 'Black', 'Yes', 'No']
['patient_id', 'Path', 'M', 'F', 'White', 'Black', 'Yes', 'No']
Training...
EPOCH	TR-AVG-LOSS	VD-AUC
> 0	0.64736		0.60740
> 1	0.47892		0.58973
> 2	0.37559		0.59448
> 3	0.29787		0.58872
> 4	0.27418		0.58733
> 5	0.25196		0.58974
> 6	0.23189		0.58815
> 7	0.22453		0.58927
> 8	0.22361		0.58897
> 9	0.21520		0.58852
> 10	0.21679		0.58867
> 11	0.21191		0.58887
> 12	0.21539		0.58925
> 13	0.21745		0.58923
> 14	0.21056		0.58958
Final epoch model saved to: /gpfs_projects/yuhang.zhang/OUT/2022_CXR/test/RAND_0//10FP/pytorch_last_epoch_model.onnx
Final epoch model saved to: /gpfs_projects/yuhang.zhang/OUT/2022_CXR/test/RAND_0//10FP/best_auc_model.onnx
END.
Start experiment...
Full fine tuning selected
['patient_id', 'Path', 'M', 'F', 'White', 'Black', 'Yes', 'No']
['patient_id', 'Path', 'M', 'F', 'White', 'Black', 'Yes', 'No']
Training...
EPOCH	TR-AVG-LOSS	VD-AUC
> 0	0.68779		0.61231
> 1	0.59941		0.64265
> 2	0.5

## Model Inference

After model training is done, you can deploy the models on the independent testing set by running the inference code below. The inference code will save prediction scores as *results__.tsv* files under the same directory.

In [3]:
# model inference
for exp in exp_list:
    %run ../src/utils/model_inference.py -i "{main_dir}/independent_test.csv" \
                                         -w "{main_dir}/pytorch_last_epoch_model.onnx" \
                                         -g 0 \
                                         -l "{main_dir}/{exp}/inference_log.log"

['patient_id', 'Path', 'M', 'F', 'White', 'Black', 'Yes', 'No']
Inferencing now ...
 onnxruntime available providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
 onnxruntime running on: GPU
 There were 1048 ROIs in the lists
 AUROC = 0.584017
 Time taken:  175.85512603400275
END.
['patient_id', 'Path', 'M', 'F', 'White', 'Black', 'Yes', 'No']
Inferencing now ...
 onnxruntime available providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
 onnxruntime running on: GPU
 There were 1048 ROIs in the lists
 AUROC = 0.639466
 Time taken:  74.80611474101897
END.
['patient_id', 'Path', 'M', 'F', 'White', 'Black', 'Yes', 'No']
Inferencing now ...
 onnxruntime available providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
 onnxruntime running on: GPU
 There were 1048 ROIs in the lists
 AUROC = 0.682226
 Time taken:  76.51662474789191
END.
['patient_id', 'Path', 'M', 'F', 'White', 'Black', 'Yes', 'No']
Inferencing now ...
 onnxruntime available providers: ['CUDAExecutionProvi

## Bias Visualization

After inference, you can analyze the model bias by running the following code. The analysis code here will calculate the subgroup **predicted prevalence** and **AUROC** , and plot these measurements with respect to training disease prevalence differences between two subgroups.

In [None]:
# metric calculation
for exp in exp_list:
    %run ../src/utils/bias_analysis.py -d "{main_dir}/{exp}" \
                                       -r "results__.tsv" \
                                       -i "{main_dir}/independent_test.csv" \
                                       -s "sex"