# CADe/CADx Performance Evaluation Tutorial

This tutorial demonstrates how to use the repository to evaluate the performance of CADe (Computer-Aided Detection) and CADx (Computer-Aided Diagnosis) systems. 
It only adresses single "target" detection system, nambely binary detection and classification systems (e.g malignant nodule), and computes ROCs, Precison-Recall, FROCs (with severall Operating Points spacifictions), bootstrapps, Confidence Intervals,
statistical comparison tests, across sets of detection predictions (...). Simply, it provides a standard and generic CADe/CADx evaluation process, easily usable to any standard case, and easily adaptable to some peculiar requirements. 
It deserves two purposes that drives 2 independent usage: first to reproduces all the results of the publication (give arXiv link), second to provide evaluation process system for other CADe/CADx.

We will walk through the following steps:
1. Installing the package and setting up the environment.
2. Structure of the Repository
3. Reproduce all the paper' figures, performances and tests in a single command from data inputs
3. Running the evaluation modules (isolated functions):
   - Series-level evaluation.
   - Lesion-level evaluation.
   - Statistical tests.


Let's get started!

## 1. INSTALL

### 1.1 Install UV - Ultra-fast Python package manager

This repository uses UV to manage the dependencies, environment, python version (...) of the repository, and you need to install it first for a direct, fast and easy usage.
UV is an ultra-fast Python package manager (see complete on [official UV installation page](https://uv.pypa.io/en/stable/installation/)).
To install UV, run the following command in your terminal depending on wheither your system is Linux-based or Windows:

Linux / macOS:
```bash
curl -LsSf https://astral.sh/uv/install.sh | env UV_INSTALL_DIR="/usr/local/bin" sudo -E sh
```

Windows (PowerShell):
```powershell
irm (Invoke-WebRequest -Uri https://astral.sh/uv/install.ps1).Content | iex
```

### 1.2 Install dependencies

To install the required dependencies, run the following command in your terminal:

```bash
uv sync
```

Activate the virtual environment:

```bash
uv activate
```

## 2. Structure of the repository

The stucture of the reposity is the following (including the repositories of the output processing that are not commited): 

```
CADe_CADx_evaluation
│   README.md
│   config_paper.py  
│   run_paper_evaluation.py   
│
└───Data
│   │CADe_CADx_evaluate.log  (OUTPUT)
│   │
│   └───data_series          (INPUT)
│   │   │series.csv          
│   │  
│   └───data_lesions         (INPUT)
│   │   │lesions.csv         
│   │   │... 
│   │     
│   └───evaluate_series      (OUTPUT)
│   │   └─── figure_1  
│   │   │     └───4_radiologist_prediction_test3     
│   │   │         │roc_curve_with_op_test3.png  
│   │   │         │... 
│   │   │    
│   │   └─── ...        
│   │     
│   └───evaluate_lesions     (OUTPUT)
│   │   └─── figure_2  
│   │   │     └───4_radiologist_prediction_test3     
│   │   │         │froc_curve_with_2_op_test3.png  
│   │   │         │... 
│   │   │    
│   │   └─── ...        
│   │     
│   └───statistical_tests     (OUTPUT)
│       └─── figure_1  
│       │    │test_results_AUC_model_vs_4radiolog_test3.csv  
│       │    │... 
│       │    
│       └─── ...                
│   
└───evaluate_common            (CODE)
│   │roc.py   
│   │precision_recall.py
│   │sample_size_analysis.py
│   │roc_confidence_interval.py
│   │plot_score_distribution_benign_cancer.py
│   │logger.py
│   │sens_spec.py
│  
└───evaluate_lesions           (CODE)
│   │evaluate_lesions.py   
│   │froc.py
│   │plot_diameter_prediction_distributionss.py
│  
└───evaluate_series            (CODE)
│   │evaluate_series.py   
│  
└───statistical_tests          (CODE)
│   │statistical_tests.py   

```

### 2.1 Configuration
All the parameters of the package are defined in "config_paper.py". It includes input and output data paths, name of the predictions, names of the subsets sample of an evaluation, names of the label GT, percentage of the confidence intervals, number of bootstrap, a fast computation option, operating points thresholds and labels. It also defines lists of evalutations specifying predictions, labels, subsets (...) to be runed grouped by figures. To apply the package to new models and dataset, you may either rewrite this configuration and define your own lists of evaluations (complex case of multiple evaluations), or directly run the submodules functions (detailed bellow). By default, this configuration script reproduces all the figures, results of the paper arXiv (to complete).

### 2.2 Input
The input are stored in .\data\data_series and .\data\data_lesions directory, for series/patient level and nodules/lesions level input respectively.
They are csv files (e.g. series.csv and lesions.csv). Each row is 1 sample, each column is a feature of the sample. There are 4 kind of features used by the evaluation:
* predictions: they are numeric (float or int) commonly a probability prediction of a model for the sample, but can be also a (measure) psychophysical assement of a human (e.g. an expert radiologist), or a numerical variable associated with the sample (eg. size) that may have a predictive value of the binary detection-classification.
* labels: this are the ground truth associated with the sample. They are binary: either (0,1) or Boolean (true or 1 indicate the postive class, False or 0 indicate the negative class)
* identifiers: commonly in medical imaging, patient_id, series_uid, Time_point
* features-variables: they are varaibles associated to the sample, on which you may stratify your evalutation (e.g. age, gender, slice-thickness, manufacturer, spiculation...)
* test set name: these are boolean variable (indicator functions) that indicate wheither the sample pertain or not to a subset (this has the same role as stratification)
In addition for the lesion/nodule level only, there is a required column "detection_status" that can take either "TP","FP","FN" (for "True Positive", "False Positive", and "False Negative"...) values as a result the output of the pairing of the detection with the GT (see bellow). Note that a "False negative" detections are attributed a prediction 0 (or least score) by the classification evaluation algorithm.  
 
### 2.3 modules "evaluate_lesions" and "evaluate_series"
The submodules "evaluate_lesions" and "evaluate_series" make essentially the same tasks and computation but either at series/patient level or at nodule/lesion level.
Those 2 levels are dissociated because patient/series level performance evaluation is straighforward from the GT input (e.g. cancer or not), whereas nodule/lesion level performance requires a prior pairing of the detections with the lesions in the GT.
This pairing is not furnished (yet) with the repository, we used standard (for 3D pairing) IoU based pairing with 0.1 threshold as furnished and recommended by [Jaeger et al.](https://arxiv.org/abs/1811.08661).
Moreover, lesion/nodule level is also commonly evaluated using FROCs which does not make sens at scan/patient level. Both modules call functions like roc.py (etc..) in the "evaluate_common" library.
They compute ROCs, FROCs (at lesion levels), etc. (see bellow in the output descriptions).

### 2.4 module "statistical_tests"
The "statistical_tests" module is based on Bootstrap methods conceived by [Efron and Tibshirani](https://www.hms.harvard.edu/bss/neuro/bornlab/nb204/statistics/bootstrap.pdf). The bootstrap samples are computed with replacement.  It is recommended because Bootstrap methods are non-parametric (and does not make assumption on the distribution that should be verified, or use parameter that should be fitted), and can be applied generically using a common framework to wide range of observable (on AUC ROC and sensitivity of a given Operating Point of the ROC similarly). It takes a list of paths of pairs of npy saved vectors (in evaluate lesions or series) of n bootstrap samples metric values (AUC, accuracy, sensitivity, specificity....) and run all the list of tests. It implements each time 2 statistical tests using [label](https://docs.scipy.org/doc/scipy/reference/stats.html): 
* a superiority t-test with unequal variance (one-sided Welch t-test, 1 st prediction vs. 2nd)
* a superiority test assuming equal variance (t-test, 1 st prediction vs. 2nd) 
It  return a csv with the result of all pairs of tests (p value, acceptance (strong, very strong, moderate, rejected)), but also a figure of the 2 bootstrapps distributions. 
This module can only be runned after "evaluate_lesions" and-or "evaluate_series" as it depends on their Bootstraps vector .npy file outputs (they are its only inputs). 

### 2.5 Outputs
The input are stored in .\data\evaluate_series and .\data\evaluate_lesions directory, for series/patient level and nodules/lesions level input respectively.
They are csv files, figures both saved in .png and .svg (allowing to further edit them in vectorial format), numpy arrays of bootstrapps samples (for statistical test). 
The saved figures are:
* ROCs, with Operating Points (by default the Youden Index Maximum OP is given, but a list of OP threshold and labels can be given) 
* Precision Recall curve 
* the distribution of predictions for each labels
* FROCs  (only at lesion level): with operating points at the point closest to 0.5FP/scan and 1FP/scan 

The saved csv are:
* sample sizes: with total sample size, imbalance, and sample size in each class.
* "roc_CI_bootstrap_hanley" csv: with AUC ROC confidence intervals (lower and upper) obtained using bootstrap methods and using [Hanley & McNeil method](https://pubs.rsna.org/doi/10.1148/radiology.143.1.7063747?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed). 
* "operating_point_performances" csv: with, for all operating points, the sens, spec, accuracy, their mean, lower and upper CI over n bootstrapps samples.  
* "operating_point_FROC_scores_at_0.5_and_1_FP_per_scan"  (only at lesion level): with, for operating points at the point closest to 0.5FP/scan and 1FP/scan , the sens and exact FP/scan, their mean, lower and upper CI over n bootstrapps samples.
* "test_results" csv (in statistical test folders):  with the result of all pairs of tests (p value, acceptance (strong, very strong, moderate, rejected)), but also a figure of the 2 bootstrapps distributions.

The saved numpy (.npy) files are the n_boostrapped vectors of AUC, sens, spec and accuracy.


## 3. Reproduce all papers figures, performances and tests in a single command from data inputs

The `run_paper_evaluation.py` script is designed to produce all the evaluation performance analyses presented in the paper (give arXiv link). This script integrates all the evaluation modules of package and generates the required outputs of each figures of the paper in the associated directories.
At series-patient and at nodule level, it generates all the output listed above for all figures of the paper, and all the statistical tests of the paper.
All its parameters are strored in `config_paper.py`, notably the path where to get the input csv data for patients and nodules predictions and the path where to store the output, along with the list of figures, stats and tests to produce. This config is specific to the paper, just rewrite another config keeping the same structure for another set of evaluations.

To make the whole paper evaluation, just run the following lines: 

In [None]:
from  run_paper_evaluation import paper_evaluation
fast_computation = True  # Set to True for faster computation during testing/debugging (nb bootstrap samples reduced to 50 and fast FROC computation)
paper_evaluation(fast_computation = fast_computation) 

the input data are 

```
project
│   README.md
│   file001.txt    
│
└───folder1
│   │   file011.txt
│   │   file012.txt
│   │
│   └───subfolder1
│       │   file111.txt
│       │   file112.txt
│       │   ...
│   
└───folder2
    │   file021.txt
    │   file022.txt
```

In [None]:
from evaluate_series.evaluate_series import evaluate_serie_main
from evaluate_lesions.evaluate_lesions import evaluate_lesions_main
from statistical_tests.statistical_tests import statistical_test_main


## Step 2: Preparing Input Data

In [None]:
# Load Input Data
# Path to the series data (update this path if necessary)
path_to_series_csv = config.path_data_series

# Load the series data
series_data = pd.read_csv(path_to_series_csv)
print("Series Data:")
print(series_data.head())

In [None]:
# Inspect Lesion-Level Data
# Path to the lesion data (update this path if necessary)
path_to_lesions_csv = config.path_data_lesions

# Load the lesion data
lesion_data = pd.read_csv(path_to_lesions_csv)
print("Lesion Data:")
print(lesion_data.head())

## Step 3: Running the Evaluation Modules

### 3.1 Series-Level Evaluation

In [None]:
# Run Series-Level Evaluation
# Define parameters for series evaluation
path_to_load_csv_serie = config.path_data_series
expdir = config.path_model_evaluate_series
set_name = "test1"  # Update this to the desired test set
prediction = "model_prediction"
label_name = "label"
operating_point_thresholds = [0]
operating_point_labels = ["Youden Index Max"]
nb_bootstrap_samples = config.nb_bootstrap_samples
confidence_threshold = config.confidence_threshold

# Run the series evaluation
evaluate_serie_main(
    path_to_load_csv_serie,
    expdir,
    set_name,
    prediction,
    label_name,
    operating_point_thresholds,
    operating_point_labels,
    nb_bootstrap_samples,
    confidence_threshold,
)

### 3.2 Lesion-Level Evaluation

In [None]:
# Run Lesion-Level Evaluation
# Define parameters for lesion evaluation
path_to_load_csv_lesion = config.path_data_lesions
expdir = config.path_model_evaluate_lesions
set_name = "test1"  # Update this to the desired test set
prediction = "model_prediction"
label_name = "label"
operating_point_thresholds = [0]
operating_point_labels = ["Youden Index Max"]
nb_bootstrap_samples = config.nb_bootstrap_samples
confidence_threshold = config.confidence_threshold

# Run the lesion evaluation
evaluate_lesions_main(
    path_to_load_csv_lesion,
    expdir,
    set_name,
    prediction,
    label_name,
    operating_point_thresholds,
    operating_point_labels,
    nb_bootstrap_samples,
    confidence_threshold,
)

### 3.3 Statistical Tests

In [None]:
# Run Statistical Tests
# Define parameters for statistical tests
list_of_tuples_of_pairs_of_bootstrap_paths = config.list_statistical_tests_figure[0][0]
expdir_analysis = config.path_model_statistical_tests

# Run the statistical tests
statistical_test_main(list_of_tuples_of_pairs_of_bootstrap_paths, expdir_analysis)

## Step 4: Analyzing the Results

In [None]:
# Analyze Results
# Path to the output directory
output_dir = config.path_model_eval
print(f"Results saved in: {output_dir}")

In [None]:
# Visualize Example Results
# Example: Load and display a Precision-Recall curve
example_pr_curve_path = Path(expdir) / "Precision_Recall_curve_model_test1.png"
if example_pr_curve_path.exists():
    img = plt.imread(example_pr_curve_path)
    plt.imshow(img)
    plt.axis("off")
    plt.title("Precision-Recall Curve")
    plt.show()
else:
    print(f"Precision-Recall curve not found at {example_pr_curve_path}")

## Conclusion

In this tutorial, we demonstrated how to:
1. Set up the environment.
2. Prepare input data.
3. Run series-level and lesion-level evaluations.
4. Perform statistical tests.
5. Analyze the results.

You can customize the parameters in the `config_nlst.py` file to evaluate different datasets and models.

Happy evaluating!