# Testing the consistency of performance scored reported for the DRIVE retinal vessel segmentation dataset

In [1]:
from mlscorecheck.bundles.retina import (check_drive_vessel_image,
                                        check_drive_vessel_aggregated)

## Aggregated scores

First we test the aggregated scores published in
G. Kovacs et al, 2016: "A self-calibrating technique for the segmentation of retinal vessels by
template matching and contour reconstruction", Medical Image Analysis 29(4), 24-46
doi:10.1016/j.media.2015.12.003.


In [2]:
scores = {'acc': 0.9494, 'sens': 0.7450, 'spec': 0.9793}

# the numerical uncertainty (assuming ceiling/flooring)
k = 4

In [3]:
results = check_drive_vessel_aggregated(scores=scores,
                                        eps=10**(-k),
                                        imageset='test',
                                        annotator=1,
                                        verbosity=0)

2023-10-26 23:42:05,250:INFO:checking the scores {'acc': 0.9494, 'sens': 0.745, 'spec': 0.9793}
2023-10-26 23:42:05,251:INFO:evaluating the tp and tn solution for acc and sens
2023-10-26 23:42:05,253:INFO:intervals before: (0, 577649), (0, 3960494)
2023-10-26 23:42:05,254:INFO:the tp solutions: (430289.58480199997, 430407.42519800004)
2023-10-26 23:42:05,255:INFO:the tn solutions: (3877642.6484160004, 3878686.2699840004)
2023-10-26 23:42:05,256:INFO:intervals after: [(430290, 430407)], [(3877643, 3878686)]
2023-10-26 23:42:05,258:INFO:evaluating the tp and tn solution for acc and spec
2023-10-26 23:42:05,259:INFO:intervals before: [(430290, 430407)], [(3877643, 3878686)]
2023-10-26 23:42:05,260:INFO:the tp solutions: (429134.3290260006, 430868.0509740007)
2023-10-26 23:42:05,261:INFO:the tn solutions: (3878107.8038119995, 3878915.744588)
2023-10-26 23:42:05,262:INFO:intervals after: [(430290, 430407)], [(3878108, 3878686)]
2023-10-26 23:42:05,263:INFO:evaluating the tp and tn solution 

In [4]:
results['inconsistency']

{'inconsistency_fov_mos': False,
 'inconsistency_fov_som': False,
 'inconsistency_all_mos': True,
 'inconsistency_all_som': True}

As the results show, the scores are inconsistent with the assumption of using all pixels with any aggregation ('Score-of-Means'/'Mean-of-Scores').

Next, we test the aggregated scores published in
Jebaseeli et al., 2019, "Retinal Blood Vessel segmentation from diabetic retinopathy images using
tandem PCNN model and deep learning based SVM", Optik 199, 163328, doi:10.1016/j.ijleo.2019.163328.

In [5]:
scores = {'acc': 0.9898, 'sens': 0.8027, 'spec': 0.9980}

# the numerical uncertainty (assuming ceiling/flooring)
k = 4

In [6]:
results = check_drive_vessel_aggregated(scores=scores,
                            eps=10**(-k),
                            imageset='test',
                            annotator=1,
                            verbosity=0)

2023-10-26 23:42:05,390:INFO:checking the scores {'acc': 0.9898, 'sens': 0.8027, 'spec': 0.998}
2023-10-26 23:42:05,392:INFO:evaluating the tp and tn solution for acc and sens
2023-10-26 23:42:05,393:INFO:intervals before: (0, 577649), (0, 3960494)
2023-10-26 23:42:05,394:INFO:the tp solutions: (463619.93210199993, 463737.772498)
2023-10-26 23:42:05,395:INFO:the tn solutions: (4027653.278316, 4028696.899884)
2023-10-26 23:42:05,397:INFO:intervals after: [(463620, 463737)], []
2023-10-26 23:42:05,398:INFO:evaluating the tp and tn solution for acc and spec
2023-10-26 23:42:05,399:INFO:intervals before: [(463620, 463737)], []
2023-10-26 23:42:05,401:INFO:the tp solutions: (538414.068426, 540147.7903740001)
2023-10-26 23:42:05,402:INFO:the tn solutions: (3952169.0416119997, 3952976.982388)
2023-10-26 23:42:05,403:INFO:intervals after: [], []
2023-10-26 23:42:05,404:INFO:evaluating the tp and tn solution for sens and spec
2023-10-26 23:42:05,405:INFO:intervals before: [], []
2023-10-26 23:4

In [7]:
results['inconsistency']

{'inconsistency_fov_mos': True,
 'inconsistency_fov_som': True,
 'inconsistency_all_mos': True,
 'inconsistency_all_som': True}

As the results show, the reported scores are inconsistent with any of the four reasonable assumptions in the field (using the FoV pixels only or using all pixels; using Mean-of-Scores aggregation or 'Score-of-Means' aggregation).

## Image level

Image level figures from Mo et al., "Multi-level deep supervised networks for retinal vessel seg-
mentation", Int J Comput Assist Radiol Surg, 12. doi:10.1007/s11548-017-1619-0

In [8]:
scores = {'acc': 0.9323, 'sens': 0.5677, 'spec': 0.9944}
identifier = '03'
k = 4

In [9]:
results = check_drive_vessel_image(scores=scores,
                                    eps=10**(-k),
                                    image_identifier=identifier,
                                    annotator=1)

2023-10-26 23:42:05,527:INFO:Use this function if the scores originate from the tp and tn statistics calculated on one test set with no aggregation of any kind.
2023-10-26 23:42:05,529:INFO:calling the score check with scores {'acc': 0.9323, 'sens': 0.5677, 'spec': 0.9944}, uncertainty 0.0001, p 32886 and n 192841
2023-10-26 23:42:05,531:INFO:checking the scores {'acc': 0.9323, 'sens': 0.5677, 'spec': 0.9944}
2023-10-26 23:42:05,533:INFO:evaluating the tp and tn solution for acc and sens
2023-10-26 23:42:05,534:INFO:intervals before: (0, 32886), (0, 192841)
2023-10-26 23:42:05,536:INFO:the tp solutions: (18666.027828, 18672.736572)
2023-10-26 23:42:05,537:INFO:the tn solutions: (191749.521374, 191802.27842600003)
2023-10-26 23:42:05,539:INFO:intervals after: [(18667, 18672)], [(191750, 191802)]
2023-10-26 23:42:05,541:INFO:evaluating the tp and tn solution for acc and spec
2023-10-26 23:42:05,543:INFO:intervals before: [(18667, 18672)], [(191750, 191802)]
2023-10-26 23:42:05,544:INFO:t

In [10]:
results['inconsistency']

{'inconsistency_fov': False, 'inconsistency_all': True}

As the results show, the scores are inconsistent with the assumption that all pixels were, used, however, the scores are not inconsistent with the assumption that only the FoV pixels were used for evaluation.

Testing image level results from Lupascu et al., 2010, "Fabc: retinal vessel segmentation using
adaboost", Trans. Info. Tech. Biomed 14 (5), 1267—1274, 2052282. doi:10.1109/TITB.2010.

In [11]:
scores = {'acc': 0.9633, 'sens': 0.7406, 'spec': 0.9849}
identifier = '01'
k = 4

In [12]:
results = check_drive_vessel_image(scores=scores,
                                eps=10**(-k),
                                image_identifier=identifier,
                                annotator=1)

2023-10-26 23:42:05,688:INFO:Use this function if the scores originate from the tp and tn statistics calculated on one test set with no aggregation of any kind.
2023-10-26 23:42:05,691:INFO:calling the score check with scores {'acc': 0.9633, 'sens': 0.7406, 'spec': 0.9849}, uncertainty 0.0001, p 29412 and n 194965
2023-10-26 23:42:05,694:INFO:checking the scores {'acc': 0.9633, 'sens': 0.7406, 'spec': 0.9849}
2023-10-26 23:42:05,696:INFO:evaluating the tp and tn solution for acc and sens
2023-10-26 23:42:05,697:INFO:intervals before: (0, 29412), (0, 194965)
2023-10-26 23:42:05,700:INFO:the tp solutions: (21779.527176, 21785.527224)
2023-10-26 23:42:05,701:INFO:the tn solutions: (194333.950422, 194385.72337800002)
2023-10-26 23:42:05,704:INFO:intervals after: [(21780, 21785)], [(194334, 194385)]
2023-10-26 23:42:05,706:INFO:evaluating the tp and tn solution for acc and spec
2023-10-26 23:42:05,711:INFO:intervals before: [(21780, 21785)], [(194334, 194385)]
2023-10-26 23:42:05,713:INFO:t

In [13]:
results['inconsistency']

{'inconsistency_fov': True, 'inconsistency_all': False}

As the results show, the scores are inconsistent with the assumption of evaluating the segmentation only in the FoV region, however, the scores are not inconsistent with the assumption of evaluating the method with all pixels.