# Testing the consistency of performance scored reported for the DRIVE retinal vessel segmentation dataset

In [1]:
from mlscorecheck.bundles import (drive_image, drive_aggregated)

## Aggregated scores

First we test the aggregated scores published in
G. Kovacs et al, 2016: "A self-calibrating technique for the segmentation of retinal vessels by
template matching and contour reconstruction", Medical Image Analysis 29(4), 24-46
doi:10.1016/j.media.2015.12.003.


In [2]:
scores = {'acc': 0.9494, 'sens': 0.7450, 'spec': 0.9793}

# the numerical uncertainty (assuming ceiling/flooring)
k = 4

In [3]:
results = drive_aggregated(scores=scores,
                            eps=10**(-k),
                            image_set='test',
                            verbosity=0)

2023-10-17 04:02:39,338:INFO:testing MoR FoV pixels
2023-10-17 04:02:39,362:INFO:testing MoR all pixels
2023-10-17 04:02:39,378:INFO:testing RoM FoV pixels
2023-10-17 04:02:39,444:INFO:testing acc, feasible tptn pairs: 577650
2023-10-17 04:02:51,705:INFO:testing sens, feasible tptn pairs: 230090
2023-10-17 04:02:51,758:INFO:testing spec, feasible tptn pairs: 116
2023-10-17 04:02:51,760:INFO:constructing final tp, tn pair set
2023-10-17 04:02:51,760:INFO:final number of intervals: 116
2023-10-17 04:02:51,761:INFO:final number of pairs: 59334
2023-10-17 04:02:51,762:INFO:testing RoM all pixels
2023-10-17 04:02:51,799:INFO:testing acc, feasible tptn pairs: 577946
2023-10-17 04:03:04,546:INFO:testing sens, feasible tptn pairs: 334588
2023-10-17 04:03:04,647:INFO:testing spec, feasible tptn pairs: 117
2023-10-17 04:03:04,649:INFO:no more feasible tp,tn pairs left
2023-10-17 04:03:04,650:INFO:constructing final tp, tn pair set
2023-10-17 04:03:04,651:INFO:final number of intervals: 0
2023-10

In [4]:
results

{'mor_fov_pixels_inconsistency': False,
 'mor_all_pixels_inconsistency': True,
 'rom_fov_pixels_inconsistency': False,
 'rom_all_pixels_inconsistency': True}

As the results show, the scores are inconsistent with the assumption of using all pixels with any aggregation ('Ratio-of-Means'/'Mean-of-Ratios').

Next, we test the aggregated scores published in
Jebaseeli et al., 2019, "Retinal Blood Vessel segmentation from diabetic retinopathy images using
tandem PCNN model and deep learning based SVM", Optik 199, 163328, doi:10.1016/j.ijleo.2019.163328.

In [5]:
scores = {'acc': 0.9898, 'sens': 0.8027, 'spec': 0.9980}

# the numerical uncertainty (assuming ceiling/flooring)
k = 4

In [6]:
results = drive_aggregated(scores=scores,
                            eps=10**(-k),
                            image_set='test',
                            verbosity=0)

2023-10-17 04:03:04,693:INFO:testing MoR FoV pixels
2023-10-17 04:03:04,709:INFO:testing MoR all pixels
2023-10-17 04:03:04,725:INFO:testing RoM FoV pixels
2023-10-17 04:03:04,767:INFO:testing acc, feasible tptn pairs: 577650
2023-10-17 04:03:17,417:INFO:testing sens, feasible tptn pairs: 46749
2023-10-17 04:03:17,426:INFO:no more feasible tp,tn pairs left
2023-10-17 04:03:17,428:INFO:constructing final tp, tn pair set
2023-10-17 04:03:17,429:INFO:final number of intervals: 0
2023-10-17 04:03:17,429:INFO:final number of pairs: 0
2023-10-17 04:03:17,430:INFO:testing RoM all pixels
2023-10-17 04:03:17,458:INFO:testing acc, feasible tptn pairs: 577946
2023-10-17 04:03:28,282:INFO:testing sens, feasible tptn pairs: 67980
2023-10-17 04:03:28,295:INFO:no more feasible tp,tn pairs left
2023-10-17 04:03:28,295:INFO:constructing final tp, tn pair set
2023-10-17 04:03:28,296:INFO:final number of intervals: 0
2023-10-17 04:03:28,297:INFO:final number of pairs: 0


In [7]:
results

{'mor_fov_pixels_inconsistency': True,
 'mor_all_pixels_inconsistency': True,
 'rom_fov_pixels_inconsistency': True,
 'rom_all_pixels_inconsistency': True}

As the results show, the reported scores are inconsistent with any of the four reasonable assumptions in the field (using the FoV pixels only or using all pixels; using Mean-of-Ratios aggregation or 'Ratio-of-Means' aggregation).

## Image level

Image level figures from Mo et al., "Multi-level deep supervised networks for retinal vessel seg-
mentation", Int J Comput Assist Radiol Surg, 12. doi:10.1007/s11548-017-1619-0

In [8]:
scores = {'acc': 0.9323, 'sens': 0.5677, 'spec': 0.9944}
identifier = '03'
k = 4

In [9]:
results = drive_image(scores=scores,
                        eps=10**(-k),
                        image_set='test',
                        identifier=identifier)

2023-10-17 04:03:28,331:INFO:Use this function if the scores originate from the tp and tn statistics calculated on one test set with no aggregation of any kind.
2023-10-17 04:03:28,333:INFO:calling the score check with scores {'acc': 0.9323, 'sens': 0.5677, 'spec': 0.9944}, uncertainty 0.0001, p 32886 and n 192841
2023-10-17 04:03:28,334:INFO:checking the scores {'acc': 0.9323, 'sens': 0.5677, 'spec': 0.9944}
2023-10-17 04:03:28,335:INFO:evaluating the tp and tn solution for acc and sens
2023-10-17 04:03:28,336:INFO:intervals before: (0, 32886), (0, 192841)
2023-10-17 04:03:28,338:INFO:the tp solutions: (18666.027828, 18672.736572)
2023-10-17 04:03:28,338:INFO:the tn solutions: (191749.521374, 191802.27842600003)
2023-10-17 04:03:28,339:INFO:intervals after: [(18667, 18672)], [(191750, 191802)]
2023-10-17 04:03:28,341:INFO:evaluating the tp and tn solution for acc and spec
2023-10-17 04:03:28,341:INFO:intervals before: [(18667, 18672)], [(191750, 191802)]
2023-10-17 04:03:28,343:INFO:t

In [10]:
results

{'fov_pixels_inconsistency': False, 'all_pixels_inconsistency': True}

As the results show, the scores are inconsistent with the assumption that all pixels were, used, however, the scores are not inconsistent with the assumption that only the FoV pixels were used for evaluation.

Testing image level results from Lupascu et al., 2010, "Fabc: retinal vessel segmentation using
adaboost", Trans. Info. Tech. Biomed 14 (5), 1267—1274, 2052282. doi:10.1109/TITB.2010.

In [11]:
scores = {'acc': 0.9633, 'sens': 0.7406, 'spec': 0.9849}
identifier = '01'
k = 4

In [12]:
results = drive_image(scores=scores,
                        eps=10**(-k),
                        image_set='test',
                        identifier=identifier)

2023-10-17 04:03:28,413:INFO:Use this function if the scores originate from the tp and tn statistics calculated on one test set with no aggregation of any kind.
2023-10-17 04:03:28,414:INFO:calling the score check with scores {'acc': 0.9633, 'sens': 0.7406, 'spec': 0.9849}, uncertainty 0.0001, p 29412 and n 194965
2023-10-17 04:03:28,415:INFO:checking the scores {'acc': 0.9633, 'sens': 0.7406, 'spec': 0.9849}
2023-10-17 04:03:28,416:INFO:evaluating the tp and tn solution for acc and sens
2023-10-17 04:03:28,417:INFO:intervals before: (0, 29412), (0, 194965)
2023-10-17 04:03:28,418:INFO:the tp solutions: (21779.527176, 21785.527224)
2023-10-17 04:03:28,419:INFO:the tn solutions: (194333.950422, 194385.72337800002)
2023-10-17 04:03:28,420:INFO:intervals after: [(21780, 21785)], [(194334, 194385)]
2023-10-17 04:03:28,423:INFO:evaluating the tp and tn solution for acc and spec
2023-10-17 04:03:28,424:INFO:intervals before: [(21780, 21785)], [(194334, 194385)]
2023-10-17 04:03:28,425:INFO:t

In [13]:
results

{'fov_pixels_inconsistency': True, 'all_pixels_inconsistency': False}

As the results show, the scores are inconsistent with the assumption of evaluating the segmentation only in the FoV region, however, the scores are not inconsistent with the assumption of evaluating the method with all pixels.