<h2>Model Evaluation</h2>

<b>Ground truth Path</b>:<br>
Path to the Annotators Segmentations or Ground Truth.<br>

<b>Predictions Path</b>:<br>
Path to the Model Predictions.<br>

<b>Save generated mask</b>:<br>
Option to save the used union, intersection and overlap masks.<br>

<b>Evaluation Class</b>:<br>
Selected class to use in the evaluation process 2 = stringvessel by default.<br>


<b>Use Dilation</b>:<br>
Use dilatation to reduce missmatch among annotators.<br>

<b>Radius</b>:<br>
Pixel size of dilatation.<br>

<b>Mode</b>:<br>
Kernel shape used for dilatation.<br>

<span style="color:red;"><b>Note:</b></span><br>
If only one annotator folder is given, Jaccard, Dice, Hausdorff and Mean Surface Distance are computed with the given annotations and the model prediction.

In [None]:
from core.evaluation import create_evaluation_menu
create_evaluation_menu()

<h2>Sumary Generation</h2>

This part take the folder of the csv files generated in the previous step and generate a new csv files with sumary information in aims to get some insights about multiple models evaluation or the generated metrics and a sumary plots as well.

<b>csv's files folder</b>:<br>
Path to the csv files generated in the previous step.<br>

In [None]:
from core.evaluation import create_sumary_menu
create_sumary_menu()

<h2><b>Model Evaluation Overview</b></h2>
<img src="docs/examples/inp.PNG" width="600"/>

<h4><b>STAPLE (Simultaneous Truth and Performance Level Estimation) Mask:</b><br><br>
STAPLE is an algorithm used to fuse multiple expert segmentations (annotations) into a single, highly reliable consensus segmentation ($A_{\text{STAPLE}}$).<br><br>
It simultaneously estimates the "true" underlying segmentation and the performance level (sensitivity and specificity) of each individual annotator relative to this estimated truth using an Expectation-Maximization (EM) algorithm.<br><br>
The output is a probabilistic map ($P_{\text{STAPLE}}$), where each pixel value represents the estimated probability that the pixel belongs to the object of interest.<br><br>
The final binary STAPLE mask, $A_{\text{STAPLE}}$, is derived by applying a threshold (typically 0.5) to this probability map:<br><br>
$\displaystyle A_{\text{STAPLE}} = \begin{cases} 1 & \text{if } P_{\text{STAPLE}} \ge 0.5 \\ 0 & \text{if } P_{\text{STAPLE}} < 0.5 \end{cases}$<br><br>
The STAPLE mask is a robust fusion of the annotations, providing a more reliable estimate of the ground truth than any single annotator.
</h4>

<h4><b>Sensitivity and Specificity (STAPLE Performance Metrics):</b><br><br>
In the STAPLE context, Sensitivity and Specificity are calculated for each individual annotator ($A_i$) relative to the estimated true segmentation ($A_{\text{STAPLE}}$). These metrics quantify the estimated performance of each annotator.<br><br>
Let TP, TN, FP, and FN be the True Positives, True Negatives, False Positives, and False Negatives of annotator $A_i$ compared against $A_{\text{STAPLE}}$:
<br><br>
<b>Sensitivity</b> (True Positive Rate, $S_e$):<br><br>
Measures the proportion of actual object pixels (estimated by $A_{\text{STAPLE}}$) that the annotator $A_i$ correctly identified. It reflects the annotator's ability to avoid False Negatives (missing the object).<br><br>
$\displaystyle \text{Sensitivity}(A_i) = \frac{\text{TP}}{\text{TP} + \text{FN}} \Rightarrow \left\{{\begin{array}{rcl}S_e\to 1&{\mbox{High ability to detect object}}\\S_e\to 0&{\mbox{Low ability to detect object}}\end{array}}\right.$
<br><br>
<b>Specificity</b> (True Negative Rate, $S_p$):<br><br>
Measures the proportion of the actual background pixels (estimated by $A_{\text{STAPLE}}$) that the annotator $A_i$ correctly identified. It reflects the annotator's ability to avoid False Positives (over-segmenting the object or including background).<br><br>
$\displaystyle \text{Specificity}(A_i) = \frac{\text{TN}}{\text{TN} + \text{FP}} \Rightarrow \left\{{\begin{array}{rcl}S_p\to 1&{\mbox{High ability to detect background}}\\S_p\to 0&{\mbox{Low ability to detect background}}\end{array}}\right.$
<br><br>
The STAPLE algorithm uses these estimated individual performance metrics ($S_e$ and $S_p$ for each annotator) to weigh their contribution in the iterative process of generating the $A_{\text{STAPLE}}$ mask.
</h4><br>


<h2><b>Metrics on multi annotation:</b></h2>
<h4>For every available metric $\mu$, if there are $n$ annotations $\{A_1,...,A_n\}$ for an image and the model prediction $Mp$, then we compute:<br><br>
$\mu (A_i,Mp)$ for $1\leqslant i \leqslant n \hspace{3mm}$ Model prediction vs each single annotation.<br><br>
$\displaystyle \mu\left(\bigcup_{i=1}^n A_i,Mp\right) \hspace{3mm}$ Model prediction vs Union of annotations.<br><br>
$\displaystyle \mu\left(\bigcap_{i=1}^n A_i,Mp\right) \hspace{3mm}$ Model prediction vs Intersection of annotation.<br><br>
$\displaystyle \mu\left(A_{\text{STAPLE}},Mp\right) \hspace{3mm}$ Model prediction vs STAPLE annotation.<br><br>
${\displaystyle \mu (A_1,...,A_n)} \hspace{3mm}$ Generalized agreement of annotations.<br><br>
${\displaystyle \mu (A_1,...,A_n,Mp)} \hspace{3mm}$ Generalized agreement of annotations and model.<br><br>
<b>Note:</b> In the case of a single annotation regular GT this process are avoided an the metrics are computed in the regular Model prediction vs GT way $\mu(GT,Mp)$.
</h4>

<h2><b>Computed metrics:</b></h2>
<h4><b>Jaccard index or Intersection over union (IoU)</b><br><br>
Measures the degree of geometric overlap between two sets, that is, the degree of overlap.<br><br>
${\displaystyle J(A,B) =\frac{|A\cap B|}{|A\cup B|}\Rightarrow\left\{{\begin{array}{rcl}J\to 1&{\mbox{Perfect overlaping}}\\J\to 0&{\mbox{Non-overlapping}}\end{array}}\right.}$<br><br>
</h4>

<h4><b>Dice or F1-score:</b><br><br>
Measures the harmonic mean of <b>Precision</b> and <b>Recall</b>.<br><br>
<b>Precision:</b> Measure what proportion of the segmentation are correct.<br><br>
<b>Recall</b> Measure what proportion of the segmentations was correctly identified.<br><br>
${\displaystyle D(A,B) =\frac{2|A\cap B|}{|A| + |B|}= \frac{2J(A,B)}{1+J(A,B)} \text{ or } \frac{2TP}{2TP+FP+FN} \left\{{\begin{array}{rcl}D\to 1&{\mbox{high precision and high recall}}\\D\to 0&{\mbox{imbalance among precision and recall}}\end{array}}\right.}$<br><br>
</h4>

<h4><b>Hausdorff Distance:</b><br><br>
Measures the maximum distance from a point in one segmentation boundary to the closest point in the other segmentation boundary. It acts as a worst-case metric that is highly sensitive to outliers.<br><br>
$\displaystyle d_H(A,B) = \max \left( \sup_{a \in A} \inf_{b \in B} d(a,b), \sup_{b \in B} \inf_{a \in A} d(a,b) \right)$<br><br>
Where $A$ and $B$ represent the points on the boundaries of the two segmentations, and $d(a,b)$ is the Euclidean distance.<br><br>
$\displaystyle d_H(A,B)\Rightarrow \left\{{\begin{array}{rcl}d_H\to 0&{\mbox{Perfect boundary alignment}}\\d_H\to \infty&{\mbox{Large maximum mismatch (outliers)}}\end{array}}\right.$<br><br>
</h4>

<h4><b>Mean Surface Distance:</b><br><br>
Calculates the average distance between the boundaries (surfaces) of two segmentations. Unlike the Hausdorff distance, it evaluates the overall average boundary error rather than focusing on the single worst-case outlier.<br><br>
$\displaystyle MSD(A,B) = \frac{1}{|S_A| + |S_B|} \left( \sum_{a \in S_A} \min_{b \in S_B} d(a,b) + \sum_{b \in S_B} \min_{a \in S_A} d(a,b) \right)$<br><br>
Where $S_A$ and $S_B$ are the sets of boundary pixels for segmentations $A$ and $B$.<br><br>
$\displaystyle MSD(A,B) \Rightarrow \left\{{\begin{array}{rcl}MSD\to 0&{\mbox{Surfaces perfectly overlap}}\\MSD\to \infty&{\mbox{High average boundary mismatch}}\end{array}}\right.$<br><br>
</h4>

<h4><b>Boundary IoU:</b><br><br>
Measures the Intersection over Union strictly focused on the boundary regions of the segmentations rather than the entire interior volume. This metric is especially useful for evaluating contour quality in larger objects, where standard IoU is dominated by internal pixels and becomes insensitive to boundary errors.<br><br>
Let $A_d$ and $B_d$ represent the boundaries of the segmentations dilated by a specific pixel distance threshold $d$.<br><br>
$\displaystyle Boundary\_IoU(A,B) = \frac{|(A_d \cap A) \cap (B_d \cap B)|}{|(A_d \cap A) \cup (B_d \cap B)|}$<br><br>
$\displaystyle \Rightarrow Boundary\_IoU(A,B)\left\{{\begin{array}{rcl}BIoU\to 1&{\mbox{Perfect contour overlap}}\\BIoU\to 0&{\mbox{No contour overlap}}\end{array}}\right.$<br><br>
</h4>

<h4><b>Overall Contour Agreement:</b><br><br>
Evaluates the degree of consensus among multiple contours, often measuring the proportion of the predicted contour that falls within a defined tolerance distance of the ground truth boundaries, and vice versa. It functions similarly to a boundary-specific F1-score or Surface Dice.<br><br>
$\displaystyle OCA(A,B) = \frac{2 |C_A \cap_d C_B|}{|C_A| + |C_B|}$<br><br>
Where $C_A$ and $C_B$ are the contour sets, and $\cap_d$ denotes the intersection of points that lie within a distance threshold $d$ of each other.<br><br>
$\displaystyle OCA(A,B\Rightarrow \left\{{\begin{array}{rcl}OCA\to 1&{\mbox{High contour agreement within tolerance}}\\OCA\to 0&{\mbox{No contour agreement}}\end{array}}\right.$<br><br>
<br><br>
</h4>

<img src="docs/examples/out.PNG" width="600"/>