<h2>Model Evaluation</h2>

<b>Ground truth Path</b>:<br>
Path to the Annotators Segmentations or Ground Truth.<br>

<b>Predictions Path</b>:<br>
Path to the Model Predictions.<br>

<b>Save generated mask</b>:<br>
Option to save the used union, intersection and overlap masks.<br>

<b>Use Dilation</b>:<br>
Use dilatation to reduce missmatch among annotators.<br>

<b>Radius</b>:<br>
Pixel size of dilatation.<br>

<b>Mode</b>:<br>
Kernel shape used for dilatation.<br>

<span style="color:red;"><b>Note:</b></span><br>
If only one annotator folder is given, Jaccard, Dice, Hausdorff and Mean Surface Distance are computed with the given annotations and the model prediction.

In [None]:
from core.evaluation import create_evaluation_menu
create_evaluation_menu()

<h2>Sumary Generation</h2>

This part take the folder of the csv files generated in the previous step and generate a new csv file with sumary information in aims to get some insights about multiple models evaluation or the generated metrics.

<b>csv's files folder</b>:<br>
Path to the csv files generated in the previous step.<br>

In [None]:
from core.evaluation import create_sumary_menu
create_sumary_menu()

<h2><b>Model Evaluation Overview</b></h2>
<img src="docs/examples/inp.PNG" width="600"/>

<h4><b>STAPLE (Simultaneous Truth and Performance Level Estimation) Mask:</b><br><br>
STAPLE is an algorithm used to fuse multiple expert segmentations (annotations) into a single, highly reliable consensus segmentation ($A_{\text{STAPLE}}$).<br><br>
It simultaneously estimates the "true" underlying segmentation and the performance level (sensitivity and specificity) of each individual annotator relative to this estimated truth using an Expectation-Maximization (EM) algorithm.<br><br>
The output is a probabilistic map ($P_{\text{STAPLE}}$), where each pixel value represents the estimated probability that the pixel belongs to the object of interest.<br><br>
The final binary STAPLE mask, $A_{\text{STAPLE}}$, is derived by applying a threshold (typically 0.5) to this probability map:<br><br>
$\displaystyle A_{\text{STAPLE}} = \begin{cases} 1 & \text{if } P_{\text{STAPLE}} \ge 0.5 \\ 0 & \text{if } P_{\text{STAPLE}} < 0.5 \end{cases}$<br><br>
The STAPLE mask is a robust fusion of the annotations, providing a more reliable estimate of the ground truth than any single annotator.
</h4>

<h4><b>Sensitivity and Specificity (STAPLE Performance Metrics):</b><br><br>
In the STAPLE context, Sensitivity and Specificity are calculated for each individual annotator ($A_i$) relative to the estimated true segmentation ($A_{\text{STAPLE}}$). These metrics quantify the estimated performance of each annotator.<br><br>
Let TP, TN, FP, and FN be the True Positives, True Negatives, False Positives, and False Negatives of annotator $A_i$ compared against $A_{\text{STAPLE}}$:
<br><br>
<b>Sensitivity</b> (True Positive Rate, $S_e$):<br><br>
Measures the proportion of actual object pixels (estimated by $A_{\text{STAPLE}}$) that the annotator $A_i$ correctly identified. It reflects the annotator's ability to avoid False Negatives (missing the object).<br><br>
$\displaystyle \text{Sensitivity}(A_i) = \frac{\text{TP}}{\text{TP} + \text{FN}} \Rightarrow \left\{{\begin{array}{rcl}S_e\to 1&{\mbox{High ability to detect object}}\\S_e\to 0&{\mbox{Low ability to detect object}}\end{array}}\right.$
<br><br>
<b>Specificity</b> (True Negative Rate, $S_p$):<br><br>
Measures the proportion of the actual background pixels (estimated by $A_{\text{STAPLE}}$) that the annotator $A_i$ correctly identified. It reflects the annotator's ability to avoid False Positives (over-segmenting the object or including background).<br><br>
$\displaystyle \text{Specificity}(A_i) = \frac{\text{TN}}{\text{TN} + \text{FP}} \Rightarrow \left\{{\begin{array}{rcl}S_p\to 1&{\mbox{High ability to detect background}}\\S_p\to 0&{\mbox{Low ability to detect background}}\end{array}}\right.$
<br><br>
The STAPLE algorithm uses these estimated individual performance metrics ($S_e$ and $S_p$ for each annotator) to weigh their contribution in the iterative process of generating the $A_{\text{STAPLE}}$ mask.
</h4><br><br><br>


<h2><b>Computed metrics:</b></h2>
<h4><b>Jaccard index or Intersection over union (IoU)</b><br><br>
Measures the degree of geometric overlap between two sets, that is, the degree of overlap.<br><br>
${\displaystyle J(A,B) =\frac{|A\cap B|}{|A\cup B|}\Rightarrow\left\{{\begin{array}{rcl}J\to 1&{\mbox{Perfect overlaping}}\\J\to 0&{\mbox{Non-overlapping}}\end{array}}\right.}$<br><br>
Assuming that there are $n$ annotations $\{A_1,...,A_n\}$ for an image and the model prediction $Mp$ the folowing computation are included:<br><br>
$J(A_i,Mp)$ for $1\leqslant i \leqslant n \hspace{3mm}$  The overlaping between every single anotation and the model prediction.<br><br>
$\displaystyle J\left(\bigcup_{i=1}^n A_i,Mp\right) \hspace{3mm}$ The overlaping between the union of the anotations and the model prediction.<br><br>
$\displaystyle J\left(\bigcap_{i=1}^n A_i,Mp\right) \hspace{3mm}$ The overlaping between the intersection of the anotations and the model prediction.<br><br>
$\displaystyle J\left(A_{\text{STAPLE}},Mp\right) \hspace{3mm}$ The overlaping between the STAPLE anotation and the model prediction.<br><br>
${\displaystyle J(A_1,...,A_n) =\frac{\left|\bigcup_{i=1}^n A_i\right|}{\left|\bigcap_{i=1}^n A_i\right|}} \hspace{3mm}$ Generalized Jaccard measure the level of agreement among the annotators.<br><br>
${\displaystyle J(A_1,...,A_n,Mp) =\frac{\left|\bigcup_{i=1}^n A_i \cup Mp\right|}{\left|\bigcap_{i=1}^n A_i \cap Mp\right|}} \hspace{3mm}$ Generalized Jaccard measure the level of agreement among the annotators taking into account the model prediction.<br><br>
</h4>

<h4><b>Dice or F1-score:</b><br><br>
Measures the harmonic mean of <b>Precision</b> and <b>Recall</b>.<br><br>
<b>Precision:</b> Measure what proportion of the segmentation are correct.<br><br>
<b>Recall</b> Measure what proportion of the segmentations was correctly identified.<br><br>
${\displaystyle D(A,B) =\frac{2|A\cap B|}{|A| + |B|}= \frac{2J(A,B)}{1+J(A,B)} \text{ or } \frac{2TP}{2TP+FP+FN} \left\{{\begin{array}{rcl}D\to 1&{\mbox{high precision and high recall}}\\D\to 0&{\mbox{imbalance among precision and recall}}\end{array}}\right.}$<br><br>
Assuming that there are $n$ annotations $\{A_1,...,A_n\}$ for an image and the model prediction $Mp$ the folowing computation are included:<br><br>
$D(A_i,Mp)$ for $1\leqslant i \leqslant n \hspace{3mm}$  The score between every single anotation and the model prediction.<br><br>
$\displaystyle D\left(\bigcup_{i=1}^n A_i,Mp\right) \hspace{3mm}$ The score between the union of the anotations and the model prediction.<br><br>
$\displaystyle D\left(\bigcap_{i=1}^n A_i,Mp\right) \hspace{3mm}$ The score between the intersection of the anotations and the model prediction.<br><br>
$\displaystyle D\left(A_{\text{STAPLE}},Mp\right) \hspace{3mm}$ The score between the STAPLE anotation and the model prediction.<br><br>
${\displaystyle D(A_1,...,A_n) } \hspace{3mm}$ Generalized Dice measure the score of agreement among the annotators.<br><br>
${\displaystyle D(A_1,...,A_n,Mp) } \hspace{3mm}$ Generalized Dice measure the score of agreement among the annotators taking into account the model prediction.<br><br>
</h4>

<h4><b>Hausdorff Distance:</b><br><br>
Measures the worst-case (maximum) mismatch between the boundaries of two segmentations $A$ and $B$. It is sensitive to outliers and ensures that every point in one set is close to a point in the other set.<br><br>
The directed Hausdorff distance from $A$ to $B$ is:<br><br>
$\displaystyle h(A, B) = \sup_{a \in A} \inf_{b \in B} \|a - b\|$<br><br>
The symmetrical Hausdorff Distance $D_{\text{HD}}$ is:<br><br>
$\displaystyle D_{\text{HD}}(A, B) = \max \left( h(A, B), h(B, A) \right) \Rightarrow \left\{{\begin{array}{rcl}D_{\text{HD}}\to 0&{\mbox{Perfect boundary alignment}}\\D_{\text{HD}}\to \infty&{\mbox{Large boundary mismatch}}\end{array}}\right.$<br><br>
Assuming that there are $n$ annotations $\{A_1,...,A_n\}$ for an image and the model prediction $Mp$, the following computations are included:<br><br>
$D_{\text{HD}}(A_i,Mp)$ for $1\leqslant i \leqslant n \hspace{3mm}$  The symmetrical distance between every single annotation boundary and the model prediction boundary.<br><br>
    $\displaystyle D_{\text{HD}}\left(\bigcup_{i=1}^n A_i,Mp\right) \hspace{3mm}$ The symmetrical distance between the boundary of the union of the annotations and the model prediction boundary.<br><br>
$\displaystyle D_{\text{HD}}\left(\bigcap_{i=1}^n A_i,Mp\right) \hspace{3mm}$ The symmetrical distance between the boundary of the intersection of the annotations and the model prediction boundary.<br><br>
$\displaystyle D_{\text{HD}}\left(A_{\text{STAPLE}},Mp\right) \hspace{3mm}$ The symmetrical distance between the boundary of STAPLE annotation and the model prediction boundary.<br><br>
$\displaystyle \text{Avg}_{i<j} \{D_{\text{HD}}(A_i, A_j)\} \hspace{3mm}$ Generalized $\text{HD}$ for Annotator Agreement, calculated as the average of all pairwise symmetrical distances among annotator boundaries.<br><br>
$\displaystyle \text{Avg}_{i} \{D_{\text{HD}}(A_i, Mp)\} \hspace{3mm}$ Generalized $\text{HD}$ for Model Agreement, calculated as the average of the pairwise symmetrical distances between each annotator boundary and the model prediction boundary.<br><br>
</h4>

<h4><b>Mean Surface Distance:</b><br><br>
Measures the average mismatch between the boundaries of two segmentations $A$ and $B$. It is less sensitive to extreme outliers than Hausdorff Distance.<br><br>
The directed Mean Surface Distance from $A$ to $B$ is:<br><br>
$\displaystyle D_{A \to B} = \frac{1}{|A|} \sum_{a \in A} \inf_{b \in B} \|a - b\|$<br><br>
The Symmetrical Mean Surface Distance is:<br><br>
$\displaystyle D_{\text{MSD}}(A, B) = \frac{D_{A \to B} + D_{B \to A}}{2} \Rightarrow \left\{{\begin{array}{rcl}D_{\text{MSD}}\to 0&{\mbox{Perfect boundary alignment}}\\D_{\text{MSD}}\to \infty&{\mbox{Large average boundary mismatch}}\end{array}}\right.$<br><br>
Assuming that there are $n$ annotations $\{A_1,...,A_n\}$ for an image and the model prediction $Mp$ (where $A_i$ and $Mp$ represent boundary points), the following computations are included:<br><br>
$D_{\text{MSD}}(A_i,Mp)$ for $1\leqslant i \leqslant n \hspace{3mm}$  The symmetrical average distance between every single annotation boundary and the model prediction boundary.<br><br>
$\displaystyle D_{\text{MSD}}\left(\bigcup_{i=1}^n A_i,Mp\right) \hspace{3mm}$ The symmetrical average distance between the boundary of the union of the annotations and the model prediction boundary.<br><br>
$\displaystyle D_{\text{MSD}}\left(\bigcap_{i=1}^n A_i,Mp\right) \hspace{3mm}$ The symmetrical average distance between the boundary of the intersection of the annotations and the model prediction boundary.<br><br>
$\displaystyle D_{\text{MSD}}\left(A_{\text{STAPLE}},Mp\right) \hspace{3mm}$ The symmetrical average distance between the STAPLE annotation and the model prediction boundary.<br><br>
$\displaystyle \text{Avg}_{i<j} \{D_{\text{MSD}}(A_i, A_j)\} \hspace{3mm}$ Generalized $\text{MSD}$ for Annotator Agreement, calculated as the average of all pairwise symmetrical average distances among annotator boundaries.<br><br>
$\displaystyle \text{Avg}_{i} \{D_{\text{MSD}}(A_i, Mp)\} \hspace{3mm}$ Generalized $\text{MSD}$ for Model Agreement, calculated as the average of the pairwise symmetrical average distances between each annotator boundary and the model prediction boundary.<br><br>
</h4>

<img src="docs/examples/out.PNG" width="600"/>

<h2><b>Output Interpretation Guide</b></h2>
<p>The evaluation process generates a <b>CSV file</b> with all computed metrics and a directory of <b>plots</b> for visual analysis of segmentation performance and annotator agreement.</p>

<h3><b>Interpretation of Plots</b></h3>

<p>The plots provide a visual summary of the metrics across all images, helping to quickly identify trends, outliers, and overall performance.</p>

<h4><b>Line Plots (Metric vs. Image ID)</b></h4>
<ul>
    <li><b>Jaccard/Dice (Overlap Metrics):</b> Values range from 0 (no overlap) to 1 (perfect overlap). Higher is better. A stable, high line indicates consistent and accurate segmentation.</li>
    <li><b>Hausdorff Distance (HD) and Mean Surface Distance (MSD):</b> Values are in pixels. Lower is better. HD is sensitive to outliers (worst mismatch), while MSD reflects the average mismatch.</li>
    <li><b>Agreement Plots:</b> These compare the generalized annotator agreement metric ($M(\text{Annotator Agreement})$) with the generalized annotator/model agreement ($M(\text{Annotator/Model Agreement})$).
        <ul>
            <li>If the model is highly accurate, the $M(\text{Annotator/Model Agreement})$ line should be close to the $M(\text{Annotator Agreement})$ line for overlap metrics, and the difference should be small for distance metrics.</li>
        </ul>
    </li>
    <li><b>STAPLE Sensitivity/Specificity:</b> Closer to 1 is better. Sensitivity indicates the annotator's ability to find the object (avoid false negatives), and Specificity indicates their ability to correctly identify the background (avoid false positives).</li>
</ul>

<h4><b>Distribution Plots (Boxplot and Violin Plot)</b></h4>
<p>These plots summarize the statistical distribution of the metrics over all images. They show the central tendency (mean/median) and variability (spread/range).</p>
<ul>
    <li><b>Boxplots:</b> Show the median (line), interquartile range (box), and potential outliers (points).</li>
    <li><b>Violin Plots:</b> Show the probability density of the data at different values, giving a clearer picture of the distribution shape.</li>
    <li><b>Interpretation:</b> For overlap metrics (Jaccard, Dice), look for distributions clustered near 1. For distance metrics (HD, MSD), look for distributions clustered near 0. A tight, narrow distribution indicates high consistency across all images.</li>
</ul>