# Diaptera wings classification using Topological Data Analysis

Guilherme Vituri F. Pinto [](https://orcid.org/0000-0002-7813-8777) (Universidade Estadual Paulista)  
Sergio UraNorthonFebruary 10, 2026

We studied etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc etc

In [2]:
using TDAfly, TDAfly.Preprocessing, TDAfly.TDA, TDAfly.Analysis
using Images: mosaicview
using Plots: plot, display, heatmap, scatter
using StatsPlots: boxplot
using PersistenceDiagrams

## 1 Introduction

Falar sobre o dataset, TDA, etc.

## 2 Methods

All images are in the `images/processed` directory. For each image, we load it, apply a gaussian blur, crop and make it have 150 pixels of height. The blurring step is necessary to “glue” small holes in the figure and keep it connected.

In [3]:
paths = readdir("images/processed", join = true)
species = basename.(paths) .|> (x -> replace(x, ".png" => ""))

# Extract family name from filename, normalizing typos and separators
function extract_family(name)
    family_raw = lowercase(split(name, r"[\s\-]")[1])
    if family_raw in ("bibionidae", "biobionidae")
        return "Bibionidae"
    elseif family_raw in ("sciaridae", "scaridae")
        return "Sciaridae"
    else
        return titlecase(family_raw)
    end
end

families = extract_family.(species)

individuals = map(species) do specie
    parts = split(specie, r"[\s\-]")
    string(extract_family(specie)[1]) * "-" * parts[end]
end

wings = load_wing.(paths, blur = 1.3)
Xs = map(image_to_r2, wings);

In [4]:
mosaicview(wings, ncol = 4, fillvalue=1)

### 2.1 Vietoris-Rips filtration

We select 1000 points from each image using a farthest point sample method

In [5]:
samples = map(Xs) do X
  farthest_points_sample(X, 1000)
end;

and create an empty dictionary to store all computations

In [6]:
simple_rips_dc = Dict();

We then calculate its persistence diagrams using the Vietoris-Rips filtration etc.

In [7]:
# get only the 1-dimensional PD
simple_rips_dc["PD"] = rips_pd.(samples, cutoff = 5, threshold = 200) .|> last;

We create the 1-dimensional persistence image for each persistence diagram using 10x10 matrices

In [8]:
PI = PersistenceImage(simple_rips_dc["PD"], size = (10, 10))

simple_rips_dc["PI"] = PI.(simple_rips_dc["PD"]);

#### 2.1.1 Examples

Below are some examples of 1-dimensional barcodes, its persistence image and the original wing that generated it. Note: we are plotting the barcode using the birth and persistence.

In [9]:
# plot one example per family
example_indices = [findfirst(==(f), families) for f in unique(families)]
map(example_indices) do i
  p = plot_wing_with_pd(simple_rips_dc["PD"][i], simple_rips_dc["PI"][i], samples[i], species[i])
  display(p)
end;

We now calculate the Euclidean distance between each persistence image (seen as a vector of $\mathbb{R}^{10x10}$) and plot its heatmap

In [10]:
simple_rips_dc["Distance_matrix"] = pairwise_distance(simple_rips_dc["PI"]);

In [11]:
plot_heatmap(
  simple_rips_dc["Distance_matrix"], 
  individuals, 
  "Distance matrix for Vietoris-Rips barcodes"
)

## 3 TDA Statistics Analysis

We extract summary statistics from each 1-dimensional persistence diagram. These statistics capture different aspects of the topological structure:

-   **Number of intervals**: Total count of 1-dimensional features (loops)
-   **Maximum persistence**: The longest-lived feature
-   **Total persistence**: Sum of all persistence values
-   **Median persistence**: Average persistence value
-   **Persistence entropy**: Normalized Shannon entropy of persistence values

In [12]:
using DataFrames

# Extract statistics from persistence diagrams
stats_matrix = collect(hcat([pd_statistics(pd) for pd in simple_rips_dc["PD"]]...)')
stat_names = ["count", "max_pers", "total_pers", "median_pers", "entropy"]

# Create DataFrame for analysis
stats_df = DataFrame(
    sample = individuals,
    family = families,
    n_intervals = stats_matrix[:, 1],
    max_pers = stats_matrix[:, 2],
    total_pers = stats_matrix[:, 3],
    median_pers = stats_matrix[:, 4],
    entropy = stats_matrix[:, 5]
)

stats_df

### 3.1 Statistics comparison by family

In [13]:
using Plots: boxplot

p1 = boxplot(stats_df.family, stats_df.n_intervals,
             title="Number of 1D Intervals", legend=false, ylabel="count")
p2 = boxplot(stats_df.family, stats_df.max_pers,
             title="Maximum Persistence", legend=false, ylabel="persistence")
p3 = boxplot(stats_df.family, stats_df.entropy,
             title="Persistence Entropy", legend=false, ylabel="entropy")
plot(p1, p2, p3, layout=(1, 3), size=(900, 300))

## 4 Distance Matrix Comparison

We compute multiple distance metrics between persistence diagrams to compare their effectiveness for classification:

1.  **Euclidean distance** on persistence images (already computed)
2.  **Bottleneck distance** - maximum over all matchings
3.  **Wasserstein-1 distance** - optimal transport with q=1
4.  **Wasserstein-2 distance** - optimal transport with q=2

In [14]:
# Compute all distance matrices
D_euclidean = simple_rips_dc["Distance_matrix"]
D_bottleneck = bottleneck_distance_matrix(simple_rips_dc["PD"])
D_wasserstein1 = wasserstein_distance_matrix(simple_rips_dc["PD"], q=1)
D_wasserstein2 = wasserstein_distance_matrix(simple_rips_dc["PD"], q=2)

distances = Dict(
    "Euclidean (PI)" => D_euclidean,
    "Bottleneck" => D_bottleneck,
    "Wasserstein-1" => D_wasserstein1,
    "Wasserstein-2" => D_wasserstein2
);

In [15]:
p1 = plot_heatmap(D_euclidean, individuals, "Euclidean (PI)")
p2 = plot_heatmap(D_bottleneck, individuals, "Bottleneck")
p3 = plot_heatmap(D_wasserstein1, individuals, "Wasserstein-1")
p4 = plot_heatmap(D_wasserstein2, individuals, "Wasserstein-2")
plot(p1, p2, p3, p4, layout=(2, 2), size=(900, 800))

## 5 Classification Results

We use k-nearest neighbors (k-NN) classification with leave-one-out cross-validation (LOOCV). With a small sample size, LOOCV is the optimal choice as it:

-   Uses all but one sample for training in each fold
-   Tests every sample exactly once
-   Provides nearly unbiased error estimates

### 5.1 Labels

In [16]:
labels = families

28-element Vector{String}:
 "Asilidae"
 "Asilidae"
 "Asilidae"
 "Asilidae"
 "Asilidae"
 "Asilidae"
 "Asilidae"
 "Asilidae"
 "Bibionidae"
 "Bibionidae"
 "Bibionidae"
 "Bibionidae"
 "Bibionidae"
 ⋮
 "Ceratopogonidae"
 "Ceratopogonidae"
 "Ceratopogonidae"
 "Ceratopogonidae"
 "Ceratopogonidae"
 "Ceratopogonidae"
 "Sciaridae"
 "Sciaridae"
 "Sciaridae"
 "Sciaridae"
 "Sciaridae"
 "Sciaridae"

### 5.2 Systematic comparison of all methods

In [17]:
# Test all combinations of distance metrics and k values
results = []

for (dist_name, D) in distances
    for k in [1, 3]
        result = loocv_knn(D, labels; k=k)
        n_correct = sum(result.predictions .== labels)
        push!(results, (
            distance = dist_name,
            k = k,
            n_correct = n_correct,
            n_total = length(labels),
            accuracy = result.accuracy
        ))
    end
end

results_df = DataFrame(results)
sort!(results_df, :accuracy, rev=true)
results_df

### 5.3 Best classifier evaluation

In [18]:
# Find best configuration
best_row = results_df[1, :]
best_dist = best_row.distance
best_k = best_row.k

println("Best method: $(best_dist) with k=$(best_k)")
println("Accuracy: $(best_row.n_correct)/$(best_row.n_total) ($(round(best_row.accuracy * 100, digits=1))%)")

# Get the corresponding distance matrix
D_best = distances[best_dist]

Best method: Euclidean (PI) with k=3
Accuracy: 23/28 (82.1%)

28×28 Matrix{Float64}:
 0.0         0.00110974  0.00445699  0.00310895  …  0.00770125   0.00785984
 0.00110974  0.0         0.00370692  0.00287656     0.00809153   0.00830718
 0.00445699  0.00370692  0.0         0.00283964     0.0089151    0.00921733
 0.00310895  0.00287656  0.00283964  0.0            0.00789216   0.00798768
 0.00183329  0.0024426   0.00588601  0.00477462     0.00797611   0.00811209
 0.00199248  0.00122984  0.0025716   0.0024043   …  0.00816044   0.00843425
 0.00257675  0.00310172  0.00650266  0.00541992     0.00799876   0.00811174
 0.00294059  0.00357795  0.00643965  0.00516952     0.00567919   0.00585013
 0.00481194  0.00556319  0.00774269  0.00604244     0.00392491   0.00383383
 0.00625684  0.00696381  0.00907759  0.0075677      0.00453116   0.00418445
 0.00652954  0.00697028  0.00749469  0.00599016  …  0.00420844   0.00372585
 0.00606755  0.00683354  0.00893434  0.00713092     0.00420579   0.00380993
 0.00548931  0.0058231   0.00668556  0.00563405     0.00238601   

### 5.4 Detailed classification report

In [19]:
# Full classification report with statistical validation
report = classification_report(D_best, labels; k=best_k, n_permutations=10000)

println("=== Classification Report ===")
println("Method: $(best_dist) + $(best_k)-NN")
println("Accuracy: $(report.n_correct)/$(report.n_total) ($(round(report.accuracy * 100, digits=1))%)")
println("95% CI: [$(round(report.ci_95.lower * 100, digits=1))%, $(round(report.ci_95.upper * 100, digits=1))%]")
println("p-value: $(round(report.p_value, digits=4))")
println("Chance level: $(round(report.chance_level * 100, digits=1))%")
println("")
println("Per-class accuracy:")
cm = report.confusion_matrix
for (i, cls) in enumerate(report.classes)
    correct = cm[i, i]
    total = sum(cm[i, :])
    println("  $(cls): $(correct)/$(total) ($(round(correct/total * 100, digits=1))%)")
end

=== Classification Report ===
Method: Euclidean (PI) + 3-NN
Accuracy: 23/28 (82.1%)
95% CI: [64.4%, 92.1%]
p-value: 0.0
Chance level: 22.1%

Per-class accuracy:
  Asilidae: 8/8 (100.0%)
  Bibionidae: 4/6 (66.7%)
  Ceratopogonidae: 6/8 (75.0%)
  Sciaridae: 5/6 (83.3%)

### 5.5 Confusion matrix

In [20]:
classes = report.classes
n_classes = length(classes)

println("Confusion Matrix (rows = true, columns = predicted):")
# Header
print(lpad("", 20))
for cls in classes
    print(lpad(cls[1:min(4,end)], 6))
end
println()
# Rows
for i in 1:n_classes
    print(rpad(classes[i], 20))
    for j in 1:n_classes
        print(lpad(string(cm[i,j]), 6))
    end
    println()
end

Confusion Matrix (rows = true, columns = predicted):
                      Asil  Bibi  Cera  Scia
Asilidae                 8     0     0     0
Bibionidae               0     4     0     2
Ceratopogonidae          0     0     6     2
Sciaridae                0     0     1     5

In [21]:
# Visualize confusion matrix
heatmap(cm,
        xticks=(1:n_classes, classes),
        yticks=(1:n_classes, classes),
        xlabel="Predicted", ylabel="True",
        title="Confusion Matrix",
        color=:Blues,
        clims=(0, maximum(cm)))

## 6 Alternative Metrics on PD Statistics

The 5 PD statistics have very different scales (e.g. `total_persistence` can be orders of magnitude larger than `entropy`). We test alternative distance metrics and z-score normalization to address this.

In [22]:
using Distances: euclidean, cityblock, chebyshev, cosine_dist

# Raw statistics vectors
stats_vectors = [stats_matrix[i, :] for i in 1:size(stats_matrix, 1)]

# Z-score normalized statistics vectors
stats_norm = zscore_normalize(stats_matrix)
stats_norm_vectors = [stats_norm[i, :] for i in 1:size(stats_norm, 1)]

# Distance matrices with different metrics on raw and normalized statistics
stats_metrics = Dict(
    "L2" => euclidean,
    "L1" => cityblock,
    "L∞" => chebyshev,
    "Cosine" => cosine_dist,
)

stats_results = []

for (metric_name, metric_fn) in stats_metrics
    for (norm_name, vecs) in [("raw", stats_vectors), ("z-score", stats_norm_vectors)]
        D_stats = pairwise_distance(vecs, metric_fn)
        for k in [1, 3, 5]
            result = loocv_knn(D_stats, labels; k=k)
            n_correct = sum(result.predictions .== labels)
            push!(stats_results, (
                metric = metric_name,
                normalization = norm_name,
                k = k,
                n_correct = n_correct,
                n_total = length(labels),
                accuracy = result.accuracy
            ))
        end
    end
end

stats_results_df = DataFrame(stats_results)
sort!(stats_results_df, :accuracy, rev=true)
stats_results_df

## 7 Combined Distance Analysis

We combine the Wasserstein-1 distance (best topology-aware metric) with a statistics-based distance using convex combinations: $D_\text{combined}(\alpha) = \alpha \cdot D_1^* + (1 - \alpha) \cdot D_2^*$ where $D^*$ denotes the matrix normalized to $[0, 1]$.

### 7.1 Wasserstein-1 + Statistics (best metric)

In [23]:
# Use best stats metric from previous section
best_stats_row = stats_results_df[1, :]
best_stats_metric = stats_metrics[best_stats_row.metric]
best_stats_norm = best_stats_row.normalization == "z-score" ? stats_norm_vectors : stats_vectors
D_stats_best = pairwise_distance(best_stats_norm, best_stats_metric)

println("Best statistics metric: $(best_stats_row.metric) ($(best_stats_row.normalization))")

grid_wass_stats = combined_distance_grid_search(D_wasserstein1, D_stats_best, labels)

println("\nTop 5 combinations (Wasserstein-1 + Stats):")
for r in grid_wass_stats[1:min(5, end)]
    println("  α=$(round(r.alpha, digits=1)), k=$(r.k): $(r.n_correct)/$(length(labels)) ($(round(r.accuracy * 100, digits=1))%)")
end

Best statistics metric: L2 (z-score)

Top 5 combinations (Wasserstein-1 + Stats):
  α=0.4, k=3: 24/28 (85.7%)
  α=0.5, k=3: 24/28 (85.7%)
  α=0.6, k=3: 24/28 (85.7%)
  α=0.6, k=5: 24/28 (85.7%)
  α=0.8, k=3: 24/28 (85.7%)

In [24]:
# Heatmap of accuracy over (alpha, k)
alphas = 0.0:0.1:1.0
ks = [1, 3, 5]

acc_grid = zeros(length(alphas), length(ks))
for r in grid_wass_stats
    i = findfirst(==(r.alpha), alphas)
    j = findfirst(==(r.k), ks)
    if !isnothing(i) && !isnothing(j)
        acc_grid[i, j] = r.accuracy
    end
end

heatmap(string.(ks), string.(collect(alphas)),
        acc_grid,
        xlabel="k", ylabel="α (Wasserstein-1 weight)",
        title="Wasserstein-1 + Stats ($(best_stats_row.metric), $(best_stats_row.normalization))",
        color=:Blues, clims=(0.5, 1.0))

### 7.2 Wasserstein-1 + Bottleneck

In [25]:
grid_wass_bn = combined_distance_grid_search(D_wasserstein1, D_bottleneck, labels)

println("Top 5 combinations (Wasserstein-1 + Bottleneck):")
for r in grid_wass_bn[1:min(5, end)]
    println("  α=$(round(r.alpha, digits=1)), k=$(r.k): $(r.n_correct)/$(length(labels)) ($(round(r.accuracy * 100, digits=1))%)")
end

Top 5 combinations (Wasserstein-1 + Bottleneck):
  α=0.7, k=3: 22/28 (78.6%)
  α=0.8, k=3: 22/28 (78.6%)
  α=0.9, k=3: 22/28 (78.6%)
  α=1.0, k=1: 22/28 (78.6%)
  α=1.0, k=3: 22/28 (78.6%)

In [26]:
acc_grid_bn = zeros(length(alphas), length(ks))
for r in grid_wass_bn
    i = findfirst(==(r.alpha), alphas)
    j = findfirst(==(r.k), ks)
    if !isnothing(i) && !isnothing(j)
        acc_grid_bn[i, j] = r.accuracy
    end
end

heatmap(string.(ks), string.(collect(alphas)),
        acc_grid_bn,
        xlabel="k", ylabel="α (Wasserstein-1 weight)",
        title="Wasserstein-1 + Bottleneck",
        color=:Blues, clims=(0.5, 1.0))

## 8 Alternative Representations

### 8.1 Persistence Landscapes and Betti Curves

Persistence landscapes and Betti curves provide richer vectorized representations of persistence diagrams than the 5 summary statistics.

In [27]:
using PersistenceDiagrams: BettiCurve, Landscape

pds = simple_rips_dc["PD"]

# Betti curve: counts alive features at each filtration scale
bc = BettiCurve(pds; length=50)
betti_vectors = bc.(pds)
D_betti = pairwise_distance(betti_vectors, euclidean)

# Persistence landscape (1st landscape)
land = Landscape(1, pds; length=50)
land_vectors = land.(pds)
D_landscape = pairwise_distance(land_vectors, euclidean)

repr_results = []
for (name, D) in [("Betti curve", D_betti), ("Landscape-1", D_landscape)]
    for k in [1, 3, 5]
        result = loocv_knn(D, labels; k=k)
        n_correct = sum(result.predictions .== labels)
        push!(repr_results, (
            representation = name,
            k = k,
            n_correct = n_correct,
            n_total = length(labels),
            accuracy = result.accuracy
        ))
    end
end

repr_results_df = DataFrame(repr_results)
sort!(repr_results_df, :accuracy, rev=true)
repr_results_df

### 8.2 Combining Betti curves / Landscapes with Wasserstein-1

In [28]:
grid_wass_betti = combined_distance_grid_search(D_wasserstein1, D_betti, labels)
grid_wass_land = combined_distance_grid_search(D_wasserstein1, D_landscape, labels)

println("Top 3 (Wasserstein-1 + Betti curve):")
for r in grid_wass_betti[1:min(3, end)]
    println("  α=$(round(r.alpha, digits=1)), k=$(r.k): $(r.n_correct)/$(length(labels)) ($(round(r.accuracy * 100, digits=1))%)")
end

println("\nTop 3 (Wasserstein-1 + Landscape-1):")
for r in grid_wass_land[1:min(3, end)]
    println("  α=$(round(r.alpha, digits=1)), k=$(r.k): $(r.n_correct)/$(length(labels)) ($(round(r.accuracy * 100, digits=1))%)")
end

Top 3 (Wasserstein-1 + Betti curve):
  α=0.7, k=1: 22/28 (78.6%)
  α=0.7, k=3: 22/28 (78.6%)
  α=0.8, k=1: 22/28 (78.6%)

Top 3 (Wasserstein-1 + Landscape-1):
  α=0.9, k=3: 23/28 (82.1%)
  α=0.8, k=3: 22/28 (78.6%)
  α=0.9, k=5: 22/28 (78.6%)

## 9 Alternative Classifiers

We compare the standard k-NN with weighted k-NN (inverse distance weighting) and nearest centroid classification.

In [29]:
# Collect all distance matrices to test
all_distances = merge(distances, Dict(
    "Betti curve" => D_betti,
    "Landscape-1" => D_landscape,
))

classifier_results = []

for (dist_name, D) in all_distances
    for k in [1, 3, 5]
        # Standard k-NN
        r1 = loocv_knn(D, labels; k=k)
        push!(classifier_results, (
            classifier = "k-NN",
            distance = dist_name,
            k = k,
            n_correct = sum(r1.predictions .== labels),
            accuracy = r1.accuracy
        ))

        # Weighted k-NN
        r2 = loocv_knn_weighted(D, labels; k=k)
        push!(classifier_results, (
            classifier = "Weighted k-NN",
            distance = dist_name,
            k = k,
            n_correct = sum(r2.predictions .== labels),
            accuracy = r2.accuracy
        ))
    end

    # Nearest centroid (no k parameter)
    r3 = loocv_nearest_centroid(D, labels)
    push!(classifier_results, (
        classifier = "Nearest centroid",
        distance = dist_name,
        k = 0,
        n_correct = sum(r3.predictions .== labels),
        accuracy = r3.accuracy
    ))
end

classifier_df = DataFrame(classifier_results)
sort!(classifier_df, :accuracy, rev=true)
first(classifier_df, 15)

## 10 Honest Evaluation (Nested LOOCV)

Since we tested many (distance, α, k) configurations, we use **nested leave-one-out cross-validation** for an unbiased accuracy estimate. The outer loop leaves one sample out; the inner loop selects the best (α, k) on the remaining samples.

In [30]:
# Nested LOOCV: Wasserstein-1 + best stats distance
nested_result = nested_loocv(D_wasserstein1, D_stats_best, labels)
n_correct_nested = sum(nested_result.predictions .== labels)

println("=== Nested LOOCV Result ===")
println("Distance pair: Wasserstein-1 + Stats ($(best_stats_row.metric), $(best_stats_row.normalization))")
println("Accuracy: $(n_correct_nested)/$(length(labels)) ($(round(nested_result.accuracy * 100, digits=1))%)")

ci = wilson_ci(n_correct_nested, length(labels))
println("95% CI: [$(round(ci.lower * 100, digits=1))%, $(round(ci.upper * 100, digits=1))%]")

println("\nSelected parameters per fold:")
for (i, p) in enumerate(nested_result.params)
    println("  Sample $i: α=$(round(p.alpha, digits=1)), k=$(p.k) (inner acc=$(round(p.inner_acc * 100, digits=1))%)")
end

=== Nested LOOCV Result ===
Distance pair: Wasserstein-1 + Stats (L2, z-score)
Accuracy: 21/28 (75.0%)
95% CI: [56.6%, 87.3%]

Selected parameters per fold:
  Sample 1: α=0.4, k=3 (inner acc=85.2%)
  Sample 2: α=0.4, k=3 (inner acc=85.2%)
  Sample 3: α=0.4, k=3 (inner acc=85.2%)
  Sample 4: α=0.4, k=3 (inner acc=85.2%)
  Sample 5: α=0.6, k=3 (inner acc=85.2%)
  Sample 6: α=0.4, k=3 (inner acc=85.2%)
  Sample 7: α=0.6, k=3 (inner acc=85.2%)
  Sample 8: α=0.4, k=3 (inner acc=85.2%)
  Sample 9: α=0.4, k=3 (inner acc=81.5%)
  Sample 10: α=0.3, k=3 (inner acc=77.8%)
  Sample 11: α=0.4, k=3 (inner acc=88.9%)
  Sample 12: α=0.4, k=3 (inner acc=81.5%)
  Sample 13: α=0.3, k=3 (inner acc=88.9%)
  Sample 14: α=0.6, k=5 (inner acc=88.9%)
  Sample 15: α=0.4, k=3 (inner acc=88.9%)
  Sample 16: α=0.4, k=3 (inner acc=85.2%)
  Sample 17: α=0.4, k=3 (inner acc=85.2%)
  Sample 18: α=0.4, k=3 (inner acc=85.2%)
  Sample 19: α=0.5, k=3 (inner acc=88.9%)
  Sample 20: α=0.4, k=3 (inner acc=88.9%)
  Sample 21:

In [31]:
# Confusion matrix for nested LOOCV
cm_nested = confusion_matrix(labels, nested_result.predictions)
classes_nested = cm_nested.classes
n_classes_nested = length(classes_nested)

println("Per-class accuracy (Nested LOOCV):")
for (i, cls) in enumerate(classes_nested)
    correct = cm_nested.matrix[i, i]
    total = sum(cm_nested.matrix[i, :])
    println("  $(cls): $(correct)/$(total) ($(round(correct/total * 100, digits=1))%)")
end

Per-class accuracy (Nested LOOCV):
  Asilidae: 8/8 (100.0%)
  Bibionidae: 3/6 (50.0%)
  Ceratopogonidae: 6/8 (75.0%)
  Sciaridae: 4/6 (66.7%)

In [32]:
heatmap(cm_nested.matrix,
        xticks=(1:n_classes_nested, classes_nested),
        yticks=(1:n_classes_nested, classes_nested),
        xlabel="Predicted", ylabel="True",
        title="Confusion Matrix (Nested LOOCV)",
        color=:Blues,
        clims=(0, maximum(cm_nested.matrix)))

## 11 Discussion

The topological approach using persistence diagrams is applied to distinguish between four Diptera families (Asilidae, Bibionidae, Ceratopogonidae, and Sciaridae). Key findings:

1.  **Feature extraction**: The 1-dimensional persistence captures meaningful topological differences in wing venation patterns (loops and holes). Summary statistics (count, max/total/median persistence, entropy) provide a complementary low-dimensional representation.

2.  **Distance metrics**: Comparing Euclidean distance on persistence images with topological distances (Bottleneck, Wasserstein) and alternative metrics (L1, L∞, cosine) on PD statistics reveals which representation best captures discriminative information. Z-score normalization of statistics is important to prevent scale-dominant features from overwhelming the distance.

3.  **Combined distances**: Convex combinations of Wasserstein-1 distance with statistics-based distances can improve classification by leveraging complementary information from topology-aware and summary-based representations.

4.  **Alternative representations**: Persistence landscapes and Betti curves provide richer vectorized summaries than the 5 summary statistics and can further improve results when combined with Wasserstein distances.

5.  **Statistical validation**: With 28 samples across 4 families, we report:

    -   Leave-one-out cross-validation accuracy
    -   95% Wilson confidence intervals
    -   Permutation test p-value to confirm statistical significance
    -   Nested LOOCV for honest evaluation when hyperparameters are tuned

### 11.1 Limitations

-   Small sample size (n=28) leads to wide confidence intervals; results should be confirmed with larger datasets
-   Mild class imbalance (8 Asilidae, 6 Bibionidae, 8 Ceratopogonidae, 6 Sciaridae) may affect classification performance
-   Image quality and preprocessing parameters may influence topological features
-   Grid search over many configurations requires nested cross-validation to avoid overfitting

### 11.2 Future work

-   Extend dataset with more species and specimens per family
-   Explore additional filtration methods (line-based, directional)
-   Combine TDA features with traditional morphometric measurements
-   Investigate kernel methods for persistence diagrams (e.g., persistence scale-space kernel)