## Topological Data Quality notebook

Here we review some examples from the article ``A topological Approach to Measuring Training Data Quality'' arXiv:2306.02411


In [1]:
import os
import numpy as np
import scipy.spatial.distance as dist
import matplotlib as mpl

_tol = 1e-10

In [2]:
# Very important to compile IBloFunMatch.cc and obtain the correct executable path
EXECUTABLE_PATH = f"..\\x64\\Debug\\IBloFunMatchCPP.exe" # this is my particular path

In [3]:
! {EXECUTABLE_PATH + " -h"}


Given metric spaces S and X together with an inclusion f
from S to X, indicated by a set of indices, which is 
such that, for all a,b from X,  

         d(a,b) >= d_X(f(a), f(b)).

This induces a morphism of persistence modules

         PH_k(VR(S))-->PH_k(VR(X)), 

for all k>=0, and with fixed field Z/2Z.
Usage: ..\x64\Debug\IBloFunMatchCPP.exe [options] file_dist_S file_dist_X sample-indices, where:
 [file_dist_S] is the file storing the distance matrix from S
 [file_dist_X] is the file storing the distance matrix from X
 [sample-indices] is a file with the indices of elements from S in X.

Allowed options:
  -h [ --help ]                              produce help message
  -r [ --max-edge-length ] arg (=inf)        Maximal length of an edge for the Rips complex 
                                             construction.
  -d [ --cpx-dimension ] arg (=1)            Maximal dimension of the Rips complex we want to 
                                             compute.
  -i [ --edge-

### Example 1

In [4]:
# Load data and labels
data = np.genfromtxt("data_first/dataset.txt")
y = np.genfromtxt("data_first/labels.txt")
# Load subsets S1, S2, S3 and get indices w.r.t. X
S1 = np.genfromtxt("data_first/subset1.txt")
S2 = np.genfromtxt("data_first/subset2.txt")
S3 = np.genfromtxt("data_first/subset3.txt")
# Load their labels 
yS1 = np.genfromtxt("data_first/labels_subset1.txt")
yS2 = np.genfromtxt("data_first/labels_subset2.txt")
yS3 = np.genfromtxt("data_first/labels_subset3.txt")

In [18]:
# Compute the persistent homology and associated matrices for each class and subset
NUM_class = 2
NUM_subset = 3
# Organize Subset, Data and labels into lists
S_list = [S1, S2, S3]
yS_list = [yS1, yS2, yS3]
# Prepare lists where to store the generated persistent homology and matrices 
# empty_list = [[] for i in range(NUM_class)]
# S_points, X_points = empty_list.copy(), empty_list.copy()
# S_barcodes, S_reps, S_reps_im = empty_list.copy(), empty_list.copy(), empty_list.copy()
# X_barcodes, X_reps = empty_list.copy(), empty_list.copy()
# pm_matrices, ind_matchings = empty_list.copy(), empty_list.copy()
IBloFunMatch_output = [] # Will store output here as dictionaries
# Buffer files to write subsets and classes for communicating with C++ program 
f_ind_sampl = "output\\indices_sample.out"
f_dist_X = "output\\dist_X.out"
f_dist_S = "output\\dist_S.out"
# Directory where output of C++ program will be read from
output_dir = "output"
# Initialize variables to range over 
attributes = ["X_barcode", "S_barcode", "X_reps", "S_reps", "S_reps_im", "pm_matrix", "induced_matching"]
types_list = ["float", "float", "int", "int", "int", "int", "int", "int"]
# Class, Subset index 
class_sub_idx = np.ones((NUM_class, NUM_subset)).astype("int")*(-1)
counter=0
# Range over Subsets and Classes
for idx_class in range(NUM_class):
    for idx_subset in range(NUM_subset):
        output_data = {}
        # Subset and dataset points pertaining to class 
        subset = S_list[idx_subset]
        y_subset = yS_list[idx_subset]
        S = subset[y_subset==idx_class]
        X = data[y==idx_class]
        output_data["S"]=S
        output_data["X"]=X
        # Indices of points from S within X and save
        idS = [np.argmax(np.sum(abs(X - pt), axis=1) < _tol) for pt in S]
        output_data["idS"]=idS
        np.savetxt(f_ind_sampl, idS, fmt="%d", newline="\n")
        # Compute distance matrices and save
        Dist_X = dist.squareform(dist.pdist(X))
        Dist_S = dist.squareform(dist.pdist(S))
        np.savetxt(f_dist_X, Dist_X, fmt="%.14e", delimiter=" ", newline="\n")
        np.savetxt(f_dist_S, Dist_S, fmt="%.14e", delimiter=" ", newline="\n")
        # Call IBloFunMatch C++ program (only for dimension 1 PH)
        ! {EXECUTABLE_PATH + " " + f_dist_S + " " + f_dist_X + " " + f_ind_sampl + " -d 2"}
        # Save barcodes and representatives reading them from output files
        data_read = []
        for attribute_name, typename in zip(attributes, types_list):
            print(f"attribute:{attribute_name}, type:{typename}")
            print(output_dir + "\\" + attribute_name + ".out")
            with open(output_dir + "\\" + attribute_name + ".out") as file:
                for line in file:
                    data_line = line.split(" ")
                    if typename=="int":
                        data_line=data_line[:-1]
                    data_read.append(list(np.array(data_line).astype(typename)))
                # end reading file lines 
                if typename=="float":
                    output_data[attribute_name] = np.array(data_read)
                else:
                    output_data[attribute_name] = data_read.copy()
                # end if-else 
            # end opening file 
            data_read.clear()
        # end saving all attributes 
        # end for 
        IBloFunMatch_output.append(output_data)
        class_sub_idx[idx_class, idx_subset]=counter 
        if counter==2:
            break
        counter+=1
    # subset range 
    break
# class range    

sample_indices (5): 23, 49, 0, 28, 11, attribute:X_barcode, type:float
output\X_barcode.out

Correctly checked inequality on dist_S and dist_X
Sample indices (sorted)
0, 11, 23, 28, 49, 
Welcome to PerMoVEC!
The subcomplex contains 13 simplices  after collapse. 
   and has dimension 2 
The subcomplex contains 468 simplices  after collapse. 
   and has dimension 2 
Cycle columns image 1: 
Checking zero columns:

1 PM_matrix:

PM_matrix end
Ready to compute block functions
Filling red_pm_matrix
Filled, printing: 
3 : 
Now going to reduce
Reduced
attribute:S_barcode, type:float
output\S_barcode.out
attribute:X_reps, type:int
output\X_reps.out
attribute:S_reps, type:int
output\S_reps.out
attribute:S_reps_im, type:int
output\S_reps_im.out
attribute:pm_matrix, type:int
output\pm_matrix.out
attribute:induced_matching, type:int
output\induced_matching.out
sample_indices (5): 33, 25, 29, 43, 10, attribute:X_barcode, type:float
output\X_barcode.out

Correctly checked inequality on dist_S and dis

In [19]:
for idx_class in range(NUM_class):
    for idx_subset in range(NUM_subset):
        print(class_sub_idx[idx_class][idx_subset])
        print(IBloFunMatch_output[class_sub_idx[idx_class][idx_subset]]["induced_matching"])
        print(IBloFunMatch_output[class_sub_idx[idx_class][idx_subset]]["pm_matrix"])

0
[[]]
[[]]
1
[]
[]
2
[[]]
[[2]]
-1
[[]]
[[2]]
-1
[[]]
[[2]]
-1
[[]]
[[2]]
