# Comparing a dataset to its UMAP reduction #

This code was used for an example application of the normalized bottleneck distance in the paper below.

The paper can be found here: https://link.springer.com/article/10.1007/s44007-024-00130-0

The arXiv link: https://arxiv.org/abs/2306.06727

We will use UMAP to reduce the dimensionality of the digits data. When the dimensionality is reduced, the normalized bottleneck distance will give a smaller distance between the original and reduced datasets than the ordinary bottleneck distance.

## Import Libraries ##
Run this cell first

In [3]:
import numpy as np
import persim
import tadasets
import ripser
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist, squareform
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import pandas as pd
%matplotlib inline
import umap

def diameter(A):
    '''Find the diameter of a data set, A.
    
    inputs:
        A: A data set stored as a np array
    outpus:
        d: The diameter of the dataset'''
    
    D = pdist(A)
    D = squareform(D)
    d = np.nanmax(D)
    
    return d

2024-09-18 21:30:55.973689: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-18 21:30:55.975841: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-18 21:30:56.015913: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-18 21:30:56.016579: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.




## Digits Data ##
Load in the data set and reduce the data into $2$ dimensions using UMAP.

In [4]:
#Load the digits
digits = load_digits()
dig_data = digits.data

print(digits)

#Reduce the data 2D
reducer2 = umap.UMAP(random_state=42)
reducer2.fit(digits.data)

embedding2 = reducer2.transform(digits.data)
# Verify that the result of calling transform is
# idenitical to accessing the embedding_ attribute
assert(np.all(embedding2 == reducer2.embedding_))

{'data': array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]]), 'target': array([0, 1, 2, ..., 8, 9, 8]), 'frame': None, 'feature_names': ['pixel_0_0', 'pixel_0_1', 'pixel_0_2', 'pixel_0_3', 'pixel_0_4', 'pixel_0_5', 'pixel_0_6', 'pixel_0_7', 'pixel_1_0', 'pixel_1_1', 'pixel_1_2', 'pixel_1_3', 'pixel_1_4', 'pixel_1_5', 'pixel_1_6', 'pixel_1_7', 'pixel_2_0', 'pixel_2_1', 'pixel_2_2', 'pixel_2_3', 'pixel_2_4', 'pixel_2_5', 'pixel_2_6', 'pixel_2_7', 'pixel_3_0', 'pixel_3_1', 'pixel_3_2', 'pixel_3_3', 'pixel_3_4', 'pixel_3_5', 'pixel_3_6', 'pixel_3_7', 'pixel_4_0', 'pixel_4_1', 'pixel_4_2', 'pixel_4_3', 'pixel_4_4', 'pixel_4_5', 'pixel_4_6', 'pixel_4_7', 'pixel_5_0', 'pixel_5_1', 'pixel_5_2', 'pixel_5_3', 'pixel_5_4', 'pixel_5_5', 'pixel_5_6', 'pixel_5_7', 'pixel_6_0', '

  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")


## Compare the Distances ##

Notice the reduced bottleneck distance gives a much smaller distance than the ordinary bottleneck distance.

In [5]:
print("2 dimensional reduction")
#Find the bottleneck distance between the reduced dataset and the original dataset
dgm_dig = ripser.ripser(dig_data)['dgms'][1]
dgm_reduced2 = ripser.ripser(embedding2)['dgms'][1]
distance_bottleneck2 = persim.bottleneck(dgm_dig, dgm_reduced2)
print(f"The ordinary bottleneck distance between the sets is: {distance_bottleneck2}")

#Compute the normalized bottleneck distance
data_dig_n = dig_data/(diameter(dig_data))
data_embedding2_n = embedding2/(diameter(embedding2))
dgm_dig_n = ripser.ripser(data_dig_n)['dgms'][1]
dgm_embedding2_n = ripser.ripser(data_embedding2_n)['dgms'][1]
distance_bottleneck_n2 = persim.bottleneck(dgm_dig_n, dgm_embedding2_n)
print(f"The reduced bottleneck distance between the sets is: {distance_bottleneck_n2}")

2 dimensional reduction


The ordinary bottleneck distance between the sets is: 4.340822219848633


The reduced bottleneck distance between the sets is: 0.056345805525779724
