# Topological Exploratory Data Analysis

Mathieu Carrière, https://mathieucarriere.github.io/website/

In [None]:
from gudhi import CoverComplex
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib notebook

In this notebook, we will see how to use Gudhi in order to perform topological dimension reduction: we will compute simplicial complex approximations of point cloud / distance matrices. These complexes will be either [Mapper complexes](https://diglib.eg.org/handle/10.2312/SPBG.SPBG07.091-100) or [Graph Induced complexes](http://web.cse.ohio-state.edu/~dey.8/GIC/gic.html). Both complexes use covers of the initial space (such as Voronoi partitions or preimages of filter functions), and use these covers to generate simplicial complexes, either by taking the nerve (Mapper) or by checking the presence of colored cliques (Graph Induced). 

Gudhi can handle both point clouds and distance matrices. Let's start with a point cloud.

# Point cloud

Load an example point cloud.

In [None]:
X = np.loadtxt('/home/mcarrier/Github/gudhi/data/points/human.txt')

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[::2,1], X[::2,0], X[::2,2], s=1)
limits = np.array([ax.get_xlim3d(), ax.get_ylim3d(), ax.get_zlim3d()])
origin = np.mean(limits, axis=1)
radius = 0.5 * np.max(np.abs(limits[:, 1] - limits[:, 0]))
x, y, z = origin
ax.set_xlim3d([x - radius, x + radius])
ax.set_ylim3d([y - radius, y + radius])
ax.set_zlim3d([z - radius, z + radius])
plt.show()

We will use the height function to color the complex nodes.

In [None]:
height = X[:,2]

We next provide different configurations for computing cover complexes:

Graph Induced complex with a Voronoi partition with 100 randomly sampled germs and Rips graph obtained with automatic threshold. 

In [None]:
cover_complex = CoverComplex(
    complex_type='gic', input_type='point cloud', cover='voronoi', colors=height, mask=0,
    graph="rips", rips_threshold=None, N=100, beta=0., C=10,
    voronoi_samples=100, 
    input_name="human", cover_name="coord2", color_name="coord2", verbose=True)

Graph Induced complex with a preimage partition with automatic resolution and Rips graph obtained with automatic threshold. 

In [None]:
cover_complex = CoverComplex(
    complex_type='gic', input_type='point cloud', cover='functional', colors=height, mask=0,
    graph="rips", rips_threshold=None, N=100, beta=0., C=10,
    filters=height, resolutions=None, gains=0.,
    input_name="human", cover_name="coord2", color_name="coord2", verbose=True)

Mapper complex with a preimage cover with automatic resolution and hierarchical clustering obtained with automatic threshold. 

In [None]:
cover_complex = CoverComplex(
    complex_type='mapper', input_type='point cloud', cover='functional', colors=height[:,np.newaxis], mask=0,
    clustering=None, N=100, beta=0., C=10,
    filters=height[:,np.newaxis], filter_bnds=None, resolutions=None, gains=None,
    input_name="human", cover_name="coord2", color_name="coord2", verbose=True)

Mapper complex with a preimage cover with automatic resolution from a 2D function and hierarchical clustering obtained with automatic threshold. 

In [None]:
filt2d = np.hstack([height[:,np.newaxis],X[:,0:1]])

In [None]:
cover_complex = CoverComplex(
    complex_type='mapper', input_type='point cloud', cover='functional', colors=filt2d, mask=0,
    clustering=None, N=100, beta=0., C=10,
    filters=filt2d, filter_bnds=None, resolutions=[20,2], gains=None,
    input_name="human", cover_name="coord2", color_name="coord2", verbose=True)

# Distance matrix

We can actually process the dataset using only the pairwise distances between points.

In [None]:
from sklearn.metrics import pairwise_distances

In [None]:
X = pairwise_distances(X)

In [None]:
plt.figure()
plt.imshow(X)
plt.show()

This time, color is given by eccentricity.

In [None]:
ecc = X.max(axis=0)

In [None]:
cover_complex = CoverComplex(
    complex_type='gic', input_type='distance matrix', cover='functional', colors=ecc, mask=0,
    graph="rips", rips_threshold=None, N=100, beta=0., C=10,
    filters=ecc, resolutions=None, gains=0.,
    input_name="human", cover_name="coord2", color_name="coord2", verbose=True)

In [None]:
cover_complex = CoverComplex(
    complex_type='gic', input_type='distance matrix', cover='voronoi', colors=ecc, mask=0,
    graph="rips", rips_threshold=None, N=100, beta=0., C=10,
    voronoi_samples=100, 
    input_name="human", cover_name="coord2", color_name="coord2", verbose=True)

In [None]:
cover_complex = CoverComplex(
    complex_type='mapper', input_type='distance matrix', cover='functional', colors=ecc[:,np.newaxis], mask=0,
    clustering=None, N=100, beta=0., C=10,
    filters=ecc[:,np.newaxis], filter_bnds=None, resolutions=None, gains=None,
    input_name="human", cover_name="coord2", color_name="coord2", verbose=True)

# Complex computation

The cover complex can now be computed in a single line of code!

In [None]:
_ = cover_complex.fit(X)

# Visualization

You can visualize the complex in three different ways with Gudhi.

1. You can use Python package `networkx`.

In [None]:
import networkx as nx

In [None]:
G = cover_complex.get_networkx()

In [None]:
plt.figure()
nx.draw(G, pos=nx.kamada_kawai_layout(G), node_color=[cover_complex.node_info[v]["colors"][0] for v in G.nodes()])
plt.show()

2. You can create a DOT file that can be processed later with `neato` to produce a PDF.

In [None]:
cover_complex.print_to_dot()

In [None]:
!neato -Tpdf human.dot -o human.pdf 

3. You can create a TXT file that you can process later with our KeplerMapper wrapper to produce a HTML file that you can visualize in browser.

In [None]:
cover_complex.print_to_txt()

In [None]:
!python /home/mcarrier/Github/gudhi/src/Nerve_GIC/utilities/KeplerMapperVisuFromTxtFile.py -f human.txt

# Topological features

There are various postprocessing one can do on a cover complex. For instance, one can compute the topological features in the complex. For our human shape, the topological features (identified by computing the persistence of the color function on the complex) are the three branches corresponding to the arms and legs (the lower leg correspond to the whole connected component).

In [None]:
dgm, bnd = cover_complex.compute_topological_features(threshold=0.)

In [None]:
G = cover_complex.get_networkx()
plt.figure(figsize=(8,2))
for idx, bd in enumerate(bnd):
    plt.subplot(1,len(bnd),idx+1)
    nx.draw(G, pos=nx.kamada_kawai_layout(G), 
            node_color=[1 if node in bd else 0 for node in G.nodes()], node_size=5)
plt.show()

You can also identify the robust topological features by bootstrapping, and select those associated to 95% confidence.

In [None]:
cover_complex.bootstrap_topological_features(100)

In [None]:
dist = cover_complex.get_distance_from_confidence_level(.95)

In [None]:
bnd_boot = [b for idx, b in enumerate(bnd) if np.abs(.5 * (dgm[idx][1][1]-dgm[idx][1][0])) >= dist]

In [None]:
G = cover_complex.get_networkx()
plt.figure(figsize=(8,2))
for idx, bd in enumerate(bnd_boot):
    plt.subplot(1,len(bnd_boot),idx+1)
    nx.draw(G, pos=nx.kamada_kawai_layout(G), 
            node_color=[1 if node in bd else 0 for node in G.nodes()], node_size=5)
plt.show()

Finally, one can identify the coordinates that best explain a topological feature VS the rest of the complex with a Kolmogorov-Smirnov test. In particular, for each topological feature, we can rank the coordinates with respect to their p-values. 

For instance, coordinate 2 (height) is the one that best distinguishes the leg from the rest.

In [None]:
cover_complex.compute_differential_coordinates(nodes=bnd[1])

On the other hand, coordinate 0 best explains both arms.

In [None]:
cover_complex.compute_differential_coordinates(nodes=bnd[2])

In [None]:
cover_complex.compute_differential_coordinates(nodes=bnd[3])