# 3D shape segmentation using TDA

Our goal in this notebook is to do classification of points on 3D shapes, which is sometimes referred to as segmentation of 3D shapes. Intuitively, we want to give a label for each point of a 3D shape. For instance, if the shapes represent airplanes, the labels would be "left wing", "right wing", "tail" and so on. 

The idea of TDA is to represent each point of each shape with a persistence diagram, obtained with the sublevel sets of the geodesic distance to the point. More formally, if $x$ is a point on the 3D shape $S$, let $D(x) = \text{Dgm}(f_x)$, where $f_x:y\in S\mapsto d_S(x,y)$ and $d_S(\cdot,\cdot)$ is the geodesic distance on $S$. Then $D(x)$ can be used as a powerful descriptor for $x$ which enjoys many desireable properties, such as stability and invariance to solid transformations of the shape. See [this article](https://diglib.eg.org/handle/10.1111/cgf12692) for more details.

Let's go!

Note that this notebook requires [h5py](https://www.h5py.org/) (for reading the dataset of persistence diagrams), [pandas](https://pandas.pydata.org/) (for reading the labels), [sklearn_tda](https://github.com/MathieuCarriere/sklearn_tda) (for handling the persistence diagrams) and [scikit-learn](http://scikit-learn.org/stable/index.html) (for performing the final classification). It also makes use of numpy.

I/O functions.

In [None]:
import numpy as np

def diag_to_array(data):
    dataset, num_diag = [], len(data["0"].keys())
    for dim in data.keys():
        X = []
        for diag in range(num_diag):
            pers_diag = np.array(data[dim][str(diag)])
            X.append(pers_diag)
        dataset.append(X)
    return dataset

def diag_to_dict(D):
    X = dict()
    for f in D.keys():
        df = diag_to_array(D[f])
        for dim in range(len(df)):
            X[str(dim) + "_" + f] = df[dim]
    return X 

Read the dataset. "train_diag.hdf5" is a file containing the persistence diagrams. They were computed using [this code](https://github.com/MathieuCarriere/local-persistence-with-UF) on 3D shapes representing airplanes. The 3D shapes and their corresponding point labels ("train.csv") were retrieved from [this dataset](http://segeval.cs.princeton.edu/).

In [None]:
import pandas as pd
import h5py

train_lab  = pd.read_csv("train.csv")
train_diag = diag_to_dict(h5py.File("train_diag.hdf5", "r"))

Separation of the dataset into train and test sets. The size of the test set is given by a percentage of the dataset size that you can specify by changing the test_size variable. Then the test set is obtained by randomly picking points in the dataset.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Size of test set
test_size = 0.4

# Shuffle dataset and pick points for test set
train_num_pts        = train_lab.shape[0]    
perm                 = np.random.permutation(train_num_pts)
limit                = np.int(test_size * train_num_pts)
test_sub, train_sub  = perm[:limit], perm[limit:]

# Create train and test labels with LabelEncoder from scikit-learn
train_full_labels  = train_lab["part"]
le                 = LabelEncoder()
train_labels       = np.array(le.fit_transform(train_full_labels[train_sub]))
test_labels        = np.array(le.transform(train_full_labels[test_sub]))

# Create train and test sets of persistence diagrams
train_full_diag    = train_diag["1_geodesic"]
train_diag         = [train_full_diag[i] for i in train_sub]
test_diag          = [train_full_diag[i] for i in test_sub]

# Print sizes
train_num_pts, test_num_pts = len(train_sub), len(test_sub)
print("Number of train points = " + str(train_num_pts))
print("Number of test  points = " + str(test_num_pts))

Here we create a scikit-learn pipeline for processing the diagrams. The pipeline will:
1. extract the points of the persistence diagrams with finite coordinates (i.e. the non essential points)
2. rotate or not the diagrams (rotation is useful for persistence images)
3. handle diagrams with vectorization or kernel methods using the sklearn_tda package
4. train a classifier from the scikit-learn package

In [None]:
import sklearn_tda as tda
from sklearn.pipeline        import Pipeline
from sklearn.svm             import SVC
from sklearn.ensemble        import RandomForestClassifier
from sklearn.neighbors       import KNeighborsClassifier

# Definition of pipeline
pipe = Pipeline([("Separator", tda.DiagramSelector(limit=np.inf, point_type="finite")),
                 ("Rotator",   tda.DiagramPreprocessor(scalers=[([0,1], tda.BirthPersistenceTransform())])),
                 ("TDA",       tda.PersistenceImage()),
                 ("Estimator", SVC())])

# Parameters of pipeline. This is the place where you specify the methods you want to use to handle diagrams
param =    [{"Rotator__use":        [False],
             "TDA":                 [tda.SlicedWassersteinKernel()], 
             "TDA__bandwidth":      [0.1, 1.0],
             "TDA__num_directions": [20],
             "Estimator":           [SVC(kernel="precomputed")]},
            
            {"Rotator__use":        [False],
             "TDA":                 [tda.PersistenceWeightedGaussianKernel()], 
             "TDA__bandwidth":      [0.1, 1.0],
             "TDA__weight":         [lambda x: np.arctan(x[1]-x[0])], 
             "Estimator":           [SVC(kernel="precomputed")]},
            
            {"Rotator__use":        [True],
             "TDA":                 [tda.PersistenceImage()], 
             "TDA__resolution":     [ [5,5], [6,6] ],
             "TDA__bandwidth":      [0.01, 0.1, 1.0, 10.0],
             "Estimator":           [SVC()]},
            
            {"Rotator__use":        [False],
             "TDA":                 [tda.Landscape()], 
             "TDA__resolution":     [100],
             "Estimator":           [RandomForestClassifier()]},
           
            {"Rotator__use":        [False],
             "TDA":                 [tda.BottleneckDistance()], 
             "TDA__wasserstein":    [1],
             "TDA__delta":          [0.1], 
             "Estimator":           [KNeighborsClassifier(metric="precomputed")]}
           ]

Our final model is the best estimator found after 3-fold cross-validation of our pipeline.

In [None]:
from sklearn.model_selection import GridSearchCV

model = GridSearchCV(pipe, param, cv=2)

Now is time to train the model. Since we perform cross-validation, the computation can be quite long, especially if using k-NN with Wasserstein distances, which is quite time-consuming. You may consider grabbing a cup of coffee at this point.

In [None]:
model = model.fit(train_diag, train_labels)

Training is finally over! Let us check what is the best method for persistence diagrams with respect to this classification problem.

In [None]:
print(model.best_params_)

Finally, we evaluate our model accuracy on the test set.

In [None]:
print("Train accuracy = " + str(model.score(train_diag, train_labels)))
print("Test accuracy  = " + str(model.score(test_diag,  test_labels)))

Depending on the method you used, the accuracy can go up to ~90%, not bad! This score can actually get improved by adding other descriptors to the persistence diagrams (using for instance a FeatureUnion in the pipeline), but using TDA only already gives competitive accuracies!