<a href="https://colab.research.google.com/github/AnkanKar-Zargon/Topological-ML-Reports-and-Presentations/blob/main/Topological_machine_learning_using_Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing Libraries

In [None]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
!pip install --upgrade hepml
!pip install persim
!pip install ripser



### All libraries and tools are imported here

In [None]:
from itertools import product
import time
import numpy as np
import persim

from scipy.stats import multivariate_normal as mvn
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
import pickle
from typing import List
from PIL import Image
from hepml.core import download_dataset
from scipy import ndimage


from ripser import Rips
from persim import PersistenceImager

import collections
collections.Iterable = collections.abc.Iterable


from gtda.homology import VietorisRipsPersistence, CubicalPersistence
from gtda.diagrams import PersistenceEntropy
from gtda.plotting import plot_heatmap, plot_point_cloud, plot_diagram
from gtda.pipeline import Pipeline
from hepml.core import make_point_clouds, load_shapes


from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from persim.images_weights import linear_ramp
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn import svm
from sklearn import metrics


# Downloading Data

### The dataset "shapes.zip" is downloaded here

In [None]:
download_dataset("shapes.zip")

Dataset already exists at '../data/shapes.zip' and is not downloaded again.


### Path for the data is stored here

In [None]:
DATA = Path('../data/')

### Data "shapes.zip" is unzipped here (if once zipped just type A and enter for again unzipping all)

In [None]:
!unzip {DATA}/'shapes.zip' -d {DATA}

Archive:  ../data/shapes.zip
replace ../data/shapes/fighter_jet6.pts? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: ../data/shapes/fighter_jet6.pts  
  inflating: ../data/shapes/guitar9.pts  
  inflating: ../data/shapes/vase5.pts  
  inflating: ../data/shapes/vase4.pts  
  inflating: ../data/shapes/guitar8.pts  
  inflating: ../data/shapes/fighter_jet7.pts  
  inflating: ../data/shapes/helicopter9.pts  
  inflating: ../data/shapes/fighter_jet5.pts  
  inflating: ../data/shapes/vase6.pts  
  inflating: ../data/shapes/vase7.pts  
  inflating: ../data/shapes/fighter_jet4.pts  
  inflating: ../data/shapes/helicopter8.pts  
  inflating: ../data/shapes/fighter_jet0.pts  
  inflating: ../data/shapes/vase3.pts  
  inflating: ../data/shapes/vase2.pts  
  inflating: ../data/shapes/fighter_jet1.pts  
  inflating: ../data/shapes/human_arms_out9.pts  
  inflating: ../data/shapes/fighter_jet3.pts  
  inflating: ../data/shapes/handgun9.pts  
  inflating: ../data/shapes/potted_plant8.pts  
  in

# We will create a dataset with a subset of attributes

In [None]:
SHAPES = Path("../data/shapes")
df = load_shapes(SHAPES, ["human_arms_out", "vase", "dining_chair", "biplane"], 400)
df.head()


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated a

Unnamed: 0,x,y,z,label
0,0.425342,0.569695,0.109767,human_arms_out0
1,0.54106,0.426657,0.114984,human_arms_out0
2,0.516622,0.54628,0.097866,human_arms_out0
3,0.54106,0.440214,0.065773,human_arms_out0
4,0.390249,0.793699,0.082838,human_arms_out0


#### To assess the data clarity and visualize the distribution we plotted one attribute here

In [None]:
plot_point_cloud(df.query('label == "dining_chair0"')[["x", "y", "z"]].values)

#### We just collected all values in a single array

In [None]:
point_clouds = np.asarray([df.query("label == @shape")[["x", "y", "z"]].values for shape in df["label"].unique()])
point_clouds.shape


(40, 400, 3)

#### We plotted all the attributes here

In [None]:
plot_point_cloud(df[["x","y","z"]].values)

### We define the homology with dimension 0 being "connected components", 1 being "holes", 2 being "voids"

In [None]:
homology_dimensions = [0, 1, 2] #

## We calculate H2 persistence which is memory intensive

In [None]:
persistence = VietorisRipsPersistence(metric="euclidean", homology_dimensions=homology_dimensions, n_jobs=6)
%time
persistence_diagrams = persistence.fit_transform(point_clouds)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.87 µs


### Since calculating persistent diagrams is often one of the most time consuming parts in the pipeline, it is a good idea to save our intermediate results to disk:

In [None]:
with open(DATA / "diagrams.pkl", "wb") as f:
    pickle.dump(persistence_diagrams, f)

with open(DATA / "diagrams.pkl", "rb") as f:
    diagrams = pickle.load(f)

# index - (human_arms_out, 0), (vase, 10), (dining_chair, 20), (biplane, 30)

In [None]:
index = 0
plot_diagram(diagrams[20])

### We calculate the topological feature matrix

In [None]:
persistent_entropy = PersistenceEntropy()
X = persistent_entropy.fit_transform(diagrams)




#### Shape is as (n_point_clouds, n_dims)

In [None]:
X.shape

(40, 3)

In [None]:
plot_point_cloud(X)

####  In this particular case, as we do not observe distinct clusters in the data, we anticipate that our classifier's performance might not be optimal. To proceed with model training, we must first define a target vector for each point cloud. A straightforward and simple approach is to label each class with an integer ranging from 0 to n-1 classes:

In [None]:
labels = np.zeros(40)
labels[10:20] = 1
labels[20:30] = 2
labels[30:] = 3

## Given the small sample size, using Random Forest Classification is a suitable approach for building a predictive model as it reduces the risk of overfitting and provides reliable results even with a small dataset.

In [None]:
rf = RandomForestClassifier(oob_score=True, random_state=42)
rf.fit(X, labels)

#### Getting the score

In [None]:
rf.oob_score_

0.6