# Tutorial 5: Unbalanced Gromov-Wasserstein Distances and Fused Gromov-Wasserstein distances
This notebook demonstrates the practical usage of unbalanced GW and fused GW. The theory behind these concepts was introduced in the "Variants of Gromov-Wasserstein" page.

## Using Unbalanced Gromov-Wasserstein
We have implemented a Python module that allows the user to compute the unbalanced Gromov-Wasserstein distance between cells. By default, CAJAL ships with a single core version of the algorithm and a multicore version, and the user can uncomment the appropriate line in the package's `setup.py` build script to get a version of the algorithm for a GPU using either CUDA or OpenCL. These are disabled by default as the end user must configure their machine so that the CUDA (respectively, OpenCL) header files can be found and all necessary libraries are available. A few other backends can be made available upon request. Our experience shows that the GPU backends are only likely to be useful when the individual UGW problems are very large (i.e., the metric spaces are large)

The user should only import one of the backend modules at a time due to technical limitations of C (C has no namespacing, so there will be symbol conflicts from identically named functions in the two backend modules). Restart the Python interpreter if you want to load a different backend module.

Let us demonstrate how to use the implementation.

We import the module we want to use, in this case the multicore implementation, and the UGW class. The constructor for the UGW class takes the backend module as its argument, establishes a connection with the library, and returns an object that maintains the internal state of the computation. The wrapper functions for the C backend are then accessible as *methods* of this object. If the user intends to parallelize at the level of Python processes, each process should instantiate the class. As usual one can call `help(UGW_multicore)`, `help(UGW_multicore.ugw_armijo)`, and so on for documentation of the functions.

In [25]:
from cajal.ugw import _multicore, UGW # _single_core for single-threaded usage, useful if you want to parallelize at the level of Python processes

UGW_multicore = UGW(_multicore) # For GPU backends, the constructor has to negotiate a connection to the GPU, so it may take a long time to initialize.

In [27]:
from cajal.run_gw import cell_iterator_csv
import numpy as np


cells, icdms = zip(*cell_iterator_csv("/home/patn/dropbox/Data/AllenInstitute/swc_bdad_100pts_euclidean_icdm.csv"))
icdm_block = np.stack(icdms,axis=0) # For efficient memory usage and effective parallelization, the parallel function requires an array of cells of uniform length.

# The appropriate parameters are sensitive to the absolute scales of your data. 
# To choose appropriate coefficients you can run the ordinary GW computation first and use this to estimate the appropriate scales,
rho1 = 4000.
rho2 = 4000.
eps = 100.

UGW_results = UGW_multicore.ugw_armijo_pairwise_unif(
    rho1 = rho1,
    rho2 = rho2,
    eps = eps,
    dmats = icdm_block
)
UGW_array = UGW_multicore.from_futhark(UGW_results)

In [19]:
from os.path import join
bd = "/home/patn/dropbox/Data/AllenInstitute/"

In [None]:
rho1 = 4000.
rho2 = 4000.
eps = 100.

UGW_results = UGW_multicore.ugw_armijo_pairwise_unif(
    rho1 = rho1,
    rho2 = rho2,
    eps = eps,
    dmats = icdm_block
)
UGW_array = UGW_multicore.from_futhark(UGW_results)

The ".from_futhark()" method converts the library's internal representation of the output to a Numpy array.

The returned array has five columns, corresponding to $\mathcal{G}(T)$, the first and second marginal costs $KL(\pi_X(T)\otimes\pi_X(T)\mid \mu\otimes\mu)$ and $KL(\pi_Y(T)\otimes\pi_Y(T)\mid \nu\otimes\nu)$, and the entropy regularization term $KL(T\otimes T\mid (\mu\otimes\nu)\otimes(\mu\otimes \nu))$, and the weighted linear combination $\mathcal{L}_\varepsilon(T)=UGW_\varepsilon$, where $T$ was the optimal coupling found by the search. In our analysis, we choose to use $\mathcal{L}$ rather than $\mathcal{L}_\varepsilon$ as the measure of "distance", because the regularization term is only present for computational reasons and it doesn't inform us about morphological distinctions.

In [11]:
from scipy.spatial.distance import squareform
UGW_dmat = squareform(UGW_array[:,0] + rho1 * UGW_array[:,1] + rho2 * UGW_array[:,2])

In [4]:
import numpy as np
from os.path import join
bd = "/home/patn/dropbox/Data/AllenInstitute"
UGW_array = np.load(join(bd,"unbalanced_gw_bdad_100pts_euclidean_rho1_4000_rho2_4000_eps_100.csv.npy"))

In [7]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

metadata = pd.read_csv(join(bd,"cell_types_specimen_details.csv"),index_col='specimen__id').loc[pd.Series(cells).map(int)]

In [12]:
cre_lines = np.array(metadata["line_name"])

clf = KNeighborsClassifier(metric="precomputed", n_neighbors=10, weights="distance")
cv = StratifiedKFold(n_splits=7, shuffle=True, random_state=0)
accuracy = cross_val_score(clf, X=UGW_dmat, y=cre_lines,cv=cv)

np.mean(accuracy)



0.31047510328332245

In [13]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import matthews_corrcoef

cvp = cross_val_predict(clf, X=UGW_dmat, y=cre_lines, cv=cv)

print(matthews_corrcoef(cvp, cre_lines))

0.2570621070384286




The accuracy of 0.310 and the MCC of 0.257 are slightly better then the accuracy and MCC we observed in Tutorial 1 with ordinary GW of 0.296 and 0.242 respectively, a modest increase of about 5%. (Of course, these statistics themselves say nothing about the sampling distribution that they are drawn from, and a different set of neurons or a different cross-validation seed would give different numbers, so there are some limits to their interpretability.)

In [20]:
import cajal.sample_swc, cajal.swc

cajal.sample_swc.compute_icdm_all_geodesic(
    infolder=join(bd, 'swc'),
    out_csv=join(bd, 'swc_bdad_100pts_geodesic_icdm.csv'),
    preprocess=cajal.swc.preprocessor_geo(
        structure_ids=[1,3,4]),
    n_sample=100)

100%|█████████▉| 508/509 [00:06<00:00, 79.50it/s] 


[]

In [21]:
import cajal.run_gw
cajal.run_gw.compute_gw_distance_matrix(
    join(bd, 'swc_bdad_100pts_geodesic_icdm.csv'),
    join(bd, 'swc_bdad_100pts_geodesic_gw.csv'),
    16
)

  0%|          | 0/129286 [00:00<?, ?it/s]

(array([[ 0.        , 55.22678091, 29.96775976, ..., 33.70196277,
         28.89944618, 43.30072298],
        [55.22678091,  0.        , 64.18048016, ..., 67.20359832,
         63.98534729, 63.61532619],
        [29.96775976, 64.18048016,  0.        , ..., 22.15445984,
         25.95531057, 51.88901184],
        ...,
        [33.70196277, 67.20359832, 22.15445984, ...,  0.        ,
         26.78978415, 60.35736956],
        [28.89944618, 63.98534729, 25.95531057, ..., 26.78978415,
          0.        , 48.84018813],
        [43.30072298, 63.61532619, 51.88901184, ..., 60.35736956,
         48.84018813,  0.        ]]),
 None)

In [22]:
import cajal.utilities


cell_names, gw_geo = cajal.utilities.read_gw_dists(join(bd, 'swc_bdad_100pts_geodesic_gw.csv'), header=True)
gw_geo_dmat = cajal.utilities.dist_mat_of_dict(gw_geo,cell_names)

In [23]:
clf = KNeighborsClassifier(metric="precomputed", n_neighbors=10, weights="distance")
cv = StratifiedKFold(n_splits=7, shuffle=True, random_state=0)
accuracy = cross_val_score(clf, X=gw_geo_dmat, y=cre_lines,cv=cv)

np.mean(accuracy)



0.2770982822352685

In [19]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import matthews_corrcoef

cvp = cross_val_predict(clf, X=gw_geo_dmat, y=cre_lines, cv=cv)

print(matthews_corrcoef(cvp, cre_lines))

0.24253698545564495




In [24]:
rho1 = 4000.
rho2 = 4000.
eps = 100.

cells, geo_icdms = zip(*cell_iterator_csv("/home/patn/dropbox/Data/AllenInstitute/swc_bdad_100pts_geodesic_icdm.csv"))
geo_icdm_block = np.stack(geo_icdms, axis=0)


NameError: name 'UGW_multicore' is not defined

In [None]:

UGW_results_geo = UGW_multicore.ugw_armijo_pairwise_unif(
    rho1 = rho1,
    rho2 = rho2,
    eps = eps,
    dmats = geo_icdm_block
)
UGW_array_geo = UGW_multicore.from_futhark(UGW_results_geo)
np.save(join(bd,"swc_bdad_100pts_geodesic_ugw_array_geo.npy"), UGW_array_geo)