Tutorial: Genetic Determinants of Neuronal Morphology
=====================================================
We will illustrate the utility of the Laplacian score in identifying genes that
contribute to the neuronal plasticity in the C. elegans. This example utilizes a dataset consisting of 799 3D neuronal reconstructions of the C.elegans DVB neuron across various mutant and control strains during days 1 to 5 of adulthood. The dataset can be downloaded from the following
[folder](https://www.dropbox.com/scl/fo/y7axeardkwyn1d6j97dqr/h?dl=0&rlkey=4b65w4tc93e778rql72d275ym). The DVB neuron is an excitatory GABAergic motor
interneuron located in the dorso-rectal ganglion of the worm, and is known to undergo post-developmental neurite outgrowth in males. This outgrowth 
alters the neuron's morphology and synaptic connectivity, contributing to
changes in the spicule protraction step of male mating behavior. More information about this 
dataset can be found at:

\- Hart, M. P. & Hobert, O. [Neurexin controls plasticity of a mature, sexually dimorphic neuron.](https://www.nature.com/articles/nature25192) Nature 553, 165-170, (2018).

\- Govek, K. W. et al. [Analysis and integration of single-cell morphological data using metric geometry.](https://www.biorxiv.org/content/10.1101/2022.05.19.492525v3) (2022). DOI: 10.1101/2022.05.19.492525 (bioRxiv).

To begin our analysis, we calculate the Gromov-Wasserstein distance between each pair of cells. For the sake of time, here we just sample 50 points per cell. This computation typically requires 20-30 minutes to complete on a standard desktop computer. A larger number of sampled points would offer better results, but would also increase the computing time.

In [1]:
import cajal.sample_swc
import cajal.swc
import cajal.run_gw

cajal.sample_swc.compute_icdm_all_geodesic(
    infolder="CAJAL/data_worm/swc/",
    out_csv="CAJAL/data_worm/c_elegans_icdm.csv",
    preprocess=cajal.swc.preprocessor_geo(
        structure_ids="keep_all_types"),
    n_sample=50,
    num_cores=8)

cajal.run_gw.compute_gw_distance_matrix(
    "CAJAL/data_worm/c_elegans_icdm.csv",
    "CAJAL/data_worm/c_elegans_gw_dist.csv")

1
101
201
301
401
501
601
701
Computation finished. Computed 318801 cell pairs. Time elapsed: 1591.9364891052246


We can generate a UMAP plot that visualizes the cell morphology space, with each point colored according to the age of each worm in days. The metadata for each neuron in this example is provided in the file ```CAJAL/data/c_elegans_features.csv```, which can be found in the GitHub repository of CAJAL. This metadata includes information such as the age of the worm in days and the genotype of each gene (0: wild-type; 1: mutant).

In [9]:
import cajal.utilities
import umap
import pandas
import plotly.express

# Read GW distance matrix
cells, gw_dist_dict = cajal.utilities.read_gw_dists("CAJAL/data_worm/c_elegans_gw_dist.csv", header=True)
gw_dist = cajal.utilities.dist_mat_of_dict(gw_dist_dict)

# Compute UMAP representation
reducer = umap.UMAP(metric="precomputed", random_state=1)
embedding = reducer.fit_transform(gw_dist)

# Download metadata
metadata = pandas.read_csv("CAJAL/data_worm/c_elegans_features.csv", index_col = "cell_name")

# Visualize UMAP
plotly.express.scatter(x=embedding[:,0], 
                       y=embedding[:,1], 
                       template="simple_white", 
                       hover_name=[m + ".swc" for m in cells],
                       color = [str(m) for m in metadata["day"]])


using precomputed metric; inverse_transform will be unavailable



Unsurprisingly, the age of the worm plays a significant role in shaping the morphology of its neurons. This is evident in the UMAP representation above, which reveals that neurons of different ages cluster in distinct regions of the UMAP. To quantify this association, we can use the Laplacian score:

In [4]:
import cajal.laplacian_score
import numpy
from scipy.spatial.distance import squareform

laplacian = pandas.DataFrame(cajal.laplacian_score.laplacian_scores(numpy.array(metadata["day"]).reshape(799,1), 
                                       gw_dist, 
                                       numpy.median(squareform(gw_dist)), 
                                       permutations = 5000, 
                                       covariates = None, 
                                       return_random_laplacians = False)[0])

print(laplacian)

   feature_laplacians  laplacian_p_values  laplacian_q_values
0            0.951447              0.0002              0.0002


A very small p value suggests a strong association between the age of the worm and the morphology of the DVB neuron.

Moving forward, our goal is to identify mutations that impact the morphology of the DVB neuron. To achieve this, we will rely on the Laplacian score once again. However, it is essential to consider the unequal representation of worms with a given genotype across different ages in the dataset. To address this issue, we will account for the uneven distribution of ages for each genotype. As an example, we will investigate the impact of deleterious mutations in the unc-25 gene. Let us first look at their distribution in the cell morphology space:

In [10]:
plotly.express.scatter(x=embedding[:,0], 
                       y=embedding[:,1], 
                       template="simple_white", 
                       hover_name=[m + ".swc" for m in cells],
                       color = [str(m) for m in metadata["unc-25"]])

The UMAP representation reveals that cells with a deleterious mutation in unc-25 exhibit similar morphology, a finding supported by the small p-value of the Laplacian score of unc-25 in the cell morphology space:

In [6]:
laplacian = pandas.DataFrame(cajal.laplacian_score.laplacian_scores(numpy.array(metadata["unc-25"]).reshape(799,1), 
                                       gw_dist, 
                                       numpy.median(squareform(gw_dist)), 
                                       permutations = 5000, 
                                       covariates = None, 
                                       return_random_laplacians = False)[0])

print(laplacian)

   feature_laplacians  laplacian_p_values  laplacian_q_values
0            0.995089              0.0012              0.0012


However, most of the samples with a mutation in unc-25 were obrained from worms with ages 1 or 3 days:

In [7]:
metadata.loc[metadata["unc-25"]==1,"day"].value_counts()

1    18
3     6
Name: day, dtype: int64

This leads to the question: is the comparable morphology of neurons with a deleterious mutation in unc-25 attributed to the mutation itself or the similar age of the worms? To address this issue, we can employ the Laplacian score but treating the age of the worm as a covariate:

In [8]:
laplacian = pandas.DataFrame(cajal.laplacian_score.laplacian_scores(numpy.array(metadata.iloc[:,0:11]), 
                                       gw_dist, 
                                       numpy.median(squareform(gw_dist)), 
                                       permutations = 5000, 
                                       covariates = numpy.array(metadata["day"]), 
                                       return_random_laplacians = False)[0])
laplacian.index = metadata.columns.values.tolist()[0:11]

print(laplacian)

        feature_laplacians  laplacian_p_values  laplacian_q_values    beta_0  \
nrx-1             0.996573            0.002799            0.006159  0.978138   
mir-1             1.000141            0.123775            0.151281  0.983826   
unc-49            0.996214            0.003199            0.005865  0.988101   
nlg-1             0.994699            0.001000            0.002749  0.971700   
unc-25            0.995089            0.000800            0.003519  0.921371   
unc-97            0.961518            0.000200            0.002200  1.023469   
lim-6             1.000951            0.302140            0.332354  1.028147   
lat-2             0.994027            0.000800            0.003519  1.018798   
ptp-3             0.999623            0.082783            0.113827  1.006270   
sup-17            0.997835            0.014197            0.022310  0.969905   
pkd-2             1.001720            0.536293            0.536293  0.972349   

          beta_1  beta_1_p_value  regre

Upon examining the table, we note that the q-value of unc-25 shifts from 0.0008 to 0.05 after adjusting for the covariate effect. Consistent with this, the F-statistic suggests a considerable impact of the covariate on the Laplacian score of unc-25, as evidenced by the low p-value of the F-statistic.