# Which one's the real data?

In the previous notebook, we found that it is difficult to determine a good algorithmic model for biological connectomes.
And how easy it is to get misled - to think that a result found in the data is indication of an underlying algorithm, when in reality it is only a meaningless and misleading side effect.

Here, we are using this insight for a little game.

I prepared data related to a number of connectomes. One of them is the actual data from human brain tissue. Published as:
Peng et al., 2024. Science. Directed and acyclic synaptic connectivity in the human layer 2-3 cortical microcircuit.

The others are stochastically generated "fakes", built by various different algorithms. Can you tell which one's the real one?
To do this, we analyze the connectomes and see if they match a number of findings that have been reported in the paper.

## Your input required
Note that the code cells below are not complete. Your input and a little bot of coding is required at times. Read the instructions carefully!

For reference, you can read the paper under the following url:
https://doi.org/10.1126/science.adg8828



# Loading the data
We begin by loading in the data.

Obtain the data from [this link](https://openbraininstitute-my.sharepoint.com/:u:/g/personal/michael_reimann_openbraininstitute_org/EcphXv5auVlLilTB2YzYEQsB6EJtcwLakIzc45PDYaDAfA?e=svddAO)

In [None]:
!pip install numpy
!pip install pandas
!pip install Connectome-Utilities
!pip install connectome-analysis
!pip install scipy

In [None]:
import numpy
import pandas
import conntility
import connalysis
from scipy import sparse

use_gdrive = True

if use_gdrive:
    from google.colab import drive
    drive.mount('/content/drive')

    # Assumes a shortcut to the shared drive has been placed in your Drive.
    fn_data = "/content/drive/MyDrive/NSC6085_Student_Share/April08/data/mystery_con_mats.h5"
else:
    # Alternatively, if the GDrive method does not work, you can download the file separately and place it into the local file system.
    # Obtain from the link listed above.
    fn_data = "./mystery_con_mats.h5"

data = conntility.ConnectivityGroup.from_h5(fn_data)

print("Loaded {0} sample connectivity matrices.".format(len(data.index)))

The result is a ConnectivityGroup object. That is, a bundle of connectivity matrices that can be analyzed together.
The matrices are grouped into different groups: "mystery_1" to "mystery_7". Each has been generated by a different algorithm, one of them is the original.

The .index tells you what each matrix belongs to.

The second level of the index numbers the instance of the matrix. That is, we do not have a full regional connectome available in this dataset. Instead, in the paper, many different brain slices were probed by the authors. This resulted in several hundred sampled connectomes, each with only 8-12 neurons.

In [None]:
display(data.index)

Use the regular indexing operation to access individual matrices.

the .vertices are the properties associated with each vertex, i.e., neuron, such as their locations and distance from the pia.

.array depicts the adjacency matrix: Zero values indicate absence of a connection, nonzero values indicate the strength of a connection.

In [None]:
mat = data["mystery_1", 0]
display(mat.array, mat.vertices)

You can convert them to a *networkx* graph and use that for example for plotting.

In [None]:
import networkx

def draw_graph(graph, coords=["coordinate_1", "coordinate_2"], **kwargs):
      pos = dict([(k, [v[_c] for _c in coords])
                  for k, v in graph.nodes.items()])
      networkx.draw(graph, pos=pos, **kwargs)

graph = data["mystery_1", 0].to_networkx()
draw_graph(graph)


## Which one's the original?

### Part 1: The mean amplitude of connections is 0.64 mV
We know that in the original data the mean amplitude is 0.64 mV. We can test which of the mystery connectomes match this.

First, we write a function that takes as inputs:
 - The adjacency matrix of a connectome as a scipy.sparse matrix
 - A pandas.DataFrame of the node properties
and returns a Series of connection amplitude values. We reference this function is a basic analysis recipe so that it can be conveniently applied to all matrices.

The we test the extracted distributions to the expected mean, separately for each type of matrix (.groupby) 

In [None]:
def edge_values(mat, nodes):
    vals = pandas.Series(mat.data)
    return vals[~numpy.isnan(vals)]

analysis = {
    "analyses": {
        "connection_strengths": {
            "source": edge_values,
            "output": "Series"
        }
    }
}
# Apply the analysis function to all matrices in the group.
result = data.analyze(analysis)

# Wrapper for a statistical test of amplitudes against the expected value
mean_strength_matches = lambda samples: scipy.stats.ttest_1samp(samples, 0.64).pvalue >= 0.05
# Compare distributions separately for all matrix_types.
display(result["connection_strengths"].groupby("matrix_type").apply(mean_strength_matches))

### Part 2: The connection probability is 0.158

We know the mean connection probability of the data (0.158). Similar to the above, we extract connection probabilities and compare the distribution of results to this value.

In [None]:
def connection_probability(mat, nodes):
    # Fill in a function that returns the connection probability as a single, float value between 0 and 1.
    # Warning: There are empty (0 nodes) matrices in the data.analyze
    return numpy.nan


analysis["analyses"]["connection_probability"] = {
    "source": connection_probability,
    "output": "Value"
}
# Apply the analysis functions to all matrices in the group.
result = data.analyze(analysis)

# Wrapper for a statistical test of amplitudes against the expected value
mean_prob_matches = lambda samples: scipy.stats.ttest_1samp(samples, 0.158).pvalue >= 0.05
# Compare distributions separately for all matrix_types.
display(result["connection_probability"].dropna().groupby("matrix_type").apply(mean_strength_matches))

## Part 3: The connection probability does not have a bias for reciprocal connections

Many connectomes, e.g. rat and mouse local cortical circuitry have been reported to have an overexpression of reciprocal connections. That is, if between a pair of neurons a connection in one direction exists, then the probability that a connection in the other direction exists as well is higher than expected. 

This is, however, reportedly not the case for the original human connectivity samples.

We test this here in the following way:
  1. For each matrix type, count the number of neuron pairs, connections and reciprocal connections
  2. Based on this, for each matrix type calculate its individual connection probability
  3. Based on this, calculate the expected distribution of the number of reciprocal connections based on null hypothesis (binomial)
  4. Based on this, evaluate the null hypothesis

In [None]:
def count_connections(mat, nodes):
    # Return the number of connections in the connection matrix
    return numpy.nan

def count_pairs(mat, nodes):
    # Return the number of ordered neuron pairs.
    # Note: This means both the pair (i, j) and (j, i) are counted!
    return numpy.nan

# Add to the list of analyses
analysis["analyses"]["pair_count"] = {
    "source": count_pairs,
    "output": "Value"
}
analysis["analyses"]["connection_count"] = {
    "source": count_connections,
    "output": "Value"
}
# To count the number of reciprocal connections we do not need a new function.
# We simply apply the connection counting function to the reciprocal connectivity
analysis["analyses"]["reciprocal_count"] = {
    "source": count_connections,
    "decorators": [  # This "decorator" means that the analysis function will be applied to the matrix containing only reciprocal connections.
        {"name": "for_bidirectional_connectivity"}
    ],
    "output": "Value"
}
# Apply the analysis functions to all matrices in the group.
result = data.analyze(analysis)
# Concatenate the three counts
result = pandas.concat([result["pair_count"],
                        result["connection_count"],
                        result["reciprocal_count"]], axis=1, 
                       keys=["pairs", "total", "reciprocal"])
# This results in a DataFrame with matrix types as rows, the three counts as columns:
#	pairs	total	reciprocal
# matrix_type			
# mystery_1	7200	[...]
# [...]
# mystery_7	7200    [...]
sum_results = result.groupby(["matrix_type"]).sum()

# We define a function that returns True if the null hypothesis that connections in either direction are formed
# statistically independently cannot be rejected at the specified significance level.
# The test statistic we use is the total count of reciprocal connections.
# The null hypothesis is that this number is binomally distributed with a probability of the square of the basic connection probability
def test_rec_count(row, thresh=0.05):
    from scipy.stats import binom
    num_pairs = row["pairs"]
    num_connections = row["total"]
    num_reciprocal = row["reciprocal"]

    # DEFINE THE APPROPRIATE NULL MODEL HERE!
    #distr = binom([...])  # Reciprocal connections are statistically independent. Using binomial distribution
    co_left = distr.isf(1.0 - thresh / 2.0)  # Cutoff for the left tail of the distribution. Value beyond this occur with p=thresh/2
    co_right = distr.isf(thresh / 2.0)  # Cutoff for the right tail of the distribution.
    # If the actual value is between the values, hypothesis cannot be rejected
    return row["reciprocal"] >= co_left and row["reciprocal"] <= co_right


# Apply
display(sum_results.apply(test_rec_count, axis=1))


## Part 4: The connection probability is distance dependent
Many connectomes, e.g. rat and mouse local cortical circuitry have been reported to have a distance-depdendent connection probability.
This is also the case for the original human data.

In this exercise, we shall evaluate this in a very simple way: We test whether the connection probability for pairs with a soma distance between 0 and 100 um is significantly higher than the connection probability above 100 um.

The distance can be calculated by using the columns "coordinate_1", "coordinate_2", "coordinate_3" of the `nodes` input.

In [None]:
# Define: A function that returns the connection probability, but only
# for pairs with a soma distance > min_dist and <= max_dist. The second condition is
# ignored if max_dist is None
def connection_probability_within(mat, nodes, min_dist=0, max_dist=None,
                                  props=["coordinate_1", "coordinate_2", "coordinate_3"]):
    assert min_dist >= 0
    assert (max_dist is None) or (max_dist > min_dist)
    from scipy.spatial.distance import (pdist, squareform)
    D = squareform(pdist(nodes[props]))  # N x N numpy.array of pairwise distances
    #[ FILL IN THE REST]


# Add to the list of analyses
analysis["analyses"]["proximal_probability"] = {
    "source": connection_probability_within,
    "output": "Value",
    "kwargs": {
        "min_dist": 0,
        "max_dist": 100
    }
}
analysis["analyses"]["distal_probability"] = {
    "source": connection_probability_within,
    "output": "Value",
    "kwargs": {
        "min_dist": 100
    }
}

# Apply the analysis functions to all matrices in the group.
result = data.analyze(analysis)
# Concatenate the three counts
result = pandas.concat([result["proximal_probability"],
                        result["distal_probability"]],
                        axis=1, 
                       keys=["proximal", "distal"])

# Here, define the appropriate statistical test
def test_con_prob_is_different(dframe, thresh=0.05):
    p1 = dframe["proximal"]
    p2 = dframe["distal"]
    #[...]


# Apply with the appropriate pandas statement
#final_result = result....
display(final_result)

## Part 5: The connection probability is directional
For this connectome a certain directionality has been reported.
That is, connections are more likely from more superficial neurons to deeper neurons than the other way around.

The node property "piadistance" can be used for that purpose. 

In [None]:
# Implement....