# Exercise 2: Sub-cellular targeting by cell type in the mouse visual cortex

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">
At the end of Workshop 1, we saw how to get labels on whether synapses were onto spines,
dendritic shafts, or somas. We also plotted this data as adjacency matrices, and made some
qualitative observations about the connectivity of different cell types in the context of 
those connection types.

Here, we'll go through a few steps to turn that into quantifications, specifically, what
fraction of connections between cell types use spines, dendritic shafts, or somas.

To accomplish this, we'll walk through some common Pandas operations (mapping, grouping/aggregating), and see how to apply them to this problem. 

As a bonus, we'll also get a glimpse out how this kind of data could be used to classify cell types. 

</div>

![](../../figures/sss-diagram.png)


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">
We'll start by loading up much of the same data from Workshop 1.
</div>

In [None]:
# Import packages
import sys
from os.path import join as pjoin
import platform

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Add the directory for the data and utilities
mat_version = 1196

platstring = platform.platform()
system = platform.system()
if system == "Darwin":
    # macOS
    data_root = "/Volumes/Brain2025/"
elif system == "Windows":
    # Windows (replace with the drive letter of USB drive)
    data_root = "E:/"
elif "amzn" in platstring:
    # then on CodeOcean
    data_root = "/data/"
else:
    # then your own linux platform
    # EDIT location where you mounted hard drive
    data_root = "/media/$USERNAME/Brain2025/"

# Set the directory to load prepared data and utility code
data_dir = pjoin(data_root, f"v1dd_{mat_version}")
utils_dir = pjoin("..", "utils")

# Add utilities to path
sys.path.append(utils_dir)


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">
Load proofreading information about cells:
</div>

In [None]:
# Loads cells with axon and dendrite proofreading
dendrite_proof_root_ids = np.load(
    pjoin(data_dir, f"proofread_dendrite_list_{mat_version}.npy")
)
axon_proof_root_ids = np.load(pjoin(data_dir, f"proofread_axon_list_{mat_version}.npy"))

proof_root_ids = np.intersect1d(dendrite_proof_root_ids, axon_proof_root_ids)

proof_root_ids

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">
Load synapses and target structure info:
</div>

In [None]:
syn_df = pd.read_feather(
    f"{data_dir}/syn_df_all_to_proofread_to_all_{mat_version}.feather"
).set_index("id")
target_structure = pd.read_feather(
    pjoin(data_dir, f"syn_label_df_all_to_proofread_to_all_{mat_version}.feather")
)["tag"]

# Combine the target information to the proofread synapses table
syn_df["target_structure"] = target_structure
syn_df["target_structure"] = syn_df["target_structure"].fillna("unknown")
syn_df.head()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">
Filter to synapses between proofread cells:
</div>

In [None]:
from utils import filter_synapse_table

proof_syn_df = filter_synapse_table(syn_df, proof_root_ids, proof_root_ids)

# we're going to copy proof_syn_df to avoid modifying the original DataFrame -
# pandas will often yell at you if you try to modify a DataFrame that is a view of
# another DataFrame
proof_syn_df = proof_syn_df.copy()

proof_syn_df.head()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">
Load cell information:
</div>

In [None]:
cell_df = pd.read_feather(f"{data_dir}/soma_and_cell_type_{mat_version}.feather").set_index("pt_root_id")
cell_df = cell_df.loc[proof_root_ids]
cell_df.head()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">
Now that we have all of that data loaded up, we'll start by focusing on what information we need to select and combine. These are the columns in the synapse table we'll focus on:
</div>

In [None]:
proof_syn_df[["pre_pt_root_id", "post_pt_root_id", "target_structure"]].head()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">

We have a table of synapses, but we first want to know what cell types are involved for each.
Although it is a little redundant, it is often useful to simply include the pre- and post-synaptic
cell types in the synapse table. There are many ways to accomplish this in Pandas - a
hint for one option is to use a column's [.map()](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html#pandas.Series.map) method.

</div>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

<b> Task: </b> Use `.map()` or another method of your choosing to make two new columns in the synapse table:
`proof_syn_df['pre_cell_type']` and `proof_syn_df['post_cell_type']`. The former should
contain the cell type of the pre-synaptic cell for that row's synapse, and the latter
should contain the cell type of the corresponding post-synaptic cell type.

</div>


In [None]:
proof_syn_df["pre_cell_type"] = # YOUR CODE HERE
proof_syn_df["post_cell_type"] = # YOUR CODE HERE

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

With this information in hand, let's start with just counting the number of synapses
between each pair of cell types and in each target structure category.

In Pandas, [.groupby()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)
is often key to aggregating data. Groupby takes a column name or list of column names, and
then allows you to apply an aggregation function to subsets of the data that share the same values
of those columns. For example, another way to count the total number of synapses in our
table in each target structure category is to do:

</div>


In [None]:
proof_syn_df.groupby(
    "target_structure",
    as_index=False,  # this makes the grouping variable(s) into columns instead of the index
).size()  # .size() is a pandas method which counts the number of rows in each group

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">

Aside: just to show the flexibility of `groupby`, this code will group by the `pre_pt_root_id` which indexes the pre-synaptic cell,
select the `ctr_pt_position_y` column which represents the depth of each synapse in nanometers, 
and then compute the mean depth of those synapses from each cell.
</div>

In [None]:
(
    proof_syn_df.groupby("pre_pt_root_id", as_index=False)[
        "ctr_pt_position_y"  # this is selecting a column "ctr_pt_position_y" by name
    ].mean()  # this is a pandas method which computes the mean of the selected column per group
)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">

More extensive tutorials on `groupby` can be found in the [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/groupby.html).

</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">

Your turn! Returning to our problem of synapse categories and cell types, let's make a new DataFrame that counts the number of synapses between each pair of
cell types in each target structure category. 

For instance, one row might represent the information "there are 38 synapses from L2-IT cells to L3-IT cell shafts". 

</div>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

<b> Task: </b>  Create a DataFrame that has four columns:
- `pre_cell_type`
- `post_cell_type`
- `target_structure` (spine, shaft, soma, or unknown)
- `size` (the number of synapses between the pre- and post-synaptic cell types in that target structure)

</div>

<div style="background: #f6d5f2ff; border-radius: 3px; padding: 10px;">

<b> Hint: </b> `.groupby()` can take a list of column names to group on!

</div>

In [None]:
group_structure_counts = # YOUR CODE HERE
group_structure_counts

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">

Now, because we are interested in the _proportion_ of synapses in each cell type pair, we
also need to know the total number of synapses between each pair of cell types, regardless
of target structure.


</div>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

<b> Task: </b>  Create another dataframe called `group_total_counts`. It should have
columns for `pre_cell_type`, `post_cell_type`, and `total_size` (the total number of synapses
between the pre- and post-synaptic cell types, regardless of target structure). 

</div>


<div style="background: #f6d5f2ff; border-radius: 3px; padding: 10px;">

Hint: you might find it useful to use `.rename(columns={"old_name": "new_name"})` to 
rename the `size` column to `total_size`.

</div>


In [None]:
group_total_counts = # YOUR CODE HERE
group_total_counts

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">

Now, let's think about how to join these tables together. As usual, there are many ways
to do this in Pandas, but one option is to use the [pd.merge()](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) function. This method takes
two DataFrames and joins them together based on one or more columns that they share.
Think about what columns we want to join on! 
</div>


<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

<b> Task: </b> Use `pd.merge()` to create a new
DataFrame called `group_counts` that contains the following columns:
- `pre_cell_type`
- `post_cell_type`
- `target_structure`
- `size` (the number of synapses between the pre- and post-synaptic cell types in that target structure)
- `total_size` (the total number of synapses between the pre- and post-synaptic cell types, regardless of target structure)

</div>



In [None]:
group_counts = # YOUR CODE HERE
group_counts

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">

Now that we've gone to all that trouble to align our data in this single table, it should
be easy to calculate the _proportion_ of synapses in each target structure category for each
pair of cell types.

</div>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

<b> Task: </b> Add a new column to `group_counts` called `proportion`, which is the
`size` divided by `total_size`.

</div>

In [None]:
group_counts["proportion"] = # YOUR CODE HERE
group_counts

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">

We have what we said we wanted to compute - but let's transform it a bit to make it 
easier to plot. The code below will first query to select the proportions of synapses onto spines, 
and then [pivot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html) the DataFrame to make it square matrix which we'll plot.

</div>

In [None]:
square_spine_counts = group_counts.query("target_structure == 'spine'").pivot(
    index="pre_cell_type", columns="post_cell_type", values="proportion"
)

# to make our plot look nicer, this code just reorders the categories to be first by
# excitatory/inhibitory, then by layer, and finally by cell type
categories = [
    "L2-IT",  # excitatory cell types
    "L3-IT",
    "L4-IT",
    "L5-IT",
    "L5-ET",
    "L5-NP",
    "L6-IT",
    "L6-CT",
    "DTC",  # inhibitory cell types
    "ITC",
    "PTC",
    "STC",
]
square_spine_counts = square_spine_counts.reindex(index=categories, columns=categories)
square_spine_counts

In [None]:
sns.heatmap(
    square_spine_counts,
    annot=True,
    cmap="Reds",
    cbar_kws={"label": "Proportion of synapses onto spines"},
    square=True,
    fmt=".2f",  # format the annotations to 2 decimal places
)

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

<b> Task: </b> Now, create similar plots for the other target structures. 

</div>

In [None]:
# YOUR CODE HERE

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

<b> Question: </b> What general trends do you notice about which connection types are using which "channels" of synaptic targeting? What sources of bias might affect the reliability of these proportions?

</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF;">

Excitatory neurons are often called "spiny" cells because they receive much of their total input onto spines. We just looked at this in terms of cell-type $\rightarrow$ cell type targeting, but what about at the single cell level?

</div>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

<b> Bonus task: </b> Compute the proportion of synapses onto spines __for each individual cell__. You should be able to do this using very similar tools to what we developed above. 

</div>

<div style="background: #f6d5f2ff; border-radius: 3px; padding: 10px;">

<b> Hint: </b> you will want a different filter on the synapse table than we used before (why?). 

</div>

In [None]:
# YOUR CODE HERE

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

<b>Question:</b> How does this one feature perform as a classifier for excitatory vs. inhibitory cells? What would be the drawbacks of using this feature as a classifier in practice? 

</div>