Jaccard similarity between different configurations of oviIN modularity
* full oviINr connectome
* oviINr inputs
* oviINr outputs
* combined oviIN connectome
* oviINr connectome without oviINr
* combined oviIN connectome with oviINs

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

# Testing with dummy data
I'm going to do this with dummy data first. I will create a dataframe with 2 columns that each have 2 clusters. There is 50% overlap between the 2 columns. 

In [3]:
# Create the dummy data for the columns
column_0 = np.concatenate([np.full(10, 1), np.full(10, 2)])
column_05 = np.tile([1, 2], 10)

# Create the DataFrame
df = pd.DataFrame({'0.0': column_0, '0.05': column_05})
df

Unnamed: 0,0.0,0.05
0,1,1
1,1,2
2,1,1
3,1,2
4,1,1
5,1,2
6,1,1
7,1,2
8,1,1
9,1,2


In [4]:
# get the coarse modules from the data
coarse_modules = df['0.0'].unique().tolist()
coarse_modules

[1, 2]

dict_zero gets the indices (which might be the body IDs depending on the df) of the rows that correspond to the 2 coarse clusters.

In [5]:
# Get bodyIds at zero resolution for each coarse cluster
dict_zero = {module: [] for module in coarse_modules}
for key, value in dict_zero.items():
    dict_zero[key] = df[df['0.0']==key].index.tolist()
dict_zero

{1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 2: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]}

The keys in dict are the coarse cluster numbers and the items are the chi cluster numbers that contain the members from the coarse cluster. 

In [6]:
# Gets cluster numbers for chi resolution for each coarse cluster
chi = '0.05'
chi_values = [chi]
dict = {}

for i, x in enumerate(coarse_modules):
    # grab a coarse cluster
    df_test = df[df['0.0']==x]
    for f, y in enumerate(chi_values):
        # take new cluster numbers overlapping with coarse cluster
        cluster = df_test[y].unique()
        #cluster_all[i,f, :len(cluster)] = cluster
    dict[coarse_modules[i]] = cluster
dict

{1: array([1, 2]), 2: array([1, 2])}

The keys in dict_new are also the coarse cluster numbers but the items are the indices that correspond to all the chi clusters for that coarse cluster.

In [7]:
# Get bodyIDs for each cluster number in the resolution
dict_new = {module: [] for module in coarse_modules}
for key, value in dict.items():
    body_ids = []
    for i, x in enumerate(value):
        # appends all the body_ids from the chi clusters
        body_ids.extend(df[df[chi]==x].index.tolist())
    dict_new[key] = body_ids
dict_new

{1: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19],
 2: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]}

These dictionaries can then be used to compute the Jaccard similarity between a coarse cluster and the chi clusters that contain its members. Rhessa did something like this.

In [8]:
set1 = set(dict_new[key])
set2 = set(dict_zero[key])

In [9]:
set2

{10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

In [10]:
set1-set2

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

In [11]:
set2-set1

set()

In [12]:
unique_1 = set1-set2
unique_2 = set2-set1

In [13]:
common = set1.intersection(set2)
# total_unique is not actually the total unique count
total_unique = len(unique_1) + len(unique_2)
jaccard_sim = len(common) / (total_unique + len(common))

In [14]:
common

{10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

In [15]:
total_unique

10

I will actually do the union between sets instead.

In [16]:
# the union of set1 and set2
union_1_2 = set1 | set2
union_1_2

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

In [17]:
# jaccard similarity is intersection divided by union
jaccard_sim = len(common) / len(union_1_2)
jaccard_sim

0.5

We can continue and make a dictionary to store jaccard values for the different coarse clusters.

In [20]:
# Create a dictionary to store the jaccard similarities
jaccard_dict = {}
for key, value in dict_new.items():
    jaccard_dict[key] = []
    unique_1, unique_2, common, jaccard_sim = calculate_difference(dict_new[key], dict_zero[key])
    jaccard_dict[key].append(jaccard_sim)
    
jaccard_dict

{1: [0.5], 2: [0.5]}

In [21]:
# Create a dataframe to store the jaccard similarities
df_jaccard = pd.DataFrame(jaccard_dict)
df_jaccard

Unnamed: 0,1,2
0,0.5,0.5


I'm not sure why Rhessa wanted to shift the indexing but I will leave that alone in case there was a reason. Doesn't seem to hurt anything.

In [22]:
# starts indexing at 1
df_jaccard.index = np.arange(1, len(df_jaccard)+1)
df_jaccard

Unnamed: 0,1,2
1,0.5,0.5


# Updates to Rhessa's Jaccard functions
The modified functions are below. This is first tested on the dummy data before overwriting df and testing on the oviIN connectome data.

In [25]:
# create a function that takes in two lists of id numbers and returns the jaccard sim of the two lists
def calculate_difference(list1, list2):
    set1 = set(list1)
    set2 = set(list2)

    # the intersection of set1 and set2
    common = set1.intersection(set2)

    # the union of set1 and set2
    union_1_2 = set1 | set2

    # Jaccard similarity index for set1 and set2
    jaccard_sim = len(common) / len(union_1_2)

    return unique_1, unique_2, common, jaccard_sim

# function that takes in a partition dataframe, the modules at the base resolution, and the chi value at which to compare
def body_ids_by_cluster(df, ref_col, chi):
    """ This function takes in a partition dataframe, the column name for modules at the base resolution (ref_col), and the chi value at which to compare. 
    It returns a dictionary of body IDs for each cluster number at the resolution and a dictionary of body IDs for each cluster 
    number at the base resolution."""
    # get the coarse modules from the data
    coarse_modules = df[ref_col].unique().tolist()

    # Get bodyIds at zero resolution for each coarse cluster
    dict_zero = {module: [] for module in coarse_modules}
    for key, value in dict_zero.items():
        dict_zero[key] = df[df[ref_col]==key].index.tolist()

    # Gets cluster numbers for chi resolution for each coarse cluster
    chi_values = [chi]
    dict = {}
    for i, x in enumerate(coarse_modules):
        # grab a coarse cluster
        df_test = df[df[ref_col]==x]
        for f, y in enumerate(chi_values):
            # take new cluster numbers overlapping with coarse cluster
            cluster = df_test[y].unique()
        dict[coarse_modules[i]] = cluster
    
    # Get bodyIDs for each cluster number in the chi resolution clusters
    dict_new = {module: [] for module in coarse_modules}
    for key, value in dict.items():
        body_ids = []
        for i, x in enumerate(value):
            # appends all the body_ids from the chi clusters
            body_ids.extend(df[df[chi]==x].index.tolist())
        dict_new[key] = body_ids

    return dict_new, dict_zero
    
# Function that takes in partition dataframe, the modules at the base resolution, and the chi value at which to compare
def main_jaccard(df, ref_col, chi):
    """ This function takes in a partition dataframe, the modules at the base resolution, and the chi 
    value at which to compare. It returns a dataframe of the jaccard similarities between the resolutions at each cluster"""

    # Get the body IDs for each cluster number at the resolution and the base resolution
    dict_new, dict_zero = body_ids_by_cluster(df, ref_col, chi)

    # Create a dictionary to store the jaccard similarities
    jaccard_dict = {}
    for key, value in dict_new.items():
        jaccard_dict[key] = []
        unique_1, unique_2, common, jaccard_sim = calculate_difference(dict_new[key], dict_zero[key])
        jaccard_dict[key].append(jaccard_sim)
    
    # Create a dataframe to store the jaccard similarities
    df_jaccard = pd.DataFrame(jaccard_dict)
    
    return df_jaccard

In [23]:
coarse_col = '0.0'
chi = '0.05'
test = main_jaccard(df, coarse_col, chi)
test

Unnamed: 0,1,2
0,0.5,0.5


In [3]:
import os

# file path for oviIN modularity data for full ovi connectome
os.chdir('/Users/ggutierr/My Drive (ggutierr@barnard.edu)/GitHub/oviIN-analyses-gabrielle/ovi_preprocessed/preprocessed-v1.2.1')

# load in the data
ovi_HB_node_df = pd.read_csv('preprocessed_nodes.csv', index_col=0)

In [25]:
df = ovi_HB_node_df

In [26]:
# get the coarse modules from the data
coarse_modules = df['0.0'].unique().tolist()

In [31]:
real_test = main_jaccard(df, coarse_col, chi)
real_test

Unnamed: 0,1,2,3,4,5,6
0,0.287668,0.221907,0.130645,0.259494,0.166893,0.179653


# Co-clustering matrices
Based on a suggestion from Alex. 
"... just re-run the clustering like 100 times and compute the co-clustering matrix (C_ij = 1 if nodes i and j belong to the same cluster, 0 otherwise) each time, then take the average. If the distribution of values in the average is really strongly bimodal (a giant peak at 0 and at 1, with very little for values strictly in between), that means you can trust the clusters"

## Co-clustering of full oviINr runs
Testing this out on the only run that I have for now.

In [4]:
# file path for oviIN modularity data for full ovi connectome
os.chdir('/Users/ggutierr/My Drive (ggutierr@barnard.edu)/GitHub/oviIN-analyses-gabrielle/ovi_preprocessed/preprocessed-v1.2.1')

ovi_HB_node_df = pd.read_csv('preprocessed_nodes.csv')

In [4]:
ovi_HB_node_df

Unnamed: 0_level_0,key,0.0,0.05,0.1,0.5,0.75,1.0,instance,celltype,pre,...,status,cropped,statusLabel,cellBodyFiber,somaRadius,somaLocation,roiInfo,notes,inputRois,outputRois
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1003215282,1,1,1,1,1,1,1,CL229_R,CL229,100,...,Traced,False,Roughly traced,PDM19,301.0,"[23044, 14981, 11600]","{'INP': {'pre': 87, 'post': 351, 'downstream':...",,"['EPA(R)', 'GOR(R)', 'IB', 'ICL(R)', 'INP', 'S...","['GOR(R)', 'IB', 'ICL(R)', 'INP', 'SCL(R)', 'S..."
1005952640,2,1,1,2,2,2,2,IB058_R,IB058,664,...,Traced,False,Roughly traced,PVL20,,,"{'INP': {'pre': 464, 'post': 1327, 'downstream...",,"['ATL(R)', 'IB', 'ICL(R)', 'INP', 'PLP(R)', 'S...","['ATL(R)', 'IB', 'ICL(R)', 'INP', 'PLP(R)', 'S..."
1006928515,3,1,1,1,3,3,3,CL300_R,CL300,86,...,Traced,False,Roughly traced,PVL13,236.0,"[12083, 10523, 16816]","{'INP': {'pre': 79, 'post': 126, 'downstream':...",,"['ATL(R)', 'IB', 'ICL(R)', 'INP', 'SCL(R)', 'S...","['ATL(R)', 'IB', 'ICL(R)', 'INP', 'SCL(R)', 'S..."
1007260806,4,1,2,1,4,4,4,CL301_R,CL301,119,...,Traced,False,Roughly traced,PVL13,236.0,"[13524, 10108, 16480]","{'INP': {'pre': 40, 'post': 128, 'downstream':...",,"['GOR(R)', 'IB', 'ICL(R)', 'INP', 'PLP(R)', 'S...","['IB', 'ICL(R)', 'INP', 'PLP(R)', 'SCL(R)', 'S..."
1007402796,5,1,1,2,5,5,5,PS119_R,PS119,245,...,Traced,False,Roughly traced,PDM16,301.0,"[25364, 12010, 12544]","{'SNP(R)': {'pre': 100, 'post': 50, 'downstrea...",,"['CAN(R)', 'GOR(R)', 'IB', 'ICL(L)', 'ICL(R)',...","['AVLP(R)', 'CAN(R)', 'IB', 'ICL(L)', 'INP', '..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
988269593,4545,3,5,5,95,13,1785,FB4E_L,FB4E,168,...,Traced,False,Roughly traced,AVM08,,,"{'SNP(L)': {'post': 25, 'upstream': 25, 'mito'...",CRELALFB4_1,"['CRE(-RUB)(L)', 'CRE(L)', 'CX', 'FB', 'FB-col...","['CRE(-RUB)(L)', 'CRE(L)', 'CX', 'FB', 'FB-col..."
988291460,4546,4,4,263,1059,1501,1786,,,2,...,Assign,,0.5assign,,,,"{'SNP(L)': {'pre': 2, 'post': 1, 'downstream':...",,"['SMP(L)', 'SNP(L)']","['SMP(L)', 'SNP(L)']"
988567837,4547,5,8,7,13,13,13,FB4G_R,FB4G,785,...,Traced,False,Roughly traced,AVM08,,,"{'SNP(R)': {'pre': 6, 'post': 73, 'downstream'...",CRELALFB4_3,"['CRE(-ROB,-RUB)(R)', 'CRE(R)', 'CX', 'FB', 'F...","['CRE(-ROB,-RUB)(R)', 'CRE(R)', 'CX', 'FB', 'F..."
988909130,4548,5,8,7,27,56,436,FB5V_R,FB5V,269,...,Traced,False,Roughly traced,AVM10,296.5,"[13226, 32024, 18600]","{'SNP(R)': {'pre': 1, 'post': 28, 'downstream'...",CRELALFB5,"['AB(R)', 'CRE(-ROB,-RUB)(R)', 'CRE(R)', 'CX',...","['CRE(-ROB,-RUB)(R)', 'CRE(R)', 'CX', 'FB', 'F..."


In [6]:
test = ovi_HB_node_df[['id','0.0']].copy()

In [7]:
test = test[4544:]
test

Unnamed: 0,id,0.0
4544,988269593,3
4545,988291460,4
4546,988567837,5
4547,988909130,5
4548,989228019,5


Co-pilot helped me out with a nested list comprehension for creating the co-clustering matrix. All I need to do is then create a co-clustering matrix for each run and average them. I plan to put this into a for loop where I iterate over each run, create its co-clustering matrix and add it to an accumulating matrix. After the loop, I will divide all elements in the matrix by the number of runs and thus obtain the average. I need to ensure that bodyIDs are always in the same order. They seem to be sorted, but make sure.

There is one potentially big problem with this approach. The clusters might be numbered differently in different runs even if the same nodes stick together. For this reason, I think Jaccard similarity will be the way to go.

In [8]:
# Create the co-clustering matrix
coclust_matrix = pd.DataFrame([[1 if test['0.0'].iloc[i] == test['0.0'].iloc[j] else 0 for j in range(len(test))] for i in range(len(test))], 
                      index=test['id'], columns=test['id'])

coclust_matrix

id,988269593,988291460,988567837,988909130,989228019
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
988269593,1,0,0,0,0
988291460,0,1,0,0,0
988567837,0,0,1,1,1
988909130,0,0,1,1,1
989228019,0,0,1,1,1


# Jaccard similarity of full oviINr runs

First, make a for loop and load each df. Attach the coarse column to a df that will collect the coarse columns of each run. Exit loop. Then nested (triangular) loop to compute similarity between pairs of columns.
The end result can be a matrix where runs are rows and columns and entries are similarity.

# Jaccard similarities between connectome configurations
Now I will compute similarities between columns from different dataframes. Because there might be different sets of body IDs for the different configurations, we may need a new function that compares 2 dataframes. 

Ultimately, when comparing different configurations, perhaps the most fair thing to do is to only do so with intersecting groups of neurons. For example, if I want to compare the full oviINr clustering to the oviINr inputs, I really only want to work with nodes that are inputs to oviINr. This would effectively be a quantification of the Sankeys that I did in sankey_modularity.ipynb. In which case, I may not need to create new functions, I just have to do some extra pre-processing steps.

What I'll do here is to create a new dataframe that combines neurons in common between 2 partition datasets. Then we can simply use the same functions for computing Jaccard similarity that we already have.

In [10]:
import os

# file path for oviIN modularity data for full ovi connectome
os.chdir('/Users/ggutierr/My Drive (ggutierr@barnard.edu)/GitHub/oviIN-analyses-gabrielle/hemibrain_preprocessed/preprocessed-v1.2')

# load in the data
HB_node_df = pd.read_csv('preprocessed_nodes.csv')

In [5]:
import os

# file path for oviIN modularity data for ovi input connectome
os.chdir('/Users/ggutierr/My Drive (ggutierr@barnard.edu)/GitHub/oviIN-analyses-gabrielle/ovi_preprocessed/preprocessed_inputs-v1.2.1')

ovi_in_node_df = pd.read_csv('preprocessed_nodes.csv')
ovi_in_node_df

Unnamed: 0,id,key,0.0,0.05,0.1,0.5,0.75,1.0,instance,celltype,...,status,cropped,statusLabel,cellBodyFiber,somaRadius,somaLocation,roiInfo,notes,inputRois,outputRois
0,1003215282,1,1,1,1,1,1,1,CL229_R,CL229,...,Traced,False,Roughly traced,PDM19,301.0,"[23044, 14981, 11600]","{'INP': {'pre': 87, 'post': 351, 'downstream':...",,"['EPA(R)', 'GOR(R)', 'IB', 'ICL(R)', 'INP', 'S...","['GOR(R)', 'IB', 'ICL(R)', 'INP', 'SCL(R)', 'S..."
1,1005952640,2,2,1,1,2,2,2,IB058_R,IB058,...,Traced,False,Roughly traced,PVL20,,,"{'INP': {'pre': 464, 'post': 1327, 'downstream...",,"['ATL(R)', 'IB', 'ICL(R)', 'INP', 'PLP(R)', 'S...","['ATL(R)', 'IB', 'ICL(R)', 'INP', 'PLP(R)', 'S..."
2,1006928515,3,1,1,1,3,3,3,CL300_R,CL300,...,Traced,False,Roughly traced,PVL13,236.0,"[12083, 10523, 16816]","{'INP': {'pre': 79, 'post': 126, 'downstream':...",,"['ATL(R)', 'IB', 'ICL(R)', 'INP', 'SCL(R)', 'S...","['ATL(R)', 'IB', 'ICL(R)', 'INP', 'SCL(R)', 'S..."
3,1007260806,4,2,1,1,4,4,4,CL301_R,CL301,...,Traced,False,Roughly traced,PVL13,236.0,"[13524, 10108, 16480]","{'INP': {'pre': 40, 'post': 128, 'downstream':...",,"['GOR(R)', 'IB', 'ICL(R)', 'INP', 'PLP(R)', 'S...","['IB', 'ICL(R)', 'INP', 'PLP(R)', 'SCL(R)', 'S..."
4,1008024276,5,3,2,2,5,5,5,FB5N_R,FB5N,...,Traced,False,Roughly traced,AVM08,472.5,"[19178, 29711, 37312]","{'SNP(L)': {'post': 5, 'upstream': 5, 'mito': ...",SMPCREFB5_4,"['CRE(-ROB,-RUB)(R)', 'CRE(R)', 'CX', 'FB', 'F...","['CRE(-ROB,-RUB)(R)', 'CRE(R)', 'CX', 'FB', 'F..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2506,987273073,2507,3,8,8,409,604,629,(PVL05)_L,,...,Traced,False,Roughly traced,,,,"{'SNP(R)': {'pre': 65, 'post': 52, 'downstream...",,"['CRE(-ROB,-RUB)(R)', 'CRE(-RUB)(L)', 'CRE(L)'...","['CRE(-ROB,-RUB)(R)', 'CRE(-RUB)(L)', 'CRE(L)'..."
2507,987842109,2508,3,9,23,533,780,815,,,...,Orphan,,Orphan hotknife,,,,"{'SNP(R)': {'pre': 2, 'post': 13, 'downstream'...",,"['SMP(R)', 'SNP(R)']","['SMP(R)', 'SNP(R)']"
2508,988567837,2509,2,3,4,16,58,63,FB4G_R,FB4G,...,Traced,False,Roughly traced,AVM08,,,"{'SNP(R)': {'pre': 6, 'post': 73, 'downstream'...",CRELALFB4_3,"['CRE(-ROB,-RUB)(R)', 'CRE(R)', 'CX', 'FB', 'F...","['CRE(-ROB,-RUB)(R)', 'CRE(R)', 'CX', 'FB', 'F..."
2509,988909130,2510,2,3,4,389,559,572,FB5V_R,FB5V,...,Traced,False,Roughly traced,AVM10,296.5,"[13226, 32024, 18600]","{'SNP(R)': {'pre': 1, 'post': 28, 'downstream'...",CRELALFB5,"['AB(R)', 'CRE(-ROB,-RUB)(R)', 'CRE(R)', 'CX',...","['CRE(-ROB,-RUB)(R)', 'CRE(R)', 'CX', 'FB', 'F..."


In [11]:
# return to cd
os.chdir('/Users/ggutierr/My Drive (ggutierr@barnard.edu)/GitHub/oviIN-analyses-gabrielle')

In [22]:
test1 = HB_node_df[['id','0.0','celltype','instance']].copy()
test2 = ovi_in_node_df[['id','0.0']].copy()

In [23]:
test3 = pd.merge(test1, test2, on='id', how='inner')
test3

Unnamed: 0,id,0.0_x,celltype,instance,0.0_y
0,263674097,3,LHPD2a5_a,LHPD2a5_a_R,5
1,266187480,3,SMP349,SMP349_R,5
2,266187559,3,SLP399,SLP399_R,5
3,267214250,3,pC1b,pC1b_R,5
4,267223104,4,SMP025,SMP025_R,5
...,...,...,...,...,...
1827,5901225755,1,,,2
1828,5901227238,1,,,2
1829,5901232053,3,SMP272,SMP272(PDL21)_L,1
1830,6400000773,3,SMP411,SMP411_R,5


In [26]:
coarse_col = '0.0_x'
chi = '0.0_y'
test4 = main_jaccard(test3, coarse_col, chi)
test4

NameError: name 'unique_1' is not defined

In [133]:
# create a new df from the reference column
configs_df = df['0.0'].copy()

# append column
configs_df['0.05'] = df['0.05']


id
1003215282    1
1005952640    1
1006928515    1
1007260806    1
1007402796    1
             ..
988269593     3
988291460     4
988567837     5
988909130     5
989228019     5
Name: 0.0, Length: 4549, dtype: int64