Jaccard similarity between different configurations of oviIN modularity
* full oviINr connectome
* oviINr inputs
* oviINr outputs
* combined oviIN connectome
* oviINr connectome without oviINr
* combined oviIN connectome with oviINs

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Testing with dummy data
I'm going to do this with dummy data first. I will create a dataframe with 2 columns that each have 2 clusters. There is 50% overlap between the 2 columns. 

In [82]:
# Create the dummy data for the columns
column_0 = np.concatenate([np.full(10, 1), np.full(10, 2)])
column_05 = np.tile([1, 2], 10)

# Create the DataFrame
df = pd.DataFrame({'0.0': column_0, '0.05': column_05})
df

Unnamed: 0,0.0,0.05
0,1,1
1,1,2
2,1,1
3,1,2
4,1,1
5,1,2
6,1,1
7,1,2
8,1,1
9,1,2


In [83]:
# get the coarse modules from the data
coarse_modules = df['0.0'].unique().tolist()
coarse_modules

[1, 2]

dict_zero gets the indices (which might be the body IDs depending on the df) of the rows that correspond to the 2 coarse clusters.

In [84]:
# Get bodyIds at zero resolution for each coarse cluster
dict_zero = {module: [] for module in coarse_modules}
for key, value in dict_zero.items():
    dict_zero[key] = df[df['0.0']==key].index.tolist()
dict_zero

{1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 2: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]}

The keys in dict are the coarse cluster numbers and the items are the chi cluster numbers that contain the members from the coarse cluster. 

In [85]:
# Gets cluster numbers for chi resolution for each coarse cluster
chi_values = [chi]
dict = {}

for i, x in enumerate(coarse_modules):
    # grab a coarse cluster
    df_test = df[df['0.0']==x]
    for f, y in enumerate(chi_values):
        # take new cluster numbers overlapping with coarse cluster
        cluster = df_test[y].unique()
        #cluster_all[i,f, :len(cluster)] = cluster
    dict[coarse_modules[i]] = cluster
dict

{1: array([1, 2]), 2: array([1, 2])}

The keys in dict_new are also the coarse cluster numbers but the items are the indices that correspond to all the chi clusters for that coarse cluster.

In [86]:
# Get bodyIDs for each cluster number in the resolution
dict_new = {module: [] for module in coarse_modules}
for key, value in dict.items():
    body_ids = []
    for i, x in enumerate(value):
        # appends all the body_ids from the chi clusters
        body_ids.extend(df[df[chi]==x].index.tolist())
    dict_new[key] = body_ids
dict_new

{1: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19],
 2: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]}

These dictionaries can then be used to compute the Jaccard similarity between a coarse cluster and the chi clusters that contain its members. Rhessa did something like this.

In [87]:
set1 = set(dict_new[key])
set2 = set(dict_zero[key])

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

In [95]:
set1-set2

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

In [93]:
set2

{10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

In [96]:
set2-set1

set()

In [90]:
unique_1 = set1-set2
unique_2 = set2-set1

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

In [97]:
common = set1.intersection(set2)
# total_unique is not actually the total unique count
total_unique = len(unique_1) + len(unique_2)
jaccard_sim = len(common) / (total_unique + len(common))

In [99]:
common

{10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

In [98]:
total_unique

10

I will actually do the union between sets instead.

In [105]:
# the union of set1 and set2
union_1_2 = set1 | set2
union_1_2

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

In [106]:
# jaccard similarity is intersection divided by union
jaccard_sim = len(common) / len(union_1_2)
jaccard_sim

0.5

We can continue and make a dictionary to store jaccard values for the different coarse clusters.

In [107]:
# Create a dictionary to store the jaccard similarities
jaccard_dict = {}
for key, value in dict_new.items():
    jaccard_dict[key] = []
    unique_1, unique_2, common, jaccard_sim = calculate_difference(dict_new[key], dict_zero[key])
    jaccard_dict[key].append(jaccard_sim)
    
jaccard_dict

{1: [0.5], 2: [0.5]}

In [108]:
# Create a dataframe to store the jaccard similarities
df_jaccard = pd.DataFrame(jaccard_dict)
df_jaccard

Unnamed: 0,1,2
0,0.5,0.5


I'm not sure why Rhessa wanted to shift the indexing.

In [109]:
# starts indexing at 1
df_jaccard.index = np.arange(1, len(df_jaccard)+1)
df_jaccard

Unnamed: 0,1,2
1,0.5,0.5


# Updates to Rhessa's Jaccard functions
The modified functions are below. This is first tested on the dummy data before overwriting df and testing on the oviIN connectome data.

In [110]:
# create a function that takes in two lists of id numbers and returns the jaccard sim of the two lists
def calculate_difference(list1, list2):
    set1 = set(list1)
    set2 = set(list2)

    # the intersection of set1 and set2
    common = set1.intersection(set2)

    # the union of set1 and set2
    union_1_2 = set1 | set2

    # Jaccard similarity index for set1 and set2
    jaccard_sim = len(common) / len(union_1_2)

    return unique_1, unique_2, common, jaccard_sim

# function that takes in a partition dataframe, the modules at the base resolution, and the chi value at which to compare
def get_body_ids(df, coarse_modules, chi):
    """ This function takes in a partition dataframe, the modules at the base resolution, and the chi value at which to compare. 
    It returns a dictionary of body IDs for each cluster number at the resolution and a dictionary of body IDs for each cluster 
    number at the base resolution."""
    # Get bodyIds at zero resolution for each coarse cluster
    dict_zero = {module: [] for module in coarse_modules}
    for key, value in dict_zero.items():
        dict_zero[key] = df[df['0.0']==key].index.tolist()

    # Gets cluster numbers for chi resolution for each coarse cluster
    chi_values = [chi]
    dict = {}
    for i, x in enumerate(coarse_modules):
        # grab a coarse cluster
        df_test = df[df['0.0']==x]
        for f, y in enumerate(chi_values):
            # take new cluster numbers overlapping with coarse cluster
            cluster = df_test[y].unique()
        dict[coarse_modules[i]] = cluster
    
    # Get bodyIDs for each cluster number in the chi resolution clusters
    dict_new = {module: [] for module in coarse_modules}
    for key, value in dict.items():
        body_ids = []
        for i, x in enumerate(value):
            # appends all the body_ids from the chi clusters
            body_ids.extend(df[df[chi]==x].index.tolist())
        dict_new[key] = body_ids

    return dict_new, dict_zero
    
# Function that takes in partition dataframe, the modules at the base resolution, and the chi value at which to compare
def main_jaccard(df, coarse_modules, chi):
    """ This function takes in a partition dataframe, the modules at the base resolution, and the chi 
    value at which to compare. It returns a dataframe of the jaccard similarities between the resolutions at each cluster"""

    # Get the body IDs for each cluster number at the resolution and the base resolution
    dict_new, dict_zero = get_body_ids(df, coarse_modules, chi)

    # Create a dictionary to store the jaccard similarities
    jaccard_dict = {}
    for key, value in dict_new.items():
        jaccard_dict[key] = []
        unique_1, unique_2, common, jaccard_sim = calculate_difference(dict_new[key], dict_zero[key])
        jaccard_dict[key].append(jaccard_sim)
    
    # Create a dataframe to store the jaccard similarities
    df_jaccard = pd.DataFrame(jaccard_dict)
    
    return df_jaccard

In [111]:
test = main_jaccard(df, coarse_modules, chi)
test

Unnamed: 0,1,2
0,0.5,0.5


In [112]:
import os

# file path for oviIN modularity data for full ovi connectome
os.chdir('/Users/ggutierr/My Drive (ggutierr@barnard.edu)/GitHub/oviIN-analyses-gabrielle/ovi_preprocessed/preprocessed-v1.2.1')

path = os.getcwd()

# load in the data
ovi_HB_node_df = pd.read_csv('preprocessed_nodes.csv')#, index_col=0)

In [113]:
df = ovi_HB_node_df

In [115]:
# get the coarse modules from the data
coarse_modules = df['0.0'].unique().tolist()

In [116]:
real_test = main_jaccard(df, coarse_modules, chi)
real_test

Unnamed: 0,1,2,3,4,5,6
0,0.287668,0.221907,0.130645,0.259494,0.166893,0.179653
