# Data Conditioning when using data from Neuprint and FlyWire
When we ask questions about neuron connectivity, the synapses a neuron gets and where, we have to make sure that the data we pull from the databases is answering that question accuratly. We can make sure that the data is treated the same way each time by following these steps each time. 
1. Neuprint
2. FlyWire

## Neuprint

In [1]:
# Connecting to Neuprint
# Import packages from neuprint (Setting up access is shown in the tutorial file on basecamp)
from neuprint import Client, fetch_neurons, NeuronCriteria as NC, fetch_adjacencies

# Load the authentication token from a file
# I chose to store my authentication token in a file called "flybrain.auth.txt", this makes it easier to access and implement in the code
auth_token_file = open("flybrain.auth.txt", 'r')
auth_token = next(auth_token_file).strip()
try:
    np_client = Client('neuprint.janelia.org', dataset='hemibrain:v1.2.1', token=auth_token)
except:
    print("Failed to connect to Neuprint")
    np_client = None

In [2]:
# Pulling data from Neuprint using fetch_adjacencies
# fetch_adjacencies returns the connections between neurons that match the criteria you set. 
neuron_data, conn_data = fetch_adjacencies(None, NC(bodyId=423101189))

# Here we can see the expected number of rows (one for each neuron)
neuron_data

Unnamed: 0,bodyId,type,instance
0,423101189,oviIN,oviIN_R
1,234630133,SMP184,SMP184(PDL05)_L
2,263674097,LHPD2a5_a,LHPD2a5_a_R
3,266187480,SMP349,SMP349_R
4,266187559,SLP399,SLP399_R
...,...,...,...
2520,5901231318,,
2521,5901232053,SMP272,SMP272(PDL21)_L
2522,6400000773,SMP411,SMP411_R
2523,7112622044,LAL137,LAL137(PVL05)_L


**In the connection dataframe, we can see that fetch_adjacencies returns repeating pairs of connections for each ROI, rather than grouping them together.**
### For this specific query, there are almost 1000 more repititons

In [3]:
# Printing out the connection data where we can see that there are many repeats of the same connections just in different rois
conn_data

Unnamed: 0,bodyId_pre,bodyId_post,roi,weight
0,234630133,423101189,CRE(R),2
1,263674097,423101189,SMP(R),2
2,266187480,423101189,SMP(R),1
3,266187559,423101189,SMP(R),3
4,267214250,423101189,SMP(R),9
...,...,...,...,...
3526,6400000773,423101189,SMP(R),2
3527,7112622044,423101189,SIP(R),1
3528,7112622044,423101189,SMP(R),1
3529,7112622044,423101189,SMP(L),1


In [4]:
# When working with synapses, we need to make sure a pair of neurons is accuratlly represented by a single number that describes the total number of synapses between them.
# To do this, we can group the data by the pre and post synaptic neurons and sum the number of synapses between them.
conn_grouped = conn_data.groupby(['bodyId_pre','bodyId_post']).sum('weight')

# After doing that we can see that there are the correct rows. Each row is a representation of a connection between a pre-synaptic neuron and our neuron of interest.
# There is one less due to our neuron of interest being included in the neuron_data dataframe.
conn_grouped


Unnamed: 0_level_0,Unnamed: 1_level_0,weight
bodyId_pre,bodyId_post,Unnamed: 2_level_1
234630133,423101189,2
263674097,423101189,2
266187480,423101189,1
266187559,423101189,3
267214250,423101189,9
...,...,...
5901231318,423101189,1
5901232053,423101189,3
6400000773,423101189,2
7112622044,423101189,3


## FlyWire

This can be a bit different depending on the dataset you download from the codex. Generally it is important to check if the the connections are collapsed by neuropil.

In [13]:
# Let's import the dataframe I have downloaded from the codex
import pandas as pd
connections = pd.read_csv('data/connections_example.csv', header=0, sep="\t")

# Take note in this dataframe that connections are represented in different neuropil
# This usually means that we want to groupby the connections and sum the weights, ignoring the neuropil column
connections

Unnamed: 0,pre_root_id,post_root_id,neuropil,syn_count,nt_type
0,720575940629970489,720575940631267655,AVLP_R,7,GABA
1,720575940605876866,720575940606514878,LAL_R,15,GABA
2,720575940627737365,720575940628914436,AL_L,32,ACH
3,720575940633587552,720575940626452879,SMP_R,15,ACH
4,720575940616871878,720575940621203973,AVLP_L,13,GABA
...,...,...,...,...,...
994,720575940627181498,720575940632221209,LAL_R,22,ACH
995,720575940629370967,720575940609268316,PRW,3,GLUT
996,720575940621298523,720575940619076918,AVLP_R,6,ACH
997,720575940645056791,720575940654057889,PRW,40,ACH


In [14]:
# Print out repeated pairings in pre_root_id and post_root_id
connections[connections.duplicated(subset=['pre_root_id','post_root_id'], keep=False)]

Unnamed: 0,pre_root_id,post_root_id,neuropil,syn_count,nt_type
429,720575940624942855,720575940619134789,SIP_R,5,GLUT
704,720575940624942855,720575940619134789,CRE_R,6,GLUT


We can see that there are repeat pairings, seperated only because they are made in different neuropil. Most of the connection weight based analyses don't require separation by neuropil.
Therefore we need to combine those rows while at the same time adding the syn-counts together so that we are still accurately representing the data!

In [30]:
# This is how we can do that:
# first we drop the nt_type and neuropil columns
connections_grouping = connections[['pre_root_id','post_root_id','syn_count']] # the double parenthesis copies the dataframe with only the columns we mention

# Then we group by the pre and post root id and sum the syn_count, we also reset the index for readability
connections_grouped = connections_grouping.groupby(['pre_root_id','post_root_id']).sum().reset_index()
connections_grouped

Unnamed: 0,pre_root_id,post_root_id,syn_count
0,720575940596125868,720575940605825666,2
1,720575940596125868,720575940606217138,1
2,720575940596125868,720575940608552405,5
3,720575940596125868,720575940609975854,4
4,720575940596125868,720575940613059993,5
...,...,...,...
15091978,720575940661338497,720575940643867296,1
15091979,720575940661338497,720575940645527918,1
15091980,720575940661338497,720575940647030580,1
15091981,720575940661339777,720575940616982614,4


From this grouping we can see that we collapsed the dataframe by almost 1.8 million rows! This is important as we want each pairing to be represented once with all the connections made between the two neurons.