# The CAVEclient

The CAVEclient is a client side library to allow easy interaction with the services within CAVE (connectome annotation versioning engine, also known as Dynamic Annotation Framework), eg. the annotations, stateserver. The github repository is public:
https://github.com/seung-lab/CAVEclient

The library can be installed directly from the github repository or from the prebuilt versions using pip:
```
pip install caveclient
```


## Tutorials

This tutorial mainly covers the interactions with the materialized annotation tables. More information and better explanations of the other functionalities of the client can be found in the following tutorial. Please be advised that depending on your permission level you may not be able to execute all queries in this tutorial with the preset parameters as it was written with defaults for iarpa's microns project:
https://github.com/seung-lab/CAVEclient/blob/master/CAVEclientExamples.ipynb


## Authentication & Authorization

If this is your first time to interact with any part of CAVE, chances are you need to setup your local credentials for your FlyWire account first. Please follow the section "Setting up your credentials" at the beginning of the tutorial above to do so.

You will need to have access to the FlyWire's production dataset to retrieve annotations. Otherwise you will see

```HTTPError: 403 Client Error: FORBIDDEN for url```

errors upon querying the materialization server.

## Initialize CAVEclient

The CAVEclient is instantiated with a datastack name. A datastack is a set of segmentation, and annotation tables and lives within an aligned volume (the coordinate space). FlyWire's main datastack is `flywire_fafb_production`, the aligned volume is `fafb_seung_alignment_v0` (v14.1). For convenience, there are other defaults set on the datastack level.

In [1]:
import numpy as np
import datetime
from caveclient import CAVEclient

In [2]:
datastack_name = "flywire_fafb_production"
client = CAVEclient(datastack_name)

## Annotation tables

Annotations are represented by points in space and parameters (such as size, type). At specific timepoints, annotations are combined with the (proofread) segmentation to create a materialized version of the annotation table. The AnnotationEngine (`client.annotation`) owns the raw annotations and the Materialization Service (`client.materialize`) owns the materialized versions of these tables. 

To check what annotation tables are visible to you run

In [3]:
client.annotation.get_tables()

https://prod.flywire-daf.com/annotation/api/v2/aligned_volume/fafb_seung_alignment_v0/table


['synapses_nt_v1']

Every table has metadata associated with it which includes information about the owner/creator, a description and a schema that annotations in this table follow. Please review the metadata of any table you might use in the future before using it as it might contain instructions and restrictions for its usage and how to credit it's creators. For instance, the (v1) synapse table (`synapses_nt_v1`) includes an extensive description on all its columns, credits people that created it, contains instructions for citing this resource among others:

In [4]:
meta_data = client.annotation.get_table_metadata("synapses_nt_v1")
print(meta_data["description"])

FlyWire synapse description
Synapse version: 20191211
NT version: 20201223

Synapses in this table consist of a pre- and a postsynaptic point (in nm), confidence scores, and neurotransmitter information. 
The synapses were predicted by Buhmann et al [1] for the v14 alignment of the FAFB dataset. The FlyWire team remapped these synapses into the v14.1 space used by FlyWire with an accuracy of <64nm (therefore, this is a potential source of error). This version of the Buhmann et al. synapses was trained on the initial training set from the calyx and performance varies across brain areas accordingly.
Buhmann et al. assigned two scores to their synapses representing different measurements of confidence. The “connection_score” column contains the scores assigned by them during prediction (higher is more confident) and “cleft_score” contains the scores acquired by Buhmann et al. by using the synapse segmentation from Heinrich et al. [2] (higher is more confident). For the latter, Buhma

The meta data contains information about the schema which ultimately determines how annotations in a table are structured. All annotations in a table follow the same schema. The synapse table follows the `fly_nt_synapse` schema:

In [5]:
meta_data["schema_type"]

'fly_nt_synapse'

## Materialized annotation tables & Queries

```
materialization = annotation + segmentation snapshot
```

As the segmentation and annotations change over time, we need to create snapshots of a combined view of them (materialized versions). Materialized versions of the annotation tables are (automatically) generated at a certain frequency. In addition to that, we are planning to include an option to retrieve any timestamp since the latest materialization ("live") but that is not available at the moment. 

There are usually a number of materialized versions available at the same time:

In [6]:
client.materialize.get_versions()

[56, 64, 66, 67, 68, 70]

Each version comes with meta data about the time when it was created and when it will be deleted (expired). Different tables have different lifetimes and some may be LTS versions. The exact frequency and lifetime of tables will depend on how the community is using these tables. 

In [7]:
latest_version = max(client.materialize.get_versions())
client.materialize.get_version_metadata(latest_version)

{'valid': True,
 'time_stamp': datetime.datetime(2021, 6, 10, 8, 10, 0, 286750, tzinfo=datetime.timezone.utc),
 'id': 59,
 'version': 70,
 'datastack': 'flywire_fafb_production',
 'expires_on': datetime.datetime(2021, 6, 12, 8, 10, 0, 286750, tzinfo=datetime.timezone.utc)}

Generally, specifying versions for the materialize service is optional. The latest version is used if no version is defined. 

Each materialization version contains a set of annotation tables. At the moment all tables are included in a materialization but in the future we might not include all tables in every materialization:

In [8]:
client.materialize.get_tables()

['synapses_nt_v1']

### Queries

Here, we demonstrate some queries with the synapses from Buhmann et al.. For some essential annotation types, default tables are define in the centralized info service. This way, one automatically uses the latest synapse table after an update. 

In [9]:
synapse_table = client.info.get_datastack_info()['synapse_table']
print(synapse_table)

synapses_nt_v1


Each table in this list is stored as a SQL table on the backend. The client allows users to query these tables through the frontend of the Materialization Service conventiently without the need for SQL specific language. The client will format the results as pandas dataframes. Queries are restricted to a size of 200k rows to not overwhelm the server. Should a query result in a larger list of rows, only the first 200k are returned. For bulk downloads (eg. for data preservation before a publication) please contact us.

To demonstrate this this query would pull the entire table but will only gather 200k rows (should take <2min). A warning will be raised if the query is cut short.

In [10]:
%%time

syn_df = client.materialize.query_table(synapse_table)



CPU times: user 764 ms, sys: 187 ms, total: 951 ms
Wall time: 6.76 s


In [11]:
print(len(syn_df))

200000


Here, we set the materialization version specifically. If the materialization version is not specified, the query defaults to the most recent version.

Let's take a brief look at the columns to illustrate how the materialization extends an annotation table:

In [12]:
syn_df.head()

Unnamed: 0,id,valid,pre_pt_supervoxel_id,pre_pt_root_id,post_pt_supervoxel_id,post_pt_root_id,connection_score,cleft_score,gaba,ach,glut,oct,ser,da,valid_nt,pre_pt_position,post_pt_position
0,102406485,t,81559230397890019,720575940633908343,81559230397890088,720575940572044374,217.02092,122,0.001483,0.988252,1.837718e-05,6.6e-05,0.000337,0.009844,t,"[636900, 135040, 152840]","[636832, 135124, 152880]"
1,101781363,t,81139629272907063,720575940630969051,81139629272948248,720575940599614314,22.191496,0,0.105925,0.352661,0.3122027,0.000107,0.176082,0.053023,t,"[613864, 292888, 152600]","[613868, 292984, 152640]"
2,102336123,t,83251997268285653,720575940627576594,83181628524108478,720575940590270455,115.574806,57,0.005971,0.927559,0.0007128069,0.063133,2e-05,0.002605,t,"[732880, 370040, 151320]","[732752, 370080, 151320]"
3,101781421,t,73469504877006214,720575940589372239,73469504876999566,720575940587527270,141.460388,144,0.000468,0.553068,1.522532e-08,1.3e-05,3e-06,0.446448,t,"[166996, 294144, 151280]","[166920, 294024, 151280]"
4,101781599,t,80998960504056992,720575940613732364,80998960504043951,720575940613732364,38.557209,37,0.674273,0.01204,0.2504642,0.00107,0.012926,0.049227,t,"[604412, 294744, 151520]","[604364, 294860, 151520]"


Annotations consist of parameters and spatial points. Some or all of these spatial points are what we call "BoundSpatialPoints". These are linked to the segmentation during materialization. The synapse tables have two such points (`pre_pt`, `post_pt`). Per point there are three columns: `*_position`, `*_supervoxel_id`, `*_root_id`. Supervoxels are the small atomic segments, and root ids describe large components (neurons) consisting of many supervoxels. A root id always refers to the same version of a neuron and represents a snapshot in time in its own right. For a given annotation id (`id`), all but the `*_root_id` columns stay constant between materializations. 

`query_table` has three parameters to define filters: filter_in_dict, filter_out_dict, filter_equal_dict. More options will be added. This can be used to query synapses between any lists of neurons. For instance, to query the outgoing synapses of an AMMC-B1 neuron we included in the FlyWire paper:
(see the next section for how to come up with a specific root id)

In [13]:
%%time

syn_df = client.materialize.query_table(synapse_table, 
                                        filter_in_dict={"pre_pt_root_id": [720575940627197566]})

CPU times: user 34 ms, sys: 209 µs, total: 34.2 ms
Wall time: 413 ms


As described in the metadata above, we suggest filtering the synapse table using the `cleft_score` and `connection_score`. Tuning these will help to reduce the number of false positive synapses in the list. The best threshold(s) will depend on the specific neurons included in the analysis. Here we will just remove all synapses with a `cleft_score < 50`.

In [14]:
syn_df = syn_df[syn_df["cleft_score"] >= 50]

Some postsynaptic partners have a 0 id. Many of these are due to the synapse prediction covering a bigger space than the segmentation. Here, we remove these along with synapses onto itself as we are confident that this cell does not make autapses.

In [15]:
syn_df = syn_df[syn_df["pre_pt_root_id"] != syn_df["post_pt_root_id"]]
syn_df = syn_df[syn_df["post_pt_root_id"] != 0]

This synapse table comes with neurotransmitter prediction from the work of Eckstein et al.. Please review the description in the metadata to understand the caveats of this data with regards your analysis. Here, we just look at the mean of the probablities of all outgoing synapses which shows that this neuron's neurotransmitter is very likely acetylcholine.

In [16]:
np.mean(syn_df[["gaba", "ach", "glut", "oct", "ser", "da"]])

gaba    0.032069
ach     0.835793
glut    0.041500
oct     0.005178
ser     0.021450
da      0.064010
dtype: float64

Here we take a brief look at the postsynaptic partners and sorting them by number of synapses; displaying the top 10:

In [17]:
u_post_root_ids, c_post_root_ids = np.unique(syn_df["post_pt_root_id"], return_counts=True)

sorting = np.argsort(c_post_root_ids)[::-1][:10]
list(zip(u_post_root_ids[sorting], c_post_root_ids[sorting]))

[(720575940612001489, 96),
 (720575940639811469, 92),
 (720575940606297353, 88),
 (720575940615361748, 86),
 (720575940621893127, 84),
 (720575940621301738, 80),
 (720575940626312778, 73),
 (720575940625492753, 71),
 (720575940623903434, 65),
 (720575940618489734, 64)]

The main target is an AMMC-A1 (720575940613535430) which is a connection we described in Figure 6 in the FlyWire paper.

We can further restrict the query by filtering the postsynaptic targets. For instance this query will only return the synapses between the these two root ids.

In [18]:
syn_df = client.materialize.query_table(synapse_table, 
                                        filter_in_dict={"pre_pt_root_id": [720575940627197566],
                                                        "post_pt_root_id": [720575940612001489]})
syn_df = syn_df[syn_df["cleft_score"] >= 50]

syn_df

Unnamed: 0,id,valid,pre_pt_supervoxel_id,pre_pt_root_id,post_pt_supervoxel_id,post_pt_root_id,connection_score,cleft_score,gaba,ach,glut,oct,ser,da,valid_nt,pre_pt_position,post_pt_position
0,211365569,t,77832229442872849,720575940627197566,77832229442883229,720575940612001489,52.161354,100,0.050798,0.706145,0.103963,0.020856,0.007026,0.111212,t,"[418808, 286736, 110840]","[418760, 286616, 110880]"
1,217890780,t,77691423235424395,720575940627197566,77691423235425756,720575940612001489,102.692398,145,0.004396,0.970069,0.000883,0.000840,0.000148,0.023665,t,"[412236, 283056, 121640]","[412208, 283148, 121680]"
2,6180240,t,77761173637894262,720575940627197566,77761173637888123,720575940612001489,227.389908,158,0.011852,0.966436,0.005491,0.002103,0.000580,0.013538,t,"[414292, 245008, 144520]","[414160, 245016, 144520]"
3,51374154,t,77691285729116258,720575940627197566,77691285729100925,720575940612001489,111.818047,132,0.008370,0.705035,0.009202,0.000005,0.272142,0.005246,t,"[409856, 274840, 93040]","[409984, 274868, 93040]"
4,237434942,t,77832023284765928,720575940627197566,77832023284771485,720575940612001489,161.775467,141,0.023719,0.890757,0.013778,0.002228,0.019519,0.049999,t,"[417900, 272952, 119880]","[417844, 273044, 119880]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
121,236131087,t,77761585821121562,720575940627197566,77761585821120349,720575940612001489,88.254326,116,0.007150,0.958711,0.001355,0.016179,0.000408,0.016197,t,"[414620, 270192, 119680]","[414520, 270216, 119680]"
122,217890770,t,77691423235405859,720575940627197566,77691423235425756,720575940612001489,65.530502,145,0.003493,0.976093,0.000120,0.003406,0.000033,0.016854,t,"[412216, 283052, 121600]","[412128, 283092, 121640]"
123,147115465,t,77832023284730175,720575940627197566,77832023284716150,720575940612001489,13.454404,142,0.014081,0.748565,0.001552,0.000199,0.003485,0.232117,t,"[419512, 274800, 118480]","[419480, 274656, 118480]"
124,221946020,t,77832160723343355,720575940627197566,77832160723346346,720575940612001489,1038.665283,143,0.048305,0.895462,0.023994,0.006970,0.001213,0.024056,t,"[417700, 282552, 108800]","[417580, 282568, 108800]"


## "Live" Materialization Queries

Before using live materializations, please make sure that your installation of the caveclient is `>= 3.1.0`. You can upgrade your installed version with 


```
pip install caveclient --upgrade
```

To make sure the latest version of the library is used in this notebook after an upgrade it is best to reload the notebook kernel. Your current version is:

In [19]:
import caveclient
caveclient.__version__

'3.1.0'

"Live" materializations allow one to run queries without adhering to versions. This is useful when recent proofreading edits should be reflected in the analysis. Live materializations require a timestamp for which the query should be executed. 

In [20]:
timestamp_now = datetime.datetime.utcnow()

In [21]:
%%time 

# Code to retrieve a root id that will work with this query. See the next section for more details
latest_roots = client.chunkedgraph.get_latest_roots(720575940627185911, timestamp_future=timestamp_now)
latest_roots

syn_df = client.materialize.live_query(synapse_table, 
                                       filter_in_dict={"pre_pt_root_id": [latest_roots[0]]},
                                       timestamp=timestamp_now)

syn_df = syn_df[syn_df["cleft_score"] >= 50]

syn_df

CPU times: user 70.7 ms, sys: 1.2 ms, total: 71.9 ms
Wall time: 1.97 s


Unnamed: 0,id,valid,pre_pt_supervoxel_id,pre_pt_root_id,post_pt_supervoxel_id,post_pt_root_id,connection_score,cleft_score,gaba,ach,glut,oct,ser,da,valid_nt,pre_pt_position,post_pt_position
0,43917406,t,84727748031271775,720575940637186126,84727748031264584,720575940626445100,63.022396,124,0.010390,0.950272,0.002132,0.008374,3.866737e-04,0.028445,t,"[820188, 250388, 153800]","[820212, 250440, 153840]"
1,61506994,t,83320304562319572,720575940637186126,83320304562328649,720575940619082317,54.087387,139,0.000389,0.997138,0.000002,0.001707,9.331072e-07,0.000762,t,"[740408, 246192, 190520]","[740380, 246032, 190560]"
2,232489725,t,84094429400951571,720575940637186126,84094429400956510,720575940590374582,270.478394,137,0.064027,0.356340,0.515614,0.004106,6.949618e-03,0.052963,t,"[782552, 248476, 180040]","[782508, 248392, 180080]"
3,232945196,t,84094429400971745,720575940637186126,84024060656788915,720575940583849998,100.080338,144,0.021885,0.813248,0.028637,0.006434,7.459946e-03,0.122336,t,"[782036, 250656, 180240]","[781968, 250780, 180200]"
4,239559768,t,84094429400828504,720575940637186126,84094429400827280,720575940626431532,31.991150,146,0.087519,0.123452,0.744683,0.002957,2.260582e-02,0.018783,t,"[783352, 250516, 176320]","[783288, 250484, 176280]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1807,232945305,t,84094429400962617,720575940637186126,84094429400990219,720575940633432206,51.244499,111,0.006596,0.946592,0.003757,0.034190,8.990807e-05,0.008774,t,"[782376, 251864, 180800]","[782468, 251780, 180840]"
1808,232948742,t,84024060656803275,720575940637186126,84024060656799478,720575940627765099,11.450621,67,0.032924,0.898681,0.032187,0.001579,1.981975e-03,0.032648,t,"[780720, 250520, 180640]","[780840, 250488, 180640]"
1810,239559737,t,84094429400832117,720575940637186126,84094429400820474,720575940620162332,275.813080,142,0.010846,0.939684,0.026259,0.020127,7.136991e-05,0.003012,t,"[784380, 249888, 176080]","[784280, 249884, 176040]"
1811,239559595,t,84094429400800897,720575940637186126,84094429400805747,720575940627076396,241.319305,142,0.002015,0.988075,0.000435,0.007664,2.856799e-06,0.001808,t,"[784652, 251304, 175600]","[784708, 251216, 175600]"


If the root id is incompatible with the timestamp, an error is raised:

In [22]:
%%time 

syn_df = client.materialize.live_query(synapse_table, 
                                       filter_in_dict={"pre_pt_root_id": [720575940627185911]},
                                       timestamp=timestamp_now)

ValueError: Timestamp incompatible with IDs: [720575940627185911] are expired, use chunkedgraph client to find valid ID(s)

## Retrieving matching root ids

Neuroglancer shows the most recent version of the segmentation by default. Neurons that have been updated since a materialized version are not included in a table of that version. To reconcile this, users need to look up root ids for their data with a timestamp. 

We generally recommend storing annotations as points in space as these can be mapped to root ids easily (that's basically what materialization is). Soon, users will be able to create their own annotation tables and CAVE will provide fitting root ids automatically. Still, use cases will arrive that require a manual materialization by the user.

### Programmatically/Manually - Root id history

The client interface can be used to query the "lineage" of a root id. This contains all ancestors and successors in time and can be restricted with timestamps in the past and future. The lineage can be retrieved as networkx graph:

In [23]:
client.chunkedgraph.get_lineage_graph(720575940627185911, as_nx_graph=True)

<networkx.classes.digraph.DiGraph at 0x7f6d02ad5d00>

Based on the lineage graph the latest root ids can be retrieved:

In [24]:
latest_roots = client.chunkedgraph.get_latest_roots(720575940627197566, timestamp_future=timestamp_now)
latest_roots

array([720575940627197566])

As there can be multiple successors for a given ID (because of splits) the user will have to determine which of these matches the neuron of interest.

The client also enables the retrieval of the original root ids that contributed to a given neuron:

In [25]:
original_roots = client.chunkedgraph.get_original_roots(720575940627197566)
original_roots

array([720575940611736976, 720575940619683443, 720575940502382601,
       720575940502373129, 720575940502373897, 720575940502377225,
       720575940519142108, 720575940519144924, 720575940519141596,
       720575940519144156, 720575940502370057, 720575940519126492,
       720575940519126748, 720575940519130332, 720575940519131356,
       720575940519134684, 720575940618400560, 720575940615477396,
       720575940618047811, 720575940601614877, 720575940630717493])

### Programmatically - Spatial lookup

The client interface allows users to query a root id for a given supervoxel id (see Section 5 in [the related tutorial](https://github.com/seung-lab/CAVEclient/blob/master/CAVEclientExamples.ipynb). Supervoxel ids can be retrieved from the segmentation using [cloudvolume](https://github.com/seung-lab/cloud-volume/).

### Neuroglancer

The segmentation layer has an option under the tab "graph" to lock a layer to a specific timestamps. Then, root ids are looked up with this specific timestamp (proofreading is not possible in this mode). Be aware that this mode does not prevent the pasting of root ids from different timestamps into the layer as that circumvents the lookup to the server.

## Timestamps

Timestamps are _always_ UTC. 

Please be aware that the package or browser you are using might format timestamps in your local timezone. The timestamp for all annotation tables within a materialization are the same:

In [26]:
client.materialize.get_version_metadata(15)

{'valid': None,
 'time_stamp': datetime.datetime(2021, 3, 10, 19, 7, 56, 375440, tzinfo=datetime.timezone.utc),
 'id': 3,
 'version': 15,
 'datastack': 'flywire_fafb_production',
 'expires_on': datetime.datetime(2021, 4, 9, 19, 7, 56, 375440, tzinfo=datetime.timezone.utc)}

## Creating neuroglancer links programmatically

We are building infrastructure into neuroglancer to display this information there while browsing neurons. Until this is ready, the most convenient way to visualize this information in neuroglancer is to programmatically create neuroglancer state and to upload them to the state server. The links can then be distributed. 

[NeuroglancerAnnotationUI (nglui)](https://github.com/seung-lab/NeuroglancerAnnotationUI)  makes programmatic creation of neuroglancer states convenient. The [statebuilder examples](https://github.com/seung-lab/NeuroglancerAnnotationUI/blob/master/examples/statebuilder_examples.ipynb) shows how one can directly from dataframes as the one above to neuroglancer states. The [related tutorial on this client](https://github.com/seung-lab/CAVEclient/blob/master/CAVEclientExamples.ipynb) shows under "4. JSON Service" how this client can be used to upload states to the server and to create neuroglancer links.


## Further references


More examples for the usage of CAVE can be found in a related project:

https://github.com/AllenInstitute/MicronsBinder

A rough overview of the structure of our backend services can be found here:

https://github.com/seung-lab/AnnotationPipelineOverview

## Credit

CAVE is developed at Princeton University and the Allen Institute for Brain Science within the iarpa MICrONs project and the FlyWire project. Main contributors to the design and backend development 
are Derrick Brittain, Forrest Collman, Sven Dorkenwald, Chris Jordan, Casey Schneider-Mizell

A citable publication is in the works. Please contact us if you are interested in using CAVE on another dataset. 