---
title: "Materialization and Versioning"
draft: false
format: 
    html:
        toc: true
bibliography: references.bib
---

<h3> The MICrONS Dataset is public and open access, ready for analysis. But: manual edits to the segmentation continue to improve data quality.</h3>

When beginning your analysis in the MICrONS dataset, it is important to understand:

 1. Why the data changes
 2. What types of data change with time
 3. How to set the version for your analysis
 4. How to cross-reference data across time

The data is regularly **versioned**; that is, a long-term copy of the dataset is made available for users. We highly recommend setting the version or **timestamp** in your analysis for future consistency. 

However, even if you do not set the version, there is a **lineage graph** of changes to the dataset. Meaning, you can find the past version of your cell, annotation table, mesh, skeleton etc. as long as you know the **root id** of the object you are interested in--or the date at which you performed an analysis.

## Why the data changes

The automatic segmentation from EM imagery to 3D reconstruction was largely effective, and the only way to process data at this scale [@the_microns_consortium_functional_2025]. However, due to imaging defects and the nature of thin, branching axons, the automated methods do make mistakes that have large impacts on the biological accuracy of the reconstructions. 

[Manual Proofreading](proofreading.html), or the correction of segmentation and adding of annotations, is an [ongoing effort](vortex-overview.html). 

Different aspects of the data require different level of manual intervention. For example, the segmentation methods produced highly accurate dendritic arbors before proofreading, enabling morphological identification of broad cell types. Most dendritic spines are properly associated with their dendritic trunk. Recovery of larger-caliber axons, those of inhibitory neurons, and the initial portions of excitatory neurons was also typically successful. Owing to the high frequency of imaging defects in the shallower and deeper portions of the dataset, processes near the pia and white matter often contain errors. Many non-neuronal objects are also well-segmented, including astrocytes, microglia and blood vessels. The two subvolumes of the dataset were segmented separately, but the alignment between the two is sufficient for manually tracing between them.

**Changes to the dataset represent an improvement in accuracy, and reflect an investment in the long-term usefulness of this open-access resource**. 

## What types of data change with time

**Proofreading** edits to the segmentation change what **supervoxels** (groups of locally aggregated voxels) are associated with what segmented object. Any time the supervoxel is associated with a different segmented object, all of the `id`s upstream of that supervoxel will update. In practice, this means the 18-digit segmentation id or `pt_root_id` of your neuron or microglia or axon etc. will change every time it is proofread.

![a, Automated segmentation overlaid on EM data. Each color represents an individual putative cell. b, Different colors represent supervoxels that make up putative cells. c, Supervoxels belonging to a particular neuron, with an overlaid cartoon of its supervoxel graph. These data corresponds to the framed square in a and the full panel in b. d, One-dimensional representation of the supervoxel graph. The ChunkedGraph data structure adds an octree structure to the graph to store the connected component information. Each abstract node (black nodes in levels >1) represents the connected component in the spatially underlying graph. ](img/nature/nature_cave_fig_2 a-d.png)

<p><center> Figure from [@dorkenwald_cave_2025] </center></p>


The `pt_root_id` is always associated with the same collection of supervoxels, and therefore the same mesh and same skeleton. But if that `pt_root_id` is *expired*, then you may not find that object in current **Annotation Tables**, **Synapse Connectivity Tables**, and Neuroglancer views of the *current* version of the dataset (default).

Creating a new `pt_root_id` for an edited object is the only way to have the flexibility of both **merging** two or more segments that should be connected (for example: extending an axon) and **splitting** an object into two, as in the following example:
 
![f, To submit a split operation, users place labels for each side of the split (top right). The backend system first connects each set of labels on each side by identifying supervoxels between them in the graph (left). The extended sets are used to identify the edges needed to be cut with a maximum-flow minimum-cut algorithm.](img/nature/nature_cave_fig_2 f.png){width=400}

<p><center> Figure from [@dorkenwald_cave_2025] </center></p>

But this also means we can track the histories of what `id`s used to be part of which segmented objects, which helps for finding the same cell, axon, or arbitrary segment across time. See **Lineage Graphs** below for details.

## How to set the version of your analysis

Most programmatic access to the CAVE services occurs through CAVEclient, a Python client to access various types of data from the online services.

Full documentation for CAVEclient [is available here](http://caveclient.readthedocs.io).

::: {.callout-important}
#### Initial Setup
Before using any programmatic access to the data, [you first need to set up your CAVEclient token](em_py_01_caveclient_setup.html).
:::

To initialize a caveclient, we give it a **datastack**, which is a name that defines a particular combination of imagery, segmentation, and annotation database.
For the MICrONs public data, we use the datastack name `minnie65_public`.

In [7]:
from caveclient import CAVEclient
from datetime import datetime, timezone

# initialize cave client
client = CAVEclient('minnie65_public')

# see the available materialization versions
client.materialize.get_versions()

[1300, 1078, 117, 661, 343, 1181, 795, 943]

And these are their associated timestamps (all timestamps are in UTC):



In [2]:
for version in client.materialize.get_versions():
    print(f"Version {version}: {client.materialize.get_timestamp(version)}")

Version 1300: 2025-01-13 10:10:01.286229+00:00
Version 1078: 2024-06-05 10:10:01.203215+00:00
Version 117: 2021-06-11 08:10:00.215114+00:00
Version 661: 2023-04-06 20:17:09.199182+00:00
Version 343: 2022-02-24 08:10:00.184668+00:00
Version 1181: 2024-09-16 10:10:01.121167+00:00
Version 795: 2023-08-23 08:10:01.404268+00:00
Version 943: 2024-01-22 08:10:01.497934+00:00


You can set the overall materialization version for the dataset using `client.version`. This will ensure all of the subsequent CAVE queries are performed at the same materialization, so you will get consistency between, for example, a cell type query and a synapse query.

In [3]:
# set materialization version, for consistency
client.version = 1300 # current public as of 1/13/2025

However, you can also set individual queries to a different version with optional argument `materialization_version`. For more about table queries, see [CAVE Query Cell Types](quickstart_notebooks/02-cave-query-cell-types.html).

In [4]:
nuc_v661 = client.materialize.tables.nucleus_detection_v0().query(materialization_version=661)

nuc_v661.sample(3)

Unnamed: 0,id,created,superceded_id,valid,volume,pt_supervoxel_id,pt_root_id,pt_position,bb_start_position,bb_end_position
92393,391995,2020-09-28 22:43:43.204393+00:00,,t,143.598387,96695656953309967,864691135736903556,"[232304, 127728, 16737]","[nan, nan, nan]","[nan, nan, nan]"
77241,437922,2020-09-28 22:43:14.122531+00:00,,t,94.680515,100231274366863259,864691135090300591,"[258080, 255920, 19317]","[nan, nan, nan]","[nan, nan, nan]"
23906,308570,2020-09-28 22:41:28.612416+00:00,,t,77.572833,91223731514852491,864691135373378760,"[192272, 253488, 19393]","[nan, nan, nan]","[nan, nan, nan]"


You can even do the same thing with an arbitrary timestamp, using optional argument `timestamp`. However, due to how the ChunkedGraph operates, this will be more time-intensive than looking up a specific materialized version.

In [5]:
%%time
nuc_v661 = client.materialize.tables.nucleus_detection_v0().query(timestamp=datetime(2023,4,6,20))

nuc_v661.sample(3)

CPU times: total: 688 ms
Wall time: 59.2 s


Unnamed: 0,id,valid,volume,pt_supervoxel_id,pt_root_id,pt_position,bb_start_position,bb_end_position
37886,253051,t,91.884913,0,0,"[177456, 79216, 24517]","[nan, nan, nan]","[nan, nan, nan]"
47218,291916,t,237.901906,91205108066569853,864691136174955526,"[192128, 114608, 15436]","[nan, nan, nan]","[nan, nan, nan]"
44601,268785,t,128.630948,87488484222419503,864691135293082636,"[165216, 210992, 18093]","[nan, nan, nan]","[nan, nan, nan]"


### How set the timestamp to an expired version
Materialization versions expire at regular intervals. Indeed, every version between our major long-term public releases existed at some point, but has since expired.

This does not mean the data from those versions is gone. 

It does mean it takes longer to materialize data from that date, because the chunkedgraph has to calculate differences between an extant materialized version and the requested time. In order to materialize data from an expired version, you must set the optional timestamp argument in every query:

In [17]:
# set the timestamp of a version that may or may not exist
example_timestamp = datetime(2022, 2, 24, 8, 10, 0, 184668, tzinfo=timezone.utc)

# example table query
nuc_timestamp = client.materialize.tables.nucleus_detection_v0().query(timestamp=example_timestamp, limit=100)
nuc_timestamp.head(3)

201 - "Limited query to 100 rows


Unnamed: 0,id,valid,volume,pt_supervoxel_id,pt_root_id,pt_position,bb_start_position,bb_end_position
0,11294,t,49.112842,73556435283294116,864691135269406572,"[63968, 218032, 20683]","[nan, nan, nan]","[nan, nan, nan]"
1,11300,t,323.577446,73626116832723681,864691135151717168,"[64400, 213072, 20630]","[nan, nan, nan]","[nan, nan, nan]"
2,11301,t,277.511864,0,0,"[64496, 219104, 20408]","[nan, nan, nan]","[nan, nan, nan]"


In [22]:
# example synapse query
syn_timestamp = client.materialize.synapse_query(post_ids=nuc_timestamp.pt_root_id, timestamp=example_timestamp)
syn_timestamp.head(3)

Unnamed: 0,id,valid,pre_pt_supervoxel_id,pre_pt_root_id,post_pt_supervoxel_id,post_pt_root_id,size,pre_pt_position,post_pt_position,ctr_pt_position
0,507533913,t,120060554860852929,864691132931221450,120060554860853553,0,420,"[402320, 146520, 23899]","[402354, 146594, 23900]","[402350, 146558, 23899]"
1,512239447,t,107692559937453733,864691134764563389,107692560004163297,0,1300,"[312342, 272446, 16895]","[312272, 272510, 16899]","[312298, 272460, 16898]"
4,416272231,t,110295859874434808,864691135385181117,110295859874440074,0,1884,"[330842, 269600, 16654]","[330878, 269532, 16662]","[330854, 269538, 16663]"


### Timestamps for all public release versions
Long-term releases are made available for analysis, but are not permanent. You can lookup the timestamp associated with any version here.

| Version | Timestamp |
| :-- | :------ |
| 117 | datetime(2021, 6, 11, 8, 10, 0, 215114, tzinfo=datetime.timezone.utc) |
| 343 | datetime(2022, 2, 24, 8, 10, 0, 184668, tzinfo=datetime.timezone.utc) |
| 661 | datetime(2023, 4, 6, 20, 17, 9, 199182, tzinfo=datetime.timezone.utc) |
| 795 | datetime(2023, 8, 23, 8, 10, 1, 404268, tzinfo=datetime.timezone.utc) |
| 943 | datetime(2024, 1, 22, 8, 10, 1, 497934, tzinfo=datetime.timezone.utc) |
| 1078 | datetime(2024, 6, 5, 10, 10, 1, 203215, tzinfo=datetime.timezone.utc) |
| 1181 | datetime(2024, 9, 16, 10, 10, 1, 121167, tzinfo=datetime.timezone.utc) |
| 1300 | datetime(2025, 1, 13, 10, 10, 1, 286229, tzinfo=datetime.timezone.utc) 

: CAVEclient materialization version timestamps (as datetime.datetime objects)


## How to cross-reference across time

### Lineage Graphs

[CAVEclient](python-tools.html) combines materialized snapshots with ChunkedGraph-based tracking of neuron edit histories to facilitate analysis queries for arbitrary time points. The ChunkedGraph tracks the edit lineage of neurons as they are being proofread, allowing us to map any segment used in a query to the closest available snapshot time point. This produces an overinclusive set of segments with which we query the snapshot database. 


![a, Edits change the assignment of synapses to segment IDs. Each of the four synapses is assigned to the segment IDs (colors) according to the presynaptic and postsynaptic points (point, bar). The identity of the segments changes through proofreading (time passed: ΔT) indicated by different colors. The lineage graph shows the current segment ID (color) for each point in time.](img/nature/nature_cave_fig_5 a.png){width=400}

<p><center> Figure from [@dorkenwald_cave_2025] </center></p>

When we query the ‘live’ database for all changes to annotations since the used materialization snapshot and add them to the set of annotations. The resulting set of annotations is then mapped back to the query timestamp using the lineage graph and supervoxel to root lookups and finally reduced to only include the queried set of root IDs.

![b, Analysis queries are not necessarily aligned to exported snapshots. Queries for other time points are supported by on-the-fly delta updates from both the annotations and segmentation through the use of the lineage graph.](img/nature/nature_cave_fig_5 b.png)

<p><center> Figure from [@dorkenwald_cave_2025] </center></p>

<h4>Example querying the ChunkedGraph history for a root id</h4>

::: {.callout-important}
#### Initial Setup
Before using any programmatic access to the data, [you first need to set up your CAVEclient token](em_py_01_caveclient_setup.html).
:::

In [19]:
from caveclient import CAVEclient
from datetime import datetime

# initialize cave client
client = CAVEclient('minnie65_public')

# return the current timestamp
client.materialize.get_timestamp()

datetime.datetime(2025, 1, 13, 10, 10, 1, 286229, tzinfo=datetime.timezone.utc)

Most commonly, what you will want is to look-up the current root id for a `pt_root_id` in a previous analysis. This is not always a trivial thing to do, for example in the case of a multi-soma object that has been manually split. Which of the two new cells was your original cell of interest?

The ChunkedGraph will make its best guess, given supervoxel overlap, with the function `suggest_latest_roots()`

In [27]:
example_id = 864691135919440816

# Access the ChunkedGraph service of caveclient
client.chunkedgraph.suggest_latest_roots(example_id, timestamp = client.materialize.get_timestamp())

np.int64(864691135970572133)

Now we have updated `pt_root_id` for our cell, at the current materialized version.

If you want to run this for a large number of root ids, you can first check if the `pt_root_id`s are current to your CAVEclient materialization version using `is_latest_roots()`, and then only update the ids that have expired:

In [29]:
# Check if roots are current
print(client.chunkedgraph.is_latest_roots(example_id))

# See when the id was generated (when the segment was last edited)
client.chunkedgraph.get_root_timestamps(example_id)

[False]


array([datetime.datetime(2023, 2, 1, 8, 47, 29, 891000, tzinfo=<UTC>)],
      dtype=object)

Using the `timestamp` argument, you can also lookup the suggested root at any arbitrary time. Here we use the timestamp for a different materialization, verion 943:

In [30]:
client.chunkedgraph.suggest_latest_roots(example_id, 
                                         timestamp=client.materialize.get_version_metadata(943)['time_stamp']
                                        )

np.int64(864691135808631069)

 Sometimes you may want to check the lineage graph for a cell of interest, to better understand what was edited and why. You can access this and more advanced features from the `get_lineage_graph()`. See the [ChunkedGraph documentation](https://caveconnectome.github.io/CAVEclient/tutorials/chunkedgraph/) for more use cases.

 ```python
client.chunkedgraph.get_lineage_graph(example_id)
```

### Static annotations

The [CAVEclient](python-tools.html) Materialization Engine updates segmentation data and creates databases that combine spatial annotation points and segmentation information. 

The live database is written to by the [Annotation service](https://caveconnectome.github.io/CAVEclient/tutorials/annotation/) and is actively managed by the [Materialization service](https://caveconnectome.github.io/CAVEclient/tutorials/materialization/) to keep root IDs up to date for all BoundSpatialPoints in all tables. Snapshotted databases are copies of a time-locked state of the ‘live’ database’s segmentation and annotation information used to facilitate consistent querying.

This means if you use a static **annotation** label to index your analysis, for example a `nucleus_id` or a `synapse_id` which do not undergo proofreading, you can look up the current `pt_root_id` at any time by asking CAVEclient to materialize the new segmentation under the static point.

Using our example cell from above, let's find its nucleus id. If we try to query the nucleus table with the expired id, we will return no result:

In [32]:
example_id = 864691135919440816

client.materialize.tables.nucleus_detection_v0(pt_root_id=example_id).query()

Unnamed: 0,id,created,superceded_id,valid,volume,pt_supervoxel_id,pt_root_id,pt_position,bb_start_position,bb_end_position


<p></p>

This is expected, since the example id is *expired* at the time of this materialization. Instead, let's query the version we know this id existed at: version 661

In [33]:
client.materialize.tables.nucleus_detection_v0(pt_root_id=example_id).query(materialization_version=661)

Unnamed: 0,id,created,superceded_id,valid,volume,pt_supervoxel_id,pt_root_id,pt_position,bb_start_position,bb_end_position
0,260620,2020-09-28 22:44:52.109195+00:00,,t,292.154409,88605588438852384,864691135919440816,"[173584, 145120, 21127]","[nan, nan, nan]","[nan, nan, nan]"


This returns both the expected `pt_root_id` 864691135919440816, and the `id` from  the `nucleus_detection_v0` -- better known as the `nucleus_id`.

Given the `nucleus_id`, we can now query the current materialized version for the current root id.

In [34]:
nuc_id = client.materialize.tables.nucleus_detection_v0(pt_root_id=example_id).query(materialization_version=661)['id']

client.materialize.tables.nucleus_detection_v0(id=nuc_id).query()

Unnamed: 0,id,created,superceded_id,valid,volume,pt_supervoxel_id,pt_root_id,pt_position,bb_start_position,bb_end_position
0,260620,2020-09-28 22:44:52.109195+00:00,,t,292.154409,88605588438852384,864691135970572133,"[173584, 145120, 21127]","[nan, nan, nan]","[nan, nan, nan]"


This returns the same `pt_root_id` as the lineage graph example above. But, it has the benefit of guaranteeing the id belongs to your cell of interest and not an arbitrary chunk of the previous segmented object.

<h4>If you are working with a neuron, glial cell, or any cell that has a nucleus detection, we recommend using the `nucleus_id` as your identifier rather than the `pt_root_id`.</h4>

<h4>If you do use `pt_root_id`, be sure to note the dataset materialization version in your analysis.</h4>