# CryoET Data Portal with Zarr Benchmarks

This notebook demonstrates how to access data from the CryoET Data Portal and benchmark it using the zarr-benchmarks tools.

## About the Data

We'll be working with dataset 10445 from the CryoET Data Portal - the "CZII CryoET Object Identification Challenge" dataset.
This dataset contains:
- 121 runs with 6,981 annotations across 484 tomograms
- Annotated objects: Apo-ferritin, Beta-amylase, Beta-galactosidase, cytosolic ribosomes, thyroglobulin, and VLP
- Data in Zarr format, perfect for benchmarking!

**Dataset Link**: https://cryoetdataportal.czscience.com/datasets/10445

## 1. Installation

First, let's install the required packages.

In [4]:
# Install required packages (uncomment if needed)
!pip install cryoet-data-portal
!pip install -e ".[plots,zarr-python-v3]"
!pip install s3fs  # For accessing S3 data

Collecting botocore<1.41.0,>=1.40.71 (from boto3<2.0,>=1.0.0->cryoet-data-portal)
  Using cached botocore-1.40.71-py3-none-any.whl.metadata (5.9 kB)
Using cached botocore-1.40.71-py3-none-any.whl (14.1 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.40.70
    Uninstalling botocore-1.40.70:
      Successfully uninstalled botocore-1.40.70
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.25.2 requires botocore<1.40.71,>=1.40.46, but you have botocore 1.40.71 which is incompatible.[0m[31m
[0mSuccessfully installed botocore-1.40.71

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[

In [3]:
import numpy as np
import pathlib
import time
import zarr
import s3fs
from cryoet_data_portal import Client, Dataset, Run, Tomogram
from zarr_benchmarks.read_write_zarr import read_write_zarr
from zarr_benchmarks import utils
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'zarr'

## 2. Connect to CryoET Data Portal

Let's connect to the portal and explore dataset 10445.

In [None]:
# Initialize the client
client = Client()

# Get the dataset
dataset = Dataset.get_by_id(client, 10445)

print(f"Dataset: {dataset.title}")
print(f"Dataset ID: {dataset.id}")
print(f"Description: {dataset.description[:200]}...") if len(dataset.description) > 200 else print(f"Description: {dataset.description}")

## 3. Explore Available Runs and Tomograms

In [None]:
# Get all runs for this dataset
runs = list(dataset.runs)
print(f"Total number of runs: {len(runs)}")

# Get information about the first run
first_run = runs[0]
print(f"\nFirst run name: {first_run.name}")
print(f"Run ID: {first_run.id}")

# Get tomograms from this run
tomograms = list(first_run.tomograms)
print(f"\nNumber of tomograms in first run: {len(tomograms)}")

In [None]:
# Get details about the first tomogram
if tomograms:
    first_tomogram = tomograms[0]
    print(f"Tomogram name: {first_tomogram.name}")
    print(f"Tomogram size (x,y,z): {first_tomogram.size_x} x {first_tomogram.size_y} x {first_tomogram.size_z}")
    print(f"Voxel spacing: {first_tomogram.voxel_spacing} Å")
    print(f"S3 path: {first_tomogram.s3_omezarr_dir}")

## 4. Access Zarr Data from S3

Now let's access the actual Zarr data from the S3 bucket.

In [None]:
# Setup S3 filesystem (anonymous access)
s3 = s3fs.S3FileSystem(anon=True)

# Get the zarr path (removing the 's3://' prefix)
if tomograms:
    zarr_path = first_tomogram.s3_omezarr_dir.replace('s3://', '')
    print(f"Accessing zarr data at: {zarr_path}")
    
    # Open the zarr array
    store = s3fs.S3Map(root=zarr_path, s3=s3, check=False)
    zarr_array = zarr.open(store, mode='r')
    
    print(f"\nZarr array shape: {zarr_array.shape}")
    print(f"Zarr array dtype: {zarr_array.dtype}")
    print(f"Zarr array chunks: {zarr_array.chunks}")
    print(f"Zarr array compressor: {zarr_array.compressor}")
    print(f"Zarr array size: {zarr_array.nbytes / (1024**3):.2f} GB (uncompressed)")

## 5. Benchmark Reading from S3

Let's benchmark reading different portions of the data.

In [None]:
# Benchmark 1: Read a single slice
print("Benchmark 1: Reading a single XY slice")
start_time = time.time()
single_slice = zarr_array[zarr_array.shape[0]//2, :, :]
slice_time = time.time() - start_time
print(f"  Time: {slice_time:.3f}s")
print(f"  Shape: {single_slice.shape}")
print(f"  Size: {single_slice.nbytes / (1024**2):.2f} MB")

# Benchmark 2: Read a small 3D chunk
print("\nBenchmark 2: Reading a 128x128x128 chunk")
start_time = time.time()
chunk_size = 128
small_chunk = zarr_array[:chunk_size, :chunk_size, :chunk_size]
chunk_time = time.time() - start_time
print(f"  Time: {chunk_time:.3f}s")
print(f"  Shape: {small_chunk.shape}")
print(f"  Size: {small_chunk.nbytes / (1024**2):.2f} MB")

# Benchmark 3: Read a larger chunk
print("\nBenchmark 3: Reading a 256x256x256 chunk")
start_time = time.time()
chunk_size = 256
if all(dim >= chunk_size for dim in zarr_array.shape):
    medium_chunk = zarr_array[:chunk_size, :chunk_size, :chunk_size]
    medium_chunk_time = time.time() - start_time
    print(f"  Time: {medium_chunk_time:.3f}s")
    print(f"  Shape: {medium_chunk.shape}")
    print(f"  Size: {medium_chunk.nbytes / (1024**2):.2f} MB")
else:
    print("  Skipped: tomogram is smaller than 256x256x256")

## 6. Visualize a Slice

Let's visualize a slice from the tomogram.

In [None]:
# Visualize the middle slice
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# XY slice (middle Z)
mid_z = zarr_array.shape[0] // 2
xy_slice = zarr_array[mid_z, :, :]
axes[0].imshow(xy_slice, cmap='gray')
axes[0].set_title(f'XY Slice (Z={mid_z})')
axes[0].axis('off')

# XZ slice (middle Y)
mid_y = zarr_array.shape[1] // 2
xz_slice = zarr_array[:, mid_y, :]
axes[1].imshow(xz_slice, cmap='gray')
axes[1].set_title(f'XZ Slice (Y={mid_y})')
axes[1].axis('off')

# YZ slice (middle X)
mid_x = zarr_array.shape[2] // 2
yz_slice = zarr_array[:, :, mid_x]
axes[2].imshow(yz_slice, cmap='gray')
axes[2].set_title(f'YZ Slice (X={mid_x})')
axes[2].axis('off')

plt.tight_layout()
plt.show()

## 7. Download and Re-compress with Different Codecs

Now let's download a chunk of data and test different compression methods using zarr-benchmarks.

In [None]:
# Extract a manageable chunk for benchmarking
chunk_size = 256
if all(dim >= chunk_size for dim in zarr_array.shape):
    test_data = zarr_array[:chunk_size, :chunk_size, :chunk_size]
else:
    # Use the full array if it's smaller
    test_data = zarr_array[:]

print(f"Test data shape: {test_data.shape}")
print(f"Test data size: {test_data.nbytes / (1024**2):.2f} MB")
print(f"Test data dtype: {test_data.dtype}")

In [None]:
# Setup for re-compression benchmarks
output_dir = pathlib.Path("data/output/cryoet_benchmarks")
output_dir.mkdir(parents=True, exist_ok=True)

chunk_size_recompress = 64
chunks = (chunk_size_recompress, chunk_size_recompress, chunk_size_recompress)
zarr_spec = 3

results = {}

print(f"Original compression: {zarr_array.compressor}")
print(f"Testing re-compression with chunk size: {chunk_size_recompress}")

### 7.1 Test Blosc Compression

In [None]:
# Blosc compression
store_path = output_dir / "cryoet_blosc.zarr"
blosc_compressor = read_write_zarr.get_blosc_compressor(
    cname="zstd",
    clevel=5,
    shuffle="shuffle",
    zarr_spec=zarr_spec
)

utils.remove_output_dir(store_path)
start_time = time.time()
read_write_zarr.write_zarr_array(
    test_data,
    store_path,
    overwrite=False,
    chunks=chunks,
    compressor=blosc_compressor,
    zarr_spec=zarr_spec
)
write_time = time.time() - start_time

start_time = time.time()
read_back = read_write_zarr.read_zarr_array(store_path)
read_time = time.time() - start_time

compression_ratio = read_write_zarr.get_compression_ratio(store_path)
storage_size = utils.get_directory_size(store_path) / (1024**2)

results['blosc_zstd'] = {
    'write_time': write_time,
    'read_time': read_time,
    'compression_ratio': compression_ratio,
    'storage_size_mb': storage_size
}

print(f"Blosc-Zstd Results:")
print(f"  Write time: {write_time:.3f}s")
print(f"  Read time: {read_time:.3f}s")
print(f"  Compression ratio: {compression_ratio:.2f}x")
print(f"  Storage size: {storage_size:.2f} MB")

### 7.2 Test GZip Compression

In [None]:
# GZip compression
store_path = output_dir / "cryoet_gzip.zarr"
gzip_compressor = read_write_zarr.get_gzip_compressor(
    level=6,
    zarr_spec=zarr_spec
)

utils.remove_output_dir(store_path)
start_time = time.time()
read_write_zarr.write_zarr_array(
    test_data,
    store_path,
    overwrite=False,
    chunks=chunks,
    compressor=gzip_compressor,
    zarr_spec=zarr_spec
)
write_time = time.time() - start_time

start_time = time.time()
read_back = read_write_zarr.read_zarr_array(store_path)
read_time = time.time() - start_time

compression_ratio = read_write_zarr.get_compression_ratio(store_path)
storage_size = utils.get_directory_size(store_path) / (1024**2)

results['gzip'] = {
    'write_time': write_time,
    'read_time': read_time,
    'compression_ratio': compression_ratio,
    'storage_size_mb': storage_size
}

print(f"GZip Results:")
print(f"  Write time: {write_time:.3f}s")
print(f"  Read time: {read_time:.3f}s")
print(f"  Compression ratio: {compression_ratio:.2f}x")
print(f"  Storage size: {storage_size:.2f} MB")

### 7.3 Test Different Blosc Algorithms

In [None]:
# Test different Blosc algorithms: lz4 (fast) vs zstd (balanced)
for cname in ['lz4', 'zlib']:
    store_path = output_dir / f"cryoet_blosc_{cname}.zarr"
    blosc_compressor = read_write_zarr.get_blosc_compressor(
        cname=cname,
        clevel=5,
        shuffle="shuffle",
        zarr_spec=zarr_spec
    )
    
    utils.remove_output_dir(store_path)
    start_time = time.time()
    read_write_zarr.write_zarr_array(
        test_data,
        store_path,
        overwrite=False,
        chunks=chunks,
        compressor=blosc_compressor,
        zarr_spec=zarr_spec
    )
    write_time = time.time() - start_time
    
    start_time = time.time()
    read_back = read_write_zarr.read_zarr_array(store_path)
    read_time = time.time() - start_time
    
    compression_ratio = read_write_zarr.get_compression_ratio(store_path)
    storage_size = utils.get_directory_size(store_path) / (1024**2)
    
    results[f'blosc_{cname}'] = {
        'write_time': write_time,
        'read_time': read_time,
        'compression_ratio': compression_ratio,
        'storage_size_mb': storage_size
    }
    
    print(f"\nBlosc-{cname} Results:")
    print(f"  Write time: {write_time:.3f}s")
    print(f"  Read time: {read_time:.3f}s")
    print(f"  Compression ratio: {compression_ratio:.2f}x")
    print(f"  Storage size: {storage_size:.2f} MB")

## 8. Compare Results

In [None]:
import pandas as pd

# Create comparison plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

methods = list(results.keys())
write_times = [results[m]['write_time'] for m in methods]
read_times = [results[m]['read_time'] for m in methods]
compression_ratios = [results[m]['compression_ratio'] for m in methods]
storage_sizes = [results[m]['storage_size_mb'] for m in methods]

# Plot 1: Write times
axes[0, 0].bar(methods, write_times, color='steelblue')
axes[0, 0].set_ylabel('Time (seconds)')
axes[0, 0].set_title('Write Performance for CryoET Data')
axes[0, 0].tick_params(axis='x', rotation=45)

# Plot 2: Read times
axes[0, 1].bar(methods, read_times, color='coral')
axes[0, 1].set_ylabel('Time (seconds)')
axes[0, 1].set_title('Read Performance for CryoET Data')
axes[0, 1].tick_params(axis='x', rotation=45)

# Plot 3: Compression ratios
axes[1, 0].bar(methods, compression_ratios, color='green')
axes[1, 0].set_ylabel('Compression Ratio')
axes[1, 0].set_title('Compression Ratio (Higher is Better)')
axes[1, 0].tick_params(axis='x', rotation=45)

# Plot 4: Storage sizes
axes[1, 1].bar(methods, storage_sizes, color='purple')
axes[1, 1].set_ylabel('Size (MB)')
axes[1, 1].set_title('Storage Size (Lower is Better)')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Summary table
summary_df = pd.DataFrame(results).T
summary_df = summary_df.round(3)
summary_df.columns = ['Write Time (s)', 'Read Time (s)', 'Compression Ratio', 'Storage Size (MB)']

print("\n=== CryoET Data Compression Benchmark Summary ===")
print(summary_df)

print("\nBest Methods:")
print(f"  Fastest write: {summary_df['Write Time (s)'].idxmin()}")
print(f"  Fastest read: {summary_df['Read Time (s)'].idxmin()}")
print(f"  Best compression: {summary_df['Compression Ratio'].idxmax()}")
print(f"  Smallest storage: {summary_df['Storage Size (MB)'].idxmin()}")

## 9. Explore More Tomograms

You can iterate through other tomograms in the dataset to test different data characteristics.

In [None]:
# List all tomogram details in the first run
print("Available tomograms in the first run:")
print("="*80)
for i, tomo in enumerate(tomograms[:5]):  # Show first 5
    print(f"\n{i+1}. {tomo.name}")
    print(f"   Size: {tomo.size_x} x {tomo.size_y} x {tomo.size_z}")
    print(f"   Voxel spacing: {tomo.voxel_spacing} Å")
    print(f"   S3 path: {tomo.s3_omezarr_dir}")
    
if len(tomograms) > 5:
    print(f"\n... and {len(tomograms) - 5} more tomograms")

## 10. Next Steps and Recommendations

### Key Findings for CryoET Data

Based on the benchmarks above, you can now make informed decisions about:

1. **Compression Method**: Which codec provides the best balance of speed and compression for your use case
2. **Chunk Size**: How chunk size affects access patterns (test with different sizes)
3. **Storage vs Speed**: Trade-offs between storage space and read/write performance

### Recommendations for CryoET Data

- **For interactive analysis**: Use Blosc with LZ4 or Zstd (fast compression/decompression)
- **For archival storage**: Use GZip or Blosc with Zlib (better compression ratios)
- **For streaming access**: Consider chunk sizes that match your typical access patterns

### Further Exploration

1. Test with different tomograms (sparse vs dense data)
2. Benchmark different chunk sizes (32, 64, 128, 256)
3. Try different compression levels
4. Test partial reads vs full reads
5. Compare with other runs and datasets

### Resources

- CryoET Portal: https://cryoetdataportal.czscience.com/
- CryoET API Docs: https://chanzuckerberg.github.io/cryoet-data-portal/
- Zarr Benchmarks Docs: https://heftieproject.github.io/zarr-benchmarks/
- Dataset 10445: https://cryoetdataportal.czscience.com/datasets/10445