<!-- DATA PROVIDER INSTRUCTIONS

1. Provide the name of your dataset, replacing the bracketed placeholder text.
2. Update the Registry of Open Data landing page URL, by replacing the bracketed placeholder text. The [REGISTRY_YAML_NAME] will correspond to the name of the YAML document in your pull request to the Registry of Open Data on Github, minus the .yaml file extension.
3. Remove these comment blocks when you have completed each section.

DATA PROVIDER INSTRUCTIONS -->

# Get to Know a Dataset: Vesuvius Challenge - CT Scans of Herculaneum Papyri

This notebook serves as a guided tour of the [Vesuvius Challenge - CT Scans of Herculaneum Papyri](https://registry.opendata.aws/vesuvius-challenge-herculaneum-scrolls) dataset, provided as OME-Zarr volumes. More usage examples, tutorials, and documentation for this dataset and others can be found at the [Registry of Open Data on AWS](https://registry.opendata.aws/).

<!-- DATA PROVIDER INSTRUCTIONS

The goal of this section is to orient users to the structure of your dataset. 

1. How are key prefixes and objects organized in your S3 bucket?
2. What kinds of filetypes are represented in your dataset?
3. Explain with text what users are expected to encounter, and then demonstrate with code the organizational framework you applied when creating your dataset.
4. The responses to each question section are meant to be expanded or replaced as dictated by your dataset

DATA PROVIDER INSTRUCTIONS -->

### Q: How have you organized your dataset? Help us understand the key prefix structure of your S3 bucket.

Each scan is stored as a single OME-Zarr root (a directory ending in .zarr/) that contains OME-NGFF metadata and multiscale arrays. Level 0 is the native resolution; higher levels are downsampled for fast preview and chunked access.

We have scanned more than 30 scrolls, but most are unreleased. The URLs to the CT scans of the released scrolls are available here: https://scrollprize.org/data_scrolls
The URLs to fragments (pieces of scrolls mechanically detached) are here: https://scrollprize.org/data_fragments

#### TODO for Johannes, brief recap of data organization here

In this tutorial we focus on the data hosted on AWS thanks to the AWS Open Data Sponsorship Program. The S3 bucket name and final prefix layout will be added here once finalized.

We recommend starting with these two volumes, acquired using the same protocol as most unreleased scans:
- PHerc. 0139 (scroll)
- PHerc. 0009B (fragment)

In [None]:
# CODING GUIDELINES FOR DATA PROVIDER
#
# General notebook coding guidelines:
# 1. Assume that your reader understands the basics of Jupyter Notebooks, Python, and their Python environment.
#    The focus of this tutorial is on your dataset.
# 2. For library requirements, list the required libraries in a comment block in "requirements.txt" format
#    (https://pip.pypa.io/en/stable/reference/requirements-file-format/)
# 3. Demonstrate importing libraries with the assumption that the user has correctly installed the required
#    libraries.
# 4. List and load all library dependencies once, at this point of the notebook, unless a complicated dependency
#    set makes it unweildy.
# 5. Remember, the goal of this tutorial is a 101-level introduction to your dataset using common tools and libraries.
#    Examples using specialized environments and deep-diving methods are better suited to follow-up tutorials.
#
# CODING GUIDELINES FOR DATA PROVIDER

We are now going to import basic python libraries, like numpy, matplotlib, and the recently installed vesuvius library. We will also define some basic plot properties and the size of an isotropic chunk we want to extract from the CT scan of the scrolls.

In [None]:
# This notebook requires the following additional libraries:
#
# vesuvius
# numpy
# matplotlib
# boto3

# Accept the Vesuvius Challenge data license (required before first use).
# This is a non-interactive acceptance; see https://scrollprize.org/data for details.
#!vesuvius.accept_terms --yes

# Import the libraries required for this notebook
import numpy as np
import matplotlib.pyplot as plt

import boto3
from botocore import UNSIGNED
from botocore.config import Config

import vesuvius
from vesuvius import Volume

plt.rcParams["figure.figsize"] = (6, 6)
plt.rcParams["image.cmap"] = "gray"  # grayscale is easier for CT volumes

CHUNK_SIZE = 256

Next, we define the S3 bucket and list the top-level prefixes.

In [None]:
bucket = "<AWS_OPEN_DATA_BUCKET>"  # TODO: set when finalized
dataset_prefix = ""  # optional prefix, e.g. "vesuvius/"

if bucket.startswith("<"): # TODO: this prints a reminder if bucket is not finalized yet, to delete in final version
    print("Set bucket and dataset_prefix to list S3 contents.")
else:
    s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))
    resp = s3.list_objects_v2(Bucket=bucket, Prefix=dataset_prefix, Delimiter="/")
    for item in resp.get("CommonPrefixes", []):
        print(item["Prefix"])

Below are two example OME-Zarr roots that we will use throughout the notebook. These are public HTTP endpoints that point to the same data hosted in S3.

In [None]:
#TODO: How to handle end points with open data program?

PHERC_0139_URL_9um = "https://data.aws.ash2txt.org/samples/PHerc0139/volumes/20250728140407-9.362um-1.2m-113keV-masked.zarr/"
PHERC_0009B_URL_9um = "https://data.aws.ash2txt.org/samples/PHerc0009B/volumes/20250521125136-8.640um-1.2m-116keV-masked.zarr/"

SAMPLE_VOLUMES = {
    "PHerc. 0139 (scroll)": PHERC_0139_URL_9um,
    "PHerc. 0009B (fragment)": PHERC_0009B_URL_9um,
}

for name, url in SAMPLE_VOLUMES.items():
    print(f"{name}: {url}")
    print(f"  native scale: {url}0")

These zarr stores are actually OME-Zarr, and each folder contains subfolders corresponding to downscaled versions of the same volume.

We will access the native scale which is in subfolder "0".

The conventional coordinate frame in the volumes is Z, Y, X.

In [None]:
SCROLL_URL = PHERC_0139_URL_9um
FRAGMENT_URL = PHERC_0009B_URL_9um
NATIVE_SCALE = "0"

<!-- DATA PROVIDER INSTRUCTIONS
This section is meant to orient users of your dataset to the formats present in your dataset, particularly if your dataset includes formats that may be unfamiliar to a general data scientist audience. This section should include:

1. Explanation of data format(s) (very common formats can be very briefly described, while less common
   or domain specific formats should include more explanation as well as links to official documentation)
2. Explanation of why the data format was chosen for your dataset
3. Recommendations around software and tooling to work with this data format
4. Explanation of any dataset-specific aspects to your usage of the format
5. Description of AWS services that may be useful to users working with your data
DATA PROVIDER INSTRUCTIONS -->

### Q: What data formats are present in your dataset? What kinds of data are stored using these formats? Can you give any advice for how you work with these data formats?

Our dataset is provided as [OME-Zarr](https://pmc.ncbi.nlm.nih.gov/articles/PMC9980008/) archives. A Zarr is a chunked N-dimensional array; in our case it is 3D with Z, Y, and X axes. OME-Zarr adds a multiscale pyramid: level 0 is native resolution, and higher levels are downsampled for fast preview and interactive access. This structure enables cloud-native, chunked reads without downloading entire volumes.

Recommended tooling for a first look: the `vesuvius` library for convenience, `zarr` plus `fsspec/s3fs` for direct access, `numpy` for analysis, and `napari` for interactive viewing.

However, to work on our virtual unwrapping pipeline, we recommend using (and working on top) of the virtual unwrapping software that we are developing, called [VC3D](https://github.com/ScrollPrize/villa/tree/main/volume-cartographer). A more advanced tutorial can be found [here](https://scrollprize.org/segmentation).

AWS services that may be useful: S3 for storage, EC2 or AWS Batch for distributed chunk processing, and SageMaker for ML training and inference on derived datasets.

<!-- DATA PROVIDER INSTRUCTIONS
The goal of this section is to demonstrate loading a portion of data from your dataset, and reveal something about its structure.
1. Load an object from S3
2. Show the structure of data in the object
DATA PROVIDER INSTRUCTIONS -->

### Q: Can you show us an example of downloading and loading data from your dataset?

Below we open an OME-Zarr volume using the `vesuvius` library. The `Volume` object is lazy; data are fetched only when indexed. We start from the native scale (level 0).

In [None]:
scroll = Volume(type="zarr", path=SCROLL_URL + NATIVE_SCALE)
print(f"Shape: {scroll.shape()}")
print(f"dtype: {scroll.dtype}")

Great! We lazy loaded the scan. Now we will download a chunk from the full volume by specifying its bounding box.

In [None]:
chunk = scroll[
    10500:10500 + CHUNK_SIZE,
    3000:3000 + CHUNK_SIZE,
    3000:3000 + CHUNK_SIZE,
]
print(f"Chunk shape: {chunk.shape}")

Let us inspect basic intensity statistics for this chunk to guide visualization and normalization.

In [None]:
chunk_min = float(chunk.min())
chunk_max = float(chunk.max())
p01, p99 = np.percentile(chunk, [1, 99])

print(f"Chunk min: {chunk_min}")
print(f"Chunk max: {chunk_max}")
print(f"1st percentile: {p01}")
print(f"99th percentile: {p99}")

<!-- DATA PROVIDER INSTRUCTIONS
The goal here is to visualize some aspect of your dataset in order to help users understand it. In addition to helping users of your dataset understand the dataset, an additional goal is to impress!

Please demonstrate any data preprocessing or reshaping required for your visualization(s).

https://www.reddit.com/r/dataisbeautiful/ for inspiration.
DATA PROVIDER INSTRUCTIONS -->

### Q: A picture is worth a thousand words. Show us a visual (or several!) from your dataset that either illustrates something informative about your dataset, or that you think might excite someone to dig in further.

We start by visualizing three orthogonal slices of the extracted chunk.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Z=100, XY
axes[0].imshow(chunk[100, :, :])
axes[0].set_title("XY plane")
axes[0].axis("off")

# Y=100, XZ
axes[1].imshow(chunk[:, 100, :])
axes[1].set_title("XZ plane")
axes[1].axis("off")

# X=100, YZ
axes[2].imshow(chunk[:, :, 100])
axes[2].set_title("YZ plane")
axes[2].axis("off")

plt.tight_layout()
plt.show()

Histogram of intensities in the extracted chunk.

In [None]:
plt.figure(figsize=(8, 5))
plt.hist(chunk.ravel(), bins=80, color="#4c78a8", alpha=0.8)
plt.title("Intensity histogram (chunk)")
plt.xlabel("Intensity")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

<!-- DATA PROVIDER INSTRUCTIONS
This section is less prescriptive / freeform than previous sections. The goal here is to show an opinionated example of answering a question using your data. The scale of your dataset may preclude a full example, and so feel free to limit the scope of this example (e.g. work on a subset of data). Users should be able to replicate your example in this notebook, and get a sense of how they would scale up.

A "toy" example is better than no example.

Ideally, your example would:
1. Transmit some of your domain & dataset experience to the reader, drawing on your own work as much as possible
2. Provide a jumping off point for users to extend your work, and do novel work of their own.

DATA PROVIDER INSTRUCTIONS -->

### Q: What is one question that you have answered using these data? Can you show us how you came to that answer?

We are on a quest to digital unwrap these CT scans of scrolls, and read them.
Reading these scrolls could change our understanding of the history of Ancient Roman times.

In order to do so we:
1. Need to find a mesh (which represents a surface in 3D) that sits on top of the inner surface of the papyrus layers. These layers are wrapped up in the CT scan, but also survived a volcanic eruption and were for almost 2 millennia buried under the ashes.
2. Find a 2D parametrization on the said mesh which preserves distances and angles in 3D. This accounts to "flattening" the surface.
3. (Un)warp an extruded region of interest around this mesh using the found flat parametrization as a reference, and render it such that it looks flat.
4. Look for ink or enhance any visible trace of ink via Machine Learning.
5. Read the text.

More details about points 1-3 are found [here](https://scrollprize.org/unwrapping).

Some preliminary results we had on other scrolls are:

- [Vesuvius Challenge Grand Prize Banner of 2023, PHerc. Paris 4](https://scrollprize.org/grandprize)
  ![Vesuvius Challenge Grand Prize Banner text](https://scrollprize.org/img/grandprize/text_bcb-smaller.webp)
- [Digital unwrapping of 70% of PHerc. 172](https://scrollprize.substack.com/p/70-of-pherc-172-is-now-digitally)
  [![Digital unwrapping of PHerc. 172](https://substackcdn.com/image/fetch/$s_!k3z_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de8af62-d0e5-48cb-838c-00a5c51ba048_30001x547.png)](https://substackcdn.com/image/fetch/$s_!k3z_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de8af62-d0e5-48cb-838c-00a5c51ba048_30001x547.png)
  *Click the image to open the full-resolution version in your browser.*

<!-- DATA PROVIDER INSTRUCTIONS
This section is, like the previous one, intended to be freeform / non-prescriptive. The goal here is to provide a challenge to the community to do something novel with your dataset. That can either be novel in terms of the task, or novel in terms of methodological or computational approach.

Another way to consider this section, is as a wishlist. If you were less constrained by time, cost, skill, etc., what would you like to see achieved using these data? 

The challenge should, however, be somewhat realistic. A challenge that assumes e.g. original data collection, is likely to go unanswered.
DATA PROVIDER INSTRUCTIONS -->

### Q: What is one unanswered question that you think could be answered using these data? Do you have any recommendations or advice for someone wanting to answer this question?

The obtained results described in the previous point are promising, but the virtual unwrapping pipeline required a lot of manual efforts by expert human annotators.

A compelling open question is whether we can reliably reconstruct continuous text from full scroll volumes with minimal manual surface selection.

To reach this goal, we are developing a more and more automated pipeline, called [VC3D](https://github.com/ScrollPrize/villa/tree/main/volume-cartographer).

We also always look for contributions that can improve the quality of ink detection, or help us see the text in scrolls on which the ink is still elusive.

We have an active online challenge with a rich line-up of prizes, visit [this page](https://scrollprize.org/prizes) to read about the current open prizes.

So far, our project distributed more than $ 1.5M in prizes! For more updated information join our [official Discord channel](https://discord.com/invite/V4fJhvtaQn).

# DATA PROVIDER: PLEASE REMEMBER TO CLEAR ALL OUTPUTS BEFORE COMMITTING TO YOUR GITHUB REPOSITORY