# 01 - SpatialData Exploration

## **Objective**
This notebook explores and analyzes the structure of the provided SpatialData object for **Crunch 1** of the Autoimmune Disease Machine Learning Challenge. The following tasks are performed to prepare the dataset for downstream modeling:

1. Load the `.zarr` dataset into a `SpatialData` object.
2. Validate the dataset to ensure it contains required components.
3. Visualize key elements, including H&E images and nucleus segmentation masks.
4. Preprocess the dataset, optionally subsampling large data for efficiency.
5. Extract nuclei coordinates and link them to gene expression profiles.
6. Create interactive visualizations to analyze tissue and gene expression.

---

## **Dataset Overview**
The dataset includes:
- **H&E Pathology Images**:
  - `HE_original`: The original H&E image in its native pixel coordinates.
  - `HE_nuc_original`: The nucleus segmentation mask associated with the H&E image.

- **Gene Expression Data**:
  - `anucleus`: Aggregated gene expression profiles for each nucleus, with log1p-normalized values for 460 genes.

---

## **Steps in This Notebook**
1. **Environment Setup**: Modules are imported, and the `SpatialDataHandler` is initialized.
2. **Data Loading and Preprocessing**: The dataset is loaded, validated, and optionally subsampled.
3. **Visualization**: Key dataset components are visualized for initial exploration.
4. **Data Extraction**: Nuclei coordinates are linked to gene expression profiles.
5. **Advanced Visualization**: Interactive visualizations are created to explore tissue and gene expression relationships.

---

## **Expected Outputs**
- A summary of the dataset structure and its components.
- Visualizations of H&E images and nucleus segmentation masks.
- A DataFrame linking nuclei coordinates to gene expression data.
- Interactive visualizations for deeper analysis of tissue structure and gene expression.

---


# **Step 1: Environment Setup**

This step prepares the environment for data exploration. Tasks include:
1. Importing required modules and libraries.
2. Parsing the `config.yaml` file to retrieve project paths.
3. Identifying `.zarr` files in the dataset directory (`raw_dir`).
4. Validating the environment to ensure necessary files and dependencies are present.
5. Initializing a `SpatialDataHandler` for a selected `.zarr` dataset.


In [1]:
# Install missing dependencies (if necessary)
%pip install spatialdata matplotlib plotly pandas numpy pyyaml tqdm

Note: you may need to restart the kernel to use updated packages.


## **Step 1.1: Imports**

Import all required libraries for the notebook, ensuring necessary modules for data loading, manipulation, visualization, and validation are available.


In [2]:

import os
import yaml



## **Step 1.2: Load `config.yaml` and Set Paths**

Parse the `config.yaml` file to retrieve the `raw_dir` path where `.zarr` datasets are stored. The presence of `.zarr` files in the directory is then verified.


In [4]:
# Define the relative path to config.yaml
config_path = "../config.yaml"  # Adjust as needed for your directory structure

# Ensure the config.yaml file exists
if not os.path.exists(config_path):
    raise FileNotFoundError(f"Configuration file not found at: {config_path}")

# Load the configuration file
with open(config_path, "r") as file:
    config = yaml.safe_load(file)

# Extract the raw data directory from the configuration
raw_dir = config["paths"]["raw_dir"]
print(f"Dataset directory (raw_dir): {raw_dir}")


Dataset directory (raw_dir): /mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data


In [5]:
# Validate the raw directory and list all `.zarr` files
if not os.path.exists(raw_dir):
    raise FileNotFoundError(f"Raw data directory not found: {raw_dir}")

zarr_files = [os.path.join(raw_dir, f) for f in os.listdir(raw_dir) if f.endswith(".zarr")]

if not zarr_files:
    raise FileNotFoundError(f"No `.zarr` files found in directory: {raw_dir}")

print(f"Found {len(zarr_files)} `.zarr` files:")
for zarr_file in zarr_files:
    print(f" - {os.path.basename(zarr_file)}")


Found 8 `.zarr` files:
 - DC1.zarr
 - DC5.zarr
 - UC1_I.zarr
 - UC1_NI.zarr
 - UC6_I.zarr
 - UC6_NI.zarr
 - UC7_I.zarr
 - UC9_I.zarr


### **Step 1.3: Initialize `SpatialDataHandler` for Single and All Datasets**

This step initializes two `SpatialDataHandler` objects:
1. **Single Dataset Handler**: Initializes the handler for a selected dataset. The dataset is chosen interactively or defaults to the first available dataset.
2. **All Datasets Handler**: Initializes the handler for all available datasets in the directory. This allows for collective exploration and processing.

#### **Key Objectives**
- Enable structured exploration and validation of a single dataset interactively.
- Prepare for batch processing across all datasets.

#### **Expected Outputs**
- A `SpatialDataHandler` for the selected dataset.
- A `SpatialDataHandler` for all datasets collectively.


In [6]:
# Display available datasets for selection
print("Available datasets:")
for i, zarr_file in enumerate(zarr_files):
    print(f"{i + 1}: {os.path.basename(zarr_file)}")

# Select a dataset interactively or default to the first one
dataset_index = input("Select a dataset by number (default: 1): ")
dataset_index = int(dataset_index) - 1 if dataset_index else 0

# Validate the selected index
if dataset_index < 0 or dataset_index >= len(zarr_files):
    raise ValueError(f"Invalid dataset selection. Please choose a number between 1 and {len(zarr_files)}.")

selected_dataset = os.path.basename(zarr_files[dataset_index])
selected_path = zarr_files[dataset_index]
print(f"Selected dataset: {selected_dataset}")

# Import the SpatialDataHandler class
from src.spatialdata_handler import SpatialDataHandler

# Initialize handler for the selected dataset (single dataset handler)
single_handler = SpatialDataHandler(zarr_paths=[selected_path])

# Initialize handler for all datasets (all datasets handler)
all_handler = SpatialDataHandler(zarr_paths=zarr_files)


Available datasets:
1: DC1.zarr
2: DC5.zarr
3: UC1_I.zarr
4: UC1_NI.zarr
5: UC6_I.zarr
6: UC6_NI.zarr
7: UC7_I.zarr
8: UC9_I.zarr
Selected dataset: UC7_I.zarr
Initialized handler for datasets: ['/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC7_I.zarr']
Initialized handler for datasets: ['/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/DC1.zarr', '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/DC5.zarr', '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_I.zarr', '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_NI.zarr', '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC6_I.zarr', '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC6_NI.zarr', '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC7_I.zarr', '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC9_I.zarr']


In [7]:
# For single dataset
print(f"Loading and validating single dataset: {selected_dataset}")
single_handler.load_data(max_retries=3)
try:
    single_handler.validate_data(
        required_images=["HE_original", "HE_nuc_original"],
        required_tables=["anucleus"]
    )
    print(f"Single dataset '{selected_dataset}' validated successfully.")
except ValueError as e:
    print(f"Validation error in single dataset: {e}")

# For all datasets
print("Loading and validating all datasets.")
all_handler.load_data(max_retries=3)
try:
    all_handler.validate_data(
        required_images=["HE_original", "HE_nuc_original"],
        required_tables=["anucleus"]
    )
    print("All datasets validated successfully.")
except ValueError as e:
    print(f"Validation error in all datasets: {e}")


Loading and validating single dataset: UC7_I.zarr


Loading Datasets:   0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset: UC7_I.zarr (Attempt 1/3)...


Loading Datasets: 100%|██████████| 1/1 [00:14<00:00, 14.10s/it]


Dataset 'UC7_I.zarr' loaded successfully.


Validating Datasets: 100%|██████████| 1/1 [00:00<00:00, 4899.89it/s]


Validating dataset: UC7_I.zarr...
Dataset 'UC7_I.zarr' is fully validated.
Single dataset 'UC7_I.zarr' validated successfully.
Loading and validating all datasets.


Loading Datasets:   0%|          | 0/8 [00:00<?, ?it/s]

Loading dataset: UC6_I.zarr (Attempt 1/3)...
Loading dataset: DC5.zarr (Attempt 1/3)...
Loading dataset: UC1_NI.zarr (Attempt 1/3)...
Loading dataset: DC1.zarr (Attempt 1/3)...
Loading dataset: UC1_I.zarr (Attempt 1/3)...
Loading dataset: UC9_I.zarr (Attempt 1/3)...
Loading dataset: UC6_NI.zarr (Attempt 1/3)...
Loading dataset: UC7_I.zarr (Attempt 1/3)...


Loading Datasets:  12%|█▎        | 1/8 [00:05<00:35,  5.11s/it]

Dataset 'DC1.zarr' loaded successfully.


Loading Datasets:  25%|██▌       | 2/8 [00:24<01:20, 13.39s/it]

Dataset 'UC1_NI.zarr' loaded successfully.


Loading Datasets:  38%|███▊      | 3/8 [00:24<00:37,  7.50s/it]

Dataset 'UC6_NI.zarr' loaded successfully.


Loading Datasets:  50%|█████     | 4/8 [00:25<00:19,  4.97s/it]

Dataset 'DC5.zarr' loaded successfully.


Loading Datasets:  62%|██████▎   | 5/8 [00:28<00:12,  4.24s/it]

Dataset 'UC7_I.zarr' loaded successfully.


Loading Datasets:  75%|███████▌  | 6/8 [00:37<00:11,  5.75s/it]

Dataset 'UC9_I.zarr' loaded successfully.
Dataset 'UC6_I.zarr' loaded successfully.


Loading Datasets: 100%|██████████| 8/8 [00:38<00:00,  4.77s/it]


Dataset 'UC1_I.zarr' loaded successfully.


Validating Datasets: 100%|██████████| 8/8 [00:00<00:00, 9236.01it/s]

Validating dataset: DC1.zarr...
Dataset 'DC1.zarr' is fully validated.
Validating dataset: UC1_NI.zarr...
Dataset 'UC1_NI.zarr' is fully validated.
Validating dataset: UC6_NI.zarr...
Dataset 'UC6_NI.zarr' is fully validated.
Validating dataset: DC5.zarr...
Dataset 'DC5.zarr' is fully validated.
Validating dataset: UC7_I.zarr...
Dataset 'UC7_I.zarr' is fully validated.
Validating dataset: UC9_I.zarr...
Dataset 'UC9_I.zarr' is fully validated.
Validating dataset: UC6_I.zarr...
Dataset 'UC6_I.zarr' is fully validated.
Validating dataset: UC1_I.zarr...
Dataset 'UC1_I.zarr' is fully validated.
All datasets validated successfully.





### **Step 2: Data Loading and Preprocessing**

This step focuses on loading the datasets and preparing them for further analysis. It is divided into substeps for clarity and modularity:

1. **Dataset Loading and Validation**: Load the `.zarr` files into memory and validate them to ensure the presence of required components.
2. **Generate Summaries**: Provide a detailed summary of the single selected dataset and an overview of all datasets.
3. **Preprocessing**: Extract gene lists, normalize data, select training cells, and crop images to prepare for downstream analysis.

The expected outcomes include:
- Validated datasets with the required components (`HE_original`, `HE_nuc_original`, and `anucleus`).
- Summaries that detail the structure and attributes of the datasets.
- Preprocessed data ready for exploration and modeling.

---

#### **Step 2.1: Dataset Loading and Validation**

This substep involves:
1. Loading the single selected dataset and all datasets using the `SpatialDataHandler`.
2. Validating the datasets to ensure the presence of:
   - **Images**: `HE_original`, `HE_nuc_original`
   - **Tables**: `anucleus`

The validation process confirms that all datasets are complete and ready for further analysis.


In [8]:
# Load and validate the selected single dataset
print(f"Loading dataset: {selected_dataset}")
single_handler.load_data(max_retries=3)
single_handler.validate_data(
    required_images=["HE_original", "HE_nuc_original"],
    required_tables=["anucleus"]
)
print(f"Single dataset '{selected_dataset}' validated successfully.")


Loading dataset: UC7_I.zarr


Loading Datasets:   0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset: UC7_I.zarr (Attempt 1/3)...


Loading Datasets: 100%|██████████| 1/1 [00:12<00:00, 12.26s/it]


Dataset 'UC7_I.zarr' loaded successfully.


Validating Datasets: 100%|██████████| 1/1 [00:00<00:00, 10082.46it/s]

Validating dataset: UC7_I.zarr...
Dataset 'UC7_I.zarr' is fully validated.
Single dataset 'UC7_I.zarr' validated successfully.





In [9]:
# Load and validate all datasets
print("Loading and validating all datasets...")
all_handler.load_data(max_retries=3)
all_handler.validate_data(
    required_images=["HE_original", "HE_nuc_original"],
    required_tables=["anucleus"]
)
print("All datasets validated successfully.")


Loading and validating all datasets...
Loading dataset: DC1.zarr (Attempt 1/3)...
Loading dataset: DC5.zarr (Attempt 1/3)...
Loading dataset: UC1_I.zarr (Attempt 1/3)...
Loading dataset: UC1_NI.zarr (Attempt 1/3)...
Loading dataset: UC6_I.zarr (Attempt 1/3)...
Loading dataset: UC6_NI.zarr (Attempt 1/3)...


Loading Datasets:   0%|          | 0/8 [00:00<?, ?it/s]

Loading dataset: UC7_I.zarr (Attempt 1/3)...
Loading dataset: UC9_I.zarr (Attempt 1/3)...


Loading Datasets:  12%|█▎        | 1/8 [00:07<00:49,  7.10s/it]

Dataset 'DC1.zarr' loaded successfully.


Loading Datasets:  38%|███▊      | 3/8 [00:27<00:40,  8.04s/it]

Dataset 'UC1_NI.zarr' loaded successfully.
Dataset 'UC6_NI.zarr' loaded successfully.


Loading Datasets:  50%|█████     | 4/8 [00:27<00:20,  5.10s/it]

Dataset 'DC5.zarr' loaded successfully.


Loading Datasets:  62%|██████▎   | 5/8 [00:31<00:13,  4.56s/it]

Dataset 'UC7_I.zarr' loaded successfully.


Loading Datasets: 100%|██████████| 8/8 [00:38<00:00,  4.79s/it]


Dataset 'UC9_I.zarr' loaded successfully.
Dataset 'UC1_I.zarr' loaded successfully.
Dataset 'UC6_I.zarr' loaded successfully.


Validating Datasets: 100%|██████████| 8/8 [00:00<00:00, 39568.91it/s]

Validating dataset: DC1.zarr...
Dataset 'DC1.zarr' is fully validated.
Validating dataset: UC1_NI.zarr...
Dataset 'UC1_NI.zarr' is fully validated.
Validating dataset: UC6_NI.zarr...
Dataset 'UC6_NI.zarr' is fully validated.
Validating dataset: DC5.zarr...
Dataset 'DC5.zarr' is fully validated.
Validating dataset: UC7_I.zarr...
Dataset 'UC7_I.zarr' is fully validated.
Validating dataset: UC9_I.zarr...
Dataset 'UC9_I.zarr' is fully validated.
Validating dataset: UC6_I.zarr...
Dataset 'UC6_I.zarr' is fully validated.
Validating dataset: UC1_I.zarr...
Dataset 'UC1_I.zarr' is fully validated.
All datasets validated successfully.





### **Step 2.2: Generate Dataset Summaries**

Generate detailed summaries of the datasets to provide insights into their structure and components. The summaries are organized as follows:

1. **Single Dataset Summary**:
   - Displays the structure of the selected dataset, including the available images, tables, and their attributes.
   - Key details include image shapes and data types, as well as table dimensions (rows and columns).

2. **All Datasets Summary**:
   - Provides an overview of all available datasets in the directory.
   - Highlights the structure of each dataset, with details about images, tables, and their attributes.

In [10]:
# Print summary of the single dataset
print("\nSummary of the single dataset:")
single_handler.print_summary()



Summary of the single dataset:
Dataset: UC7_I.zarr
- Images:
  HE_nuc_original: shape (1, 17000, 20992), dtype uint32
  HE_original: shape (3, 17000, 20992), dtype uint8
- Tables:
  anucleus: 144704 rows, 460 columns
  cell_id-group: 277046 rows, 0 columns


In [11]:
# Print summary of all datasets
print("\nSummary of all datasets:")
all_handler.print_summary()



Summary of all datasets:
Dataset: DC1.zarr
- Images:
  HE_nuc_original: shape (1, 17000, 22000), dtype uint32
  HE_original: shape (3, 17000, 22000), dtype uint8
- Tables:
  anucleus: 0 rows, 460 columns
  cell_id-group: 215465 rows, 0 columns
Dataset: UC1_NI.zarr
- Images:
  HE_nuc_original: shape (1, 21000, 22000), dtype uint32
  HE_original: shape (3, 21000, 22000), dtype uint8
- Tables:
  anucleus: 80037 rows, 460 columns
  cell_id-group: 93686 rows, 0 columns
Dataset: UC6_NI.zarr
- Images:
  HE_nuc_original: shape (1, 17000, 21000), dtype uint32
  HE_original: shape (3, 17000, 21000), dtype uint8
- Tables:
  anucleus: 101485 rows, 460 columns
  cell_id-group: 123517 rows, 0 columns
Dataset: DC5.zarr
- Images:
  HE_nuc_original: shape (1, 18000, 22000), dtype uint32
  HE_original: shape (3, 18000, 22000), dtype uint8
- Tables:
  anucleus: 140368 rows, 460 columns
  cell_id-group: 171019 rows, 0 columns
Dataset: UC7_I.zarr
- Images:
  HE_nuc_original: shape (1, 17000, 20992), dtype

### **Step 2.3: Preprocess Dataset(s)**

This substep involves preprocessing the dataset(s) to ensure they are in a suitable format for downstream analysis and machine learning tasks. The key preprocessing steps include:

1. **Extracting Gene Expression Profiles**:
   - Links nuclei coordinates to their respective gene expression data from the `anucleus` table.
   - Optionally applies filters for subsets of genes or nuclei.

2. **Log1p Normalization**:
   - Normalizes the extracted gene expression data using the log1p transformation to stabilize variance and improve downstream modeling performance.

3. **Visualization Setup**:
   - Prepares the processed data for visualization, including cropping of H&E images or nucleus segmentation masks around each nucleus.


#### **Step 2.3.1: Extract Gene Expression Profiles**

In [12]:
# Substep 2.3.1: Extract Gene Expression

# Define the genes and nuclei subsets (if applicable)
gene_subset = None  # Replace with specific genes if filtering is required
nuclei_subset = None  # Replace with specific nuclei if filtering is required

# Extract data for the single dataset
print("\nExtracting nuclei and gene expression for the single dataset...")
single_dataset_results = single_handler.extract_nuclei_and_gene_expression(
    gene_subset=gene_subset,
    nuclei_subset=nuclei_subset,
    batch_size=1000
)
print(f"Extraction completed for single dataset '{selected_dataset}'.")

# Extract data for all datasets
print("\nExtracting nuclei and gene expression for all datasets...")
all_datasets_results = all_handler.extract_nuclei_and_gene_expression(
    gene_subset=gene_subset,
    nuclei_subset=nuclei_subset,
    batch_size=1000
)
print("Extraction completed for all datasets.")



Extracting nuclei and gene expression for the single dataset...


Extracting Data:   0%|          | 0/1 [00:00<?, ?it/s]
Processing Batch 1/1 in UC7_I.zarr: 100%|██████████| 460/460 [00:00<00:00, 493952.85it/s]
Extracting Data: 100%|██████████| 1/1 [00:00<00:00,  1.35it/s]


Extraction complete for dataset 'UC7_I.zarr'.
Extraction completed for single dataset 'UC7_I.zarr'.

Extracting nuclei and gene expression for all datasets...


Extracting Data:   0%|          | 0/8 [00:00<?, ?it/s]


ValueError: Dataset 'DC1.zarr' is missing required spatial or expression data.