# 01 - SpatialData Exploration

## **Objective**
This notebook explores and analyzes the structure of the provided SpatialData object for **Crunch 1** of the Autoimmune Disease Machine Learning Challenge. The following tasks are performed to prepare the dataset for downstream modeling:

1. Load the `.zarr` dataset into a `SpatialData` object.
2. Validate the dataset to ensure it contains required components.
3. Visualize key elements, including H&E images and nucleus segmentation masks.
4. Preprocess the dataset, optionally subsampling large data for efficiency.
5. Extract nuclei coordinates and link them to gene expression profiles.
6. Create interactive visualizations to analyze tissue and gene expression.

---

## **Dataset Overview**
The dataset includes:
- **H&E Pathology Images**:
  - `HE_original`: The original H&E image in its native pixel coordinates.
  - `HE_nuc_original`: The nucleus segmentation mask associated with the H&E image.

- **Gene Expression Data**:
  - `anucleus`: Aggregated gene expression profiles for each nucleus, with log1p-normalized values for 460 genes.

---

## **Steps in This Notebook**
1. **Environment Setup**: Modules are imported, and the `SpatialDataHandler` is initialized.
2. **Data Loading and Preprocessing**: The dataset is loaded, validated, and optionally subsampled.
3. **Visualization**: Key dataset components are visualized for initial exploration.
4. **Data Extraction**: Nuclei coordinates are linked to gene expression profiles.
5. **Advanced Visualization**: Interactive visualizations are created to explore tissue and gene expression relationships.

---

## **Expected Outputs**
- A summary of the dataset structure and its components.
- Visualizations of H&E images and nucleus segmentation masks.
- A DataFrame linking nuclei coordinates to gene expression data.
- Interactive visualizations for deeper analysis of tissue structure and gene expression.

---


# **Step 1: Environment Setup**

This step prepares the environment for data exploration. Tasks include:
1. Importing required modules and libraries.
2. Parsing the `config.yaml` file to retrieve project paths.
3. Identifying `.zarr` files in the dataset directory (`raw_dir`).
4. Validating the environment to ensure necessary files and dependencies are present.
5. Initializing a `SpatialDataHandler` for a selected `.zarr` dataset.


In [1]:
# Install missing dependencies (if necessary)
%pip install spatialdata matplotlib plotly pandas numpy pyyaml tqdm

Note: you may need to restart the kernel to use updated packages.


## **Step 1.1: Imports**

Import all required libraries for the notebook, ensuring necessary modules for data loading, manipulation, visualization, and validation are available.


In [2]:
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import os
import yaml

In [3]:
import spatialdata_plot
import spatialdata as sd
import scanpy as sc



## **Step 1.2: Load `config.yaml` and Set Paths**

Parse the `config.yaml` file to retrieve the `raw_dir` path where `.zarr` datasets are stored. The presence of `.zarr` files in the directory is then verified.


In [4]:
# Define the relative path to config.yaml
config_path = "../config.yaml"  # Adjust as needed for your directory structure

# Ensure the config.yaml file exists
if not os.path.exists(config_path):
    raise FileNotFoundError(f"Configuration file not found at: {config_path}")

# Load the configuration file
with open(config_path, "r") as file:
    config = yaml.safe_load(file)

# Extract the raw data directory from the configuration
raw_dir = config["paths"]["raw_dir"]
print(f"Dataset directory (raw_dir): {raw_dir}")


Dataset directory (raw_dir): /mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data


In [5]:
# Validate the raw directory and list all `.zarr` files
if not os.path.exists(raw_dir):
    raise FileNotFoundError(f"Raw data directory not found: {raw_dir}")

zarr_files = [os.path.join(raw_dir, f) for f in os.listdir(raw_dir) if f.endswith(".zarr")]

if not zarr_files:
    raise FileNotFoundError(f"No `.zarr` files found in directory: {raw_dir}")

print(f"Found {len(zarr_files)} `.zarr` files:")
for zarr_file in zarr_files:
    print(f" - {os.path.basename(zarr_file)}")


Found 8 `.zarr` files:
 - DC1.zarr
 - DC5.zarr
 - UC1_I.zarr
 - UC1_NI.zarr
 - UC6_I.zarr
 - UC6_NI.zarr
 - UC7_I.zarr
 - UC9_I.zarr


## **Step 1.3: Initialize `SpatialDataHandler` for One Dataset**

Initializes the `SpatialDataHandler` for a single dataset. The dataset is selected either interactively or via default selection. This handler will be used for structured exploration and validation.



In [6]:
# Display available datasets for selection
print("Available datasets:")
for i, zarr_file in enumerate(zarr_files):
    print(f"{i + 1}: {os.path.basename(zarr_file)}")

# Select a dataset interactively or default to the first one
dataset_index = input("Select a dataset by number (default: 1): ")
dataset_index = int(dataset_index) - 1 if dataset_index else 0

# Validate the selected index
if dataset_index < 0 or dataset_index >= len(zarr_files):
    raise ValueError(f"Invalid dataset selection. Please choose a number between 1 and {len(zarr_files)}.")

selected_dataset = os.path.basename(zarr_files[dataset_index])
selected_path = zarr_files[dataset_index]
print(f"Selected dataset: {selected_dataset}")


Available datasets:
1: DC1.zarr
2: DC5.zarr
3: UC1_I.zarr
4: UC1_NI.zarr
5: UC6_I.zarr
6: UC6_NI.zarr
7: UC7_I.zarr
8: UC9_I.zarr
Selected dataset: UC7_I.zarr


In [7]:
# Import the SpatialDataHandler class
from crunch1_project.src.spatialdata_handler import SpatialDataHandler

# Initialize the handler for the selected dataset
handler = SpatialDataHandler(selected_path)

# Confirm initialization
print(f"Initialized handler for dataset: {selected_dataset}")


INFO:crunch1_project.src.spatialdata_handler:SpatialDataHandler initialized for file: /mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC7_I.zarr


Initialized handler for dataset: UC7_I.zarr


In [8]:
# Load and validate the dataset
print(f"Loading dataset: {selected_dataset}")
handler.load_data()

print(f"Validating dataset: {selected_dataset}")
handler.validate_data()


Loading dataset: UC7_I.zarr


INFO:ome_zarr.reader:root_attr: multiscales
INFO:ome_zarr.reader:root_attr: omero
INFO:ome_zarr.reader:root_attr: spatialdata_attrs
INFO:ome_zarr.reader:datasets [{'coordinateTransformations': [{'scale': [1.0, 1.0, 1.0], 'type': 'scale'}], 'path': '0'}]
INFO:ome_zarr.reader:resolution: 0
INFO:ome_zarr.reader: - shape ('c', 'y', 'x') = (1, 17000, 20992)
INFO:ome_zarr.reader: - chunks =  ['1', '5792 (+ 5416)', '5792 (+ 3616)']
INFO:ome_zarr.reader: - dtype = uint32
INFO:ome_zarr.reader:root_attr: multiscales
INFO:ome_zarr.reader:root_attr: omero
INFO:ome_zarr.reader:root_attr: spatialdata_attrs
INFO:ome_zarr.reader:root_attr: multiscales
INFO:ome_zarr.reader:root_attr: omero
INFO:ome_zarr.reader:root_attr: spatialdata_attrs
INFO:ome_zarr.reader:datasets [{'coordinateTransformations': [{'scale': [1.0, 1.0, 1.0], 'type': 'scale'}], 'path': '0'}]
INFO:ome_zarr.reader:resolution: 0
INFO:ome_zarr.reader: - shape ('c', 'y', 'x') = (3, 17000, 20992)
INFO:ome_zarr.reader: - chunks =  ['3', '6688

Validating dataset: UC7_I.zarr
