# 01_Preprocessing

### Overview
This notebook focuses on data preprocessing for Crunch 1 of the Autoimmune Disease Machine Learning Challenge. The preprocessing step is critical to prepare the raw spatial transcriptomics data and H&E images for downstream modeling tasks. It ensures data quality, consistency, and compatibility with machine learning algorithms.

---

### Objectives
1. **Data Loading**: Load and validate raw data files (e.g., Zarr, H&E images, and gene expression tables).
2. **Image Preprocessing**:
   - Normalize H&E images.
   - Extract nucleus-centered patches for modeling.
3. **Gene Expression Preprocessing**:
   - Normalize and filter gene expression data.
   - Perform dimensionality reduction and clustering.
4. **Spatial Feature Engineering**:
   - Compute spatial features such as centroids, pairwise distances, and adjacency matrices.
5. **Save Outputs**: Store preprocessed data in the `interim` directory for use in downstream tasks.

---

### Expected Outputs
- **Processed H&E Images**:
  - Nucleus-centered patches with normalized intensities.
- **Processed Gene Expression**:
  - Normalized, filtered, and clustered gene expression data.
- **Spatial Features**:
  - Centroids, distances, and adjacency matrices for nuclei.

---

### Steps
1. **Imports and Configuration**: Load necessary libraries and initialize configurations.
2. **Data Loading**: Load raw data using the `DataLoader`.
3. **Preprocessing Tasks**:
   - Image Preprocessing (`ImagePreprocessor`).
   - Gene Expression Preprocessing (`GenePreprocessor`).
   - Spatial Preprocessing (`SpatialPreprocessor`).
4. **Save Outputs**: Ensure all preprocessed data is saved to the `interim` directory.
5. **Validation**: Validate preprocessed data visually or through summary statistics.


## 1. Imports and Configuration
This section initializes the preprocessing environment by:
- Importing core libraries and project-specific modules.
- Loading the configuration file (`config.yaml`) for path management.
- Setting up paths for raw and interim data.

In [2]:
# Import libraries
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from tqdm import tqdm
import spatialdata as sd
import scanpy as sc

# Import project-specific modules
from src.config.config_loader import ConfigLoader
from src.data_loader import DataLoader

In [3]:
# Load configuration and DataLoader for Crunch1
config_path = "/home/secondbook5/projects/AutoImmuneML/config.yaml"
config = ConfigLoader(config_path=config_path)
crunch_name = "crunch1"
# initialize data loader
data_loader = DataLoader(config=config, crunch_name="crunch1")


# Set paths
raw_dir = config.get_crunch_path(crunch_name, "raw_dir")
interim_dir = config.get_crunch_path(crunch_name, "interim_dir")

# Display paths
print(f"Raw directory: {raw_dir}")
print(f"Interim directory: {interim_dir}")

Raw directory: /mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data
Interim directory: /mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/interim


In [4]:
# Project-Specific Modules
from src.loaders.zarr_loader import ZARRLoader
from src.preprocessors.image_preprocessor import ImagePreprocessor
from src.preprocessors.gene_preprocessor import GenePreprocessor
from src.preprocessors.spatial_preprocessor import SpatialPreprocessor


## 2. Data Loading
This section loads the raw data from Zarr files using the `DataLoader`. The data includes:
1. **Images**: H&E registered images.
2. **Tables**: Gene expression data.

We also display a sample image and inspect the structure of the gene expression table to ensure correctness.


In [5]:
# Define specific Zarr files to load
zarr_keys = ["UC1_NI.zarr", "UC1_I.zarr"]
zarr_paths = [os.path.join(raw_dir, key) for key in zarr_keys]

# Load specific Zarr files using the DataLoader
print(f"[INFO] Loading specific Zarr files: {zarr_paths}")
zarr_data = data_loader.load_zarr(zarr_paths)

# Display loaded datasets
print(f"[INFO] Loaded datasets: {list(zarr_data.keys())}")

[ERROR] Path Key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_NI.zarr' not found in configuration for 'crunch1'
[ERROR] Error processing Zarr key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_NI.zarr': Path Key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_NI.zarr' not found in configuration for 'crunch1'
[ERROR] Path Key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_I.zarr' not found in configuration for 'crunch1'
[ERROR] Error processing Zarr key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_I.zarr': Path Key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_I.zarr' not found in configuration for 'crunch1'


[INFO] Loading specific Zarr files: ['/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_NI.zarr', '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_I.zarr']
[DEBUG] Retrieved path for key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_NI.zarr': None
[DEBUG] Retrieved path for key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data/UC1_I.zarr': None
[INFO] Loaded datasets: []


In [6]:
### Cell 2: Load All Zarr Files in Directory
# Load all Zarr files in the raw directory
print(f"[INFO] Loading all Zarr files in directory: {raw_dir}")
all_zarr_data = data_loader.load_zarr([raw_dir])

# Display loaded datasets
print(f"[INFO] All loaded datasets: {list(all_zarr_data.keys())}")


[ERROR] Path Key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data' not found in configuration for 'crunch1'
[ERROR] Error processing Zarr key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data': Path Key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data' not found in configuration for 'crunch1'


[INFO] Loading all Zarr files in directory: /mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data
[DEBUG] Retrieved path for key '/mnt/d/AutoImmuneML/broad-1-autoimmune-crunch1/data': None
[INFO] All loaded datasets: []


In [None]:
for key, data in zarr_data.items():
    print(f"[DEBUG] Dataset {key} contains: {list(data.images.keys())} and {list(data.tables.keys())}")


In [None]:
# Step 1: Get the raw_dir path from the configuration
raw_dir = config.get_crunch_path("crunch1", "raw_dir")

# Step 2: Ensure raw_dir exists and is valid
assert raw_dir, "raw_dir is not defined in the configuration or is invalid."
assert os.path.exists(raw_dir), f"raw_dir does not exist: {raw_dir}"

# Step 3: Specify which datasets to load (e.g., UC1_NI.zarr, UC1_I.zarr)
zarr_filenames = ["UC1_NI.zarr", "UC1_I.zarr"]
zarr_paths = [os.path.join(raw_dir, fname) for fname in zarr_filenames]

# Step 4: Load Zarr datasets using the DataLoader
zarr_data = data_loader.load_zarr(zarr_paths)

# Step 5: Print loaded dataset keys
print(f"Loaded datasets: {list(zarr_data.keys())}")


## 2. Data Loading

In [None]:
# Load Zarr Data
zarr_data = data_loader.load_zarr(["UC1_NI.zarr", "UC1_I.zarr"])
print(f"Loaded datasets: {zarr_data.keys()}")

# Display Sample Image and Metadata
sample_image = zarr_data["UC1_NI"]["images"]["HE_registered"]
plt.imshow(sample_image)
plt.title("Sample H&E Registered Image")
plt.show()

# Display Sample Gene Expression Data
gene_expression = pd.DataFrame(zarr_data["UC1_NI"]["tables"]["anucleus"].to_numpy())
print(gene_expression.head())

## 3. Preprocessing Tasks

### 3.1 H&E Image Preprocessing
- Normalize stains using Reinhard normalization.
- Extract nucleus-centered patches.
- Apply augmentations.


In [None]:
# Initialize Image Preprocessor
image_preprocessor = ImagePreprocessor()

# Normalize and Extract Patches
for zarr_key, dataset in zarr_data.items():
    he_images = dataset["images"]["HE_registered"]
    he_nuc_masks = dataset["images"]["HE_nuc_registered"]

    # Example: Normalize
    normalized_images = image_preprocessor.normalize_stains(he_images)

    # Example: Patch Extraction
    patches = image_preprocessor.extract_patches(he_images, he_nuc_masks, patch_size=32)

    # Save to Interim Directory
    save_path = os.path.join(interim_dir, f"{zarr_key}_processed_images.npy")
    np.save(save_path, patches)


### 3.2 Gene Expression Preprocessing
- Normalize and filter gene expression data.
- Log-transform if needed.


In [None]:
# Initialize Gene Preprocessor
gene_preprocessor = GenePreprocessor()

# Normalize Gene Expression
for zarr_key, dataset in zarr_data.items():
    anucleus_table = dataset["tables"]["anucleus"]
    normalized_genes = gene_preprocessor.normalize(anucleus_table)

    # Save to Interim Directory
    save_path = os.path.join(interim_dir, f"{zarr_key}_processed_genes.npy")
    np.save(save_path, normalized_genes)


### 3.3 Spatial Feature Engineering
- Compute distances between nuclei.
- Create adjacency matrices for spatial modeling.


In [None]:
# Initialize Spatial Preprocessor
spatial_preprocessor = SpatialPreprocessor()

# Compute Features
for zarr_key, dataset in zarr_data.items():
    spatial_features = spatial_preprocessor.generate_features(dataset)

    # Save to Interim Directory
    save_path = os.path.join(interim_dir, f"{zarr_key}_spatial_features.npy")
    np.save(save_path, spatial_features)


## 4. Intermediate Validation
- Visualize preprocessed images, gene distributions, and spatial features.


In [None]:
# Visualize Preprocessed Images
plt.imshow(patches[0])
plt.title("Example Preprocessed Patch")
plt.show()

# Plot Gene Expression Distribution
plt.hist(normalized_genes.flatten(), bins=50)
plt.title("Normalized Gene Expression Distribution")
plt.show()


## 5. Save Preprocessed Data
- Save all outputs to the `interim` directory.

In [None]:
# Ensure Preprocessed Data is Stored Correctly
assert os.path.exists(interim_dir), "Interim directory does not exist!"

## 6. Notes and Next Steps
- Preprocessing is complete. The next step is Enhanced EDA.
- Key Observations:
  - ...
  - ...
