# 01_Preprocessing

### Notebook Overview
This notebook handles data preprocessing for Crunch 1. It focuses on:
1. Preprocessing H&E images (e.g., stain normalization, patch extraction).
2. Normalizing and filtering gene expression data.
3. Engineering spatial features (e.g., distances between nuclei).
4. Saving preprocessed data to appropriate directories for modeling.

---

## 1. Imports and Configuration


In [None]:
# Core Libraries
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# Data Libraries
from skimage import io, color, exposure
from tqdm import tqdm

In [None]:
import squidpy as sq
import scanpy as sc
import spatialdata as sd

In [None]:
# Project-Specific Modules
from src.config.config_loader import ConfigLoader
from src.loaders.data_loader import DataLoader
from src.preprocessors.image_preprocessor import ImagePreprocessor
from src.preprocessors.gene_preprocessor import GenePreprocessor
from src.preprocessors.spatial_preprocessor import SpatialPreprocessor


In [None]:
# Load Config
config = ConfigLoader("config.yaml")
crunch_name = "crunch1"
data_loader = DataLoader(config=config, crunch_name=crunch_name)

# Set Paths
raw_dir = config.get_crunch_path(crunch_name, "raw_dir")
interim_dir = config.get_crunch_path(crunch_name, "interim_dir")

## 2. Data Loading

In [None]:
# Load Zarr Data
zarr_data = data_loader.load_zarr(["UC1_NI.zarr", "UC1_I.zarr"])
print(f"Loaded datasets: {zarr_data.keys()}")

# Display Sample Image and Metadata
sample_image = zarr_data["UC1_NI"]["images"]["HE_registered"]
plt.imshow(sample_image)
plt.title("Sample H&E Registered Image")
plt.show()

# Display Sample Gene Expression Data
gene_expression = pd.DataFrame(zarr_data["UC1_NI"]["tables"]["anucleus"].to_numpy())
print(gene_expression.head())

## 3. Preprocessing Tasks

### 3.1 H&E Image Preprocessing
- Normalize stains using Reinhard normalization.
- Extract nucleus-centered patches.
- Apply augmentations.


In [None]:
# Initialize Image Preprocessor
image_preprocessor = ImagePreprocessor()

# Normalize and Extract Patches
for zarr_key, dataset in zarr_data.items():
    he_images = dataset["images"]["HE_registered"]
    he_nuc_masks = dataset["images"]["HE_nuc_registered"]

    # Example: Normalize
    normalized_images = image_preprocessor.normalize_stains(he_images)

    # Example: Patch Extraction
    patches = image_preprocessor.extract_patches(he_images, he_nuc_masks, patch_size=32)

    # Save to Interim Directory
    save_path = os.path.join(interim_dir, f"{zarr_key}_processed_images.npy")
    np.save(save_path, patches)


### 3.2 Gene Expression Preprocessing
- Normalize and filter gene expression data.
- Log-transform if needed.


In [None]:
# Initialize Gene Preprocessor
gene_preprocessor = GenePreprocessor()

# Normalize Gene Expression
for zarr_key, dataset in zarr_data.items():
    anucleus_table = dataset["tables"]["anucleus"]
    normalized_genes = gene_preprocessor.normalize(anucleus_table)

    # Save to Interim Directory
    save_path = os.path.join(interim_dir, f"{zarr_key}_processed_genes.npy")
    np.save(save_path, normalized_genes)


### 3.3 Spatial Feature Engineering
- Compute distances between nuclei.
- Create adjacency matrices for spatial modeling.


In [None]:
# Initialize Spatial Preprocessor
spatial_preprocessor = SpatialPreprocessor()

# Compute Features
for zarr_key, dataset in zarr_data.items():
    spatial_features = spatial_preprocessor.generate_features(dataset)

    # Save to Interim Directory
    save_path = os.path.join(interim_dir, f"{zarr_key}_spatial_features.npy")
    np.save(save_path, spatial_features)


## 4. Intermediate Validation
- Visualize preprocessed images, gene distributions, and spatial features.


In [None]:
# Visualize Preprocessed Images
plt.imshow(patches[0])
plt.title("Example Preprocessed Patch")
plt.show()

# Plot Gene Expression Distribution
plt.hist(normalized_genes.flatten(), bins=50)
plt.title("Normalized Gene Expression Distribution")
plt.show()


## 5. Save Preprocessed Data
- Save all outputs to the `interim` directory.

In [None]:
# Ensure Preprocessed Data is Stored Correctly
assert os.path.exists(interim_dir), "Interim directory does not exist!"

## 6. Notes and Next Steps
- Preprocessing is complete. The next step is Enhanced EDA.
- Key Observations:
  - ...
  - ...
