# üß¨ Mechano-Velocity: Notebook 01 - Preprocessing

**Physics-Informed Cell Migration Prediction for Spatial Transcriptomics**

This notebook covers:
1. Environment setup and dependency installation
2. Dataset download and loading
3. Quality control and filtering
4. Normalization and transformation
5. Saving processed data

---

## 1. Environment Setup

In [None]:
# Check if running in Colab
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Running in Google Colab")
    # Clone repository
    !git clone https://github.com/your-username/mechano-velocity.git 2>/dev/null || echo "Repo already cloned"
    %cd mechano-velocity
    
    # Install dependencies
    !pip install -q scanpy scvelo squidpy anndata leidenalg
else:
    print("Running locally")

In [None]:
# Core imports
import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Set scanpy settings
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=100, frameon=False, figsize=(8, 8))

print(f"Scanpy version: {sc.__version__}")

In [None]:
# Add project to path
import sys
from pathlib import Path

# Adjust path based on where you're running
PROJECT_ROOT = Path('.').resolve()
if 'mechano-velocity' in str(PROJECT_ROOT):
    PROJECT_ROOT = PROJECT_ROOT
else:
    PROJECT_ROOT = Path('./mechano-velocity').resolve()

sys.path.insert(0, str(PROJECT_ROOT))

# Import our modules
from mechano_velocity import Config, DataLoader, Preprocessor
print(f"Project root: {PROJECT_ROOT}")

## 2. Dataset Download

We use the **10x Genomics Human Breast Cancer** dataset:
- Block A Section 1
- Invasive Ductal Carcinoma
- Contains visible fibrotic regions (perfect for testing our model)

In [None]:
# Create data directory
DATA_DIR = PROJECT_ROOT / "data" / "V1_Breast_Cancer_Block_A"
DATA_DIR.mkdir(parents=True, exist_ok=True)

SPATIAL_DIR = DATA_DIR / "spatial"
SPATIAL_DIR.mkdir(exist_ok=True)

print(f"Data directory: {DATA_DIR}")

In [None]:
# Option 1: Download from 10x Genomics (Colab)
# Note: You may need to manually download if links change

if IN_COLAB:
    import os
    
    # Check if already downloaded
    h5_file = DATA_DIR / "filtered_feature_bc_matrix.h5"
    
    if not h5_file.exists():
        print("Downloading dataset from 10x Genomics...")
        
        # Download filtered matrix
        !wget -q -O {DATA_DIR}/filtered_feature_bc_matrix.h5 \
            "https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Breast_Cancer_Block_A_Section_1/V1_Breast_Cancer_Block_A_Section_1_filtered_feature_bc_matrix.h5"
        
        # Download spatial files
        !wget -q -O {SPATIAL_DIR}/tissue_hires_image.png \
            "https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Breast_Cancer_Block_A_Section_1/V1_Breast_Cancer_Block_A_Section_1_tissue_hires_image.png"
        
        !wget -q -O {SPATIAL_DIR}/tissue_lowres_image.png \
            "https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Breast_Cancer_Block_A_Section_1/V1_Breast_Cancer_Block_A_Section_1_tissue_lowres_image.png"
        
        !wget -q -O {SPATIAL_DIR}/scalefactors_json.json \
            "https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Breast_Cancer_Block_A_Section_1/V1_Breast_Cancer_Block_A_Section_1_scalefactors_json.json"
        
        !wget -q -O {SPATIAL_DIR}/tissue_positions_list.csv \
            "https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Breast_Cancer_Block_A_Section_1/V1_Breast_Cancer_Block_A_Section_1_tissue_positions_list.csv"
        
        print("Download complete!")
    else:
        print("Dataset already downloaded.")
else:
    print("Running locally - please ensure dataset is in data/V1_Breast_Cancer_Block_A/")
    print("Download from: https://www.10xgenomics.com/resources/datasets/human-breast-cancer-block-a-section-1-1-standard-1-1-0")

In [None]:
# Verify files
required_files = [
    DATA_DIR / "filtered_feature_bc_matrix.h5",
    SPATIAL_DIR / "scalefactors_json.json",
]

for f in required_files:
    status = "‚úÖ" if f.exists() else "‚ùå"
    print(f"{status} {f.name}")

# Check tissue positions (can have different names)
pos_file = SPATIAL_DIR / "tissue_positions_list.csv"
pos_file_alt = SPATIAL_DIR / "tissue_positions.csv"
if pos_file.exists() or pos_file_alt.exists():
    print("‚úÖ tissue_positions file")
else:
    print("‚ùå tissue_positions file missing!")

## 3. Load Data

In [None]:
# Initialize configuration
config = Config()
config.data_dir = PROJECT_ROOT / "data"
config.output_dir = PROJECT_ROOT / "output"
config.output_dir.mkdir(exist_ok=True)

print(f"Config loaded:")
print(f"  Dataset: {config.dataset_name}")
print(f"  Min counts: {config.preprocessing.min_counts}")
print(f"  Target sum: {config.preprocessing.target_sum}")

In [None]:
# Load data using our DataLoader
loader = DataLoader(config)
adata = loader.load_visium(DATA_DIR)

In [None]:
# Print summary
print(loader.summary())

In [None]:
# View the tissue image
fig, ax = plt.subplots(figsize=(10, 10))
sc.pl.spatial(adata, img_key='hires', ax=ax, show=False)
ax.set_title("Breast Cancer Tissue (H&E Staining)", fontsize=14)
plt.tight_layout()
plt.show()

## 4. Quality Control

In [None]:
# Calculate QC metrics
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

In [None]:
# Visualize QC metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Total counts distribution
axes[0].hist(adata.obs['total_counts'], bins=50, color='steelblue', edgecolor='white')
axes[0].axvline(x=config.preprocessing.min_counts, color='red', linestyle='--', label=f'Threshold ({config.preprocessing.min_counts})')
axes[0].set_xlabel('Total Counts')
axes[0].set_ylabel('Frequency')
axes[0].set_title('UMI Counts per Spot')
axes[0].legend()

# Genes detected
axes[1].hist(adata.obs['n_genes_by_counts'], bins=50, color='forestgreen', edgecolor='white')
axes[1].set_xlabel('Number of Genes')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Genes per Spot')

# Mitochondrial percentage
axes[2].hist(adata.obs['pct_counts_mt'], bins=50, color='coral', edgecolor='white')
axes[2].axvline(x=20, color='red', linestyle='--', label='Threshold (20%)')
axes[2].set_xlabel('Mitochondrial %')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Mitochondrial Content')
axes[2].legend()

plt.tight_layout()
plt.savefig(config.output_dir / 'qc_distributions.png', dpi=150)
plt.show()

In [None]:
# Spatial QC visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

sc.pl.spatial(adata, color='total_counts', ax=axes[0], show=False, title='Total Counts')
sc.pl.spatial(adata, color='n_genes_by_counts', ax=axes[1], show=False, title='Genes Detected')

plt.tight_layout()
plt.savefig(config.output_dir / 'qc_spatial.png', dpi=150)
plt.show()

## 5. Preprocessing Pipeline

In [None]:
# Store raw counts for later
adata.layers['counts'] = adata.X.copy()

print(f"Original shape: {adata.shape}")

In [None]:
# Run full preprocessing pipeline
preprocessor = Preprocessor(config)
adata = preprocessor.run(
    adata,
    filter_spots=True,
    normalize=True,
    log_transform=True,
    compute_hvg=True,
    compute_pca=True,
    compute_neighbors=True,
    copy=False  # Modify in place
)

In [None]:
# Verify preprocessing results
print(f"\nFinal shape: {adata.shape}")
print(f"\nLayers: {list(adata.layers.keys())}")
print(f"Obsm: {list(adata.obsm.keys())}")

## 6. Clustering and Visualization

In [None]:
# Compute UMAP
sc.tl.umap(adata)

# Cluster with Leiden algorithm
sc.tl.leiden(adata, resolution=0.5)

print(f"Found {adata.obs['leiden'].nunique()} clusters")

In [None]:
# Visualize clusters
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# UMAP
sc.pl.umap(adata, color='leiden', ax=axes[0], show=False, title='UMAP - Clusters')

# Spatial
sc.pl.spatial(adata, color='leiden', ax=axes[1], show=False, title='Spatial - Clusters')

plt.tight_layout()
plt.savefig(config.output_dir / 'clusters.png', dpi=150)
plt.show()

## 7. Check Key Genes

Let's verify our mechanotyping genes are present.

In [None]:
# Extract gene panel
gene_panel, missing = loader.extract_gene_panel()

print(f"\nAvailable mechanotyping genes: {gene_panel.n_vars}")
if missing:
    print(f"Missing genes: {missing}")

In [None]:
# Visualize key genes spatially
key_genes = ['COL1A1', 'LOX', 'MMP9', 'CD8A']
available_genes = [g for g in key_genes if g in adata.var_names]

if available_genes:
    fig, axes = plt.subplots(1, len(available_genes), figsize=(5*len(available_genes), 5))
    if len(available_genes) == 1:
        axes = [axes]
    
    for ax, gene in zip(axes, available_genes):
        sc.pl.spatial(adata, color=gene, ax=ax, show=False, title=gene)
    
    plt.tight_layout()
    plt.savefig(config.output_dir / 'key_genes_spatial.png', dpi=150)
    plt.show()
else:
    print("Key genes not found in dataset")

## 8. Save Processed Data

In [None]:
# Save processed AnnData
output_path = config.output_dir / 'preprocessed_adata.h5ad'
adata.write_h5ad(output_path)
print(f"Saved preprocessed data to: {output_path}")

In [None]:
# Save configuration
config.save(config.output_dir / 'config.json')
print("Configuration saved.")

## Summary

‚úÖ Loaded 10x Visium breast cancer dataset  
‚úÖ Performed quality control and filtering  
‚úÖ Applied CPM normalization and log transformation  
‚úÖ Computed PCA and neighborhood graph  
‚úÖ Identified clusters  
‚úÖ Verified mechanotyping genes are present  
‚úÖ Saved processed data for next notebook  

**Next: Run `02_Mechanotyping.ipynb` to calculate the resistance field.**