# Single-nucleus ATAC-seq Preprocessing Pipeline

## Overview

This pipeline preprocesses single-nucleus ATAC-seq (snATAC-seq) pseudobulk peak count data
for downstream chromatin accessibility QTL (caQTL) analysis and region-specific studies.

**Goals:**
- Transform raw pseudobulk peak counts into analysis-ready formats
- Remove technical confounders while optionally preserving biological covariates
- Generate QTL-ready phenotype files or region-specific datasets

## Pipeline Structure
```
Step 0: Sample ID Mapping
↓
Step 1: Pseudobulk QC
├── Option A: BIOvar (regress out technical + biological covariates)
└── Option B: noBIOvar (regress out technical covariates only)
↓ (optional)
Batch Correction (ComBat-seq or limma::removeBatchEffect)
↓
Step 2: Format Output
├── Format A: Phenotype Reformatting → BED (genome-wide caQTL mapping)
└── Format B: Region Peak Filtering  → TSV (locus-specific analysis)

```

## Input Files

All input files required to run this pipeline can be downloaded
[here](https://drive.google.com/drive/folders/1UzJuHN8SotMn-PJTBp9uGShD25YxapKr?usp=drive_link).

| File | Used in |
|------|---------|
| `pseudobulk_peaks_counts_{celltype}.csv.gz` | Step 0, Step 1 |
| `metadata_{celltype}.csv` | Step 0, Step 1 |
| `rosmap_sample_mapping_data.csv` | Step 0 |
| `rosmap_cov.txt` | Step 1 |
| `hg38-blacklist.v2.bed.gz` | Step 1 |
| `SampleSheet.csv` | Step 1 (batch correction only) |
| `sampleSheetAfterQc.csv` | Step 1 (batch correction only) |


## Minimal Working Example

## Step 0: Sample ID Mapping

Maps original sample identifiers (`individualID`) to standardized sample IDs (`sampleid`)
across metadata and count matrix files.

### Input

| File | Description |
|------|-------------|
| `rosmap_sample_mapping_data.csv` | Mapping reference: `individualID → sampleid` |
| `metadata_{celltype}.csv` × 6 | Per-cell-type sample metadata |
| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Per-cell-type peak count matrices |

Cell types: `Ast`, `Ex`, `In`, `Microglia`, `Oligo`, `OPC`

### Process

**Part 1 — Metadata files**

For each `metadata_{celltype}.csv`:
1. Look up each `individualID` in the mapping reference
2. Assign `sampleid` — falls back to `individualID` if no mapping found
3. Insert `sampleid` as the first column
4. Save updated file

**Part 2 — Count matrix files**

For each `pseudobulk_peaks_counts_{celltype}.csv.gz`:
1. Extract the header row (column names only)
2. Keep `peak_id` (first column) unchanged
3. Map remaining column names (`individualID` → `sampleid`) where mapping exists,
   otherwise keep original
4. Write new header and stream data rows unchanged
5. Recompress with gzip

### Output

Output directory: `output/1_files_with_sampleid/`

| File | Description |
|------|-------------|
| `metadata_{celltype}.csv` × 6 | Metadata with `sampleid` column prepended |
| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Count matrices with mapped column headers |

**Timing:** < 1 min


In [None]:
sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \
    --cwd output/atac_seq/1_files_with_sampleid \
    --map_file data/atac_seq/rosmap_sample_mapping_data.csv \
    --input_dir data/atac_seq/1_files_with_sampleid_xiong \
    --output_dir output/atac_seq/1_files_with_sampleid \
    --celltype Ast Ex In Microglia Oligo OPC


# For MIT input data
sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \
    --cwd output/atac_seq/1_files_with_sampleid  \
    --map_file data/atac_seq/rosmap_sample_mapping_data.csv \
    --input_dir data/atac_seq/1_files_with_sampleid_MIT \
    --output_dir output/atac_seq/1_files_with_sampleid \
    --celltype Astro Exc Inh Mic Oligo OPC \
    --suffix _50nuc

## Step 1: Pseudobulk QC

Two approaches are available depending on whether biological covariates should be regressed out.
Both options support an **optional batch correction** step after filtering and normalization.


### Option A: With Biological Covariates (BIOvar)

Use when residuals should be adjusted for all technical **and** biological covariates (sex, age, PMI).

**Input:**

| File | Location |
|------|----------|
| `pseudobulk_peaks_counts_{celltype}.csv.gz` | `1_files_with_sampleid/` |
| `metadata_{celltype}.csv` | `1_files_with_sampleid/` |
| `rosmap_cov.txt` | `data/` |
| `hg38-blacklist.v2.bed.gz` | `data/` |
| `SampleSheet.csv` *(batch correction only)* | `data/` |
| `sampleSheetAfterQc.csv` *(batch correction only)* | `data/` |

**Process:**

1. Load pseudobulk peak count matrix and metadata per cell type
2. Filter samples with fewer than 20 nuclei
3. Calculate technical QC metrics per sample:
   - `log_n_nuclei`: log-transformed nuclei count
   - `med_nucleosome_signal`: median nucleosome signal
   - `med_tss_enrich`: median TSS enrichment score
   - `log_med_n_tot_fragment`: log-transformed median total fragments
   - `log_total_unique_peaks`: log-transformed unique peak count
4. Filter blacklisted genomic regions
5. Merge with demographic covariates (`msex`, `age_death`, `pmi`, `study`)
6. Apply expression filtering (`filterByExpr`):
   - `min_count = 5`: minimum reads in at least one sample
   - `min_total_count = 15`: minimum total reads across all samples
   - `min_prop = 0.1`: peak expressed in ≥10% of samples
7. TMM normalization
8. *(Optional)* Batch correction — see [Batch Correction](#batch-correction-optional) below
9. Fit linear model (`voom` + `lmFit`):~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich

log_med_n_tot_fragment + log_total_unique_peaks
sequencingBatch + msex + age_death + pmi + study

   > If batch correction was applied, `sequencingBatch` is removed from the model.
10. Compute residuals adjusted for all covariates
11. Compute final adjusted values: `offset + residuals`
    - `offset`: predicted expression at median/reference covariate values
    - `residuals`: unexplained variation after removing all covariate effects

**Output:** `output/2_residuals/{celltype}/`

| File | Description |
|------|-------------|
| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |
| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |
| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |
| `{celltype}_summary.txt` | Filtering statistics and QC summary |

**Covariates regressed out:**
- Technical: sequencing depth, nuclei count, nucleosome signal, TSS enrichment, batch
- Biological: sex (`msex`), age at death (`age_death`), post-mortem interval (`pmi`), study cohort

### Option B: Without Biological Covariates (noBIOvar)

Use when biological variation should be preserved (e.g., age/sex comparisons, region-specific analyses).

**Input:** Same as Option A.

**Process:**

Steps 1–8 are identical to Option A. Key differences at the modelling stage:
- `msex` and `age_death` are **excluded** from the model
- `med_peakwidth` (weighted median peak width per sample) is added as a technical covariate

**Model formula:**
```
Model: ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + med_peakwidth + sequencingBatch + pmi + study
```

**Output:** `output/2_residuals/{celltype}/`

| File | Description |
|------|-------------|
| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |
| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |
| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |
| `{celltype}_summary.txt` | Filtering statistics and QC summary |

**Variables deliberately NOT regressed out:**
- Sex (`msex`)
- Age at death (`age_death`)

**Timing:** <5 min per celltype

### Pseudobulk QC with BIOVar

In [None]:
sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \
    --cwd output/atac_seq \
    --input_dir output/atac_seq/1_files_with_sampleid_xiong \
    --output_dir output/atac_seq/2_residuals \
    --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \
    --covariates_file data/atac_seq/rosmap_cov.txt \
    --include_bio TRUE \
    --batch_correction FALSE \
    --min_count 5 \
    --celltype Ast Ex In Microglia Oligo OPC

### Pseudobulk QC noBIOvar 

In [None]:
sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \
    --cwd output/atac_seq \
    --input_dir output/atac_seq/1_files_with_sampleid_xiong \
    --output_dir output/atac_seq/2_residuals \
    --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \
    --covariates_file data/atac_seq/rosmap_cov.txt \
    --include_bio FALSE \
    --batch_correction FALSE \
    --min_count 5 \
    --celltype Ast Ex In Microglia Oligo OPC

### Batch Correction (Optional)

Applies to both Option A and Option B. Runs between TMM normalization and model fitting.
Use when batch effects are severe (e.g., visible batch clusters in PCA, multiple sequencing runs).

> When batch correction is applied, `sequencingBatch` is **removed** from the model formula
> since batch variance has already been removed from the counts.

**Method comparison:**

| | ComBat-seq | limma `removeBatchEffect` |
|---|---|---|
| **Operates on** | Raw integer counts | log-CPM values |
| **Mean-variance modelling** | Yes | No |
| **Best for** | Large, balanced batches | Small or fragmented batches |
| **Robustness** | May fail with many small batches | More robust to unbalanced designs |

**ComBat-seq:**
```r
adjusted_counts <- ComBat_seq(counts = dge$counts, batch = batches)
```

**limma `removeBatchEffect`:**
```r
logCPM <- cpm(dge, log = TRUE, prior.count = 1)
adj_logCPM  <- removeBatchEffect(logCPM, batch = batches, design = model.matrix(~1, data = dge$samples))
adjusted_counts <- round(pmax(2^adj_logCPM * mean(dge$samples$lib.size) / 1e6, 0))
```

**Additional filtering applied before correction:**
- Singleton batches (only 1 sample) are removed
- Samples absent from the batch sheet are dropped

**Additional output when batch correction is enabled:**

| File | Description |
|------|-------------|
| `{celltype}_results.rds` | Includes `batch_adjusted_counts` and `batch_method` fields |


### Pseudobulk QC with BIOvar & with batch correction

In [1]:
sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \
    --cwd output/atac_seq \
    --input_dir output/atac_seq/1_files_with_sampleid_xiong \
    --output_dir output/atac_seq/2_residuals \
    --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \
    --covariates_file data/atac_seq/rosmap_cov.txt \
    --include_bio TRUE \
    --batch_correction TRUE \
    --batch_method limma \
    --min_count 2
    --celltype Ast Ex In Microglia Oligo OPC

ERROR: Error in parse(text = input): <text>:1:5: unexpected symbol
1: sos run
        ^


### Pseudobulk QC noBIOvar & with batch correction

In [None]:
sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \
    --cwd output/atac_seq \
    --input_dir output/atac_seq/1_files_with_sampleid_xiong \
    --output_dir output/atac_seq/2_residuals \
    --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \
    --covariates_file data/atac_seq/rosmap_cov.txt \
    --include_bio FALSE \
    --batch_correction TRUE \
    --batch_method limma \
    --min_count 5
    --celltype Ast Ex In Microglia Oligo OPC

**Note**
For MIT data, add these parameters:

In [None]:
--celltype Astro Exc Inh Mic Oligo OPC \
--suffix _50nuc \
--input_dir output/1_files_with_sampleid_MIT

For additional parameters:

In [None]:
--min_count 5
--min_total_count 15
--min_prop 0.1
--min_nuclei 20

## Step 2: Format Output
### Phenotype Reformatting

Converts residuals into a QTL-ready BED format for genome-wide caQTL mapping.

**Input:**

| File | Location |
|------|----------|
| `{celltype}_residuals.txt` | `output/2_residuals/{celltype}/` |

**Process:**

1. Read residuals file with proper handling of peak IDs and sample columns
2. Parse peak coordinates from peak IDs (`chr-start-end` format)
3. Convert to midpoint coordinates (standard for QTLtools):
```
   start = floor((peak_start + peak_end) / 2)
   end = start + 1
```
4. Build BED format: `#chr`, `start`, `end`, `ID` followed by per-sample expression values
5. Sort by chromosome and position
6. Compress with `bgzip` and index with `tabix`

**Output:** `output/3_phenotype_processing/phenotype/{celltype}_snatac_phenotype.bed.gz`

| File | Description |
|------|-------------|
| `{celltype}_snatac_phenotype.bed.gz` | bgzip-compressed BED with peak midpoint coordinates |
| `{celltype}_snatac_phenotype.bed.gz.tbi` | tabix index for random-access queries |

**Use case:** Standard caQTL mapping to identify genetic variants affecting chromatin
accessibility independent of demographic factors. Compatible with FastQTL, TensorQTL, and QTLtools.

**Timing:** <1 min

In [None]:
sos run pipeline/snatacseq_preprocessing.ipynb phenotype_formatting \
    --cwd output/atac_seq \
    --input_dir output/atac_seq/2_residuals \
    --output_dir output/atac_seq/3_pheno_reformat \
    --celltype Ast Ex In Microglia Oligo OPC

### Region Peak Filtering

Filters peak counts to specific genomic regions of interest for locus-specific analysis.

**Input:**

| File | Location |
|------|----------|
| `{celltype}_filtered_raw_counts.txt` | `output/2_residuals/{celltype}/` |

**Process:**

1. Read filtered raw counts per cell type
2. Parse peak coordinates from peak IDs (`chr-start-end` format)
3. Calculate per-peak metrics:
   - `peakwidth`: `end - start`
   - `midpoint`: `(start + end) / 2`
4. Filter peaks overlapping target regions (includes peaks that start, end, or span boundaries):

   | Region | Coordinates | Size |
   |--------|-------------|------|
   | Chr7   | 28,000,000 – 28,300,000 bp | 300 kb |
   | Chr11  | 85,050,000 – 86,200,000 bp | 1.15 Mb |

5. Calculate summary statistics per peak:
   - `total_count`: sum of counts across all samples
   - `weighted_count`: `total_count / peakwidth` (normalizes for peak size)

**Output:** `output/3_format_output/regions/{celltype}/`

| File | Description |
|------|-------------|
| `{celltype}_filtered_regions.txt` | Full count matrix for peaks in target regions |
| `{celltype}_filtered_regions_summary.txt` | Peak metadata with coordinates and count statistics |

**Use case:** Hypothesis-driven analysis of specific genomic loci (e.g., AD risk loci such as
the APOE or TREM2 regions) where biological variation is preserved for downstream interpretation.

**Timing:** <1 min

In [None]:
sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \
    --cwd output/atac_seq \
    --input_dir output/atac_seq/2_residuals \
    --output_dir output/atac_seq/3_region_filter \
    --celltype Ast Ex In Microglia Oligo OPC

In [None]:
# Custom regions
sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \
    --cwd output/atac_seq \
    --input_dir output/atac_seq/2_residuals \
    --output_dir output/atac_seq \
    --celltype Ast Ex In Microglia Oligo OPC \
    --regions "chr1:1000000-2000000,chr5:50000000-51000000"

## Command interface

In [None]:
sos run pipeline/snatacseq_preprocessing.ipynb -h

## Setup and global parameters

In [None]:
[global]
# Output directory
parameter: cwd = path("output")
# For cluster jobs, number of commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container
parameter: container = ""

import re
parameter: entrypoint = (
    'micromamba run -a "" -n' + ' ' +
    re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])
) if container else ""

from sos.utils import expand_size
cwd = path(f'{cwd:a}')

## `sampleid_mapping`

In [None]:
[sampleid_mapping]
parameter: map_file   = str
parameter: input_dir  = str
parameter: output_dir = str
parameter: celltype   = ['Ast', 'Ex', 'In', 'Microglia', 'Oligo', 'OPC']
parameter: suffix     = ''   # e.g. '' for Xiong, '_50nuc' for Kellis

input:  [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]
output: [f'{output_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]

python: expand = "${ }"

    import pandas as pd
    import gzip
    import os
    import subprocess
    import csv
    import numpy as np

    map_df = pd.read_csv("${map_file}")
    id_map = dict(zip(map_df["individualID"], map_df["sampleid"]))

    celltype   = ${celltype}
    input_dir  = "${input_dir}"
    output_dir = "${output_dir}/1_files_with_sampleid"
    suffix     = "${suffix}"

    os.makedirs(output_dir, exist_ok=True)

    def map_id(ind_id):
        return id_map.get(ind_id, ind_id)
    
    def format_value(val):
        """Format numeric values: remove .0 from integers, keep decimals"""
        if pd.isna(val):
            return ''
        if isinstance(val, (int, np.integer)):
            return str(val)
        if isinstance(val, (float, np.floating)):
            if val == int(val):  # Check if it's a whole number
                return str(int(val))
            else:
                return str(val)
        return str(val)

    # ── Process metadata CSV files ────────────────────────────────────────────
    for ct in celltype:
        fname    = f"metadata_{ct}{suffix}.csv"
        in_path  = os.path.join(input_dir,  fname)
        out_path = os.path.join(output_dir, fname)

        if not os.path.exists(in_path):
            print(f"Warning: Metadata file not found: {in_path}")
            continue

        meta = pd.read_csv(in_path)

        if "individualID" not in meta.columns:
            print(f"Warning: individualID column not found in {fname}")
            continue

        # Create or update sampleid column
        meta["sampleid"] = meta["individualID"].map(map_id)
        
        # Always reorder: sampleid FIRST, then individualID, then rest
        cols = meta.columns.tolist()
        cols.remove("sampleid")
        cols.remove("individualID")
        new_cols = ["sampleid", "individualID"] + cols
        meta = meta[new_cols]

        # Write CSV with custom formatting
        with open(out_path, 'w', newline='') as f:
            writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
            # Write header
            writer.writerow(meta.columns)
            # Write data rows with custom formatting
            for _, row in meta.iterrows():
                writer.writerow([format_value(val) for val in row])
        
        print(f"Processed metadata: {fname}")

    # ── Process count matrix .csv.gz files ───────────────────────────────────
    for ct in celltype:
        # Try both naming patterns: with and without underscore
        patterns = [
            f"pseudobulk_peaks_counts_{ct}{suffix}.csv.gz",  # Xiong pattern
            f"pseudobulk_peaks_counts{ct}{suffix}.csv.gz"    # Kellis pattern
        ]
        
        in_path = None
        for pattern in patterns:
            test_path = os.path.join(input_dir, pattern)
            if os.path.exists(test_path):
                in_path = test_path
                fname = pattern
                break
        
        if in_path is None:
            print(f"Warning: Count file not found for celltype {ct}")
            continue
        
        out_path = os.path.join(output_dir, fname)

        with gzip.open(in_path, "rt") as fh:
            header_line = fh.readline().rstrip("\n")

        col_names       = header_line.split(",")
        peak_id_col     = col_names[0]
        sample_cols     = col_names[1:]
        new_sample_cols = [map_id(s) for s in sample_cols]
        new_header      = ",".join([peak_id_col] + new_sample_cols)

        import tempfile
        temp_header = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')
        temp_header.write(new_header + "\n")
        temp_header.close()
        
        cmd = f"zcat {in_path} | tail -n +2 | cat {temp_header.name} - | gzip -6 > {out_path}"
        subprocess.run(cmd, shell=True, check=True)
        
        os.unlink(temp_header.name)
        print(f"Processed counts: {fname}")

    print("\nSample ID mapping completed!")

## `pseudobulk_qc`

In [None]:
[pseudobulk_qc]
parameter: celltype         = ['Ast','Ex','In','Microglia','Oligo','OPC']
parameter: input_dir        = str
parameter: output_dir       = str
parameter: covariates_file  = str
parameter: blacklist_file   = ''
parameter: include_bio      = "FALSE"   # "TRUE" or "FALSE"
parameter: batch_correction = "FALSE"   # "TRUE" or "FALSE"
parameter: batch_method     = "limma"   # "limma" or "combat"
parameter: min_count        = 5
parameter: min_total_count  = 15
parameter: min_prop         = 0.1
parameter: min_nuclei       = 20
parameter: suffix           = ''

input:  [f'{input_dir}/metadata_{ct}{suffix}.csv'                   for ct in celltype], \
        [f'{input_dir}/pseudobulk_peaks_counts_{ct}{suffix}.csv.gz' for ct in celltype]
output: [f'{output_dir}/{ct}/{ct}_residuals.txt'        for ct in celltype]

task: trunk_workers = 1, trunk_size = 1, walltime = '6:00:00', mem = '64G', cores = 4

R: expand = "${ }", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'

    library(edgeR)
    library(limma)
    library(data.table)
    library(GenomicRanges)
    if (as.logical("${batch_correction}") && "${batch_method}" == "combat") library(sva)

    # ── Helper: standardize metadata column names ─────────────────────────────
    rename_if_found <- function(dt, target, candidates) {
        found <- intersect(candidates, colnames(dt))[1]
        if (!is.na(found) && found != target) setnames(dt, found, target)
    }

    standardize_meta <- function(meta) {
        rename_if_found(meta, "n_nuclei",              c("n.nuclei","nNuclei","nuclei_count"))
        rename_if_found(meta, "med_nucleosome_signal", c("med.nucleosome_signal.ct","NucleosomeRatio","med_nucleosome_signal.ct"))
        rename_if_found(meta, "med_tss_enrich",        c("med.tss.enrich.ct","TSSEnrichment","med_tss_enrich.ct"))
        rename_if_found(meta, "med_n_tot_fragment",    c("med.n_tot_fragment.ct","med_n_tot_fragment.ct"))
        return(meta)
    }

    # ── Helper: blacklist filtering ───────────────────────────────────────────
    filter_blacklist <- function(mat, bed) {
        peaks <- data.table(id = rownames(mat))
        peaks[, c("chr","start","end") := tstrsplit(gsub("_","-",id), "-")]
        peaks[, `:=`(start = as.numeric(start), end = as.numeric(end))]
        bl <- fread(bed)[, 1:3]
        setnames(bl, c("chr","start","end"))
        bl[, `:=`(start = as.numeric(start), end = as.numeric(end))]
        gr1 <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))
        gr2 <- GRanges(bl$chr,    IRanges(bl$start,    bl$end))
        blacklisted <- unique(queryHits(findOverlaps(gr1, gr2)))
        if (length(blacklisted) > 0) {
            message("Blacklisted peaks removed: ", length(blacklisted))
            return(mat[-blacklisted, , drop=FALSE])
        }
        return(mat)
    }

    # ── Helper: predictOffset ─────────────────────────────────────────────────
    predictOffset <- function(fit) {
        D  <- fit$design
        Dm <- D
        for (col in colnames(D)) {
            if (col == "(Intercept)") next
            if (is.numeric(D[, col]) && !all(D[, col] %in% c(0, 1)))
                Dm[, col] <- median(D[, col], na.rm=TRUE)
            else
                Dm[, col] <- 0
        }
        B <- fit$coefficients
        B[is.na(B)] <- 0
        B %*% t(Dm)
    }

    # ── Main loop ─────────────────────────────────────────────────────────────
    cts <- c(${', '.join([f"'{x}'" for x in celltype])})

    for (ct in cts) {
        message("\n", paste(rep("=", 40), collapse=""))
        message("Processing: ", ct)
        message("Mode: ", ifelse(as.logical("${include_bio}"), "BIOvar", "noBIOvar"))
        message("Batch correction: ", ifelse(as.logical("${batch_correction}"), "${batch_method}", "none"))
        message(paste(rep("=", 40), collapse=""))

        outdir <- file.path("${output_dir}/2_residuals", ct)
        dir.create(outdir, recursive=TRUE, showWarnings=FALSE)

        # ── 1. Load data ───────────────────────────────────────────────────
        meta       <- fread(sprintf("${input_dir}/metadata_%s${suffix}.csv", ct))
        counts_raw <- fread(sprintf("${input_dir}/pseudobulk_peaks_counts_%s${suffix}.csv.gz", ct))

        counts <- as.matrix(counts_raw[, -1, with=FALSE])
        rownames(counts) <- counts_raw[[1]]
        rm(counts_raw)
        n_original <- nrow(counts)
        message("Loaded: ", n_original, " peaks x ", ncol(counts), " samples")

        # ── 2. Standardize metadata columns ───────────────────────────────
        meta <- standardize_meta(meta)

        # ── 3. Identify sample ID column ──────────────────────────────────
        idcol <- intersect(c("sampleid","sampleID","individualID","projid"), colnames(meta))[1]
        if (is.na(idcol)) stop("Cannot find sample ID column in metadata.")

        # ── 4. Nuclei filter ──────────────────────────────────────────────
        if ("n_nuclei" %in% colnames(meta)) {
            meta <- meta[meta$n_nuclei > ${min_nuclei}]
            message("Samples after nuclei (>${min_nuclei}) filter: ", nrow(meta))
        }
        n_after_nuclei <- nrow(meta)

        # ── 5. Align samples ───────────────────────────────────────────────
        common <- intersect(meta[[idcol]], colnames(counts))
        if (length(common) == 0) stop("Zero sample overlap between metadata and count matrix.")
        meta   <- meta[match(common, meta[[idcol]])]
        counts <- counts[, common, drop=FALSE]
        message("Samples after alignment: ", length(common))

        # ── 6. Blacklist filtering ─────────────────────────────────────────
        if ("${blacklist_file}" != "" && file.exists("${blacklist_file}")) {
            counts <- filter_blacklist(counts, "${blacklist_file}")
            message("Peaks after blacklist filter: ", nrow(counts))
        } else {
            message("No blacklist file provided - skipping blacklist filtering.")
        }
        n_after_blacklist <- nrow(counts)

        # ── 7. Load and merge covariates ───────────────────────────────────
        covs      <- fread("${covariates_file}")
        id2       <- intersect(c("#id","id","projid","individualID"), colnames(covs))[1]
        bio_cols  <- if (as.logical("${include_bio}")) c("msex","age_death","pmi","study") else c("pmi","study")
        keep_cols <- c(id2, intersect(bio_cols, colnames(covs)))
        covs      <- covs[, ..keep_cols]
        meta      <- merge(meta, covs, by.x=idcol, by.y=id2, all.x=TRUE)

        # ── CRITICAL: re-order meta back to common sample order ────────────
        meta <- meta[match(common, meta[[idcol]])]

        # ── 8. Impute missing covariate values ─────────────────────────────
        for (col in intersect(c("pmi","age_death"), colnames(meta))) {
            if (any(is.na(meta[[col]]))) {
                message("Imputing missing values for: ", col)
                meta[[col]][is.na(meta[[col]])] <- median(meta[[col]], na.rm=TRUE)
            }
        }

        # ── 9. Compute technical metrics ──────────────────────────────────
        meta$log_n_nuclei           <- log1p(meta$n_nuclei)
        meta$log_med_n_tot_fragment <- log1p(meta$med_n_tot_fragment)
        meta$log_total_unique_peaks <- log1p(colSums(counts > 0))

        # ── 10. Select model variables ────────────────────────────────────
        tech_vars <- c("log_n_nuclei","med_nucleosome_signal","med_tss_enrich",
                       "log_med_n_tot_fragment","log_total_unique_peaks","pmi","study")
        bio_vars  <- c("msex","age_death")
        all_vars  <- if (as.logical("${include_bio}")) c(tech_vars, bio_vars) else tech_vars
        all_vars  <- intersect(all_vars, colnames(meta))
        message("Model terms: ", paste(all_vars, collapse=", "))

        # ── 11. Drop samples with NA in model variables ────────────────────
        keep_rows <- complete.cases(meta[, ..all_vars])
        meta      <- meta[keep_rows]
        counts    <- counts[, meta[[idcol]], drop=FALSE]
        message("Valid samples for modelling: ", nrow(meta))

        # ── 12. Expression filtering ───────────────────────────────────────
        dge <- DGEList(counts=counts, samples=meta)
        dge$samples$group <- factor(rep("all", ncol(dge)))
        message("Peaks before expression filter: ", nrow(dge))

        keep <- filterByExpr(dge, group=dge$samples$group,
                             min.count=${min_count},
                             min.total.count=${min_total_count},
                             min.prop=${min_prop})
        dge <- dge[keep,, keep.lib.sizes=FALSE]
        n_after_expr <- nrow(dge)
        message("Peaks after expression filter: ", n_after_expr)

        # Save filtered raw counts
        write.table(dge$counts,
                    file.path(outdir, paste0(ct, "_filtered_raw_counts.txt")),
                    sep="\t", quote=FALSE, col.names=NA)

        # ── 13. TMM normalization ──────────────────────────────────────────
        dge <- calcNormFactors(dge, method="TMM")

        # ── 14. Optional batch correction ─────────────────────────────────
        if (as.logical("${batch_correction}") && "sequencingBatch" %in% colnames(dge$samples)) {
            batches       <- dge$samples$sequencingBatch
            batch_counts  <- table(batches)
            valid_batches <- names(batch_counts[batch_counts > 1])
            keep_bc       <- batches %in% valid_batches
            dge           <- dge[, keep_bc, keep.lib.sizes=FALSE]
            batches       <- batches[keep_bc]
            message("Samples after singleton batch removal: ", ncol(dge))

            if ("${batch_method}" == "combat") {
                dge$counts <- ComBat_seq(as.matrix(dge$counts), batch=batches)
                message("ComBat-seq batch correction applied.")
            } else {
                logCPM     <- cpm(dge, log=TRUE, prior.count=1)
                logCPM     <- removeBatchEffect(logCPM, batch=factor(batches))
                dge$counts <- round(pmax(2^logCPM, 0))
                message("limma removeBatchEffect applied.")
            }
        }

        # ── 15. Add sequencingBatch and Library to model if multi-level ───
        # Insert after technical vars but before pmi/study to match original order
        tech_only <- c("log_n_nuclei","med_nucleosome_signal","med_tss_enrich",
                       "log_med_n_tot_fragment","log_total_unique_peaks")
        other_vars <- setdiff(all_vars, tech_only)  # pmi, study, msex, age_death

        batch_vars <- c()
        if ("sequencingBatch" %in% colnames(dge$samples) &&
            length(unique(dge$samples$sequencingBatch)) > 1) {
            dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)
            batch_vars <- c(batch_vars, "sequencingBatch_factor")
        }

        if ("Library" %in% colnames(dge$samples) &&
            length(unique(dge$samples$Library)) > 1) {
            dge$samples$Library_factor <- factor(dge$samples$Library)
            batch_vars <- c(batch_vars, "Library_factor")
        }

        # Final order: technical + batch + other (pmi, study, bio)
        all_vars <- c(tech_only, batch_vars, other_vars)
        all_vars <- intersect(all_vars, c(colnames(dge$samples), colnames(meta)))

        # ── 16. Build design matrix ────────────────────────────────────────
        form   <- as.formula(paste("~", paste(all_vars, collapse=" + ")))
        design <- model.matrix(form, data=dge$samples)
        message("Formula: ", deparse(form))

        if (!is.fullrank(design)) {
            message("Design not full rank - trimming.")
            qr_d   <- qr(design)
            design <- design[, qr_d$pivot[seq_len(qr_d$rank)], drop=FALSE]
        }
        message("Design matrix: ", nrow(design), " x ", ncol(design))

        # ── 17. Voom + lmFit + eBayes ─────────────────────────────────────
        v   <- voom(dge, design, plot=FALSE)
        fit <- lmFit(v, design)
        fit <- eBayes(fit)

        # ── 18. Offset + residuals ─────────────────────────────────────────
        off   <- predictOffset(fit)
        res   <- residuals(fit, v)
        final <- off + res

        # ── 19. Save outputs ───────────────────────────────────────────────
        write.table(final,
                    file.path(outdir, paste0(ct, "_residuals.txt")),
                    sep="\t", quote=FALSE, col.names=NA)

        saveRDS(list(
            dge              = dge,
            offset           = off,
            residuals        = res,
            final_data       = final,
            valid_samples    = colnames(dge),
            design           = design,
            fit              = fit,
            model            = form,
            mode             = ifelse(as.logical("${include_bio}"), "BIOvar", "noBIOvar"),
            batch_correction = as.logical("${batch_correction}"),
            batch_method     = ifelse(as.logical("${batch_correction}"), "${batch_method}", "none")
        ), file.path(outdir, paste0(ct, "_results.rds")))

        # ── 20. Summary report ─────────────────────────────────────────────
        sink(file.path(outdir, paste0(ct, "_summary.txt")))
        cat("*** Processing Summary for", ct, "***\n\n")

        cat("=== Analysis Mode ===\n")
        cat("Mode:", ifelse(as.logical("${include_bio}"), "BIOvar", "noBIOvar"), "\n")
        cat("Batch correction:", ifelse(as.logical("${batch_correction}"), "${batch_method}", "none"), "\n")
        cat("Model formula:", deparse(form), "\n\n")

        cat("=== Filtering Parameters ===\n")
        cat("Nuclei cutoff: >", ${min_nuclei}, "\n")
        cat("Blacklist filtering:", ifelse("${blacklist_file}" != "", "TRUE", "FALSE"), "\n")
        if ("${blacklist_file}" != "") cat("Blacklist file:", "${blacklist_file}", "\n")
        cat("min_count:", ${min_count}, "\n")
        cat("min_total_count:", ${min_total_count}, "\n")
        cat("min_prop:", ${min_prop}, "\n\n")

        cat("=== Peak Counts ===\n")
        cat("Original peak count:", n_original, "\n")
        cat("Peaks after blacklist filtering:", n_after_blacklist, "\n")
        cat("Peaks after expression filtering:", n_after_expr, "\n\n")

        cat("=== Sample Counts ===\n")
        cat("Number of samples after nuclei (>", ${min_nuclei}, ") filtering:", n_after_nuclei, "\n")
        cat("Number of samples in final model:", ncol(final), "\n\n")

        cat("=== Technical Variables Used ===\n")
        for (v in intersect(c("log_n_nuclei","med_nucleosome_signal","med_tss_enrich",
                               "log_med_n_tot_fragment","log_total_unique_peaks"), all_vars))
            cat("-", v, "\n")
        if ("sequencingBatch_factor" %in% all_vars) cat("- sequencingBatch: Sequencing batch ID\n")
        if ("Library_factor"         %in% all_vars) cat("- Library: Library ID\n")

        if (as.logical("${include_bio}")) {
            cat("\n=== Biological Variables Used ===\n")
            for (v in intersect(c("msex","age_death"), all_vars))
                cat("-", v, "\n")
        } else {
            cat("\n=== Biological Variables Used ===\n")
            cat("None (noBIOvar mode - biological variation preserved)\n")
        }

        cat("\n=== Other Variables Used ===\n")
        if ("pmi"   %in% all_vars) cat("- pmi: Post-mortem interval\n")
        if ("study" %in% all_vars) cat("- study: Study cohort\n")
        sink()

        # ── 21. Variable explanation report ───────────────────────────────
        sink(file.path(outdir, paste0(ct, "_variable_explanation.txt")))
        cat("# ATAC-seq Technical Variables Explanation\n\n")
        cat("## Why Log Transformation?\n")
        cat("Log transformation is applied to certain variables for several reasons:\n")
        cat("1. To make the distribution more symmetric and closer to normal\n")
        cat("2. To stabilize variance across the range of values\n")
        cat("3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\n")
        cat("4. To be consistent with the approach used in related studies like haQTL\n\n")
        cat("## Variables and Their Meanings\n\n")
        cat("### Technical Variables\n")
        cat("- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\n")
        cat("  * Filtered to include only samples with >", ${min_nuclei}, "nuclei\n")
        cat("  * Log-transformed because count data typically has a right-skewed distribution\n\n")
        cat("- med_n_tot_fragment: Median number of total fragments per cell\n")
        cat("  * Represents sequencing depth\n")
        cat("  * Log-transformed because sequencing depth typically has exponential effects\n\n")
        cat("- total_unique_peaks: Number of unique peaks detected in each sample\n")
        cat("  * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\n\n")
        cat("- med_nucleosome_signal: Median nucleosome signal\n")
        cat("  * Measures the degree of nucleosome positioning\n")
        cat("  * Not log-transformed as it is already a ratio/normalized metric\n\n")
        cat("- med_tss_enrich: Median transcription start site enrichment score\n")
        cat("  * Indicates the quality of the ATAC-seq data\n")
        cat("  * Not log-transformed as it is already a ratio/normalized metric\n\n")
        if ("sequencingBatch_factor" %in% all_vars)
            cat("- sequencingBatch: Sequencing batch ID\n  * Treated as a factor to account for batch effects\n\n")
        if ("Library_factor" %in% all_vars)
            cat("- Library: Library preparation batch ID\n  * Treated as a factor to account for library preparation effects\n\n")
        if (as.logical("${include_bio}")) {
            cat("### Biological Variables\n")
            cat("- msex: Sex (male=1, female=0)\n")
            cat("- age_death: Age at death\n\n")
        }
        cat("### Other Variables\n")
        cat("- pmi: Post-mortem interval (time between death and tissue collection)\n")
        cat("- study: Study cohort (ROSMAP, MAP, ROS)\n\n")
        cat("## Relationship to voom Transformation\n")
        cat("The voom transformation converts count data to log2-CPM (counts per million) values ")
        cat("and estimates the mean-variance relationship. By log-transforming certain technical ")
        cat("covariates, we ensure they are on a similar scale to the transformed expression data, ")
        cat("which can improve the fit of the linear model used for removing unwanted variation.\n")
        sink()

        message("Completed: ", ct, " -> ", outdir)
        message("  Peaks: ", nrow(final), " | Samples: ", ncol(final))
    }

## `phenotype_reformatting`

In [None]:
[phenotype_formatting]
parameter: celltype    = ['Ast','Ex','In','Mic','Oligo','OPC']
parameter: input_dir   = str
parameter: output_dir  = str

input:  [f'{input_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]
output: [f'{output_dir}/{ct}_snatac_phenotype.bed.gz' for ct in celltype]

task: trunk_workers = 1, trunk_size = 1, walltime = '2:00:00', mem = '16G', cores = 2

python: expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'

    import os
    import subprocess
    import pandas as pd

    celltypes  = ${celltype}
    input_dir  = "${input_dir}"
    output_dir = "${output_dir}"

    def read_residuals(path):
        first_line = open(path).readline().rstrip("\n")
        col_names  = first_line.split("\t")
        df = pd.read_csv(path, sep="\t", header=None, skiprows=1)
        if df.shape[1] > len(col_names):
            peak_ids   = df.iloc[:, 0].values
            df         = df.iloc[:, 1:]
            df.columns = col_names
        else:
            peak_ids   = df.iloc[:, 0].values
            df         = df.iloc[:, 1:]
            df.columns = col_names[1:]
        return peak_ids, df

    def to_midpoint_bed(peak_ids, residuals):
        parts  = pd.Series(peak_ids).str.split("-", expand=True)
        chrs   = parts[0].values
        starts = parts[1].astype(int).values
        ends   = parts[2].astype(int).values
        mids   = ((starts + ends) // 2).astype(int)
        bed = pd.DataFrame({
            "#chr":  chrs,
            "start": mids,
            "end":   mids + 1,
            "ID":    peak_ids
        })
        bed = pd.concat([bed, residuals.reset_index(drop=True)], axis=1)
        return bed.sort_values(["#chr", "start"]).reset_index(drop=True)

    def run_cmd(cmd, label):
        r = subprocess.run(cmd, capture_output=True)
        if r.returncode != 0:
            print(f"WARNING: {label} failed: {r.stderr.decode()}")
        else:
            print(f"{label}: OK")

    for ct in celltypes:
        print(f"\n{'='*40}\nPhenotype Formatting: {ct}\n{'='*40}")

        out_dir = os.path.join(output_dir, "3_pheno_reformat")
        os.makedirs(out_dir, exist_ok=True)

        res_path = os.path.join(input_dir, ct, f"{ct}_residuals.txt")
        if not os.path.exists(res_path):
            print(f"WARNING: {res_path} not found, skipping.")
            continue

        peak_ids, residuals = read_residuals(res_path)
        print(f"Loaded {len(peak_ids)} peaks x {residuals.shape[1]} samples")

        bed     = to_midpoint_bed(peak_ids, residuals)
        out_bed = os.path.join(out_dir, f"{ct}_snatac_phenotype.bed")
        bed.to_csv(out_bed, sep="\t", index=False, float_format="%.15f")
        print(f"Written: {out_bed}")

        run_cmd(["bgzip", "-f", out_bed],                "bgzip")
        run_cmd(["tabix", "-p", "bed", f"{out_bed}.gz"], "tabix")

        print(f"Completed: {ct} -> {out_dir}")

## `region_filtering`

In [None]:
[region_filtering]
parameter: celltype    = ['Ast','Ex','In','Mic','Oligo','OPC']
parameter: input_dir   = str
parameter: output_dir  = str
parameter: regions     = "chr7:28000000-28300000,chr11:85050000-86200000"

input:  [f'{input_dir}/{ct}/{ct}_filtered_raw_counts.txt' for ct in celltype]
output: [f'{output_dir}/{ct}_filtered_regions_of_interest.txt' for ct in celltype]

task: trunk_workers = 1, trunk_size = 1, walltime = '1:00:00', mem = '16G', cores = 2

python: expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'

    import os
    import pandas as pd

    celltypes  = ${celltype}
    input_dir  = "${input_dir}"
    output_dir = "${output_dir}"

    def parse_regions(region_str):
        result = []
        for r in region_str.split(","):
            chrom, coords = r.strip().split(":")
            start, end    = coords.split("-")
            result.append({"chr": chrom, "start": int(start), "end": int(end)})
        return result

    regions = parse_regions("${regions}")

    def parse_peak_ids(peak_ids):
        parts = pd.Series(peak_ids).str.split("-", expand=True)
        return pd.DataFrame({
            "chr":   parts[0].values,
            "start": parts[1].astype(int).values,
            "end":   parts[2].astype(int).values
        })

    def overlaps_region(chr_col, start_col, end_col, reg):
        return (
            (chr_col   == reg["chr"]) &
            (start_col <  reg["end"]) &
            (end_col   >  reg["start"])
        )

    for ct in celltypes:
        print(f"\n{'='*40}\nRegion Filtering: {ct}\n{'='*40}")

        reg_dir = os.path.join(output_dir, "3_region_filter")
        os.makedirs(reg_dir, exist_ok=True)

        counts_path = os.path.join(input_dir, ct, f"{ct}_filtered_raw_counts.txt")
        if not os.path.exists(counts_path):
            print(f"WARNING: {counts_path} not found, skipping.")
            continue

        df = pd.read_csv(counts_path, sep="\t", index_col=0)
        df.index.name = "peak_id"
        df = df.reset_index()

        coords          = parse_peak_ids(df["peak_id"].values)
        df["chr"]       = coords["chr"].values
        df["start"]     = coords["start"].values
        df["end"]       = coords["end"].values
        df["peakwidth"] = df["end"] - df["start"]
        df["midpoint"]  = ((df["start"] + df["end"]) / 2).astype(int)

        # Filter to regions of interest
        mask = pd.Series(False, index=df.index)
        for reg in regions:
            mask |= overlaps_region(df["chr"], df["start"], df["end"], reg)

        region_df = df[mask].copy()
        print(f"Peaks in regions of interest: {len(region_df)}")

        # Save full filtered data
        full_out = os.path.join(reg_dir, f"{ct}_filtered_regions_of_interest.txt")
        region_df.to_csv(full_out, sep="\t", index=False)
        print(f"Saved: {full_out}")

        # Save summary
        meta_cols  = ["peak_id","chr","start","end","peakwidth","midpoint"]
        count_cols = [c for c in region_df.columns if c not in meta_cols]
        count_mat  = region_df[count_cols].apply(pd.to_numeric, errors="coerce")

        summary = region_df[meta_cols].copy()
        summary["total_count"]    = count_mat.sum(axis=1).values
        summary["weighted_count"] = (summary["total_count"] / summary["peakwidth"]).values

        summary_out = os.path.join(reg_dir, f"{ct}_filtered_regions_of_interest_summary.txt")
        summary.to_csv(summary_out, sep="\t", index=False)
        print(f"Saved: {summary_out}")

        print(f"Completed: {ct} -> {reg_dir}")