# Short-Read Amplicon Sequencing Pipeline (Dox-Enabled)

This notebook provides a consolidated interface for running the analysis pipeline for paired-end short-read amplicon sequencing. It focuses on SNP-based allelic quantification (B6 vs. Cast) and includes built-in bias detection.

### Required Environment
Ensure you are running this in the `bioinfo` conda environment:
```bash
conda activate bioinfo
```

## Configuration

Set the global paths and parameters below. These will be used throughout the pipeline.

In [None]:
import os

# Input Data
DATA_DIR = "/Volumes/guttman/users/gmgao/Data_seq/20260207-DoxSeqAllRepsMultiplexed/"
RESULTS_DIR = os.path.join(DATA_DIR, "results")

# Computation
THREADS = 8

# Create results directory if it doesn't exist
os.makedirs(RESULTS_DIR, exist_ok=True)
print(f"Results will be saved to: {RESULTS_DIR}")

---

## Step 0: Reference Preparation

**What it does:** Generates an N-masked reference sequence for the *Xist* amplicon and builds bowtie2 indices for both the original and masked sequences. N-masking SNP positions prevents alignment bias by ensuring that both B6 and Cast alleles face the same alignment penalty.

### Outputs:
- `results/references/`: FASTA files and bowtie2 indices.

In [None]:
!python 00_prepare_references.py

## Step 1: Quality Control (fastp)

**What it does:** Runs `fastp` on all paired-end FASTQ files to calculate quality metrics (Q30, GC content, duplication rates). It provides a detailed HTML report for each sample and a consolidated CSV summary.

### Outputs:
- `results/qc/fastp/`: HTML and JSON reports.

In [None]:
!python 01_fastq_qc_fastp.py \
  --data_dir {DATA_DIR} \
  --results_dir {RESULTS_DIR}

## Step 2: Bowtie2 Alignment (Dual-Mode)

**What it does:** Aligns reads to the amplicon reference using `bowtie2`. We run this twice: once using the original reference and once using the N-masked reference. This allows us to verify if using a single-strain reference introduces any technical bias into the quantification.

### Key Parameters:
- `--local`: Uses local alignment to handle potential adapter contamination or structural variations.
- `--reference_mode`: Switches between `original` and `masked` index.

In [None]:
# Align to Original Reference
!python 02_align_bowtie2.py \
  --data_dir {DATA_DIR} \
  --results_dir {RESULTS_DIR} \
  --reference_mode original \
  --threads {THREADS}

# Align to Masked Reference
!python 02_align_bowtie2.py \
  --data_dir {DATA_DIR} \
  --results_dir {RESULTS_DIR} \
  --reference_mode masked \
  --threads {THREADS}

## Step 3: Alignment QC (samtools)

**What it does:** Calculates mapping rates, depth of coverage, and insert size distributions using `samtools flagstat` and `samtools stats`.

In [None]:
!python 03_alignment_qc.py \
  --results_dir {RESULTS_DIR}

## Step 4: Allele Quantification

**What it does:** This is the core quantification step. It extracts the base at each of the 4 accessible SNP positions for every read. Reads are assigned to B6 or Cast alleles based on their SNP match counts. We use a high-depth `pysam` implementation to ensure every read is accounted for.

### Outputs:
- `results/quantification/`: Per-read CSVs and the `allele_quantification_summary.csv`.

In [None]:
!python 04_snp_calling_quantification.py

## Step 6: MultiQC Consolidation

**What it does:** Aggregates all QC metrics from fastp and samtools into a single, interactive dashboard.

In [None]:
!python 06_consolidate_qc.py

---

## Final Report & Visualization

- **Final Report**: See `Final_Pipeline_Summary.md` for a technical overview of results.
- **Visualization**: Use **[visualization.ipynb](visualization.ipynb)** to generate stoichiometry heatmaps and biological aggregation plots (Nanopore Style).