# Nanopore Amplicon Sequencing Pipeline

This notebook provides a consolidated interface for running the 6-step analysis pipeline for Nanopore amplicon sequencing. Specifically designed for SNP-based allelic quantification (e.g., B6 vs. Cast strains).

### Required Environment
Ensure you are running this in the `bioinfo` conda environment:
```bash
conda activate bioinfo
```

## Configuration

Set the global paths and parameters below. These will be used throughout the pipeline.

In [1]:
import os

# Input Data
DATA_DIR = (
    "/Volumes/guttman/users/gmgao/Data_seq/Consolidated-2026Feb-DoxSeqAmpliconNanopore"
)
RESULTS_DIR = "/Volumes/guttman/users/gmgao/Data_seq/Consolidated-2026Feb-DoxSeqAmpliconNanopore/results/"

# Reference Files
GENOME_FA = "/Volumes/guttman/genomes/mm10/fasta/mm10.fa"
VCF_FILE = (
    "/Volumes/guttman/genomes/mm10/variants/mgp.v5.merged.snps_all.dbSNP142.vcf.gz"
)

# SNP Extraction Parameters (B6/Cast example)
REGION = "chrX:103460373-103483233"
B6_NAME = "C57BL_6NJ"
CAST_NAME = "CAST_EiJ"
SNP_FILE = os.path.join(RESULTS_DIR, "xist_snps.txt")

# Computation
THREADS = 8

# Create results directory if it doesn't exist
os.makedirs(RESULTS_DIR, exist_ok=True)

---

## Step 1: Quality Control

**What it does:** Calculates read length statistics (N50, mean, median) and quality scores for all FASTQ files in the data directory. It also generates length distribution histograms for each sample. Results are saved to a `qc/` subfolder inside your results directory.

### Arguments:
- `--data_dir`: Directory containing raw FASTQ files.
- `--results_dir`: Base results directory where the `qc/` folder will be created.

In [2]:
!python 01_fastq_quality_metrics.py \
  --data_dir {DATA_DIR} \
  --results_dir {RESULTS_DIR}

Saving QC results to: /Volumes/guttman/users/gmgao/Data_seq/Consolidated-2026Feb-DoxSeqAmpliconNanopore/results/qc


## Step 2: Genome Alignment

**What it does:** Aligns your reads to the **full reference genome** using `minimap2` with the `map-ont` preset. Choosing a whole-genome alignment strategy (rather than just an amplicon reference) ensures that any off-target reads are correctly identified and do not bias your on-target quantification.

### Arguments:
- `--data_dir`: Directory containing raw FASTQ files.
- `--results_dir`: Base results directory where the `aligned/` folder will be created.
- `--genome`: Path to the reference genome FASTA file.
- `--threads`: Number of CPU threads to use for alignment and sorting.

In [3]:
!python 02_align_to_genome.py \
  --data_dir {DATA_DIR} \
  --results_dir {RESULTS_DIR} \
  --genome {GENOME_FA} \
  --threads {THREADS}

Aligning XistExMixAmp-A_WT_diff-Rep1...
...
Indexing XistExMixAmp-K_WT_diff72h-Dox24h-Rep1...


## Step 3: Extract Targeted SNPs

**What it does:** Extracts discriminating SNPs between your two strains (e.g., B6 and Cast) **specifically for the region of interest** (e.g., Xist). This results in a focused SNP reference table that will be used to assign alleles in the next step.

### Arguments:
- `--vcf`: Path to the compressed VCF file containing variant information.
- `--region`: Genomic coordinates for the target locus (Format: chrom:start-end).
- `--b6`: Sample name for the B6 strain in the VCF.
- `--cast`: Sample name for the Cast strain in the VCF.
- `--output`: Path to save the extracted SNP table.

In [4]:
!python 03_extract_snps_from_vcf.py \
  --vcf {VCF_FILE} \
  --region {REGION} \
  --b6 {B6_NAME} \
  --cast {CAST_NAME} \
  --output {SNP_FILE}

## Step 4: Region Filtering & Allele Quantification

**What it does:** This step combines your alignment (Step 2) and your reference SNPs (Step 3). It performs **on-the-fly filtering** of your whole-genome BAM files to extract only the reads within your region of interest. It then quantifies the bases at each SNP position for every read, assigning them to either B6 or Cast alleles.

### Arguments:
- `--bam_dir`: Directory containing the whole-genome sorted BAM files.
- `--snp_file`: The region-specific SNP reference file (from Step 3).
- `--output_dir`: Directory to save the quantification CSVs.
- `--region`: The genomic region to filter for (ensures we only analyze on-target reads).

In [5]:
BAM_DIR = os.path.join(RESULTS_DIR, "aligned")
QUANT_DIR = os.path.join(RESULTS_DIR, "quantification")

!python 04_quantify_alleles.py \
  --bam_dir {BAM_DIR} \
  --snp_file {SNP_FILE} \
  --output_dir {QUANT_DIR} \
  --region {REGION}

Quantifying alleles for XistExMixAmp-A_WT_diff-Rep1 using XistExMixAmp-A_WT_diff-Rep1.sorted.bam...
  Filtering for region: chrX:103460373-103483233
...


---

## Step 5: Compare Multiple Datasets (Optional)

**What it does:** Compares allelic ratios across replicates or conditions, generating a comparative report.


In [None]:
COMP_DIR = "results/comparative_analysis"
DIRS = "results/exon_rep1,results/exon_rep2,results/intron"
LABELS = "ExonRep1,ExonRep2,Intron"

!python 05_compare_multiple_datasets.py \
  --results_dirs {DIRS} \
  --labels {LABELS} \
  --output_dir {COMP_DIR}

## Step 6: Final Summary Report

**What it does:** Auto-generates a consolidated Markdown summary report for the current run.

In [7]:
!python 06_generate_summary_report.py \
  --results_dir {RESULTS_DIR}

Report generated: /Volumes/guttman/users/gmgao/Data_seq/Consolidated-2026Feb-DoxSeqAmpliconNanopore/results/reports/Automated_Summary_Report.md


---

## Visualization

Detailed visualizations including SNP match heatmaps and allele quantification stacked barplots are now available in the dedicated **[visualization.ipynb](visualization.ipynb)** notebook. Use it to fine-tune your plots for publication.