In [13]:
# Date of creation: 12-12-24
# Author: David Yang

### **Step 1: CpG Methylation Probability Pileup QC**

CpG methylation probability pileups generated by [pb-CpG-tools](https://github.com/PacificBiosciences/pb-CpG-tools) (v2.3.2) can produce incorrect results in two key scenarios:

1. **Variant Sites Appearing as Unmethylated CpGs**: When genetic variants disrupt CpG sites, these positions can be misinterpreted as CpGs with 0%/low methylation (>10%) probability. This creates "phantom" unmethylated CpG calls at variant locations.
2. **Denovo CpG Creation**: Variants can create new CpG sites on one haplotype that don't exist on the other. This leads to incorrect methylation calls when the non-CpG haplotype is interpreted as having unmethylated CpGs.

#### Workflow

**Input Files Required:**
- VCF file containing genetic variants
- Reference FASTA file
- CpG methylation BED files (combined, hap1, hap2)

**Analysis Steps:**
1. Identify variant positions affecting CpG sites
2. Detect destroyed reference CpGs and created denovo CpGs per haplotype
3. Filter methylation calls based on variant effects
4. Generate QC metrics and visualizations

**Output Files:**
- Filtered BED files with problematic sites removed
- Excluded sites BED files
- QC report with filtering statistics
- Distribution plots for passing and excluded sites


In [None]:
# Set up environment
import os
import sys

# Set up paths
bash_dir = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash"
bash_current = os.path.join(bash_dir, "pileup_QC")

print(bash_current)

In [None]:
# Run the filtering script
sample_list = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash/pileup_QC/sample_list.txt"
config_file = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash/pileup_QC/config.yaml"
output_dir = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/pileup_QC"

# adjust paths in config file as needed

!sbatch $bash_current/submit_pileup_qc.sh $sample_list $config_file $output_dir


In [None]:
# Generate summary statistics and plots of the filtering input

import subprocess

# Run the script
input_dir = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/pileup_QC"
output_dir = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/pileup_QC/summary"

python_script = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/python/pileup_QC/summarize_qc.py"

subprocess.run(["python", python_script, "--input-dir", input_dir, "--output-dir", output_dir])


#### Note
A site is considered "preserved" if it meets these criteria:
- Has high methylation probability (>90%)
- Has good coverage (≥10 reads)
- Was initially marked for exclusion but isn't at a variant position


### **Step 2: Generate segmentation data**


In [1]:
# Set up environment
import os
import sys

# Set up paths
bash_dir = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash"
bash_current = os.path.join(bash_dir, "segmentation")

In [2]:
# Run the segmentation script

qc_dir="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/pileup_QC"
output_dir="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/segmentation"

!sbatch $bash_current/a_segment_samples.sh $qc_dir $output_dir



Submitted batch job 8694899


In [None]:
# Parse the segmentation results 

input="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/segmentation"
output="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/segmentation/regions_by_label"

!sbatch $bash_current/b_extract-regions.sh $input $output




### **Step 3: Generate candidate regions**


In [1]:

# Set up environment
import os
import sys

# Set up paths
bash_dir = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash"
bash_current = os.path.join(bash_dir, "candidate_regions")

In [None]:
# Generate the consensus regions

# Data and summary graphs are generated in the candidate_regions directory
input_dir="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/segmentation/regions_by_label/regions"
output_dir="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/candidate_regions/pre-filter"


!sbatch $bash_current/a_consensus_regions.sh $input_dir $output_dir

In [None]:
# Filter consensus regions to generate candidate regions

# Data and summary graphs are generated in the candidate_regions directory

# Set script parameters
CONSENSUS_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/candidate_regions/pre-filter"
REGIONS_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/segmentation/regions_by_label/regions"
OUTPUT_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/candidate_regions/filtered"

# Step 2a: Filter the consensus regions
!sbatch $bash_current/b_filter-consensus.sh $CONSENSUS_DIR $REGIONS_DIR $OUTPUT_DIR H1
!sbatch $bash_current/b_filter-consensus.sh $CONSENSUS_DIR $REGIONS_DIR $OUTPUT_DIR H2


### **Step 4: Generate IGV viewable files**
For visualizing our candidate regions and variants, we will need to aggregate, format, and colorize the 2 datasets generated:
- Individual sample data
- Candidate regions

Files generated:

- allele_stacks :  IGV colorized bed files of each sample's segmentation regions organized by haplotype.
- candidate_regions : IGV colorized bed files of the candidate regions

In [2]:
# Set up environment
import os
import sys

# Set up paths
bash_dir = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash"
bash_current = os.path.join(bash_dir, "igv_viewing")



In [None]:
# Generate allele stacks by aggregating the segmentation data for each sample
INPUT_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/segmentation/regions_by_label/regions"
OUTPUT_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/igv_beds/allele_stacks"


!sbatch $bash_current/a_IGV_all-samples.sh $INPUT_DIR $OUTPUT_DIR


In [None]:
# Generate IGV beds for the candidate regions

INPUT_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/candidate_regions/filtered"
OUTPUT_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/igv_beds/candidate_regions"


!sbatch $bash_current/b_consensus-IGV.sh $INPUT_DIR $OUTPUT_DIR


In [None]:
# Create BED files for IGV viewing each sample's H1 and H2 segments

# Define paths
input_dir="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/segmentation/regions_by_label/regions"
output_dir="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/old_QC_AS/outputs/allele_stacks/all_samples/sample_segmentation"

!sbatch $bash_current/c_IGV-segmentation.sh $input_dir $output_dir



### **Step 5: Prepare VCFs for variant mapping**
This requires merging each sample's VCF for each variant type into a cohort wide VCF.
* Adjust file paths within each bash script

Each VCF was QC'ed with the following parameters for this run :


Merging:
bcftools merge
    --force-samples \
    --merge both \
    --file-list "$FILEPATHS" \
    --output-type z \

Small variants:
bcftools filter -i 'QUAL>=20 && DP>=10' 
bcftools annotate --set-id '%CHROM:%POS:%REF:%ALT'

SVs:
bcftools filter -i 'FILTER="PASS" && DP>=10'
bcftools annotate --set-id '%CHROM:%POS:%REF:%ALT:%SVANN'

CNVs:
bcftools filter -i 'QUAL>=20 && FILTER="PASS"
bcftools annotate --set-id '%CHROM:%POS:%REF:%ALT:%SVLEN' 

Tandem Repeats:
bcftools filter -i 'QUAL>=20 && FILTER="PASS"'
bcftools annotate --set-id '%CHROM:%POS:%REF:%ALT'

In [3]:

# Set up environment
import os
import sys

# Set up paths
bash_dir = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash"
bash_current = os.path.join(bash_dir, "variant_map")   

In [None]:
# merge each VCF type into a single cohort vcf using bcftools merge
# small variants
FILEPATHS="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash/variant_map/file_paths/filepaths_small.txt"
OUT_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/variant_mapping/merged_variants/small_variants"


!sbatch $bash_current/a_merge_small_vars.sh $FILEPATHS $OUT_DIR

In [None]:
# merge structural variants

FILEPATHS="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash/variant_map/file_paths/filepaths_SVs.txt"
OUT_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/variant_mapping/merged_variants/structural_variants"


!sbatch $bash_current/b_merge_sv_vars.sh $FILEPATHS $OUT_DIR

In [None]:
# merge copy number variants
FILEPATHS="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash/variant_map/file_paths/filepaths_CNVs.txt"
OUT_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/variant_mapping/merged_variants/copy_number_variants"

!sbatch $bash_current/c_merge_CNVs_vars.sh $FILEPATHS $OUT_DIR

In [None]:
# merge tandem repeats

FILEPATHS="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash/variant_map/file_paths/filepaths_repeats.txt"

OUT_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/variant_mapping/merged_variants/trgt_repeat_vcf"

!sbatch $bash_current/d_merge_repeats_vars.sh  $FILEPATHS $OUT_DIR

# none remained after QC, omitted for now  

### Step 6: Map variants to regions

#### Haplotype-Specific Variant Mapping Analysis

Overview
Maps genetic variants to haplotype-specific methylated/unmethylated regions. Processes phased variants (SNPs, SVs) and unphased variants (CNVs, tandem repeats) from VCF files against methylation BED files.

Input Requirements
- BED file with methylation regions containing:
  - Chromosome, start, end positions
  - `methylated_samples` and `unmethylated_samples` columns
- Indexed VCF files (.vcf.gz + .tbi) for variants:
  - Small variants (SNPs/indels)
  - Copy number variants
  - Structural variants  
  - Tandem repeats

Output Format
Tab-separated file with columns:
```
chrom         Chromosome
start         Region start position
end           Region end position
variant_id    Unique variant identifier 
type          Variant type (small/cnv/sv/tr)
ref           Reference allele
alt           Alternative allele
num_meth      Number of samples with variant + methylated
num_unmeth    Number of samples with variant + unmethylated
meth_samples  Methylated samples with variant: sample:genotype
unmeth_samples Unmethylated samples with variant: sample:genotype
```
Haplotype Assignment
- Phased variants (small, sv): Uses `|`-separated genotypes, checking H1 (idx 0) or H2 (idx 1)
- Unphased variants (cnv, tr): Reports any non-reference genotype



In [4]:
# Set up environment
import os
import sys

# Set up paths
bash_dir = "/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/bash"
bash_current = os.path.join(bash_dir, "variant_map")   

BASE_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker"
OUTPUT_DIR="/gs/gsfs0/shared-lab/greally-lab/David/AlleleStacker_tests/AlleleStacker/outputs/variant_mapping"

# Make sure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Submit the job
!sbatch $bash_current/test_map.sh -b $BASE_DIR -o $OUTPUT_DIR

Submitted batch job 8694918
