# Week 5: Pharmacogenomics Analysis

## Analysis Pipeline
1. Download reference genome (chromosome 10)
2. Align Illumina and PacBio reads with minimap2
3. Call variants with bcftools
4. Phase variants with HapCUT2
5. Compare VCFs and identify discordant variants
6. Determine star-alleles using PharmVar database

## Step 0: Setup

### Install Dependencies

In [1]:
import os
import sys
import subprocess
import pandas as pd
from pathlib import Path
import gzip
from collections import defaultdict
import urllib.request
import shutil
import bz2
import platform

### Set up working directories

In [2]:
notebook_dir = Path.cwd()
data_dir = notebook_dir / "data"
data_dir.mkdir(parents=True, exist_ok=True)
print(f"Working directory: {notebook_dir}")
print(f"Data directory: {data_dir}")

Working directory: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code
Data directory: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data


## Step 1:

All target genes are located on chromosome 10:
- CYP2C8
- CYP2C9
- CYP2C19

Locate these genes in the reference genome. Use Genome Browser. We will focus on the hg38 (or GRCh38) version of the human genome. Also, make sure to download the reference genome!

Note: you will need to download the human genome for this step; however, note that you do not need the whole human genome. Just focus on the chromosome that contains those genes! The reference should basically be a single FASTA file (extension: .fa or .fasta).

Expected output: a single FASTA file.

In [3]:
# Download chromosome 10 reference genome
chr10_url = "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr10.fa.gz"
chr10_gz_path = data_dir / "chr10.fa.gz"
chr10_fa_path = data_dir / "chr10.fa"

print("DOWNLOADING REFERENCE GENOME")

if not chr10_fa_path.exists():
    print(f"Downloading chromosome 10 from UCSC Genome Browser")
    print(f"Source: {chr10_url}")
    print(f"Target: {chr10_gz_path}")
    
    try:
        urllib.request.urlretrieve(chr10_url, chr10_gz_path)
        print("Download complete!")
        
        print(f"Decompressing {chr10_gz_path.name}...")
        with gzip.open(chr10_gz_path, 'rb') as f_in:
            with open(chr10_fa_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
        
        print(f"Decompression complete!")
        print(f"Output file: {chr10_fa_path}")
        
        # Remove compressed file to save space
        chr10_gz_path.unlink()
        print(f"Removed compressed file: {chr10_gz_path}")
                
    except Exception as e:
        print(f"Error during download: {e}")
        raise
else:
    print(f"Reference genome already exists: {chr10_fa_path}")

print("")

DOWNLOADING REFERENCE GENOME
Downloading chromosome 10 from UCSC Genome Browser
Source: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr10.fa.gz
Target: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/chr10.fa.gz
Download complete!
Decompressing chr10.fa.gz...
Decompression complete!
Output file: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/chr10.fa
Removed compressed file: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/chr10.fa.gz



### Download Sequencing Data

In [4]:
# Download Illumina data
illumina_url = "https://github.com/inumanag/fall25-csc-bioinf/raw/refs/heads/main/week4/data/illumina.fq.bz2"
illumina_bz2_path = data_dir / "illumina.fq.bz2"
illumina_fq_path = data_dir / "illumina.fq"

if not illumina_fq_path.exists():
    print(f"Downloading Illumina data from GitHub")
    print(f"Source: {illumina_url}")
    
    try:
        urllib.request.urlretrieve(illumina_url, illumina_bz2_path)
        print("Download complete!")
        
        print(f"Decompressing {illumina_bz2_path.name}...")
        with bz2.open(illumina_bz2_path, 'rb') as f_in:
            with open(illumina_fq_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
        
        print(f"Decompression complete!")
        print(f"Output file: {illumina_fq_path}")
        
        # Remove compressed file
        illumina_bz2_path.unlink()
        print(f"Removed compressed file: {illumina_bz2_path}")
                
    except Exception as e:
        print(f"Error during download: {e}")
        raise
else:
    print(f"Illumina data already exists: {illumina_fq_path}")

print("")

pacbio_url = "https://github.com/inumanag/fall25-csc-bioinf/raw/refs/heads/main/week4/data/pacbio.fq.bz2"
pacbio_bz2_path = data_dir / "pacbio.fq.bz2"
pacbio_fq_path = data_dir / "pacbio.fq"

if not pacbio_fq_path.exists():
    print(f"Downloading PacBio data from GitHub")
    print(f"Source: {pacbio_url}")
    
    try:
        urllib.request.urlretrieve(pacbio_url, pacbio_bz2_path)
        print("Download complete!")
        
        print(f"Decompressing {pacbio_bz2_path.name}...")
        with bz2.open(pacbio_bz2_path, 'rb') as f_in:
            with open(pacbio_fq_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
        
        print(f"Decompression complete!")
        print(f"Output file: {pacbio_fq_path}")
        
        pacbio_bz2_path.unlink()
        print(f"Removed compressed file: {pacbio_bz2_path}")
                
    except Exception as e:
        print(f"Error during download: {e}")
        raise
else:
    print(f"PacBio data already exists: {pacbio_fq_path}")

print("")

Downloading Illumina data from GitHub
Source: https://github.com/inumanag/fall25-csc-bioinf/raw/refs/heads/main/week4/data/illumina.fq.bz2
Download complete!
Decompressing illumina.fq.bz2...
Decompression complete!
Output file: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/illumina.fq
Removed compressed file: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/illumina.fq.bz2

Downloading PacBio data from GitHub
Source: https://github.com/inumanag/fall25-csc-bioinf/raw/refs/heads/main/week4/data/pacbio.fq.bz2
Download complete!
Decompressing pacbio.fq.bz2...
Decompression complete!
Output file: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/pacbio.fq
Removed compressed file: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/pacbio.fq.bz2



## Step 2:

Align all samples in FASTQ format to the human genome (version GRCh38). Use minimap2 for the alignment. Make sure to use appropriate parameters for each technology.

Note: you will need to download the human genome for this step; however, note that you do not need the whole human genome. Just focus on the chromosome that contains those genes! The reference should basically be a single FASTA file (extension: .fa).

Expected output: two BAM files and two BAI files (one of those for each sample).

### Index Reference Genome

In [5]:
print("INDEXING REFERENCE GENOME")

chr10_mmi_path = data_dir / "chr10.mmi"

if not chr10_mmi_path.exists():
    print(f"Creating minimap2 index for {chr10_fa_path}")
    cmd = [
        "minimap2",
        "-d", str(chr10_mmi_path),
        str(chr10_fa_path)
    ]
    
    result = subprocess.run(cmd, check=True, capture_output=True, text=True)
    print(result.stderr)  # minimap2 outputs to stderr
    print(f"Index created: {chr10_mmi_path}")

else:
    print(f"Index already exists: {chr10_mmi_path}")

print("")

INDEXING REFERENCE GENOME
Creating minimap2 index for /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/chr10.fa
[M::mm_idx_gen::5.555*0.71] collected minimizers
[M::mm_idx_gen::6.252*0.94] sorted minimizers
[M::main::83.910*0.16] loaded/built the index for 1 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::84.082*0.16] distinct minimizers: 16061920 (79.91% are singletons); average occurrences: 1.563; average spacing: 5.329; total length: 133797422
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -d /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/chr10.mmi /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/chr10.fa
[M::main] Real time: 84.148 sec; CPU: 13.487 sec; Peak RSS: 1.097 GB

Index created: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/chr10.mmi



### Align Illumina Reads

In [None]:
print("ALIGNING ILLUMINA READS")

illumina_sam = data_dir / "illumina.sam"
illumina_bam = data_dir / "illumina.bam"
illumina_sorted_bam = data_dir / "illumina_sorted.bam"
illumina_bai = data_dir / "illumina_sorted.bam.bai"

if not illumina_sorted_bam.exists():
    print("Aligning Illumina reads with minimap2 (sr preset for short reads)")
    
    # Align with minimap2 using short-read preset
    align_cmd = [
        "minimap2",
        "-ax", "sr",  # short single-end reads preset
        "-t", "4",     # threads
        str(chr10_mmi_path),
        str(illumina_fq_path)
    ]
    
    with open(illumina_sam, 'w') as sam_file:
        result = subprocess.run(align_cmd, stdout=sam_file, stderr=subprocess.PIPE, text=True, check=True)
        print(result.stderr)  # minimap2 outputs to stderr
    
    print(f"Alignment complete: {illumina_sam}")
    
    # Convert SAM to BAM
    print("Converting SAM to BAM...")
    subprocess.run([
        "samtools", "view",
        "-b", "-o", str(illumina_bam),
        str(illumina_sam)
    ], check=True)
    print(f"BAM created: {illumina_bam}")
    
    # Sort BAM
    print("Sorting BAM file...")
    subprocess.run([
        "samtools", "sort",
        "-o", str(illumina_sorted_bam),
        str(illumina_bam)
    ], check=True)
    print(f"Sorted BAM created: {illumina_sorted_bam}")
    
    # Index BAM
    print("Indexing BAM file...")
    subprocess.run([
        "samtools", "index",
        str(illumina_sorted_bam)
    ], check=True)
    print(f"Index created: {illumina_bai}")
    
    # Clean up intermediate files
    illumina_sam.unlink()
    illumina_bam.unlink()
    print("Cleaned up intermediate files")
    
    print(f"\nFinal output: {illumina_sorted_bam}")
    print(f"Final index: {illumina_bai}")
else:
    print(f"Illumina alignment already exists: {illumina_sorted_bam}")

print("")

ALIGNING ILLUMINA READS
Aligning Illumina reads with minimap2 (sr preset for short reads)
[M::main::73.124*0.12] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::73.124*0.12] mid_occ = 1000
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::73.289*0.12] distinct minimizers: 16061920 (79.91% are singletons); average occurrences: 1.563; average spacing: 5.329; total length: 133797422
[M::worker_pipeline::121.134*0.46] mapped 309505 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -ax sr -t 4 /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/chr10.mmi /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/illumina.fq
[M::main] Real time: 121.157 sec; CPU: 55.691 sec; Peak RSS: 0.925 GB

Alignment complete: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/code/data/illumina.sam
Converting SAM to BAM...


### Align PacBio Reads

In [None]:
print("ALIGNING PACBIO READS")

pacbio_sam = data_dir / "pacbio.sam"
pacbio_bam = data_dir / "pacbio.bam"
pacbio_sorted_bam = data_dir / "pacbio_sorted.bam"
pacbio_bai = data_dir / "pacbio_sorted.bam.bai"

if not pacbio_sorted_bam.exists():
    print("Aligning PacBio reads with minimap2 (map-pb preset for PacBio)")
    
    # Align with minimap2 using PacBio preset
    align_cmd = [
        "minimap2",
        "-ax", "map-pb",  # PacBio CLR reads preset
        "-t", "4",         # threads
        str(chr10_mmi_path),
        str(pacbio_fq_path)
    ]
    
    with open(pacbio_sam, 'w') as sam_file:
        result = subprocess.run(align_cmd, stdout=sam_file, stderr=subprocess.PIPE, text=True, check=True)
        print(result.stderr)  # minimap2 outputs to stderr
    
    print(f"Alignment complete: {pacbio_sam}")
    
    print("Converting SAM to BAM...")
    subprocess.run([
        "samtools", "view",
        "-b", "-o", str(pacbio_bam),
        str(pacbio_sam)
    ], check=True)
    print(f"BAM created: {pacbio_bam}")
    
    print("Sorting BAM file...")
    subprocess.run([
        "samtools", "sort",
        "-o", str(pacbio_sorted_bam),
        str(pacbio_bam)
    ], check=True)
    print(f"Sorted BAM created: {pacbio_sorted_bam}")
    
    print("Indexing BAM file...")
    subprocess.run([
        "samtools", "index",
        str(pacbio_sorted_bam)
    ], check=True)
    print(f"Index created: {pacbio_bai}")
    
    pacbio_sam.unlink()
    pacbio_bam.unlink()
    print("Cleaned up intermediate files")
    
    print(f"\nFinal output: {pacbio_sorted_bam}")
    print(f"Final index: {pacbio_bai}")
else:
    print(f"PacBio alignment already exists: {pacbio_sorted_bam}")

print("")

## Step 3: Call Variants with bcftools

Find all variants in each sample for all genes of interest and obtain VCF files.

**Expected output:** Two VCF files (one for each sample)

In [None]:
print("VARIANT CALLING WITH BCFTOOLS")

illumina_vcf = data_dir / "illumina_variants.vcf.gz"
pacbio_vcf = data_dir / "pacbio_variants.vcf.gz"

def call_variants(bam_path, vcf_path, sample_name):
    
    if vcf_path.exists():
        print(f"\n{sample_name} VCF already exists: {vcf_path}")
        return
    
    print(f"Calling variants for {sample_name}")
    print(f"Running bcftools mpileup")
    mpileup_cmd = [
        "bcftools", "mpileup",
        "-f", str(chr10_fa_path),  # reference genome
        "-Ou",                      # uncompressed BCF output
        "-q", "20",                 # minimum mapping quality
        "-Q", "20",                 # minimum base quality
        str(bam_path)
    ]
    
    print(f"Running bcftools call")
    call_cmd = [
        "bcftools", "call",
        "-mv",                      # multiallelic and variants-only caller
        "-Ou",                      # uncompressed BCF output
        "--ploidy", "2"             # diploid organism
    ]
    
    print(f"Applying quality filters")
    filter_cmd = [
        "bcftools", "filter",
        "-s", "LowQual",            # mark low quality variants
        "-e", "QUAL<20",            # filter expression: quality < 20
        "-Oz",                      # compressed VCF output
        "-o", str(vcf_path)
    ]
    
    p1 = subprocess.Popen(mpileup_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    p2 = subprocess.Popen(call_cmd, stdin=p1.stdout, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    p1.stdout.close()  # Allow p1 to receive SIGPIPE if p2 exits
    p3 = subprocess.Popen(filter_cmd, stdin=p2.stdout, stderr=subprocess.PIPE)
    p2.stdout.close()
    p3.wait()
    
    if p3.returncode != 0:
        stderr = p3.stderr.read().decode()
        raise subprocess.CalledProcessError(p3.returncode, filter_cmd, stderr=stderr)
    
    print(f"Variants called successfully: {vcf_path}")
    
    print(f"Indexing VCF file")
    subprocess.run([
        "bcftools", "index",
        "-t",  # tabix index
        str(vcf_path)
    ], check=True)
    print(f"Index created: {vcf_path}.tbi")
    
    print(f"\nVariant Statistics:")
    stats_result = subprocess.run([
        "bcftools", "stats",
        str(vcf_path)
    ], capture_output=True, text=True, check=True)
    
    for line in stats_result.stdout.split('\n'):
        if line.startswith('SN'):
            parts = line.split('\t')
            if len(parts) >= 4 and 'number of' in parts[2]:
                print(f"  {parts[2]}: {parts[3]}")

call_variants(illumina_sorted_bam, illumina_vcf, "Illumina")
call_variants(pacbio_sorted_bam, pacbio_vcf, "PacBio")

print("VARIANT CALLING COMPLETE")
print(f"\nOutput files:")
print(f"  Illumina: {illumina_vcf}")
print(f"  PacBio: {pacbio_vcf}")
print("")

## Step 4: 

Now phase the variant VCFs with HapCUT2. The output of these tools may be in HapCUT block format; if that happens, convert this file to the phased VCF format.

Expected output: two VCF files (one for each sample).

In [None]:
print("PHASING VARIANTS WITH HAPCUT2")

# Define output paths for phasing
illumina_fragments = data_dir / "illumina_fragments.txt"
illumina_phased = data_dir / "illumina_phased.txt"
illumina_phased_vcf = data_dir / "illumina_phased.vcf"

pacbio_fragments = data_dir / "pacbio_fragments.txt"
pacbio_phased = data_dir / "pacbio_phased.txt"
pacbio_phased_vcf = data_dir / "pacbio_phased.vcf"

def phase_variants(bam_path, vcf_path, fragments_path, phased_path, phased_vcf_path, sample_name):
    """Phase variants using HapCUT2"""
    
    if phased_vcf_path.exists():
        print(f"{sample_name} phased VCF already exists: {phased_vcf_path}")
        return
    
    print(f"Phasing variants for {sample_name}")    
    vcf_for_phasing = vcf_path
    temp_vcf = None
    if str(vcf_path).endswith('.gz'):
        print(f"Decompressing VCF for extractHAIRS...")
        temp_vcf = data_dir / f"{sample_name}_temp.vcf"
        subprocess.run([
            "bcftools", "view",
            str(vcf_path),
            "-o", str(temp_vcf)
        ], check=True)
        vcf_for_phasing = temp_vcf
        print(f"Temporary VCF created: {temp_vcf}")
    
    print(f"Extracting haplotype-informative reads...")
    extracthairs_cmd = [
        "extractHAIRS",
        "--bam", str(bam_path),
        "--VCF", str(vcf_for_phasing),
        "--out", str(fragments_path),
        "--ref", str(chr10_fa_path)
    ]
    
    result = subprocess.run(extracthairs_cmd, capture_output=True, text=True, check=True)
    print(f"Fragment file created: {fragments_path}")
    if result.stderr:
        print(f"  extractHAIRS info: {result.stderr[:200]}...")
    
    print(f"Running HapCUT2 phasing algorithm...")
    hapcut2_cmd = [
        "HAPCUT2",
        "--fragments", str(fragments_path),
        "--VCF", str(vcf_for_phasing),
        "--output", str(phased_path),
        "--outvcf", "1"  # Output VCF directly from HAPCUT2
    ]
    
    result = subprocess.run(hapcut2_cmd, capture_output=True, text=True, check=True)
    print(f"✓ Phased haplotype blocks created: {phased_path}")
    
    # HAPCUT2 with --outvcf 1 creates a .phased.VCF file (with capital VCF)
    hapcut2_phased_vcf = data_dir / f"{phased_path.name}.phased.VCF"
    
    if hapcut2_phased_vcf.exists():
        import shutil
        shutil.move(str(hapcut2_phased_vcf), str(phased_vcf_path))
        print(f"Phased VCF created: {phased_vcf_path}")
    
    if result.stdout:
        print(f"\nPhasing Statistics:")
        for line in result.stdout.split('\n')[:10]:  # Show first 10 lines
            if line.strip():
                print(f"  {line}")
    
    if temp_vcf and temp_vcf.exists():
        temp_vcf.unlink()
        print(f"Cleaned up temporary VCF: {temp_vcf}")
    
    print(f"\nPhasing Summary:")
    with open(phased_path, 'r') as f:
        block_count = 0
        variant_count = 0
        for line in f:
            if line.startswith('BLOCK'):
                block_count += 1
            elif not line.startswith('*') and line.strip():
                variant_count += 1
        print(f"  Phased blocks: {block_count}")
        print(f"  Phased variants: {variant_count}")
        
phase_variants(illumina_sorted_bam, illumina_vcf, illumina_fragments, illumina_phased, illumina_phased_vcf, "Illumina")

phase_variants( pacbio_sorted_bam, pacbio_vcf, pacbio_fragments, pacbio_phased, pacbio_phased_vcf, "PacBio")

print("PHASING COMPLETE")
print(f"\nOutput files:")
print(f"  Illumina phased VCF: {illumina_phased_vcf}")
print(f"  PacBio phased VCF: {pacbio_phased_vcf}")
print("")

## Step 5:

Now you should have two phased VCF files (one for each sequencing technology). Compare these VCFs. How many variants are shared between the VCFs? How many are not?

Select 2-3 variants that are not common (if any) and check which technology supports this variant. Open both BAM files in IGV and take a screenshot of each problematic discordant location. What can you deduce from these screenshots—are these variants sequencing-related artifacts or are they indeed true variants?

Do this analysis for every gene.

IGV screenshots can also be automated (it is a bit tricky, though—ask your LLM for help). You can opt out of doing this, but you will lose half a point.

Expected output: Jupyter cell(s) with IGV screenshots and a discussion.

In [None]:
print("="*60)
print("COMPARING VCFS")
print("="*60)
print("\nTODO: Compare VCFs and identify discordant variants")
print("This step will analyze shared and unique variants between technologies")
print("")

## Step 6: Determine Star-Alleles using PharmVar

Can you figure out the star-allele for each gene of interest? The star-allele database can be found in PharmVar; see this for CYP2C19. Your answer should be something like CYP2C19*12 because X, Y and Z. This step does not have to be automated, but should be at least explained in the notebook.

Hint: use phased data!

Expected output: Jupyter cell(s) with discussion (and code, if you want to do it that way).

In [None]:
print("="*60)
print("STAR-ALLELE DETERMINATION")
print("="*60)
print("\nTODO: Determine star-alleles using PharmVar database")
print("This step will identify CYP2C8, CYP2C9, and CYP2C19 star-alleles")
print("")