# Week 5: Pharmacogenomics Analysis

## Analysis Pipeline
1. Download reference genome (chromosome 10)
2. Align Illumina and PacBio reads with minimap2
3. Call variants with bcftools
4. Phase variants with HapCUT2
5. Compare VCFs and identify discordant variants
6. Determine star-alleles using PharmVar database

### Install Dependancies

In [1]:
import os
import subprocess
import pandas as pd
from pathlib import Path
import gzip
from collections import defaultdict
import gzip
import urllib.request
import shutil
import bz2
import platform

### Set up working directories

In [2]:
notebook_dir = Path.cwd()
data_dir = notebook_dir.parent / "data"
data_dir.mkdir(parents=True, exist_ok=True)

### Load Sequence Data

In [3]:
import bz2
import shutil
from pathlib import Path

illumina_path = Path('../data/illumina.fq.bz2')
pacbio_path = Path('../data/pacbio.fq.bz2')

illumina_out = illumina_path.with_suffix('')
pacbio_out = pacbio_path.with_suffix('')

print(f"Illumina data: {illumina_path.exists()}")
print(f"PacBio data: {pacbio_path.exists()}")

if not illumina_out.exists():
    print("Decompressing Illumina data")
    with bz2.open(illumina_path, 'rb') as f_in, open(illumina_out, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
    print("Illumina decompressed to", illumina_out)
else:
    print("Illumina already decompressed")

if not pacbio_out.exists():
    print("Decompressing PacBio data")
    with bz2.open(pacbio_path, 'rb') as f_in, open(pacbio_out, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
    print("PacBio decompressed to", pacbio_out)
else:
    print("PacBio already decompressed")


Illumina data: True
PacBio data: True
Illumina already decompressed
PacBio already decompressed


## Step 1: Download Reference Genome (All are located on chromosome 10)

- CYP2C8
- CYP2C9
- CYP2C19

Locate these genes in the reference genome. Use Genome Browser. We will focus on the hg38 (or GRCh38) version of the human genome. Also, make sure to download the reference genome!

Note: you will need to download the human genome for this step; however, note that you do not need the whole human genome. Just focus on the chromosome that contains those genes! The reference should basically be a single FASTA file (extension: .fa or .fasta).

Expected output: a single FASTA file.

In [4]:
# Download chromosome 10 reference genome
chr10_url = "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr10.fa.gz"
chr10_gz_path = data_dir / "chr10.fa.gz"
chr10_fa_path = data_dir / "chr10.fa"

print("DOWNLOADING REFERENCE GENOME")
if not chr10_fa_path.exists():
    print(f"\nDownloading chromosome 10 from UCSC Genome Browser")
    print(f"Source: {chr10_url}")
    print(f"Target: {chr10_gz_path}")
    
    try:
        urllib.request.urlretrieve(chr10_url, chr10_gz_path)
        print("\nDownload complete")
        print(f"Decompressing {chr10_gz_path.name}")
        with gzip.open(chr10_gz_path, 'rb') as f_in:
            with open(chr10_fa_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
        
        print(f"Decompression complete")
        print(f"  Output file: {chr10_fa_path}")
                
    except Exception as e:
        print(f"\nError during download: {e}")
        raise
else:
    print("File already exists.")

DOWNLOADING REFERENCE GENOME
File already exists.


## 2. Align Illumina and PacBio reads with minimap2

Align all samples in FASTQ format to the human genome (version GRCh38). Use minimap2 for the alignment. Make sure to use appropriate parameters for each technology.

Note: you will need to download the human genome for this step; however, note that you do not need the whole human genome. Just focus on the chromosome that contains those genes! The reference should basically be a single FASTA file (extension: .fa).

Expected output: two BAM files and two BAI files (one of those for each sample).

### Install 

In [5]:
%%bash
# Local install (no sudo) — builds minimap2 from source into ~/bin
mkdir -p ~/bin
cd /tmp
git clone https://github.com/lh3/minimap2.git
cd minimap2
make
cp minimap2 ~/bin/

# Add to PATH for this session
export PATH="$HOME/bin:$PATH"
echo "export PATH=\"$HOME/bin:\$PATH\"" >> ~/.bashrc

# Verify installation
~/bin/minimap2 --version


fatal: destination path 'minimap2' already exists and is not an empty directory.


make: Nothing to be done for 'all'.
2.30-r1287


### Index Reference Genome

In [6]:
# Index the reference genome for minimap2
print("INDEXING REFERENCE GENOME")
chr10_mmi_path = data_dir / "chr10.mmi"

if not chr10_mmi_path.exists():
    print(f"\nCreating minimap2 index for {chr10_fa_path}")
    cmd = [
        "minimap2",
        "-d", str(chr10_mmi_path),
        str(chr10_fa_path)
    ]
    
    try:
        subprocess.run(cmd, check=True)
        print(f"Index created: {chr10_mmi_path}")
    except subprocess.CalledProcessError as e:
        print(f"Error creating index: {e}")
        raise
else:
    print(f"Index already exists: {chr10_mmi_path}")

INDEXING REFERENCE GENOME
Index already exists: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/data/chr10.mmi


### Align Illumina Reads

In [7]:
# Align Illumina reads (short, accurate reads)
print("\nALIGNING ILLUMINA READS")

illumina_sam = data_dir / "illumina.sam"
illumina_bam = data_dir / "illumina.bam"
illumina_sorted_bam = data_dir / "illumina_sorted.bam"
illumina_bai = data_dir / "illumina_sorted.bam.bai"

if not illumina_sorted_bam.exists():
    print("Aligning Illumina reads with minimap2 (sr preset for short reads)")
    
    # Align with minimap2 using short-read preset
    align_cmd = [
        "minimap2",
        "-ax", "sr",  # short single-end reads preset
        "-t", "4",     # threads
        str(chr10_mmi_path),
        str(illumina_out)
    ]
    
    with open(illumina_sam, 'w') as sam_file:
        subprocess.run(align_cmd, stdout=sam_file, check=True)
    
    print(f"Alignment complete: {illumina_sam}")
    
    # Convert SAM to BAM
    print("Converting SAM to BAM")
    subprocess.run([
        "samtools", "view",
        "-b", "-o", str(illumina_bam),
        str(illumina_sam)
    ], check=True)
    
    # Sort BAM
    print("Sorting BAM file")
    subprocess.run([
        "samtools", "sort",
        "-o", str(illumina_sorted_bam),
        str(illumina_bam)
    ], check=True)
    
    # Index BAM
    print("Indexing BAM file")
    subprocess.run([
        "samtools", "index",
        str(illumina_sorted_bam)
    ], check=True)
    
    # Clean up intermediate files
    illumina_sam.unlink()
    illumina_bam.unlink()
    
    print(f"Final output: {illumina_sorted_bam}")
    print(f"Index: {illumina_bai}")
else:
    print(f"Illumina alignment already exists: {illumina_sorted_bam}")


ALIGNING ILLUMINA READS
Aligning Illumina reads with minimap2 (sr preset for short reads)


[M::main::59.279*0.10] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::59.279*0.10] mid_occ = 1000
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::59.451*0.11] distinct minimizers: 16061920 (79.91% are singletons); average occurrences: 1.563; average spacing: 5.329; total length: 133797422
[M::worker_pipeline::100.235*0.50] mapped 309505 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -ax sr -t 4 /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/data/chr10.mmi ../data/illumina.fq
[M::main] Real time: 100.258 sec; CPU: 50.300 sec; Peak RSS: 0.925 GB


Alignment complete: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/data/illumina.sam
Converting SAM to BAM
Sorting BAM file
Indexing BAM file
Final output: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/data/illumina_sorted.bam
Index: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/data/illumina_sorted.bam.bai


### Align PacBio Reads

In [8]:
# Align PacBio reads (long, less accurate reads)
print("\nALIGNING PACBIO READS")

pacbio_sam = data_dir / "pacbio.sam"
pacbio_bam = data_dir / "pacbio.bam"
pacbio_sorted_bam = data_dir / "pacbio_sorted.bam"
pacbio_bai = data_dir / "pacbio_sorted.bam.bai"

if not pacbio_sorted_bam.exists():
    print("Aligning PacBio reads with minimap2 (map-pb preset for PacBio)")
    
    # Align with minimap2 using PacBio preset
    align_cmd = [
        "minimap2",
        "-ax", "map-pb",  # PacBio CLR reads preset
        "-t", "4",         # threads
        str(chr10_mmi_path),
        str(pacbio_out)
    ]
    
    with open(pacbio_sam, 'w') as sam_file:
        subprocess.run(align_cmd, stdout=sam_file, check=True)
    
    print(f"Alignment complete: {pacbio_sam}")
    
    # Convert SAM to BAM
    print("Converting SAM to BAM")
    subprocess.run([
        "samtools", "view",
        "-b", "-o", str(pacbio_bam),
        str(pacbio_sam)
    ], check=True)
    
    # Sort BAM
    print("Sorting BAM file")
    subprocess.run([
        "samtools", "sort",
        "-o", str(pacbio_sorted_bam),
        str(pacbio_bam)
    ], check=True)
    
    # Index BAM
    print("Indexing BAM file")
    subprocess.run([
        "samtools", "index",
        str(pacbio_sorted_bam)
    ], check=True)
    
    # Clean up intermediate files
    pacbio_sam.unlink()
    pacbio_bam.unlink()
    
    print(f"Final output: {pacbio_sorted_bam}")
    print(f"Index: {pacbio_bai}")
else:
    print(f"PacBio alignment already exists: {pacbio_sorted_bam}")


ALIGNING PACBIO READS
Aligning PacBio reads with minimap2 (map-pb preset for PacBio)


[M::main::59.848*0.10] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::60.150*0.11] mid_occ = 178
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::60.313*0.11] distinct minimizers: 16061920 (79.91% are singletons); average occurrences: 1.563; average spacing: 5.329; total length: 133797422
[M::worker_pipeline::62.874*0.19] mapped 3063 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -ax map-pb -t 4 /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/data/chr10.mmi ../data/pacbio.fq
[M::main] Real time: 62.896 sec; CPU: 12.253 sec; Peak RSS: 0.748 GB


Alignment complete: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/data/pacbio.sam
Converting SAM to BAM
Sorting BAM file
Indexing BAM file
Final output: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/data/pacbio_sorted.bam
Index: /mnt/c/Users/Nick/Documents/GitHub/fall25-csc-bioinf/week5/data/pacbio_sorted.bam.bai


### Verify output

In [10]:
# Verify all output files exist
print("\nVERIFYING OUTPUT FILES")
print(f"Illumina BAM: {illumina_sorted_bam.exists()}")
print(f"Illumina BAI: {illumina_bai.exists()}")
print(f"PacBio BAM: {pacbio_sorted_bam.exists()}")
print(f"PacBio BAI: {pacbio_bai.exists()}")

if all([illumina_sorted_bam.exists(), illumina_bai.exists(), 
        pacbio_sorted_bam.exists(), pacbio_bai.exists()]):
    print("\nAll alignment files successfully created!")


VERIFYING OUTPUT FILES
Illumina BAM: True
Illumina BAI: True
PacBio BAM: True
PacBio BAI: True

All alignment files successfully created!


## Step 3: 

Find all variants (i.e., call them) in each sample for all genes of interest, and obtain the resulting VCF file. You can use either bcftools or FreeBayes. Those with suppressed masochistic tendencies are welcome to use GATK, but be warned: that tool is the epitome of (needless) complexity.

Expected output: two VCF files (one for each sample).

## Step 4: 

Now phase the variant VCFs with HapCUT2 or HapTree-X. The output of these tools may be in HapCUT block format; if that happens, convert this file to the phased VCF format.

Expected output: two VCF files (one for each sample).

## Step 5: 

Now you should have two phased VCF files (one for each sequencing technology). Compare these VCFs. How many variants are shared between the VCFs? How many are not?

Select 2-3 variants that are not common (if any) and check which technology supports this variant. Open both BAM files in IGV and take a screenshot of each problematic discordant location. What can you deduce from these screenshots—are these variants sequencing-related artifacts or are they indeed true variants?

Do this analysis for every gene.

Expected output: Jupyter cell(s) with IGV screenshots and a discussion.

## Step 6: 

Can you figure out the star-allele for each gene of interest? The star-allele database can be found in PharmVar; see this for CYP2C19. Your answer should be something like CYP2C19*12 because X, Y and Z. This step does not have to be automated, but should be at least explained in the notebook.

Hint: use phased data!

Expected output: Jupyter cell(s) with discussion (and code, if you want to do it that way).