<a href="https://colab.research.google.com/github/SenseiBassa/Bioinformatics-Projects-HackBio-/blob/main/WGS_Variant_Analysis_Human_Project_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clinical Case Presentation – WGS Variant Analysis: Human
**By:** Bassa Joshua Samuel  
**HackBio Internship – Week 3**  
**Date – 10/09/2025**

---

## Patient Background
- **Patient X (25-year-old male):** Recurrent severe fatigue, jaundice, joint pain, and anemia since childhood.  
- **Laboratory findings:** Hemoglobin 6–8 g/dL, elevated reticulocytes, hemolysis.  
- **Family:** African descent; mother sequenced (Patient Y), father unavailable.  
- **Clinical suspicion:** Genetic cause (hemoglobinopathy).  

# Description:
Identify potential causal variants in Patient X using WGS data and GATK best practices workflow.

---

## Objective
- Identify the causal mutation using WGS data.  
- Annotate and interpret the mutation.  
- Provide clinical recommendations for diagnosis and management.  

---

## Tools
- **GATK** (Genome Analysis Toolkit)  
- **Reference genome:** GRCh38 (hg38.fa, .fai, .dict)  
- **Dataset:** Single-end FASTQ reads from `/data/human_stage_1/`

---

### Variant Annotation

Use ANNOVAR or Ensembl VEP to annotate cohort.vcf.gz and detect pathogenic variants.

### Expected Findings

Likely mutation in HBB gene (e.g., Glu6Val, rs334) consistent with Sickle Cell Disease.

Compare patient vs mother to confirm inheritance pattern (carrier vs affected).

### Clinical Recommendations

Confirmatory Test: Hemoglobin electrophoresis or targeted PCR assay for HBB mutation.

### Management Strategies:

- Hydroxyurea therapy (reduces sickle crises).

- Regular monitoring and transfusion support.

- Genetic counseling for patient and family.

- Consider curative approaches such as stem cell transplantation or emerging gene therapy.


In [None]:
#!/bin/bash
set -euo pipefail  # Safer execution: stop on error, undefined var, or failed pipe
#---------------------------
# Input & Reference Setup
#---------------------------
DATA_DIR="/data/human_stage_1"
REF_DIR="/data/ref"
REF="${REF_DIR}/hg38.fa"
SAMPLES=("PatientX" "PatientY")   # Patient X = proband, Patient Y = mother

#---------------------------
# Step 1: Index Reference Genome
#---------------------------
echo "[INFO] Indexing reference genome..."
bwa index "${REF}"
samtools faidx "${REF}"
gatk CreateSequenceDictionary -R "${REF}"

#---------------------------
# Step 2: Align Reads to Reference
#---------------------------
mkdir -p alignments
for SAMPLE in "${SAMPLES[@]}"; do
    echo "[INFO] Aligning ${SAMPLE}..."
    bwa mem -t 8 "${REF}" \
        "${DATA_DIR}/${SAMPLE}.fastq.gz" \
        | samtools view -b -o "alignments/${SAMPLE}.bam" -
done

#---------------------------
# Step 3: Sort and Mark Duplicates
#---------------------------
mkdir -p processed
for SAMPLE in "${SAMPLES[@]}"; do
    echo "[INFO] Processing ${SAMPLE} BAM..."
    samtools sort -o "processed/${SAMPLE}_sorted.bam" "alignments/${SAMPLE}.bam"
    gatk MarkDuplicates \
        -I "processed/${SAMPLE}_sorted.bam" \
        -O "processed/${SAMPLE}_dedup.bam" \
        -M "processed/${SAMPLE}_dup_metrics.txt"
    samtools index "processed/${SAMPLE}_dedup.bam"
done

#---------------------------
# Step 4: Variant Calling
#---------------------------
mkdir -p variants
for SAMPLE in "${SAMPLES[@]}"; do
    echo "[INFO] Calling variants for ${SAMPLE}..."
    gatk HaplotypeCaller \
        -R "${REF}" \
        -I "processed/${SAMPLE}_dedup.bam" \
        -O "variants/${SAMPLE}.g.vcf.gz" \
        -ERC GVCF
done

# Joint Genotyping
gatk CombineGVCFs -R "${REF}" \
    --variant variants/PatientX.g.vcf.gz \
    --variant variants/PatientY.g.vcf.gz \
    -O variants/cohort.g.vcf.gz

gatk GenotypeGVCFs -R "${REF}" \
    -V variants/cohort.g.vcf.gz \
    -O variants/cohort.vcf.gz

#---------------------------
# Step 5: Variant Filtering
#---------------------------
echo "[INFO] Filtering variants..."
gatk VariantFiltration \
    -R "${REF}" \
    -V variants/cohort.vcf.gz \
    -O variants/cohort_filtered.vcf.gz \
    --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0" \
    --filter-name "LOW_QUALITY"

#---------------------------
# Step 6: Variant Annotation
#---------------------------
echo "[INFO] Annotating variants..."
mkdir -p annotation
# Example with ClinVar database if available
# Replace with actual path to annotation DB
gatk Funcotator \
    -R "${REF}" \
    -V variants/cohort_filtered.vcf.gz \
    -O annotation/cohort_annotated.vcf.gz \
    --output-file-format VCF \
    --data-sources-path /data/ref/funcotator_dataSources

#---------------------------
# Step 7: Generate Clinical Report
#---------------------------
echo "[INFO] Generating clinical report..."
zgrep -v "^#" annotation/cohort_annotated.vcf.gz \
    | awk '$7 == "PASS"' \
    | cut -f1-8 \
    > annotation/clinical_report.txt

echo "[INFO] Pipeline completed. Results in annotation/clinical_report.txt"
