# BAM to GedMatch Conversion (Google Colab)

Convert Bodzia cemetery BAM files to GedMatch-compatible 23andMe format.

**Requirements:** Run this notebook in Google Colab with Google Drive mounted.

**Samples available:**
- VK157 (Elite male - proposed Sviatopolk I)
- VK158, VK159, VK166
- VK50, VK53, VK110, VK146, VK147, VK201, VK202, VK234

In [None]:
# Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Step 2: Install bioinformatics tools
!apt-get update -qq
!apt-get install -qq bcftools samtools tabix plink1.9
!ln -sf /usr/bin/plink1.9 /usr/bin/plink

# Verify installation
!bcftools --version | head -1
!samtools --version | head -1

In [None]:
# Step 3: Download reference genome (GRCh37) - ~3GB, takes ~10 min
!mkdir -p /content/reference
!wget -q -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz -O /content/reference/human_g1k_v37.fasta.gz
!gunzip -f /content/reference/human_g1k_v37.fasta.gz
!samtools faidx /content/reference/human_g1k_v37.fasta
print("Reference genome ready!")

In [None]:
# Step 4: Download dbSNP for rsID annotation - ~10GB, takes ~30 min
# (Optional but recommended for GedMatch compatibility)
!mkdir -p /content/dbsnp
!wget -q -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/GATK/All_20180423.vcf.gz -O /content/dbsnp/All_20180423.vcf.gz
!wget -q -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/GATK/All_20180423.vcf.gz.tbi -O /content/dbsnp/All_20180423.vcf.gz.tbi
print("dbSNP ready!")

In [None]:
# Configuration - UPDATE THESE PATHS
import os

# Path to your BAM files in Google Drive
# Adjust this to match your folder structure
DRIVE_BAM_FOLDER = "/content/drive/MyDrive/YOUR_FOLDER_WITH_BAM_FILES"

# Which sample to process (change as needed)
SAMPLE = "VK157"  # Options: VK157, VK158, VK159, VK166, etc.

# Derived paths
INPUT_BAM = f"{DRIVE_BAM_FOLDER}/{SAMPLE}.final.bam"
OUTPUT_DIR = "/content/output"
REF_GENOME = "/content/reference/human_g1k_v37.fasta"
DBSNP = "/content/dbsnp/All_20180423.vcf.gz"

os.makedirs(OUTPUT_DIR, exist_ok=True)

# Verify BAM exists
if os.path.exists(INPUT_BAM):
    print(f"✓ Found: {INPUT_BAM}")
    !ls -lh "{INPUT_BAM}"
else:
    print(f"✗ NOT FOUND: {INPUT_BAM}")
    print("\nSearching for .bam files in Drive...")
    !find /content/drive/MyDrive -name "*.final.bam" 2>/dev/null | head -20

In [None]:
# Step 5: Index BAM if needed
import os

bai_file = f"{INPUT_BAM}.bai"
if not os.path.exists(bai_file):
    print("Indexing BAM file...")
    !samtools index "{INPUT_BAM}"
else:
    print("BAM index already exists")

In [None]:
# Step 6: Variant calling (autosomal chromosomes 1-22)
# This is the longest step - may take 30-90 minutes for WGS
# Ancient DNA with low coverage will be faster

print(f"Starting variant calling for {SAMPLE}...")
print("This may take 30-90 minutes depending on file size.")

!bcftools mpileup -Ou -f "{REF_GENOME}" \
    -r 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 \
    --max-depth 250 \
    --min-MQ 20 \
    "{INPUT_BAM}" 2>/dev/null | \
bcftools call -mv -Oz -o "{OUTPUT_DIR}/{SAMPLE}_raw.vcf.gz"

!bcftools index "{OUTPUT_DIR}/{SAMPLE}_raw.vcf.gz"
print("\n✓ Variant calling complete!")
!bcftools stats "{OUTPUT_DIR}/{SAMPLE}_raw.vcf.gz" | grep -E "^SN"

In [None]:
# Step 7: Filter and normalize variants
print("Filtering variants (QUAL >= 20)...")
!bcftools filter -i 'QUAL>=20' "{OUTPUT_DIR}/{SAMPLE}_raw.vcf.gz" -Oz -o "{OUTPUT_DIR}/{SAMPLE}_filtered.vcf.gz"

print("Normalizing variants...")
!bcftools norm -f "{REF_GENOME}" "{OUTPUT_DIR}/{SAMPLE}_filtered.vcf.gz" -Oz -o "{OUTPUT_DIR}/{SAMPLE}_normalized.vcf.gz" 2>/dev/null
!bcftools index "{OUTPUT_DIR}/{SAMPLE}_normalized.vcf.gz"

print("✓ Filtering complete!")

In [None]:
# Step 8: Annotate with rsIDs from dbSNP
print("Annotating with rsIDs from dbSNP...")
!bcftools annotate -a "{DBSNP}" -c ID "{OUTPUT_DIR}/{SAMPLE}_normalized.vcf.gz" -Oz -o "{OUTPUT_DIR}/{SAMPLE}_annotated.vcf.gz" 2>/dev/null
!bcftools index "{OUTPUT_DIR}/{SAMPLE}_annotated.vcf.gz"

# Count annotation rate
total = !bcftools view -H "{OUTPUT_DIR}/{SAMPLE}_annotated.vcf.gz" | wc -l
annotated = !bcftools view -H "{OUTPUT_DIR}/{SAMPLE}_annotated.vcf.gz" | awk '$3 != "."' | wc -l
total_count = int(total[0].strip())
annot_count = int(annotated[0].strip())
rate = (annot_count / total_count * 100) if total_count > 0 else 0

print(f"\n✓ Annotation rate: {rate:.1f}% ({annot_count:,} / {total_count:,} variants)")

In [None]:
# Step 9: Convert to 23andMe format for GedMatch
print("Converting to 23andMe format...")

# Filter to SNPs with rsIDs only
!bcftools view -i 'ID!="." && strlen(REF)==1 && strlen(ALT)==1' \
    "{OUTPUT_DIR}/{SAMPLE}_annotated.vcf.gz" -Oz -o "{OUTPUT_DIR}/{SAMPLE}_snps.vcf.gz"

# Convert with plink
!plink --vcf "{OUTPUT_DIR}/{SAMPLE}_snps.vcf.gz" \
    --recode 23 \
    --chr 1-22 \
    --out "{OUTPUT_DIR}/{SAMPLE}_plink" 2>/dev/null

# Create final output with proper header
output_file = f"{OUTPUT_DIR}/{SAMPLE}_23andme.txt"
!echo "# rsid\tchromosome\tposition\tgenotype" > "{output_file}"
!tail -n +2 "{OUTPUT_DIR}/{SAMPLE}_plink.txt" >> "{output_file}"

# Count SNPs
snp_count = !tail -n +2 "{output_file}" | wc -l
snp_count = int(snp_count[0].strip())

print(f"\n" + "="*50)
print(f"✓ Conversion complete!")
print(f"="*50)
print(f"Output: {output_file}")
print(f"SNP count: {snp_count:,}")

if snp_count < 200000:
    print(f"\n⚠️  WARNING: SNP count < 200,000")
    print("   This is expected for ancient DNA with low coverage.")
    print("   GedMatch may still accept it but matches will be limited.")
else:
    print("\n✓ SNP count looks good for GedMatch upload")

In [None]:
# Step 10: Copy result back to Google Drive
DRIVE_OUTPUT = "/content/drive/MyDrive/GedMatch_Files"
!mkdir -p "{DRIVE_OUTPUT}"
!cp "{OUTPUT_DIR}/{SAMPLE}_23andme.txt" "{DRIVE_OUTPUT}/"

print(f"\n✓ File saved to Google Drive: {DRIVE_OUTPUT}/{SAMPLE}_23andme.txt")
print("\nTo upload to GedMatch:")
print("  1. Go to https://www.gedmatch.com")
print("  2. Log in → DNA Upload")
print("  3. Select '23andMe' format")
print(f"  4. Upload the file from your Drive: GedMatch_Files/{SAMPLE}_23andme.txt")

In [None]:
# Cleanup intermediate files to free space (optional)
# Uncomment to delete temporary files

# !rm -f "{OUTPUT_DIR}/{SAMPLE}_raw.vcf.gz"*
# !rm -f "{OUTPUT_DIR}/{SAMPLE}_filtered.vcf.gz"*
# !rm -f "{OUTPUT_DIR}/{SAMPLE}_normalized.vcf.gz"*
# !rm -f "{OUTPUT_DIR}/{SAMPLE}_annotated.vcf.gz"*
# !rm -f "{OUTPUT_DIR}/{SAMPLE}_snps.vcf.gz"*
# !rm -f "{OUTPUT_DIR}/{SAMPLE}_plink"*
# print("Intermediate files cleaned up")

## Process Additional Samples

To process another sample, go back to the Configuration cell, change `SAMPLE = "VK158"` (or another sample), and re-run cells 5-10.

Available Bodzia samples:
- VK157 (Elite male)
- VK158, VK159, VK166
- VK50, VK53, VK110, VK146, VK147, VK201, VK202, VK234