# VCF to MAF and VCF Conversion

This notebook demonstrates how to:
1. Read a VCF file using `read_vcf` with assembly=38
2. Export the PyMutation object to MAF format using `to_maf`
3. Export the PyMutation object to VCF format using `to_vcf`
4. Read the exported MAF file using `read_maf`


In [1]:
import sys
import os

# Configure project directory
project_root = os.path.abspath(os.path.join(os.getcwd(), '..', '..', '..', 'src'))
if project_root not in sys.path:
    sys.path.append(project_root)

print('✅ PYTHONPATH configured to include:', project_root)


✅ PYTHONPATH configured to include: /Users/luis/Desktop/pyMut/src


## Import the necessary functions


In [2]:
from pyMut import read_vcf, read_maf

print("✅ Functions imported correctly")


✅ Functions imported correctly


## Define the path to the VCF file


In [3]:
# Path to the VCF file with VEP annotations
vcf_path = "../../../src/pyMut/data/examples/VCF/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased_vep_protein_gene_variant_class.vcf"

print("📁 File to process:")
print(f"  - VCF file: {vcf_path}")

# Verify that the file exists
if os.path.exists(vcf_path):
    print("✅ File found")
else:
    print("❌ File not found")


📁 File to process:
  - VCF file: ../../../src/pyMut/data/examples/VCF/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased_vep_protein_gene_variant_class.vcf
✅ File found


## Read the VCF file with assembly=38


In [4]:
print("📖 Reading VCF file...")

try:
    # Read the VCF file with assembly=38
    pymutation_obj = read_vcf(vcf_path, "38")
    
    print("✅ PyMutation object created successfully")
    print(f"   DataFrame shape: {pymutation_obj.data.shape}")
    print(f"   Number of variants: {len(pymutation_obj.data)}")
    print(f"   Number of columns: {len(pymutation_obj.data.columns)}")
    print(f"   Number of samples: {len(pymutation_obj.samples)}")
    
except Exception as e:
    print(f"❌ Error reading the file: {e}")
    import traceback
    traceback.print_exc()


2025-07-29 19:30:54,739 | INFO | pyMut.input | Starting optimized VCF reading: ../../../src/pyMut/data/examples/VCF/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased_vep_protein_gene_variant_class.vcf
2025-07-29 19:30:54,740 | INFO | pyMut.input | Loading from cache: ../../../src/pyMut/data/examples/VCF/.pymut_cache/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased_vep_protein_gene_variant_class_459f83c3aa1088de.parquet


📖 Reading VCF file...


2025-07-29 19:30:54,906 | INFO | pyMut.input | Cache loaded successfully in 0.17 seconds


✅ PyMutation object created successfully
   DataFrame shape: (1000, 2601)
   Number of variants: 1000
   Number of columns: 2601
   Number of samples: 2548


## Show the first rows of the DataFrame


In [5]:
print("🔍 First 3 rows of the DataFrame:")
pymutation_obj.head(3)


🔍 First 3 rows of the DataFrame:


Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,HG00096,HG00097,HG00099,...,VEP_ENSP,VEP_SWISSPROT,VEP_TREMBL,VEP_UNIPARC,VEP_UNIPROT_ISOFORM,VEP_NEAREST,VEP_DOMAINS,Hugo_Symbol,Variant_Classification,Variant_Type
0,chr10,11501,.,C,A,.,PASS,C|A,C|C,C|C,...,,,,,,TUBB8,,TUBB8,INTRON,SNP
1,chr10,36097,.,G,A,.,PASS,G|A,A|G,G|G,...,,,,,,TUBB8,,TUBB8,INTRON,SNP
2,chr10,45900,.,C,T,.,PASS,C|C,C|C,C|C,...,ENSP00000456206,Q3ZCM7.157,,UPI000007238E,,TUBB8,,TUBB8,3'FLANK,SNP


## Define output paths for MAF and VCF exports


In [6]:
# Create output directory if it doesn't exist
output_dir = "./output"
os.makedirs(output_dir, exist_ok=True)

# Define output paths
maf_output_path = os.path.join(output_dir, "vcf_to_maf_output.maf")
vcf_output_path = os.path.join(output_dir, "vcf_to_vcf_output.vcf")

print(f"📁 Output files will be saved to:")
print(f"  - MAF output: {maf_output_path}")
print(f"  - VCF output: {vcf_output_path}")


📁 Output files will be saved to:
  - MAF output: ./output/vcf_to_maf_output.maf
  - VCF output: ./output/vcf_to_vcf_output.vcf


## Export to MAF format


In [7]:
print("📝 Exporting to MAF format...")

try:
    # Export to MAF format
    pymutation_obj.to_maf(maf_output_path)
    
    # Check if the file was created
    if os.path.exists(maf_output_path):
        print(f"✅ MAF file created successfully: {maf_output_path}")
        print(f"   File size: {os.path.getsize(maf_output_path) / (1024 * 1024):.2f} MB")
    else:
        print(f"❌ MAF file was not created")
        
except Exception as e:
    print(f"❌ Error exporting to MAF: {e}")
    import traceback
    traceback.print_exc()


2025-07-29 19:30:55,090 | INFO | pyMut.output | Starting MAF export to: output/vcf_to_maf_output.maf
2025-07-29 19:30:55,106 | INFO | pyMut.output | Starting to process 1000 variants from 2548 samples


📝 Exporting to MAF format...


2025-07-29 19:30:55,126 | INFO | pyMut.output | Processing sample 1/2548: HG00096 (0.0%)
2025-07-29 19:30:55,175 | INFO | pyMut.output | Sample HG00096: 49 variants found
2025-07-29 19:30:58,973 | INFO | pyMut.output | Processing sample 50/2548: HG00149 (2.0%)
2025-07-29 19:30:59,015 | INFO | pyMut.output | Sample HG00149: 122 variants found
2025-07-29 19:31:02,854 | INFO | pyMut.output | Processing sample 100/2548: HG00256 (3.9%)
2025-07-29 19:31:02,893 | INFO | pyMut.output | Sample HG00256: 94 variants found
2025-07-29 19:31:06,728 | INFO | pyMut.output | Processing sample 150/2548: HG00328 (5.9%)
2025-07-29 19:31:06,768 | INFO | pyMut.output | Sample HG00328: 49 variants found
2025-07-29 19:31:10,627 | INFO | pyMut.output | Processing sample 200/2548: HG00406 (7.8%)
2025-07-29 19:31:10,666 | INFO | pyMut.output | Sample HG00406: 8 variants found
2025-07-29 19:31:14,477 | INFO | pyMut.output | Processing sample 250/2548: HG00581 (9.8%)
2025-07-29 19:31:14,516 | INFO | pyMut.output |

✅ MAF file created successfully: ./output/vcf_to_maf_output.maf
   File size: 53.29 MB


## Export to VCF format


In [8]:
print("📝 Exporting to VCF format...")

try:
    # Export to VCF format
    pymutation_obj.to_vcf(vcf_output_path)
    
    # Check if the file was created
    if os.path.exists(vcf_output_path):
        print(f"✅ VCF file created successfully: {vcf_output_path}")
        print(f"   File size: {os.path.getsize(vcf_output_path) / (1024 * 1024):.2f} MB")
    else:
        print(f"❌ VCF file was not created")
        
except Exception as e:
    print(f"❌ Error exporting to VCF: {e}")
    import traceback
    traceback.print_exc()


2025-07-29 19:34:14,933 | INFO | pyMut.output | Starting VCF export to: output/vcf_to_vcf_output.vcf


📝 Exporting to VCF format...


2025-07-29 19:34:14,948 | INFO | pyMut.output | Starting to process 1000 variants from 2548 samples
2025-07-29 19:34:15,121 | INFO | pyMut.output | Processing genotype data to replace bases with indices
2025-07-29 19:34:32,352 | INFO | pyMut.output | Writing 1000 variants to file
2025-07-29 19:34:32,649 | INFO | pyMut.output | Progress: 1000/1000 variants written (100.0%)
2025-07-29 19:34:32,651 | INFO | pyMut.output | VCF export completed successfully: 1000 variants processed and written to output/vcf_to_vcf_output.vcf
2025-07-29 19:34:32,651 | INFO | pyMut.output | Conversion summary: 2548 samples, 1000 input variants, 1000 output variants


✅ VCF file created successfully: ./output/vcf_to_vcf_output.vcf
   File size: 10.00 MB


## Examine the exported files


In [9]:
# Show the first few lines of the exported MAF file
print("🔍 First 10 lines of the exported MAF file:")
!head -10 {maf_output_path}


🔍 First 10 lines of the exported MAF file:
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=EAS_AF,Number=A,Type=Float,Description="Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=EUR_AF,Number=A,Type=Float,Description="Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AMR_AF,Number=A,Type=Float,Description="Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=SAS_AF,N

In [10]:
# Show the first few lines of the exported VCF file
print("🔍 First 10 lines of the exported VCF file:")
!head -10 {vcf_output_path}


🔍 First 10 lines of the exported VCF file:
##fileformat=VCFv4.3
##fileDate=20250729
##source=https://github.com/Luisruimor/pyMut
##reference=38
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased Genotype">
##INFO=<ID=PMUT,Number=.,Type=String,Description="Consequence annotations columns from PyMut. Format: AC|AN|DP|AF|EAS_AF|EUR_AF|AFR_AF|AMR_AF|SAS_AF|VT|NS|EX_TARGET|VEP_Allele|VEP_Consequence|VEP_IMPACT|VEP_SYMBOL|VEP_Gene|VEP_Feature_type|VEP_Feature|VEP_BIOTYPE|VEP_EXON|VEP_INTRON|VEP_HGVSc|VEP_HGVSp|VEP_cDNA_position|VEP_CDS_position|VEP_Protein_position|VEP_Amino_acids|VEP_Codons|VEP_Existing_variation|VEP_DISTANCE|VEP_STRAND|VEP_FLAGS|VEP_VARIANT_CLASS|VEP_SYMBOL_SOURCE|VEP_HGNC_ID|VEP_ENSP|VEP_SWISSPROT|VEP_TREMBL|VEP_UNIPARC|VEP_UNIPROT_ISOFORM|VEP_NEAREST|VEP_DOMAINS|Hugo_Symbol|Variant_Classification|Variant_Type">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00096	HG00097	HG00099	HG00100	H

## Read the exported MAF file

Now we'll read the MAF file that was generated using `to_maf()` to demonstrate the full conversion cycle.


In [11]:
print("📖 Reading the exported MAF file...")

try:
    # Read the MAF file with assembly=38 (same as the original VCF)
    maf_pymutation_obj = read_maf(maf_output_path, "38")
    
    print("✅ PyMutation object created successfully from MAF")
    print(f"   DataFrame shape: {maf_pymutation_obj.data.shape}")
    print(f"   Number of variants: {len(maf_pymutation_obj.data)}")
    print(f"   Number of columns: {len(maf_pymutation_obj.data.columns)}")
    print(f"   Number of samples: {len(maf_pymutation_obj.samples)}")
    
except Exception as e:
    print(f"❌ Error reading the MAF file: {e}")
    import traceback
    traceback.print_exc()


2025-07-29 19:34:33,001 | INFO | pyMut.input | Starting MAF reading: output/vcf_to_maf_output.maf


📖 Reading the exported MAF file...


2025-07-29 19:34:33,065 | INFO | pyMut.input | Reading MAF with 'pyarrow' engine…
2025-07-29 19:34:34,467 | INFO | pyMut.input | Reading with 'pyarrow' completed.
2025-07-29 19:34:34,654 | INFO | pyMut.input | Detected 2548 unique samples.
2025-07-29 19:35:01,385 | INFO | pyMut.input | Saving to cache: output/.pymut_cache/vcf_to_maf_output_0bf45773a91c2ef0.parquet
2025-07-29 19:35:22,108 | INFO | pyMut.input | MAF processed successfully: 188125 rows, 2609 columns in 49.11 seconds


✅ PyMutation object created successfully from MAF
   DataFrame shape: (188125, 2609)
   Number of variants: 188125
   Number of columns: 2609
   Number of samples: 2548


## Show the first rows of the MAF-derived PyMutation object


In [12]:
print("🔍 First 3 rows of the MAF-derived PyMutation object:")
maf_pymutation_obj.head(3)


🔍 First 3 rows of the MAF-derived PyMutation object:


Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,HG00096,HG00097,HG00099,...,VEP_ENSP,VEP_IMPACT,VEP_SWISSPROT,EAS_AF,DP,VEP_Allele,VEP_DISTANCE,NS,AMR_AF,VEP_UNIPROT_ISOFORM
0,chr10,11501,.,C,A,.,PASS,C|A,C|C,C|C,...,,MODIFIER,,0.0,20462,A,,2548,0.03,
1,chr10,36097,.,G,A,.,PASS,G|A,G|G,G|G,...,,MODIFIER,,0.26,18607,A,,2548,0.18,
2,chr10,47876,.,C,T,.,PASS,C|T,C|C,C|C,...,ENSP00000456206,LOW,Q3ZCM7.157,0.22,513338,T,,2548,0.24,


## Summary

In this notebook, we demonstrated how to:
1. Read a VCF file using `read_vcf` with assembly=38
2. Export the PyMutation object to MAF format using `to_maf`
3. Export the PyMutation object to VCF format using `to_vcf`
4. Read the exported MAF file using `read_maf`

These conversion capabilities allow for seamless interoperability between different mutation data formats.