# PyMutation Filtering Methods Example

This notebook demonstrates the various filtering methods available in PyMutation:
- `filter_by_chrom_sample`: Filter by chromosome and/or sample
- `region`: Filter by genomic coordinates
- `gen_region`: Filter by gene name
- `pass_filter`: Check if specific records have FILTER == "PASS"


In [1]:
import sys
import os
import pandas as pd
from IPython.display import display

# Add src to path
sys.path.insert(0, os.path.join('..', '..', '..', 'src'))

from pyMut.input import read_maf


## Load TCGA LAML Dataset


In [2]:
# Load real TCGA LAML data
maf_path = os.path.join('..', '..', '..', 'src', 'pyMut', 'data', 'examples', 'MAF','tcga_laml.maf.gz')
# TCGA data is typically based on GRCh37 assembly
py_mut = read_maf(maf_path, assembly="37")

print(f"Loaded TCGA LAML data: {len(py_mut.data)} variants")
print(f"Unique genes: {py_mut.data['Hugo_Symbol'].nunique()}")
print(f"Unique samples: {py_mut.data['Tumor_Sample_Barcode'].nunique()}")
print(f"Chromosomes present: {sorted(py_mut.data['CHROM'].unique())}")

# Display first few rows
print("\nFirst 5 rows of the dataset:")
display(py_mut.data.head())


2025-07-30 23:17:37,176 | INFO | pyMut.input | Starting MAF reading: ../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz
2025-07-30 23:17:37,177 | INFO | pyMut.input | Loading from cache: ../../../src/pyMut/data/examples/MAF/.pymut_cache/tcga_laml.maf_3515f757055e6890.parquet
2025-07-30 23:17:37,202 | INFO | pyMut.input | Cache loaded successfully in 0.03 seconds


Loaded TCGA LAML data: 2207 variants
Unique genes: 1611
Unique samples: 193
Chromosomes present: ['X', 'chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9']

First 5 rows of the dataset:


Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,TCGA-AB-2988,TCGA-AB-2869,TCGA-AB-3009,...,Strand,Variant_Classification,Variant_Type,Reference_Allele,Tumor_Seq_Allele1,Tumor_Seq_Allele2,Tumor_Sample_Barcode,Protein_Change,i_TumorVAF_WU,i_transcript_name
0,chr17,67170917,.,T,C,.,.,T|C,T|T,T|T,...,+,SPLICE_SITE,SNP,T,T,C,TCGA-AB-2988,p.K960R,45.66,NM_080282.3
1,chr1,94490594,.,C,T,.,.,C|C,C|T,C|C,...,+,MISSENSE_MUTATION,SNP,C,C,T,TCGA-AB-2869,p.R1517H,38.12,NM_000350.2
2,chr2,169780250,.,G,A,.,.,G|G,G|G,G|A,...,+,MISSENSE_MUTATION,SNP,G,G,A,TCGA-AB-3009,p.A1283V,46.972177,NM_003742.2
3,chr16,48244997,.,G,A,.,.,G|G,G|G,G|G,...,+,SILENT,SNP,G,G,A,TCGA-AB-2830,p.I490I,34.27,NM_032583.3
4,chr17,48760974,.,C,T,.,.,C|C,C|C,C|C,...,+,MISSENSE_MUTATION,SNP,C,C,T,TCGA-AB-2887,p.P1271S,56.41,NM_003786.1


## 1. Chromosome and Sample Filtering (filter_by_chrom_sample)

This method allows filtering by chromosome and/or sample. It comes from `chrom_sample_filter.py`.


In [3]:
print("=== Chromosome and Sample Filtering Examples ===")

# Example 1: Filter by chromosome only
print("\n1. Filter by chromosome 17:")
filtered_chr17 = py_mut.filter_by_chrom_sample(chrom='17')
print(f"Original variants: {len(py_mut.data)}")
print(f"Chromosome 17 variants: {len(filtered_chr17.data)}")

# Example 2: Filter by multiple chromosomes
print("\n2. Filter by chromosomes 17 and X:")
filtered_multi_chr = py_mut.filter_by_chrom_sample(chrom=['17', 'X'])
print(f"Chromosomes 17 and X variants: {len(filtered_multi_chr.data)}")

# Example 3: Filter by sample (get first few samples)
sample_list = py_mut.data['Tumor_Sample_Barcode'].unique()[:3].tolist()
print(f"\n3. Filter by first 3 samples: {sample_list}")
filtered_samples = py_mut.filter_by_chrom_sample(sample=sample_list)
print(f"Filtered by samples variants: {len(filtered_samples.data)}")
print(f"Unique samples in filtered data: {filtered_samples.data['Tumor_Sample_Barcode'].nunique()}")

# Example 4: Combined filtering (chromosome + sample)
print(f"\n4. Combined filter - chromosome 17 + first sample:")
filtered_combined = py_mut.filter_by_chrom_sample(chrom='17', sample=sample_list[0])
print(f"Combined filter variants: {len(filtered_combined.data)}")


2025-07-30 23:17:37,256 | INFO | pyMut.filters.chrom_sample_filter | Chromosomes to filter: ['chr17']
2025-07-30 23:17:37,261 | INFO | pyMut.filters.chrom_sample_filter | Applied chromosome filter: chr17
2025-07-30 23:17:37,261 | INFO | pyMut.filters.chrom_sample_filter | Combined filter applied: chromosome:chr17
2025-07-30 23:17:37,261 | INFO | pyMut.filters.chrom_sample_filter | Variants before filter: 2207
2025-07-30 23:17:37,262 | INFO | pyMut.filters.chrom_sample_filter | Variants after filter: 99
2025-07-30 23:17:37,262 | INFO | pyMut.filters.chrom_sample_filter | Variants filtered out: 2108
2025-07-30 23:17:37,262 | INFO | pyMut.filters.chrom_sample_filter | Successfully applied filter: chromosome:chr17
2025-07-30 23:17:37,265 | INFO | pyMut.filters.chrom_sample_filter | Chromosomes to filter: ['chr17', 'chrX']
2025-07-30 23:17:37,269 | INFO | pyMut.filters.chrom_sample_filter | Applied chromosome filter: chr17,chrX
2025-07-30 23:17:37,270 | INFO | pyMut.filters.chrom_sample_fil

=== Chromosome and Sample Filtering Examples ===

1. Filter by chromosome 17:
Original variants: 2207
Chromosome 17 variants: 99

2. Filter by chromosomes 17 and X:
Chromosomes 17 and X variants: 207

3. Filter by first 3 samples: ['TCGA-AB-2988', 'TCGA-AB-2869', 'TCGA-AB-3009']
Filtered by samples variants: 69
Unique samples in filtered data: Tumor_Sample_Barcode    3
Tumor_Sample_Barcode    3
dtype: int64

4. Combined filter - chromosome 17 + first sample:
Combined filter variants: 1


## 2. Genomic Range Filtering (region)

This method filters by genomic coordinates using chromosome, start, and end positions. It comes from `genomic_range.py`.


In [4]:
print("=== Genomic Range Filtering Examples ===")

# Example 1: Filter a specific region on chromosome 17
print("\n1. Filter chromosome 17, positions 7,500,000 to 8,000,000:")
filtered_region = py_mut.region(chrom='17', start=7500000, end=8000000)
print(f"Original variants: {len(py_mut.data)}")
print(f"Region variants: {len(filtered_region.data)}")

if len(filtered_region.data) > 0:
    print("Genes in this region:")
    genes_in_region = filtered_region.data['Hugo_Symbol'].value_counts().head(10)
    display(genes_in_region)

# Example 2: Filter a smaller region
print("\n2. Filter chromosome 17, positions 7,570,000 to 7,590,000 (TP53 region):")
filtered_tp53_region = py_mut.region(chrom='17', start=7570000, end=7590000)
print(f"TP53 region variants: {len(filtered_tp53_region.data)}")

if len(filtered_tp53_region.data) > 0:
    print("Variants in TP53 region:")
    display(filtered_tp53_region.data[['Hugo_Symbol', 'CHROM', 'POS', 'REF', 'ALT', 'Variant_Classification']].head())


2025-07-30 23:17:37,340 | INFO | pyMut.filters.genomic_range | Chromosome formatted: '17' -> 'chr17'
2025-07-30 23:17:37,341 | INFO | pyMut.filters.genomic_range | Attempting to use PyArrow optimization
2025-07-30 23:17:37,346 | INFO | pyMut.filters.genomic_range | PyArrow optimization successful
2025-07-30 23:17:37,347 | INFO | pyMut.filters.genomic_range | Genomic filter applied: chr17:7500000-8000000
2025-07-30 23:17:37,347 | INFO | pyMut.filters.genomic_range | Variants before filter: 2207
2025-07-30 23:17:37,347 | INFO | pyMut.filters.genomic_range | Variants after filter: 20
2025-07-30 23:17:37,347 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2187
2025-07-30 23:17:37,348 | INFO | pyMut.filters.genomic_range | Successfully filtered genomic region: chr17:7500000-8000000


=== Genomic Range Filtering Examples ===

1. Filter chromosome 17, positions 7,500,000 to 8,000,000:
Original variants: 2207
Region variants: 20
Genes in this region:


Hugo_Symbol
TP53      19
GUCY2D     1
Name: count, dtype: int64[pyarrow]

2025-07-30 23:17:37,354 | INFO | pyMut.filters.genomic_range | Chromosome formatted: '17' -> 'chr17'
2025-07-30 23:17:37,354 | INFO | pyMut.filters.genomic_range | Attempting to use PyArrow optimization
2025-07-30 23:17:37,360 | INFO | pyMut.filters.genomic_range | PyArrow optimization successful
2025-07-30 23:17:37,360 | INFO | pyMut.filters.genomic_range | Genomic filter applied: chr17:7570000-7590000
2025-07-30 23:17:37,361 | INFO | pyMut.filters.genomic_range | Variants before filter: 2207
2025-07-30 23:17:37,361 | INFO | pyMut.filters.genomic_range | Variants after filter: 19
2025-07-30 23:17:37,361 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2188
2025-07-30 23:17:37,362 | INFO | pyMut.filters.genomic_range | Successfully filtered genomic region: chr17:7570000-7590000



2. Filter chromosome 17, positions 7,570,000 to 7,590,000 (TP53 region):
TP53 region variants: 19
Variants in TP53 region:


Unnamed: 0,Hugo_Symbol,CHROM,POS,REF,ALT,Variant_Classification
1999,TP53,chr17,7578403,C,T,MISSENSE_MUTATION
2000,TP53,chr17,7578181,-,GCGGCTC,FRAME_SHIFT_INS
2001,TP53,chr17,7577100,T,C,MISSENSE_MUTATION
2002,TP53,chr17,7577609,C,T,SPLICE_SITE
2003,TP53,chr17,7579312,C,T,SPLICE_SITE


## 3. Gene-based Filtering (gen_region)

This method filters by gene name using the Hugo_Symbol column. It comes from `genomic_range.py`.


In [5]:
print("=== Gene-based Filtering Examples ===")

# Get the most common genes in the dataset
common_genes = py_mut.data['Hugo_Symbol'].value_counts().head(5)
print("Most common genes in the dataset:")
display(common_genes)

# Example 1: Filter by TP53 gene
print("\n1. Filter by TP53 gene:")
filtered_tp53 = py_mut.gen_region('TP53')
print(f"TP53 variants: {len(filtered_tp53.data)}")

if len(filtered_tp53.data) > 0:
    print("TP53 variant types:")
    tp53_variants = filtered_tp53.data['Variant_Classification'].value_counts()
    display(tp53_variants)

# Example 2: Filter by the most common gene
most_common_gene = common_genes.index[0]
print(f"\n2. Filter by most common gene ({most_common_gene}):")
filtered_common = py_mut.gen_region(most_common_gene)
print(f"{most_common_gene} variants: {len(filtered_common.data)}")

# Example 3: Filter by multiple genes (using multiple calls)
print("\n3. Filter by multiple genes (FLT3, NPM1, DNMT3A):")
genes_of_interest = ['FLT3', 'NPM1', 'DNMT3A']
for gene in genes_of_interest:
    filtered_gene = py_mut.gen_region(gene)
    print(f"{gene}: {len(filtered_gene.data)} variants")


=== Gene-based Filtering Examples ===
Most common genes in the dataset:


Hugo_Symbol
DNMT3A    54
FLT3      52
NPM1      34
TET2      27
IDH2      20
Name: count, dtype: int64[pyarrow]

2025-07-30 23:17:37,422 | INFO | pyMut.filters.genomic_range | Applying gene filter for: TP53
2025-07-30 23:17:37,423 | INFO | pyMut.filters.genomic_range | Source format detected: MAF
2025-07-30 23:17:37,423 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column
2025-07-30 23:17:37,423 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbol column: Hugo_Symbol
2025-07-30 23:17:37,427 | INFO | pyMut.filters.genomic_range | Gene filter applied: TP53
2025-07-30 23:17:37,427 | INFO | pyMut.filters.genomic_range | Variants before filter: 2207
2025-07-30 23:17:37,427 | INFO | pyMut.filters.genomic_range | Variants after filter: 19
2025-07-30 23:17:37,428 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2188
2025-07-30 23:17:37,428 | INFO | pyMut.filters.genomic_range | Successfully filtered data for gene: TP53



1. Filter by TP53 gene:
TP53 variants: 19
TP53 variant types:


Variant_Classification
MISSENSE_MUTATION    11
SPLICE_SITE           3
FRAME_SHIFT_INS       2
FRAME_SHIFT_DEL       2
NONSENSE_MUTATION     1
Name: count, dtype: int64[pyarrow]

2025-07-30 23:17:37,431 | INFO | pyMut.filters.genomic_range | Applying gene filter for: DNMT3A
2025-07-30 23:17:37,432 | INFO | pyMut.filters.genomic_range | Source format detected: MAF
2025-07-30 23:17:37,433 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column
2025-07-30 23:17:37,433 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbol column: Hugo_Symbol



2. Filter by most common gene (DNMT3A):


2025-07-30 23:17:37,436 | INFO | pyMut.filters.genomic_range | Gene filter applied: DNMT3A
2025-07-30 23:17:37,436 | INFO | pyMut.filters.genomic_range | Variants before filter: 2207
2025-07-30 23:17:37,436 | INFO | pyMut.filters.genomic_range | Variants after filter: 54
2025-07-30 23:17:37,439 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2153
2025-07-30 23:17:37,440 | INFO | pyMut.filters.genomic_range | Successfully filtered data for gene: DNMT3A
2025-07-30 23:17:37,441 | INFO | pyMut.filters.genomic_range | Applying gene filter for: FLT3
2025-07-30 23:17:37,441 | INFO | pyMut.filters.genomic_range | Source format detected: MAF
2025-07-30 23:17:37,441 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column
2025-07-30 23:17:37,442 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbol column: Hugo_Symbol
2025-07-30 23:17:37,444 | INFO | pyMut.filters.genomic_range | Gene filter applied: FLT3
2025-07-30 23:17:37,444 | INFO | pyMu

DNMT3A variants: 54

3. Filter by multiple genes (FLT3, NPM1, DNMT3A):
FLT3: 52 variants
NPM1: 34 variants
DNMT3A: 54 variants


## 4. PASS Filter Check (pass_filter)

This method checks if specific records have FILTER == "PASS". It comes from `pass_filter.py`.
Note: This method returns a boolean value, not a filtered dataset.


In [6]:
print("=== PASS Filter Check Examples ===")

# First, let's see what FILTER values are present in our data
if 'FILTER' in py_mut.data.columns:
    print("FILTER column values:")
    filter_values = py_mut.data['FILTER'].value_counts()
    display(filter_values)
    
    # Example 1: Check specific records for PASS filter
    print("\n1. Checking specific records for PASS filter:")
    
    # Get a few sample records
    sample_records = py_mut.data.head(3)
    
    for idx, row in sample_records.iterrows():
        chrom = row['CHROM']
        pos = row['POS']
        ref = row['REF']
        alt = row['ALT']
        
        is_pass = py_mut.pass_filter(chrom=chrom, pos=pos, ref=ref, alt=alt)
        print(f"Record {chrom}:{pos} {ref}>{alt} - PASS: {is_pass}")
        
    # Example 2: Check a non-existent record
    print("\n2. Checking a non-existent record:")
    is_pass_fake = py_mut.pass_filter(chrom='1', pos=999999999, ref='A', alt='T')
    print(f"Non-existent record - PASS: {is_pass_fake}")
    
else:
    print("FILTER column not found in the dataset")
    print("Available columns:", list(py_mut.data.columns))


=== PASS Filter Check Examples ===
FILTER column values:


FILTER
.    2207
Name: count, dtype: int64

2025-07-30 23:17:37,523 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr17:67170917 T>C
2025-07-30 23:17:37,524 | INFO | pyMut.filters.pass_filter | Attempting to use PyArrow optimization



1. Checking specific records for PASS filter:


2025-07-30 23:17:37,531 | INFO | pyMut.filters.pass_filter | PASS filter result: False
2025-07-30 23:17:37,531 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr1:94490594 C>T
2025-07-30 23:17:37,532 | INFO | pyMut.filters.pass_filter | Attempting to use PyArrow optimization
2025-07-30 23:17:37,538 | INFO | pyMut.filters.pass_filter | PASS filter result: False
2025-07-30 23:17:37,539 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr2:169780250 G>A
2025-07-30 23:17:37,539 | INFO | pyMut.filters.pass_filter | Attempting to use PyArrow optimization


Record chr17:67170917 T>C - PASS: False
Record chr1:94490594 C>T - PASS: False


2025-07-30 23:17:37,546 | INFO | pyMut.filters.pass_filter | PASS filter result: False
2025-07-30 23:17:37,547 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr1:999999999 A>T
2025-07-30 23:17:37,547 | INFO | pyMut.filters.pass_filter | Attempting to use PyArrow optimization
2025-07-30 23:17:37,553 | INFO | pyMut.filters.pass_filter | Record not found: chr1:999999999 A>T


Record chr2:169780250 G>A - PASS: False

2. Checking a non-existent record:
Non-existent record - PASS: False


## 5. Combining Multiple Filters

You can chain multiple filtering operations to create complex filters.


In [7]:
print("=== Combining Multiple Filters ===")

# Example: Filter by chromosome 17, then by TP53 gene, then by genomic region
print("1. Multi-step filtering: Chromosome 17 → TP53 gene → specific region")

# Step 1: Filter by chromosome 17
step1 = py_mut.filter_by_chrom_sample(chrom='17')
print(f"Step 1 - Chromosome 17: {len(step1.data)} variants")

# Step 2: Filter by TP53 gene
step2 = step1.gen_region('TP53')
print(f"Step 2 - TP53 gene: {len(step2.data)} variants")

# Step 3: Filter by specific region (TP53 locus)
step3 = step2.region(chrom='17', start=7570000, end=7590000)
print(f"Step 3 - TP53 region: {len(step3.data)} variants")

if len(step3.data) > 0:
    print("\nFinal filtered results:")
    display(step3.data[['Hugo_Symbol', 'CHROM', 'POS', 'REF', 'ALT', 'Variant_Classification', 'Tumor_Sample_Barcode']])

# Show the filter history
print(f"\nFilter history: {step3.metadata.filters}")


2025-07-30 23:17:37,592 | INFO | pyMut.filters.chrom_sample_filter | Chromosomes to filter: ['chr17']


=== Combining Multiple Filters ===
1. Multi-step filtering: Chromosome 17 → TP53 gene → specific region


2025-07-30 23:17:37,597 | INFO | pyMut.filters.chrom_sample_filter | Applied chromosome filter: chr17
2025-07-30 23:17:37,598 | INFO | pyMut.filters.chrom_sample_filter | Combined filter applied: chromosome:chr17
2025-07-30 23:17:37,598 | INFO | pyMut.filters.chrom_sample_filter | Variants before filter: 2207
2025-07-30 23:17:37,598 | INFO | pyMut.filters.chrom_sample_filter | Variants after filter: 99
2025-07-30 23:17:37,600 | INFO | pyMut.filters.chrom_sample_filter | Variants filtered out: 2108
2025-07-30 23:17:37,600 | INFO | pyMut.filters.chrom_sample_filter | Successfully applied filter: chromosome:chr17
2025-07-30 23:17:37,600 | INFO | pyMut.filters.genomic_range | Applying gene filter for: TP53
2025-07-30 23:17:37,601 | INFO | pyMut.filters.genomic_range | Source format detected: MAF
2025-07-30 23:17:37,601 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column
2025-07-30 23:17:37,601 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbo

Step 1 - Chromosome 17: 99 variants
Step 2 - TP53 gene: 19 variants
Step 3 - TP53 region: 19 variants

Final filtered results:


Unnamed: 0,Hugo_Symbol,CHROM,POS,REF,ALT,Variant_Classification,Tumor_Sample_Barcode
1999,TP53,chr17,7578403,C,T,MISSENSE_MUTATION,TCGA-AB-2813
2000,TP53,chr17,7578181,-,GCGGCTC,FRAME_SHIFT_INS,TCGA-AB-2820
2001,TP53,chr17,7577100,T,C,MISSENSE_MUTATION,TCGA-AB-2829
2002,TP53,chr17,7577609,C,T,SPLICE_SITE,TCGA-AB-2829
2003,TP53,chr17,7579312,C,T,SPLICE_SITE,TCGA-AB-2838
2004,TP53,chr17,7579569,-,CCATCCAG,FRAME_SHIFT_INS,TCGA-AB-2860
2005,TP53,chr17,7578555,C,T,SPLICE_SITE,TCGA-AB-2868
2006,TP53,chr17,7578206,T,C,MISSENSE_MUTATION,TCGA-AB-2878
2007,TP53,chr17,7578414,A,-,FRAME_SHIFT_DEL,TCGA-AB-2878
2008,TP53,chr17,7578272,G,A,MISSENSE_MUTATION,TCGA-AB-2885



Filter history: ['.', 'chromosome:chr17', 'gene_filter:Hugo_Symbol:TP53', 'genomic_region:chr17:7570000-7590000']


## Summary

This notebook demonstrated the four main filtering methods available in PyMutation:

1. **`filter_by_chrom_sample`**: Filters by chromosome and/or sample
   - Parameters: `chrom` (str or list), `sample` (str or list), `sample_column` (str)
   - Returns: New PyMutation object with filtered data

2. **`region`**: Filters by genomic coordinates
   - Parameters: `chrom` (str), `start` (int), `end` (int)
   - Returns: New PyMutation object with filtered data

3. **`gen_region`**: Filters by gene name
   - Parameters: `gen_name` (str)
   - Returns: New PyMutation object with filtered data

4. **`pass_filter`**: Checks if specific records have FILTER == "PASS"
   - Parameters: `chrom` (str), `pos` (int), `ref` (str), `alt` (str)
   - Returns: Boolean value

All filtering methods preserve the original data structure and update the metadata to track applied filters.