# PyMutation Filtering Methods Example

This notebook demonstrates the various filtering methods available in PyMutation:
- `filter_by_chrom_sample`: Filter by chromosome and/or sample
- `region`: Filter by genomic coordinates
- `gen_region`: Filter by gene name
- `pass_filter`: Check if specific records have FILTER == "PASS"


In [1]:
import os
from IPython.display import display
from pyMut.input import read_maf

## Load TCGA LAML Dataset


In [2]:
# Load real TCGA LAML data
maf_path = os.path.join('..', '..', '..', 'src', 'pyMut', 'data', 'examples', 'MAF','tcga_laml.maf.gz')
# TCGA data is typically based on GRCh37 assembly
py_mut = read_maf(maf_path, assembly="37")

print(f"Loaded TCGA LAML data: {len(py_mut.data)} variants")
print(f"Unique genes: {py_mut.data['Hugo_Symbol'].nunique()}")
print(f"Unique samples: {py_mut.data['Tumor_Sample_Barcode'].nunique()}")
print(f"Chromosomes present: {sorted(py_mut.data['CHROM'].unique())}")

# Display first few rows
print("\nFirst 5 rows of the dataset:")
display(py_mut.data.head())


2025-08-01 02:02:53,078 | INFO | pyMut.input | Starting MAF reading: ../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz
2025-08-01 02:02:53,079 | INFO | pyMut.input | Loading from cache: ../../../src/pyMut/data/examples/MAF/.pymut_cache/tcga_laml.maf_8bfbda65c4b23428.parquet
2025-08-01 02:02:53,105 | INFO | pyMut.input | Cache loaded successfully in 0.03 seconds


Loaded TCGA LAML data: 2091 variants
Unique genes: 1611
Unique samples: 190
Chromosomes present: ['X', 'chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9']

First 5 rows of the dataset:


Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,TCGA-AB-2988,TCGA-AB-2869,TCGA-AB-3009,...,Strand,Variant_Classification,Variant_Type,Reference_Allele,Tumor_Seq_Allele1,Tumor_Seq_Allele2,Tumor_Sample_Barcode,Protein_Change,i_TumorVAF_WU,i_transcript_name
0,chr9,100077177,.,T,C,.,.,T|T,T|T,T|T,...,+,SILENT,SNP,T,T,C,TCGA-AB-2886,p.T431T,9.76,NM_020893.1
1,chr9,100085148,.,G,A,.,.,G|G,G|G,G|G,...,+,MISSENSE_MUTATION,SNP,G,G,A,TCGA-AB-2917,p.R581H,18.4,NM_020893.1
2,chr9,100971322,.,A,C,.,.,A|A,A|A,A|A,...,+,MISSENSE_MUTATION,SNP,A,A,C,TCGA-AB-2841,p.L593R,45.83,NM_018421.3
3,chr9,104086335,.,C,T,.,.,C|C,C|C,C|C,...,+,MISSENSE_MUTATION,SNP,C,C,T,TCGA-AB-2877,p.T325I,37.12,NM_017753.2
4,chr9,104124840,.,G,A,.,.,G|A,G|G,G|G,...,+,MISSENSE_MUTATION,SNP,G,G,A,TCGA-AB-2988,p.T376M,48.35,NM_001701.1


## 1. Chromosome and Sample Filtering (filter_by_chrom_sample)

This method allows filtering by chromosome and/or sample. It comes from `chrom_sample_filter.py`.


In [3]:
print("=== Chromosome and Sample Filtering Examples ===")

# Example 1: Filter by chromosome only
print("\n1. Filter by chromosome 17:")
filtered_chr17 = py_mut.filter_by_chrom_sample(chrom='17')
print(f"Original variants: {len(py_mut.data)}")
print(f"Chromosome 17 variants: {len(filtered_chr17.data)}")

# Example 2: Filter by multiple chromosomes
print("\n2. Filter by chromosomes 17 and X:")
filtered_multi_chr = py_mut.filter_by_chrom_sample(chrom=['17', 'X'])
print(f"Chromosomes 17 and X variants: {len(filtered_multi_chr.data)}")

# Example 3: Filter by sample (get first few samples)
sample_list = py_mut.data['Tumor_Sample_Barcode'].unique()[:3].tolist()
print(f"\n3. Filter by first 3 samples: {sample_list}")
filtered_samples = py_mut.filter_by_chrom_sample(sample=sample_list)
print(f"Filtered by samples variants: {len(filtered_samples.data)}")
print(f"Unique samples in filtered data: {filtered_samples.data['Tumor_Sample_Barcode'].nunique()}")

# Example 4: Combined filtering (chromosome + sample)
print(f"\n4. Combined filter - chromosome 17 + first sample:")
filtered_combined = py_mut.filter_by_chrom_sample(chrom='17', sample=sample_list[0])
print(f"Combined filter variants: {len(filtered_combined.data)}")


2025-08-01 02:02:53,152 | INFO | pyMut.filters.chrom_sample_filter | Chromosomes to filter: ['chr17']
2025-08-01 02:02:53,156 | INFO | pyMut.filters.chrom_sample_filter | Applied chromosome filter: chr17
2025-08-01 02:02:53,156 | INFO | pyMut.filters.chrom_sample_filter | Combined filter applied: chromosome:chr17
2025-08-01 02:02:53,157 | INFO | pyMut.filters.chrom_sample_filter | Variants before filter: 2091
2025-08-01 02:02:53,157 | INFO | pyMut.filters.chrom_sample_filter | Variants after filter: 99
2025-08-01 02:02:53,157 | INFO | pyMut.filters.chrom_sample_filter | Variants filtered out: 1992
2025-08-01 02:02:53,157 | INFO | pyMut.filters.chrom_sample_filter | Successfully applied filter: chromosome:chr17
2025-08-01 02:02:53,160 | INFO | pyMut.filters.chrom_sample_filter | Chromosomes to filter: ['chr17', 'chrX']
2025-08-01 02:02:53,165 | INFO | pyMut.filters.chrom_sample_filter | Applied chromosome filter: chr17,chrX
2025-08-01 02:02:53,166 | INFO | pyMut.filters.chrom_sample_fil

=== Chromosome and Sample Filtering Examples ===

1. Filter by chromosome 17:
Original variants: 2091
Chromosome 17 variants: 99

2. Filter by chromosomes 17 and X:
Chromosomes 17 and X variants: 205

3. Filter by first 3 samples: ['TCGA-AB-2886', 'TCGA-AB-2917', 'TCGA-AB-2841']
Filtered by samples variants: 34
Unique samples in filtered data: Tumor_Sample_Barcode    3
Tumor_Sample_Barcode    3
dtype: int64

4. Combined filter - chromosome 17 + first sample:
Combined filter variants: 0


## 2. Genomic Range Filtering (region)

This method filters by genomic coordinates using chromosome, start, and end positions. It comes from `genomic_range.py`.


In [4]:
print("=== Genomic Range Filtering Examples ===")

# Example 1: Filter a specific region on chromosome 17
print("\n1. Filter chromosome 17, positions 7,500,000 to 8,000,000:")
filtered_region = py_mut.region(chrom='17', start=7500000, end=8000000)
print(f"Original variants: {len(py_mut.data)}")
print(f"Region variants: {len(filtered_region.data)}")

if len(filtered_region.data) > 0:
    print("Genes in this region:")
    genes_in_region = filtered_region.data['Hugo_Symbol'].value_counts().head(10)
    display(genes_in_region)

# Example 2: Filter a smaller region
print("\n2. Filter chromosome 17, positions 7,570,000 to 7,590,000 (TP53 region):")
filtered_tp53_region = py_mut.region(chrom='17', start=7570000, end=7590000)
print(f"TP53 region variants: {len(filtered_tp53_region.data)}")

if len(filtered_tp53_region.data) > 0:
    print("Variants in TP53 region:")
    display(filtered_tp53_region.data[['Hugo_Symbol', 'CHROM', 'POS', 'REF', 'ALT', 'Variant_Classification']].head())


2025-08-01 02:02:53,222 | INFO | pyMut.filters.genomic_range | Chromosome formatted: '17' -> 'chr17'
2025-08-01 02:02:53,222 | INFO | pyMut.filters.genomic_range | Attempting to use PyArrow optimization
2025-08-01 02:02:53,228 | INFO | pyMut.filters.genomic_range | PyArrow optimization successful
2025-08-01 02:02:53,228 | INFO | pyMut.filters.genomic_range | Genomic filter applied: chr17:7500000-8000000
2025-08-01 02:02:53,229 | INFO | pyMut.filters.genomic_range | Variants before filter: 2091
2025-08-01 02:02:53,229 | INFO | pyMut.filters.genomic_range | Variants after filter: 20
2025-08-01 02:02:53,229 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2071
2025-08-01 02:02:53,230 | INFO | pyMut.filters.genomic_range | Successfully filtered genomic region: chr17:7500000-8000000


=== Genomic Range Filtering Examples ===

1. Filter chromosome 17, positions 7,500,000 to 8,000,000:
Original variants: 2091
Region variants: 20
Genes in this region:


Hugo_Symbol
TP53      19
GUCY2D     1
Name: count, dtype: int64[pyarrow]

2025-08-01 02:02:53,234 | INFO | pyMut.filters.genomic_range | Chromosome formatted: '17' -> 'chr17'
2025-08-01 02:02:53,235 | INFO | pyMut.filters.genomic_range | Attempting to use PyArrow optimization
2025-08-01 02:02:53,240 | INFO | pyMut.filters.genomic_range | PyArrow optimization successful
2025-08-01 02:02:53,241 | INFO | pyMut.filters.genomic_range | Genomic filter applied: chr17:7570000-7590000
2025-08-01 02:02:53,241 | INFO | pyMut.filters.genomic_range | Variants before filter: 2091
2025-08-01 02:02:53,242 | INFO | pyMut.filters.genomic_range | Variants after filter: 19
2025-08-01 02:02:53,242 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2072
2025-08-01 02:02:53,242 | INFO | pyMut.filters.genomic_range | Successfully filtered genomic region: chr17:7570000-7590000



2. Filter chromosome 17, positions 7,570,000 to 7,590,000 (TP53 region):
TP53 region variants: 19
Variants in TP53 region:


Unnamed: 0,Hugo_Symbol,CHROM,POS,REF,ALT,Variant_Classification
1932,TP53,chr17,7574003,G,-,FRAME_SHIFT_DEL
1933,TP53,chr17,7574018,G,A,MISSENSE_MUTATION
1934,TP53,chr17,7576897,G,A,NONSENSE_MUTATION
1935,TP53,chr17,7577081,T,C,MISSENSE_MUTATION
1936,TP53,chr17,7577100,T,C,MISSENSE_MUTATION


## 3. Gene-based Filtering (gen_region)

This method filters by gene name using the Hugo_Symbol column. It comes from `genomic_range.py`.


In [5]:
print("=== Gene-based Filtering Examples ===")

# Get the most common genes in the dataset
common_genes = py_mut.data['Hugo_Symbol'].value_counts().head(5)
print("Most common genes in the dataset:")
display(common_genes)

# Example 1: Filter by TP53 gene
print("\n1. Filter by TP53 gene:")
filtered_tp53 = py_mut.gen_region('TP53')
print(f"TP53 variants: {len(filtered_tp53.data)}")

if len(filtered_tp53.data) > 0:
    print("TP53 variant types:")
    tp53_variants = filtered_tp53.data['Variant_Classification'].value_counts()
    display(tp53_variants)

# Example 2: Filter by the most common gene
most_common_gene = common_genes.index[0]
print(f"\n2. Filter by most common gene ({most_common_gene}):")
filtered_common = py_mut.gen_region(most_common_gene)
print(f"{most_common_gene} variants: {len(filtered_common.data)}")

# Example 3: Filter by multiple genes (using multiple calls)
print("\n3. Filter by multiple genes (FLT3, NPM1, DNMT3A):")
genes_of_interest = ['FLT3', 'NPM1', 'DNMT3A']
for gene in genes_of_interest:
    filtered_gene = py_mut.gen_region(gene)
    print(f"{gene}: {len(filtered_gene.data)} variants")


=== Gene-based Filtering Examples ===
Most common genes in the dataset:


Hugo_Symbol
FLT3      38
DNMT3A    29
TET2      26
CEBPA     19
TP53      19
Name: count, dtype: int64[pyarrow]

2025-08-01 02:02:53,307 | INFO | pyMut.filters.genomic_range | Applying gene filter for: TP53
2025-08-01 02:02:53,307 | INFO | pyMut.filters.genomic_range | Source format detected: MAF
2025-08-01 02:02:53,308 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column
2025-08-01 02:02:53,308 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbol column: Hugo_Symbol
2025-08-01 02:02:53,312 | INFO | pyMut.filters.genomic_range | Gene filter applied: TP53
2025-08-01 02:02:53,312 | INFO | pyMut.filters.genomic_range | Variants before filter: 2091
2025-08-01 02:02:53,312 | INFO | pyMut.filters.genomic_range | Variants after filter: 19
2025-08-01 02:02:53,313 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2072
2025-08-01 02:02:53,313 | INFO | pyMut.filters.genomic_range | Successfully filtered data for gene: TP53



1. Filter by TP53 gene:
TP53 variants: 19
TP53 variant types:


Variant_Classification
MISSENSE_MUTATION    11
SPLICE_SITE           3
FRAME_SHIFT_DEL       2
FRAME_SHIFT_INS       2
NONSENSE_MUTATION     1
Name: count, dtype: int64[pyarrow]

2025-08-01 02:02:53,316 | INFO | pyMut.filters.genomic_range | Applying gene filter for: FLT3
2025-08-01 02:02:53,317 | INFO | pyMut.filters.genomic_range | Source format detected: MAF
2025-08-01 02:02:53,317 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column
2025-08-01 02:02:53,317 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbol column: Hugo_Symbol
2025-08-01 02:02:53,319 | INFO | pyMut.filters.genomic_range | Gene filter applied: FLT3
2025-08-01 02:02:53,320 | INFO | pyMut.filters.genomic_range | Variants before filter: 2091
2025-08-01 02:02:53,320 | INFO | pyMut.filters.genomic_range | Variants after filter: 38
2025-08-01 02:02:53,320 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2053
2025-08-01 02:02:53,320 | INFO | pyMut.filters.genomic_range | Successfully filtered data for gene: FLT3
2025-08-01 02:02:53,321 | INFO | pyMut.filters.genomic_range | Applying gene filter for: FLT3
2025-08-01 02:02:53,322 | INFO | pyM


2. Filter by most common gene (FLT3):
FLT3 variants: 38

3. Filter by multiple genes (FLT3, NPM1, DNMT3A):
FLT3: 38 variants
NPM1: 14 variants
DNMT3A: 29 variants


## 4. PASS Filter Check (pass_filter)

This method checks if specific records have FILTER == "PASS". It comes from `pass_filter.py`.
Note: This method returns a boolean value, not a filtered dataset.


In [6]:
print("=== PASS Filter Check Examples ===")

# First, let's see what FILTER values are present in our data
if 'FILTER' in py_mut.data.columns:
    print("FILTER column values:")
    filter_values = py_mut.data['FILTER'].value_counts()
    display(filter_values)
    
    # Example 1: Check specific records for PASS filter
    print("\n1. Checking specific records for PASS filter:")
    
    # Get a few sample records
    sample_records = py_mut.data.head(3)
    
    for idx, row in sample_records.iterrows():
        chrom = row['CHROM']
        pos = row['POS']
        ref = row['REF']
        alt = row['ALT']
        
        is_pass = py_mut.pass_filter(chrom=chrom, pos=pos, ref=ref, alt=alt)
        print(f"Record {chrom}:{pos} {ref}>{alt} - PASS: {is_pass}")
        
    # Example 2: Check a non-existent record
    print("\n2. Checking a non-existent record:")
    is_pass_fake = py_mut.pass_filter(chrom='1', pos=999999999, ref='A', alt='T')
    print(f"Non-existent record - PASS: {is_pass_fake}")
    
else:
    print("FILTER column not found in the dataset")
    print("Available columns:", list(py_mut.data.columns))


=== PASS Filter Check Examples ===
FILTER column values:


FILTER
.    2091
Name: count, dtype: int64

2025-08-01 02:02:53,391 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr9:100077177 T>C
2025-08-01 02:02:53,391 | INFO | pyMut.filters.pass_filter | Attempting to use PyArrow optimization
2025-08-01 02:02:53,398 | INFO | pyMut.filters.pass_filter | PASS filter result: False
2025-08-01 02:02:53,399 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr9:100085148 G>A
2025-08-01 02:02:53,399 | INFO | pyMut.filters.pass_filter | Attempting to use PyArrow optimization
2025-08-01 02:02:53,405 | INFO | pyMut.filters.pass_filter | PASS filter result: False
2025-08-01 02:02:53,406 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr9:100971322 A>C
2025-08-01 02:02:53,406 | INFO | pyMut.filters.pass_filter | Attempting to use PyArrow optimization
2025-08-01 02:02:53,411 | INFO | pyMut.filters.pass_filter | PASS filter result: False
2025-08-01 02:02:53,412 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr1:999999999 A>T
2025-08-01 02:0


1. Checking specific records for PASS filter:
Record chr9:100077177 T>C - PASS: False
Record chr9:100085148 G>A - PASS: False
Record chr9:100971322 A>C - PASS: False

2. Checking a non-existent record:
Non-existent record - PASS: False


## 5. Combining Multiple Filters

You can chain multiple filtering operations to create complex filters.


In [7]:
print("=== Combining Multiple Filters ===")

# Example: Filter by chromosome 17, then by TP53 gene, then by genomic region
print("1. Multi-step filtering: Chromosome 17 → TP53 gene → specific region")

# Step 1: Filter by chromosome 17
step1 = py_mut.filter_by_chrom_sample(chrom='17')
print(f"Step 1 - Chromosome 17: {len(step1.data)} variants")

# Step 2: Filter by TP53 gene
step2 = step1.gen_region('TP53')
print(f"Step 2 - TP53 gene: {len(step2.data)} variants")

# Step 3: Filter by specific region (TP53 locus)
step3 = step2.region(chrom='17', start=7570000, end=7590000)
print(f"Step 3 - TP53 region: {len(step3.data)} variants")

if len(step3.data) > 0:
    print("\nFinal filtered results:")
    display(step3.data[['Hugo_Symbol', 'CHROM', 'POS', 'REF', 'ALT', 'Variant_Classification', 'Tumor_Sample_Barcode']])

# Show the filter history
print(f"\nFilter history: {step3.metadata.filters}")


2025-08-01 02:02:53,451 | INFO | pyMut.filters.chrom_sample_filter | Chromosomes to filter: ['chr17']
2025-08-01 02:02:53,454 | INFO | pyMut.filters.chrom_sample_filter | Applied chromosome filter: chr17
2025-08-01 02:02:53,455 | INFO | pyMut.filters.chrom_sample_filter | Combined filter applied: chromosome:chr17
2025-08-01 02:02:53,455 | INFO | pyMut.filters.chrom_sample_filter | Variants before filter: 2091
2025-08-01 02:02:53,456 | INFO | pyMut.filters.chrom_sample_filter | Variants after filter: 99
2025-08-01 02:02:53,456 | INFO | pyMut.filters.chrom_sample_filter | Variants filtered out: 1992
2025-08-01 02:02:53,457 | INFO | pyMut.filters.chrom_sample_filter | Successfully applied filter: chromosome:chr17
2025-08-01 02:02:53,457 | INFO | pyMut.filters.genomic_range | Applying gene filter for: TP53
2025-08-01 02:02:53,457 | INFO | pyMut.filters.genomic_range | Source format detected: MAF
2025-08-01 02:02:53,458 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking 

=== Combining Multiple Filters ===
1. Multi-step filtering: Chromosome 17 → TP53 gene → specific region
Step 1 - Chromosome 17: 99 variants
Step 2 - TP53 gene: 19 variants
Step 3 - TP53 region: 19 variants

Final filtered results:


Unnamed: 0,Hugo_Symbol,CHROM,POS,REF,ALT,Variant_Classification,Tumor_Sample_Barcode
1932,TP53,chr17,7574003,G,-,FRAME_SHIFT_DEL,TCGA-AB-2938
1933,TP53,chr17,7574018,G,A,MISSENSE_MUTATION,TCGA-AB-2904
1934,TP53,chr17,7576897,G,A,NONSENSE_MUTATION,TCGA-AB-2908
1935,TP53,chr17,7577081,T,C,MISSENSE_MUTATION,TCGA-AB-2952
1936,TP53,chr17,7577100,T,C,MISSENSE_MUTATION,TCGA-AB-2829
1937,TP53,chr17,7577121,G,A,MISSENSE_MUTATION,TCGA-AB-2943
1938,TP53,chr17,7577538,C,T,MISSENSE_MUTATION,TCGA-AB-2935
1939,TP53,chr17,7577609,C,T,SPLICE_SITE,TCGA-AB-2829
1940,TP53,chr17,7578181,-,GCGGCTC,FRAME_SHIFT_INS,TCGA-AB-2820
1941,TP53,chr17,7578206,T,C,MISSENSE_MUTATION,TCGA-AB-2878



Filter history: ['.', 'chromosome:chr17', 'gene_filter:Hugo_Symbol:TP53', 'genomic_region:chr17:7570000-7590000']


## Summary

This notebook demonstrated the four main filtering methods available in PyMutation:

1. **`filter_by_chrom_sample`**: Filters by chromosome and/or sample
   - Parameters: `chrom` (str or list), `sample` (str or list), `sample_column` (str)
   - Returns: New PyMutation object with filtered data

2. **`region`**: Filters by genomic coordinates
   - Parameters: `chrom` (str), `start` (int), `end` (int)
   - Returns: New PyMutation object with filtered data

3. **`gen_region`**: Filters by gene name
   - Parameters: `gen_name` (str)
   - Returns: New PyMutation object with filtered data

4. **`pass_filter`**: Checks if specific records have FILTER == "PASS"
   - Parameters: `chrom` (str), `pos` (int), `ref` (str), `alt` (str)
   - Returns: Boolean value

All filtering methods preserve the original data structure and update the metadata to track applied filters.