# **SnpEff SNP Mutation Matrix(MPAM) (Tier 2)**

This notebook shall walk through analysis and preparation of SnpEff SNP Mutation dataset generated from SNIPPY output files (i.e. VCF files) for Machine learning models, following is the information about the data in .pkl file.
Total unique mutations detected: 529,854

**Mutation type distribution:**

  - frameshift: 3,125
  - inframe_indel: 73,303
  - intergenic: 73,890
  - nonsynonymous: 188,882
  - stop/unknown: 26,193
  - synonymous: 162,544

**Frequency distribution:**

  - Rare (1-2 samples): 167,191
  - Uncommon (3-10): 78,980
  - Common (11-50): 95,231
  - Very common (51-100): 41,787
  - Highly common (>100): 146,665

# **SNP Helper Functions**

In [None]:
import pickle
import pandas as pd
import numpy as np
import re
from scipy.sparse import load_npz, hstack, csr_matrix

## **LOADING FUNCTIONS**

In [None]:
def load_snp_features(pkl_file='snp_features_ALL_MUTATIONS.pkl'):
    """Load complete SNP feature package"""
    with open(pkl_file, 'rb') as f:
        data = pickle.load(f)
    return data

def load_snp_sparse(npz_file='snp_matrix_sparse.npz'):
    """Load just the sparse matrix"""
    return load_npz(npz_file)

## **File Format CONVERSION FUNCTIONS**

In [None]:
def sparse_to_pandas(data_dict):
    """Convert sparse matrix to pandas DataFrame (use for small subsets only)"""
    sparse_matrix = data_dict['matrix']
    sample_ids = data_dict['sample_ids']
    feature_names = data_dict['feature_names']

    df = pd.DataFrame.sparse.from_spmatrix(
        sparse_matrix,
        index=sample_ids,
        columns=feature_names
    )
    return df

## **FILTERING FUNCTIONS**

In [None]:
def filter_by_frequency(data_dict, min_freq=10, max_freq=None):
    """
    Filter mutations by sample frequency (removes rare (Present in Some) + fixed mutations (Present in all))

    Fixed mutations (frequency = 100%) have zero variance and provide no discriminative power for prediction.
    Example:
        hemD_S99R: present in 1089/1089 samples (100%)
        This mutation is USELESS for prediction because:
            - All resistant strains have it
            - All susceptible strains have it
            - It's just a fixed lineage marker

    Args:
        min_freq: Minimum samples (default: 10)
        max_freq: Maximum samples (default: n_samples - min_freq)

    Returns:
        Filtered data dictionary

    Reference:
        Brown et al. (2016). Removing invariant features improves
        bacterial GWAS power. PLoS Genet, 12(11), e1006413.
    """
    frequencies = np.array(data_dict['feature_frequencies'])
    n_samples = len(data_dict['sample_ids'])

    if max_freq is None:
        max_freq = n_samples - min_freq

    mask = (frequencies >= min_freq) & (frequencies <= max_freq)

    n_kept = np.sum(mask)
    n_total = len(mask)
    print(f"Frequency filter: kept {n_kept}/{n_total} ({n_kept/n_total*100:.1f}%)")
    print(f"  Removed {np.sum(frequencies < min_freq)} rare mutations (< {min_freq} samples)")
    print(f"  Removed {np.sum(frequencies > max_freq)} fixed mutations (> {max_freq} samples)")

    return {
        'matrix': data_dict['matrix'][:, mask],
        'sample_ids': data_dict['sample_ids'],
        'feature_names': [f for i, f in enumerate(data_dict['feature_names']) if mask[i]],
        'feature_frequencies': [f for i, f in enumerate(frequencies) if mask[i]]
    }

### **`Mutation Types`**
Based on the options available in `filter_by_type` function, We will initially focus on the mutation types most strongly associated with functional change, but we can consider expanding later.

we can consider adding others for a comprehensive initial analysis:

| Mutation Type | Impact on Protein | Rationale for Keeping |
| :--- | :--- | :--- |
| **Nonsynonymous** | Changes the amino acid sequence. | These are the canonical **missense mutations** that drive protein functional changes, often leading to drug resistance. |
| **Frameshift** | Shifts the reading frame, leading to a completely different, usually truncated, protein. | Almost always results in a **loss-of-function** (LOF) and is a strong candidate for resistance or fitness cost. |
| **Stop Gain** | Converts an amino acid codon into a premature stop codon. | Causes premature termination and a **loss-of-function** (LOF). Highly relevant for resistance mechanisms that involve regulatory genes. |
| **Stop Loss** | Mutates a stop codon into an amino acid codon. | Results in an **extended, non-functional protein** or unstable mRNA. Also a strong LOF candidate. |
| **Inframe Indel** | Insertion or deletion of codons that keeps the reading frame intact. | Alters the protein structure by adding or removing amino acids. The functional impact can vary but should be retained as it constitutes a structural change. |

**The initial recommended set is:** `['nonsynonymous', 'frameshift', 'stop_gain', 'stop_loss', 'inframe_indel']`

**`Mutation Types to Consider Excluding`**

1. **Synonymous Mutations**

Synonymous mutations (silent mutations) change the DNA sequence but **do not alter the amino acid sequence**. They are generally excluded from preliminary GWAS as they are poor predictors of resistance. However, they can be kept in a secondary analysis if you suspect **codon usage bias** or effects on mRNA stability.

2. **Intergenic Mutations**

Intergenic mutations fall outside of protein-coding regions. While they can be crucial if they land in **regulatory regions** (like promoters or terminators), they are often difficult to interpret and introduce significant noise.

  * **Recommendation:** Keep them separate. If you want to include them, filter them heavily by frequency and only analyze them if your primary coding mutations don't explain the full resistance signal.

In [None]:
def filter_by_type(data_dict, keep_types=['nonsynonymous', 'frameshift']):
    """
    Filter mutations by biological effect

    Args:
        keep_types: List of types to keep

    Returns:
        Filtered data dictionary

    Reference:
        Cingolani et al. (2012). SnpEff: SNP effect prediction.
        Fly, 6(2), 80-92.
    """
    feature_names = data_dict['feature_names']
    mask = []

    for mut in feature_names:
        mut_type = None

        if 'intergenic' in mut:
            mut_type = 'intergenic'
        elif 'frameshift' in mut.lower():
            mut_type = 'frameshift'
        elif 'inframe' in mut.lower() or 'insertion' in mut.lower() or 'deletion' in mut.lower():
            if '_' in mut and not mut.startswith('intergenic'):
                mut_type = 'inframe_indel'
            else:
                mut_type = 'intergenic'
        else:
            parts = mut.split('_')
            if len(parts) >= 2:
                aa_change = parts[1]
                match = re.match(r'^([A-Z*]+)(\d+)([A-Z*]+|STOP)$', aa_change)

                if match:
                    ref_aa = match.group(1)
                    alt_aa = match.group(3)

                    if alt_aa == 'STOP' or alt_aa == '*':
                        mut_type = 'stop_gain'
                    elif ref_aa == '*':
                        mut_type = 'stop_loss'
                    elif ref_aa == alt_aa:
                        mut_type = 'synonymous'
                    else:
                        mut_type = 'nonsynonymous'
                else:
                    mut_type = 'unknown'
            else:
                mut_type = 'unknown'

        mask.append(mut_type in keep_types)

    mask = np.array(mask)
    n_kept = np.sum(mask)
    n_total = len(mask)
    print(f"Type filter: kept {n_kept}/{n_total} ({n_kept/n_total*100:.1f}%)")

    return {
        'matrix': data_dict['matrix'][:, mask],
        'sample_ids': data_dict['sample_ids'],
        'feature_names': [f for i, f in enumerate(feature_names) if mask[i]],
        'feature_frequencies': [f for i, f in enumerate(data_dict['feature_frequencies']) if mask[i]]
    }

In [None]:
def filter_known_amr_genes(data_dict, gene_list=None):
    """
    Filter to mutations in specific genes (CASE-INSENSITIVE - Needed because different annotation tools use different conventions (Prokka uses lowercase, NCBI uses uppercase))

    Args:
        gene_list: List of gene names (if None, uses default AMR genes)

    Returns:
        Filtered data dictionary

    Reference:
        Seemann (2014). Prokka annotation. Bioinformatics, 30(14), 2068-2069.
    """
    if gene_list is None:
        gene_list = [
            'gyrA', 'gyrB', 'parC', 'parE',  #quinolones
            'ompF', 'ompC', 'ompK36',         #porins
            'marR', 'acrR', 'soxR', 'ramR',   #regulators
            'rpoB', 'rpsL', 'folP', 'folA',   #other targets
            'pmrA', 'pmrB', 'mgrB'            #colistin
        ]

    gene_list_lower = [g.lower() for g in gene_list]
    feature_names = data_dict['feature_names']
    mask = []

    for mut in feature_names:
        gene = mut.split('_')[0].lower()
        mask.append(gene in gene_list_lower)

    mask = np.array(mask)
    n_kept = np.sum(mask)
    n_total = len(mask)
    print(f"Gene filter: kept {n_kept}/{n_total} ({n_kept/n_total*100:.1f}%)")

    return {
        'matrix': data_dict['matrix'][:, mask],
        'sample_ids': data_dict['sample_ids'],
        'feature_names': [f for i, f in enumerate(feature_names) if mask[i]],
        'feature_frequencies': [f for i, f in enumerate(data_dict['feature_frequencies']) if mask[i]]
    }

## **MERGING FUNCTION**

In [None]:
def merge_with_amr_genes(snp_data, amr_df, standardize_ids=True):
    """
    Merge SNP + AMR features with SAFE ALIGNMENT

    Args:
        snp_data: Dict from load_snp_features()
        amr_df: DataFrame (samples × genes)
        standardize_ids: Convert # <-----> _ for matching

    Returns:
        Dict with combined sparse matrix
        
    """
    if standardize_ids:
        def replace_last_underscore_with_hash(s):
            s_str = str(s)
            parts = s_str.rsplit('_', 1)
            return '#'.join(parts) if len(parts) > 1 else s_str

        snp_sample_ids = [replace_last_underscore_with_hash(s) for s in snp_data['sample_ids']]
        amr_df = amr_df.copy()
        amr_df.index = amr_df.index.map(replace_last_underscore_with_hash)
    else:
        snp_sample_ids = snp_data['sample_ids']

    #find common samples
    common_samples = set(snp_sample_ids).intersection(set(amr_df.index))
    print(f"Common samples: {len(common_samples)}")

    if len(common_samples) == 0:
        raise ValueError("No common samples! Check ID formatting.")

    #align both
    common_list = sorted(list(common_samples))
    snp_idx = [i for i, s in enumerate(snp_sample_ids) if s in common_samples]
    snp_matrix_aligned = snp_data['matrix'][snp_idx, :]
    amr_df_aligned = amr_df.loc[common_list]
    amr_sparse = csr_matrix(amr_df_aligned.values)

    #combine
    combined = hstack([snp_matrix_aligned, amr_sparse])
    combined_features = snp_data['feature_names'] + list(amr_df_aligned.columns)

    print(f"Combined: {combined.shape}")

    return {
        'matrix': combined,
        'sample_ids': common_list,
        'feature_names': combined_features
    }

## **ANALYSIS HELPERS**

In [None]:
def get_feature_summary(data_dict):
    """Get summary statistics"""
    frequencies = np.array(data_dict['feature_frequencies'])
    return {
        'total_features': len(data_dict['feature_names']),
        'total_samples': len(data_dict['sample_ids']),
        'mean_frequency': np.mean(frequencies),
        'median_frequency': np.median(frequencies),
        'rare_mutations': np.sum(frequencies <= 2),
        'common_mutations': np.sum(frequencies > 50),
        'sparsity': data_dict.get('sparsity', 'N/A')
    }

# **`TIER 2: SNP Mutations + AMR Genes`**
- Filters 529K mutations ---> ~500-1000 high-impact mutations

### **Variant Annotation/Snpeff Workflow**

**Genome Database**: `SnpEff` requires a pre-built reference genome database (containing gene and transcript coordinates, usually downloaded from Ensembl or NCBI), we already downloaded one for our problem regarding E.coli.

**Input**: A VCF (Variant Call Format) file containing the detected variants and their genomic positions.

**Annotation**: For each variant in the VCF, `SnpEff` checks the reference database, determines the surrounding context, and appends a detailed annotation string (the `ANN`/annotation field in the VCF) describing the predicted effect.

The general structure in our data is often `GeneName_RefAAPositionAltAA` (e.g., `gyrA_S83L`).  (AA = AminoAcid, Alt = Alternative)

**`Interpreting Common Mutation Types`**

The biological impact is categorized based on SnpEff's prediction, which is crucial for distinguishing features in our GWAS.

**`1. Synonymous (Silent) Mutations (Low Impact)`**

These mutations change the DNA sequence but **do not alter the amino acid sequence** because multiple codons can code for the same amino acid. They are generally filtered out in functional GWAS.

| Mutation Example | Breakdown | SnpEff Impact | Implication |
| :--- | :--- | :--- | :--- |
| **`ulaD_V115V`** | **V**aline → **V**aline at position 115 in the *ulaD* gene. | **Synonymous** | No direct change to the protein function. |
| **`malT_Q360Q`** | **Q**lutamine → **Q**lutamine at position 360 in the *malT* gene. | **Synonymous** | Likely neutral, although it could affect mRNA stability or translation speed. |


**`2. Nonsynonymous (Missense) Mutations (Moderate Impact)`**

These mutations change the DNA sequence, resulting in a **different amino acid** in the protein sequence. They are a primary focus in drug resistance studies.

| Mutation Example | Breakdown | SnpEff Impact | Implication |
| :--- | :--- | :--- | :--- |
| **`gyrA_S111P`** | **S**erine → **P**roline at position 111 in the *gyrA* gene. | **Nonsynonymous** | Can directly alter the enzyme's binding pocket (e.g., for quinolones), leading to resistance. |
| **`ampC_T86A`** | **T**hreonine → **A**lanine at position 86 in the *ampC* gene. | **Nonsynonymous** | Alters the active site or stability of the chromosomal β-lactamase, potentially increasing its activity. |
| **`hemD_S99R`** | **S**erine → **R**ginine at position 99 in the *hemD* gene. | **Nonsynonymous** | Changes an amino acid to a chemically different one, potentially impacting protein folding or function. |

**`3. High Impact Mutations (Loss-of-Function)`**

These mutations are the most severe, often resulting in a truncated or completely non-functional protein.

| Mutation Example | Breakdown | SnpEff Impact | Implication |
| :--- | :--- | :--- | :--- |
| **`ldtC_L180P`** | **L**eucine → **P**roline at position 180 in the *ldtC* gene. | This specific example may be **Nonsynonymous**, but if the change was `L180*` (Stop Gain) → | **Stop Gain / Frameshift** | **Loss of function (LOF).** If it were a Stop Gain, the protein is shortened. If it were a Frameshift, the protein is destroyed. |
| (Hypothetical) | `GeneX_W45STOP` | **Stop Gain** | Complete LOF. Very strong signal for traits like auxotrophy or resistance via inactivation. |


**`4. Non-Coding and Structural Variations (Modifier Impact)`**

These do not directly alter the protein sequence but can affect gene expression or stability.

| Mutation Example | Breakdown | SnpEff Impact | Implication |
| :--- | :--- | :--- | :--- |
| **`intergenic_NZ...A>G`** | An **A** to **G** substitution occurring in an **intergenic region** (between two genes). | **Intergenic** | May affect the promoter or regulatory sequence of an adjacent gene (e.g., activating `ampC` expression). |
| **`intergenic_NZ...insertion_1bp`** | A 1 base pair **insertion** in an **intergenic region**. | **Intergenic** | A structural change that could disrupt a transcription factor binding site or promoter element. |


**Interpreting Filtered List**

Our final `827` mutations are powerful because they represent the **Moderate and High Impact** mutations in known AMR genes:

| Our Data Example | Interpretation | Resistance Relevance |
| :--- | :--- | :--- |
| **`gyrA_R91C`** | Nonsynonymous in DNA Gyrase. | Primary mechanism for **quinolone resistance** (along with `S83L/N` and `D87N`). |
| **`robA_L279V`** | Nonsynonymous in a regulator (`robA`). | Can lead to **Efflux pump hyper-expression** and Multi-Drug Resistance (MDR). |
| **`ftsI_H425Q`** | Nonsynonymous in PBP3 (`ftsI`). | Mutation in a Penicillin-Binding Protein (PBP) target for β-lactams, potentially conferring **cephalosporin resistance**. |

In [None]:
import os
import pandas as pd
from pathlib import Path
import numpy as np
import zipfile
import pickle
from scipy.sparse import csr_matrix

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
zip_file_path = '/content/drive/MyDrive/data/snp_features.zip'
#extract_to_path = './snp_features'
extract_to_path = '/content/drive/MyDrive/snp_features'

#create the extraction directory if it doesn't exist
os.makedirs(extract_to_path, exist_ok=True)

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to_path)

print(f"Successfully extracted '{zip_file_path}' to '{extract_to_path}'")

Successfully extracted '/content/drive/MyDrive/data/snp_features.zip' to '/content/drive/MyDrive/snp_features'


## **Load sparse matrix**

In [None]:
with open('/content/drive/MyDrive/snp_features/snp_features_ALL_MUTATIONS.pkl', 'rb') as f:
    data = pickle.load(f)

#structure
print(data.keys())

dict_keys(['matrix', 'sample_ids', 'feature_names', 'feature_frequencies', 'mutation_types', 'shape', 'sparsity'])


In [None]:
#inspect
print(f"Shape: {data['matrix'].shape}")
print(f"Samples: {len(data['sample_ids'])}")
print(f"Features: {len(data['feature_names'])}")
print(f"Sparsity: {1 - data['matrix'].nnz / (data['matrix'].shape[0] * data['matrix'].shape[1]):.2%}")

#first few feature names
print(data['feature_names'][:20])

Shape: (1089, 529854)
Samples: 1089
Features: 529854
Sparsity: 86.69%
['ulaD_V115V', 'fumB_A353T', 'hemD_S99R', 'cysI_R566R', 'truA_T102T', 'IEU92_RS16020_D16G', 'malT_Q360Q', 'intergenic_NZ_LR881938.1_1080684_insertion_1bp', 'yciT_I244M', 'grpE_V44V', 'plsX_P42P', 'acnA_A522A', 'ldtC_L180P', 'tdh_N254H', 'intergenic_NZ_LR881938.1_3384850_A>G', 'yfeX_V194V', 'yebF_D37G', 'mepM_N298H', 'hflX_T338I', 'intergenic_NZ_LR881938.1_1271723_A>G']


## **Load SNP Data**

In [None]:
data = load_snp_features('/content/drive/MyDrive/snp_features/snp_features_ALL_MUTATIONS.pkl')
print(f"\nLoaded: {data['matrix'].shape}")
print(f"Sparsity: {data['sparsity']:.2f}%")


Loaded: (1089, 529854)
Sparsity: 86.69%


## **Frequency Filter (Remove Rare + Fixed)**
- **`Rare`** = Present in some like in less than 5 or 10 isolates...
- **`Fixed`** = Present in all or almost in all isolates...

In [None]:
#keep mutations in 10-1079 samples (0.9% - 99.1%)
data_freq_filtered = filter_by_frequency(data, min_freq=10, max_freq=1079)
print(f"\nAfter frequency filter: {data_freq_filtered['matrix'].shape}")

Frequency filter: kept 289475/529854 (54.6%)
  Removed 240168 rare mutations (< 10 samples)
  Removed 211 fixed mutations (> 1079 samples)

After frequency filter: (1089, 289475)


## **Type Filtering (Keep High-Impact Mutations)**

In [None]:
#keep nonsynonymous, frameshift, stop_gain, stop_loss
data_coding = filter_by_type(
    data_freq_filtered,
    keep_types=['nonsynonymous', 'frameshift', 'stop_gain', 'stop_loss']
)
print(f"\nAfter type filter: {data_coding['matrix'].shape}")

Type filter: kept 113779/289475 (39.3%)

After type filter: (1089, 113779)


## **Gene Filter (Focus on Known AMR Genes)**

**`Initial List`**

The initial gene filter was designed to capture primarily chromosomally acquired mutations associated with AMR.


| Gene(s) | Antimicrobial Class / Function |
|---------|-------------------------------|
| `gyrA`, `gyrB`, `parC`, `parE` | Quinolones (QRDR mutations) |
| `ompF`, `ompC`, `ompK36`, `ompK35` | Beta-lactams (porin mutations) |
| `marR`, `acrR`, `soxR`, `ramR`, `robA` | Efflux regulators |
| `rpoB` | Rifampin |
| `rpsL`, `rrs` | Aminoglycosides |
| `folP`, `folA`, `thyA` | Folate pathway (sulfonamides/trimethoprim) |
| `pmrA`, `pmrB`, `mgrB`, `phoP`, `phoQ` | Colistin |
| `ftsI`, `pbp1a`, `pbp2` | Beta-lactams (PBPs) |
| `erm`, `rplD`, `rplV` | Macrolides |

**`Expanded List`**

The list was subsequently expanded to include additional genes for more comprehensive coverage of known AMR roles.


| Gene(s) | Antimicrobial Class / Function | Notes |
|---------|-------------------------------|-------|
| `gyrA`, `gyrB`, `parC`, `parE`, `qnrA`, `qnrB`, `qnrS` | Quinolones | Added `qnr` for comprehensive coverage |
| `ompF`, `ompC`, `ompK36`, `ompK35` | Beta-lactams (porins) | |
| `marR`, `acrR`, `soxR`, `ramR`, `robA` | Efflux regulators | |
| `rpoB` | Rifampin | |
| `rpsL`, `rrs` | Aminoglycosides | |
| `folP`, `folA`, `thyA` | Folate pathway | |
| `pmrA`, `pmrB`, `mgrB`, `phoP`, `phoQ` | Colistin | |
| `ftsI`, `pbp1a`, `pbp2`, `pbp3`, `ampC`, `tolC` | Beta-lactams / Efflux | Added `pbp3`, `ampC`, and `tolC` (outer membrane component of AcrAB) |
| `erm`, `rplD`, `rplV` | Macrolides | |


**`Rationale for Including qnr Genes`**

While `qnr` genes are typically plasmid-borne and captured in **Tier 1A**, they were included in the chromosomal SNP list as a safeguard. `qnrA`, `qnrB`, and `qnrS` represent a crucial mechanism for quinolone resistance by `protecting the DNA gyrase/topoisomerase targets` from the drug. Given the focus on quinolone resistance (with `gyrA` and `parC` as top hits), this inclusion ensures the capture of rare events where a `qnr` gene has chromosomally integrated and acquired a functional point mutation.

In [None]:
#focus on genes with known AMR roles
amr_genes = [
    #quinolones (QRDR mutations)
    'gyrA', 'gyrB', 'parC', 'parE',
    'qnrA', 'qnrB', 'qnrS', #added qnr for comprehensive quinolone coverage

    #beta-lactams (porin mutations)
    'ompF', 'ompC', 'ompK36', 'ompK35',
    #efflux regulators
    'marR', 'acrR', 'soxR', 'ramR', 'robA',
    #rifampin
    'rpoB',
    #aminoglycosides
    'rpsL', 'rrs',
    #folate pathway (sulfonamides/trimethoprim)
    'folP', 'folA', 'thyA',
    #colistin
    'pmrA', 'pmrB', 'mgrB', 'phoP', 'phoQ',
    #beta-lactams (PBPs)
    'ftsI', 'pbp1a', 'pbp2',

    'pbp3', 'ampC',

    'tolC', #primary outer membrane component of AcrAB efflux

    #macrolides
    'erm', 'rplD', 'rplV', #ribosomal proteins for Macrolides ('rplD', 'rplV')

    # 'nfsA', 'nfsB',      #nitrofurantoin resistance
    # 'blaTEMp',           #beta-lactamase promoter
    # 'fabI',              #triclosan/biocide resistance
    # 'uhpT'               #fosfomycin resistance
]

data_amr_muts = filter_known_amr_genes(data_coding, gene_list=amr_genes)
print(f"\nAfter gene filter: {data_amr_muts['matrix'].shape}")

Gene filter: kept 827/113779 (0.7%)

After gene filter: (1089, 827)


## **Convert to DataFrame (Now Manageable Size)**

In [None]:
tier2_snps = sparse_to_pandas(data_amr_muts)
print(f"\nTier 2 SNPs DataFrame: {tier2_snps.shape}")
print(f"\nFirst 10 mutations:")
print(tier2_snps.columns[:10].tolist())


Tier 2 SNPs DataFrame: (1089, 827)

First 10 mutations:
['ampC_T86A', 'ampC_A356S', 'gyrA_I198L', 'gyrA_S111P', 'gyrA_R91C', 'robA_L279V', 'ftsI_H425Q', 'thyA_Q39E', 'marR_I137L', 'thyA_L184F']


In [None]:
#save filtered SNPs
tier2_snps.to_csv('/content/drive/MyDrive/amr_features/tier2_snp_mutations_filtered.csv')
print("\nSaved: tier2_snp_mutations_filtered.csv")


Saved: tier2_snp_mutations_filtered.csv


## **Merge SNPs with Tier 1A (Acquired AMR Genes)**

In [None]:
tier1c = pd.read_csv('/content/drive/MyDrive/amr_features/population_structure_markers.csv', index_col=0)
print(f"\nTier 1C loaded: {tier1c.shape}")

common_cols = set(tier2_snps.columns.tolist()).intersection(set(tier1c.columns))

print(f"Number of common columns: {len(common_cols)}")
print("Common columns:")
for col in sorted(list(common_cols)):
    print(f"- {col}")


#check overlap between AMRFinder and SnpEff
amrfinder_muts = set(tier1c.columns)
snpeff_muts = set(tier2_snps.columns)

overlap = amrfinder_muts.intersection(snpeff_muts)
print(f"Overlap: {len(overlap)}/{len(amrfinder_muts)} AMRFinder mutations found in SnpEff")

only_in_amrfinder = amrfinder_muts - snpeff_muts
print(f"Only in AMRFinder: {len(only_in_amrfinder)}")


Tier 1C loaded: (1168, 69)
Number of common columns: 5
Common columns:
- gyrA_S83L
- parC_A108T
- parC_E84V
- parC_S80I
- parE_S458T
Overlap: 5/69 AMRFinder mutations found in SnpEff
Only in AMRFinder: 64


In [None]:
#load Tier 1A (without chromosomal mutations)
tier1a = pd.read_csv('/content/drive/MyDrive/amr_features/tier1a_acquired_amr_genes_CORRECTED.csv', index_col=0)
print(f"\nTier 1A loaded: {tier1a.shape}")


#merge using corrected function
tier2_combined = merge_with_amr_genes(data_amr_muts, tier1a, standardize_ids=True)

#convert to DataFrame
tier2_df = pd.DataFrame.sparse.from_spmatrix(
    tier2_combined['matrix'],
    index=tier2_combined['sample_ids'],
    columns=tier2_combined['feature_names']
)


Tier 1A loaded: (1651, 409)
Common samples: 1089
Combined: (1089, 1236)


In [None]:
print(f"Final shape: {tier2_df.shape}")
print(f"  - Tier 1A genes: {len(tier1a.columns)}")
print(f"  - SNP mutations: {len(data_amr_muts['feature_names'])}")
print(f"  - Total features: {len(tier2_combined['feature_names'])}")

Final shape: (1089, 1236)
  - Tier 1A genes: 409
  - SNP mutations: 827
  - Total features: 1236


In [None]:
#save Tier 2 matrix
tier2_df.to_csv('/content/drive/MyDrive/amr_features/tier2_amr_genes_plus_mutations.csv')
print("\nSaved: tier2_amr_genes_plus_mutations.csv")


Saved: tier2_amr_genes_plus_mutations.csv


## **Summary Statistics**

In [None]:
tier2_df.head()

Unnamed: 0,ampC_T86A,ampC_A356S,gyrA_I198L,gyrA_S111P,gyrA_R91C,robA_L279V,ftsI_H425Q,thyA_Q39E,marR_I137L,thyA_L184F,...,rmtB1,sat2,sul1.1,sul2.1,sul3.1,tet(A).1,tet(B),tet(D).1,tet(M),tet(X4)
11657_5#25,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0,0.0,0.0,0.0,0,0.0,0.0,0,0,0
11657_5#26,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0,0.0,0.0,0.0,0,0.0,0.0,0,0,0
11657_5#27,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0,1.0,0.0,0.0,0,0.0,0.0,0,0,0
11657_5#29,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0,0.0,1.0,1.0,0,1.0,0.0,0,0,0
11657_5#30,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0,0.0,1.0,0.0,0,0.0,1.0,0,0,0


In [None]:
combined_frequencies = np.array(tier2_combined['matrix'].astype(bool).sum(axis=0)).flatten()
tier2_combined['feature_frequencies'] = combined_frequencies

summary = get_feature_summary(tier2_combined)
print(f"\nTotal features: {summary['total_features']}")
print(f"Total samples: {summary['total_samples']}")
print(f"Mean frequency: {summary['mean_frequency']:.1f} samples")
print(f"Median frequency: {summary['median_frequency']:.1f} samples")


Total features: 1236
Total samples: 1089
Mean frequency: 247.0 samples
Median frequency: 59.5 samples


In [None]:
#count mutations by gene
from collections import Counter
mutation_counts = Counter([f.split('_')[0] for f in data_amr_muts['feature_names']])
print(f"\nTop 10 genes with most mutations:")
for gene, count in mutation_counts.most_common(10):
    print(f"  {gene}: {count} mutations")


Top 10 genes with most mutations:
  gyrA: 135 mutations
  gyrB: 106 mutations
  parC: 99 mutations
  parE: 81 mutations
  phoQ: 48 mutations
  ampC: 46 mutations
  robA: 39 mutations
  pmrB: 36 mutations
  folP: 35 mutations
  ompC: 34 mutations


In [None]:
#  tier2_df = pd.read_csv('tier2_amr_genes_plus_mutations.csv', index_col=0)

#verify expected mutations are present
quinolone_muts = [col for col in tier2_df.columns if 'gyr' in col.lower() or 'par' in col.lower()]
print(f"Quinolone mutations: {len(quinolone_muts)}")
print(quinolone_muts[:10])

Quinolone mutations: 421
['gyrA_I198L', 'gyrA_S111P', 'gyrA_R91C', 'gyrA_Y100H', 'parE_V44L', 'parE_V39F', 'gyrA_G468R', 'parE_S164R', 'parE_Q211E', 'parE_N299H']


In [None]:
if 'gyrA_S83L' in tier2_df.columns:
      print(f"gyrA_S83L present in {tier2_df['gyrA_S83L'].sum()} samples")
else:
      print("WARNING: gyrA_S83L missing!")

gyrA_S83L present in 239.0 samples


# **Feature Frequencies Separately for Genes and SNPs**
Our current output:
```
Mean frequency: 247.0 samples
Median frequency: 59.5 samples
```
This is MEANINGLESS because:
- AMR genes: Often 300-800 samples (mobile, epidemic clones)
- SNP mutations: Often 10-200 samples (recent, evolutionary)

Mixing them creates uninformative statistics!

In [None]:
import pandas as pd
import numpy as np

## **Separate Statistics**

In [None]:
def get_feature_summary_separate(tier2_df, tier1a_columns):
    """
    Calculate statistics separately for genes vs mutations

    Args:
        tier2_df: Combined Tier 2 DataFrame (samples × features)
        tier1a_columns: List of column names from Tier 1A (genes)
    """

    #separate features
    gene_cols = [col for col in tier2_df.columns if col in tier1a_columns]
    snp_cols = [col for col in tier2_df.columns if col not in tier1a_columns]

    #calculate frequencies
    gene_freq = tier2_df[gene_cols].sum(axis=0)
    snp_freq = tier2_df[snp_cols].sum(axis=0)

    #summary statistics
    summary = {
        'total_features': len(tier2_df.columns),
        'total_samples': len(tier2_df),

        #gene statistics
        'n_genes': len(gene_cols),
        'gene_freq_mean': gene_freq.mean(),
        'gene_freq_median': gene_freq.median(),
        'gene_freq_min': gene_freq.min(),
        'gene_freq_max': gene_freq.max(),
        'gene_freq_std': gene_freq.std(),

        #SNP statistics
        'n_snps': len(snp_cols),
        'snp_freq_mean': snp_freq.mean(),
        'snp_freq_median': snp_freq.median(),
        'snp_freq_min': snp_freq.min(),
        'snp_freq_max': snp_freq.max(),
        'snp_freq_std': snp_freq.std(),

        #frequency distributions
        'gene_frequencies': gene_freq,
        'snp_frequencies': snp_freq
    }

    return summary

In [None]:
tier2_df = pd.read_csv('/content/drive/MyDrive/amr_features/tier2_amr_genes_plus_mutations.csv', index_col=0)
tier1a = pd.read_csv('/content/drive/MyDrive/amr_features/tier1a_acquired_amr_genes_CORRECTED.csv', index_col=0)

summary = get_feature_summary_separate(tier2_df, tier1a.columns.tolist())

In [None]:
print("TIER 2 FEATURE SUMMARY")
print(f"\nTotal Features: {summary['total_features']}")
print(f"Total Samples: {summary['total_samples']}")

TIER 2 FEATURE SUMMARY

Total Features: 1236
Total Samples: 1089


In [None]:
print("ACQUIRED AMR GENES (Tier 1A)")
print(f"Count: {summary['n_genes']}")
print(f"Frequency (samples):")
print(f"  Mean:   {summary['gene_freq_mean']:.1f}")
print(f"  Median: {summary['gene_freq_median']:.1f}")
print(f"  Min:    {summary['gene_freq_min']:.0f}")
print(f"  Max:    {summary['gene_freq_max']:.0f}")
print(f"  Std:    {summary['gene_freq_std']:.1f}")

ACQUIRED AMR GENES (Tier 1A)
Count: 409
Frequency (samples):
  Mean:   163.7
  Median: 2.0
  Min:    0
  Max:    1089
  Std:    353.5


In [None]:
print("CHROMOSOMAL MUTATIONS (Tier 2)")
print(f"Count: {summary['n_snps']}")
print(f"Frequency (samples):")
print(f"  Mean:   {summary['snp_freq_mean']:.1f}")
print(f"  Median: {summary['snp_freq_median']:.1f}")
print(f"  Min:    {summary['snp_freq_min']:.0f}")
print(f"  Max:    {summary['snp_freq_max']:.0f}")
print(f"  Std:    {summary['snp_freq_std']:.1f}")

CHROMOSOMAL MUTATIONS (Tier 2)
Count: 827
Frequency (samples):
  Mean:   288.2
  Median: 110.0
  Min:    10
  Max:    1055
  Std:    330.3


## **FREQUENCY DISTRIBUTIONS**

In [None]:
#SNP frequency bins
snp_freq_array = summary['snp_frequencies'].values
print("\nSNP Mutations by Frequency:")
print(f"  Rare (1-10 samples):      {np.sum((snp_freq_array >= 1) & (snp_freq_array <= 10))}")
print(f"  Uncommon (11-50):         {np.sum((snp_freq_array > 10) & (snp_freq_array <= 50))}")
print(f"  Common (51-200):          {np.sum((snp_freq_array > 50) & (snp_freq_array <= 200))}")
print(f"  Very common (201-500):    {np.sum((snp_freq_array > 200) & (snp_freq_array <= 500))}")
print(f"  Highly prevalent (>500):  {np.sum(snp_freq_array > 500)}")


SNP Mutations by Frequency:
  Rare (1-10 samples):      11
  Uncommon (11-50):         266
  Common (51-200):          228
  Very common (201-500):    121
  Highly prevalent (>500):  201


In [None]:
#Gene frequency bins
gene_freq_array = summary['gene_frequencies'].values
print("\nAMR Genes by Frequency:")
print(f"  Rare (1-50 samples):      {np.sum((gene_freq_array >= 1) & (gene_freq_array <= 50))}")
print(f"  Uncommon (51-200):        {np.sum((gene_freq_array > 50) & (gene_freq_array <= 200))}")
print(f"  Common (201-500):         {np.sum((gene_freq_array > 200) & (gene_freq_array <= 500))}")
print(f"  Very common (>500):       {np.sum(gene_freq_array > 500)}")


AMR Genes by Frequency:
  Rare (1-50 samples):      183
  Uncommon (51-200):        40
  Common (201-500):         14
  Very common (>500):       54


**`Analysis of Separate Frequencies`**

1. **ACQUIRED AMR GENES (Tier 1A)**

| Statistic | Value | Interpretation |
| :--- | :--- | :--- |
| **Count** | 409 | Total acquired genes in the combined matrix. |
| **Median Freq.** | **2.0 samples** | Most acquired genes are **very rare** (e.g., specific serotype-linked genes, rare resistance types). |
| **Mean Freq.** | **163.7 samples** | The mean is pulled up by a few **highly common epidemic genes**. |
| **Max Freq.** | **1089 samples** | At least one gene is present in **every single sample** (likely a universally conserved, non-variable gene that was included in the source file). |
| **Very Common (>500)** | **54 genes** | This represents the highly **prevalent, mobile genes** ($\text{bla}_{\text{TEM}}$, $\text{tetA}$, etc.) that are characteristic of epidemic clones. |


2. **CHROMOSOMAL MUTATIONS (Tier 2)**


| Statistic | Value | Interpretation |
| :--- | :--- | :--- |
| **Count** | 827 | Total chromosomal SNP features. |
| **Median Freq.** | **110.0 samples** | The typical SNP is present in about 1/10th of the population. This is a **strong, manageable signal**. |
| **Min Freq.** | **10 samples** | Confirms your initial frequency filter successfully removed ultra-rare mutations. |
| **Highly Prevalent (>500)** | **201 mutations** | **This is the key finding.** These are the **major clonal/core mutations** (e.g., $gyrA\_S83L$, which you saw at 239 samples, and other major clonal markers) that define large, potentially multi-drug resistant lineages. These will be highly predictive. |


**Conclusion: Ready for Modeling**

Our separate frequency analysis provides the necessary justification for our feature selection:

1.  **Tier 1A (Acquired Genes):** Shows a huge diversity (median 2.0) but also contains powerful, high-frequency signals (54 genes $>500$ samples) crucial for explaining resistance.
2.  **Tier 2 (Chromosomal Mutations):** Shows a more even distribution centered around 110 samples, with a large block of **201 highly prevalent mutations** that serve as excellent markers for successful evolutionary lineages.

The separation confirms that the combined mean frequency of 247.0 was meaningless because it hid the true complexity: the **Tier 1A median of 2.0** and the **Tier 2 median of 110.0**.

In [None]:
summary = get_feature_summary_separate(tier2_df, tier1a.columns.tolist())

#save statistics for manuscript
stats_df = pd.DataFrame({
    'Feature_Type': ['AMR Genes', 'Chromosomal Mutations'],
    'Count': [summary['n_genes'], summary['n_snps']],
    'Mean_Frequency': [summary['gene_freq_mean'], summary['snp_freq_mean']],
    'Median_Frequency': [summary['gene_freq_median'], summary['snp_freq_median']],
    'Min_Frequency': [summary['gene_freq_min'], summary['snp_freq_min']],
    'Max_Frequency': [summary['gene_freq_max'], summary['snp_freq_max']],
    'Std_Frequency': [summary['gene_freq_std'], summary['snp_freq_std']]
})

stats_df.to_csv('/content/drive/MyDrive/amr_features/tier2_feature_statistics.csv', index=False)
print(stats_df)

            Feature_Type  Count  Mean_Frequency  Median_Frequency  \
0              AMR Genes    409      163.669927               2.0   
1  Chromosomal Mutations    827      288.211608             110.0   

   Min_Frequency  Max_Frequency  Std_Frequency  
0            0.0         1089.0     353.479415  
1           10.0         1055.0     330.266713  
