# **ROARY PANGENOME PROCESSING FOR ML - TIER 3 FEATURES**

**`Purpose:`** Process our 2018 Roary output for hierarchical ML analysis

**`Strategy:`** Aggressive filtering → Feature selection → Top genes for discovery

**`Input:`** 
```python
    - ROARY_FILE = Path('/content/drive/MyDrive/data/E.coli/gene_presence_absence.csv')
    - UNIFIED_AMR_FILE = Path('/content/drive/MyDrive/amr_features/gpam_card_gpam_resfinder_gpam_amrfinder_ppam_plasmid.csv')  #our Tier 1 matrix
    - PHENOTYPE_FILE = Path('/content/drive/MyDrive/data/E.coli/phenotypic.csv')
```
**`Output:`** Filtered accessory genome features for ML
```python
    - OUTPUT_DIR = Path('/content/drive/MyDrive/pangenome_features')
```
**`Steps:`**
1. Remove known AMR genes (that are already in Tier 1)
2. Variance filtering (remove rare/ubiquitous genes)
3. Feature selection (mutual information + RF importance)
4. Output top N genes for novel discovery

**`References:`**
- Page et al. (2015) - Roary pangenome analysis
- Breiman (2001) - Random Forest feature importance
- Cover & Thomas (1991) - Mutual information

**`Runtime:`** 5-10 minutes for 44,958 genes × 1089 samples

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.feature_selection import VarianceThreshold, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **CONFIGURATION**

In [None]:
#paths
ROARY_FILE = Path('/content/drive/MyDrive/data/E.coli/gene_presence_absence.csv')
UNIFIED_AMR_FILE = Path('/content/drive/MyDrive/amr_features/gpam_card_gpam_resfinder_gpam_amrfinder_ppam_plasmid.csv')  #our Tier 1 matrix
PHENOTYPE_FILE = Path('/content/drive/MyDrive/data/E.coli/phenotypic.csv')
OUTPUT_DIR = Path('/content/drive/MyDrive/pangenome_features')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

In [None]:
#filtering parameters
VARIANCE_THRESHOLD = 0.05   #keep genes in 5-95% frequency
TOP_N_FEATURES = 500        #final number of genes for Tier 3
RANDOM_STATE = 42

print(f"Input: {ROARY_FILE}")
print(f"Output: {OUTPUT_DIR}")
print(f"Variance threshold: {VARIANCE_THRESHOLD}")
print(f"Target features: {TOP_N_FEATURES}")

Input: /content/drive/MyDrive/data/E.coli/gene_presence_absence.csv
Output: /content/drive/MyDrive/pangenome_features
Variance threshold: 0.05
Target features: 500


## **LOAD ROARY DATA**

In [None]:
roary_df = pd.read_csv(ROARY_FILE, low_memory=False)

print(f"Loaded Roary output")
print(f"  Total genes: {len(roary_df):,}")

Loaded Roary output
  Total genes: 44,957


In [None]:
roary_df.head()

Unnamed: 0,Gene,Non-unique Gene name,Annotation,No. isolates,No. sequences,Avg sequences per isolate,Genome Fragment,Order within Fragment,Accessory Fragment,Accessory Order with Fragment,...,12045_3#58,12045_3#59,12045_3#60,12045_3#61,12045_3#62,12045_3#63,12045_3#64,12045_3#65,12045_3#66,12045_3#67
0,menH,,acyl-CoA thioester hydrolase,1094,1094,1.0,1,17765,,,...,12045_3#58_03627,12045_3#59_02017,12045_3#60_00251,12045_3#61_00635,12045_3#62_03718,12045_3#63_00290,12045_3#64_01814,12045_3#65_01647,12045_3#66_00916,12045_3#67_00914
1,yhbP,,"conserved protein, UPF0306 family",1094,1094,1.0,1,2449,,,...,12045_3#58_00874,12045_3#59_01852,12045_3#60_02321,12045_3#61_02767,12045_3#62_02513,12045_3#63_02181,12045_3#64_02429,12045_3#65_01476,12045_3#66_01618,12045_3#67_02278
2,group_10229,,putative hydrolase,1094,1094,1.0,1,26199,,,...,12045_3#58_01104,12045_3#59_04138,12045_3#60_01650,12045_3#61_02519,12045_3#62_01218,12045_3#63_01510,12045_3#64_02819,12045_3#65_02964,12045_3#66_02239,12045_3#67_03235
3,hofP,,protein involved in utilization of DNA as a ca...,1094,1094,1.0,1,26240,,,...,12045_3#58_01137,12045_3#59_04100,12045_3#60_01683,12045_3#61_02557,12045_3#62_01251,12045_3#63_01543,12045_3#64_02786,12045_3#65_02931,12045_3#66_02272,12045_3#67_03202
4,ugpQ,,glycerophosphoryl diester phosphodiesterase,1094,1094,1.0,1,26350,,,...,12045_3#58_01198,12045_3#59_04039,12045_3#60_01752,12045_3#61_02612,12045_3#62_01315,12045_3#63_01612,12045_3#64_02721,12045_3#65_02809,12045_3#66_02336,12045_3#67_03133


In [None]:
print("Null values:", roary_df['Non-unique Gene name'].isnull().sum())
print('='*50)
print("Unique values:", roary_df['Non-unique Gene name'].nunique())
print('='*50)
print(roary_df['Non-unique Gene name'].value_counts())

Null values: 34788
Unique values: 3367
Non-unique Gene name
intA_1    40
flu_2     35
traC_1    31
hsdS      30
umuC_2    29
          ..
ygjI_2     1
yeeO_2     1
fiu        1
iaaA_2     1
ymcB_2     1
Name: count, Length: 3367, dtype: int64


In [None]:
#metadata vs sample columns
metadata_cols = ['Gene', 'Non-unique Gene name', 'Annotation', 'No. isolates',
                'No. sequences', 'Avg sequences per isolate', 'Genome Fragment',
                'Order within Fragment', 'Accessory Fragment',
                'Accessory Order with Fragment', 'QC',
                'Min group size nuc', 'Max group size nuc', 'Avg group size nuc']

In [None]:
#sample columns are those not in metadata
sample_cols = [col for col in roary_df.columns if col not in metadata_cols]

print(f"  Sample columns: {len(sample_cols)}")
print(f"  First 5 samples: {sample_cols[:5]}")

  Sample columns: 1094
  First 5 samples: ['11657_5#25', '11657_5#26', '11657_5#27', '11657_5#29', '11657_5#30']


In [None]:
#extract presence/absence matrix - Roary uses gene IDs (e.g., "group_1234") for presence, blank for absence
presence_absence = roary_df[sample_cols].copy()

#convert to binary matrix
for col in sample_cols:
    presence_absence[col] = (presence_absence[col].notna() & (presence_absence[col] != '')).astype(int)

#add gene names as index
presence_absence.index = roary_df['Gene'].values

#transpose (samples as rows, genes as columns)
gene_matrix = presence_absence.T
gene_matrix.index = sample_cols

print(f"\nGene matrix: {gene_matrix.shape}")
print(f"  Samples: {gene_matrix.shape[0]:,}")
print(f"  Genes: {gene_matrix.shape[1]:,}")


Gene matrix: (1094, 44957)
  Samples: 1,094
  Genes: 44,957


## **REMOVE KNOWN AMR GENES (EXACT + KEYWORD MATCHING)**

In [None]:
# #load unified AMR matrix (Tier 1)
# try:
#     unified_amr = pd.read_csv(UNIFIED_AMR_FILE, index_col=0)


#     known_amr_genes = set(unified_amr.columns)

#     print(f"Loaded {len(known_amr_genes)} known AMR genes from Tier 1")
#     print(f"  Examples: {list(known_amr_genes)[:10]}")

#     #find overlap with Roary genes (exact matching)
#     roary_genes = set(gene_matrix.columns)
#     genes_to_remove = roary_genes.intersection(known_amr_genes)

#     print(f"\nFound {len(genes_to_remove)} genes that overlap with Tier 1")
#     if len(genes_to_remove) > 0:
#         print(f"  Examples: {list(genes_to_remove)[:10]}")

#     #keep only novel genes
#     novel_genes = [g for g in gene_matrix.columns if g not in genes_to_remove]
#     gene_matrix_novel = gene_matrix[novel_genes]

#     print(f"\nAfter removing Tier 1 genes:")
#     print(f"  Remaining genes: {gene_matrix_novel.shape[1]:,}")

# except FileNotFoundError:
#     print("Warning: Unified AMR file not found")
#     print("   Skipping AMR gene removal (will use all Roary genes)")
#     gene_matrix_novel = gene_matrix.copy()

Loaded 678 known AMR genes from Tier 1
  Examples: ['nfsA_Q113STOP', 'IncX3_1', 'IncHI1B(CIT)_1_pNDM-CIT', 'tet(D)', 'ErmB', 'Col(KPHS6)_1', 'blaCTX-M-1', 'terD', 'emrD', 'blaOXA-9_2']

Found 101 genes that overlap with Tier 1
  Examples: ['mchF', 'ireA', 'terZ', 'lpfA', 'emrE', 'mdtM', 'terD', 'emrD', 'emrK', 'pcoC']

After removing Tier 1 genes:
  Remaining genes: 44,856


In [None]:
[col for col in gene_matrix_novel.columns if "bla" in col.lower()]

['blaSE', 'bla', 'bla_1', 'bla_2', 'bla_3']

In [None]:
try:
    unified_amr = pd.read_csv(UNIFIED_AMR_FILE, index_col=0)

    #get exact AMR gene names from OUR CARD/ResFinder/AMRFinder data (Tier 1 data)
    known_amr_genes = set(unified_amr.columns)


    roary_genes = gene_matrix.columns.tolist()

    #BROAD KEYWORDS FOR GENERIC ROARY CLUSTERS
    # broad_amr_keywords = [
    #     # Common AMR families that Roary often groups generically
    #     'bla', 'tet', 'erm', 'aad', 'str', 'aph', 'sul', 'dfr', 'qnr',
    #     'cat', 'cml', 'mcr', 'flo', 'optr', 'van', 'emr', 'mdt', 'efflux'
    # ]
    broad_amr_keywords = [
    'bla',      # β-lactamases
    'tet',      # Tetracyclines
    'erm',      # Macrolides (Erythromycin)
    'aad',      # Aminoglycosides
    'str',      # Streptomycin
    'aph',      # Aminoglycosides (phosphotransferases)
    'sul',      # Sulfonamides
    'dfr',      # Trimethoprim
    'qnr',      # Quinolones
    'cat',      # Chloramphenicol
    'cml',      # Chloramphenicol
    'mcr',      # Colistin
    'flo',      # Florfenicol
    'optr',     # (or 'opt')
    'van',      # Vancomycin
    'emr',      # Efflux (EmrAB)
    'mdt',      # Efflux (MdtABC)
    'efflux',    # Generic efflux
    #to be more comprehensive:
    'acr',      # AcrAB efflux (very common in E. coli)
    'tolc',     # Outer membrane efflux channel
    'omp',      # Porins (ompF, ompC - permeability)
    'mar',      # Multiple antibiotic resistance regulators
    'mex',      # Efflux (MexAB, though more common in P. aeruginosa)
    'fos',      # Fosfomycin resistance
    'ant',      # Aminoglycoside nucleotidyltransferases
    'aac',      # Aminoglycoside acetyltransferases
    ]


    genes_to_remove = set()

    #1.exact Matching (Captures specific variants and detailed names)
    exact_matches = set(roary_genes).intersection(known_amr_genes)
    genes_to_remove.update(exact_matches)

    #2.keyword Matching (Captures Roary's generic cluster names like 'bla' or 'tet')
    keyword_matches = set()
    for gene in roary_genes:
        #check if the gene name (case-insensitive) contains any of the broad keywords
        if any(keyword in gene.lower() for keyword in broad_amr_keywords):
            keyword_matches.add(gene)

    genes_to_remove.update(keyword_matches)


    print(f"Loaded {len(known_amr_genes)} known specific AMR genes (Tier 1)")
    print(f"  Examples: {list(known_amr_genes)[:10]}")
    print(f"Found {len(exact_matches)} genes that exact match Tier 1 names")
    if len(genes_to_remove) > 0:
      print(f"  Examples: {list(genes_to_remove)[:10]}")
    print(f"Found {len(keyword_matches)} genes matching broad AMR keywords (e.g., 'bla', 'tet')")
    print(f"Total unique genes marked for removal: {len(genes_to_remove):,}")

    #keep only novel genes
    novel_genes = [g for g in gene_matrix.columns if g not in genes_to_remove]
    gene_matrix_novel = gene_matrix[novel_genes]

    print(f"\nAfter removing Tier 1/AMR-associated genes:")
    print(f"  Remaining genes: {gene_matrix_novel.shape[1]:,}")

except FileNotFoundError:
    print("Warning: Unified AMR file not found")
    print("    Skipping AMR gene removal (will use all Roary genes)")
    gene_matrix_novel = gene_matrix.copy()

Loaded 678 known specific AMR genes (Tier 1)
  Examples: ['silB', 'ampC_C-42T', 'IncFIC(FII)_1', 'blaTEM-1B_1', "aac(6')-Ib3", 'pcoC', 'cmlA5', 'tetM', 'mdtE', 'qepA2_1']
Found 101 genes that exact match Tier 1 names
  Examples: ['tetD', 'ompN_2', 'bla_2', 'silB', 'sepA', 'arsA', 'emrA', 'mcrB', 'ompR_2', 'mdtK_2']
Found 152 genes matching broad AMR keywords (e.g., 'bla', 'tet')
Total unique genes marked for removal: 227

After removing Tier 1/AMR-associated genes:
  Remaining genes: 44,730


In [None]:
print(len(genes_to_remove))
print(genes_to_remove)

227
{'tetD', 'ompN_2', 'bla_2', 'silB', 'sepA', 'arsA', 'emrA', 'mcrB', 'ompR_2', 'mdtK_2', 'lacR_1', 'pcoC', 'aadA_1', 'mexB', 'espB', 'ompC_3', 'nanT_4', 'ariR', 'ompX_2', 'mdtE', 'mdtD', 'emrB_1', 'marC_1', 'perM', 'marC_2', 'mdtP_3', 'iucA', 'acrE', 'ompF', 'cmlA', 'sulA', 'emrB_4', 'vanT', 'dfrA', 'acrR_1', 'marA_1', 'marR_4', 'fosA', 'aap', 'terC', 'merC', 'arsR', 'napH', 'marA', 'marA_2', 'ompN_1', 'astA', 'air', 'mdtO', 'tolC', 'ydfR', 'mchF', 'merD', 'baeS', 'marR_2', 'mdtN', 'espA', 'espF', 'mexA', 'ompX_5', 'iucB', 'floR', 'acrB_3', 'aadA1', 'nemR_2', 'bla_3', 'msbA', 'antB', 'ompT_1', 'hlyE', 'papE', 'cvaB', 'merP', 'ble', 'mdtM', 'emrE_2', 'sfaS', 'acrR', 'mdtM_2', 'terZ', 'ant', 'emrD', 'mexB_2', 'merE', 'eae', 'tir', 'dfrD', 'mdtB', 'kdpE', 'eatA', 'papA', 'tetR_2', 'hcaT', 'cpxA', 'tetC', 'silC', 'ireA', 'tsh', 'dapH', 'acrF', 'papH_2', 'mdtL', 'iucD', 'ompC', 'papH', 'evgS', 'aggR', 'iroC', 'mdtI', 'ompX_4', 'cdtB', 'iutA', 'ompN_4', 'mdtE_1', 'mphA', 'ompX_1', 'mfd', 

### **`Analysis of Removed Genes (Total: 227 Unique Genes)`**

#### 1. **Efflux Pumps & Outer Membrane Channels (High AMR Relevance)**

This is the largest and most critical category, and its comprehensive removal strongly validates our expanded keyword list (`acr`, `tolc`, `emr`, `mdt`, `omp`, `mex`). These genes mediate resistance by actively pumping drugs out of the cell or controlling cell permeability.

| Gene | Mechanism | AMR Relevance |
| :--- | :--- | :--- |
| **`acrA`, `acrB`, `acrB_2`, `acrB_3`** | AcrAB-TolC efflux system components. | Primary system conferring **multi-drug resistance (MDR)** to fluoroquinolones, $\beta$-lactams, etc. |
| **`tolC`** | Outer membrane channel. | Essential outer channel for AcrAB, EmrAB, and other efflux pumps. Critical for MDR. |
| **`emrA`, `emrB_1`, `emrB_2`, `emrB_3`, `emrB_4`, `emrE`, `emrE_1`, `emrE_2`, `emrK`** | EmrAB and EmrKY efflux systems. | Multidrug efflux pumps. |
| **`mexA`, `mexB`, `mexB_1`, `mexB_2`** | MexAB-TolC efflux components. | Another major family of efflux pumps, often associated with MDR. |
| **`mdtC`, `mdtD`, `mdtD_1`, `mdtD_2`, `mdtE`, `mdtE_1`, `mdtE_3`, `mdtF`, `mdtG`, `mdtH_2`, `mdtI`, `mdtJ`, `mdtK`, `mdtK_2`, `mdtL`, `mdtM`, `mdtM_2`, `mdtN`, `mdtO`, `mdtP_1`, `mdtP_2`, `mdtP_3`** | Multidrug Transporter (Mdt) systems. | Various MDR efflux transporters. |
| **`ompC`, `ompC_1`, `ompC_2`, `ompC_3`, `ompF`, `ompN`, `ompN_1`, `ompN_2`, `ompN_3`, `ompN_4`, `ompT`, `ompT_1`, `ompT_2`, `ompG`, `ompG_1`, `ompW`, `ompX_1`, `ompX_2`, `ompX_3`, `ompX_4`, `ompX_5`** | Outer membrane porins/proteins (OMPs). | Changes in porin expression or structure (like OmpF) are a major mechanism for **resistance to $\beta$-lactams and fluoroquinolones** by limiting drug uptake. |

#### 2. **Transcriptional Regulators (MDR Control)**

These genes control the expression of efflux pumps and porins, making them essential to the AMR phenotype. Our inclusion of `mar`, `acrR`, `baeR`, etc., was crucial.

| Gene | Mechanism | AMR Relevance |
| :--- | :--- | :--- |
| **`marA`, `marA_1`, `marA_2`** | Multiple Antibiotic Resistance (Mar) system activator. | **Central regulator of MDR**; activates *acrAB* and represses *ompF*. |
| **`marR_1`, `marR_2`, `marR_3`, `marR_4`** | MarR system repressor. | Represses *marA*. Mutations lead to constitutive expression of MarA and MDR. |
| **`acrR`, `acrR_1`, `acrR_2`** | AcrAB efflux system regulator. | Controls the AcrAB efflux pump expression. |
| **`baeR`, `baeS`** | BaeSR Two-Component System. | Regulates the expression of efflux pumps like MdtABC and AcrD. |
| **`evgA`, `evgS`** | EvgAS Two-Component System. | Affects drug resistance by regulating efflux pumps and other general stress responses. |

#### 3. **Direct Resistance Genes (Tier 1/Keywords)**

These are classic resistance genes that were likely captured by our specific or broad keyword matches.

| Gene | Mechanism | AMR Relevance |
| :--- | :--- | :--- |
| **`bla`, `bla_1`, `bla_2`, `bla_3`, `blaSE`** | $\beta$-Lactamase genes. | Direct inactivation of $\beta$-lactam antibiotics (AMX, AMC). |
| **`tetA`, `tetC`, `tetD`** | Tetracycline resistance genes. | Efflux or ribosomal protection mediating resistance to tetracyclines. |
| **`mcrB`, `mcrB_1`, `mcrB_2`, `mcrC`** | Colistin resistance genes. | Mobilized resistance genes. |
| **`aadA`, `aadA_1`, `aadA1`** | Aminoglycoside resistance gene. | Nucleotidyltransferase that inactivates streptomycin/spectinomycin. |
| **`aphA`, `aphA_2`** | Aminoglycoside resistance gene. | Phosphotransferase that inactivates kanamycin, etc. |
| **`aacA-aphD`, `aacA-aphD_1`** | Aminoglycoside resistance gene (fusion). | Inactivates multiple aminoglycosides. |
| **`cmlA`, `cmlA1`** | Chloramphenicol resistance gene. | Efflux pump mediating resistance. |
| **`dfrA`, `dfrD`** | Trimethoprim resistance genes. | Target modification (DHFR bypass). |
| **`fosA`** | Fosfomycin resistance gene. | Direct inactivation of fosfomycin. |
| **`sulA`, `sulI`** | Sulfonamide resistance genes. | Target bypass/modification. |

#### 4. **Heavy Metal/Biocide Resistance**

These are often co-located with AMR genes on plasmids or mobile elements, creating strong LD.

| Gene | Mechanism | AMR Relevance |
| :--- | :--- | :--- |
| **`arsA`, `arsD`, `arsR`** | Arsenic resistance. | Frequently found on plasmids co-harboring AMR genes. |
| **`silB`, `silC`, `silE`, `silP`** | Silver resistance. | Often plasmid-borne with $\beta$-lactamases and other AMR genes. |
| **`terB`, `terC`, `terD`, `terE`, `terW`, `terZ`** | Tellurite resistance. | Frequently associated with AMR plasmids. |
| **`merA`, `merC`, `merD`, `merE`, `merP`, `merR`, `merT`** | Mercury resistance (Mer operon). | Often found on large resistance plasmids. |
| **`pcoC`, `pcoE`** | Copper resistance. | Plasmid-borne, often linked to AMR. |


**`Conclusion: Success`**

The removal of these 227 genes is highly effective. By excluding everything from core efflux components (`acrB`, `tolC`) and key regulators (`marA`, `acrR`) to direct resistance enzymes (`bla`, `tet`), we have successfully created a Tier 3 feature set that is dramatically enriched for truly **novel genetic determinants**.

Our pipeline has robustly achieved the primary objective of the initial filtering steps: isolating the $\mathbf{non-canonical}$ accessory genes whose function is currently unknown or not well-characterized in the context of AMR.


## **VARIANCE FILTERING**


| Step | Action | Explanation |
| :--- | :--- | :--- |
| **Step 1 & 2** | **Remove Known AMR Genes** | **Purpose:** Ensures the remaining genes are truly **novel candidates** for Tier 3 discovery, preventing feature redundancy and false novelty claims. **Basis:** *Comparative Genomics*. By excluding genes identified by CARD/ResFinder/AMRFinder patterns (Tier 1), we isolate the accessory genome's $\mathbf{non-canonical}$ components. |
| **Step 3** | **Variance Filtering ($\text{Min/Max Freq} = 5\%/95\%)$** | **Purpose:** Removes genes that are too rare to be statistically linked to resistance (e.g., in only 1-2 isolates) and genes that are effectively ubiquitous ($\text{Core Genes}$), which typically don't explain phenotype variance. **Basis:** *Feature Engineering*. Removing features with low variance maximizes the signal-to-noise ratio for subsequent ML models. |
| **Step 4 & 5** | **Sequential Feature Selection (MI + RF)** | **Purpose:** Uses two complementary machine learning techniques to rank the remaining $\mathbf{\sim 6,500}$ genes. **Mutual Information (MI)** measures the dependency between a gene and the phenotype ($R/S$), acting as a univariate filter. **Random Forest (RF) Importance** measures how valuable a gene is when considered *in combination* with other genes (multivariate importance). **Basis:** *Feature Selection Theory*. Combining these approaches yields features that are both individually relevant *and* contribute uniquely within a model ensemble. |

Our approach to making the final selection **per antibiotic** is $\mathbf{CRITICAL}$ and correct, as different antibiotics rely on distinct resistance mechanisms (e.g., an efflux pump gene relevant for fluoroquinolones might be irrelevant for $\beta$-lactams).


In [None]:
print(f"Removing genes with frequency <{VARIANCE_THRESHOLD*100:.0f}% or >{(1-VARIANCE_THRESHOLD)*100:.0f}%")

#calculate gene frequencies
gene_frequencies = gene_matrix_novel.sum(axis=0) / len(gene_matrix_novel)

#keep genes in 5-95% range - remove low-variance genes (present in <5% or >95% of samples)
#rationale- Rare/ubiquitous genes unlikely to be informative
min_freq = VARIANCE_THRESHOLD
max_freq = 1 - VARIANCE_THRESHOLD

keep_mask = (gene_frequencies >= min_freq) & (gene_frequencies <= max_freq)
gene_matrix_filtered = gene_matrix_novel.loc[:, keep_mask]

removed = gene_matrix_novel.shape[1] - gene_matrix_filtered.shape[1]
print(f"\nVariance filtering complete:")
print(f"  Removed: {removed:,} genes")
print(f"  Remaining: {gene_matrix_filtered.shape[1]:,} genes")

Removing genes with frequency <5% or >95%

Variance filtering complete:
  Removed: 38,239 genes
  Remaining: 6,491 genes


In [None]:
[col for col in gene_matrix_filtered.columns if "bla" in col.lower()]   #will check if the bla gene is still there or removed, if removed we succeeded in removing generic genes

[]

In [None]:
#show frequency distribution
freq_filtered = gene_frequencies[keep_mask]
print(f"\nFrequency distribution of kept genes:")
print(f"  5-10%: {np.sum((freq_filtered >= 0.05) & (freq_filtered < 0.10)):,}")
print(f"  10-25%: {np.sum((freq_filtered >= 0.10) & (freq_filtered < 0.25)):,}")
print(f"  25-50%: {np.sum((freq_filtered >= 0.25) & (freq_filtered < 0.50)):,}")
print(f"  50-75%: {np.sum((freq_filtered >= 0.50) & (freq_filtered < 0.75)):,}")
print(f"  75-95%: {np.sum((freq_filtered >= 0.75) & (freq_filtered <= 0.95)):,}")


Frequency distribution of kept genes:
  5-10%: 1,974
  10-25%: 2,464
  25-50%: 1,004
  50-75%: 525
  75-95%: 524


In [None]:
#save variance-filtered matrix
variance_filtered_file = OUTPUT_DIR / 'roary_variance_filtered.csv'
gene_matrix_filtered.to_csv(variance_filtered_file)
print(f"\nSaved variance-filtered matrix: {variance_filtered_file}")


Saved variance-filtered matrix: /content/drive/MyDrive/pangenome_features/roary_variance_filtered.csv


## **FEATURE SELECTION (PER ANTIBIOTIC)**

## **1. Mutual Information (MI)**

Mutual Information measures the **statistical dependence** between two variables. In this context, it quantifies how much knowing the state of a gene ($X$) reduces the uncertainty about the phenotype (Resistance/Susceptibility, $Y$).

### **Key Concept: Information Gain**

MI is calculated using Shannon's entropy ($H$), which measures the uncertainty (or impurity) of a variable.

$$MI(X; Y) = H(Y) - H(Y | X)$$

* $H(Y)$: The initial uncertainty about the phenotype (e.g., if $R$ and $S$ are equally common, uncertainty is high).
* $H(Y | X)$: The uncertainty about the phenotype *after* knowing the state of the gene (i.e., if the gene is present/absent).

### **Application to Genes**

* **High MI Score:** A high score means the gene's presence or absence is strongly predictive of the resistance phenotype. For example, a known $\beta$-lactamase gene that is **always** present in Resistant strains ($R$) and **always** absent in Susceptible strains ($S$) would have a very high MI score.
* **Strengths:** MI is excellent at detecting **non-linear relationships** and is model-agnostic (it doesn't rely on training a classifier).

### 2. **Random Forest (RF) Feature Importance**

Random Forest (RF) is an ensemble of decision trees. Feature importance, often called **Gini importance** or **Mean Decrease in Impurity (MDI)**, measures how much each feature contributes to reducing the impurity of the trees in the forest.

### **Key Concept: Impurity Reduction**

When a decision tree splits on a feature (gene) to separate $R$ and $S$ samples, it reduces the *impurity* (mix of $R$ and $S$) of the resulting subsets. The importance score is the total reduction in impurity across all splits using that feature, averaged over all trees in the forest.

### **Application to Genes**

* **High RF Importance:** A high score means the gene is frequently selected as an optimal split point in the decision trees to maximize the separation between $R$ and $S$ samples.
* **Strengths:** RF inherently captures the **interactions** between features. A gene might have a moderate MI score alone, but if it is crucial when combined with another gene (a synergistic effect), RF is often better at reflecting its practical importance in a predictive model. It directly measures a gene's utility for **classification**.



### **The Combined Approach**

By normalizing the scores (scaling them between 0 and 1) and taking a weighted average ($0.5 \cdot MI_{norm} + 0.5 \cdot RF_{norm}$), the combined approach benefits from the best of both worlds:

1.  **Robustness:** It captures strong, independent associations (MI) and crucial predictive contributions/interactions (RF).
2.  **Stability:** Features with high scores in both metrics are prioritized, leading to a more stable and reliable set of candidate AMR genes compared to using either method alone.

This strategy ensures we select genes that are both **statistically informative** and **functionally important** for building an accurate prediction model.

In [None]:
gene_matrix_filtered.head()

Unnamed: 0,eutR,ybiN,yjdB,ydhY,frc,atoA,fadK,exoX,allS_5,group_7729,...,group_7354,group_7535,tar_2,mbeA,group_7801,group_8004,group_8174,group_8196,agaC_1,group_9813
11657_5#25,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
11657_5#26,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
11657_5#27,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
11657_5#29,1,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,1,0
11657_5#30,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,1,1,0


In [None]:
phenotypes = pd.read_csv(PHENOTYPE_FILE, index_col=0)
phenotypes.head()

Unnamed: 0_level_0,Isolate,Lane.accession,Sequecning Status,Year,CTZ,CTX,AMP,AMX,AMC,TZP,CXM,CET,GEN,TBM,TMP,CIP
ENA.Accession.Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
ERS356929,11657_5#10,ERR434268,Previously sequenced,2010.0,S,S,S,,S,S,S,I,S,S,S,S
ERS356935,11657_5#11,ERR434269,Previously sequenced,2010.0,S,S,R,,R,S,S,I,S,S,R,R
ERS356938,11657_5#12,ERR434270,Previously sequenced,2010.0,S,S,S,,S,S,S,S,S,S,S,S
ERS356941,11657_5#13,ERR434271,Previously sequenced,2010.0,S,S,R,,R,S,S,I,S,S,S,R
ERS356967,11657_5#14,ERR434272,Previously sequenced,2010.0,S,S,R,,S,S,S,S,S,S,R,S


In [None]:
#load phenotypes
try:
    phenotypes = pd.read_csv(PHENOTYPE_FILE, index_col=0)


    # #standardize sample IDs   (No need as they both have same naming - was going to do it for other datasets but then left the idea to maintain the naming by ENA)
    # phenotypes.index = phenotypes.index.str.replace('#', '_')
    # gene_matrix_filtered.index = gene_matrix_filtered.index.str.replace('#', '_')

    #use the 'Isolate' column as the index for phenotypes to align with gene_matrix_filtered
    phenotypes.set_index('Isolate', inplace=True)

    #find common samples
    common_samples = gene_matrix_filtered.index.intersection(phenotypes.index)

    print(f"Loaded phenotypes: {phenotypes.shape}")
    print(f"Common samples: {len(common_samples)}")

    #align datasets
    X = gene_matrix_filtered.loc[common_samples]

    #process each antibiotic
    for drug in ['AMX', 'AMC', 'CIP']:
        print(f"\n--- Processing {drug} ---")

        #prepare labels
        y = phenotypes.loc[common_samples, drug].map({'R': 1, 'S': 0, 'I': 0})
        y = y.dropna()

        #align X with y
        X_drug = X.loc[y.index]

        print(f"  Samples: {len(y)}")
        print(f"  Resistant: {y.sum()} ({y.mean()*100:.1f}%)")

        #mutual Information (find genes correlated with resistance)
        # Reference: Cover & Thomas (1991) - Information Theory

        print("  Computing MI scores...")
        mi_scores = mutual_info_classif(
            X_drug,
            y,
            discrete_features=True,
            random_state=RANDOM_STATE
            )

        # Random Forest
        print("  Training RF...")
        rf = RandomForestClassifier(
            n_estimators=100,
            max_depth=10,
            random_state=RANDOM_STATE,
            n_jobs=-1
        )
        rf.fit(X_drug, y)
        rf_scores = rf.feature_importances_

        #normalize scores
        mi_scores_norm = (mi_scores - mi_scores.min()) / (mi_scores.max() - mi_scores.min() + 1e-10)
        rf_scores_norm = (rf_scores - rf_scores.min()) / (rf_scores.max() - rf_scores.min() + 1e-10)

        #combined score (50% MI + 50% RF)
        combined_scores = 0.5 * mi_scores_norm + 0.5 * rf_scores_norm

        #get top N genes
        top_indices = np.argsort(combined_scores)[-TOP_N_FEATURES:][::-1]
        selected_genes = X_drug.columns[top_indices].tolist()

        print(f"  Selected top {len(selected_genes)} genes")
        print(f"  MI score range: {mi_scores[top_indices].min():.4f} - {mi_scores[top_indices].max():.4f}")
        print(f"  RF importance range: {rf_scores[top_indices].min():.4f} - {rf_scores[top_indices].max():.4f}")

        #save selected genes for this drug
        X_selected = X_drug[selected_genes]
        output_file = OUTPUT_DIR / f'roary_filtered_{drug}_top{TOP_N_FEATURES}.csv'
        X_selected.to_csv(output_file)
        print(f"  Saved: {output_file}")

        #save feature importance scores
        scores_df = pd.DataFrame({
            'gene': X_drug.columns,
            'mi_score': mi_scores,
            'rf_importance': rf_scores,
            'combined_score': combined_scores
        }).sort_values('combined_score', ascending=False)

        scores_file = OUTPUT_DIR / f'roary_feature_scores_{drug}.csv'
        scores_df.to_csv(scores_file, index=False)
        print(f"  Saved scores: {scores_file}")

except FileNotFoundError:
    print("Warning: Phenotype file not found")
    print("Skipping feature selection")
    print("Use variance-filtered matrix for now")

Loaded phenotypes: (1936, 15)
Common samples: 1094

--- Processing AMX ---
  Samples: 1094
  Resistant: 662 (60.5%)
  Computing MI scores...
  Training RF...
  Selected top 500 genes
  MI score range: 0.0021 - 0.2299
  RF importance range: 0.0000 - 0.0233
  Saved: /content/drive/MyDrive/pangenome_features/roary_filtered_AMX_top500.csv
  Saved scores: /content/drive/MyDrive/pangenome_features/roary_feature_scores_AMX.csv

--- Processing AMC ---
  Samples: 1094
  Resistant: 326 (29.8%)
  Computing MI scores...
  Training RF...
  Selected top 500 genes
  MI score range: 0.0108 - 0.0599
  RF importance range: 0.0000 - 0.0101
  Saved: /content/drive/MyDrive/pangenome_features/roary_filtered_AMC_top500.csv
  Saved scores: /content/drive/MyDrive/pangenome_features/roary_feature_scores_AMC.csv

--- Processing CIP ---
  Samples: 1094
  Resistant: 181 (16.5%)
  Computing MI scores...
  Training RF...
  Selected top 500 genes
  MI score range: 0.0775 - 0.1827
  RF importance range: 0.0000 - 0.015

### **`Analysis of Feature Selection Quality`**

| Drug | Resistant Samples | MI Score Range (Top 500) | RF Importance Range (Top 500) | Interpretation |
| :--- | :--- | :--- | :--- | :--- |
| **AMX** (Amoxicillin) | 60.5% | 0.0021 - **0.2299** | 0.0000 - **0.0233** | **Excellent/High Confidence.** MI has a large range, indicating some genes are highly informative. RF importance also has the highest peak. |
| **AMC** (Amox./Clavulanate) | 29.8% | 0.0043 - **0.0599** | 0.0000 - **0.0101** | **Difficult/Low Confidence.** MI score is significantly lower than for AMX or CIP, suggesting weaker individual gene associations. RF importance is also the lowest. |
| **CIP** (Ciprofloxacin) | 16.5% | 0.0723 - **0.1827** | 0.0000 - **0.0155** | **Good/Moderate Confidence.** MI score is high, almost comparable to AMX, indicating strong statistical signals. The minimum MI score is also high, suggesting even the 500th gene is somewhat informative. |



## **CREATE GENERIC FILTERED SET (NO PHENOTYPE) (Redundant)**

- As I did already do per drug selection of genes above, i made it because I was having a problem in per-drug selection but then the issue got resolved so this part is redundant.

In [None]:
#if we couldn't do per-drug selection, we can just take top N by overall variance
if gene_matrix_filtered.shape[1] > TOP_N_FEATURES:
    #use variance (spread) as a simple importance measure
    gene_variances = gene_matrix_filtered.var(axis=0)
    top_var_indices = np.argsort(gene_variances)[-TOP_N_FEATURES:][::-1]

    gene_matrix_final = gene_matrix_filtered.iloc[:, top_var_indices]

    print(f"Selected top {TOP_N_FEATURES} genes by variance")
    print(f"  Variance range: {gene_variances.iloc[top_var_indices].min():.4f} - {gene_variances.iloc[top_var_indices].max():.4f}")
else:
    gene_matrix_final = gene_matrix_filtered

# Save generic filtered matrix
generic_file = OUTPUT_DIR / f'roary_filtered_generic_top{TOP_N_FEATURES}.csv'
gene_matrix_final.to_csv(generic_file)
print(f"Saved: {generic_file}")

Selected top 500 genes by variance
  Variance range: 0.2367 - 0.2502
Saved: /content/drive/MyDrive/pangenome_features/roary_filtered_generic_top500.csv


## **FINAL SUMMARY**

In [None]:
print(f"\nSummary:")
print(f"  Original Roary genes: {gene_matrix.shape[1]:,}")
print(f"  After removing Tier 1: {gene_matrix_novel.shape[1]:,}")
print(f"  After variance filter: {gene_matrix_filtered.shape[1]:,}")
print(f"  Final selected: {TOP_N_FEATURES}")


Summary:
  Original Roary genes: 44,957
  After removing Tier 1: 44,795
  After variance filter: 6,520
  Final selected: 500


In [None]:
print(f"\nOutput files:")
print(f"  1. roary_variance_filtered.csv - All variance-filtered genes")
print(f"  2. roary_filtered_{{DRUG}}_top{TOP_N_FEATURES}.csv - Per-drug selections")
print(f"  3. roary_feature_scores_{{DRUG}}.csv - Feature importance scores")
print(f"  4. roary_filtered_generic_top{TOP_N_FEATURES}.csv - Generic selection")
print(f"\nReady for Tier 3 model training!")        # acutally we are not ready for the model and the issue is discussed below!!!


Output files:
  1. roary_variance_filtered.csv - All variance-filtered genes
  2. roary_filtered_{DRUG}_top500.csv - Per-drug selections
  3. roary_feature_scores_{DRUG}.csv - Feature importance scores
  4. roary_filtered_generic_top500.csv - Generic selection

Ready for Tier 3 model training!


# **POST-HOC CORRELATION FILTER FOR TIER 3 NOVEL GENES**

**`Purpose:`** Remove Tier 3 genes highly correlated with Tier 1 AMR genes

**`Prevents:`** False "novel" discoveries that are just linked to known AMR genes

**`Basis:`**
- Linkage disequilibrium on plasmids (genes travel together)
- Correlation threshold: |ρ| ≥ 0.8 indicates strong linkage
- Ensures Tier 3 genes are truly independent mechanisms


**`The "Novelty Trap"`**
We are still vulnerable to the "novelty trap," where a high-ranking gene is merely an innocent neighbor of a known AMR gene on a plasmid (linkage disequilibrium)
- We applied filtering the to 44,958 Roary genes which was a robust, multi-step process (Variance $\rightarrow$ MI $\rightarrow$ RF).
- But the question is, we will select the top 200-500 genes based on Random Forest importance. Since the Roary matrix and the phenotypic data are aligned, how will we ensure that these top genes are truly novel mechanisms and `not simply accessory genes` that are in high correlation (linkage disequilibrium) with a known Tier 1 or Tier 2 feature?

 **`Ensuring Novelty in Roary Filtering` (Correlation Challenge):**

 - There might be high correlation (**`linkage disequilibrium (LD)`**) between accessory genes and known resistance genes.
 - A gene highly correlated with $\text{bla}_{\text{TEM-1}}$ will rank highly in our Random Forest importance but won't be a novel resistance mechanism; it will just be a linked accessory gene on the same mobile element (plasmid).

 **`Strategy to Address Linkage:`**

 **`1. Direct Feature Exclusion:`** `Before training the Tier 3 model`, We will perform a $\mathbf{post-hoc\ correlation\ analysis}$ on the final $\sim 500$ top Roary genes. Calculate the Pearson correlation coefficient between each of the top $\mathbf{500}$ Roary genes and every known $\mathbf{Tier\ 1\ gene}$. Exclude any Roary gene that has a correlation coefficient of $|\rho| \ge 0.9$ with any Tier 1 feature. This helps decouple accessory genes from known mobile resistance elements.

 **`2. Biological Vetting:`** Once the final $\sim 500$ candidates are selected (post-correlation filter), our final task is a literature search. Check the candidates for annotations (e.g., hypothetical protein, membrane transporter, efflux pump component). Genes with functional annotations related to transport, regulation, or membrane structure are stronger novel candidates than those annotated simply as Hypothetical Protein or genes known to be related to housekeeping.

 **`Linkage Disequilibrium (LD)`** exists when certain combinations of genetic markers (alleles) occur more or less often in a population than would be expected by chance. This is usually because the genes are physically close to each other on a chromosome and are inherited together more often than they are separated by recombination.

 **`Justification of ρ ≥ 0.90 threshold:`**
 - ρ ≥ 0.90 (More conservative) Retains genes with functional importance but not perfect LD, means the analysis is being refined to be more stringent. By using a high threshold, we are focusing on genes that are in very strong, but perhaps not complete, association with each other, which helps in identifying potentially important functional genes that might otherwise be overlooked in a more relaxed analysis (e.g., if a lower threshold was used).

 | Threshold | Use Case                                   | Expected Removal |
|-----------|--------------------------------------------|------------------|
| ρ ≥ 0.95  | Remove only near-perfect LD (plasmid co-carriage) | 5-15%            |
| ρ ≥ 0.90  | Remove strong LD                           | 10-25%           |
| ρ ≥ 0.80  | Our current (a bit too aggressive for accessory genome) | 50-60%           |

In short
 *We applied a correlation threshold of |ρ| ≥ 0.90 to remove genes in near-perfect linkage disequilibrium with known AMR determinants while retaining genes that may represent functionally independent mechanisms. This threshold is consistent with bacterial GWAS studies (Lees et al., 2020) and accounts for the high recombination rates in E. coli accessory genomes.*

Our **unified matrix** (678 features) from `Master_file_creation.ipynb` is composed of:
```python
gpam_card          → 143 features (CARD AMR genes)
gpam_resfinder     → 121 features (ResFinder AMR genes)
gpam_amrfinder     → 354 features (AMRFinderPlus genes/mutations)
ppam_plasmid       →  60 features (PlasmidFinder replicons)
─────────────────────────────────────────────────
TOTAL              → 678 features
```
**`AMRFinderPlus`** doesn't just report AMR genes - it reports multiple element types:

| Element Type | What It Is                                                                 | Should Be in Tier 1?                   |
|--------------|---------------------------------------------------------------------------|----------------------------------------|
| AMR          | True resistance genes (e.g., `blaTEM-1B, aac(6')-Ib)`                        | YES                                  |
| STRESS       | Metal/biocide resistance (e.g., `arsA, merA`)                               | Debatable (creates LD)              |
| VIRULENCE    | Pathogenicity factors (e.g., `ibeA, eatA`)                                  | NO (not AMR)                         |
| POINT        | Chromosomal mutations (e.g., `gyrA_S83L`, `parC_E84V`, `ptsI_V25I`)             | Some YES, some NO                   |



## **Check AMRFinderPlus Types**
- Most probably it contains all AMRFinderPlus elements (`AMR, Stress, Virulence and also Point Mutations`) because `never filtered by the Type column` to keep only AMR elements, while we need to remove only AMR genes from the `roary_pangenome`, others can stay. So we need to create separate AMRFinder Files that must contain only AMR elements so that we can perform POST-HOC test.

In [None]:
# Load the AMRFinderPlus master file
df_amrfinder = pd.read_csv('/content/drive/MyDrive/amr_features/df_master_amrfinder.csv')

#check the distribution of element types
print("AMRFinderPlus Element Types:")
print(df_amrfinder['Type'].value_counts())
print("\n")

# Check specific problematic elements
problematic = df_amrfinder[df_amrfinder['Element symbol'].str.contains(
    'ptsI|parC|gyrA|gyrB', case=False, na=False)]
print(f"Chromosomal mutations in AMRFinderPlus: {len(problematic)}")
print(problematic[['Element symbol', 'Type', 'Subtype']].drop_duplicates())

AMRFinderPlus Element Types:
Type
VIRULENCE    29860
AMR          15739
STRESS        6484
Name: count, dtype: int64


Chromosomal mutations in AMRFinderPlus: 1654
      Element symbol Type Subtype
49         parC_S57T  AMR   POINT
50         parC_S80I  AMR   POINT
64         gyrA_D87N  AMR   POINT
65         gyrA_S83L  AMR   POINT
169        gyrA_S83A  AMR   POINT
195        ptsI_V25I  AMR   POINT
584        parC_E84V  AMR   POINT
849        parC_S80R  AMR   POINT
1518       parC_E84K  AMR   POINT
1564       gyrA_D87Y  AMR   POINT
2567       parC_E84G  AMR   POINT
3772       gyrA_D87G  AMR   POINT
8714       parC_A56T  AMR   POINT
11778     parC_A108V  AMR   POINT
28405      gyrA_G81D  AMR   POINT
35020     parC_A108T  AMR   POINT
38796      parC_G78D  AMR   POINT
51710      gyrA_S83V  AMR   POINT


In [None]:
#check the Type/Subtype combinations
print("Type and Subtype combinations:")
print(df_amrfinder.groupby(['Type', 'Subtype']).size())

#check specific problematic elements
print("\nElements with Type='AMR' and Subtype='POINT':")
point_mutations = df_amrfinder[
    (df_amrfinder['Type'] == 'AMR') &
    (df_amrfinder['Subtype'] == 'POINT')    # <--- Chromosomal mutations (PROBLEM!) as these are not acquired genes
]
print(point_mutations['Element symbol'].unique())

Type and Subtype combinations:
Type       Subtype  
AMR        AMR          12381
           POINT         3358
STRESS     ACID          1604
           BIOCIDE       2029
           HEAT           115
           METAL         2736
VIRULENCE  VIRULENCE    29860
dtype: int64

Elements with Type='AMR' and Subtype='POINT':
['cyaA_S352T' 'parE_L416F' 'parC_S57T' 'parC_S80I' 'gyrA_D87N' 'gyrA_S83L'
 'uhpT_E350Q' 'blaTEMp_C32T' 'gyrA_S83A' 'ptsI_V25I' 'parE_I529L'
 'marR_S3N' 'nfsA_E75STOP' 'soxR_R20H' 'parC_E84V' 'parC_S80R' 'acrR_R45C'
 'parE_I355T' 'nfsA_R203C' 'parE_D475E' 'parE_S458A' 'rpoB_Q148L'
 'parC_E84K' 'parE_E460D' 'gyrA_D87Y' 'ampC_C-42T' 'blaTEMp_G162T'
 'soxR_G121D' 'parE_S458T' 'parC_E84G' 'nfsA_H11Y' 'pmrB_V161G'
 'ampC_T-32A' 'gyrA_D87G' 'ompF_Q88STOP' 'ftsI_I336IKYRI' 'nfsA_W159STOP'
 'nfsA_R133S' 'ampC_C-11T' 'nfsA_E223STOP' 'parC_A56T' 'nfsA_R15C'
 'nfsA_G154E' 'ampC_T-14TGT' 'nfsA_Q67STOP' 'nfsA_G126R' 'parC_A108V'
 'parE_E460K' 'cirA_Q42STOP' 'pmrB_P94Q' 'ftsI_G363S' 

## **Separate AMRFinderPlus by Type**

### **TIER 1A: TRUE AMR GENES (`Acquired Genes ONLY`) (for correlation filtering)**

In [None]:
df_amrfinder = pd.read_csv('/content/drive/MyDrive/amr_features/df_master_amrfinder.csv')
df_master_card = pd.read_csv('/content/drive/MyDrive/amr_features/df_master_card.csv')
df_master_resfinder = pd.read_csv('/content/drive/MyDrive/amr_features/df_master_resfinder.csv')

In [None]:
df_amrfinder.head()

Unnamed: 0,Protein id,Contig id,Start,Stop,Strand,Element symbol,Element name,Scope,Type,Subtype,...,Target length,Reference sequence length,% Coverage of reference,% Identity to reference,Alignment length,Closest reference accession,Closest reference name,HMM accession,HMM description,ISOLATE_ID
0,,ERZ3165056.1,15297,24914,+,clbB,colibactin hybrid non-ribosomal peptide synthe...,plus,VIRULENCE,VIRULENCE,...,3206,3206,100.0,100.0,3206,AMQ58400.1,colibactin hybrid non-ribosomal peptide synthe...,,,11679_7_21
1,,ERZ3165056.1,55013,59377,+,clbN,colibactin non-ribosomal peptide synthetase ClbN,plus,VIRULENCE,VIRULENCE,...,1455,1455,100.0,100.0,1455,AMH08480.1,colibactin non-ribosomal peptide synthetase ClbN,,,11679_7_21
2,,ERZ3165056.1,102681,104480,+,ybtP,yersiniabactin ABC transporter ATP-binding/per...,plus,VIRULENCE,VIRULENCE,...,600,600,100.0,99.83,600,CAA21388.1,yersiniabactin ABC transporter ATP-binding/per...,,,11679_7_21
3,,ERZ3165056.1,104470,106269,+,ybtQ,yersiniabactin ABC transporter ATP-binding/per...,plus,VIRULENCE,VIRULENCE,...,600,600,100.0,99.5,600,AAC69584.1,yersiniabactin ABC transporter ATP-binding/per...,,,11679_7_21
4,,ERZ3165056.1,151758,152576,+,cdtB,cytolethal distending toxin type I nuclease su...,plus,VIRULENCE,VIRULENCE,...,273,273,100.0,100.0,273,BAF63361.1,cytolethal distending toxin type I nuclease su...,,,11679_7_21


In [None]:
df_amrfinder['Type'].value_counts()

Unnamed: 0_level_0,count
Type,Unnamed: 1_level_1
VIRULENCE,29860
AMR,15739
STRESS,6484


In [None]:
df_amrfinder['Subtype'].value_counts()

Unnamed: 0_level_0,count
Subtype,Unnamed: 1_level_1
VIRULENCE,29860
AMR,12381
POINT,3358
METAL,2736
BIOCIDE,2029
ACID,1604
HEAT,115


In [None]:
print("AMRFinderPlus raw data loaded")
print(f"Total rows: {len(df_amrfinder):,}")
print("\nType/Subtype breakdown:")
print(df_amrfinder.groupby(['Type', 'Subtype']).size())

AMRFinderPlus raw data loaded
Total rows: 52,083

Type/Subtype breakdown:
Type       Subtype  
AMR        AMR          12381
           POINT         3358
STRESS     ACID          1604
           BIOCIDE       2029
           HEAT           115
           METAL         2736
VIRULENCE  VIRULENCE    29860
dtype: int64


In [None]:
def create_gpam(df_master, gene_col, id_col='ISOLATE_ID', identity_thr=90, coverage_thr=90):
    """Creates a Gene Presence/Absence Matrix (GPAM)."""

    #1.standardize and filter for High Confidence Hits
    #CARD and ResFinder usually use %IDENTITY and %COVERAGE
    if '%IDENTITY' in df_master.columns and '%COVERAGE' in df_master.columns:
        df_filtered = df_master[
            (df_master['%IDENTITY'] >= identity_thr) &
            (df_master['%COVERAGE'] >= coverage_thr)
        ].copy()
    else:
        #AMRfinderPlus features often use high confidence by default, or we might rely on Type/Scope
        df_filtered = df_master.copy()

    # 2.assign Presence (1)
    df_filtered['Presence'] = 1

    # 3.pivot the table (Wide Format Transformation) - using pivot_table to handle multiple hits for the same gene/isolate (max operation keeps the 1)
    gpam = df_filtered.pivot_table(
        index=id_col,
        columns=gene_col,
        values='Presence',
        fill_value=0, #if an isolate/gene combination is missing, the gene is absent (0)
        aggfunc='max'
    )

    #clean column names (important for AMRfinderPlus's 'Element symbol')
    gpam.columns.name = None
    gpam.index.name = 'ISOLATE_ID'

    return gpam

#lets try for GPAMs
gpam_card = create_gpam(df_master_card, gene_col='GENE')
gpam_resfinder = create_gpam(df_master_resfinder, gene_col='GENE')


print(f"\nCARD GPAM shape: {gpam_card.shape}")
print(f"ResFinder GPAM shape: {gpam_resfinder.shape}")


CARD GPAM shape: (1651, 143)
ResFinder GPAM shape: (1124, 121)


In [None]:
df_amrfinder_acquired = df_amrfinder[
    (df_amrfinder['Type'] == 'AMR') &     #only AMR genes
    (df_amrfinder['Subtype'] != 'POINT')  #exclude chromosomal mutations
].copy()

print(f"\nAcquired AMR genes (Type=AMR, Subtype≠POINT): {len(df_amrfinder_acquired):,} rows")


gpam_amrfinder_acquired = create_gpam(
    df_amrfinder_acquired,
    gene_col='Element symbol',
    identity_thr=90,
    coverage_thr=90
)

print(f"\nGPAM (Acquired AMR): {gpam_amrfinder_acquired.shape}")
print(f"Columns (first 20): {gpam_amrfinder_acquired.columns.tolist()[:20]}")


Acquired AMR genes (Type=AMR, Subtype≠POINT): 12,381 rows

GPAM (Acquired AMR): (1651, 145)
Columns (first 20): ['aac(3)-IId', 'aac(3)-IIe', 'aac(3)-IIg', 'aac(3)-IVa', 'aac(3)-VIa', "aac(6')-IIc", "aac(6')-Ib", "aac(6')-Ib'", "aac(6')-Ib-cr5", "aac(6')-Ib3", "aac(6')-Ib4", 'aacA16', 'aadA1', 'aadA11', 'aadA12', 'aadA13', 'aadA15', 'aadA2', 'aadA22', 'aadA25']


In [None]:
#verify ptsI_V25I is NOT in the columns      #We don't want these SNP as ptsI_v25I = A chromosomal SNP in metabolic gene (phosphotransferase system I (PTS)). This is not an AMR gene but is associated with AMR when occur with mutations in the uhpT gene (uhpT_E350Q), thus can disrupt the uptake of fosfomycin into the bacterial cell, thereby conferring resistance.
if 'ptsI_V25I' in gpam_amrfinder_acquired.columns:
    print("\nERROR: ptsI_V25I is STILL in acquired genes!")
else:
    print("\nSUCCESS: ptsI_V25I is NOT in acquired genes")


SUCCESS: ptsI_V25I is NOT in acquired genes


### **TIER 1B: STRESS RESPONSE (debatable inclusion)**

1. **Baker-Austin, C., Wright, M. S., Stepanauskas, R., & McArthur, J. V. (2006).** Co-selection of antibiotic and metal resistance. Trends in microbiology, 14(4), 176–182. https://doi.org/10.1016/j.tim.2006.02.006

    *Key point:* Metal resistance genes create LD but may have independent biological importance

In [None]:
df_amrfinder_stress = df_amrfinder[
    df_amrfinder['Type'].isin(['STRESS'])   #STRESS/METAL RESISTANCE
].copy()

gpam_amrfinder_stress = create_gpam(
    df_amrfinder_stress,
    gene_col='Element symbol'
)

print(f"AMRFinderPlus STRESS genes (Tier 1B): {gpam_amrfinder_stress.shape}")

AMRFinderPlus STRESS genes (Tier 1B): (1644, 50)


### **SEPARATE: CHROMOSOMAL MUTATIONS (Clonal markers/population structure markers) (`SEPARATE - NOT FOR FILTERING`)**

1. `Reference for Chromosomal Mutations as Population Markers:`

    **Stoesser, Nicole et al.** “Evolutionary History of the Global Emergence of the Escherichia coli Epidemic Clone ST131.” mBio vol. 7,2 e02162. 22 Mar. 2016, doi:10.1128/mBio.02162-15

    *Key point:* Chromosomal SNPs like gyrA_S83L, parC_E84V define clonal complexes

In [None]:
df_amrfinder_chromosomal = df_amrfinder[
    (df_amrfinder['Type'] == 'AMR') &
    (df_amrfinder['Subtype'] == 'POINT')
].copy()

gpam_amrfinder_chromosomal = create_gpam(
    df_amrfinder_chromosomal,
    gene_col='Element symbol'
)

print(f"\nGPAM (Chromosomal mutations): {gpam_amrfinder_chromosomal.shape}")
print(f"Mutations (first 20): {gpam_amrfinder_chromosomal.columns.tolist()[:20]}")


GPAM (Chromosomal mutations): (1168, 69)
Mutations (first 20): ['acrR_R45C', 'ampC_C-11T', 'ampC_C-42T', 'ampC_T-14TGT', 'ampC_T-32A', 'blaTEMp_C141G', 'blaTEMp_C32T', 'blaTEMp_G162T', 'cirA_Q42STOP', 'cyaA_S352T', 'fabI_F203L', 'ftsI_G363S', 'ftsI_I336IKYRI', 'ftsI_N337NYRIN', 'gyrA_D87G', 'gyrA_D87N', 'gyrA_D87Y', 'gyrA_G81D', 'gyrA_S83A', 'gyrA_S83L']


In [None]:
#verify ptsI_V25I IS in chromosomal mutations
if 'ptsI_V25I' in gpam_amrfinder_chromosomal.columns:
    print("\nSUCCESS: ptsI_V25I is in chromosomal mutations (correct location)")
else:
    print("\nWARNING: ptsI_V25I not found in chromosomal mutations")


SUCCESS: ptsI_V25I is in chromosomal mutations (correct location)


In [None]:
#check what's in the mutations
print("\nChromosomal mutations found:")
print(gpam_amrfinder_chromosomal.columns.tolist())


Chromosomal mutations found:
['acrR_R45C', 'ampC_C-11T', 'ampC_C-42T', 'ampC_T-14TGT', 'ampC_T-32A', 'blaTEMp_C141G', 'blaTEMp_C32T', 'blaTEMp_G162T', 'cirA_Q42STOP', 'cyaA_S352T', 'fabI_F203L', 'ftsI_G363S', 'ftsI_I336IKYRI', 'ftsI_N337NYRIN', 'gyrA_D87G', 'gyrA_D87N', 'gyrA_D87Y', 'gyrA_G81D', 'gyrA_S83A', 'gyrA_S83L', 'gyrA_S83V', 'marR_S3N', 'nfsA_E223STOP', 'nfsA_E75STOP', 'nfsA_G126R', 'nfsA_G131D', 'nfsA_G154E', 'nfsA_H11Y', 'nfsA_Q113STOP', 'nfsA_Q44STOP', 'nfsA_Q67STOP', 'nfsA_R133S', 'nfsA_R15C', 'nfsA_R203C', 'nfsA_W159STOP', 'nfsB_W94STOP', 'ompC_Q171STOP', 'ompC_Q82STOP', 'ompC_R195L', 'ompF_Q88STOP', 'parC_A108T', 'parC_A108V', 'parC_A56T', 'parC_E84G', 'parC_E84K', 'parC_E84V', 'parC_G78D', 'parC_S57T', 'parC_S80I', 'parC_S80R', 'parE_D475E', 'parE_E460D', 'parE_E460K', 'parE_I355T', 'parE_I464F', 'parE_I529L', 'parE_L416F', 'parE_L445H', 'parE_S458A', 'parE_S458T', 'pmrB_L10P', 'pmrB_P94Q', 'pmrB_V161G', 'ptsI_V25I', 'rpoB_Q148L', 'rpoB_V146F', 'soxR_G121D', 'soxR_R20H

## **Revised Unified Matrix**

### **TIER 1A: ACQUIRED AMR GENES ONLY (`WITHOUT CHROMOSOMAL MUTATIONS`)**

In [None]:
#load other databases
print(f"\nCARD: {gpam_card.shape}")
print(f"ResFinder: {gpam_resfinder.shape}")


CARD: (1651, 143)
ResFinder: (1124, 121)


In [None]:
tier1a_list = [
    gpam_card,                    #CARD genes
    gpam_resfinder,               #ResFinder genes
    gpam_amrfinder_acquired,        #AMRFinderPlus AMR genes ONLY - Without chromosomal mutations
]

unified_tier1a = pd.concat(tier1a_list, axis=1, join='outer').fillna(0)

print(f"\nTier 1A (Acquired AMR genes): {unified_tier1a.shape}")
print(f"Samples: {len(unified_tier1a)}")
print(f"Features: {len(unified_tier1a.columns)}")


Tier 1A (Acquired AMR genes): (1651, 409)
Samples: 1651
Features: 409


In [None]:
#verify no chromosomal mutations in Tier 1A
chromosomal_markers = ['ptsI_V25I', 'parC_E84V', 'gyrA_S83L', 'gyrA_D87N']
found_markers = [m for m in chromosomal_markers if m in unified_tier1a.columns]

if found_markers:
    print(f"\nERROR: Found {len(found_markers)} chromosomal mutations in Tier 1A!")
    print(f"  Found: {found_markers}")
    print("  These should NOT be in Tier 1A!")
else:
    print(f"\nSUCCESS: No chromosomal mutations in Tier 1A")


SUCCESS: No chromosomal mutations in Tier 1A


### **SAVE CORRECTED TIER MATRICES**

In [None]:
output_dir = Path('/content/drive/MyDrive/amr_features')

unified_tier1a.to_csv(output_dir / 'tier1a_acquired_amr_genes_CORRECTED.csv')
print(f"\nSaved: tier1a_acquired_amr_genes_CORRECTED.csv")

gpam_amrfinder_stress.to_csv(output_dir / 'tier1b_stress_genes.csv')
print(f"Saved: tier1b_stress_genes.csv")

gpam_amrfinder_chromosomal.to_csv(output_dir / 'population_structure_markers.csv')
print(f"Saved: population_structure_markers.csv")


Saved: tier1a_acquired_amr_genes_CORRECTED.csv
Saved: tier1b_stress_genes.csv
Saved: population_structure_markers.csv


In [None]:
#summary
print(f"Tier 1A (Acquired AMR): {unified_tier1a.shape[1]} features")
print(f"  - CARD: {len(gpam_card.columns)}")
print(f"  - ResFinder: {len(gpam_resfinder.columns)}")
print(f"  - AMRFinderPlus (acquired only): {len(gpam_amrfinder_acquired.columns)}")
print(f"\nTier 1B (Stress/metal): {gpam_amrfinder_stress.shape[1]} features")
print(f"\nChromosomal mutations (NOT for filtering): {gpam_amrfinder_chromosomal.shape[1]} features")

Tier 1A (Acquired AMR): 409 features
  - CARD: 143
  - ResFinder: 121
  - AMRFinderPlus (acquired only): 145

Tier 1B (Stress/metal): 50 features

Chromosomal mutations (NOT for filtering): 69 features


### **TIER 1B: STRESS/METAL RESISTANCE (optional - creates LD)**

In [None]:
unified_tier1b = gpam_amrfinder_stress.fillna(0)
print(f"Tier 1B (Stress/metal resistance): {unified_tier1b.shape}")

Tier 1B (Stress/metal resistance): (1644, 50)


### **TIER 1C: PLASMID REPLICONS (mobilization markers)**



In [None]:
df_master_plasmidfinder = pd.read_csv('/content/drive/MyDrive/amr_features/df_master_plasmidfinder.csv')

def create_ppam(df_master, gene_col='GENE', id_col='ISOLATE_ID', identity_thr=90, coverage_thr=90):
    """Creates a Plasmid Presence/Absence Matrix (PPAM) focusing on replicon types."""

    #filter for high confidence hits (adjust thresholds as needed for PlasmidFinder)
    df_filtered = df_master[
        (df_master['%IDENTITY'] >= identity_thr) &
        (df_master['%COVERAGE'] >= coverage_thr)
    ].copy()

    df_filtered['Presence'] = 1

    #pivot to create the wide format matrix
    ppam = df_filtered.pivot_table(
        index=id_col,
        columns=gene_col,
        values='Presence',
        fill_value=0,
        aggfunc='max'
    )

    ppam.columns.name = None
    ppam.index.name = 'ISOLATE_ID'

    return ppam

ppam_plasmid = create_ppam(df_master_plasmidfinder, gene_col='GENE')

print(f"\nPlasmid Presence/Absence Matrix (PPAM) shape: {ppam_plasmid.shape}")

unified_tier1c = ppam_plasmid.fillna(0)
print(f"Tier 1C (Plasmid replicons): {unified_tier1c.shape}")


Plasmid Presence/Absence Matrix (PPAM) shape: (1451, 60)
Tier 1C (Plasmid replicons): (1451, 60)


### **POPULATION STRUCTURE MARKERS (DO NOT USE IN CORRELATION FILTER)**

In [None]:
population_structure_markers = gpam_amrfinder_chromosomal.fillna(0)
print(f"\nPopulation structure markers: {population_structure_markers.shape}")
print("These should NOT be used for correlation filtering!")


Population structure markers: (1168, 69)
These should NOT be used for correlation filtering!


## **SAVE TIER-SPECIFIC FILES**

In [None]:
unified_tier1c.to_csv(
    '/content/drive/MyDrive/amr_features/tier1c_plasmid_replicons.csv'
)

population_structure_markers.to_csv(
    '/content/drive/MyDrive/amr_features/population_structure_markers.csv'
)

## **Revised Correlation Filter**

In [None]:
#LOAD TIER-SPECIFIC MATRICES
tier1a = pd.read_csv(
    '/content/drive/MyDrive/amr_features/tier1a_acquired_amr_genes_CORRECTED.csv',
    index_col=0
)

tier1b = pd.read_csv(
    '/content/drive/MyDrive/amr_features/tier1b_stress_genes.csv',
    index_col=0
)

### **Standardize sample IDs**

In [None]:
def replace_last_underscore_with_hash(s):
    s_str = str(s)
    parts = s_str.rsplit('_', 1)
    return '#'.join(parts) if len(parts) > 1 else s_str

tier1a.index = tier1a.index.map(replace_last_underscore_with_hash)
tier1b.index = tier1b.index.map(replace_last_underscore_with_hash)

print(f"Tier 1A (CORRECTED) samples: {len(tier1a)}")
print(f"Tier 1A (CORRECTED) features: {len(tier1a.columns)}")

#verify no chromosomal mutations
if 'ptsI_V25I' in tier1a.columns:
    print("\nERROR: ptsI_V25I STILL in Tier 1A!")
else:
    print("\nptsI_V25I successfully excluded from Tier 1A")

Tier 1A (CORRECTED) samples: 1651
Tier 1A (CORRECTED) features: 409

ptsI_V25I successfully excluded from Tier 1A


### **CORRELATION FILTER FUNCTION**

In [None]:
def filter_by_correlation_v3(X_novel, tier1a, tier1b=None, corr_threshold=0.90):
    """
    Enhanced correlation filter with tiered approach.

    Args:
        X_novel: Novel genes from Roary (Tier 3 candidates)
        tier1a: Acquired AMR genes (STRICT filtering)
        tier1b: Stress/metal genes (OPTIONAL - report but don't remove)
        corr_threshold: Correlation threshold (default 0.90)

    Returns:
        Filtered genes, removed genes, flagged genes
    """

    print(f"\n{'='*80}")
    print("TIERED CORRELATION FILTERING")
    print(f"{'='*80}")

    #find common samples FIRST (between novel genes and tier1a)
    common_samples = X_novel.index.intersection(tier1a.index)
    print(f"Common samples (Novel ∩ Tier1a): {len(common_samples)}")

    #align all matrices to common samples
    X_novel_aligned = X_novel.loc[common_samples]
    tier1a_aligned = tier1a.loc[common_samples]

    #Align tier1b to common_samples AND handle missing samples
    if tier1b is not None:
        #find which common samples exist in tier1b
        tier1b_available_samples = common_samples.intersection(tier1b.index)
        print(f"Common samples in Tier1b: {len(tier1b_available_samples)}")

        #only align to samples that exist in tier1b
        tier1b_aligned = tier1b.loc[tier1b_available_samples]

        # for samples NOT in tier1b, we'll skip the stress gene check
        missing_in_tier1b = set(common_samples) - set(tier1b_available_samples)
        if missing_in_tier1b:
            print(f"Warning: {len(missing_in_tier1b)} samples missing from Tier1b")
            print(f"  Examples: {list(missing_in_tier1b)[:5]}")
    else:
        tier1b_aligned = None
        tier1b_available_samples = []

    genes_to_keep = []
    genes_removed_tier1a = []
    genes_flagged_tier1b = []

    for i, novel_gene in enumerate(X_novel_aligned.columns):
        if (i + 1) % 100 == 0:
            print(f"  Progress: {i+1}/{X_novel_aligned.shape[1]} genes...")


        #CHECK 1: Correlation with ACQUIRED AMR (Tier 1A) - STRICT
        corr_tier1a = tier1a_aligned.corrwith(
            X_novel_aligned[novel_gene],
            method='pearson'
        ).abs()

        max_corr_tier1a = corr_tier1a.max()

        if pd.isna(max_corr_tier1a) or max_corr_tier1a < corr_threshold:
            #gene passes Tier 1A filter
            genes_to_keep.append(novel_gene)


            #CHECK 2: Correlation with STRESS genes (Tier 1B) - REPORT ONLY
            if tier1b_aligned is not None and len(tier1b_available_samples) > 0:
                #only correlate on samples where tier1b has data
                X_novel_tier1b_subset = X_novel_aligned.loc[tier1b_available_samples, novel_gene]

                corr_tier1b = tier1b_aligned.corrwith(
                    X_novel_tier1b_subset,
                    method='pearson'
                ).abs()

                max_corr_tier1b = corr_tier1b.max()

                if not pd.isna(max_corr_tier1b) and max_corr_tier1b >= corr_threshold:
                    max_gene_tier1b = corr_tier1b.idxmax()
                    genes_flagged_tier1b.append({
                        'novel_gene': novel_gene,
                        'corr_with': max_gene_tier1b,
                        'correlation': corr_tier1b[max_gene_tier1b],
                        'note': 'Correlated with stress/metal gene (LD expected)'
                    })
        else:
            #gene removed due to high correlation with Tier 1A
            max_gene_tier1a = corr_tier1a.idxmax()
            genes_removed_tier1a.append({
                'novel_gene': novel_gene,
                'corr_with': max_gene_tier1a,
                'correlation': corr_tier1a[max_gene_tier1a],
                'tier': 'Tier 1A (Acquired AMR)'
            })

            if len(genes_removed_tier1a) <= 10:
                print(f"    REMOVED: {novel_gene} | ρ={max_corr_tier1a:.3f} | {max_gene_tier1a}")

    print(f"FILTERING COMPLETE")
    print(f"Removed (Tier 1A - Acquired AMR): {len(genes_removed_tier1a)}")
    print(f"Flagged (Tier 1B - Stress/metal): {len(genes_flagged_tier1b)}")
    print(f"Retained: {len(genes_to_keep)}")

    return (X_novel_aligned[genes_to_keep],
            genes_removed_tier1a,
            genes_flagged_tier1b)

### **APPLY TO EACH DRUG**

In [None]:
ROARY_FILTERED_DIR = Path('/content/drive/MyDrive/pangenome_features')

In [None]:
tier1a.head()

Unnamed: 0_level_0,AAC(3)-IIa,AAC(3)-IV,AAC(3)-VIa,AAC(6')-IIc,AAC(6')-Iaf,ANT(2'')-Ia,APH(3'')-Ib,APH(3')-IIa,APH(3')-Ia,APH(4)-Ia,...,soxR_R20H,sul1.1,sul2.1,sul3.1,tet(A).1,tet(B),tet(D).1,tet(M),tet(X4),uhpT_E350Q
ISOLATE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11657_5#1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
11657_5#10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11657_5#11,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
11657_5#12,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11657_5#13,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,1


In [None]:
roary_df.head()

Unnamed: 0,tnpR,group_3326,folP2,group_8657,ebr,yhjK_2,neo,repA,pemI,intI,...,group_26329,group_26390,group_4134,group_26387,group_26391,group_26396,group_26394,group_26393,group_26392,group_26311
11657_5#25,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11657_5#26,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11657_5#27,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11657_5#29,1,1,1,0,1,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
11657_5#30,0,0,0,0,1,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [None]:
tier1b.head()

Unnamed: 0_level_0,ariR,arsA,arsD,arsR,clpK,emrE,hdeD-GI,hsp20,kefB-GI,merA,...,silS,terB,terC,terD,terE,terW,terZ,trxLHR,yfdX1,yfdX2
ISOLATE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11657_5#1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11657_5#10,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11657_5#11,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11657_5#12,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11657_5#13,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
ROARY_FILTERED_DIR = Path('/content/drive/MyDrive/pangenome_features')

In [None]:
#perfect correlation (ρ = 1.0) suggests these are the SAME feature

#APPLY TO EACH DRUG
for drug in ['AMX', 'AMC', 'CIP']:
    print(f"\n\n{'#'*80}")
    print(f"PROCESSING {drug}")
    print(f"{'#'*80}")

    roary_file = ROARY_FILTERED_DIR / f'roary_filtered_{drug}_top500.csv'
    roary_df = pd.read_csv(roary_file, index_col=0)

    print(f"Loaded Roary: {roary_df.shape}")

    #apply tiered correlation filter
    X_filtered, removed_tier1a, flagged_tier1b = filter_by_correlation_v3(
        roary_df,
        tier1a,
        tier1b,
        corr_threshold=0.90
    )

    #save results
    output_file = ROARY_FILTERED_DIR / f'roary_filtered_{drug}_top500_decorrelated_v2.csv'
    X_filtered.to_csv(output_file)
    print(f"\nSaved filtered genes: {output_file}")

    #save removed genes
    if removed_tier1a:
        removed_df = pd.DataFrame(removed_tier1a)
        removed_file = ROARY_FILTERED_DIR / f'roary_removed_tier1a_{drug}.csv'
        removed_df.to_csv(removed_file, index=False)
        print(f"Saved removed genes: {removed_file}")

        #show top removed genes
        print(f"\nTop 10 removed genes (Tier 1A):")
        removed_df_sorted = removed_df.reindex(
            removed_df['correlation'].abs().sort_values(ascending=False).index
        )
        print(removed_df_sorted.head(10)[['novel_gene', 'corr_with', 'correlation']].to_string(index=False))

    #save flagged genes
    if flagged_tier1b:
        flagged_df = pd.DataFrame(flagged_tier1b)
        flagged_file = ROARY_FILTERED_DIR / f'roary_flagged_tier1b_{drug}.csv'
        flagged_df.to_csv(flagged_file, index=False)
        print(f"\nSaved flagged genes: {flagged_file}")
        print(f"  (These genes correlated with stress/metal genes but were retained)")

    print(f"\n{'='*80}")
    print(f"FINAL RESULT FOR {drug}")
    print(f"{'='*80}")
    print(f"Original genes: 500")
    print(f"Removed (Tier 1A): {len(removed_tier1a)}")
    print(f"Flagged (Tier 1B): {len(flagged_tier1b)}")
    print(f"Retained: {X_filtered.shape[1]}")
    print(f"Final shape: {X_filtered.shape}")



################################################################################
PROCESSING AMX
################################################################################
Loaded Roary: (1094, 500)

TIERED CORRELATION FILTERING
Common samples (Novel ∩ Tier1a): 1089
Common samples in Tier1b: 1082
  Examples: ['11658_8#70', '11791_3#13', '11657_7#35', '11657_7#46', '11657_5#93']
    REMOVED: folP2 | ρ=0.989 | sul2.1
    REMOVED: ebr | ρ=0.980 | sul1.1
    REMOVED: repA | ρ=0.914 | sul2_3
    REMOVED: group_14507 | ρ=0.904 | sul2_3
    REMOVED: dhfrI | ρ=0.993 | dfrA17
    REMOVED: mphR | ρ=1.000 | Mrx
    REMOVED: group_20781 | ρ=0.921 | aadA5
    REMOVED: group_14301 | ρ=0.965 | catA1
    REMOVED: group_15444 | ρ=0.975 | Mrx
  Progress: 100/500 genes...
    REMOVED: group_26088 | ρ=0.992 | CTX-M-15
  Progress: 200/500 genes...
  Progress: 300/500 genes...
  Progress: 400/500 genes...
  Progress: 500/500 genes...
FILTERING COMPLETE
Removed (Tier 1A - Acquired AMR): 11
Flagged (Tie

### **SUMMARY and Next Possible Steps (Making Notes in each notebook as of the huge size of the project and datasets...)**

In [None]:
print("CORRELATION FILTERING COMPLETE")

print(f"\nSummary:")
print(f"  For each drug, we now have:")
print(f"  1. roary_filtered_{{DRUG}}_top500_decorrelated.csv - Filtered genes")
print(f"  2. roary_removed_correlated_{{DRUG}}.csv - Removed genes with reasons")

print(f"\nNext steps:")
print(f"  1. Use decorrelated files for Tier 3 training")
print(f"  2. Genes in these files are truly NOVEL candidates")
print(f"  3. Review removed genes to understand linkage patterns")

print(f"\nInterpretation:")
print(f"  - High correlation = gene travels with known AMR gene (plasmid linkage)")
print(f"  - Low correlation = independent mechanism (TRUE NOVELTY!)")

CORRELATION FILTERING COMPLETE

Summary:
  For each drug, you now have:
  1. roary_filtered_{DRUG}_top500_decorrelated.csv - Filtered genes
  2. roary_removed_correlated_{DRUG}.csv - Removed genes with reasons

Next steps:
  1. Use decorrelated files for Tier 3 training
  2. Genes in these files are truly NOVEL candidates
  3. Review removed genes to understand linkage patterns

Interpretation:
  - High correlation = gene travels with known AMR gene (plasmid linkage)
  - Low correlation = independent mechanism (TRUE NOVELTY!)


### **`Summary of What Has Been Achieved`**
The correlation filtering is now dramatically improved, earlier in our previous Notebook `1.0 Master_file_creation.ipynb` we unintentionally made a mistake of not filtering the AMRFinderPlus data as this data contain many different AMR elements, from AMR, stress, virulence genes to POINT mutations, but for our Novel Gene discovery pipeline we only needed to remove the true Acquired AMR genes with AMR tag from Roary pangenome data through correlation and other methods, the mutations and others can stay, but we unknowingly used the whole AMRFinderPlus data with everything in `unified_dataset` which removed a great deal of our 500 most variable/informative genes from Roary pangenome so we had to revised the whole strategy from `1.0 Master_file_creation.ipynb` for AMRFinderPlus dataset:

| Metric          | Before Fix               | After Fix                | Improvement          |
|-----------------|--------------------------|--------------------------|----------------------|
| AMX Removal     | 111/500 (22%)           | 11/500 (2.2%)           | 90% reduction        |
| AMC Removal     | 185/500 (37%)           | 11/500 (2.2%)           | 94% reduction        |
| CIP Removal     | 180/500 (36%)           | 5/500 (1.0%)            | 97% reduction        |
| ptsI_V25I correlations | Present               | Absent                  | Problem solved       |

**`Success Indicators:`**
- No ptsI_V25I (chromosomal mutation) in removed genes
- Only TRUE acquired AMR genes removed (e.g., sul2.1, dfrA17, CTX-M-15)
- Removal rates are now biologically realistic (1-2% vs. 20-40%)
- `Retained 489-495 genes per antibiotic` for novel discovery

#### **`Current Data Assets`**
There are now 4 distinct feature sets that should be used for different purposes:

1. `Tier 1A:` Acquired AMR Genes (409 features)
File: tier1a_acquired_amr_genes_CORRECTED.csv

Composition:
- CARD: 143 genes
- ResFinder: 121 genes
- AMRFinderPlus (acquired only): 145 genes

Use for:
- Tier 1 baseline models (known AMR mechanisms)
- Correlation filtering (DONE)
- Feature importance benchmarking
- Manuscript: "Known resistance determinants"

Do NOT use for:
- Novel gene discovery (these are known mechanisms)


2. `Tier 1B:` Stress/Metal Resistance (50 features)
File: tier1b_stress_genes.csv

Composition:
- Metal resistance: arsA, merA, silB, terC, etc.
- Biocide resistance: qacE, emrE, etc.

Use for:
- Flagging co-selection (report genes correlated with these)
- Supplementary analysis: "Genes linked to stress response"
- Debatable for Tier 2: Could include in multi-tier models if wanting to capture co-selection dynamics

Do NOT use for:
- Strict correlation filtering (these have already been flagged separately)

3. `Tier 1C:` Plasmid Replicons (60 features)
File: tier1c_plasmid_replicons.csv

Composition:
- Plasmid incompatibility groups: IncFII, IncFIA, IncQ1, etc.

Use for:
- Tier 2 models (plasmid-based transmission risk)
- Epidemiological analysis: Tracking horizontal gene transfer
- Feature engineering: Create "plasmid burden" composite features

Example:
- Create "high-risk plasmid" composite feature
```python
high_risk_plasmids = ['IncFII', 'IncFIA', 'IncN', 'IncHI2']
df['high_risk_plasmid_count'] = df[high_risk_plasmids].sum(axis=1)
```

4. `Population Structure Markers (69 features)`
File: population_structure_markers.csv

Composition:
- Chromosomal SNPs: gyrA_S83L, parC_E84V, ptsI_V25I, etc.
- Regulatory mutations: marR_S3N, acrR_R45C, etc.

Use for:
- Population structure analysis (phylogenetic context)
- Clonal complex identification (e.g., ST131 markers)
- Confounding variable control in statistical models
- NOT for correlation filtering (already separated)

Important Note:
Some of these ARE resistance mutations (e.g., `gyrA_S83L` for fluoroquinolones), but they're chromosomal and clonally inherited, not acquired via HGT (`horizontal gen transfer`) like Tier 1A genes.

Tier 3 Novel Genes (Final Decorrelated Sets)
Files:
```python
- roary_filtered_AMX_top500_decorrelated_v2.csv → 489 genes
- roary_filtered_AMC_top500_decorrelated_v2.csv → 489 genes
- roary_filtered_CIP_top500_decorrelated_v2.csv → 495 genes
```
These are the TRUE NOVEL CANDIDATES for Tier 3 discovery.

What was removed (and why it's correct):

| Removed Gene | Correlated With       | Reason                                             |
|--------------|-----------------------|----------------------------------------------------|
| folP2        | sul2.1 (ρ=0.99)      | DHPS variant traveling with sulfonamide resistance |
| dhfrI        | dfrA17 (ρ=0.99)      | Trimethoprim resistance gene variant              |
| mphRM        | rxm (ρ=1.0)          | Macrolide resistance regulator                     |
| ebr          | sul1.1 (ρ=0.98)      | Gene on integrons with sulfonamide resistance     |
| group_26088  | CTX-M-15 (ρ=0.99)    | Gene co-located with ESBL plasmid                  |

All removals are biologically justified - these genes are in strong LD (Linkage Disequilibrium) with known AMR genes due to plasmid co-carriage.

### **Methodology We used in Short**
**Feature Engineering and Tiered Architecture**

**Antimicrobial resistance feature extraction.** We employed a multi-database approach to identify resistance determinants: `CARD` (n=`143` genes), `ResFinder` (n=`121`), and `AMRFinderPlus` (n=`15,739` elements). From `AMRFinderPlus`, we separated acquired resistance genes (`Type='AMR'`, `Subtype≠'POINT'`, n=`145`) from chromosomal point mutations (`Subtype='POINT'`, n=`69`) to distinguish horizontally transferred determinants from clonally inherited markers. Stress response genes (`Type='STRESS'`, n=`50`) and plasmid replicon types (`PlasmidFinder`, n=`60`) were categorized separately. All gene presence/absence calls required `≥90%` sequence identity and `≥90%` coverage.

**Tiered model architecture.** We constructed three hierarchical feature sets:
1. **Tier 1:** Known acquired AMR genes from `CARD`, `ResFinder`, and `AMRFinderPlus` (n=`409` features)
2. **Tier 2:** Tier 1 + plasmid replicon types (n=`469` features)
3. **Tier 3:** Tier 2 + novel accessory genome genes (n=`898-958` features, antibiotic-specific)

**Novel gene discovery pipeline.** To identify putative novel resistance determinants, we processed pangenome data from `Roary` (Page et al., 2015). We removed genes matching known AMR families via exact matching (n=`101`) and broad keyword filtering (n=`65` additional genes), including efflux components (`acr`, `emr`, `mdt`), transcriptional regulators (`mar`, `acrR`), and metal resistance operons. Variance filtering (`5-95%` prevalence) reduced the candidate set from `44,957` to `6,520` genes. Feature selection combining mutual information (Cover & Thomas, 2006) and Random Forest importance (Breiman, 2001) identified the top `500` genes per antibiotic. Post-hoc correlation filtering (`|ρ| ≥ 0.90`) removed genes in linkage disequilibrium with Tier 1 genes, retaining `489-495` genes per antibiotic (`1.0-2.2%` removal rate) for Tier 3 analysis.

**Population structure markers.** Chromosomal point mutations (n=`69`), including quinolone resistance-determining region (`QRDR`) mutations (`gyrA`, `parC`) and metabolic gene variants (`ptsI_V25I`), were analyzed separately as clonal complex markers to control for population structure effects but were excluded from predictive models.