# **1. Tier 1A: Acquired AMR Genes (409 features)**
**`File:`** `tier1a_acquired_amr_genes_CORRECTED.csv`

**`Composition:`**
- **CARD**: 143 genes
- **ResFinder**: 121 genes
- **AMRFinderPlus** (acquired only): 145 genes

**`Use:`**
- Tier 1 baseline models (`known AMR mechanisms`)
- Correlation filtering with roary pangenome to make novel gene set. (DONE)
- Feature importance benchmarking

These genes are "`Known resistance determinants`".

A `very Important note` or rather `a mistake to avoid`, we can't and should never use these for Novel gene discovery (`these are known` mechanisms).

If you had such thoughts,` SUCH BLASPHEMOUS thoughts!!!` Quickly, Let those thoughts Sho! Sho! and go take a shower with Cold Water...

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score, average_precision_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
import shap

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Load Datasets**

In [None]:
#load Tier 1A
tier1a = pd.read_csv('/content/drive/MyDrive/amr_features/tier1a_acquired_amr_genes_CORRECTED.csv', index_col=0)

#load phenotypes
phenotypes = pd.read_csv('/content/drive/MyDrive/data/E.coli/phenotypic.csv', index_col=0)
phenotypes.set_index('Isolate', inplace=True)

## **Standardize sample IDs**

In [None]:
def replace_last_underscore_with_hash(s):
    s_str = str(s)
    parts = s_str.rsplit('_', 1)
    return '#'.join(parts) if len(parts) > 1 else s_str

tier1a.index = tier1a.index.map(replace_last_underscore_with_hash)

## **Unweighthed Model**

In [None]:
#train Tier 1 models for each drug
for drug in ['AMX', 'AMC', 'CIP']:
    print("="*50)
    print(f"TIER 1 MODEL: {drug}")
    print("="*50)

    #find common samples
    common_samples = tier1a.index.intersection(phenotypes.index)

    X = tier1a.loc[common_samples]

    #map phenotypic data to binary (1=Resistant, 0=Susceptible/Intermediate)
    y = phenotypes.loc[common_samples, drug].map({'R': 1, 'S': 0, 'I': 0}).dropna()

    X = X.loc[y.index]

    print(f"Data prepared for {drug}. Total samples: {len(X)}")
    print(f"Resistance counts (R=1, S/I=0): {y.value_counts()}")

    ##split Data into Train and Test Sets, We use stratified split to ensure the ratio of Resistant (1) to Susceptible (0) is the same in both the training and testing sets.
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.2,    #use 20% of data for testing
        random_state=42,  #for reproducibility
        stratify=y        #essential for imbalanced data like AMR
    )

    #train XGBoost
    model = XGBClassifier(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        random_state=42,
        eval_metric='logloss'
    )

    model.fit(X_train, y_train)

    #evaluate
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    auroc = roc_auc_score(y_test, y_pred_proba)
    auprc = average_precision_score(y_test, y_pred_proba)

    print(f"\nResults:")
    print(classification_report(y_test, y_pred))
    print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba):.3f}")
    print(f" Model trained. Test AUROC: {auroc:.4f}")
    print(f"AUPRC: {auprc:.4f}")

    # SHAP feature importance
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X_test)

    # Get top 20 novel genes
    feature_importance = pd.DataFrame({
        'gene': X_train.columns,
        'shap_importance': np.abs(shap_values).mean(axis=0)
    }).sort_values('shap_importance', ascending=False)

    top_20_novel = feature_importance.head(20)
    print(top_20_novel)

    #save model
    model.save_model(f'/content/drive/MyDrive/models/tier1_model_{drug}.json')

TIER 1 MODEL: AMX
Data prepared for AMX. Total samples: 1089
Resistance counts (R=1, S/I=0): AMX
1.0    659
0.0    430
Name: count, dtype: int64

Results:
              precision    recall  f1-score   support

         0.0       0.81      0.97      0.88        86
         1.0       0.97      0.85      0.91       132

    accuracy                           0.89       218
   macro avg       0.89      0.91      0.89       218
weighted avg       0.91      0.89      0.90       218

ROC-AUC: 0.941
 Model trained. Test AUROC: 0.9414
AUPRC: 0.9686
            gene  shap_importance
58         TEM-4         2.370581
327     blaTEM-1         0.345772
39         OXA-1         0.278450
134         sul1         0.267117
311        blaEC         0.225393
44          PmrE         0.193800
36           Mrx         0.151461
204  blaTEM-1B_1         0.141080
288  aph(3'')-Ib         0.132515
16      CTX-M-15         0.123643
402       sul2.1         0.123463
343      catB3.1         0.102834
55       SHV

## **Weighted Model**

In [None]:
#train Tier 1 models for each drug
for drug in ['AMX', 'AMC', 'CIP']:
    print("="*50)
    print(f"TIER 1 MODEL: {drug}")
    print("="*50)

    #find common samples
    common_samples = tier1a.index.intersection(phenotypes.index)

    X = tier1a.loc[common_samples]

    #map phenotypic data to binary (1=Resistant, 0=Susceptible/Intermediate)
    y = phenotypes.loc[common_samples, drug].map({'R': 1, 'S': 0, 'I': 0}).dropna()

    X = X.loc[y.index]

    print(f"Data prepared for {drug}. Total samples: {len(X)}")
    print(f"Resistance counts (R=1, S/I=0): {y.value_counts()}")

    ##split Data into Train and Test Sets, We use stratified split to ensure the ratio of Resistant (1) to Susceptible (0) is the same in both the training and testing sets.
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.2,    #use 20% of data for testing
        random_state=42,  #for reproducibility
        stratify=y        #essential for imbalanced data like AMR
    )

    data_dict = {}


    #calculate Class Weights for Imbalance, Resistance (1) is the positive class.
    n_resistant = np.sum(y_train == 1)
    n_susceptible = np.sum(y_train == 0)

    if n_resistant > 0:
        scale_pos_weight = n_susceptible / n_resistant
    else:
        #fallback if no resistant samples are in the training set (rare but safe)
        scale_pos_weight = 1.0

    #train XGBoost Model
    print(f"\n{drug}: Training XGBoost model...")
    print(f" - Train Samples: {len(X_train)} (R={n_resistant}, S={n_susceptible})")
    print(f" - Test Samples: {len(X_test)}")
    print(f" - scale_pos_weight: {scale_pos_weight:.2f}")


    #train XGBoost

    model = XGBClassifier(
    scale_pos_weight=scale_pos_weight,
    max_depth=6,
    learning_rate=0.1,
    n_estimators=100,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='logloss',
    use_label_encoder=False
    )

    model.fit(X_train, y_train)

    #evaluate Model
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    auroc = roc_auc_score(y_test, y_pred_proba)
    auprc = average_precision_score(y_test, y_pred_proba)
    precision = precision_score(y_test, y_pred_proba.round())
    recall = recall_score(y_test, y_pred_proba.round())
    f1 = f1_score(y_test, y_pred_proba.round())


    print(f" Model trained. Test AUROC: {auroc:.4f}")
    print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba):.3f}")
    print(f"AUPRC: {auprc:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")

    #sore Results
    data_dict[drug] = {
        'model': model,
        'X_train': X_train,
        'X_test': X_test,
        'y_train': y_train,
        'y_test': y_test,
        'auroc': auroc
    }

    print(f"\nResults:")
    print(classification_report(y_test, y_pred))

    # SHAP feature importance
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X_test)

    # Get top 20 novel genes
    feature_importance = pd.DataFrame({
        'gene': X_train.columns,
        'shap_importance': np.abs(shap_values).mean(axis=0)
    }).sort_values('shap_importance', ascending=False)

    top_20_novel = feature_importance.head(20)
    print(top_20_novel)

    print(f"\n--- Training complete for {drug} ---")

    #save model
    model.save_model(f'/content/drive/MyDrive/models/tier1_weighted_model_{drug}.json')

TIER 1 MODEL: AMX
Data prepared for AMX. Total samples: 1089
Resistance counts (R=1, S/I=0): AMX
1.0    659
0.0    430
Name: count, dtype: int64

AMX: Training XGBoost model...
 - Train Samples: 871 (R=527, S=344)
 - Test Samples: 218
 - scale_pos_weight: 0.65


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


 Model trained. Test AUROC: 0.9350
ROC-AUC: 0.935
AUPRC: 0.9660
Precision: 0.9825
Recall: 0.8485
F1 Score: 0.9106

Results:
              precision    recall  f1-score   support

         0.0       0.81      0.98      0.88        86
         1.0       0.98      0.85      0.91       132

    accuracy                           0.90       218
   macro avg       0.90      0.91      0.90       218
weighted avg       0.91      0.90      0.90       218

            gene  shap_importance
58         TEM-4         1.745257
327     blaTEM-1         0.642719
204  blaTEM-1B_1         0.267102
39         OXA-1         0.214595
311        blaEC         0.206418
134         sul1         0.205218
44          PmrE         0.201813
402       sul2.1         0.155927
288  aph(3'')-Ib         0.137484
36           Mrx         0.117108
120         mdtM         0.104284
363       emrD.1         0.083918
189   blaOXA-1_1         0.069570
55       SHV-102         0.061054
135         sul2         0.058966
343  

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


 Model trained. Test AUROC: 0.8255
ROC-AUC: 0.826
AUPRC: 0.7148
Precision: 0.5882
Recall: 0.8000
F1 Score: 0.6780

Results:
              precision    recall  f1-score   support

         0.0       0.90      0.76      0.82       230
         1.0       0.59      0.80      0.68       100

    accuracy                           0.77       330
   macro avg       0.74      0.78      0.75       330
weighted avg       0.80      0.77      0.78       330

            gene  shap_importance
58         TEM-4         1.121231
318     blaOXA-1         0.389447
39         OXA-1         0.207375
311        blaEC         0.139966
258     tet(A)_4         0.116612
327     blaTEM-1         0.114794
36           Mrx         0.102914
135         sul2         0.083075
134         sul1         0.071199
401       sul1.1         0.069021
204  blaTEM-1B_1         0.067930
348        dfrA1         0.066062
382       mdtM.1         0.064300
126         mphA         0.064259
352     dfrA17.1         0.059725
363  

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


 Model trained. Test AUROC: 0.9272
ROC-AUC: 0.927
AUPRC: 0.8613
Precision: 0.7160
Recall: 0.8056
F1 Score: 0.7582

Results:
              precision    recall  f1-score   support

         0.0       0.94      0.91      0.93       258
         1.0       0.72      0.81      0.76        72

    accuracy                           0.89       330
   macro avg       0.83      0.86      0.84       330
weighted avg       0.89      0.89      0.89       330

               gene  shap_importance
120            mdtM         0.756636
126            mphA         0.559397
205     blaTEM-1C_5         0.230480
311           blaEC         0.230048
204     blaTEM-1B_1         0.218444
258        tet(A)_4         0.208959
382          mdtM.1         0.204063
16         CTX-M-15         0.194781
0        AAC(3)-IIa         0.185926
343         catB3.1         0.169481
88           dfrA17         0.132326
44             PmrE         0.122105
159  aac(6')Ib-cr_1         0.114237
248          strA_4         0.1

## **Tier 1 Model Comparison**

The **Weighted Models** are superior for **AMC and CIP** because they achieve a better balance between **Precision** and **Recall** for the minority (Resistant) class. The unweighted model performed slightly better on AMX, likely because the imbalance was less severe.

| Drug | Model Version | AUROC (Area Under ROC) | AUPRC (Area Under PR Curve) | Precision (R=1) | Recall (R=1) | F1-Score (R=1) | Judgment |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **AMX** (Balanced, R:S = 1.5:1) | **Unweighted (Default)** | **0.9414** | **0.9686** | 0.97 | 0.85 | 0.91 | **Slightly Better** (Higher Precision/AUROC) |
| **AMX** | Weighted (0.65) | 0.9350 | 0.9660 | 0.98 | 0.85 | 0.91 | Very close, similar F1. |
| --- | --- | --- | --- | --- | --- | --- | --- |
| **AMC** (Imbalanced, R:S = 1:2.3) | Unweighted (Default) | 0.8198 | **0.7167** | **0.67** | 0.53 | 0.59 | **Worse** (Low Recall for Resistance) |
| **AMC** | **Weighted (2.29)** | **0.8255** | 0.7148 | 0.59 | **0.80** | **0.68** | **Better** (Higher Recall/F1) |
| --- | --- | --- | --- | --- | --- | --- | --- |
| **CIP** (Highly Imbalanced, R:S = 1:3.6) | Unweighted (Default) | **0.9352** | **0.8681** | **0.88** | 0.68 | 0.77 | **Worse** (Low Recall for Resistance) |
| **CIP** | **Weighted (3.62)** | 0.9272 | 0.8613 | 0.72 | **0.81** | **0.76** | **Better** (Higher Recall/F1) |

**`Detailed Analysis by Drug`**

**`1. Amoxicillin (AMX) - Relatively Balanced`**
* **Best Model:** **Unweighted Model.**
* **Reasoning:** Since the class ratio is closer to 1:1, the default XGBoost objective often maximizes overall separation, resulting in a slightly higher **AUROC (0.9414)** and a very high **Precision (0.97)** for the resistant class. The weighted model shows slightly lower AUROC and AUPRC. Both models are excellent.

**`2. Amoxicillin/Clavulanate (AMC) - Imbalanced`**
* **Best Model:** **Weighted Model ($\text{scale_pos_weight} = 2.29$).**
* **Reasoning:** The unweighted model achieves high precision (0.81) for susceptible (0.0) but only **0.53 Recall** for the resistant (1.0) class. This means it misses almost half of the actual resistant isolates. The weighted model sacrifices a bit of precision (down to **0.59**) to drastically increase **Recall to 0.80**, resulting in a much higher **F1-Score (0.68 vs 0.59)**. This is a vital trade-off in AMR, as **missing resistance (low Recall)** is usually the most dangerous error.

**`3. Ciprofloxacin (CIP) - Highly Imbalanced`**
* **Best Model:** **Weighted Model ($\text{scale_pos_weight} = 3.62$).**
* **Reasoning:** The imbalance is most severe here. The unweighted model prioritizes accuracy on the large susceptible class, resulting in a very low **Recall of 0.68** for the resistant class. The weighted model successfully redistributes the focus, pushing **Recall up to 0.81**, which is a significant improvement in identifying resistance, and a better **F1-Score (0.76 vs 0.77)** for the resistant class.

**`Conclusion`**

The **weighted models** provide a better, more robust, and more clinically relevant performance profile for the **imbalanced** drugs (**AMC and CIP**). They successfully use **`scale_pos_weight`** to balance the ability to correctly identify both susceptible and resistant isolates, as shown by the improved Recall and F1-scores for the Resistant (1.0) class.


# **2. Tier 1B: Stress/Metal Resistance (50 features)**
**`File:`** `tier1b_stress_genes.csv`

**`Composition:`**
- Metal resistance: `arsA`, `merA`, `silB`, `terC`, etc.
- Biocide resistance: `qacE`, `emrE`, etc.

**`Used for:`**
- Flagging co-selection (report genes correlated with these)
- Supplementary analysis: "Genes linked to stress response"
- **`Debatable for Tier 2:`** Could include in multi-tier models if want to capture co-selection dynamics

**`Do NOT use for:`**
- Strict correlation filtering (we already flagged these separately)

**`Examples of Co-Selection Genes:`**

- **Arsenic**: `arsA`, `arsD`, `arsR`
- **Silver**: `silB`, `silC`, `silE`, `silP`
- **Tellurite**: `terB`, `terC`, `terD`, `terE`, `terW`, `terZ`
- **Mercury**: `merA`, `merC`, `merD`, `merE`, `merP`, `merR`, `merT`
- **Copper**: `pcoC`, `pcoE`

| System      | Genes Removed                              | Co-Selection Link                                                                                     | References                                                                                                                                                   |
| :---------- | :----------------------------------------- | :---------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Silver (`sil`) | `silB`, `silC`, `silE`, `silP`             | Often on plasmids with β-lactamases                                                                   | Gupta et al. (1999) demonstrated that silver resistance genes on plasmids are frequently co-located with antibiotic resistance determinants                 |
| Mercury (`mer`) | `merA`, `merC`, `merD`, `merE`, `merP`, `merR`, `merT` | Class 1 integrons with AMR cassettes                                                               | Liebert et al. (1999) found mercury resistance genes on conjugative plasmids carrying multiple antibiotic resistance genes                                   |
| Tellurite (`ter`) | `terB`, `terC`, `terD`, `terE`, `terW`, `terZ` | Plasmid-borne with AMR genes                                                                      | Turner et al. (1999) reported tellurite resistance operons on mobile genetic elements associated with antibiotic resistance                                 |

**Validated**: These genes create linkage disequilibrium (LD) with true AMR genes due to co-location on plasmids. Their removal prevents spurious associations in ML models.

In [4]:
stress = pd.read_csv("./tier1b_stress_genes.csv")

In [5]:
stress.head(20)

Unnamed: 0,ISOLATE_ID,ariR,arsA,arsD,arsR,clpK,emrE,hdeD-GI,hsp20,kefB-GI,...,silS,terB,terC,terD,terE,terW,terZ,trxLHR,yfdX1,yfdX2
0,11657_5_1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,11657_5_10,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,11657_5_11,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,11657_5_12,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,11657_5_13,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,11657_5_14,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,11657_5_15,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,11657_5_16,1,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
8,11657_5_17,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,11657_5_18,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
stress.columns

Index(['ISOLATE_ID', 'ariR', 'arsA', 'arsD', 'arsR', 'clpK', 'emrE', 'hdeD-GI',
       'hsp20', 'kefB-GI', 'merA', 'merB', 'merC', 'merD', 'merE', 'merF',
       'merP', 'merR', 'merT', 'ncrA', 'ncrB', 'ncrC', 'pcoA', 'pcoB', 'pcoC',
       'pcoD', 'pcoE', 'pcoR', 'pcoS', 'psi-GI', 'qacE', 'qacEdelta1', 'qacL',
       'shsP', 'silA', 'silB', 'silC', 'silE', 'silF', 'silP', 'silR', 'silS',
       'terB', 'terC', 'terD', 'terE', 'terW', 'terZ', 'trxLHR', 'yfdX1',
       'yfdX2'],
      dtype='object')