# **Preparing Datasets**

| Dataset              | Format                          | Feature Space                              | Status               |
|----------------------|----------------------------------|---------------------------------------------|-----------------------|
| **1. Phenotypes**    | Tabular (`phenotypic_df`)        | Target variable (R/S/I for each antibiotic) | Available             |
| **2. SNP Data**      | Matrix (`snp_encoded`)           | 10,000 top variable SNP positions           | Ready for mapping     |
| **3. Roary Core/Accessory** | Matrix (`roary_matrix`)   | ≈ 5,000+ generic gene names                 | Ready for filtering   |
| **4. CARD Genes**    | Master DF / Extracted lists (`card_features`) | Specific AMR/Efflux genes (e.g., *PmrF*) | Ready for GPAM        |
| **5. ResFinder Genes** | Master DF / Extracted lists (`resfinder_features`) | Specific transferable genes (e.g., `tet(B)_4`) | Ready for GPAM |
| **6. AMRFinderPlus** | Master DF / Extracted lists (`amrfinder_features`) | Specific genes and point mutations (e.g., `parC_S80I`) | Ready for GPAM |

If you have `PlasmidFinder (Abricate with the --plasmidfinder database)` results, you should process them the same way (Presence/Absence of Inc groups like IncFII, IncI1) and include them in the unified_df. This is key for predicting resistance spread (I'll generate and do the same later in this very same notebook...)

In [None]:
import os
import pandas as pd
from pathlib import Path
import numpy as np
import zipfile


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Extract `amr_features.zip`**

In [None]:
zip_file_path = '/content/drive/MyDrive/data/amr_features.zip'
extract_to_path = './contents'

#create the extraction directory if it doesn't exist
os.makedirs(extract_to_path, exist_ok=True)

#extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to_path)

print(f"Successfully extracted '{zip_file_path}' to '{extract_to_path}'")

Successfully extracted '/content/drive/MyDrive/data/amr_features.zip' to './contents'


### **Inspection of Type Content in the `amr_features `**

In [None]:
card_path = '/content/drive/MyDrive/amr_features/abricate/11657_5_10_card.tsv'
resfinder_path = '/content/drive/MyDrive/amr_features/abricate/11657_5_10_resfinder.tsv'
amrfinder_path = '/content/drive/MyDrive/amr_features/amrfinder/11657_5_1.tsv'

print(f"\nImporting: {card_path}")
df_card = pd.read_csv(card_path, sep='\t')
print("df_card head:")
print(df_card.head())

print(f"\nImporting: {resfinder_path}")
df_resfinder = pd.read_csv(resfinder_path, sep='\t')
print("df_resfinder head:")
print(df_resfinder.head())

print(f"\nImporting: {amrfinder_path}")
df_amrfinder = pd.read_csv(amrfinder_path, sep='\t')
print("df_amrfinder head:")
print(df_amrfinder.head())


Importing: /content/drive/MyDrive/amr_features/abricate/11657_5_10_card.tsv
df_card head:
                                              #FILE      SEQUENCE   START  \
0  /home/ryder/AMR_PROJECT/assemblies/11657_5_10.fa  ERZ3164562.1   76875   
1  /home/ryder/AMR_PROJECT/assemblies/11657_5_10.fa  ERZ3164562.1  266270   
2  /home/ryder/AMR_PROJECT/assemblies/11657_5_10.fa  ERZ3164562.1  266989   
3  /home/ryder/AMR_PROJECT/assemblies/11657_5_10.fa  ERZ3164562.1  269805   
4  /home/ryder/AMR_PROJECT/assemblies/11657_5_10.fa  ERZ3164562.1  272883   

      END  GENE     COVERAGE     COVERAGE_MAP GAPS  %COVERAGE  %IDENTITY  \

  DATABASE                  ACCESSION  
0     card     U00096:2367071-2368040  
1     card  NC_007779:2166413-2167136  
2     card   AP009048:2165013-2166417  
3     card     U00096:2158386-2161464  
4     card     U00096:2155263-2158386  

Importing: /content/drive/MyDrive/amr_features/abricate/11657_5_10_resfinder.tsv
df_resfinder head:
                            


#### **1. Abricate Outputs (CARD & ResFinder)**

Abricate uses a standardized format regardless of which database (CARD or ResFinder) is used.

* **#FILE**: The path to the input genomic assembly file (e.g., `11657_5_10.fa`).
* **SEQUENCE**: The specific contig or sequence ID within our assembly where the gene was found.
* **START / END**: The nucleotide coordinates on our sequence where the gene alignment begins and ends.
* **GENE**: The name of the antimicrobial resistance (AMR) gene identified (e.g., `PmrF`, `tet(34)`).
* **COVERAGE**: The specific range of the gene covered relative to its total length (e.g., `1-970/970` means the entire gene was found).
* **COVERAGE_MAP**: A visual "bar" representing the alignment; `=======` indicates full coverage.
* **GAPS**: Indicates if there are any insertions or deletions (gaps) in the alignment.
* **%COVERAGE**: The percentage of the reference gene length that matched our sequence.
* **%IDENTITY**: The percentage of exact nucleotide matches between our sequence and the database reference.
* **DATABASE**: The database used for the search (e.g., `card` or `resfinder`).
* **ACCESSION**: The unique identifier/accession number for that gene in the respective database.



#### **2. AMRFinderPlus Output**

AMRFinderPlus provides more detailed biological context, including functional descriptions and the "Method" used for detection.

* **Protein id / Contig id**: Identifiers for the protein (if provided) and the genomic contig.
* **Start / Stop / Strand**: The genomic coordinates and the orientation (`+` or `-`) of the gene.
* **Element symbol**: The short gene name or mutation symbol (e.g., `blaEC` or `parC_S80I` for a specific point mutation).
* **Element name**: The full descriptive name of the gene or protein (e.g., "BlaEC family class C beta-lactamase").
* **Scope**: Indicates if the gene is part of the "core" genome or the "plus" (extended) database. "plus" subset includes genes related to biocide resistance, stress resistance, general efflux pumps, virulence factors, and antigenicity. "plus' genes may not always directly cause resistance to the antibiotics typically tested in clinical settings, but they provide context on how bacteria survive in hostile environments.
* **Type / Subtype**: The broad category of the hit (e.g., `AMR`, `VIRULENCE`) and its specific subtype (e.g., `POINT` for point mutations).
* **Method**: The detection strategy:
* **EXACTX**: 100% match.
* **PARTIALX**: The gene is split or partially found.
* **POINTX**: Identification based on a specific point mutation.


* **Target length / Reference sequence length**: The length of the gene in our sample versus its length in the NCBI database.
* **% Coverage of reference**: How much of the NCBI reference gene is present in our assembly.
* **% Identity to reference**: The percentage of identical amino acids or nucleotides.
* **Alignment length**: The actual length of the sequence alignment.
* **Closest reference accession / name**: The ID and name of the top-matching sequence in the NCBI database.
* **HMM accession / description**: Details of the Hidden Markov Model (HMM) if the gene was identified through profile matching rather than a direct sequence search.

In [None]:
df_card.head()

Unnamed: 0,#FILE,SEQUENCE,START,END,GENE,COVERAGE,COVERAGE_MAP,GAPS,%COVERAGE,%IDENTITY,DATABASE,ACCESSION
0,/home/ryder/AMR_PROJECT/assemblies/11657_5_10.fa,ERZ3164562.1,76875,77844,PmrF,1-970/970,===============,0/0,100.0,99.38,card,U00096:2367071-2368040
1,/home/ryder/AMR_PROJECT/assemblies/11657_5_10.fa,ERZ3164562.1,266270,266993,baeR,1-724/724,===============,0/0,100.0,98.07,card,NC_007779:2166413-2167136
2,/home/ryder/AMR_PROJECT/assemblies/11657_5_10.fa,ERZ3164562.1,266989,268393,baeS,1-1405/1405,===============,0/0,100.0,100.0,card,AP009048:2165013-2166417
3,/home/ryder/AMR_PROJECT/assemblies/11657_5_10.fa,ERZ3164562.1,269805,272883,mdtC,1-3079/3079,===============,0/0,100.0,98.9,card,U00096:2158386-2161464
4,/home/ryder/AMR_PROJECT/assemblies/11657_5_10.fa,ERZ3164562.1,272883,276006,mdtB,1-3124/3124,===============,0/0,100.0,98.62,card,U00096:2155263-2158386


In [None]:
df_resfinder.head()

Unnamed: 0,#FILE,SEQUENCE,START,END,GENE,COVERAGE,COVERAGE_MAP,GAPS,%COVERAGE,%IDENTITY,DATABASE,ACCESSION
0,/home/ryder/AMR_PROJECT/assemblies/11657_5_10.fa,ERZ3164562.39,16974,17329,tet(34)_1,66-421/465,.=======/=====.,2/2,76.34,75.07,resfinder,AB061440


In [None]:
df_amrfinder.head()

Unnamed: 0,Protein id,Contig id,Start,Stop,Strand,Element symbol,Element name,Scope,Type,Subtype,...,Method,Target length,Reference sequence length,% Coverage of reference,% Identity to reference,Alignment length,Closest reference accession,Closest reference name,HMM accession,HMM description
0,,ERZ3164308.17,50583,51431,-,espX1,type III secretion system effector EspX1,plus,VIRULENCE,VIRULENCE,...,PARTIALX,283,473,59.83,88.69,283,ADD54679.1,type III secretion system effector EspX1,,
1,,ERZ3164308.23,51543,52673,-,blaEC,BlaEC family class C beta-lactamase,plus,AMR,AMR,...,EXACTX,377,377,100.0,100.0,377,AAC77110.1,BlaEC family class C beta-lactamase,,
2,,ERZ3164308.27,32433,34688,-,parC_S80I,Escherichia quinolone resistant ParC,core,AMR,POINT,...,POINTX,752,752,100.0,99.87,752,WP_001281881.1,DNA topoisomerase IV subunit A ParC,,
3,,ERZ3164308.27,42222,44111,-,parE_E460K,Escherichia quinolone resistant ParE,core,AMR,POINT,...,POINTX,630,630,100.0,99.68,630,WP_033554192.1,DNA topoisomerase IV subunit B ParE,,
4,,ERZ3164308.29,45497,46240,-,mdtM,multidrug efflux MFS transporter MdtM,plus,AMR,AMR,...,PARTIALX,248,410,60.49,95.97,248,AAC77293.1,multidrug efflux MFS transporter MdtM,,


### **Master File Creation**
We removed the database tag (_card or _resfinder or _amrfinder) and the .tsv extension to get the full isolate ID (e.g., 11657_5_10) and made a new column so as to match the samples across all datasets.

In [None]:
#configuration
BASE_FEATURES_DIR = Path('./amr_features')

#processing Function

def create_master_df_revised(search_subpath: str, file_suffix: str) -> pd.DataFrame:
    """
    Finds files with a specific suffix, extracts the full ISOLATE_ID from the file name, and concatenates.
    """

    search_path = BASE_FEATURES_DIR / search_subpath
    print(f"\n--- Processing: {file_suffix} ---")

    #use rglob to find all files ending with the specific suffix
    all_files = list(search_path.rglob(f'*{file_suffix}.tsv'))

    if not all_files:
        print(f"No files found ending with {file_suffix}.tsv in {search_path}.")
        return pd.DataFrame()

    print(f"Found {len(all_files)} files to concatenate.")

    all_data = []

    for f in all_files:
        try:
            df = pd.read_csv(f, sep='\t', low_memory=False)

            #ISOLATE ID EXTRACTION
            #remove the suffix tag (e.g., '_card' or '_resfinder' or '_amrfinder')
            full_stem = f.stem
            isolate_id = full_stem.removesuffix(file_suffix)

            # Set the ISOLATE_ID
            df['ISOLATE_ID'] = isolate_id

            all_data.append(df)

        except Exception as e:
            print(f"Error reading {f}: {e}")
            continue

    if not all_data:
        return pd.DataFrame()

    master_df = pd.concat(all_data, ignore_index=True)

    print(f"Master {file_suffix} DF created. Total rows: {len(master_df):,}")
    return master_df

#lets try

#1.CARD features (uses _card.tsv suffix)
df_master_card = create_master_df_revised('abricate', '_card')

#2.ResFinder features (uses _resfinder.tsv suffix)
df_master_resfinder = create_master_df_revised('abricate', '_resfinder')

#3.AMRfinderPlus features (uses the full filename as it has no specific suffix in our example)
df_master_amrfinder = create_master_df_revised('amrfinder', '') #using empty suffix for AMRfinderPlus

NameError: name 'Path' is not defined

In [None]:
if not df_master_card.empty:
    print(f"\nMaster CARD DF ({len(df_master_card):,} rows):")
    print(df_master_card[['GENE', 'DATABASE', 'ISOLATE_ID']].head())

if not df_master_resfinder.empty:
    print(f"\nMaster ResFinder DF ({len(df_master_resfinder):,} rows):")
    print(df_master_resfinder[['GENE', 'DATABASE', 'ISOLATE_ID']].head())

if not df_master_amrfinder.empty:
    print(f"\nMaster AMRfinderPlus DF ({len(df_master_amrfinder):,} rows):")
    print(df_master_amrfinder[['Element symbol', 'Type', 'ISOLATE_ID']].head())


Master CARD DF (98,440 rows):
   GENE DATABASE  ISOLATE_ID
0  PmrF     card  11679_6_95
1  emrY     card  11679_6_95
2  emrK     card  11679_6_95
3  evgA     card  11679_6_95
4  evgS     card  11679_6_95

Master ResFinder DF (12,486 rows):
       GENE   DATABASE  ISOLATE_ID
0   catB4_1  resfinder  11679_5_79
1  tet(B)_4  resfinder  11679_5_79
2   catA1_1  resfinder  11679_5_79
3   catB4_1  resfinder  11679_5_79
4    strB_1  resfinder  11679_5_79

Master AMRfinderPlus DF (52,083 rows):
  Element symbol       Type  ISOLATE_ID
0           clbB  VIRULENCE  11679_7_21
1           clbN  VIRULENCE  11679_7_21
2           ybtP  VIRULENCE  11679_7_21
3           ybtQ  VIRULENCE  11679_7_21
4           cdtB  VIRULENCE  11679_7_21


#### **Save Master Files**

In [None]:
df_master_card.to_csv('df_master_card.csv', index=False)
print("df_master_card saved to 'df_master_card.csv'")

df_master_card saved to 'df_master_card.csv'


In [None]:
df_master_resfinder.to_csv('df_master_resfinder.csv', index=False)
print("df_master_resfinder saved to 'df_master_resfinder.csv'")

df_master_resfinder saved to 'df_master_resfinder.csv'


In [None]:
df_master_amrfinder.to_csv('df_master_amrfinder.csv', index=False)
print("df_master_amrfinder saved to 'df_master_amrfinder.csv'")

df_master_amrfinder saved to 'df_master_amrfinder.csv'


## **Plasmid Finder/Mobile Genetic Elements (MGEs)**

### **Extraction of `plasmidfinder.zip`**

In [None]:
zip_file_path = '/content/drive/MyDrive/data/plasmidfinder.zip'
extract_to_path = './contents'

#create the extraction directory if it doesn't exist
os.makedirs(extract_to_path, exist_ok=True)

#extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to_path)

print(f"Successfully extracted '{zip_file_path}' to '{extract_to_path}'")

Successfully extracted '/content/drive/MyDrive/data/plasmidfinder.zip' to './contents'


### **Inspect the content of Files**

In [None]:
df_plasmisdfinder = pd.read_csv('/content/drive/MyDrive/plasmidfinder/11657_5_11_plasmidfinder.tsv', sep='\t')
df_plasmisdfinder.head()

Unnamed: 0,#FILE,SEQUENCE,START,END,GENE,COVERAGE,COVERAGE_MAP,GAPS,%COVERAGE,%IDENTITY,DATABASE,ACCESSION
0,/home/ryder/AMR_PROJECT/assemblies/11657_5_11.fa,ERZ3164564.100,124,385,Col(MG828)_1,1-262/262,===============,0/0,100.0,93.51,plasmidfinder,NC_008486
1,/home/ryder/AMR_PROJECT/assemblies/11657_5_11.fa,ERZ3164564.103,128,385,Col(MG828)_1,1-258/262,===============,0/0,98.47,94.19,plasmidfinder,NC_008486
2,/home/ryder/AMR_PROJECT/assemblies/11657_5_11.fa,ERZ3164564.30,51875,52190,IncI2_1_Delta,1-316/316,===============,0/0,100.0,98.42,plasmidfinder,AP002527
3,/home/ryder/AMR_PROJECT/assemblies/11657_5_11.fa,ERZ3164564.32,38099,38359,IncFII_1,1-261/261,===============,0/0,100.0,100.0,plasmidfinder,AY458016
4,/home/ryder/AMR_PROJECT/assemblies/11657_5_11.fa,ERZ3164564.62,4252,4639,IncFIA_1,1-388/388,===============,0/0,100.0,99.74,plasmidfinder,AP001918


### **PlasmidFinder Master File Creation**

In [None]:
#base directory for the PlasmidFinder files
BASE_FEATURES_DIR = Path('/content/drive/MyDrive/plasmidfinder')

#processing Function

def create_master_plasmid_df(search_subpath: str, file_suffix: str) -> pd.DataFrame:
    """
    Finds PlasmidFinder files, extracts the ISOLATE_ID, and concatenates.
    """

    #for the PlasmidFinder example, search_subpath might be an empty string
    search_path = BASE_FEATURES_DIR / search_subpath
    print(f"\n--- Processing: {file_suffix} ---")

    # use rglob to find all files ending with the specific suffix - the common PlasmidFinder file name is often '_plasmidfinder.tsv'
    all_files = list(search_path.rglob(f'*{file_suffix}.tsv'))

    if not all_files:
        print(f"No files found ending with {file_suffix}.tsv in {search_path}.")
        return pd.DataFrame()

    print(f"Found {len(all_files)} files to concatenate.")

    all_data = []

    for f in all_files:
        try:
            df = pd.read_csv(f, sep='\t', low_memory=False)

            #ISOLATE ID EXTRACTION (PlasmidFinder Specific)
            full_stem = f.stem
            #get the PlasmidFinder suffix
            isolate_id = full_stem.removesuffix(file_suffix)

            #set the ISOLATE_ID
            df['ISOLATE_ID'] = isolate_id

            all_data.append(df)

        except Exception as e:
            print(f"Error reading {f}: {e}")
            continue

    if not all_data:
        return pd.DataFrame()

    master_df = pd.concat(all_data, ignore_index=True)

    print(f"Master {file_suffix} DF created. Total rows: {len(master_df):,}")
    return master_df


# all files end with '_plasmidfinder.tsv'
df_master_plasmidfinder = create_master_plasmid_df(
    search_subpath='',           #search directly in BASE_FEATURES_DIR
    file_suffix='_plasmidfinder' #use the PlasmidFinder specific suffix
)


--- Processing: _plasmidfinder ---
Found 1651 files to concatenate.
Master _plasmidfinder DF created. Total rows: 7,269


  master_df = pd.concat(all_data, ignore_index=True)


In [None]:
print(df_master_plasmidfinder.head())

                                              #FILE        SEQUENCE START  \
0  /home/ryder/AMR_PROJECT/assemblies/11657_7_19.fa  ERZ3164689.102   324   
1  /home/ryder/AMR_PROJECT/assemblies/11657_7_19.fa  ERZ3164689.106  2634   
2  /home/ryder/AMR_PROJECT/assemblies/11657_7_19.fa  ERZ3164689.118  2433   
3  /home/ryder/AMR_PROJECT/assemblies/11657_7_19.fa  ERZ3164689.123  1101   
4  /home/ryder/AMR_PROJECT/assemblies/11657_7_19.fa  ERZ3164689.133   928   

    END                       GENE    COVERAGE     COVERAGE_MAP GAPS  \

   %COVERAGE  %IDENTITY       DATABASE  ACCESSION  ISOLATE_ID  
0      99.03      88.78  plasmidfinder   DQ995353  11657_7_19  
1     100.00      96.18  plasmidfinder   AJ851089  11657_7_19  
2      90.77      87.50  plasmidfinder   DQ298019  11657_7_19  
3      99.80      93.61  plasmidfinder   AP001918  11657_7_19  
4     100.00      92.75  plasmidfinder  NC_008486  11657_7_19  


| Aspect | Result |
|--------|--------|
| **Status** | `df_master_plasmidfinder` successfully processed |
| **Total Rows** | **7,269** total plasmid hits across all isolates |
| **Isolate ID Extraction** | Correct — IDs such as `11657_7_19` are accurately parsed |
| **Features Captured** | Successfully identifies biologically meaningful replicon types, including:<br>• `IncFII(pRSB107)_1_pRSB107`<br>• `IncFIC(FII)_1`<br>These are strong indicators of HGT (Horizontal Gene Transfer) activity and broad-spectrum AMR potential |


### **Save PlasmidFinder Master File**

In [None]:
# save
df_master_plasmidfinder.to_csv('/content/drive/MyDrive/amr_features/df_master_plasmidfinder.csv', index=False)

NameError: name 'df_master_plasmidfinder' is not defined

### **Load PlasmidFinder Master File**

In [None]:
df_master_plasmidfinder = pd.read_csv('/content/drive/MyDrive/amr_features/df_master_plasmidfinder.csv')

In [None]:
df_master_plasmidfinder['GENE'].value_counts()

Unnamed: 0_level_0,count
GENE,Unnamed: 1_level_1
IncFIB(AP001918)_1,1043
ColRNAI_1,928
Col156_1,794
IncFIC(FII)_1,736
Col(MG828)_1,639
...,...
IncHI1A(CIT)_1_pNDM-CIT,1
IncFII_1_pKP91,1
Col(Ye4449)_1,1
IncN2_1,1


In [None]:
df_master_plasmidfinder.isnull().sum()

Unnamed: 0,0
#FILE,0
SEQUENCE,0
START,0
END,0
GENE,0
COVERAGE,0
COVERAGE_MAP,0
GAPS,0
%COVERAGE,0
%IDENTITY,0


In [None]:
df_master_plasmidfinder['ISOLATE_ID'].nunique()

1509

In [None]:
df_master_plasmidfinder['GENE'].nunique()

69

# **Presence/Absence Matrices (`PAMs`)**

## **Plasmid Presence/Absence Matrix (`PPAM`)**

In [None]:
def create_ppam(df_master, gene_col='GENE', id_col='ISOLATE_ID', identity_thr=90, coverage_thr=90):
    """Creates a Plasmid Presence/Absence Matrix (PPAM) focusing on replicon types."""

    #filter for high confidence hits
    df_filtered = df_master[
        (df_master['%IDENTITY'] >= identity_thr) &
        (df_master['%COVERAGE'] >= coverage_thr)
    ].copy()

    df_filtered['Presence'] = 1

    #pivot to create the wide format matrix
    ppam = df_filtered.pivot_table(
        index=id_col,
        columns=gene_col,
        values='Presence',
        fill_value=0,
        aggfunc='max'
    )

    ppam.columns.name = None
    ppam.index.name = 'ISOLATE_ID'

    return ppam

ppam_plasmid = create_ppam(df_master_plasmidfinder, gene_col='GENE')

print(f"\nPlasmid Presence/Absence Matrix (PPAM) shape: {ppam_plasmid.shape}")


Plasmid Presence/Absence Matrix (PPAM) shape: (1451, 60)


## **`GPAM` - Gene Presence Absence Matrix**

In [None]:
df_master_card = pd.read_csv('/content/drive/MyDrive/amr_features/df_master_card.csv')
df_master_resfinder = pd.read_csv('/content/drive/MyDrive/amr_features/df_master_resfinder.csv')
df_master_amrfinder = pd.read_csv('/content/drive/MyDrive/amr_features/df_master_amrfinder.csv')

In [None]:
print(df_master_card['ISOLATE_ID'].nunique())
print(df_master_amrfinder['ISOLATE_ID'].nunique())
print(df_master_resfinder['ISOLATE_ID'].nunique())

1651
1651
1394


In [None]:
df_master_card['GENE'].nunique()

165

In [None]:
df_master_resfinder['GENE'].nunique()

146

In [None]:
df_master_amrfinder['Element symbol'].nunique()

354

### **Justification for the 90% Threshold**

The choice of **90%** for both %IDENTITY and %COVERAGE in preliminary feature filtering is a common bioinformatics heuristic, often used to establish a baseline for high-confidence homologous hits.

### 1. The Intuition and Basis

The core scientific purpose of using these thresholds is to filter out noisy, ambiguous, or evolutionarily distant sequence matches, ensuring that the feature (the gene or replicon) we flag as "present" is highly similar to the known resistance mechanism in the database.

| Threshold | Meaning in Context | Intuition |
|----------|--------------------|-----------------------|
| **%IDENTITY (≥ 90%)** | The percentage of matching nucleotides (or amino acids) between the sequenced gene and the reference gene in the database. | Ensures the gene We found is the correct functional variant of the reference. Lower identity implies significant mutations, which might change function or represent an unrelated sequence. |
| **%COVERAGE (≥ 90%)** | The percentage of the reference database gene that is covered by the sequencing read (or contig). | Ensures the gene is complete and functional. Low coverage (e.g., 50%) means only half the gene is present, suggesting truncation, misassembly, or absence. |


### 2. References and Common Practice

- While a universal “best” cutoff does not exist, the range of **90–95%** is widely accepted for initial filtering in AMR studies, particularly when using tools such as Abricate (BLAST-based).
- **Abricate / ResFinder / CARD:** These tools commonly apply strict thresholds.  
  - *ResFinder* defaults to **90% identity** and **60% coverage**.  
  - Using **90% coverage** is more conservative and is standard in high-confidence comparative genomics.
- **Bioinformatics Literature:**  
  Many comparative genomic studies use **≥ 90% identity** to define strong homology, distinguish paralogs, and identify conserved AMR variants.

```
"To ensure the features represented true antimicrobial resistance determinants and to mitigate noise from fragmented assemblies or weak sequence homology,
we employed a conservative filtering approach for all gene and plasmid hits derived from Abricate and PlasmidFinder.
Only hits demonstrating a minimum of 90% sequence identity to the reference gene and 90% coverage of the reference gene length were retained.
This approach is consistent with established practices in comparative genomics for defining high-confidence gene presence."
```


In [None]:
def create_gpam(df_master, gene_col, id_col='ISOLATE_ID', identity_thr=90, coverage_thr=90):
    """Creates a Gene Presence/Absence Matrix (GPAM)."""

    #1.standardize and filter for High Confidence Hits
    #CARD and ResFinder usually use %IDENTITY and %COVERAGE (as of the time i used them, in future they may use some other notations...)
    if '%IDENTITY' in df_master.columns and '%COVERAGE' in df_master.columns:
        df_filtered = df_master[
            (df_master['%IDENTITY'] >= identity_thr) &
            (df_master['%COVERAGE'] >= coverage_thr)
        ].copy()
    else:
        #AMRfinderPlus features often use high confidence by default, or we might rely on Type/Scope
        df_filtered = df_master.copy()

    # 2.assign Presence (1)
    df_filtered['Presence'] = 1

    # 3.pivot the table (Wide Format Transformation) - using pivot_table to handle multiple hits for the same gene/isolate (max operation keeps the 1)
    gpam = df_filtered.pivot_table(
        index=id_col,
        columns=gene_col,
        values='Presence',
        fill_value=0, #if an isolate/gene combination is missing, the gene is absent (0)
        aggfunc='max'
    )

    #clean column names (important for AMRfinderPlus's 'Element symbol')
    gpam.columns.name = None
    gpam.index.name = 'ISOLATE_ID'

    return gpam

#lets try (hope it works)
gpam_card = create_gpam(df_master_card, gene_col='GENE')
gpam_resfinder = create_gpam(df_master_resfinder, gene_col='GENE')
#for AMRfinder, the gene column is 'Element symbol'
gpam_amrfinder = create_gpam(df_master_amrfinder, gene_col='Element symbol')

print(f"\nCARD GPAM shape: {gpam_card.shape}")
print(f"ResFinder GPAM shape: {gpam_resfinder.shape}")
print(f"AMRfinderPlus GPAM shape: {gpam_amrfinder.shape}")


CARD GPAM shape: (1651, 143)
ResFinder GPAM shape: (1124, 121)
AMRfinderPlus GPAM shape: (1651, 354)


In [None]:
gpam_card.head()

Unnamed: 0_level_0,AAC(3)-IIa,AAC(3)-IV,AAC(3)-VIa,AAC(6')-IIc,AAC(6')-Iaf,ANT(2'')-Ia,APH(3'')-Ib,APH(3')-IIa,APH(3')-Ia,APH(4)-Ia,...,rmtB,sul1,sul2,sul3,tet(A),tet(D),tetM,tetX,tolC,vgaC
ISOLATE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11657_5_1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
11657_5_10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
11657_5_11,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
11657_5_12,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
11657_5_13,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,0,0,1,0


## **Combine all DataSets - `unified_df`**

- Plasmid Presence/Absence Matrix (PPAM) shape: (1451, 60)
- CARD GPAM shape: (1651, 143)
- ResFinder GPAM shape: (1124, 121)
- AMRfinderPlus GPAM shape: (1651, 354)

60 + 143 + 121 + 354 = 678




In [None]:
#all matrices (mpam_snp, gpam_card, gpam_resfinder, gpam_amrfinder, ppam_plasmid) are loaded/created and indexed by 'ISOLATE_ID'

#list of all feature matrices
# feature_list = [mpam_snp, gpam_card, gpam_resfinder, gpam_amrfinder, ppam_plasmid]
feature_list = [gpam_card, gpam_resfinder, gpam_amrfinder, ppam_plasmid]

#combine all features into one dataframe
unified_df = pd.concat(feature_list, axis=1, join='outer')

#fill NaN values (where an isolate was not found in a specific tool's output) with 0
unified_df = unified_df.fillna(0)

print(f"\nFinal Unified Feature Matrix shape: {unified_df.shape}")


Final Unified Feature Matrix shape: (1651, 678)


In [None]:
unified_df

Unnamed: 0_level_0,AAC(3)-IIa,AAC(3)-IV,AAC(3)-VIa,AAC(6')-IIc,AAC(6')-Iaf,ANT(2'')-Ia,APH(3'')-Ib,APH(3')-IIa,APH(3')-Ia,APH(4)-Ia,...,IncX1_1,IncX1_4,IncX3_1,IncX4_1,IncX4_2,IncY_1,TrfA_1,p0111_1,pENTAS02_1,pSL483_1
ISOLATE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11657_5_1,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11657_5_10,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11657_5_11,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11657_5_12,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11657_5_13,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18090_8_5,0,0,1,0,0,0,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18090_8_6,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18090_8_7,1,0,0,0,0,0,1,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18090_8_8,1,0,0,0,0,0,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
unified_df['AAC(3)-IIa'].value_counts()

Unnamed: 0_level_0,count
AAC(3)-IIa,Unnamed: 1_level_1
0,1459
1,192


In [None]:
unified_df.to_csv("/content/drive/MyDrive/amr_features/gpam_card_gpam_resfinder_gpam_amrfinder_ppam_plasmid.csv")

Final Unified Feature Matrix Shape: **(1651 rows × 678 columns)**.  
- This is a robust and highly specific feature set encompassing genes, plasmids, and specific mutations for all 1651 isolates.

**Data Completeness:**  
- The fact that the final matrix has 1651 rows means we successfully aligned all the different data sources (even those with fewer original hits like ResFinder: 1394 unique isolates) by correctly using `pd.concat` with `join='outer'` and `fillna(0)`.
