## STRATEGY for getting AMR gene lists

Even if we narrow our search on only ESKAPE pathogens, the number of AMR genes is still pretty high ~ 3.5million entries for ~ 250,000 unique genomes, so here's my plan of action :
Extract the AMR genes for each pathogen entry by filtering on the following:
- subtype = 'AMR'
- scope = 'core'
- % identity to reference = 100
- select first 15 unique genomes


The above steps done for 6 pathogen species:
- Enterococcus faecium
- Staphylococcus aureus
- Klebsiella pneumoniae
- Acinetobacter baumannii
- Pseudomonas aeruginosa
- other Enterobacterales group

In [11]:
import pandas as pd

In [17]:
#read and combine data from tsv files 
df1 = pd.read_csv('AMR_metadata/E.tsv', sep='\t')
df2 = pd.read_csv('AMR_metadata/S.tsv', sep='\t')
df3 = pd.read_csv('AMR_metadata/K.tsv', sep='\t')
df4 = pd.read_csv('AMR_metadata/A.tsv', sep='\t')
df5 = pd.read_csv('AMR_metadata/P.tsv', sep='\t')
df6 = pd.read_csv('AMR_metadata/otherE.tsv', sep='\t')

combined_AMR_metadata = pd.concat([df1,df2,df3,df4,df5,df6], ignore_index= 'TRUE')
combined_AMR_metadata

Unnamed: 0,#Scientific name,Protein,BioSample,Isolate,Contig,Start,Stop,Strand,Assembly,Element symbol,Element name,Type,Scope,Subtype,Class,Subclass,Method,% Coverage of reference,AMRFinderPlus analysis type,% Identity to reference
0,Enterococcus faecium,HBC2783091.1,SAMEA2156146,PDT000701543.1,DADXWL010000002.1,79551,80099,+,GCA_018086425.1,aac(6')-I,aminoglycoside 6'-N-acetyltransferase,AMR,core,AMR,AMINOGLYCOSIDE,AMINOGLYCOSIDE,EXACTP,100.00,COMBINED,100.0
1,Enterococcus faecium,HAQ4543115.1,SAMEA3304260,PDT000702375.1,DACGHB010000003.1,19821,20318,-,GCA_015422305.1,dfrG,trimethoprim-resistant dihydrofolate reductase...,AMR,core,AMR,TRIMETHOPRIM,TRIMETHOPRIM,EXACTP,100.00,COMBINED,100.0
2,Enterococcus faecium,WP_001574277.1,SAMN37287145,PDT002100136.1,NZ_CP144272.1,7309,8685,-,GCA_037019775.1,tet(L),tetracycline efflux MFS transporter Tet(L),AMR,core,AMR,TETRACYCLINE,TETRACYCLINE,EXACTP,100.00,COMBINED,100.0
3,Enterococcus faecium,HAQ9466659.1,SAMEA5368487,PDT000700007.1,DACIRG010000007.1,31550,32047,-,GCA_015460685.1,dfrG,trimethoprim-resistant dihydrofolate reductase...,AMR,core,AMR,TRIMETHOPRIM,TRIMETHOPRIM,EXACTP,100.00,COMBINED,100.0
4,Enterococcus faecium,HAQ8784092.1,SAMEA5368439,PDT000700258.1,DACIIG010000002.1,118680,119177,+,GCA_015456265.1,dfrG,trimethoprim-resistant dihydrofolate reductase...,AMR,core,AMR,TRIMETHOPRIM,TRIMETHOPRIM,EXACTP,100.00,COMBINED,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4776,Enterobacter hormaechei,HCR0194550.1,SAMN31246612,PDT001469959.1,DAJRKR010000159.1,389,810,+,GCA_025899025.1,sul1,sulfonamide-resistant dihydropteroate synthase...,AMR,core,AMR,SULFONAMIDE,SULFONAMIDE,PARTIAL_CONTIG_ENDP,51.27,COMBINED,100.0
4777,Enterobacter hormaechei,HCR0194551.1,SAMN31246612,PDT001469959.1,DAJRKR010000160.1,106,558,+,GCA_025899025.1,arr-3,NAD(+)--rifampin ADP-ribosyltransferase Arr-3,AMR,core,AMR,RIFAMYCIN,RIFAMYCIN,EXACTP,100.00,COMBINED,100.0
4778,Enterobacter hormaechei,HCR0194552.1,SAMN31246612,PDT001469959.1,DAJRKR010000161.1,42,674,+,GCA_025899025.1,catB3,type B-3 chloramphenicol O-acetyltransferase C...,AMR,core,AMR,PHENICOL,CHLORAMPHENICOL,EXACTP,100.00,COMBINED,100.0
4779,Enterobacter hormaechei,HCR0194554.1,SAMN31246612,PDT001469959.1,DAJRKR010000163.1,26,580,+,GCA_025899025.1,aac(6')-Ib-cr5,fluoroquinolone-acetylating aminoglycoside 6'-...,AMR,core,AMR,AMINOGLYCOSIDE/QUINOLONE,AMIKACIN/KANAMYCIN/QUINOLONE/TOBRAMYCIN,ALLELEP,100.00,COMBINED,100.0


### Now let's get the list of genome assemblies we have to download

In [28]:
assemblies = combined_AMR_metadata['Assembly'].unique().tolist()
print (assemblies)

# Save the list to a file
with open('AMR_metadata/unique_assemblies.txt', 'w') as f:
    for element in assemblies:
        f.write(f"{element}\n")

['GCA_018086425.1', 'GCA_015422305.1', 'GCA_037019775.1', 'GCA_015460685.1', 'GCA_015456265.1', 'GCA_015424425.1', 'GCA_018084525.1', 'GCA_015345495.1', 'GCA_017262775.1', 'GCA_015422045.1', 'GCA_030925715.1', 'GCA_019032785.1', 'GCA_015455185.1', 'GCA_015395665.1', 'GCA_017700385.1', 'GCA_021010185.1', 'GCA_028832425.1', 'GCA_028832465.1', 'GCA_027190905.1', 'GCA_028831535.1', 'GCA_028831345.1', 'GCA_028832365.1', 'GCA_027106235.1', 'GCA_028836565.1', 'GCA_028832385.1', 'GCA_028831305.1', 'GCA_021009465.1', 'GCA_027756525.1', 'GCA_031904085.1', 'GCA_029690545.1', 'GCA_022315655.1', 'GCA_022315595.1', 'GCA_022332625.1', 'GCA_023024485.1', 'GCA_022305095.1', 'GCA_022152465.1', 'GCA_022300495.1', 'GCA_022289835.1', 'GCA_022302065.1', 'GCA_022329305.1', 'GCA_021942045.1', 'GCA_022110415.1', 'GCA_022328325.1', 'GCA_031017585.1', 'GCA_022122845.1', 'GCA_016514445.1', 'GCA_021484905.1', 'GCA_022442445.3', 'GCA_025897775.1', 'GCA_031152015.2', 'GCA_016514295.1', 'GCA_016513995.1', 'GCA_016497

In [40]:
! head -10 AMR_metadata/unique_assemblies.txt

GCA_018086425.1
GCA_015422305.1
GCA_037019775.1
GCA_015460685.1
GCA_015456265.1
GCA_015424425.1
GCA_018084525.1
GCA_015345495.1
GCA_017262775.1
GCA_015422045.1


### Downloading the above list of genomes using ncbi-genome-download

In [25]:
conda install bioconda::ncbi-genome-download


Channels:
 - defaults
 - bioconda
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/maaly7/anaconda3/envs/general

  added / updated specs:
    - bioconda::ncbi-genome-download


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    brotli-python-1.0.9        |  py311h6a678d5_8         358 KB
    certifi-2024.8.30          |  py311h06a4308_0         163 KB
    charset-normalizer-3.3.2   |     pyhd3eb1b0_0          44 KB
    idna-3.7                   |  py311h06a4308_0         133 KB
    ncbi-genome-download-0.3.3 |     pyh7cba7a3_0          29 KB  bioconda
    pysocks-1.7.1              |  py311h06a4308_0          35 KB
    requests-2.32.3            |  py311h06a4308_0         126 KB
    tqdm-4.66.5                |  py311h92b7b1e_0         163 KB
    urllib3-2.2.2              |  py311h06a4308_0  

In [29]:
! ncbi-genome-download --help

usage: ncbi-genome-download [-h] [-s {refseq,genbank}] [-F FILE_FORMATS]
                            [-l ASSEMBLY_LEVELS] [-g GENERA] [--genus GENERA]
                            [--fuzzy-genus] [-S STRAINS] [-T SPECIES_TAXIDS]
                            [-t TAXIDS] [-A ASSEMBLY_ACCESSIONS]
                            [--fuzzy-accessions] [-R REFSEQ_CATEGORIES]
                            [--refseq-category REFSEQ_CATEGORIES] [-o OUTPUT]
                            [--flat-output] [-H] [-P] [-u URI] [-p N] [-r N]
                            [-m METADATA_TABLE] [-n] [-N] [-v] [-d] [-V]
                            [-M TYPE_MATERIALS]
                            groups

positional arguments:
  groups                The NCBI taxonomic groups to download (default: all).
                        A comma-separated list of taxonomic groups is also
                        possible. For example: "bacteria,viral"Choose from:
                        ['all', 'archaea', 'bacteria', 'fungi',
        

In [41]:
! ncbi-genome-download -s genbank -F 'cds-fasta' -A AMR_metadata/unique_assemblies.txt -o genome_assemblies/ --flat-output -v bacteria

INFO: Using cached summary.
INFO: Checking record 'GCA_021665835.3'
INFO: Checking record 'GCA_021666615.3'
INFO: Checking record 'GCA_021666635.3'
INFO: Checking record 'GCA_021666655.3'
INFO: Checking record 'GCA_021666675.3'
INFO: Checking record 'GCA_021674535.1'
INFO: Checking record 'GCA_021674595.1'
INFO: Checking record 'GCA_023951155.1'
INFO: Checking record 'GCA_023994755.1'
INFO: Checking record 'GCA_023995295.1'
INFO: Checking record 'GCA_024961505.3'
INFO: Checking record 'GCA_025921805.1'
INFO: Checking record 'GCA_028640045.3'
INFO: Checking record 'GCA_029622995.1'
INFO: Checking record 'GCA_039684275.1'
INFO: Checking record 'GCA_016497045.1'
INFO: Checking record 'GCA_016497185.1'
INFO: Checking record 'GCA_016513995.1'
INFO: Checking record 'GCA_016514295.1'
INFO: Checking record 'GCA_016514445.1'
INFO: Checking record 'GCA_021484905.1'
INFO: Checking record 'GCA_022442445.3'
INFO: Checking record 'GCA_023783775.1'
INFO: Checking record 'GCA_023825195.1'
INFO: Checki

In [34]:
! ls genome_assemblies/

GCA_018086425.1_PDT000701543.1_cds_from_genomic.fna.gz


In [45]:
! gzip -d genome_assemblies/*.gz


In [36]:
! head -20 genome_assemblies/GCA_018086425.1_PDT000701543.1_cds_from_genomic.fna

>lcl|DADXWL010000001.1_cds_HBC2782886.1_1 [locus_tag=IX163_000001] [protein=arsenate reductase] [protein_id=HBC2782886.1] [location=complement(606..962)] [gbkey=CDS]
GTGGAAGTGAAAGTCTATGGATCAGCGCACTGTGTAACAAGTCTGAAGGCGAAACAATGGTTAGAACACCATCAATTGCC
TTATCAATTTGTTGACTTAGAAGAACGCGATCTAAGTACAGAAGAAGCAGAAGAATTATGCAGTTTTGAAGAAGTGCAAG
CAGAACAGCTATTTGCGACTTGGTCTGAGGCTTTTCACAACATTGCGATTGACCTTTCCTGTGTAGGAAAAAAACAACTG
ATTTTTTTATGCACACAACATACCAATCTCCTCAGACGTCCTTTGATTATTATCGATGAAGTTCTATTTATTGGATACAA
TGAGCAGTTGATGGAACAGCTTATCTTAAATAAATAG
>lcl|DADXWL010000001.1_cds_HBC2782887.1_2 [gene=spx] [locus_tag=IX163_000002] [protein=transcriptional regulator Spx] [protein_id=HBC2782887.1] [location=complement(976..1386)] [gbkey=CDS]
ATGATAAAAGTATATACTTCGCCTAGCTGTACTTCCTGTAGAAAAGCAAGAGCATGGTTGCAAGCGAATCAACTTGAATT
TGAAGAGAAGAATATTTTTGCCGAACCGTTAACAGAAAGTGAGATCAAAAAGATCTTGGCAGCGACTGAAGGAGGAGTTC
AAGATATTATTTCCACTCGGTCGAAAATTTACGAAAAGCTGGGCATAGATTTTAACGAATTAACTTTAAAGCAAATGATC
GTCTTATTCAAAGAATATCCTTCATTATTGAGACGACCTA

## Extracting AMR genes sequences and non-AMR gene sequences

In [46]:
from Bio import SeqIO
import os

In [47]:
def create_cds_dataframe(fasta_dir):
    """Creates a DataFrame from all CDS FASTA files in a directory."""
    
    data = []
    
    # Iterate over all files in the directory
    for filename in os.listdir(fasta_dir):
        # Check if the file ends with '.fna'
        if filename.endswith(".fna"):
            filepath = os.path.join(fasta_dir, filename)
            
            # Parse the FASTA file and extract headers and sequences
            for record in SeqIO.parse(filepath, "fasta"):
                data.append({"header": record.id, "sequence": str(record.seq)})
    
    return pd.DataFrame(data)


In [48]:
fasta_dir = "genome_assemblies/"  

# Create the DataFrame from the FASTA files
all_df = create_cds_dataframe(fasta_dir)

print(all_df)

                                             header  \
0          lcl|DADXSS010000001.1_cds_HBC2407390.1_1   
1          lcl|DADXSS010000001.1_cds_HBC2407391.1_2   
2          lcl|DADXSS010000001.1_cds_HBC2407392.1_3   
3          lcl|DADXSS010000001.1_cds_HBC2407393.1_4   
4          lcl|DADXSS010000001.1_cds_HBC2407394.1_5   
...                                             ...   
418304  lcl|DACIGJ010000248.1_cds_HAQ8610755.1_2883   
418305  lcl|DACIGJ010000249.1_cds_HAQ8610756.1_2884   
418306  lcl|DACIGJ010000250.1_cds_HAQ8610757.1_2885   
418307  lcl|DACIGJ010000251.1_cds_HAQ8610758.1_2886   
418308  lcl|DACIGJ010000252.1_cds_HAQ8610759.1_2887   

                                                 sequence  
0       ATGAAAGAAATCAGACGAAAACATTTCCGTTTTCCTGCGTACGATG...  
1       ATGCCCACTTTAGCAGATGTTGCAAAACGTGCCAATGTTTCAAAAA...  
2       TTGAATCCAATTATATTGAAACGTTCTGTCCATCATATTCCAGTGT...  
3       ATGAAAACGGTCAATCGTGTGTTTTGTGCCAGCCACGTGCTAAGTC...  
4       ATGATTGATCCAAAAATGTTGGGGATGATCT

In [58]:
all_df['header'][1]  #### the header contains the protein ID !!

'lcl|DADXSS010000001.1_cds_HBC2407391.1_2'

In [61]:
combined_AMR_metadata['Protein'].tolist()

protein_ids = combined_AMR_metadata['Protein'].tolist()
# Remove NaN values from the list
protein_ids = [protein_id for protein_id in protein_ids if pd.notna(protein_id)]
len(protein_ids)

4639

### Get the AMR v/s non-AMR sequences by checking the Protein IDs in both

In [62]:
amr_df = pd.DataFrame()
#protein_ids = combined_AMR_metadata['Protein'].tolist()

# Iterate over the list of protein IDs and collect matching rows
for protein_id in protein_ids:
    # Get rows that contain the protein ID in the 'header' column
    matches = all_df[all_df['header'].str.contains(protein_id, regex=False)]
    
    # Append matching rows to the matching_df
    amr_df = pd.concat([amr_df, matches])

# After processing all protein IDs, the non-matching rows are the remaining ones
non_amr_df = all_df[~all_df.index.isin(amr_df.index)]


In [63]:
amr_df

Unnamed: 0,header,sequence
324991,lcl|DADXWL010000002.1_cds_HBC2783091.1_212,ATGATAATCAGTGAATTTGACCGTAATAATCCAGTATTGAAAGATC...
281181,lcl|DACGHB010000003.1_cds_HAQ4543115.1_254,ATGAAAGTTTCTTTGATTGCTGCGATGGATAAGAATAGAGTGATTG...
242199,lcl|DACIRG010000007.1_cds_HAQ9466659.1_528,ATGAAAGTTTCTTTGATTGCTGCGATGGATAAGAATAGAGTGATTG...
148208,lcl|DACIIG010000002.1_cds_HAQ8784092.1_240,ATGAAAGTTTCTTTGATTGCTGCGATGGATAAGAATAGAGTGATTG...
299547,lcl|DACGLS010000004.1_cds_HAQ4897849.1_242,ATGAAAGTTTCTTTGATTGCTGCGATGGATAAGAATAGAGTGATTG...
...,...,...
48581,lcl|DAJRKR010000159.1_cds_HCR0194550.1_4885,ATGGTGACGGTGTTCGGCATTCTGAATCTCACCGAGGACTCCTTCT...
48582,lcl|DAJRKR010000160.1_cds_HCR0194551.1_4886,ATGGTAAAAGATTGGATTCCCATCTCTCATGATAATTACAAGCAGG...
48583,lcl|DAJRKR010000161.1_cds_HCR0194552.1_4887,ATGACCAACTACTTTGATAGCCCCTTCAAAGGCAAGCTGCTTTCTG...
48585,lcl|DAJRKR010000163.1_cds_HCR0194554.1_4889,GTGACCAACAGCAACGATTCCGTCACACTGCGCCTCATGACTGAGC...


In [64]:
non_amr_df

Unnamed: 0,header,sequence
0,lcl|DADXSS010000001.1_cds_HBC2407390.1_1,ATGAAAGAAATCAGACGAAAACATTTCCGTTTTCCTGCGTACGATG...
1,lcl|DADXSS010000001.1_cds_HBC2407391.1_2,ATGCCCACTTTAGCAGATGTTGCAAAACGTGCCAATGTTTCAAAAA...
2,lcl|DADXSS010000001.1_cds_HBC2407392.1_3,TTGAATCCAATTATATTGAAACGTTCTGTCCATCATATTCCAGTGT...
3,lcl|DADXSS010000001.1_cds_HBC2407393.1_4,ATGAAAACGGTCAATCGTGTGTTTTGTGCCAGCCACGTGCTAAGTC...
4,lcl|DADXSS010000001.1_cds_HBC2407394.1_5,ATGATTGATCCAAAAATGTTGGGGATGATCTTCCTCATTAATTTTG...
...,...,...
418300,lcl|DACIGJ010000245.1_cds_HAQ8610751.1_2879,ATGGTTTTTGATATTGATAATTTAAAAGGATTTCTTAATGATACCA...
418301,lcl|DACIGJ010000245.1_cds_HAQ8610752.1_2880,ATGAATATAGTTGAAAATGAAATATGTATAAGAACTTTAATAGATG...
418302,lcl|DACIGJ010000246.1_cds_HAQ8610753.1_2881,ATGTCTAAAATCGAAATAAAAAATCTGACATTCGGCTATGACAGCC...
418303,lcl|DACIGJ010000247.1_cds_HAQ8610754.1_2882,ATGTCTAAAATCGAAATAAAAAATCTGACATTCGGCTACGACAGCC...


In [7]:
from classes.genes import genes
genes = genes()

### GENERATE DATAFRAME FROM ASSEMBLY ACCESSION

In [5]:
wg_accs = ['KQ970957.1']

In [6]:
genes_df = genes.genbank(wg_accs)
genes_df

KQ970957.1


Unnamed: 0,accession,id,locus_tag,gene,product,protein_id,translation,text
0,KQ970957.1,0,HMPREF2530_00001,,TonB-dependent receptor plug domain protein,KXU51435.1,TYKIVDNQIVVSTAAVTTNNVQVTQQQKQRKVSGIIKDTTGEPVIG...,T Y K I V D N Q I V V S T A A V T T N N V Q V ...
1,KQ970957.1,1,HMPREF2530_00002,,SusD family protein,KXU51436.1,MKKNYLYIFATALVVGFSSCSDFLDRYPQEELSDGSFWKTPDDANK...,M K K N Y L Y I F A T A L V V G F S S C S D F ...
2,KQ970957.1,2,HMPREF2530_00003,,hypothetical protein,KXU51437.1,MGAFYLECPHFTYYLPPILMKRLPYLLFALLFIIYFLCYQGVLSHV...,M G A F Y L E C P H F T Y Y L P P I L M K R L ...
3,KQ970957.1,3,HMPREF2530_00004,,WD40-like protein,KXU51438.1,MKTIYILLITVLSWSLQACTAQCGKPDACTDSIPRIYPDYAGVTFP...,M K T I Y I L L I T V L S W S L Q A C T A Q C ...
4,KQ970957.1,4,HMPREF2530_00005,,glycosyl hydrolase family 3 protein,KXU51439.1,MKNIFLTMSLGIGLLFPCKLHAQSQYPFQNTTLSTEERVDDLIKRM...,M K N I F L T M S L G I G L L F P C K L H A Q ...
5,KQ970957.1,5,HMPREF2530_00006,,"transcriptional regulator, AraC family",KXU51440.1,MNLISMAKQKDGFLGEQALVLPPAIVQRMKTDPATSILYITDIGYY...,M N L I S M A K Q K D G F L G E Q A L V L P P ...
6,KQ970957.1,6,HMPREF2530_00007,,"MFS transporter, SP family",KXU51441.1,MKQTRGYLLLICIVSAMGGLLFGYDWVVIGGAKIFYEPYFGIENSA...,M K Q T R G Y L L L I C I V S A M G G L L F G ...
7,KQ970957.1,7,HMPREF2530_00008,,tetratricopeptide repeat protein,KXU51442.1,MKESVNVWEEDILLPTYGIGRPEKNPMFLEKRVYQGSSGVVYPYPV...,M K E S V N V W E E D I L L P T Y G I G R P E ...
8,KQ970957.1,8,HMPREF2530_00009,,"glycosyl hydrolase family 2, sugar binding dom...",KXU51443.1,MKQSFQNTRYRAILLLTGLMLAMTAMAQTLTNDALDLSGMWRFQLD...,M K Q S F Q N T R Y R A I L L L T G L M L A M ...
9,KQ970957.1,9,HMPREF2530_00010,,hypothetical protein,KXU51444.1,MPLGNATVGALVWQRDSTLRLSLDRTDLWDLRPVDSLSGDNFRFSW...,M P L G N A T V G A L V W Q R D S T L R L S L ...
