# <span style="color: blue;">Intro</span>

The first part of the analysis is to see what happens when ABRicate is run with both the Resfinder and CARD database to identify the extent of difference in how each method identifies "TRANSMISSIBLE RESISTANCE GENES"
Note by the fact we are only looking at "TRANSMISSIBLE RESISTNACE GENES" CARD and Resfinder dbs should have all necessary genes in them

**For two programs to identify the same thing they have to identify everything identically at the protein level**

### Setup 

**Dependencies**
* Python 3
* Biopython

**Inputs - Resistance Databases**
* resfinder_20191001 (Primary)
* card_20191023
* resfinder_20180122
* resfinder_20170126




In [1]:
# Importing modules and looking at local files

# Python inbuilt
import json
import csv
import subprocess
import os

# Extra downloaded
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Data handling
import numpy as np
import pandas as pd

# File Structure
%ls

[1m[36mcard_20191023[m[m/                      [1m[36mresfinder_20180122[m[m/
card_20191023_formatted.fasta       [1m[36mresfinder_20190122[m[m/
card_20191023_link.csv              [1m[36mresfinder_20191001[m[m/
db_comparison.csv                   [1m[36mresfinder_20191001_ami[m[m/
db_comparison_full.json             [1m[36mresfinder_20191001_blm[m[m/
[1m[36mdb_comparison_result_tarballs[m[m/      resfinder_20191001_formatted.fasta
db_related.ipynb                    resfinder_20191001_link.csv
formatting_old_resfinder_dbs.ipynb  [1m[36mresfinder_20191001_qui[m[m/
resfinder.fasta                     [1m[36mresfinder_20191001_sul[m[m/
[1m[36mresfinder_20170126[m[m/                 [1m[36mresfinder_20191001_tri[m[m/


In [2]:
# First we PARSE the databases 
# CARD
card_db = "card_20191023/card-data/card.json"
with open(card_db) as f:
    card_db = json.load(f)

    

## <span style="color: blue;">Formatting and comparing the two databases</span>


So for comparing CARD with Resfinder results, were aiming to identify transmissible proteins from each
In card this approximately equates to the **protein homolog model**.

This process has several steps
1. For each database, create a csv which contains an easily accesible set of information for comparison
- Note for working out whether there is a translation link, we translate with translation table 11, then disregard the first element and remove any trailing stop codon sequence.
- Looking at the remaining sequence, if they match then there is a link, if not no link
- If no link leave blank, and each DNA with no link will be put next to its directly translated protein
2. As part of making a database each element will get a new identifier which will be used to make sure all programs can process them the same way
- e.g. cardnewid_x 
2. Match these CSVs according to protein sequence




#### The card CSV will have the following headers

1. prot_seq
2. dna_seq
3. card_name
4. card_newname
5. aro_id
6. translation_link

#### The resfinder CSV will have the following headers

1. prot_seq
2. dna_seq
3. resfinder_name
4. resfinder_newname


#### DATABASE QUALITY CONTROL

For each database, we need to create a unique identifier for elements
This based on its DNA sequence only
This in effect will remove a sequence which has the same DNA sequence but different name in any given database

e.g. supposed the following sequence AACTTGCTA was called both gene1 and gene2 in the formated databases it will only be called newgene1

Duplicates names for different sequences will be asigned new names. The aim of this approach is to retain the databases in as close to original format as possible , while making up to date databases readable for each of the 4 programs

##### Resfinder 1st October 2019 release

**Note the resfinder database is produced by concatinating each of the specific antibiotic databases, this results in some duplicates which we remove, (i.e. the genes which affect more than one type of antibiotic, (for example quinolone resistance causing aac variant)**

###### Removed variants below

So the duplicate names we will have censored using this method
1. "blaOXA-347_1_JN086160" , same sequence as "blaOXA-347_1_ACWG01000053" (2 refs)
2. "blaZ_129_CP003194", same sequence as "blaZ_125_CP003194" (curation)
3. "blaIMP-58_1_KU647281",  same sequence as "blaIMP-58_1_KU647281" (duplciate)
4. "blaCTX-M-63_1_AB205197", same sequence as "blaCTX-M-63_1_EU660216" (2 refs)
5. "blaCMY-110_1_AB872957", same sequence as "blaCMY-110_1_AB872957" (duplicate)
6. "blaCMY-104_1_KF150216", same sequence as "blaCMY-104_1_KF150216" (duplicate)
7. "blaACC-4_2_EF504260", same sequence as "blaACC-4_1_GU256641" (2 refs)
8. "blaSHV-36_1_AF467947", same sequence as "blaSHV-36_1_AF467947" (duplicate)
9. "blaOXA-60_1_AF525303", same sequence as "blaOXA-60_1_AF525303" (duplicate)
10. "blaFRI-1_1_KT192551", same sequence as "blaFRI-1_1_KT192551" (duplicate)
11. "cfr_1_AM408573", same sequence as "cfr_1_AM408573" (duplicate)
12. "cfr_2_AJ879565", same sequence as "cfr_2_AJ879565" (duplicate)
13. "cfr(B)_3_KR610408", same sequence as "cfr(B)_3_KR610408" (duplicate)
14. aac(6')-Ib-cr_1_DQ303918, same sequence as "aac(6')-Ib-cr_1_DQ303918" (duplicate)
15. aac(6')-Ib-cr_2_EF636461 , same sequence as "aac(6')-Ib-cr_2_EF636461" (duplicate)
16. dfrA22_3_FM957884, same sequence as dfrA33_1_FM957884 (curation)

##### CARD 23rd October 2019 release

This database is more complex in structure, but to attempt to produce a similar set of genes (i.e. just the transmissible genes) we only look at the genes which apply to the "protein homolog model"

In this database, there are several things to note which will be important as we compare the database:
* Each element has at least one DNA and protein sequence
* all of the DNA elements do translate to produce the related protein , but there is an issue with wether or not sequences include a stop codon and how the start codon is translated. Also some are not in the correct frame
* Also 1 sequence (Erm(44)v) has 2 sequences. This is not relavent to this study as it focusses on E. coli whereas this gene is only seen in S. saprophyticus and encodes macrolide resistance (to which E. coli is innately resistant)
* For simplicity here, we will just take the DNA sequences from the protein homolog CARD.
* Otherwise sequences in this database are all unique (see Resfinder Issues above)


#### DATABASE PREPARATION CHOICES

This is any specific parameters we have chosen in preparing databases

**ARIBA -> Non coding only**

In [38]:
# CARD

# This code makes the formatted card db ("card_20191023_formatted.fasta") 
# and produces a key for the identifiers in the fasta file "card_20191023_link.csv"
# And finally produces a dictionary which we use downstream "card_20191023_link.csv"

card_formatted_no = 1

card_dict = {}

card_recs = []

with open("card_20191023_link.csv", "w") as f:
    writer = csv.writer(f)
    for key in card_db.keys():
        try:
            # First we extract relavent data
            if card_db[key]['model_type'] == "protein homolog model":
                aro_id = card_db[key]['ARO_accession']
                name = card_db[key]["ARO_name"]
                for k in card_db[key]['model_sequences']['sequence'].values():
                    card_prot = k['protein_sequence']['sequence']
                    card_dna = k['dna_sequence']['sequence']
                # This way we automatically keep the second sequence
                # Then we rename things so that each program formatts them properly.
                new_id = "cardnewid_{0}" .format(card_formatted_no)
                card_formatted_no += 1
                card_seq = Seq(card_dna)
                card_rec = SeqRecord(card_seq, id=new_id, description="")
                card_recs.append(card_rec)
                # Now we want to create some stuff for a linking database to compare the two
                orig_id = name + "_ARO"+ aro_id
                card_dict[new_id] = {"orig_id":orig_id, 
                                    "dna_seq":card_dna,
                                    "prot_seq":card_prot}
                writer.writerow([orig_id, new_id, str(card_seq)])
        except:
            # This line removes the 1 entry which just has a description of the database
            pass

SeqIO.write(card_recs, "card_20191023_formatted.fasta", "fasta")

2617

In [39]:
# Resfinder

# Likewise for the resfinder database
# The formatted fasta "resfinder_20191001_formatted.fasta"
# link for ids "resfinder_20191001_link.csv"
# Dictionary for downstream analysis resfinder_db

# Firstly, you concatenate and read in the database

##############################################################################
# NOTE THIS NEEDS TO BE FIXED BEFORE WE PUT THIS UP FORMALLY ONLINE
if os.name == "posix":
    subprocess.check_call("cat resfinder_20191001/*.fsa > resfinder.fasta", shell=True)
else:
    pass
resfinder_initdb = SeqIO.parse("resfinder_20191001/resfinder.fasta", "fasta")
##############################################################################

# Then you assign a new unique name to each element of the database
# Note some elements are removed (see above) to ensure each DNA sequence only has one name
# The way things are linked are put into a resfinder_20191001_link.csv file
# The final database is then written into a resfinder_20191001_formatted.fasta
resfinder_db = {}
identified_seqs = []
newid = 0
for k in resfinder_initdb:
    if str(k.seq) not in identified_seqs:
        newid += 1
        identified_seqs.append(str(k.seq))
        resfinder_db["resfindernewid_{0}" .format(newid)] = k
out_recs = []
with open("resfinder_20191001_link.csv", "w") as f:
    writer = csv.writer(f, delimiter = ",")
    for k in resfinder_db:
        writer.writerow([k, resfinder_db[k].id, str(resfinder_db[k].seq)])
        k_id = k
        k_desc = ""
        k_seq = resfinder_db[k].seq
        k_rec = SeqRecord(k_seq, id=k_id, description=k_desc)
        out_recs.append(k_rec)

SeqIO.write(out_recs, "resfinder_20191001_formatted.fasta", "fasta")

3079

## <span style="color: blue;">Next comes how to format the databases </span>

This code basically goes through the exact commands we use to format the databases for each program. Note for each one we using singularity images which contain each program. Your actual locations of some script (e.g. the SRST2 specific ones)  may vary

As an additional side point this database then needs to be formatted to meet ABRicate's structure
The script below does this

~~~
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

base_db = SeqIO.parse("<db_file>","fasta")
abricate_recs = []

for k in base_db:
    k_seq = k.seq
    k_id = "resfinderformatted~~~{0}~~~00{1}".format(k.id, "{:04d}" .format(int(k.id.split("_")[-1])))
    k_rec = SeqRecord(k_seq, id=k_id, description="")
    abricate_recs.append(k_rec)
SeqIO.write(abricate_recs, "sequences", "fasta")
~~~


##### Each of the following illustrates the steps required to prep the database

**preparing the ABRicate_db**

~~~
python abricate_format.py 
mkdir abricate_db
mv sequences abricate_db/
cd abricate_db
makeblastdb -in sequences -title <db_title> -dbtype nucl -hash_index
cd ..
~~~
**preparing the ARIBA db**
~~~
ariba prepareref --all_coding no -f <db_file> ariba_db
~~~

**Preparing the KmerResistance db**
** Note for KmerResistance, there is an additional file**
### add to this text ### should primarily been in the supplements

The exception to this was that KmerResistance requires a “bacteria.fsa” FASTA file of all complete genomes in NCBI’s RefSeq database to filter low-coverage matches. As of writing, the version used by KmerResistance’s authors was not publicly available, however we attempted mitigate this using two alterations; replacing the file using a FASTA containing all complete genomes as identified by Centrifuge-download[14], and applying an average depth of coverage cut-off of >5x to KmerResistance results. For ABRicate (which uses BLASTn to search assemblies), assemblies were produced using SPAdes[8] run with default parameters. 

~~~
mkdir kmerres_db
mv bacteria.fsa kmerres_db
cp <db_file> kmerres_db/
cd kmerres_db
kma index -i bacteria.fsa -o bacteria -Sparse ATG
mv <db_file> kmerres_fasta.fa
kma_index -i kmerres_fasta.fa -o kmerres_fasta
cd ..
~~~

**Preparing the SRST2 db**
~~~
mkdir srst2_db
cd srst2_db/
mv <db_file> rawseqs.fasta
cd-hit-est -i rawseqs.fasta -o rawseqs_cdhit90 -d 0 > rawseqs_cdhit90.stdout
python /srst2/database_clustering/cdhit_to_csv.py --cluster_file rawseqs_cdhit90.clstr --infasta rawseqs.fasta --outfile rawseqs_clustered.csv
python /srst2/database_clustering/csv_to_gene_db.py -t rawseqs_clustered.csv -o seqs_clustered.fasta -f rawseqs.fasta -c 4
cd ..
~~~

** A similar method is used for preparing the card and legacy resfinder databases **





## <span style="color: blue;">Comparing the databases</span>

Next step we want to compare the two databases
The resfinder database is now encapsulated in the resfinder_db dictionary
The card database is now encapsuled in card_dict

**Now to compare these two we are going to do the following.**
1. translate resfinder_db sequences
2. compare the protein sequences from the two databases. [note this requires some wiggle room re first and last amino-acids, see above]
3. give each unique protein sequence a unique id.
4. Identify whether DNA variants are the same or different

**Note all elements of each database are disctinct (we have removed duplicates as part of formatting)**


In [80]:

# First we start with a function to compare two proteins with "wiggle room"
def compare_2_prots(s1, s2):
    # Now as there is a bit of a wiggle around what sequence is at 
    s1_poss = [s1,s1[1:],s1[:-1],s1[1:-1]]
    s2_poss = [s2,s2[1:],s2[:-1],s2[1:-1]]
    linked = False
    for k in s1_poss:
        if k in s2_poss:
            linked = True
    return linked

# Next lets get all of the proteins from both the CARD and Resfinder databases
resfinder_prots = {}
for k in resfinder_db:
    resfinder_prots[k] = str(resfinder_db[k].seq.translate(11))

card_prots = {}
for k in card_dict:
    card_prots[k] = card_dict[k]['prot_seq']


# Now we combine these into a list of individual proteins
prot_list = []
for k in resfinder_prots:
    seen_before  = False
    for j in prot_list:
        if compare_2_prots(resfinder_prots[k], j) == True:
            seen_before = True
    if seen_before == False:
        prot_list.append(resfinder_prots[k])
    
for k in card_prots:
    seen_before = False
    for j in prot_list:
        if compare_2_prots(card_prots[k], j) == True:
            seen_before = True
    if seen_before == False:
        prot_list.append(card_prots[k])

#Turning these into a dictionary
prots_full = {}
prot_no = 1
for k in prot_list:
    prot_id = "uprotein_{0}".format(prot_no)
    prots_full[prot_id] = k
    prot_no += 1

# We then want to make an output which combines all of these.
# We will produce two however, 
# 1 which only contains the simple names linking (and official names) (CSV) and 1 which contains all of the details about the sequence (JSON)

final_dict = {}
with open("db_comparison.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerow(["prot_id", 'card_fids','card_oids', 'resfinder_fids', 'resfinder_oids'])
    for k in prots_full:
        final_dict[k] ={"card":{}, "resfinder":{}, "protein":prots_full[k]}
        k_card_ids = []
        k_resfinder_ids = []
        for j in card_prots:
            if compare_2_prots(prots_full[k], card_prots[j]) == True:
                k_card_ids.append(j)
        k_card_origid = [card_dict[j]['orig_id'] for j in k_card_ids]
        for j in k_card_ids:
            final_dict[k]['card'][j] = card_dict[j]
        for j in resfinder_prots:
            if compare_2_prots(prots_full[k], resfinder_prots[j]) == True:
                k_resfinder_ids.append(j)
        for j in k_resfinder_ids:
            final_dict[k]['resfinder'][j] = {resfinder_db[j].id:str(resfinder_db[j].seq)}
        k_resfinder_origid = [resfinder_db[j].id for j in k_resfinder_ids]
        csv_line = [k, ":".join(k_card_ids),":".join(k_card_origid), 
                   ":".join(k_resfinder_ids), ":".join(k_resfinder_origid)]
        writer.writerow(csv_line)

# NOTE FOR ME, see sublime_text on problems with 3 SHV sequences. I will currently plough on regardless 
        
with open("db_comparison_full.json", "w") as f:
    x = json.dumps(final_dict)
    json.dump(x, f) 

# Lastly given the different scope of the two databases, CARD will contain some sequences that are ubiquitous in E. coli
# We create a list of thee so thay can be filtered out at a later stage.

# We do this by examining the card_prevalence.txt from the card database. (note for this study we have used the 20191127 release)
# and I set a rather arbitrary threshold of 90% to determine near ubiquity

def near_ubiquitous(l):
    return([k for k in l if k > 50] != [])

# so we go through all the sequences in the card protein homolog database and then check whether they are nearly ubuqitously found in E. coli
# card exclusion list and then a human interpretable version
nu_ecoli_card = []
nu_ecoli_card_hi = []
# We will also check we won't remove any Resfinder ones as its our default method
nu_ecoli_rf = []

# Reloading the data
db_comparison = pd.read_csv("db_comparison.csv").fillna("")
card_prev = pd.read_csv("card_prevalence_20191127.txt", delimiter = "\t", index_col = 0)
card_prev = card_prev.loc[card_prev.Pathogen == "Escherichia coli"]
for i in range(len(db_comparison)):
    if db_comparison.iloc[i].card_fids != "":
        i_card_id = db_comparison.iloc[i].card_fids
        i_rf_id = db_comparison.iloc[i].resfinder_fids
        i_name = db_comparison.iloc[i].card_oids.split("_")[0]
        i_aro = "ARO:" + db_comparison.iloc[i].card_oids.split("_")[-1].lstrip("ARO")
        try:
            i_prev = card_prev.loc[[i_aro]]
            i_ncbi_wgs = list(i_prev['NCBI WGS'])
            if near_ubiquitous(i_ncbi_wgs) == True:
                nu_ecoli_card.append(i_card_id)
                nu_ecoli_card_hi.append(i_name)
                if i_rf_id != "":
                    for j in i_rf_id.split(":"):
                        nu_ecoli_rf.append(i_rf_id)
        except KeyError:
            pass

# having generated these lists 
# We check the  resfinder list is empty (NOTE AS WE GO THROUGH THE DB COMPARISON DATABASE IT WILL REMOVE ANY RESFINDER SEQUENCES WITH THE SAME PROTEIN)
assert nu_ecoli_rf == []
# then print the card exclusions
print("***** EXCLUDED NEARLY UBIQUITOUS SEQUENCES ******")
print(nu_ecoli_card_hi)



***** EXCLUDED NEARLY UBIQUITOUS SEQUENCES ******
['mdtP', 'gadX', 'mdtF', 'cpxA', 'mdtH', 'mdtA', 'AcrS', 'emrY', 'mdtG', 'emrK', 'eptA', 'mdtB', 'TolC', 'CRP', 'ugd', 'baeS', 'evgA', 'acrB', 'mdtM', 'H-NS', 'mdtC', 'evgS', 'emrR', 'baeR', 'acrD', 'AcrF', 'mdtN', 'AcrE', 'mphB', 'emrA', 'bacA', 'emrB', 'mdtE', 'marA', 'mdtO', 'Escherichia coli ampC1 beta-lactamase', 'PmrF', 'Escherichia coli ampH beta-lactamase', 'kdpE', 'msbA', 'YojI', 'Escherichia coli emrE', 'Escherichia coli acrA', 'Escherichia coli mdfA', 'Escherichia coli ampC beta-lactamase', 'Klebsiella pneumoniae KpnE', 'Klebsiella pneumoniae KpnF']


## <span style="color: blue;">Comparing Resfinder and CARD output</span>

Note I will only investigate differences between the two that I encounter in my data., 
so the next step will be to compare the output between CARD and Resfinder.

This bit needs to be updated, given the figure I intend to produce? compare all full databases.


In [75]:
#############################################################
#               LOADING THE DATA                            #
#############################################################

# Note during the covid outbreak I have created a temporary file structure to enable analysis of files
# This option (if true) enables using it
temp = False
# THIS CHOICE WILL BE DELETED PRIOR TO PRE_PRINT publication

result_dict = {}

linking_csv = pd.read_csv("db_comparison.csv", index_col=0).fillna("")
card_linkdict = {}
card_names = {}
for k in linking_csv.index:
    for j in range(len(linking_csv.loc[k].card_fids.split(":"))):
        j_fid = linking_csv.loc[k].card_fids.split(":")[j]
        j_oid = linking_csv.loc[k].card_oids.split(":")[j]        
        card_linkdict[j_fid] = k
        card_names[k] = j_oid
        
resfinder_linkdict = {}
resfinder_names = {}
for k in linking_csv.index:
    for j in range(len(linking_csv.loc[k].resfinder_fids.split(":"))):
        j_fid = linking_csv.loc[k].resfinder_fids.split(":")[j]
        j_oid = linking_csv.loc[k].resfinder_oids.split(":")[j]
        resfinder_linkdict[j_fid]=k
        resfinder_names[k]=j_oid

card_ftranslate = pd.read_csv("card_20191023/card_20191023_link.csv", header=None, index_col = 1)
card_ftranslate = {k: card_ftranslate.loc[k][0] for k in card_ftranslate.index}
card_btranslate = pd.read_csv("card_20191023/card_20191023_link.csv", header=None, index_col = 0)
card_btranslate = {k: card_btranslate.loc[k][1] for k in card_btranslate.index}
resfinder_ftranslate = pd.read_csv("resfinder_20191001/resfinder_20191001_link.csv", header=None, index_col = 1)
resfinder_ftranslate = {k: resfinder_ftranslate.loc[k][0] for k in resfinder_ftranslate.index}
resfinder_btranslate = pd.read_csv("resfinder_20191001/resfinder_20191001_link.csv", header=None, index_col = 0)
resfinder_btranslate = {k: resfinder_btranslate.loc[k][1] for k in resfinder_btranslate.index}

card_dir = "db_comparison_result_tarballs/card_results/"
card_reports = [os.path.join(root, f) for root, dirs, files in os.walk(card_dir) for f in files]
rf_dir = "db_comparison_result_tarballs/resfinder_results/"
rf_reports = [os.path.join(root, f) for root, dirs, files in os.walk(rf_dir) for f in files]
guuids = [k.split("/")[-1].split("_")[0] for k in rf_reports if "summary" not in k]
result_dict = {k:{"card":"", "resfinder":""} for k in guuids}
for k in result_dict:
    card_fl = [f for f in card_reports if k in f]
    assert len(card_fl) == 1
    card_fl = card_fl[0]
    resfinder_fl = [f for f in rf_reports if k in f]
    assert len(resfinder_fl) == 1 
    resfinder_fl = resfinder_fl[0]
    result_dict[k]["card"] = card_fl
    result_dict[k]["resfinder"] = resfinder_fl

### DELETE THIS SECTION FOR pre-print publication

# this code essentially produces a dictionary where keys are guuids and then within that 
# there is a card and a resfinder reports

if temp == True:
    card_dir = "../../../result_tarballs/pl_comp_data/db_comparison_result_tarballs/card_results"
    card_reports = [os.path.join(root, f) for root, dirs, files in os.walk(card_dir) for f in files]
    rf_dir = "../../../result_tarballs/pl_comp_data/db_comparison_result_tarballs/resfinder_results"
    rf_reports = [os.path.join(root, f) for root, dirs, files in os.walk(rf_dir) for f in files]
    guuids = [k.split("\\")[-1].split("_")[0] for k in rf_reports if "summary" not in k]
    result_dict = {k:{"card":"", "resfinder":""} for k in guuids}
    for k in result_dict:
        card_fl = [f for f in card_reports if k in f]
        assert len(card_fl) == 1
        card_fl = card_fl[0]
        resfinder_fl = [f for f in rf_reports if k in f]
        assert len(resfinder_fl) == 1 
        resfinder_fl = resfinder_fl[0]
        result_dict[k]["card"] = card_fl
        result_dict[k]["resfinder"] = resfinder_fl

    
### END OF DELETE


In [84]:
class isolate:
    
    def __init__(self, guuid):
        self.guuid = guuid
        self.resfinder_fl = pd.read_csv(result_dict[self.guuid]["resfinder"], "\t")
        self.resfinder_trgs = sorted(list(set([resfinder_linkdict[k] for k in list(self.resfinder_fl["GENE"])])))
        self.card_fl = pd.read_csv(result_dict[self.guuid]["card"], "\t")
        self.card_trgs = sorted(list(set([card_linkdict[k] for k in list(self.card_fl["GENE"]) 
                                          if k not in nu_ecoli_card])))
        self.resfinder_otrgs = [resfinder_names[k] for k in self.resfinder_trgs]
        self.card_otrgs = [card_names[k] for k in self.card_trgs]


out_genes = []
for g in guuids:
    x = isolate(g)
    for k in x.resfinder_otrgs:
        out_genes.append(k)
x = sorted(list(set(out_genes)))

In [85]:
### So there is a major problem with comparing the two databases 
# Difference of scope (e.g. CARD only 1 variant per protein and include chromosomal mechanisms of resistance)
# Variable inclusion of different genes/variants

# This makes the two very difficult to compare, and conseqeuntly I'm planning on taking a more manual approach.
# Essentially I will instead look at "of the resfinder genes, which ones" are also found using Resfinder
# And matching by the NAMES alone!


for k in x:
    print(k)

ARR-3_4_FM207631
aac(3)-IIa_1_X51534
aac(3)-IId_1_EU022314
aac(3)-IVa_1_X01385
aac(3)-Ia_1_X15852
aac(3)-Ib_1_L06157
aac(3)-VIa_2_NC_009838
aac(6')-Ib-cr_1_DQ303918
aac(6')-Ib_1_M21682
aac(6')-Ib_2_M23634
aadA12_1_AY665771
aadA13_1_AY713504
aadA1_2_FJ591054
aadA1_3_JQ414041
aadA1_4_JQ480156
aadA22_1_AM261837
aadA2_1_NC_010870
aadA2_2_JQ364967
aadA3_1_AF047479
aadA4_1_Z50802
aadA5_1_AF137361
aadA8b_2_AM040708
ant(2'')-Ia_1_X04555
ant(2'')-Ia_6_AJ871915
ant(3'')-Ia_1_X02340
ant(3'')-Ii-aac(6')-IId_1_AF453998
aph(3'')-Ib_1_M28829
aph(3'')-Ib_3_AF321550
aph(3'')-Ib_4_AF313472
aph(3'')-Ib_5_AF321551
aph(3')-IIa_2_V00618
aph(3')-Ia_1_V00359
aph(3')-Ia_3_EF015636
aph(3')-Ia_4_AF498082
aph(3')-Ia_6_L05392
aph(3')-Ia_7_X62115
aph(3')-VI_1_KC170992
aph(4)-Ia_1_V01499
aph(6)-Ic_1_X01702
aph(6)-Id_1_M28829
aph(6)-Id_4_CP000971
armA_1_AY220558
blaACT-15_1_JX440356
blaCARB-2_1_M69058
blaCMY-2_1_X91840
blaCMY-37_1_AB280919
blaCMY-42_1_HM146927
blaCMY-4_1_LNHZ01000079
blaCMY-51_1_JQ733571
blaCMY-6_1_A