In [2]:
# So this code goes through deciding what elements in each of the two databases are "the same"

# Required modules

# Python inbuilt
import json
import csv
import subprocess

# Extra downloaded
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord



In [3]:
# The first part of the analysis is to see what happens when ABRicate is run with both the Resfinder and CARD database
# To identify the extent of difference in how each method identifies "TRANSMISSIBLE RESISTANCE GENES"
# Note by the fact we are only looking at "TRANSMISSIBLE RESISTNACE GENES" CARD and Resfinder dbs should have all necessary genes in them

# For the database comparison, we use the two following simple definitions

# 0 = Identical sequence
# 1 = Identical protein

# For two programs to identify the same thing they have to identify everything identically at the protein level


In [4]:
# First we PARSE the databases 
# CARD
card_db = "card_20191023/card-data/card.json"
with open(card_db) as f:
    card_db = json.load(f)

    

# Formatting and comparing the two databases


So for comparing CARD with Resfinder results, were aiming to identify transmissible proteins from each
In card this approximately equates to the **protein homolog model**.

This process has several steps
1. For each database, create a csv which contains an easily accesible set of information for comparison
- Note for working out whether there is a translation link, we translate with translation table 11, then disregard the first element and remove any trailing stop codon sequence.
- Looking at the remaining sequence, if they match then there is a link, if not no link
- If no link leave blank, and each DNA with no link will be put next to its directly translated protein
2. As part of making a database each element will get a new identifier which will be used to make sure all programs can process them the same way
- e.g. cardnewid_x 
2. Match these CSVs according to protein sequence




#### The card CSV will have the following headers

1. prot_seq
2. dna_seq
3. card_name
4. card_newname
5. aro_id
6. translation_link

#### The resfinder CSV will have the following headers

1. prot_seq
2. dna_seq
3. resfinder_name
4. resfinder_newname


#### DATABASE QUALITY CONTROL

For each database, we need to create a unique identifier for elements
This based on its DNA sequence only
This in effect will remove a sequence which has the same DNA sequence but different name in any given database

e.g. supposed the following sequence AACTTGCTA was called both gene1 and gene2 in the formated databases it will only be called newgene1

Duplicates names for different sequences will be asigned new names. The aim of this approach is to retain the databases in as close to original format as possible , while making up to date databases readable for each of the 4 programs

**Note the resfinder database is produced by concatinating each of the specific antibiotic databases, this results in some duplicates which we remove, (i.e. the genes which affect more than one type of antibiotic, (for example quinolone resistance causing aac variant)**

###### Removed variants below

So the duplicate names we will have censored using this method
1. "blaOXA-347_1_JN086160" , same sequence as "blaOXA-347_1_ACWG01000053" (2 refs)
2. "blaZ_129_CP003194", same sequence as "blaZ_125_CP003194" (curation)
3. "blaIMP-58_1_KU647281",  same sequence as "blaIMP-58_1_KU647281" (duplciate)
4. "blaCTX-M-63_1_AB205197", same sequence as "blaCTX-M-63_1_EU660216" (2 refs)
5. "blaCMY-110_1_AB872957", same sequence as "blaCMY-110_1_AB872957" (duplicate)
6. "blaCMY-104_1_KF150216", same sequence as "blaCMY-104_1_KF150216" (duplicate)
7. "blaACC-4_2_EF504260", same sequence as "blaACC-4_1_GU256641" (2 refs)
8. "blaSHV-36_1_AF467947", same sequence as "blaSHV-36_1_AF467947" (duplicate)
9. "blaOXA-60_1_AF525303", same sequence as "blaOXA-60_1_AF525303" (duplicate)
10. "blaFRI-1_1_KT192551", same sequence as "blaFRI-1_1_KT192551" (duplicate)
11. "cfr_1_AM408573", same sequence as "cfr_1_AM408573" (duplicate)
12. "cfr_2_AJ879565", same sequence as "cfr_2_AJ879565" (duplicate)
13. "cfr(B)_3_KR610408", same sequence as "cfr(B)_3_KR610408" (duplicate)
14. aac(6')-Ib-cr_1_DQ303918, same sequence as "aac(6')-Ib-cr_1_DQ303918" (duplicate)
15. aac(6')-Ib-cr_2_EF636461 , same sequence as "aac(6')-Ib-cr_2_EF636461" (duplicate)
16. dfrA22_3_FM957884, same sequence as dfrA33_1_FM957884 (curation)

#### DATABASE PREPARATION CHOICES


**ARIBA**

In [43]:
# CARD

# Note to add , it appears that the DNA sequences do always translate to the protein, although the DNA sequence is not always in the correct frame
# To keep it simple, I will just use the DNA database
# And then compare with the resfinder database using the protein

# Note 7 have 2 DNA sequences. These are identical in all but Erm(44)v's case 
# Given the ARO sequence identifier is in effect our unique identifier, I will drop this one, and just use the first
# This is unlikely to be relavent in gram negatives as only been seen in S. saprophyticus and confers resistance to macrolides (which GNRs are generally innately resistant to)
# We will use the scond encountered

for key in card_db.keys():
    try:
        if card_db[key]['model_type'] == "protein homolog model":
            aro_id = card_db[key]['ARO_id']
            name = card_db[key]["ARO_name"]
            for k in card_db[key]['model_sequences']['sequence'].values():
                card_prot = k['protein_sequence']['sequence']
                card_dna = Seq(k['dna_sequence']['sequence'])
            # This way we automatically keep the second sequence
    except:
        # This line removes the 1 entry which just has a description of the database
        pass



CblA-1 ATGAAAGCATATTTCATCGCCATACTTACCTTATTCACTTGTATAGCTACCGTCGTCCGGGCGCAGCAAATGTCTGAACTTGAAAACCGGATTGACAGTCTGCTCAATGGCAAGAAAGCCACCGTTGGTATAGCCGTATGGACAGACAAAGGAGACATGCTCCGGTATAACGACCATGTACACTTCCCCTTGCTCAGTGTATTCAAATTCCATGTGGCACTGGCCGTACTGGACAAGATGGATAAGCAAAGCATCAGTCTGGACAGCATTGTTTCCATAAAGGCATCCCAAATGCCGCCCAATACCTACAGCCCCCTGCGGAAGAAGTTTCCCGACCAGGATTTCACGATTACGCTTAGGGAACTGATGCAATACAGCATTTCCCAAAGCGACAACAATGCCTGCGACATCTTGATAGAATATGCAGGAGGCATCAAACATATCAACGACTATATCCACCGGTTGAGTATCGACTCCTTCAACCTCTCGGAAACAGAAGACGGCATGCACTCCAGCTTCGAGGCTGTATACCGCAACTGGAGTACTCCTTCCGCTATGGTCCGACTACTGAGAACGGCTGATGAAAAAGAGTTGTTCTCCAACAAGGAGCTGAAAGACTTCTTGTGGCAGACCATGATAGATACTGAAACCGGTGCCAACAAACTGAAAGGTATGTTGCCAGCCAAAACCGTGGTAGGACACAAGACCGGCTCTTCCGACCGCAATGCCGACGGTATGAAAACTGCAGATAATGATGCCGGCCTCGTTATCCTTCCCGACGGCCGGAAATACTACATTGCCGCCTTCGTCATGGACTCATACGAGACGGATGAGGACAATGCGAACATCATCGCCCGCATATCACGCATGGTATATGATGCGATGAGATGA MKAYFIAILTLFTCIATVVRAQQMSELENRIDSLLNGKKATVGIAVWTDKGDMLRYNDHVHFPLLSVFKFHVALAVLDKMDKQSISLDSIVSIKASQMPPN

mecR1 GTGTTATCATCTTTTTTAATGTTAAGTATAATCAGTTCATTGCTCACGATATGTGTAATTTTTTTAGTGAGAATGCTCTATATAAAATATACTCAAAATATTATGTCACATAAGATTTGGTTATTAGTGCTCGTCTCCACGTTAATTCCATTAATACCATTTTACAAAATATCGAATTTTACATTTTCAAAAGATATGATGAATCGAAATGTATCTGACACGACTTCTTCGGTTAGTCATATGTTAGATGGTCAACAATCATCTGTTACGAAAGACTTAGCAATTAATGTTAATCAGTTTGAGACCTCAAATATAACGTATATGATTCTTTTGATATGGGTATTTGGTAGTTTGTTGTGCTTATTTTATATGATTAAGGCATTCCGACAAATTGATGTTATTAAAAGTTCGTCATTGGAATCGTCATATCTTAATGAACGACTTAAAGTATGTCAAAGTAAGATGCAGTTCTACAAAAAGCATATAACAATTAGTTATAGTTCAAACATTGATAATCCGATGGTATTTGGTTTAGTGAAATCCCAAATTGTACTACCAACTGTCGTAGTCGAAACCATGAATGACAAAGAAATTGAATATATTATTCTACATGAACTATCACATGTGAAAAGTCATGACTTAATATTCAACCAGCTTTATGTTGTTTTTAAAATGATATTCTGGTTTAATCCTGCACTATATATAAGTAAAACAATGATGGACAATGACTGTGAAAAAGTATGTGATAGAAACGTTTTAAAAATTTTGAATCGCCATGAACATATACGTTATGGTGAATCGATATTAAAATGCTCTATTTTAAAATCTCAGCACATAAATAATGTGGCAGCACAATATTTACTAGGTTTTAATTCAAATATTAAAGAACGTGTTAAGTATATTGCACTTTATGATTCAATGCCTAAACCTAATCGAAACAAGCGTATTGTTGCGTATATTGTATGTAGTATATCGCTTTTAATACAAGCACCGT

adeA ATGCAAAAGCATCTTTTACTTCCTTTATTTTTATCTATTGGGCTGATATTACAGGGGTGTGATTCAAAAGAAGTCGCTCAAGCTGAGCCACCACCGGCTAAAGTCAGTGTATTAAGCATTCAACCGCAATCGGTAAATTTTAGTGAAAATCTTCCTGCACGTGTACATGCATTCCGTACGGCGGAAATCCGTCCGCAAGTCGGAGGTATCATTGAAAAGGTTCTATTTAAACAAGGTAGTGAAGTTAGAGCAGGGCAAGCCTTATATAAAATTAATTCCGAGACTTTTGAGGCCGATGTAAATAGCAATAGAGCTTCTCTCAATAAAGCTGAAGCTGAGGTGGCAAGACTCAAAGTTCAGTTAGAACGTTATGAGCAGTTATTACCAAGTAATGCAATTAGTAAGCAAGAAGTAAGTAATGCTCAAGCTCAGTATCGTCAGGCTCTAGCCGATGTCGCTCAAATGAAAGCATTGCTGGCCAGACAAAACTTGAATCTGCAATATGCAACAGTTCGAGCGCCTATTTCTGGGCGTATTGGGCAATCTTTTGTCACTGAAGGTGCATTGGTCGGTCAGGGCGATACCAATACGATGGCAACCATTCAACAGATTGATAAAGTCTATGTTGATGTAAAGCAATCGGTTAGTGAGTATGAACGCCTACAGGCTGCGCTACAAAGCGGCGAATTATCAGCAAATAGTGACAAAACCGTTCGTATTACCAATAGCCACGGACAGCCCTATAACGTCACAGCAAAAATGTTGTTTGAAGATATTAATGTTGACCCGGAAACAGGCGATGTCACATTCCGTATTGAAGTTAATAACACTGAACGAAAATTACTTCCGGGCATGTATGTGCGTGTCAATATTGATCGTGCTTCTATTCCTCAAGCGCTATTGGTTCCGGCGCAAGCGATCCAACGTAATATCAGTGGCGAGCCTCAGGTATATGTCATTAACGCCCAAGGGACAGCGGAAATTCGTCCTATCGA

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [41]:
# Resfinder

# So this code does the following
# Firstly, you concatenate and read in the database
subprocess.check_call("cat resfinder_20191001/*.fsa > resfinder.fasta", shell=True)
resfinder_initdb = SeqIO.parse("resfinder_20191001/resfinder.fasta", "fasta")

# Then you assign a new unique name to each element of the database
# Note some elements are removed (see above) to ensure each DNA sequence only has one name
# The way things are linked are put into a resfinder_20191001_link.csv file
# The final database is then written into a resfinder_20191001_formatted.fasta
resfinder_db = {}
identified_seqs = []
newid = 0
for k in resfinder_initdb:
    if str(k.seq) not in identified_seqs:
        newid += 1
        identified_seqs.append(str(k.seq))
        resfinder_db["resfindernewid_{0}" .format(newid)] = k
out_recs = []
with open("resfinder_20191001_link.csv", "w") as f:
    writer = csv.writer(f, delimiter = ",")
    for k in resfinder_db:
        writer.writerow([k, resfinder_db[k].id, str(resfinder_db[k].seq)])
        k_id = k
        k_desc = ""
        k_seq = resfinder_db[k].seq
        k_rec = SeqRecord(k_seq, id=k_id, description=k_desc)
        out_recs.append(k_rec)

SeqIO.write(out_recs, "resfinder_20191001_formatted.fasta", "fasta")

3079

### Next comes how to format the databases 

As an additional side point this database then needs to be formatted to meet ABRicate's structure
The script below does this

~~~
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

base_db = SeqIO.parse("resfinder_20191001_formatted.fasta","fasta")
abricate_recs = []

for k in base_db:
    k_seq = k.seq
    k_id = "resfinderformatted~~~{0}~~~00{1}".format(k.id, "{:04d}" .format(int(k.id.split("_")[-1])))
    k_rec = SeqRecord(k_seq, id=k_id, description="")
    abricate_recs.append(k_rec)
SeqIO.write(abricate_recs, "sequences", "fasta")
~~~


##### Each of the following illustrates the steps required to prep the database

**preparing the ABRicate_db**

~~~
python abricate_format.py 
mkdir abricate_db
mv sequences abricate_db/
cd abricate_db
makeblastdb -in sequences -title rf102019 -dbtype nucl -hash_index
cd ..
~~~
**preparing the ARIBA db**
~~~
mkdir ariba_db
cp  resfinder_20191001_formatted.fasta ariba_db
cd ariba_db
ariba prepareref --all_coding no -f resfinder_20191001_formatted.fasta formatted_dbs/ariba_db
cd ..
~~~

**Preparing the KmerResistance db**
** Note for KmerResistance, there is an additional file**
### add to this text ### should primarily been in the supplements

The exception to this was that KmerResistance requires a “bacteria.fsa” FASTA file of all complete genomes in NCBI’s RefSeq database to filter low-coverage matches. As of writing, the version used by KmerResistance’s authors was not publicly available, however we attempted mitigate this using two alterations; replacing the file using a FASTA containing all complete genomes as identified by Centrifuge-download[14], and applying an average depth of coverage cut-off of >5x to KmerResistance results. For ABRicate (which uses BLASTn to search assemblies), assemblies were produced using SPAdes[8] run with default parameters. 

~~~
mkdir kmerres_db
mv bacteria.fsa kmerres_db
cp resfinder_20191001_formatted.fasta kmerres_db/
cd kmerres_db
kma index -i bacteria.fsa -o bacteria -Sparse ATG
mv resfinder_20191001_formatted.fasta kmerres_fasta.fa
kma_index -i kmerres_fasta.fa -o kmerres_fasta
cd ..
~~~

**Preparing the SRST2 db**
~~~
mkdir srst2_db
cd srst2_db/
mv resfinder_20191001_formatted.fasta rawseqs.fasta
cd-hit-est -i rawseqs.fasta -o rawseqs_cdhit90 -d 0 > rawseqs_cdhit90.stdout
python /srst2/database_clustering/cdhit_to_csv.py --cluster_file rawseqs_cdhit90.clstr --infasta rawseqs.fasta --outfile rawseqs_clustered.csv
python /srst2/database_clustering/csv_to_gene_db.py -t rawseqs_clustered.csv -o seqs_clustered.fasta -f rawseqs.fasta -c 4
cd ..
~~~







In [None]:
# Next step we want to 