# <span style="color: blue;">Analysing method related differences script - bioinformatic differences</span>

This script goes through the investigating discrepancies between the 4 methods

1. ABRicate (our baseline method)
2. ARIBA
3. KmerResistance
4. SRST2

**NOTE FOR EASE OF READING, SOME CODE IS REPEATED IN EACH SECTION, TO MAKE EACH SECTION BE A DISCREET UNIT OF ANALYSIS, THIS LENGTHANS THE RUNTIME OF THE CODE BUT MAKES IT EASIER TO UNDERSTAND. THE RUNTIME IS STILL SHORT (< 2 hours even comparing all database similarities)**

### <span style="color: green;">Highlighting database dependency.</span>

For each of several databases, it does the following:
1. reads in all the results we obtained
3. This is then used to produce **Figure X** 


Databases used
**Primary**
Resfinder 1st October 2019 release 
**Secondary**
* CARD 23rd October 2019 release
* beta-lactam.fsa of the Resfinder 1st October 2019 release
* aminoglycoside.fsa  of the Resfinder 1st October 2019 release
* quinolone.fsa of the Resfinder 1st October 2019 release
* trimethoprim.fsa of the Resfinder 1st October 2019 release
* sulphonamide.fsa of the Resfinder 1st October 2019 release
* Resfinder 22nd of January 2019 release
* Resfinder 22nd of January 2018 release
* Resfinder 26th of January 2017 release



From here on, we are investigating causes of error and only use the Resfinder 1st October 2019 release.

### <span style="color: green;">Identifying which TRGs are most commonly discrepant and then </span>

1. For this we first identify which reported TRGs likely represent the same gene found differently by the different programs (see image below)

![image](method_1.png)

2. We then look at patterns of discrepancy and identify the 10 most common across the data
3. We then demonstrate what we believe the cause of each of these is (simulated example in each cell of the notebook)





### <span style="color: green;">Highlighting extent of annotation errors</span>

This then goes through how we determine if an error is likely due to annotation or not

1. For each discrepant, we identify if any of the identified TRGs are complete within the assembly.
2. If so, we take the containing contig and simulate **perfect* reads at 50x and 500x from it (10 times each).
3. We then observe if the same misclassification pattern occurs again. If so the error is deemed likely annotation
4. This is then used to produce **Figure XX** in the paper.

Note for some of the code from this section is only presented here in markdown as I can't provide any full sequence files on GitHub. However, the sequences for this project are available on NCBI and the code can be recreated to analyse these scripts here.

### <span style="color: green;">Investigating non annotation errors.</span>

This section finally ends on investigation of 6 samples where different beta-lcatamase genes were reported but were not artefactual.





## <span style="color: blue;">Setup</span>

**So first steps are to load in required modules and then identify all the output reports**


#### Dependencies

1. Python 3 
2. Biopython
3. Pandas
4. Numpy
5. tqdm
6. networkx

#### Inputs
Some notes for this step, firstly which files we take
1. for ABRicate we take each contigs.tab file
2. for ARIBA we use its summary file
3. for KmerResistance we use the .KmerRes files
4. for SRST2 we use the .out__fullgenes__seqs_clustered__results.txt files

Note these were chosen as they seem to follow guidelines and where no guidelines available, give us the most closely matching results between the four programs

Note also SRST2 does not produce output if it finds no genes. The others all do

#### Resfinder database

For this we load
1. The naming link database
2. For each of the sub databases by antibiotic class 
  * beta-lactam
  * quinolone
  * aminoglycoside
  * sulfa antibiotics
  * trimethoprim
3. The whole database

We do 1 so the results are interpretable
We do 2 so we can breakdown results by antibiotic class as well as specific genes
We do 3 to do all versus all of the resistance database


In [3]:
# =============================================================================
# Code block 1 - Requirements
# =============================================================================


# We start by importing required modules
# File structure
import os
import csv


# Pandas
import pandas as pd
import numpy as np
import networkx as nx
# These are the fundamental modules used for analysing the data


# Pictures
import matplotlib.pyplot as plt
import seaborn as sns

# Sequence manipulation
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
# These modules are priarily used for loading the database, and then for identifying whether there are any
# Perfrect protein matches in the dataset


# Plotting


# Other general
from tqdm import tnrange, tqdm_notebook
# We just import this to check our code runs sensibly and to get timing estimates for stuff
from copy import deepcopy

#Looking at local files
%ls 



by_trg_pattern.csv
gene_naming.csv
gene_numbers.csv
interpreting_simulations.csv
isolate_patterns.csv
[1m[36mlegacy_code[m[m/
main_analysis.ipynb
method_1.png
method_1.svg
method_combinations.png
method_image.png
notes_random.txt
pattern_annotator.csv
readymade_card_20191023_full_sim_matrix.csv
readymade_resfinder_20170126_full_sim_matrix.csv
readymade_resfinder_20180122_full_sim_matrix.csv
readymade_resfinder_20190122_full_sim_matrix.csv
readymade_resfinder_20191001_ami_sim_matrix.csv
readymade_resfinder_20191001_blm_sim_matrix.csv
readymade_resfinder_20191001_full_sim_matrix.csv
readymade_resfinder_20191001_qui_sim_matrix.csv
readymade_resfinder_20191001_sul_sim_matrix.csv
readymade_resfinder_20191001_tri_sim_matrix.csv
[1m[36mresult_tarballs[m[m/
simulation_analysis.ipynb
simulation_contigs.csv


## <span style="color: blue;">Loading results</span>

### Loading the output files - Code block 2

The following block of code finds all the result files and puts them together into a single dictionary
Note we re-ran each program several times using several different databases. Most of these results aren't in the main text (aside from those pertaining to the 20191001 database) but refer to tings in the supplementary.
Also note we re-ran each database on the sub-databases of interest to make sure different sub-databases didn't interfere with one another.

### Results all contained within the result_tarballs directory and dbs all contained within the ../db_preparation directory 

Following dictionary links tarballs (for results), and database files, so that all can be analysed together
All tarballs available in result_tarballs. please run tar -xzvf * on all tarballs prior to running this code
Note we don't provide all the results as the files become large and difficult to work with in this small repo
Instead we just keep the files we need. These can all be reproduced by accding the sequence files directly and 
Then reruning each piece of software using the databases provided.

### Loading the database - Code block 3


In [4]:

# =============================================================================
# Code block 2 - finding the output files
# =============================================================================


# Giving it where files are, note if you unpack the result tarballs somwhere else you will need to repoint the 
# "results" files

dbresult_link = {"resfinder_20191001_full":{"results":"result_tarballs/resfinderfull_20191001/",
                                          "db": "../db_preparation/resfinder_20191001/"},
                 "resfinder_20191001_blm":{"results":"result_tarballs/resfinderblm_20191001/",
                                          "db":"../db_preparation/resfinder_20191001_blm/"},
                 "resfinder_20191001_ami":{"results":"result_tarballs/resfinderami_20191001/",
                                          "db":"../db_preparation/resfinder_20191001_ami/"},
                 "resfinder_20191001_qui":{"results":"result_tarballs/resfinderqui_20191001/",
                                           "db":"../db_preparation/resfinder_20191001_qui/"},
                 "resfinder_20191001_tri":{"results":"result_tarballs/resfindertri_20191001/", 
                                          "db":"../db_preparation/resfinder_20191001_tri/"},
                 "resfinder_20191001_sul":{"results":"result_tarballs/resfindersul_20191001/", 
                                          "db":"../db_preparation/resfinder_20191001_sul/"},
                 "resfinder_20190122_full":{"results":"result_tarballs/resfinder_20190122/", 
                                          "db":"../db_preparation/resfinder_20190122/"},
                 "resfinder_20180122_full":{"results":"result_tarballs/resfinder_20180122/", 
                                          "db":"../db_preparation/resfinder_20180122/"},
                 "resfinder_20170126_full":{"results":"result_tarballs/resfinder_20170126/", 
                                          "db":"../db_preparation/resfinder_20170126/"},
                 "card_20191023_full":{"results":"result_tarballs/card_20191023/", 
                                       "db":"../db_preparation/card_20191023/"}
            }


# Loading in the output files

output_files = {}

for k in dbresult_link:
    output_files[k] = {}
    base_key = dbresult_link[k]['results'] + dbresult_link[k]['results'].split("/")[-2] + "_"
    # loading in ABRicate first
    output_files[k]["abricate"] = base_key + "abricate/"
    abricate_files = [os.path.join(root, f) for root, dirs, files 
                  in os.walk(output_files[k]["abricate"])
                 for f in files if f != "summary.tab"]
    abricate_files = {k.split("/")[-1].split("_")[0]:k for k in abricate_files}
    output_files[k]["abricate"] = abricate_files
    # Loading the KmerRes files
    output_files[k]["kmerres"] = base_key + "kmerres/"
    kmerres_files = [os.path.join(root, f) for root, dirs, files in os.walk(output_files[k]["kmerres"])
                     for f in files if ".KmerRes" in f]
    kmerres_files = {k.split("/")[-1].split("_")[0]:k for k in kmerres_files}    
    output_files[k]["kmerres"] = kmerres_files
    # The SRST2 files
    output_files[k]['srst2'] = base_key + "srst2/"
    srst2_files = [os.path.join(root, f) for root, dirs, files in os.walk(output_files[k]['srst2']) for f in files]
    srst2_files = {k.split("/")[-1].split("_")[0]:k for k in srst2_files}
    output_files[k]['srst2'] = srst2_files
    # Finally we'll put the whole ariba summary into a pandas database
    output_files[k]["ariba"] = base_key + "ariba.csv"
    ariba_names = [k.split("/")[1].split("_")[0] for k in list(pd.read_csv(output_files[k]["ariba"]).name)]
    ariba_summary = pd.read_csv(output_files[k]["ariba"]).fillna("")
    ariba_summary.index = ariba_names
    output_files[k]["ariba"] = ariba_summary
    


# Note because of the different way in which ARIBA is loaded in we also add a special reading function
def ariba_parser(s):
    s_clusters = sorted(list(set([k.split(".")[0] for k in s.index if "cluster" in k and ".match" in k])))
    s_clusters = [k for k in s_clusters if s[k+".match"] == "yes"]
    s_genes = [s[k+ ".ref_seq"] for k in s_clusters]
    return s_genes

guuids = list(output_files["resfinder_20191001_full"]["abricate"].keys())



From here on in I am going to focus on an individual database, but this is specified at the start
 
1. **resfinder_20191001_full** - The full 1st October 2019 Resfinder database
2. **resfinder_20191001_blm** - The beta-lactams only from the 1st October 2019 Resfinder database
3. **resfinder_20191001_ami** - The aminoglycosides only from the 1st October 2019 Resfinder database
4. **resfinder_20191001_qui** - The quinolones only from the 1st October 2019 Resfinder database
5. **resfinder_20191001_sul** - The sulphonomides only from the 1st October 2019 Resfinder database
6. **resfinder_20191001_tri** - The trimethoprim only from the 1st October 2019 Resfinder database
7. **resfinder_20190122_full** - The full 22nd Jan 2019 Resfinder database
8. **resfinder_20180122_full** - The full 22nd Jan 2018 Resfinder database
9. **resfinder_20170126_full** - The full 26nd Jan 2017 Resfinder database
10. **card_20191023_full** - The full 23rd October CARD database

Within these files (see above dict) there is a standard file structure
To use a different database, you just select a different database
Note for code which only applies to the resfinder_20191001_full database (The main database used in the paper) 



In [16]:
# =============================================================================
# Code block 3 - loading in the database
# =============================================================================



poss_dbs = ["resfinder_20191001_full", "resfinder_20191001_blm", "resfinder_20191001_ami", 
           "resfinder_20191001_qui", "resfinder_20191001_sul", "resfinder_20191001_tri","resfinder_20190122_full", 
           "resfinder_20180122_full","resfinder_20170126_full", "card_20191023_full"]


db_choice = "resfinder_20191001_full"
# Database
db = dbresult_link[db_choice]["db"]
formatted_db = db +"db_formatted.fasta"
linkfile = db + "db_link.csv"
clstrfile = db + "db_clustered.clstr"

#results

abricate_files = output_files[db_choice]['abricate']
kmerres_files = output_files[db_choice]['kmerres']
srst2_files = output_files[db_choice]['srst2']
ariba_summary = output_files[db_choice]['ariba']


# 1. loading the link file
link = pd.read_csv(linkfile, index_col=0, header=None)
rlink = pd.read_csv(linkfile, index_col=1, header=None)


# 2. loading the database
res_db = SeqIO.to_dict(SeqIO.parse(formatted_db, "fasta"))

# 3. Next were going to do all vs all similarity of the resfinder database

# The primary way we are going to do this in the manuscript is using CD-hit 80 to cluster genes into families.
# Its primary use is to give us units of "genes" which we can work on.


sim_matrix = pd.DataFrame(np.zeros((len(res_db.keys()),len(res_db.keys()) )), 
                         columns = sorted(list(res_db.keys())), index=sorted(list(res_db.keys())))

#So reading the cluster file

clusters ={}
with open(clstrfile, "r") as f:
    for line in f:
        if line[0] == ">":
            cluster_no = int(line.rstrip("\n").split(" ")[-1])
            clusters[int(line.rstrip("\n").split(" ")[-1])] = []
        else:
            clusters[cluster_no].append(line.rstrip("\n").split(">")[-1].split("...")[0])
for k in clusters:
    for i in clusters[k]:
        for j in clusters[k]:
            sim_matrix.loc[i][j] = 1



# To make sure the clusters given by CD-HIT are sensible I'm also employing a k-mer based clustering method which clusters
# Any isolates which share a 17mer

jac_sim_matrix = pd.DataFrame(np.zeros((len(res_db.keys()),len(res_db.keys()) )), 
                         columns = sorted(list(res_db.keys())), index=sorted(list(res_db.keys())))
# For this similarity we will use 17-mers (one of the prefixes using MASH, note several others were checked prior to this for their effects)
# Clusters marked by this are fairly similar to other k-mer sizes
res_db_kmers = {k: set([str(res_db[k].seq)[i:i+17] for i in range(len(res_db[k].seq)-16)]) for k in res_db.keys()}

def calculate_jac_sim(l1, l2):
    intersection = len(res_db_kmers[l1].intersection(res_db_kmers[l2]))
    union = len(res_db_kmers[l1].union(res_db_kmers[l2]))
    return(intersection/union)

# This code actually goes through populating this matrix, 
# However takes 30 minutes to run, and for the sake of running this code quickly on line, I will
# use a matrix I made earlier (using thie code)
if os.path.isfile("readymade"+ "_{0}_sim_matrix.csv" .format(db_choice)) == False:
    for n in tnrange(len(res_db_kmers)):
        k  = list(res_db_kmers.keys())[n]
        for j in res_db_kmers:
            jac_sim_matrix.loc[k][j] = calculate_jac_sim(k, j)
    jac_sim_matrix.to_csv("readymade"+ "_{0}_sim_matrix.csv" .format(db_choice))

jac_sim_matrix = pd.read_csv("readymade"+ "_{0}_sim_matrix.csv" .format(db_choice), index_col = 0)



## <span style="color: blue;"> Setting up the analysis class </span>


In the next section of code, the aim is to define a class which performs most of the comparisons for us. 

#### Defining useful functions - Code block 4
Before we set up the class we define useful functions
There are some general useful functions and also some more specific ones. 
The specific and less obvious ones are below.

**CLUSTERING**

recursive cluster => This essentially uses netrowkx to greate a graph, which links togehter elements with non-zero similarity. 
The other two functions make_tuples and name list are simplications of bits within the recursive cluster functions

**AGREEMENT**

We also use a function to define agreement
This is then useful for quick and easy ready of 
which programs have agreed together. 
Note the panel is the same one as included in a supplementary image.

![image](method_combinations.png)



#### Reading in the data - Code block 5
Once we have the tools to analyse the data we actually read in the data

This does the following steps
For each of ABRicate, ARIBA, KmerResistance, SRST2 we
1. read in its file
2. Pull out the TRGs it identifies and relabel them with their original names
3. Separate these into groups according to their relavent antibiotics.




This is done using an external spreadsheet (which suggests putative families for all patterns of genes seen)

In [18]:

# =============================================================================
# Code block 4 - useful functions
# =============================================================================



###### CLUSTERING FUNCTIONS  ######


def make_tuples(l):
    # Make all possible tuples in a list
    output = []
    for i in range(len(l)):
        for j in range(len(l)):
            output.append((l[i],l[j]))
    output = sorted(list(set(output)))
    return output

def name_list(l, d):
    #rename all elements of a list
    return [d[k] for k in l]


def recursive_cluster(df, l):
    groups = {}
    # First we get all linked pais.
    for i in l:
        i_data = df.loc[i]
        i_group = [i]
        for j in l:
            if j != i: 
                if i_data[j] != 0:
                    i_group.append(j)
        i_group = sorted(i_group)
        groups[i] =  i_group
    # Assign numbers to the elements of l and then generate a dictionary to link numbers and names
    naming = {}
    reverse_naming  = {}
    m = 1
    for i in l:
        naming[i] = m
        reverse_naming[m] = i
        m += 1
    # Grouping tuples like a graph using networkx
    final_tuples = []
    for i in groups:
        final_tuples = final_tuples + make_tuples([naming[j] for j in groups[i]])
    final_tuples = sorted(list(set(final_tuples)))
    graph=nx.Graph(final_tuples)
    output = [name_list(list(c), reverse_naming) for c in nx.connected_components(graph)]
    return output

###### AGREEMENT PATTERN FUNCTIONS 

# Note for these functions they always assume the results are put in the correct order
# i.e. ABRicate, ARIBA, KmerResistance, SRST2

# First we start with a general agreement function 
def agreement_pattern(l1, l2, l3, l4):
    args = deepcopy(locals())
    arg_list = ['l1', 'l2', 'l3', 'l4']
    for key in arg_list:
        args[key] = ":".join(sorted(args[key]))
    patterns = {}
    output = []
    starting_no = 0
    for key in arg_list:
        if args[key] not in patterns:
            starting_no += 1
            patterns[args[key]] = starting_no
            output.append(starting_no)
        else:
            output.append(patterns[args[key]])
    return output
    

# Now for gene agreement, I use this program to say which genes (from a list) each method has found
def pres_bin(l1, l2):
    output = []
    for k in l1:
        if k in l2:
            output.append("1")
        else:
            output.append("0")
    return output

# Here is the agreement function again, but this time i've dropped the sort function. 
# This enables me to use the output from pres_bin directly to make the patterns as defined above 
def pres_bin_agreement_pattern(l1, l2, l3, l4):
    args = deepcopy(locals())
    arg_list = ['l1', 'l2', 'l3', 'l4']
    for key in arg_list:
        args[key] = ":".join(args[key])
    patterns = {}
    output = []
    starting_no = 0
    for key in arg_list:
        if args[key] not in patterns:
            starting_no += 1
            patterns[args[key]] = starting_no
            output.append(starting_no)
        else:
            output.append(patterns[args[key]])
    return output
    



In [19]:
# =============================================================================
# Code block 5 - The "isolate" class
# =============================================================================



# Reading in each of the data



class isolate:
    
    def __init__(self,guuid):
        self.guuid = guuid

        # ABRicate
        # Each section of this code does similar things, 1. read the file , 2 translate the genes, 
        self.abricate_fl = pd.read_csv(abricate_files[self.guuid], delimiter= "\t").fillna("")

        
        ################## ADDITION FOR INTEREST #####################################
        # NOTE I HAVE LEFT THE BELOW FILTERING CODE IN JUST IN CASE THOSE TESING THIS SCRIPT WANT TO TRY VARING CUTOFFS
        # SIMILAR BITS CAN BE ADDED TO ALL THE OTHER FILES, NOTE THE 60, 90 CUTOFFS ARE THOSE USED BY RESFINDER AS DEFAULT!
        #         self.abricate_fl = self.abricate_fl.loc[self.abricate_fl['%COVERAGE'] > 60.0]
        #         self.abricate_fl = self.abricate_fl.loc[self.abricate_fl['%IDENTITY'] > 90.0]


        self.abricate_genes = sorted(list(set([link.loc[k][1] for k in list(self.abricate_fl['GENE'])])))

        
        # ARIBA
        self.ariba_data = ariba_parser(ariba_summary.loc[self.guuid])
        self.ariba_genes = sorted(list(set([link.loc[k][1] for k in self.ariba_data])))

        # KmerResistance
        # For this one we also add in a coverage cut-off given our file doesn't seem to be able to do this reliably
        # Plus we are trying to apply the 70% cutoff as it doesn't work easily in the coverage
        # So we will re-apply this.)
        self.kmerres_fl = pd.read_csv(kmerres_files[self.guuid], delimiter = "\t").fillna("")
        self.kmerres_fl = self.kmerres_fl.loc[self.kmerres_fl.template_id > 70.0]
        self.kmerres_genes = sorted(list(set([link.loc[k][1] for k in [j for j in list(self.kmerres_fl['#Template']) if "resfindernewid" in j]])))

        # SRST2
        # Note for SRST2 we have another bit which doesn't quite work
        # It does not make a file if it finds no genes
        # Therefore we put it into a try except group

        try:
            self.srst2_fl = pd.read_csv(srst2_files[self.guuid], delimiter = "\t").fillna("")
            self.srst2_genes = sorted(list(set([link.loc[k][1] for k in list(self.srst2_fl['allele'])])))
        except:
            self.srst2_fl = "N/A"
            self.srst2_genes = []

        
        ### Aggregating genes. 
        
        self.geno_full = {"abricate":self.abricate_genes, "ariba":self.ariba_genes, 
                         "kmerres": self.kmerres_genes, "srst2":self.srst2_genes}
        self.all_genes = sorted(list(set(self.abricate_genes + self.ariba_genes + self.srst2_genes + self.kmerres_genes)))
        
        
        
        ### Defining gene families 

        gene_df = pd.DataFrame(np.zeros((len(self.all_genes),len(self.all_genes) )), 
                               columns =self.all_genes, index=self.all_genes)
        for l in gene_df.index:
            for j in gene_df.columns:
                gene_df.loc[l][j] = sim_matrix.loc[rlink.loc[l][0]][rlink.loc[j][0]]
        self.gene_families = recursive_cluster(gene_df, gene_df.index)
            
            
#         ### Assessing levels of agreement
#         # Here we define three things, Firstly, do results agree for all genes for a particular antibiotic class
#         # Then do they agree for a whole isolate
#         # Then finally we do a bit more delving into the patterns of disagreement
#         # Whole isolate level agreement
        self.isolate_patterns = agreement_pattern(sorted(self.abricate_genes), sorted(self.ariba_genes),
                                                sorted(self.kmerres_genes), sorted(self.srst2_genes))
        self.isolate_agreement = (self.isolate_patterns == [1,1,1,1])
        


        
#         ###For each gene
        self.genes_identified = {}
        self.gene_patterns = {}
        for pat in self.gene_families:
            pat_id = ":".join(pat)
            pat_string = [pres_bin(pat, self.abricate_genes),
                          pres_bin(pat, self.ariba_genes), 
                          pres_bin(pat, self.kmerres_genes), 
                          pres_bin(pat, self.srst2_genes)]
            self.gene_patterns[pat_id] = pres_bin_agreement_pattern(pat_string[0], pat_string[1], pat_string[2], pat_string[3])
            pat_string = "|".join([":".join(i) for i in pat_string])
            self.genes_identified[pat_id] = pat_string



# Now with the classes set up we read in everything into an isolates dict
isolates = {}

for n in tnrange(len(guuids)):
    k = guuids[n]
    x = isolate(k)
    isolates[k] = x
                
        
    

HBox(children=(IntProgress(value=0, max=1818), HTML(value='')))




## <span style="color: blue;">Producing figures</span>


From here on in, we move to specifically analysing the outputs found when using the October database (except for a few supplementary figures, cells which produce the data for these will be marked specifically with ##### SUPPLEMENTARY ALL DATABASES #####)

### Gene naming - code block 6

For each gene, and pattern of discorance, we want more interpretable gene (not allele) name. This is done using supplementary spreadsheets. 

### <span style="color: red;">NOTE FOR CLARITY , EXACTLY HOW GENES ARE NAMED IN PICTURES VS WHAT WAS ACTUALLY FOUND IS ALL ENCOMPASSED IN THE "gene_naming.csv" IN THIS DIRECTORY </span>

We then use these to begin constructing figures/figure components.



In [25]:
# =============================================================================
# Code block 6 - Giving the genes sensible names for pictures
# =============================================================================


# First we find which genes have been found across all methods
gene_list = []

for k in isolates:
    for g in isolates[k].all_genes:
        gene_list.append(g)
# Overall its approximately 200 different alleles, but comes back down to being a single class. 
gene_list = sorted(list(set(gene_list)))
print(len(gene_list))

        
# We then define our alleles

gene_df = pd.DataFrame(np.zeros((len(gene_list),len(gene_list) )), 
                       columns =gene_list, index=gene_list)
for l in gene_df.index:
    for j in gene_df.columns:
        gene_df.loc[l][j] = sim_matrix.loc[rlink.loc[l][0]][rlink.loc[j][0]]
gene_families = recursive_cluster(gene_df, gene_df.index)


kmer_df = pd.DataFrame(np.zeros((len(gene_list),len(gene_list) )), 
                       columns =gene_list, index=gene_list)
for l in kmer_df.index:
    for j in kmer_df.columns:
        kmer_df.loc[l][j] = jac_sim_matrix.loc[rlink.loc[l][0]][rlink.loc[j][0]]
kmer_families = recursive_cluster(kmer_df, kmer_df.index)

# Next onto naming
# THis is done in a separate spreadsheet, gene_naming.csv which links together gene, family,which genes they share kmers with 
# and then what its actual family is (as listed in the CARD database front end) and finally what it ends up being called as in picutres
# Note for genes seen less than X times, they're grouped into "Other"

# Both for Figure 1, and generally to know how to group genes, we need to know how often they occur.
gene_nos = {g:0 for g in gene_list}
genemet_nos = {g:{"abricate":0, 
                  "ariba":0, 
                  "kmerres":0, 
                  "srst2":0} for g in gene_list}

for g in gene_list:
    for i in isolates:
        if g in isolates[i].all_genes:
            gene_nos[g] += 1
        if g in isolates[i].abricate_genes:
            genemet_nos[g]["abricate"] += 1
        if g in isolates[i].ariba_genes:
            genemet_nos[g]["ariba"] += 1
        if g in isolates[i].kmerres_genes:
            genemet_nos[g]["kmerres"] += 1
        if g in isolates[i].srst2_genes:
            genemet_nos[g]["srst2"] += 1


# This however is a bit crude and can't really be used for counting genes, its just a useful set of dictionaries to have for latter times
# when looking at how common any given allele is. 
        
        


251


In [24]:
# =============================================================================
# Code block 7 - Producing figure 1 panel A
# =============================================================================





['sequence', 'gene', 'gene_family', 'kmer_family', 'family_card', 'picture_family']
['ARR-2_1_HQ141279', 'ARR-2', 'ARR-2_1_HQ141279:ARR-3_1_JF806499:ARR-3_4_FM207631', "blaTEM-143_1_DQ075245:ARR-2_1_HQ141279:ARR-3_1_JF806499:ARR-3_4_FM207631:blaTEM-1A_1_HM749966:blaTEM-1B_1_AY458016:blaTEM-1C_1_FJ560503:blaTEM-1D_1_AF188200:blaTEM-206_1_KC783461:aac(3)-Ia_1_X15852:aac(3)-Ib_1_L06157:blaTEM-207_1_KC818234:aac(6')-Ib-Hangzhou_1_FJ503047:aac(6')-Ib-cr_1_DQ303918:aac(6')-Ib-cr_2_EF636461:aac(6')-Ib11_1_AY136758:aac(6')-Ib3_1_X60321:aac(6')-Ib_1_M21682:aac(6')-Ib_2_M23634:aadA11_2_AJ567827:aadA12_1_AY665771:aadA13_1_AY713504:aadA13_2_NC010643:aadA15_1_DQ393783:aadA17_1_FJ460181:aadA1_2_FJ591054:aadA1_3_JQ414041:aadA1_4_JQ480156:aadA1_5_JX185132:aadA1b_1_M95287:aadA21_1_AY171244:aadA22_1_AM261837:aadA23_1_AJ809407:aadA24_1_AM711129:aadA24_1_DQ677333:aadA2_1_NC_010870:aadA2_2_JQ364967:aadA2b_1_D43625:aadA3_1_AF047479:aadA4_1_Z50802:aadA5_1_AF137361:aadA8b_1_AY139603:aadA8b_2_AM040708:blaTEM-2

In [None]:
#### TRG PATTERN FILES

# Aggregating the patterns from all samples
pattern_counter = {}
pattern_bymethod = {}
pattern_byabx = {}
for i in isolates:
    for k in isolates[i].genes_identified:
        pattern_key = k + "|" + isolates[i].genes_identified[k] 
        if pattern_key not in pattern_counter.keys():
            pattern_bymethod[pattern_key] = ":".join([str(j) for j in isolates[i].gene_patterns[k]])
            pattern_counter[pattern_key] = 1
            pattern_byabx[pattern_key] = isolates[i].gene_group[k]
        else:
            pattern_counter[pattern_key] += 1

print(pattern_byabx)
# writing this data into a CSV
with open("by_trg_pattern.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerow(['trg_pattern',"abx", "number_of_isolates", "method_agreement", "overall_agreement"])
    for key in pattern_counter:
        writer.writerow([key,pattern_byabx[key], pattern_counter[key], pattern_bymethod[key], pattern_bymethod[key]=="1:1:1:1"])



In [None]:
#### ISOLATE files:

with open("isolate_patterns.csv", "w") as f:
    writer = csv.writer(f, delimiter = ",")
    writer.writerow(["isolate", "pattern", "agreement"])
    for k in isolates:
        for pat in isolates[k].genes_identified:
            writer.writerow([k, pat, ("0" not in isolates[k].genes_identified[pat])])
        

In [None]:
### For presentation stuff


annotated_patterns = pd.read_csv("pattern_annotator.csv")
gene_output = pd.read_csv("by_trg_pattern.csv")
pat_data = gene_output.merge(annotated_patterns, on="trg_pattern")
simulation_data = pd.read_csv("interpreting_simulations.csv")
print(simulation_data.head())

def met_agr(l):
    if l == "1:1:1:1":
        return "0"
    if l == "1:2:2:2":
        return "1"
    if l == "1:2:1:1":
        return "2"
    if l == "1:1:2:1":
        return "3"    
    if l == "1:1:1:2":
        return "4"
    else:
        return "5"

pat_dict = {}

pg_dict = {}
artefact_dict = {}

for k in range(len(pat_data)):
    k_data = pat_data.iloc[k]
    k_pat = k_data.trg_pattern.split("|")[0]
    sim_dat = (True in list(simulation_data.loc[simulation_data.pattern == k_pat].overall))
    if k_data.method_agreement == "1:1:1:1":
        k_status = "0"
    elif sim_dat == True:
        k_status = "1"
    else:
        k_status = "2"
    if k_data.gene_name not in pat_dict.keys():
        pat_dict[k_data.gene_name] = {str(i):0 for i in range(6)}
        artefact_dict[k_data.gene_name]  = {str(i):0 for i in range(3)}
    pat_dict[k_data.gene_name][met_agr(k_data.method_agreement)] += k_data.number_of_isolates
    artefact_dict[k_data.gene_name][k_status] += k_data.number_of_isolates
    pg_dict[k_data.gene_name] = k_data.abx_x


def ad_sum(d):
    try:
        return d["1"]/(d["1"]+d["2"])
    except ZeroDivisionError:
        return -1

for k in pg_dict:
    pg_dict[k] = (pg_dict[k], sum(pat_dict[k].values()),round(ad_sum(artefact_dict[k]), 2))

print(pat_dict)
for k in sorted(pg_dict.keys(), key = lambda a: (pg_dict[a][0],pg_dict[a][1]) , reverse = True):
    print(k, pg_dict[k])

In [None]:
f  = plt.figure(figsize=(10, 5), dpi=300)
ax1 = plt.subplot2grid((1,1),(0,0), rowspan = 1 , colspan=1)

f_keys = ["blaTEM", "blaCTX-M-1", "blaOXA-1","blaCMY", "blaSHV","blaCTX-M-9" , 
         "aph(6)-Id","aph(3'')-Ib", "ant(3'')-Ia", "aadA5", "aac(3)-IIa", "aac(6')-Ib", "aph(3')-Ia", 
         "aac(3)-IV", "aph(4)-Ia", 
         "qnrS",
         "dfrA7","dfrA1", "drfA12", "dfrA14" , "dfrA5", 
         "sul2","sul1", "sul3" ]
xs = range(len(f_keys))
f_0vals = [pat_dict[k]['0'] for k in f_keys]
f_1vals = [pat_dict[k]['1'] for k in f_keys]
f_2vals = [pat_dict[k]['2'] for k in f_keys]
f_3vals = [pat_dict[k]['3'] for k in f_keys]
f_4vals = [pat_dict[k]['4'] for k in f_keys]
f_5vals = [pat_dict[k]['5'] for k in f_keys]
def convert_numbers(l):
    out_list = []
    for k in l:
        if k != -1:
            out_list.append(str(int(100*k)) + "%")
        else:
            out_list.append("N/A")
    return out_list

f_numbers = convert_numbers([pg_dict[k][2] for k in f_keys])
print(f_numbers)
width = 0.5

ax1.bar(xs, f_0vals, width)
ax1.bar(xs, f_1vals, width,
             bottom=f_0vals, label="ABRicate discrepant")
ax1.bar(xs, f_2vals, width,
             bottom= [f_0vals[i]+ f_1vals[i] for i in range(len(f_0vals))], label="ARIBA discrepant")
ax1.bar(xs, f_3vals, width,
             bottom= [f_0vals[i]+ f_1vals[i]+f_2vals[i] for i in range(len(f_0vals))], label="KmerResistance discrepant")
ax1.bar(xs, f_4vals, width,
             bottom= [f_0vals[i]+ f_1vals[i]+f_2vals[i]+f_3vals[i] for i in range(len(f_0vals))], label="SRST2 discrepant")
ax1.bar(xs, f_5vals, width,
             bottom= [f_0vals[i]+ f_1vals[i]+f_2vals[i]+f_3vals[i]+f_4vals[i] for i in range(len(f_0vals))], label="Multiple discrepant")

ax1.set_xticks(range(len(f_keys)))
ax1.set_xticklabels(f_keys, rotation=90)
ax1.set_yticklabels([], rotation=90)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
