# Introduction

In [6]:
import ModelSearch
import pandas as pd 

TypeError: 'module' object is not callable

Hello[WHO2021]<a id='WHO2021'></a> Goodbye

[Blurf](#WHO2021)

# Methods and Results

### Data Processing Workflow

The pipeline below (FIGURE) first trains models on binary phenotype readouts (resistance vs susceptible) and leads to a prediction of a binary phenotype. This pipeline requires the pulldown of structural, biochemical, thermodynamic, and evolutionary features from RNAP and its mutations before merging these with filtered, prepated mutation data. These data and that for the filtered, prepared phonetype preadouts form the feature and target sets, respectively. 

The data is split into a 70:30 training and test set, the former of which undergoes a parameter grid search with cross validation and upsampling (discussed later). The model is retrained using the best parameters, and a classification report and confusion matrix is outputted post decision-threshold shifting to maximise specificity.

![title](direct_binary_class_workflow.png)

### Mutation data derivation

The data that formed the target and much of the training sets were derived from the CRyPTIC database. Mutation data was extracted from Illumina platform WGS of clinical isolates. Since this study focuses on predicting susceptiblity in response to solo structural SNPs in Mtb RNA polymerase, mutations associated with other drug targets, non-solo mutations, phylogenetic, and synonymous mutaations were filtered out, as well as insertion/deletion polymorphisms (indels), and mutations in the promoter. 

In [20]:
#read in CRyPTIC mutation data
mutation_df = pd.read_pickle('Data_tables/MUTATIONS-rnap.pkl.gz')
mutation_df.drop_duplicates(inplace=True)
#remove nulls and ensure mutation is a SNP
mutation_df = mutation_df[(~mutation_df['IS_NULL']) & (mutation_df['IS_FILTER_PASS']) & (mutation_df['IS_SNP'])]
#ensure mutation is in the coding region
mutation_df = mutation_df[mutation_df['IN_CDS']]
#ensure mutations are non-synonymous
mutation_df = mutation_df[~mutation_df['IS_SYNONYMOUS']]
#only investigate solo mutations
mutation_df = mutation_df.drop_duplicates(subset=['UNIQUEID'], keep=False)
#insert segid column
mutation_df['segid'] = [i[-1] for i in mutation_df.GENE]
mutation_df.set_index('UNIQUEID',verify_integrity=True, inplace=True)
mutation_df

Unnamed: 0_level_0,GENE,MUTATION,POSITION,AMINO_ACID_NUMBER,GENOME_INDEX,NUCLEOTIDE_NUMBER,REF,ALT,IS_SNP,IS_INDEL,...,IS_NULL,IS_FILTER_PASS,ELEMENT_TYPE,MUTATION_TYPE,INDEL_LENGTH,INDEL_1,INDEL_2,SITEID,NUMBER_NUCLEOTIDE_CHANGES,segid
UNIQUEID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
site.02.subj.0604.lab.241032-14.iso.1,rpoB,M587I,587.0,587.0,,,atg,ata,True,False,...,False,True,GENE,AAM,,,,02,1,B
site.02.subj.0104.lab.22A057.iso.1,rpoB,S450L,450.0,450.0,,,tcg,ttg,True,False,...,False,True,GENE,AAM,,,,02,1,B
site.02.subj.0385.lab.235016-15.iso.1,rpoB,D545E,545.0,545.0,,,gac,gag,True,False,...,False,True,GENE,AAM,,,,02,1,B
site.02.subj.0904.lab.22A138.iso.1,rpoB,H445Y,445.0,445.0,,,cac,tac,True,False,...,False,True,GENE,AAM,,,,02,1,B
site.02.subj.0951.lab.22A186.iso.1,rpoB,S450L,450.0,450.0,,,tcg,ttg,True,False,...,False,True,GENE,AAM,,,,02,1,B
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
site.10.subj.YA00029870.lab.YA00029870.iso.1,rpoB,D435F,435.0,435.0,,,gac,ttc,True,False,...,False,True,GENE,AAM,,,,10,2,B
site.10.subj.MG04015656.lab.MG04015656.iso.1,rpoC,A1044V,1044.0,1044.0,,,gcg,gtg,True,False,...,False,True,GENE,AAM,,,,10,1,C
site.10.subj.PH00493578.lab.PH00493578.iso.1,rpoB,I491F,491.0,491.0,,,atc,ttc,True,False,...,False,True,GENE,AAM,,,,10,1,B
site.10.subj.WG00269790.lab.WG00269790.iso.1,rpoB,S450L,450.0,450.0,,,tcg,ttg,True,False,...,False,True,GENE,AAM,,,,10,1,B


### Phenotype data derivation

Primary phenotypic data were Minimum Inhibitory Concentrations (MIC) read from bespoke CRyPTIC UKMYC5 and UKMYC6 susceptibility plates on which the same clinical isolates that underwent WGS were plated. Data was pooled form laboraties in 27 countries across 65 continents, and to date, the WHO-endoresed CRYpTIC dataset contains MICs of 15,211 isolates to 13 different anti-TB drugs ([Brankin et al, 2021](#Brankin2021), [Walker et al, 2022](#Walker2022)).
   
Each UKMYC5/6 96-well microtitre plate contained 13 different 1st and 2nd line anti-TB compounds, including rifampicin, as well as two repurposed drugs (delamanid and bedaquiline) and two positive control wells. Plates were manufactured with the drugs freeze-dried onto each well following a doubling dilution series. After incubation for 14 days, MICs are read by a trained lab technician via either a Sensititre-Vizion Digital MIC viewing system or via a Mirrored Box [(Plate et al, 2018)](#Plate2018), and these are verified by an Automated Mycobacterial Growth Detection Algorithm (AMyGDA) ([Fowler et al, 2018](#Fowler2018); [Plate et al, 2018](#Plate2018)), as well as via a community science platform, BashTheBug. Phenotype readout disagreements between all three methods constituted low-quality phenotypes and were discarded for this study. 

The dataset contains a resistance vs susceptible binary phenotype calculated directly from MICs (using a critical concentration Epidemiological Cut-off (ECOFF) value of 1.0 mg/L for rifampicin) that acts as the target set for binary classification models ([CRyPTIC, 2019](#CRyPTIC2019)).

In [23]:
phenotype_df = pd.read_pickle('Data_tables/DST_MEASUREMENTS-rifamycins.pkl.gz')
#filter out rifamycin
phenotype_df = phenotype_df.loc[phenotype_df.DRUG=='RIF']
phenotype_df.drop_duplicates(subset=["UNIQUEID"], inplace=True)
#filter out rifamycin
phenotype_df.set_index('UNIQUEID',verify_integrity=True, inplace=True)
phenotype_df

Unnamed: 0_level_0,DRUG,PHENOTYPE,SOURCE,METHOD_1,METHOD_2,METHOD_3,METHOD_CC,METHOD_MIC
UNIQUEID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
site.06.subj.06TB_0290.lab.06MIL0881.iso.1,RIF,R,CRyPTIC,liquid media,microdilution plate,UKMYC5,0.5,>4
site.05.subj.PMFR-0724.lab.MFR-199.iso.1,RIF,S,CRyPTIC,liquid media,microdilution plate,UKMYC5,0.5,0.12
site.05.subj.PTAN-0168.lab.TAN-500.iso.1,RIF,S,CRyPTIC,liquid media,microdilution plate,UKMYC5,0.5,<=0.06
site.06.subj.MHL_0185-14.lab.06MIL0212.iso.1,RIF,R,CRyPTIC,liquid media,microdilution plate,UKMYC5,0.5,>4
site.06.subj.MHL_1323-15-R.lab.06MIL0060.iso.1,RIF,R,CRyPTIC,liquid media,microdilution plate,UKMYC5,0.5,>4
...,...,...,...,...,...,...,...,...
site.05.subj.PTAN-0340.lab.TAN-347.iso.1,RIF,S,CLIRES,MODS,,,,
site.05.subj.PTAN-0252.lab.TAN-578.iso.1,RIF,S,CLIRES,MODS,,,,
site.05.subj.PSLM-0841.lab.SLM-108.iso.1,RIF,R,CLIRES,solid media,,,,
site.05.subj.PMK-1021.lab.MK-1825.iso.1,RIF,S,CLIRES,MODS,,,,


### Merge phenotype and mutation dataframes

In [24]:
merged = mutation_df.join(phenotype_df[['PHENOTYPE']],how='inner')
merged

Unnamed: 0_level_0,GENE,MUTATION,POSITION,AMINO_ACID_NUMBER,GENOME_INDEX,NUCLEOTIDE_NUMBER,REF,ALT,IS_SNP,IS_INDEL,...,IS_FILTER_PASS,ELEMENT_TYPE,MUTATION_TYPE,INDEL_LENGTH,INDEL_1,INDEL_2,SITEID,NUMBER_NUCLEOTIDE_CHANGES,segid,PHENOTYPE
UNIQUEID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
site.02.subj.0604.lab.241032-14.iso.1,rpoB,M587I,587.0,587.0,,,atg,ata,True,False,...,True,GENE,AAM,,,,02,1,B,S
site.02.subj.0104.lab.22A057.iso.1,rpoB,S450L,450.0,450.0,,,tcg,ttg,True,False,...,True,GENE,AAM,,,,02,1,B,R
site.02.subj.0385.lab.235016-15.iso.1,rpoB,D545E,545.0,545.0,,,gac,gag,True,False,...,True,GENE,AAM,,,,02,1,B,S
site.02.subj.0904.lab.22A138.iso.1,rpoB,H445Y,445.0,445.0,,,cac,tac,True,False,...,True,GENE,AAM,,,,02,1,B,R
site.02.subj.0951.lab.22A186.iso.1,rpoB,S450L,450.0,450.0,,,tcg,ttg,True,False,...,True,GENE,AAM,,,,02,1,B,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
site.10.subj.SACD00862519_S13.lab.CD00862519_S13.iso.1,rpoC,G594E,594.0,594.0,,,ggg,gag,True,False,...,True,GENE,AAM,,,,10,1,C,S
site.10.subj.SATRL0073861_S19.lab.TRL0073861_S19.iso.1,rpoB,S450L,450.0,450.0,,,tcg,ttg,True,False,...,True,GENE,AAM,,,,10,1,B,R
site.10.subj.YA00029870.lab.YA00029870.iso.1,rpoB,D435F,435.0,435.0,,,gac,ttc,True,False,...,True,GENE,AAM,,,,10,2,B,R
site.10.subj.MG04015656.lab.MG04015656.iso.1,rpoC,A1044V,1044.0,1044.0,,,gcg,gtg,True,False,...,True,GENE,AAM,,,,10,1,C,S


# Bibliography

Brankin A, Malone KM, Barilar I, Battaglia S, Pires Brandao A, Maurizio Cabibbe A, Carter J, Maria D, Claxton P, Clifton DA, et al (2021) A data compendium of Mycobacterium tuberculosis 1 antibiotic resistance.<a id='Brankin2021'></a>    

CRyPTIC (2021) A generalisable approach to drug susceptibility prediction for M. tuberculosis using machine learning and whole-genome sequencing.<a id='CRyPTIC2021'></a>     

CRyPTIC (2019) Epidemiological cutoffs for a 96-well broth microtitre plate for high-throughput research antibiotic susceptibility testing of M.tuberculosis.<a id='CRyPTIC2019'></a>

Fowler PW, Cruz ALG, Hoosdally SJ, Jarrett L, Borroni E, Chiacchiaretta M, Rathod P, Lehmann S, Molodtsov N, Walker TM, et al (2018) Automated detection of bacterial growth on 96-well plates for high-throughput drug susceptibility testing of mycobacterium tuberculosis. Microbiology (United Kingdom) 164: 1522–1530<a id=Fowler2018></a>

Plate M, Bedaquiline C, Walker TM, Grazian C, Davies TJ, Peto TEA, Crook DW, Fowler PW & Cirillo DM (2018) Validating a 14-Drug Microtiter Plate Containing Bedaquiline and Delamanid for Large-Scale Research Susceptibility Testing of Mycobacterium tuberculosis. 62: 1–15<a id='Plate2018'></a>      

Walker TM, Fowler PW, Knaggs J, Hunt M, Peto TE, Walker AS, Crook DW, Walker TM, Miotto P, Cirillo DM, et al (2022) The 2021 WHO catalogue of Mycobacterium tuberculosis complex mutations associated with drug resistance: a genotypic analysis. The Lancet Microbe 3: e265–e273<a id='Walker2022'></a>   


[CRyPTIC, 2021](#CRyPTIC,2021)