By Akinde Kadjo

# Introduction

Antimicrobial resistance (AMR) occurs when an organism (bacteria, microbe etc. ) is or evolve to be resistant to an antibiotic. Microbes resistant to multiple antimicrobials are called multidrug resistant (MDR) and are sometimes referred to as superbugs.

The goal of this project is to predict species AMR using their MS spectra. The Data used here was collected from the University Hospital of Basel, Switzerland and downloaded from [Kaggle](https://www.kaggle.com/datasets/drscarlat/driams/data). 
One of the publications ([link](https://www.biorxiv.org/content/10.1101/2020.07.30.228411v2)) using this same data, treats it as a multi-classification problem. 

My approach here is to treat is a regression problem where the labels for each antibiotic is converted into an overall antibiotic resistance score.

The image below (taken from [wikipedia](https://en.wikipedia.org/wiki/Antimicrobial_resistance) ) shows the bacteria on the right being resistant to 3 of the 7 antibiotics (white rings) while the bacteria on the left is sensitive to all of the 7 antibiotics.

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Antibiotic_sensitivity_and_resistance.jpg/465px-Antibiotic_sensitivity_and_resistance.jpg")

# Data Pre-processing

The original data size is very large over 20 GB, so it won't be added to this GitHub directory but can be found at this link [here.](https://www.kaggle.com/datasets/drscarlat/driams) 
Only the DRIAMS-A folder is used.

## MS data preprocessing

The ipynb file can be found [here.]()

The provided MALDI-TOF mass spectra are stored in a *.txt* format that have been binned along the mass-to-charge-ratio axis with a bin size of 3Da, resulting in 6000 feature bins.
My preprocessing steps consisted of: selecting only the binned data column from each *.txt* file, normalizing it to a maximum of 1, removing the noise by keeping data points above 5%, saving the data as float 16 for less memory, keeping the filename as in the id column, combining all of the data in a single dataframe and saving it as a single compressed file.

## Label data preprocessing

The ipynb file can be found [here.]()

The provided labeled data are provided in a *.csv* format. In other words, this is the AMR (Antimicrobial resistance) susceptibility profile sheet, containing the code, species and antibiotic list, where R are listed for 'resistant' and S for 'sensitive'. My preprocessing steps consisted of: combining all of the column names from all 4 different files, removing unnecessary columns (i.e unnamed) and empty columns, combining all 4 files into a single Dataframe,  removing the duplicate species column, correcting the 'species' column for miss spelling, for rows with the same species fill the AMR with the most frequent entry, and saving it as a single compressed file.

# Imports