Genetic Bacteria Identification

Problem Statement

Novel DNA sequencing technologies have proliferated over the past two decades. Continual improvements in “next-generation sequencing” (NGS) and “third-generation sequencing” (TGS) have increased the fidelity and rate of sequencing, but it still takes hours or days to obtain complete sequences. Furthermore, there are some diagnostic applications in which very rapid identification of a particular gene or genetic species becomes essential, while identification of all genes is not necessary. For example, in patients with septic shock from bacterial infections, identification of antibiotic-resistance genes is essential because the mortality rate increases 7.6% per hour of delay in administering correct antibiotics. Unfortunately, it takes more than 24 h to grow up the bacteria recovered from the blood of an infected patient, identify the species, and then determine to which antibiotics the organism is resistant, leading to a very high mortality rate for such infections.

Bacterial antibiotic resistance is becoming a significant health threat, and rapid identification of antibiotic-resistant bacteria is essential to save lives and reduce the spread of antibiotic resistance.

Object

Our object was to create a model that classifies 10 different bacteria species using the data from a genomic analysis technique by comparison to available bacterial DNA sequences

Data Description

The dataset consists of 10 different classes of Bacteria. The dataset contains 10-mer snippets of DNA which are sampled and analyzed to give the histogram of base count. Each row of data contains a spectrum of histograms generated by repeated measurements of a sample, each row containing the output of all 286 histogram possibilities The data (both train and test) also contains simulated measurement errors (of varying rates) for many of the samples, which makes the problem more challenging.

Dataset

You can find the dataset that was used in this link

Libraries

1. Python>=3.8
2. Numpy>=1.19
3. Pandas>=1.3.5
4. Seaborn>=0.11.2 
5. Sklearn>=0.22
6. Matplotlib>=1.19

Cloud Tools

1. Google Drive
2. Google Colab

Install libraries

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Genetic_Bacteria_Identification.ipynb		Genetic_Bacteria_Identification.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly