Skip to content

Machine Learning model that classifies 10 different bacteria species using the data from a genomic analysis technique by comparison to available bacterial DNA sequences

Notifications You must be signed in to change notification settings

AndreasAvgou/Genetic-Bacteria-Identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Genetic Bacteria Identification

images

Problem Statement

Novel DNA sequencing technologies have proliferated over the past two decades. Continual improvements in “next-generation sequencing” (NGS) and “third-generation sequencing” (TGS) have increased the fidelity and rate of sequencing, but it still takes hours or days to obtain complete sequences. Furthermore, there are some diagnostic applications in which very rapid identification of a particular gene or genetic species becomes essential, while identification of all genes is not necessary. For example, in patients with septic shock from bacterial infections, identification of antibiotic-resistance genes is essential because the mortality rate increases 7.6% per hour of delay in administering correct antibiotics. Unfortunately, it takes more than 24 h to grow up the bacteria recovered from the blood of an infected patient, identify the species, and then determine to which antibiotics the organism is resistant, leading to a very high mortality rate for such infections.

Bacterial antibiotic resistance is becoming a significant health threat, and rapid identification of antibiotic-resistant bacteria is essential to save lives and reduce the spread of antibiotic resistance.

Object

Our object was to create a model that classifies 10 different bacteria species using the data from a genomic analysis technique by comparison to available bacterial DNA sequences

Data Description

The dataset consists of 10 different classes of Bacteria. The dataset contains 10-mer snippets of DNA which are sampled and analyzed to give the histogram of base count. Each row of data contains a spectrum of histograms generated by repeated measurements of a sample, each row containing the output of all 286 histogram possibilities The data (both train and test) also contains simulated measurement errors (of varying rates) for many of the samples, which makes the problem more challenging.

Dataset

You can find the dataset that was used in this link

Libraries

1. Python>=3.8
2. Numpy>=1.19
3. Pandas>=1.3.5
4. Seaborn>=0.11.2 
5. Sklearn>=0.22
6. Matplotlib>=1.19

Cloud Tools

1. Google Drive
2. Google Colab

Install libraries

pip install -r requirements.txt

About

Machine Learning model that classifies 10 different bacteria species using the data from a genomic analysis technique by comparison to available bacterial DNA sequences

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published