Skip to content

Development of a HIV tropism Deep Neural Network using Pytorch

Notifications You must be signed in to change notification settings

GabrielSGoncalves/DeepTropism

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepTropism

Development Deep Neural Network for defining HIV-1 Tropism using Pytorch.

Introduction

HIV-1 is still a pandemic with around 37 million infected globally. Past decades have shown a great improvement in the development of antiretroviral helping patients fight the infection and increase life spam, with more than 25 drugs of 6 different classes are available for treatment. A drug class called Chemokine receptor antagonists acts on targets cell co-receptors, CCR5 or CXCR4, used by the virus to infect human cells. Maraviroc is an approved Chemokine receptor antagonists and blocks the interaction of the virus with CCR5 present on the patients cell membrane. As most of the early infection is caused by CCR5 tropic viruses, Maraviroc can be successfully used to treat patients. As the infection progresses the virus can evolve to start using CXCR4 to infect new cells, resulting on Maraviroc inefficiency to block viral replication. So before prescribing Maraviroc to patients, it is important to know which virus tropism (R5-tropic or X4-tropic) is causing the infection. Different machine learning methods were applied during the past decades to predict HIV tropism, but there is still room for improvement, as the performance of the tools available is still deficient.

Building the Dataset

In order to develop a model that could show a good performance against the broad diversity of HIV-1, we have organized a dataset based on 5 previous published datasets on viral tropism. The total number of sequences was 9550, with length ranging from 21 to 35 amino acids. After labeling each sample as CCR5 tropic (coded as 0) or CXCR4 tropic (coded as 1) we removed the duplicated sequences to avoid bias, resulting on a final dataset of 3608 unique sequences. This dataset was used for training, validation, and testing. Dataset unique sequences :

  • CCR5: 2779 samples (labeled as 0)
  • R5X4: 485 samples (labeled as 1)
  • CXCR4: 344 samples (labeled as 1)

Challenges I ran into

Feature extraction

One of the main challenges when dealing with genetic data is how to transform the information we have as a string into arrays and matrices to be used as input for training Machine Learning Models. In order to compare each protein sequence we used the alignment created by the Los Alamos National Laboratory (HIV Sequence Compendium) to align the sequences on our dataset. The result was a 44 character string with the letters representing the aminoacids, and '-' representanting gaps. These aligned sequences were used as raw data for the next step. To convert the protein sequence into tensors we decided to used a simple approach for one-hot encoding of each amino acid. A traditional amino acid one-letter code table has 20 different aminoacids, an X representing any molecule, and a few representing dubious amino acids (B, Z, and J), totaling 26 positions. For each letter on our V3 loop sequence, we created a numpy array of zeroes of size 1 by 26, and replaced the corresponding position of the aminoacid with a 1, as each amino acid was represented by one position in the array. So each aligned protein sequence of 44 characters resulted in a numpy array of 44 by 26. This matrix was then linearized into an array of 1 by 1144 and used as the input tensor of our Deep Neural Network.

Defining the architecture of our Neural Network

Of othe main questions Data Scientist ask when faced with a new problem related to Deep Learning is which architecture to use. A good approach is to start simple and gradualy improve complexity. That was exactally what we did. We decided to used a simple DNN architecture of 3 fully connected layers of 1144, 250 and 100 layers. The output layer was formed by the 2 nodes representing our binary outcome.

Present results

We have created a new Dataset of curated and unique sequences that can be used to train other models of viral tropism. The performance of our DeepTropism Neural Network has surpassed all the published tools in the field. By using simple feature extraction and architecture we have shown the potential of Pytorch on solving problems that were around for many years and couldn't be tackled with traditional algorithms and tools.

Built With

  • Python3 - The web framework used
  • Pytorch - An open source machine learning framework for Deep Learning in Python.
  • JupyterLab - Web-based user interface for Project Jupyter.
  • Conda - Package, dependency and environment management for any language.

Contributing

If you are interested in contributing to the project please reach me by email (gabrielgoncalvesbr@gmail.com)

Author

License

This project is licensed under the MIT License - see the LICENSE.md file for details

References

About

Development of a HIV tropism Deep Neural Network using Pytorch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published