Skip to content

DanielStreicker/ViralHostPredictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Reservoir Hosts and Arthropod Vectors from Evolutionary Signatures in RNA Virus Genomes

Simon A. Babayan, Richard J. Orton and Daniel G. Streicker

Background

A series of scripts and datasets described in Babayan et al. (2018) Science doi: 10.1126/science.aap9072 which predict the reservoir hosts, existence of arthropod vectors and identity of arthropod vectors using gradient boosting machines.

File descriptions

Datasets:

BabayanEtAl_sequences.fasta contains coding sequences for all viruses used in the analyses

EbolaTimeSeriesData.csv contains epidemiological data and genomic features for Zaire ebolaviruses sampled during the 2014-2016 West African outbreak

BabayanEtAl_VirusData.csv contains reservoir host, arthropod-borne transmission status and vector taxa for all ssRNA viruses analyzed and features extracted from the genome of each virus

R scripts:

arthropodBorne_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting arthropod-borne transmission across different training sets

arthropodBorne_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using phylogenetic neighborhoods and genomic features selected by arthropodBorne_featureSelection.R

arthropodBorne_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using phylogenetic neighborhoods

arthropodBorne_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using genomic features selected by arthropodBorne_featureSelection.R

reservoir_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting reservoir hosts across different training sets

reservoirPredict_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using phylogenetic neighborhoods and genomic features selected by reservoir_featureSelection.R

reservoirPredict_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using phylogenetic neighborhoods

reservoirPredict_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using genomic features selected by reservoir_featureSelection.R

vectorPredict_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting reservoir hosts across different training sets

vectorPredict_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using phylogenetic neighborhoods and genomic features selected by vectorPredict_featureSelection.R

vectorPredict_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using phylogenetic neighborhoods

vectorPredict_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using genomic features selected by vectorPredict_featureSelection.R

Python script

algo_comparison.py Compares the predictive power of a variety of competing machine learning algorithms to predict reservoir hosts, arthropod-borne transmission and vector taxa from all possible genomic features

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published