Python scripts to extract sequence-derived features of small ORFs and perform RF-based validation and prediction
Get all features of train dataset
Grid search based on five-fold validation in Prokaryotic species so that I can validate the best parameters (tree number, depth, sample number for splitting) of RF-based model for the prediction
Training and cross-species prediction and obtain the performance
Get all the features of testing datasets
Grid search based on five-fold validation in eukaryotic species so that I can validate the best parameter of RF model for the prediction
Cross-species prediction (One to one)
Merge all kind of features of all species and perform cross-species prediction (Four to one)
Including some functions to calculate the sequence-derived features, which is called by the above scripts