GitHub - Bonidia/BioPrediction: Democratizing Machine Learning in the Study of Molecular Interactions

BioPrediction: Democratizing Machine Learning in the Study of Molecular Interactions

Democratizing Machine Learning in Life Sciences

Home • AutoAI Pandemics • Installing • How To Use • Citation

Main Reference:

Published paper (in Portuguese): ENIAC 2023 Anals

BioPrediction is part of a bigger project which proposes to democratize Machine Learning in for analysis, study and control of epidemics and pandemics. Take a look!!!

Abstract

Given the increasing number of biological sequences stored in databases, there is a large source of information that can benefit several sectors such as agriculture and health. Machine Learning (ML) algorithms can extract useful and new information from these data, increasing social and economic benefits, in addition to productivity. However, the categorical and unstructured nature of biological sequences makes this process difficult, requiring ML expertise. In this paper, we propose and experimentally evaluate an end-to-end automated ML-based framework, named BioPrediction, able to identify implicit interactions between sequences, e.g., long non-coding RNA and protein pairs, without the need for end-to-end ML expertise. Our experimental results show that the proposed framework can induce ML models with high predictive accuracy, between 77% and 91%, which are competitive with state-of-the-art tools.

First study to propose an automated feature engineering and model training pipeline to classify interactions between biological sequences;
The pipeline was mainly tested on datasets regarding lncRNA-protein interactions. The maintainers are further expanding their support to work with other molecules;
BioPrediction can accelerate new studies, reducing the feature engineering time-consuming stage and improving the design and performance of ML pipelines in bioinformatics;
BioPrediction does not require specialist human assistance.

Maintainers

Robson Parmezan Bonidia, Bruno Rafael Florentino and Natan Henrique Sanches.
Correspondence: rpbonidia@gmail.com or bonidia@usp.br, brunorf1204@usp.br, natan.sanches@usp.br

Installing dependencies and package

Via miniconda (Terminal)

Installing BioPrediction using Miniconda to manage its dependencies, e.g.:

$ git clone https://github.com/Bonidia/BioPrediction.git BioPrediction

$ cd BioPrediction

$ git submodule init

$ git submodule update

1 - Install Miniconda:

See documentation: https://docs.conda.io/en/latest/miniconda.html

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

$ chmod +x Miniconda3-latest-Linux-x86_64.sh

$ ./Miniconda3-latest-Linux-x86_64.sh

$ export PATH=~/miniconda3/bin:$PATH

2 - Create environment:

conda env create -f BioPrediction-env.yml -n bioprediction

3 - Activate environment:

conda activate bioprediction

4 - You can deactivate the environment using:

conda deactivate

How to use

Execute the BioPrediction pipeline with the following command:

...
To run the code (Example): $ python Bioprediction.py -h

where:

 -input1_fasta_train: txt or fasta format file with the sequences, e.g., data/dataset_1/lncRNA_Sequence.txt
 -input1_fasta_test:  txt or fasta format file with the sequences, e.g., data/dataset_1/lncRNA_Sequence.txt
 -label_1:            the class label for this sequence type, e.g., lncRNA
 -sequence_type1:     type of biological sequence,  e.g., RNA

 -input2_fasta_train: txt or fasta format file with the sequences, e.g., data/dataset_1/protein_Sequence.txt
 -input2_fasta_test:  txt or fasta format file with the sequences, e.g., data/dataset_1/protein_Sequence.txt
 -label_2:            the class label for this sequence type, e.g., enzymes
 -sequence_type2:     type of biological sequence,  e.g., protein

 -interaction_table:  csv format file with the interation matrix, e.g., data/dataset_1/protein_Sequence.txt
 -output: output path, e.g., experiment_1

 -n_cpu:  number of cpus - default = 1
 -estimations: number of estimations - default = 10


execution example:
$ python Bioprediction.py -input1_fasta_train data/dataset_1/lncRNA_Sequence.txt -label_1 LncRNA -sequence_type1 DNA -input2_fasta_train data/dataset_1/protein_Sequence.txt -label_2 rProtein -sequence_type2 Protein -interaction_table data/dataset_1/LPI.csv -output results1


Note Inserting a test dataset is optional.

Citation

If you use this code in a scientific publication, we would appreciate citations to the following paper:

In progress...

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
misc/img		misc/img
results		results
source		source
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
BioPrediction-env.yml		BioPrediction-env.yml
BioPrediction.py		BioPrediction.py
ENIAC_table.pdf		ENIAC_table.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

misc/img

misc/img

results

results

source

source

utils

utils

.gitignore

.gitignore

.gitmodules

.gitmodules

BioPrediction-env.yml

BioPrediction-env.yml

BioPrediction.py

BioPrediction.py

ENIAC_table.pdf

ENIAC_table.pdf

README.md

README.md

Repository files navigation

BioPrediction: Democratizing Machine Learning in the Study of Molecular Interactions

Democratizing Machine Learning in Life Sciences

Main Reference:

Abstract

Maintainers

Installing dependencies and package

Via miniconda (Terminal)

How to use

Citation

About

Releases

Packages

Contributors 4

Languages

Bonidia/BioPrediction

Folders and files

Latest commit

History

Repository files navigation

BioPrediction: Democratizing Machine Learning in the Study of Molecular Interactions

Democratizing Machine Learning in Life Sciences

Main Reference:

Abstract

Maintainers

Installing dependencies and package

Via miniconda (Terminal)

How to use

Citation

About

Resources

Stars

Watchers

Forks

Languages