EPPuGNN

This is the repository of a senior project titled Essential Protein Prediction using Graph Neural Networks. Project investigates state-of-the-art Graph Neural Network models that fits to essentiality prediction. GNN models utilized are node2vec, GraphSAGE and two diffusion based GNNs namely GRAND and BLEND. Other computational, topological etc. methods are provided to see the progress clearer. XGBoost is used as the classification algorithm for unsupervised models. Use of several biological information sources such as gene expressions, go annotations to enhance the prediction is also analyzed. Relevant materials including code, data, documents etc. are all published to this repository.

How To

Pre-requisities

Conda Latest
Python v3.9
CUDA v11.3

Clone the repository

git clone https://github.com/Saydemr/EPPuGNN.git

Create Environment

conda env create -f environment.yml
conda activate eppugnn

This might fail with the error saying pip could not find torch version with 1.11.0+cu113. Then, remove all the lines below from environment.yml and install them manually with pip.

- pykeops==2.1
- ogb==1.2.1
- torch==1.11.0+cu113
- torch-cluster==1.6.0
- torch-geometric==2.0.3
- torch-scatter==2.0.9
- torch-sparse==0.6.13
- torch-spline-conv==1.2.1
- torchdiffeq==0.2.3

Download requirements and latest biological data

python update.py

If you see anything that points to an error, you can download the missing files from here. Downloaded files must be placed under ./data directory before running the next commands.

Biological data are obtained from BioGRID, COMPARTMENTS, and NHI GEO databases. Links to obtain files can be found inside the script.

Compile data needed for each GNN

Preprocessor takes organism name as an argument. It compiles necessary information for the given organism and saves it under ./data directory. If you want to create data for all organisms, you can run the following command.

cd ./data
python compose_data.py --organism all

If you want to create data for a specific organism, you can run the following command.

cd ./data
python compose_data.py --organism sc

Outputs to be used in each GNN will be placed under respective directories.

Run GNNs to get results.

For now, refer to the GitHub pages of the each GNN. This part will be updated after the automated pipeline is considered ready. We forked said GNNs to integrate some necessary features missing in the original repositories.

Project Information

Supervisor: Dr. Emre Sefer
Institution: Ozyegin University

Members

Special Thanks

We would like to thank Esad Simitcioglu, OzU AI Labs and Dr. Reyhan Aydoğan for providing on-demand hardware equipment.

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
.github/workflows		.github/workflows
centrality_methods		centrality_methods
data		data
grand_blend		grand_blend
n2v_xg		n2v_xg
results		results
sage		sage
splits		splits
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
init.py		init.py
requirements.txt		requirements.txt

License

Saydemr/protein-essentiality-prediction

Folders and files

Latest commit

History

Repository files navigation

EPPuGNN

How To

Pre-requisities

Clone the repository

Create Environment

Download requirements and latest biological data

Compile data needed for each GNN

Run GNNs to get results.

Project Information

Members

Special Thanks

About

Topics

Resources

License

Stars

Watchers

Forks

Languages