Skip to content

Saydemr/protein-essentiality-prediction

Repository files navigation

EPPuGNN

This is the repository of a senior project titled Essential Protein Prediction using Graph Neural Networks. Project investigates state-of-the-art Graph Neural Network models that fits to essentiality prediction. GNN models utilized are node2vec, GraphSAGE and two diffusion based GNNs namely GRAND and BLEND. Other computational, topological etc. methods are provided to see the progress clearer. XGBoost is used as the classification algorithm for unsupervised models. Use of several biological information sources such as gene expressions, go annotations to enhance the prediction is also analyzed. Relevant materials including code, data, documents etc. are all published to this repository.

How To

Pre-requisities

Clone the repository

git clone https://github.com/Saydemr/EPPuGNN.git

Create Environment

conda env create -f environment.yml
conda activate eppugnn

This might fail with the error saying pip could not find torch version with 1.11.0+cu113. Then, remove all the lines below from environment.yml and install them manually with pip.

- pykeops==2.1
- ogb==1.2.1
- torch==1.11.0+cu113
- torch-cluster==1.6.0
- torch-geometric==2.0.3
- torch-scatter==2.0.9
- torch-sparse==0.6.13
- torch-spline-conv==1.2.1
- torchdiffeq==0.2.3

Download requirements and latest biological data

python update.py

If you see anything that points to an error, you can download the missing files from here. Downloaded files must be placed under ./data directory before running the next commands.

Biological data are obtained from BioGRID, COMPARTMENTS, and NHI GEO databases. Links to obtain files can be found inside the script.

Compile data needed for each GNN

Preprocessor takes organism name as an argument. It compiles necessary information for the given organism and saves it under ./data directory. If you want to create data for all organisms, you can run the following command.

cd ./data
python compose_data.py --organism all

If you want to create data for a specific organism, you can run the following command.

cd ./data
python compose_data.py --organism sc

Outputs to be used in each GNN will be placed under respective directories.

Run GNNs to get results.

For now, refer to the GitHub pages of the each GNN. This part will be updated after the automated pipeline is considered ready. We forked said GNNs to integrate some necessary features missing in the original repositories.

Project Information

Members

Special Thanks

About

Neural PDEs tested on protein networks

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published