Altegrad-Protein-Prediction

Jérémie Dentan, Meryem Jaaidan, Abdellah El Mrini

This git repository implements some deep learning methods for protein classification. This classification task is part of a Kaggle challenge. For more details about the challenge, the data, and our approach, please refer to the /doc folder.

Set up your environment

Download the data

You should first download the data. To do so, please use the following link to download the data : https://drive.google.com/file/d/1ybLD-EgKbeWVf6bKu796SYIF-1p-j7Nr/view?usp=share_link.

Then, unzip the folder and copy its content directly in data/. The content of data/ should now look like:

% ls -lah
total 4241392
drwxr-xr-x@ 10 jeremie  staff   320B  9 jan 15:42 .
drwxr-xr-x@ 11 jeremie  staff   352B  9 jan 18:29 ..
-rw-r--r--   1 jeremie  staff    13B  9 jan 18:29 .gitignore
-rw-rw-r--@  1 jeremie  staff   653M  7 déc 12:22 edge_attributes.txt
-rw-rw-r--@  1 jeremie  staff   211M  7 déc 12:21 edgelist.txt
-rw-rw-r--@  1 jeremie  staff   7,3M  7 déc 12:22 graph_indicator.txt
-rw-rw-r--@  1 jeremie  staff    42K  7 déc 17:13 graph_labels.txt
-rw-rw-r--@  1 jeremie  staff   1,2G  7 déc 12:22 node_attributes.txt
-rw-rw-r--@  1 jeremie  staff   1,5M  7 déc 15:56 sequences.txt

Installing dependencies

This code is meant to run in Python 3.8 with the PYTHONPATH set to the root of the project. We advise you to use Python native virtual environments or Conda virtual environments.

To do so, first install the base dependencies

pip install -r requirements.txt

Then, you need to install torch-geometric, torch-scatter and torch-sparse. The installation depends on your installation cuda, so we advise you to follow the official installation instructions:

torch-geometric: https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html
torch-sparse: https://pypi.org/project/torch-sparse/
torch-scatter: https://pypi.org/project/torch-scatter/

An easy way to do so is to set CUDA variable either to cpu, or cu116 or cu117 depending on your version of cuda, and the to get the wheels from Pypi. For cuda >11.7, e.g. CUDA 12.0, you can use cu117 as well. For example:

export CUDA='cpu'
pip install torch-geometric==2.2.0
pip install torch-sparse -f https://data.pyg.org/whl/torch-1.13.0+${CUDA}.html
pip install torch-scatter -f https://data.pyg.org/whl/torch-1.13.0+${CUDA}.html

Precomputed embeddings

Our prediction relies on embeddings of each proteins. This computation is really long, so we provide precomputed embeddings. They are available using the following link: https://www.icloud.com/iclouddrive/0b5Nghg-5oZnk5QDJfhKHc8cA#precomputedembeddings

Then, you should unzip and move the embeddings folder in /output, and it will be automatically recognized by our pipeline.

Run the code

You first need to set up PYTHONPATH to the root of the project (the root of the project is the parent of this README file). Do do so, execute the following from the root:

export PYTHONPATH=$(pwd)

Computing the embeddings

To compute the embeddings, you should run the following:

python -m src.embeddings

The embeddings will be automatically computed and stored in /output/embeddings. The logs of the computation will be stored in /logs.

Please note that this phase can be really long, depending on your hardware. If you use GPU, you will need at least 20Go of graphic memory. Using NVIDIA GeForce RTX 3090, the computation took:

About 1h30 for the embeddings with protbert
Few minutes for the embeddings with tfidf
About 20min for the embeddings with structure

Computing the predictions

To compute the predicted probabilities of each class, please run the following:

python -m src.predict

The output will be stored in /output/submissions and the logs in /logs. In particular, the log-loss on the validation set will be logged in stdout and stored in the logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Altegrad-Protein-Prediction

Set up your environment

Download the data

Installing dependencies

Precomputed embeddings

Run the code

Computing the embeddings

Computing the predictions

About

Releases 1

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
doc		doc
logs		logs
output		output
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

DentanJeremie/Altegrad-Protein-Prediction

Folders and files

Latest commit

History

Repository files navigation

Altegrad-Protein-Prediction

Set up your environment

Download the data

Installing dependencies

Precomputed embeddings

Run the code

Computing the embeddings

Computing the predictions

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages