SeqMG-RPI: RNA-Protein Interaction Prediction Network

1. Introduction

SeqMG-RPI is an RNA-protein interaction prediction network based on RNA and protein sequences.

2. Operating System

SeqMG-RPI was developed on a Linux environment with CUDA 11.8.

Hardware: Two NVIDIA GeForce RTX 4090（24G）

3. Environment Setup

Create and activate the environment

conda create -n seqmg python=3.8  # Create environment
conda activate seqmg  # Activate environment

Install ESM-2 Model

Download the ESM-2 model and follow the official tutorial for installation.

pip install fair-esm  # latest release OR
pip install git+https://github.com/facebookresearch/esm.git  # bleeding edge, current repo main branch

The ESM-2 model used in this project is esm2_t33_650M_UR50D. This model will be automatically downloaded during the first run of the code. If the download fails, please follow the ESM-2 official tutorial for manual download.

Install Dependencies

conda install numpy  # numpy 1.24.3
conda install scikit-learn  # scikit-learn 1.3.0
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia  # pytorch 2.4.1
conda install tqdm  # tqdm 4.66.5
conda install pyg -c pyg  # torch_geometric 2.6.1
conda install pytorch-scatter -c pyg  # torch-scatter 2.1.2

4. Feature Generation

4.1 Generate Multi-Channel RNA Features

Run the following command to generate multi-channel RNA features for the dataset:

python ./create_feature/RNA_multi-channel-feature.py <rna_seq_fasta_file>  # <rna_seq_fasta_file> is the RNA sequence file path

For example, to generate multi-channel RNA features for the ATH948 dataset:

python ./create_feature/RNA_multi-channel-feature.py datasets/rna_seq/ATH948_rna_seq.fa

The output feature files will be stored by default in ./features/RNA_multi-channel-features. To store them in a different location, use the following command to specify the output directory:

python ./create_feature/RNA_multi-channel-feature.py <rna_seq_fasta_file> <output_dir>  # <output_dir> is the directory where you want to store the feature files

4.2 Generate RNA k-mer Frequency Features

Run the following command to generate RNA k-mer frequency features for the dataset:

python ./create_feature/RNA_kmer-frequency-feature.py <rna_seq_fasta_file>  # <rna_seq_fasta_file> is the RNA sequence file path

For example, to generate RNA k-mer frequency features for the ATH948 dataset:

python ./create_feature/RNA_kmer-frequency-feature.py datasets/rna_seq/ATH948_rna_seq.fa

The output feature files will be stored by default in ./features/RNA_kmer-frequency-features. To store them in a different location, use the following command to specify the output directory:

python ./create_feature/RNA_kmer-frequency-feature.py <rna_seq_fasta_file> <output_dir>  # <output_dir> is the directory where you want to store the feature files

4.3 Generate RNA Sparse Matrix Features

Run the following command to generate RNA sparse matrix features for the dataset:

python ./create_feature/RNA_sparse-matrix-feature.py <rna_seq_fasta_file>  # <rna_seq_fasta_file> is the RNA sequence file path

For example, to generate RNA sparse matrix features for the ATH948 dataset:

python ./create_feature/RNA_sparse-matrix-feature.py datasets/rna_seq/ATH948_rna_seq.fa

The output feature files will be stored by default in ./features/RNA_sparse-matrix-features. To store them in a different location, use the following command to specify the output directory:

python ./create_feature/RNA_sparse-matrix-feature.py <rna_seq_fasta_file> <output_dir>  # <output_dir> is the directory where you want to store the feature files

4.4 Generate Protein Graph Features

4.4.1 Use ESM-2 Model to Generate Protein Connect Map and Representations

The first time you run it, the model esm2_t33_650M_UR50D will be automatically downloaded. This will take a long time, which is normal.

python ./ESM-2/esm2_t33_650M_UR50D.py <protein_seq_fasta_file>  # <protein_seq_fasta_file> is the protein sequence file path

The model uses CUDA0 as the default training device. If you want to change it, use the following command:

python ./ESM-2/esm2_t33_650M_UR50D.py <protein_seq_fasta_file> [device_index]

For example, to generate the connect map file and representations file for the ATH948 dataset using CUDA1:

python ./ESM-2/esm2_t33_650M_UR50D.py datasets/protein_seq/ATH948_protein_seq.fa 1

This script uses default window_size and stride values of 1000 and 500 for processing long protein sequences. These are optimal parameters, but if you wish to adjust them, you can specify the parameters when running the code:

python ./ESM-2/esm2_t33_650M_UR50D.py <protein_seq_fasta_file> [device_index] [window_size] [stride]

4.4.2 Use Connect Map and Representations Files to Generate Protein Graph Features

Run the following command to generate protein graph features:

python ./create_feature/protein_graph-feature.py <esm2_file_dir>  # <esm2_file_dir> is the directory where the connect map and representations files for the protein dataset are stored

For example, to generate protein graph features for the ATH948 dataset:

python ./create_feature/protein_graph-feature.py ESM-2/esm2-ATH948_protein_seq_w1000_s500

The output feature files will be stored by default in ./features/protein_graph-feature. To store them in a different location, use the following command to specify the output directory:

python ./create_feature/protein_graph-feature.py <esm2_file_dir> <output_dir>  # <output_dir> is the directory where you want to store the feature files

5. Running SeqMG-RPI

After all the features of the required dataset have been generated, you can run the main program with the following command:

# The same dataset is used as training and test sets.
python main.py <dataset_pair_file>
# OR
# Different datasets are used as training and test sets
python main2.py <train_dataset_pair_file> <test_dataset_pair_file>

The model uses CUDA0 as the default training device. If you want to change it, use the following command:

python main.py <dataset_pair_file> [device_index]
# OR
python main2.py <train_dataset_pair_file> <test_dataset_pair_file> [device_index]

For example:

# Use the ATH948 dataset as training and test sets and run on device CUDA0
python main.py datasets/pair/ATH948_pairs.csv 0
# OR
# Use RPI1807 as training set, RPI_D as test set, and run on device CUDA1
python main2.py datasets/pair/RPI1807_pairs.csv datasets/multi-species/RPI_D/RPI_D_interaction.csv 1

If you have modified the feature file storage paths in the previous steps, please adjust the following lines in main.py and main2.py:

    rna_mutil_channel_feature_dir = "./features/RNA_multi-channel-features"
    rna_kmer_frequency_feature_dir = "./features/RNA_kmer-frequency-features"
    rna_sparse_matrix_feature_dir = "./features/RNA_sparse-matrix-features"
    protein_graph_feature_dir = "./features/protein_graph-feature"

Change them to the paths you have set:

    rna_mutil_channel_feature_dir = "<output_dir>"
    rna_kmer_frequency_feature_dir = "<output_dir>"
    rna_sparse_matrix_feature_dir = "<output_dir>"
    protein_graph_feature_dir = "<output_dir>"

The interaction probability files generated by main.py and main2.py will be stored in the ./result directory.

6. Datasets

The datasets used in this study are stored in ./datasets/

Citation and contact

@article{
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
ESM-2		ESM-2
create_feature		create_feature
datasets		datasets
src		src
README.md		README.md
main.py		main.py
main2.py		main2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeqMG-RPI: RNA-Protein Interaction Prediction Network

1. Introduction

2. Operating System

3. Environment Setup

Create and activate the environment

Install ESM-2 Model

Install Dependencies

4. Feature Generation

4.1 Generate Multi-Channel RNA Features

4.2 Generate RNA k-mer Frequency Features

4.3 Generate RNA Sparse Matrix Features

4.4 Generate Protein Graph Features

4.4.1 Use ESM-2 Model to Generate Protein Connect Map and Representations

4.4.2 Use Connect Map and Representations Files to Generate Protein Graph Features

5. Running SeqMG-RPI

6. Datasets

Citation and contact

About

Uh oh!

Releases

Packages

Languages

MToToo/SeqMG-RPI

Folders and files

Latest commit

History

Repository files navigation

SeqMG-RPI: RNA-Protein Interaction Prediction Network

1. Introduction

2. Operating System

3. Environment Setup

Create and activate the environment

Install ESM-2 Model

Install Dependencies

4. Feature Generation

4.1 Generate Multi-Channel RNA Features

4.2 Generate RNA k-mer Frequency Features

4.3 Generate RNA Sparse Matrix Features

4.4 Generate Protein Graph Features

4.4.1 Use ESM-2 Model to Generate Protein Connect Map and Representations

4.4.2 Use Connect Map and Representations Files to Generate Protein Graph Features

5. Running SeqMG-RPI

6. Datasets

Citation and contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages