SeqMG-RPI is an RNA-protein interaction prediction network based on RNA and protein sequences.
SeqMG-RPI was developed on a Linux environment with CUDA 11.8.
Hardware: Two NVIDIA GeForce RTX 4090(24G)
conda create -n seqmg python=3.8 # Create environment
conda activate seqmg # Activate environmentDownload the ESM-2 model and follow the official tutorial for installation.
pip install fair-esm # latest release OR
pip install git+https://github.com/facebookresearch/esm.git # bleeding edge, current repo main branchThe ESM-2 model used in this project is esm2_t33_650M_UR50D. This model will be automatically downloaded during the first run of the code. If the download fails, please follow the ESM-2 official tutorial for manual download.
conda install numpy # numpy 1.24.3
conda install scikit-learn # scikit-learn 1.3.0
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia # pytorch 2.4.1
conda install tqdm # tqdm 4.66.5
conda install pyg -c pyg # torch_geometric 2.6.1
conda install pytorch-scatter -c pyg # torch-scatter 2.1.2Run the following command to generate multi-channel RNA features for the dataset:
python ./create_feature/RNA_multi-channel-feature.py <rna_seq_fasta_file> # <rna_seq_fasta_file> is the RNA sequence file pathFor example, to generate multi-channel RNA features for the ATH948 dataset:
python ./create_feature/RNA_multi-channel-feature.py datasets/rna_seq/ATH948_rna_seq.faThe output feature files will be stored by default in ./features/RNA_multi-channel-features. To store them in a different location, use the following command to specify the output directory:
python ./create_feature/RNA_multi-channel-feature.py <rna_seq_fasta_file> <output_dir> # <output_dir> is the directory where you want to store the feature filesRun the following command to generate RNA k-mer frequency features for the dataset:
python ./create_feature/RNA_kmer-frequency-feature.py <rna_seq_fasta_file> # <rna_seq_fasta_file> is the RNA sequence file pathFor example, to generate RNA k-mer frequency features for the ATH948 dataset:
python ./create_feature/RNA_kmer-frequency-feature.py datasets/rna_seq/ATH948_rna_seq.faThe output feature files will be stored by default in ./features/RNA_kmer-frequency-features. To store them in a different location, use the following command to specify the output directory:
python ./create_feature/RNA_kmer-frequency-feature.py <rna_seq_fasta_file> <output_dir> # <output_dir> is the directory where you want to store the feature filesRun the following command to generate RNA sparse matrix features for the dataset:
python ./create_feature/RNA_sparse-matrix-feature.py <rna_seq_fasta_file> # <rna_seq_fasta_file> is the RNA sequence file pathFor example, to generate RNA sparse matrix features for the ATH948 dataset:
python ./create_feature/RNA_sparse-matrix-feature.py datasets/rna_seq/ATH948_rna_seq.faThe output feature files will be stored by default in ./features/RNA_sparse-matrix-features. To store them in a different location, use the following command to specify the output directory:
python ./create_feature/RNA_sparse-matrix-feature.py <rna_seq_fasta_file> <output_dir> # <output_dir> is the directory where you want to store the feature filesThe first time you run it, the model esm2_t33_650M_UR50D will be automatically downloaded. This will take a long time, which is normal.
python ./ESM-2/esm2_t33_650M_UR50D.py <protein_seq_fasta_file> # <protein_seq_fasta_file> is the protein sequence file pathThe model uses CUDA0 as the default training device. If you want to change it, use the following command:
python ./ESM-2/esm2_t33_650M_UR50D.py <protein_seq_fasta_file> [device_index]For example, to generate the connect map file and representations file for the ATH948 dataset using CUDA1:
python ./ESM-2/esm2_t33_650M_UR50D.py datasets/protein_seq/ATH948_protein_seq.fa 1This script uses default window_size and stride values of 1000 and 500 for processing long protein sequences. These are optimal parameters, but if you wish to adjust them, you can specify the parameters when running the code:
python ./ESM-2/esm2_t33_650M_UR50D.py <protein_seq_fasta_file> [device_index] [window_size] [stride]Run the following command to generate protein graph features:
python ./create_feature/protein_graph-feature.py <esm2_file_dir> # <esm2_file_dir> is the directory where the connect map and representations files for the protein dataset are storedFor example, to generate protein graph features for the ATH948 dataset:
python ./create_feature/protein_graph-feature.py ESM-2/esm2-ATH948_protein_seq_w1000_s500The output feature files will be stored by default in ./features/protein_graph-feature. To store them in a different location, use the following command to specify the output directory:
python ./create_feature/protein_graph-feature.py <esm2_file_dir> <output_dir> # <output_dir> is the directory where you want to store the feature filesAfter all the features of the required dataset have been generated, you can run the main program with the following command:
# The same dataset is used as training and test sets.
python main.py <dataset_pair_file>
# OR
# Different datasets are used as training and test sets
python main2.py <train_dataset_pair_file> <test_dataset_pair_file>The model uses CUDA0 as the default training device. If you want to change it, use the following command:
python main.py <dataset_pair_file> [device_index]
# OR
python main2.py <train_dataset_pair_file> <test_dataset_pair_file> [device_index]For example:
# Use the ATH948 dataset as training and test sets and run on device CUDA0
python main.py datasets/pair/ATH948_pairs.csv 0
# OR
# Use RPI1807 as training set, RPI_D as test set, and run on device CUDA1
python main2.py datasets/pair/RPI1807_pairs.csv datasets/multi-species/RPI_D/RPI_D_interaction.csv 1If you have modified the feature file storage paths in the previous steps, please adjust the following lines in main.py and main2.py:
rna_mutil_channel_feature_dir = "./features/RNA_multi-channel-features"
rna_kmer_frequency_feature_dir = "./features/RNA_kmer-frequency-features"
rna_sparse_matrix_feature_dir = "./features/RNA_sparse-matrix-features"
protein_graph_feature_dir = "./features/protein_graph-feature"Change them to the paths you have set:
rna_mutil_channel_feature_dir = "<output_dir>"
rna_kmer_frequency_feature_dir = "<output_dir>"
rna_sparse_matrix_feature_dir = "<output_dir>"
protein_graph_feature_dir = "<output_dir>"The interaction probability files generated by main.py and main2.py will be stored in the ./result directory.
The datasets used in this study are stored in ./datasets/
@article{
}