Skip to content

CatIIIIIIII/RNAErnie_baselines

Repository files navigation

DOI

RNAErnie_baselines

Official implement of BERT-like baselines (RNABERT, RNA-MSM, RNA-FM) for paper "Multi-purpose RNA Language Modeling with Motif-aware Pre-training and Type-guided Fine-tuning" with pytorch.

Installation

First, download the repository and create the environment.

git clone https://github.com/CatIIIIIIII/RNAErnie_baselines.git
cd ./RNAErnie_baselines
conda env create -f environment.yaml

Then, activate the "RNAErnie" environment.

conda activate ErnieFold

Pre-training

You need to download the pre-training model weight from RNABERT, RNA-MSM and place them in the ./checkpoints folder. The pre-training model weight of RNA-FM would be downloaded automatically when you run the fine-tuning script.

Downstream Tasks

RNA sequence classification

1. Data Preparation

You can download training data from Google Drive and place them in the ./data/seq_cls folder. For baselines, only dataset nRC is available for this task.

2. Fine-tuning

Fine-tune BERT-style large-scale pre-trained language model on RNA sequence classification task with the following command:

python run_seq_cls.py \
    --device 'cuda:0' \
    --model_name RNAFM

You could configure backbone model by changing --model_name to RNAMSM or RNABERT.

RNA RNA interaction prediction

1. Data Preparation

You can download training data from Google Drive and place them in the ./data/rr_inter folder.

2. Fine-tuning

Fine-tune RNAErnie on RNA-RNA interaction task with the following command:

python run_rr_inter.py \
    --device 'cuda:0' \
    --model_name RNAFM

You could configure backbone model by changing --model_name to RNAMSM or RNABERT.

RNA secondary structure prediction

1. Data Preparation

You can download training data from Google Drive and unzip and place them in the ./data/ssp folder. Two tasks (RNAStrAlign-ArchiveII, bpRNA1m) are available for this task.

2. Adaptation

Adapt RNAErnie on RNA secondary structure prediction task with the following command:

python run_ss_pred.py \
    --device 'cuda:0' \
    --model_name RNAFM

You could configure backbone model by changing --model_name to RNAMSM or RNABERT. Or test on different tasks by changing --task_name to RNAStrAlign or bpRNA1m.