Long non-coding RNAs (lncRNAs), characterized as RNA transcripts longer than 200 nucleotides without functional open reading frames, play critical regulatory roles in various biological and developmental processes in both animals and plants. Despite their recent discovery and intriguing potential functions, lncRNA characterization remains a significant challenge, particularly in plants due to limited information, distinct transcriptional patterns, low sequence conservation, and scarce resources for credible annotation in plant genomes and transcriptomes. This highlights the need for novel tools to effectively identify and characterize plant lncRNAs.
We introduce DeepPlnc, a deep learning-based software for accurately identifying plant lncRNAs across various plant genomes. Unlike most existing tools, DeepPlnc can even annotate incomplete length transcripts. It employs a bi-modal architecture of Convolutional Neural Networks (CNNs) to extract information from both the nucleotide sequence and secondary structure of plant lncRNAs, enabling accurate lncRNA identification.
The user needs to provide RNA-seq data or any nucleotide sequence data in a fasta format as an input. This data undergoes analysis via trained bi-modal Convolutional Neural Networks (CNNs), resulting in the generation of scores allocated to each provided sequence. These scores serve as an output, offering insights or assessments derived from the analysis conducted by the CNNs.
A webserver for lncRNAs detection has been established at https://scbb.ihbt.res.in/DeepPlnc/. User can identify lncRNAs by providing FASTA sequences as input. There is an download option for the result in the tabular format where the first column indicates sequence ID and second column represents whether it is lncRNA or not.
Figure: DeepPlnc webserver implementation
- Python3.6 or higher
- Numpy
- keras
- tensorflow
- plotly
- pandas
- RNAfold
- python module multiprocessing, Bio, bayesian-optimization
Download and extract the source code for DeepPlnc, and then unzip this parent directory. Type the following commands:
git clone https://github.com/SCBB-LAB/DeepPlnc.git
cd DeepPlnc
unzip DeepPlnc.zip
In the parent directory, you will find a collection of files that are described below:
DeepPlnc.sh
= Complete execution script.DeepPlnc.py
= Python script for detecting lncRNAs from sequences provided.Model_A.h5
= Trained model have traditionally considered negative dataset (mRNA sequences).Model_B.h5
= Trained model has one-third of the negative dataset having plant rRNAs and tRNAs, along with two-third of it having mRNAs.test
= Fasta sequence. (Minimum 200 bases in length)make-plot.py
= Python script for box and violin plot generation for a single sequence.batch-plot.py
= Python script for violin plot generation for a batch (10 sequence).predict_GPU.py
= Python script for detecting lncRNAs utilizing GPU.file_format_GPU
= File format of input for script predict_GPU.py. file containing seq_id, sequence (sequence length of >= 200 bases but not > 400 bases), and secondary structure generated using RNAfold software (in dot bracket representation) separated by tabs.model_hyper.py
= Python script used to build model implementing hyperparameter tuning.
For the identification of lncRNAs, execute following command in the parent directory:
sh DeepPlnc.sh test /usr/local/bin/ A
- test = Test file
- /usr/local/bin/ = Path of RNAfold in your local system
- A = Model to be selected for classification (Options : A|B)
Output: Two files, namely test.txt
and test_results.tsv
, are generated that contains chunks wise probability score of the sequence provided
and classification result of the sequence provided
.
For the identification of lncRNAs using GPU, run the following command:
python3 predict_GPU.py file_format_GPU A
- file_format_GPU = This file containing seq_id, sequence (sequence length of >= 200 bases but not > 400 bases), and secondary structure generated using RNAfold software (in dot bracket representation) separated by tabs.
- A = Model to be selected for classification (Options : A|B)
Output: One file, namely file_format_GPU_prediction.txt
, is generated that contains chunks wise probability score of the sequence provided
.
To build model implementing hyperparameter tuning, run the following command:
python3 model_hyper.py file_for_tuning
- file_format_GPU = This file containing seq_id, sequence (sequence length of >= 200 bases but not > 400 bases), and secondary structure generated using RNAfold software (in dot bracket representation) separated by tabs.
Output: Two files, namely seq.txt
and struc.txt
, are generated that contains hyparameters for sequence side of bi-modal
and hyparameters for structure side of bi-modal
.
- To plot box and violin plot for a single sequence, switch to directory name
plot
and execute following command:
cd plot
python3 ../make-plot.py seq1 (sequences file name without ".csv")
- To plot violin plot for a batch (10 sequence), switch to directory name
plot
and execute following command:
cd plot
python3 ../batch-plot.py batch_1 (batch file name without ".csv")
Output: The plot is generated for a single sequence and for a batch of 10 sequences.
Note: When you run DeepPlnc, please make sure there are no folder named "plot" in parent directory, otherwise it will give unnecessary warning:
mkdir: cannot create directory plot: File exists
Citation: Ritu, Gupta S, Sharma NK, Shankar R (2022) DeepPlnc: Discovering plant lncRNAs through multimodal deep learning on sequential data. Genomics, 2022. https://www.sciencedirect.com/science/article/pii/S0888754322001884