Skip to content
Applying machine learning techniques to characterising and naming lncRNA genes
Jupyter Notebook
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Google Summer of Code-2019 : Srijan Verma

Mentors : Daniel Zerbino, Elspeth Bruford, Ruth Seal

Project Title

Applying machine learning techniques to characterising and naming lncRNA genes

Brief Description

Advances in RNA sequencing technologies have revealed the complexity of our genome. Long non-coding RNAs (lncRNAs) make up the majority of the non-coding transcriptome. Understanding the significance of this RNA world is one of the most important challenges faced in biology today, and the lncRNAs within it represent a gold mine of potential new biomarkers and drug targets. Its discovery is still at a preliminary stage.

To date, very few lncRNAs have been characterized in detail. However, it is clear that lncRNAs are important regulators of gene expression, and lncRNAs are thought to have a wide range of functions in cellular and developmental processes. There are many specialized lncRNA databases (like RefSeq, GENCODE, Ensembl, SGD, tair). We will use Machine Learning techniques to highlight and compare two sets of calls (of Ensembl / GENCODE and RefSeq) and determine which calls are incorrect.

Specifications of the parent directory (srijan-gsoc-2019)

Contains 5 folders namely:

  1. Ensembl-analysis - Where scripts for making analysis and data collected from Ensembl can be found.
  2. RefSeq-analysis - Where scripts for making analysis and data collected from RefSeq can be found.
  3. feature_selection - Where scripts for creating features can be found.
  4. ML - Where scripts for making ML analysis on data collected (with their features) can be found.
  5. add_copyright_to_all - Where script for adding copyright Info to all ipynb files can be found.


Python 3.6



  1. Data obtained from Ensembl can be found here.

  2. Data obtained from RefSeq can be found here.

Research papers / References

Some of the papers which have been published in the similar domain are given below:

  1. A Deep Learning Framework for Robust and Accurate Prediction of ncRNA-Protein

  2. Accurate prediction of protein lncRNA interactions by diffusion and HeteSim features

  3. CRlncRC: a machine learning-based method for cancer-related Lnc RNA identification using integrated features

  4. LncADeep

  5. lncRNAnet: Long Non-coding RNA Identification using Deep Learning

  6. Long Noncoding RNA Identification: Comparing Machine Learning Based Tools for Lnc Transcripts Discrimination

  7. Machine Learning Based LncRNA Function Prediction

  8. Prediction of LncRNA Subcellular Localization with Deep Learning from Sequence Features

Medium blogs

  1. GSoC Journey Part 1

  2. GSoC Journey Part 2- The Problem Statement

  3. GSoC Journey Part 3- Data Analysis

  4. GSoC Journey Part 4- Final Report and Summary


  1. I would like to thank Daniel Zerbino for taking the time to mentor me and for providing invaluable suggestions. I truly appreciate his constant trust and encouragement!

  2. Elspeth Bruford

  3. Ruth Seal

  4. Ensembl admins, helpdesk and the whole community

  5. GSoC organizers, managers and Google

You can’t perform that action at this time.