Google Summer of Code-2019 : Srijan Verma
Applying machine learning techniques to characterising and naming lncRNA genes
Advances in RNA sequencing technologies have revealed the complexity of our genome. Long non-coding RNAs (lncRNAs) make up the majority of the non-coding transcriptome. Understanding the significance of this RNA world is one of the most important challenges faced in biology today, and the lncRNAs within it represent a gold mine of potential new biomarkers and drug targets. Its discovery is still at a preliminary stage.
To date, very few lncRNAs have been characterized in detail. However, it is clear that lncRNAs are important regulators of gene expression, and lncRNAs are thought to have a wide range of functions in cellular and developmental processes. There are many specialized lncRNA databases (like RefSeq, GENCODE, Ensembl, SGD, tair). We will use Machine Learning techniques to highlight and compare two sets of calls (of Ensembl / GENCODE and RefSeq) and determine which calls are incorrect.
Specifications of the parent directory (srijan-gsoc-2019)
Contains 5 folders namely:
- Ensembl-analysis - Where scripts for making analysis and data collected from Ensembl can be found.
- RefSeq-analysis - Where scripts for making analysis and data collected from RefSeq can be found.
- feature_selection - Where scripts for creating features can be found.
- ML - Where scripts for making ML analysis on data collected (with their features) can be found.
- add_copyright_to_all - Where script for adding copyright Info to all ipynb files can be found.