Skip to content

Latest commit

 

History

History
39 lines (32 loc) · 2.61 KB

README.md

File metadata and controls

39 lines (32 loc) · 2.61 KB

DESCRIPTION

This repository contains all code described in our recent manuscript:

  • Yekaterina Shulgina*, Marena I. Trinidad*, Conner J. Langeberg*, Hunter Nisonoff*, Seyone Chithrananda*, Petr Skopintsev*, Amos J. Nissley*, Jaymin Patel, Ron S. Boger, Honglue Shi, Peter H. Yoon, Erin E Doherty, Tara Pande, Aditya M. Iyer, Jennifer A. Doudna, Jamie H. D. Cate*. RNA language models predict mutations that improve RNA function. bioRxiv 2024.04.05.588317; doi: https://doi.org/10.1101/2024.04.05.588317

    * Contributed equally

It includes the curation of diverse RNA datasets from the Genome Taxonomy Database (GTDB), prediction of optimal growth temperature (OGT) phenotypes, and the application of LM and GNN models for the development of thermostable ribosomes. These resources include examples for dataset preprocessing, the generation of training and test sets utilizing hierarchical clustering with CD-HIT, model validation, and sequence generation. This project is a community effort coordinated by The Innovative Genomics Institute and Department of Electrical Engineering and Computer Sciences at University of California Berkeley.

The GARNET database (GTDB-Acquired RNA with Environmental Temperatures) is freely available here: https://doi.org/10.5281/zenodo.11226103

CONTENTS

  • Software Requirements and Installation with Anaconda:

    All software requirements are specified in the following yml files. Dependencies may be configured with Anaconda as detailed below.

    • Dependencies for Rfam curation, data preprocessing, figures and OGT prediction: Conda_Environments/data_processing_env.yml

    • Dependencies for GNN Model: Conda_Environments/gnn_environment.yml

    • Dependencies for LM Model: Conda_Environments/lm_environment.yml

    • Example Installation:

      conda env create --file data_processing_env.yml --name data_processing_env
      
  • Dataset Curation: Extraction of Rfams from GTDB

    • GTDB_RNA_Curation/*sh
    • GTDB_RNA_Curation/*py
  • Phenotype Annotation: OGT Prediction and Figure 2

    • GTDB_RNA_Curation/Figure_2_OGT.ipynb
  • Train/Test Split: Hierarchical Clustering with CD-HIT

    • Train_Test_Split/Train_Test_Splits.ipynb
  • Dataset Preprocessing: Create Contact Maps for GNN Model

    • Contact_Maps/*
  • Language Model:

    • LM_Model/*
  • GNN Model: Graph-Based RNA Sequence Modeling

    • GNN_Model/*
  • Generated Sequences: Sequences Generated by LM and GM Models

    • Generated_Sequences/*
  • Validation: Likelihoods of Generated Sequence

    • Validation/*