Skip to content

Doudna-lab/GARNET_DL

Repository files navigation

DESCRIPTION

This repository contains all code described in our recent manuscript:

  • Yekaterina Shulgina*, Marena I. Trinidad*, Conner J. Langeberg*, Hunter Nisonoff*, Seyone Chithrananda*, Petr Skopintsev*, Amos J. Nissley*, Jaymin Patel, Ron S. Boger, Honglue Shi, Peter H. Yoon, Erin E Doherty, Tara Pande, Aditya M. Iyer, Jennifer A. Doudna, Jamie H. D. Cate*. RNA language models predict mutations that improve RNA function. bioRxiv 2024.04.05.588317; doi: https://doi.org/10.1101/2024.04.05.588317

    * Contributed equally

It includes the curation of diverse RNA datasets from the Genome Taxonomy Database (GTDB), prediction of optimal growth temperature (OGT) phenotypes, and the application of LM and GNN models for the development of thermostable ribosomes. These resources include examples for dataset preprocessing, the generation of training and test sets utilizing hierarchical clustering with CD-HIT, model validation, and sequence generation. This project is a community effort coordinated by The Innovative Genomics Institute and Department of Electrical Engineering and Computer Sciences at University of California Berkeley.

The GARNET database (GTDB-Acquired RNA with Environmental Temperatures) is freely available here: https://tinyurl.com/5abszup9

CONTENTS

  • Conda environments:
    • Dependencies for Rfam curation, data preprocessing, figures and OGT prediction: Conda_Environments/data_processing_env.yml
    • Dependencies for GNN Model: Conda_Environments/gnn_environment.yml
    • Dependencies for LM Model: Conda_Environments/lm_environment.yml
  • Dataset curation: Extraction of Rfams from GTDB
    • GTDB_RNA_Curation/*sh
    • GTDB_RNA_Curation/*py
  • Phenotype Annotation: OGT Prediction and Figure 2
    • GTDB_RNA_Curation/Figure_2_OGT.ipynb
  • Train/Test Split: Hierarchical Clustering with CD-HIT
    • Train_Test_Split/Train_Test_Splits.ipynb
  • Dataset Preprocessing: Create Contact Maps for GNN Model
    • Contact_Maps/*
  • Language Model:
    • LM_Model/*
  • GNN Model: Graph-Based RNA Sequence Modeling
    • GNN_Model/*
  • Generated Sequences: Sequences Generated by LM and GM Models
    • Generated_Sequences/*
  • Validation: Likelihoods of Generated Sequence
    • Validation/*

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published