Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
f028558
Diff_gif
LucasSilvaFerreira Sep 16, 2022
e9d4b66
Proposal Update
LucasSilvaFerreira Sep 16, 2022
e36e4db
readme update 0.3
LucasSilvaFerreira Sep 16, 2022
a96b952
Edited README, added intro etc
meuleman Sep 19, 2022
d91aab1
Update README.md
lucapinello Sep 20, 2022
a0819ef
Update README.md
lucapinello Sep 20, 2022
950d251
Update README.md
lucapinello Sep 20, 2022
9787b5f
Create LICENSE.md
lucapinello Sep 28, 2022
d01d7fa
Fixes formatting in the 'Tasks and Potential Roadmap' section
jxilt Sep 28, 2022
aa4ec8d
Added pointers to background papers/material and better data descript…
meuleman Oct 5, 2022
5e1e540
Added pointers to background papers/material and better data descript…
meuleman Oct 5, 2022
79815c7
Fixed typo in markdown link
meuleman Oct 6, 2022
fb2f4c4
Merge pull request #1 from jxilt/patch-1
meuleman Oct 7, 2022
d8e9b92
Simplified data description, with pointer to newly formatted data on …
meuleman Oct 7, 2022
f25eceb
Simplified data description
meuleman Oct 7, 2022
75f0d6c
Simplified data description
meuleman Oct 7, 2022
afb67ee
Update README.md
lucapinello Oct 10, 2022
2fd47a9
Create README.MD
LucasSilvaFerreira Oct 13, 2022
f0dd88b
Add files via upload
LucasSilvaFerreira Oct 13, 2022
2220036
Create README.MD
LucasSilvaFerreira Oct 13, 2022
a99ac73
vanilla diffusion+ hotencoder
LucasSilvaFerreira Oct 13, 2022
2285929
Version 2 of the vanilla diffusion
LucasSilvaFerreira Oct 14, 2022
1267284
feat: classifier free guidance
zanussbaum Oct 16, 2022
a1d1c62
added initial contributor list with names of people that pushed to repo
Oct 17, 2022
73fe583
Added link to contributor list
IhabBendidi Oct 17, 2022
8aaabd2
Line spacing for contributors
IhabBendidi Oct 17, 2022
00254bc
Line spacing *2
IhabBendidi Oct 17, 2022
7f025ae
File update
IhabBendidi Oct 17, 2022
987e7cf
Merge pull request #25 from IhabBendidi/dna-diffusion
LucasSilvaFerreira Oct 17, 2022
816f437
Create Readme.md
LucasSilvaFerreira Oct 17, 2022
a913216
archiving version 1
LucasSilvaFerreira Oct 17, 2022
954ef19
Delete Code_to_refactor_UNET_ANNOTATED.ipynb
LucasSilvaFerreira Oct 17, 2022
76992bc
Update README.MD
LucasSilvaFerreira Oct 17, 2022
333588b
Update README.MD
LucasSilvaFerreira Oct 17, 2022
83603ac
Update Contributors.md
aaronwtr Oct 19, 2022
113a076
Update Contributors.md
IhabBendidi Oct 20, 2022
f526478
Update Contributors.md
IhabBendidi Oct 20, 2022
0ea507a
Create README.MD
LucasSilvaFerreira Oct 22, 2022
3a48081
Add files via upload
LucasSilvaFerreira Oct 22, 2022
acb8353
Update README.md
lucapinello Oct 23, 2022
2eac72a
Update README.md
lucapinello Oct 23, 2022
eecc5ae
Update README.md
lucapinello Oct 23, 2022
7107f7e
directory structure;
LucasSilvaFerreira Oct 26, 2022
4a17dee
adding readme for new directories
LucasSilvaFerreira Oct 26, 2022
b693c0d
ADDING README
LucasSilvaFerreira Oct 26, 2022
719e47f
Fix READM
LucasSilvaFerreira Oct 26, 2022
fa40437
populating dna-diffusion with putative folders
LucasSilvaFerreira Oct 26, 2022
5a3ccdc
adding models dir
LucasSilvaFerreira Oct 26, 2022
a81bde3
adding text to the readmes
LucasSilvaFerreira Oct 26, 2022
7d80777
adding text to the readmes
LucasSilvaFerreira Oct 26, 2022
03817c8
Added an MPRA dataset (Sahu et al)
1edv Oct 26, 2022
f8ccce6
Added metrics folder
lucapinello Oct 26, 2022
30be400
Create .gitignore
lucapinello Oct 26, 2022
de0281d
Update .gitignore
lucapinello Oct 26, 2022
89008eb
Merge branch 'dna-diffusion' of https://github.com/pinellolab/DNA-Dif…
lucapinello Oct 26, 2022
908c2fa
Update README.md
lucapinello Oct 26, 2022
49b2751
DNA HotEncoder Difussion adapted from the Annotated diffusion noteboo…
LucasSilvaFerreira Nov 4, 2022
50ac603
moving prototype diffusion notebook
LucasSilvaFerreira Nov 10, 2022
a03375d
Adding the modified NB
Nov 26, 2022
ca068bc
Merge pull request #60 from pinellolab/new_DNA_diff_notebook
LucasSilvaFerreira Dec 2, 2022
83dcef9
update readme
jamesthesnake Dec 5, 2022
ec9ccad
here
jamesthesnake Feb 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/
10 changes: 10 additions & 0 deletions Contributors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
__Luca Pinello,__ Associate Professor, Harvard Medical School,MGH Boston (lucapinello on Discord).
__Wouter Meuleman,__ Investigator, Altius Institute for Biomedical Sciences & Affiliate Associate Professor, University of Washington, Seattle.
__Lucas Ferreira,__ PostDoc, Harvard Medical School/MGH Boston.
__Sameer Gabbita,__ High school intern, MGH, Student at Thomas Jefferson High School for Science & Technology.
__Jiecong Lin,__ Postdoc, Harvard Medical School/MGH, Boston.
__Zach Nussbaum,__ Machine Learning Engineer.
__Matei Bejan,__ Phd Student in deep learning.
__Simon Senan,__ Master Data Science graduate student.
__Aaron Wenteler,__ PhD Student in AI and Drug Discovery, Queen Mary University of London.
__Ihab Bendidi,__ PhD Student in self supervised learning, Ecole Normale Supérieure de Paris.
21 changes: 21 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2022 Luca Pinello

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
152 changes: 126 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,166 @@
# Proposal Title
# Understanding the code of life: generative models of regulatory DNA sequences based on diffusion models.
<img src='https://raw.githubusercontent.com/pinellolab/DNA-Diffusion/f028558816fe5832097c270f424e3b3c3db48d8d/diff_first.gif'> </img>

#update

A proposal by project instigator 1, project instigator 2, etc.

## Abstract
The Human Genome Project has laid bare the DNA sequence of the entire human genome, revealing the blueprint for tens of thousands of genes involved in a plethora of biological process and pathways.
In addition to this (coding) part of the human genome, DNA contains millions of non-coding elements involved in the regulation of said genes.

Such regulatory elements control the expression levels of genes, in a way that is, at least in part, encoded in their primary genomic sequence.
Many human diseases and disorders are the result of genes being misregulated.
As such, being able to control the behavior of such elements, and thus their effect on gene expression, offers the tantalizing opportunity of correcting disease-related misregulation.

Although such cellular programming should in principle be possible through changing the sequence of regulatory elements, the rules for doing so are largely unknown.
A number of experimental efforts have been guided by preconceived notions and assumptions about what constitutes a regulatory element, essentialy resulting in a "trial and error" approach.

Here, we instead propose to use a large-scale data-driven approach to learn and apply the rules underlying regulatory element sequences, applying the latest generative modelling techniques.

Provide brief outline motivating the project. How would it positively impact biological research? What is the hypothesis behind it? No need to discuss datasets or models yet, we will do that later. Focus on the grand picture and \textit{why} the community should care about it.

## Introduction and Prior Work
The goal of this project is to investigate the application and adaptation of recent diffusion models (see https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ for a nice intro and references) to genomics data. Diffusion models are powerful models that have been used for image generation (e.g. stable diffusion, DALL-E), music generation (recent version of the magenta project) with outstanding results.
A particular model formulation called "guided" diffusion allows to bias the generative process toward a particular direction if during training a text or continuous/discrete labels are provided. This allows the creation of "AI artists" that, based on a text prompt, can create beautiful and complex images (a lot of examples here: https://www.reddit.com/r/StableDiffusion/).

Some groups have reported the possibility of generating synthetic DNA regulatory elements in a context-dependent system, for example, cell-specific enhancers.
(https://elifesciences.org/articles/41279 ,
https://www.biorxiv.org/content/10.1101/2022.07.26.501466v1)


### Step 1: generative model

We propose to develop models that can generate cell type specific or context specific DNA-sequences with certain regulatory properties based on an input text prompt.
For example:

- "A sequence that will correspond to open (or closed) chromatin in cell type X"

- "A sequence that will activate a gene to its maximum expression level in cell type X"

- "A sequence active in cell type X that contains binding site(s) for the transcription factor Y"

- "A sequence that activates a gene in liver and heart, but not in brain"


### Step 2: extensions and improvements

Beyond individual regulatory elements, so called "Locus Control Regions" are known to harbour multiple regulatory elements in specific configurations, working in concert to result in more complex regulatory rulesets. Having parallels with "collaging" approaches, in which multiple stable diffusion steps are combined into one final (graphical) output, we want to apply this notion to DNA sequences with the goal of designing larger regulatory loci. This is a particularly exciting and, to our knowledge, hitherto unexplored direction.

Besides synthetic DNA creations, a diffusion model can help understand and interpret regulatory sequence element components and for instance be a valuable tool for studying single nucleotide variations (https://www.biorxiv.org/content/10.1101/2022.08.22.504706v1) and evolution.
(https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1502-5)


Taken together, we believe our work can accelerate our understanding of the intrinsic properties of DNA-regulatory sequence in normal development and different diseases.

## Proposed framework

For this work we propose to build a Bit Diffusion model based on the formulation proposed by Chen, Zhang and Hinton https://arxiv.org/abs/2208.04202. This model is a generic approach for generating discrete data with continuous diffusion models. An implementation of this approach already exists, and this is a potential code base to build upon:

https://github.com/lucidrains/bit-diffusion

## Tasks and potential roadmap:
- Collecting genomic datasets
- Implementing the guided diffusion based on the code base
- Thinking about the best encoding of biological information for the guided diffusion (e.g. cell type: K562, very strong activating sequence for chromatin, or cell type: GM12878, very open chromatin)
- Plans for validation based on existing datasets or how to perform new biological experiments (we need to think about potential active learning strategies).

Provide a short (preferably beginner friendly) introduction to the project and a brief outline of the literature most relevant to it. How does the project fit into this context?

## Deliverables

What do we plan to provide the broader community with upon the completion of the project? Datasets? Models? APIs? Every deliverable should preferably have its own subsection with its associated potential impact, although it is not required.
- __Dataset:__ compile and provide a complete database of cell-specific regulatory regions (DNAse assay) to allow scientists to train and generate different diffusion models based on the regulatory sequences.


- __Models:__ Provide a model that can generate regulatory sequences given a specific cell type and genomic context.


- __API:__ Provide an API to make it possible to manipulate DNA regulatory models and a visual playground to generate synthetic contextual sequences.


## Datasets

### DHS Index:
Chromatin (DNA + associated proteins) that is actively used for the regulation of genes (i.e. "regulatory elements") is typically accessible to DNA-binding proteins such as transcription factors ([review](https://www.nature.com/articles/s41576-018-0089-8), [relevant paper](https://www.nature.com/articles/nature11232)).
Through the use of a technique called [DNase-seq](https://en.wikipedia.org/wiki/DNase-Seq), we've measured which parts of the genome are accessible across 733 human biosamples encompassing 438 cell and tissue types and states, resulting in more than 3.5 million DNase Hypersensitive Sites (DHSs).
Using Non-Negative Matrix Factorization, we've summarized these data into 16 _components_, each corresponding to a different cellular context (e.g. 'cardiac', 'neural', 'lymphoid').

For the efforts described in this proposal, and as part of an earlier [ongoing project](https://www.meuleman.org/research/synthseqs/) in the research group of Wouter Meuleman,
we've put together smaller subsets of these data that can be used to train models to generate synthetic sequences for each NMF component.

### Datasets
Please find these data, along with a data dictionary, [here](https://www.meuleman.org/research/synthseqs/#material).

If applicable, how large is the dataset that the project aims to produce? How difficult is producing such a dataset expected to be? What kind of resources are needed? What license will the dataset be licensed under? MIT is preferred but not required.
### Other potential datasets:

- DNA-sequences data corresponding to annotated regulatory sequences such as gene promoters or distal regulatory sequences such as enhancers annotated (based on chromatin marks or accessibility) for hundreds of cells by the NHGRI funded projects like ENCODE or Roadmap Epigenomics.

### Models
- Data from MPRA assays that test the regulatory potential of hundred of DNA sequences in parallel (https://elifesciences.org/articles/69479.pdf , https://www.nature.com/articles/s41588-021-01009-4 , ... )

If applicable, does the project aim to release more than one model? What would be the input modality? What about the output modality? How large are the models that the project aims to release? Are there other important differences between the models to be released? If the models are very different, consider writing a short subsection for each model type.
- MIAA assays that test the ability of open chromatin within a given cell type.

### APIs
## Models

If applicable, what kind of API does the project aim to release? Are there any existing APIs that it could be integrated into? What kind of documentation could the project provide?
## Input modality:
A) Cell type + regulatory element ex: Brain tumor cell weak Enhancer
B) Cell type + regulatory elements + TF combination (presence or absence) Ex: Prostate cell, enhancer , AR(present), TAFP2a (present) and ER (absent),
C) Cell type + TF combination + TF positions Ex: Blood Stem cell GATA2(presence) and ER(absent) + GATA1 (100-108)
D) Sequencing having a GENETIC VARIANT -> low number diffusion steps = nucleotide importance prediction

### Paper
### Output:
DNA-sequence
__Model size:__
The number of enhancers and biological sequences isn’t bigger than the number of available images on the Lion dataset. The dimensionality of our generated DNA outputs should not be longer than 4 bases [A,C,T,G] X ~1kb. The final models should be bigger than ~2 GB.

Can the project be turned into a paper? What does the evaluation process for such a paper look like? What conferences are we targeting? Can we release a blog post as well as the paper?
__Models:__
Different models can be created based on the total sequence length.

## Resources
## APIs
TBD depending on interest

### Requirements
## Paper
__Can the project be turned into a paper? What does the evaluation process for such a paper look like? What conferences are we targeting? Can we release a blog post as well as the paper?__

What kinds of resources (e.g. GPU hours, RAM, storage) are needed to complete the project?
Yes, We intend to have a mix of our in silico generations and experimental validations to study our models' performance on classic regulatory systems ( ex: Sickle cell and Cancer).
Our group and collaborators present a substantial reputation in the academic community and different publications in high-impact journals, such as Nature and Cell.

### Timeline

What is a (rough) timeline for this project?
## Resources Requirements
__What kinds of resources (e.g. GPU hours, RAM, storage) are needed to complete the project?__

Our initial model can be trained with small datasets (~1k sequences) in about 3 hours ( ~500 epochs) on a colab PRO (24GB ram ) single GPU Tesla K80. Based on this we expect that to train this or similar models on the large dataset mentioned above ( ~3 million sequences (4x200) we will need several high-performant GPUs for about 3 months. ( Optimization suggestions are welcome!)

## Timeline
__What is a (rough) timeline for this project?__

6 months to 1 year.

## Broader Impact
__How is the project expected to positively impact biological research at large?__

We believe this project will help to better understand genomic regulatory sequences: their composition and the potential regulators acting on them in different biological contexts and with the potential to create therapeutics based on this knowledge.

How is the project expected to positively impact biological research at large?

## Reproducibility
We will use best practices to make sure our code is reproducible and with versioning. We will release data processing scripts and conda environments/docker to make sure other researchers can easily run it.

What steps are going to be taken to ensure the project's reproducibility? Will data processing scripts be released? What about training logs?
We have several assays and technologies to test the synthetic sequences generated by these models at scale based on CRISPR genome editing or massively parallel reporter assays (MPRA).

## Failure Case

If our findings are unsatisfactory, do we have an exit plan? Do we have deliverables along the way that we can still provide the community with?
## Failure Case
Regardless of the performance of the final models, we believe it is important to test diffusion models on novel domains and other groups can build on top of our investigations.

## Preliminary Findings
Using the Bit Diffusion model we were able to reconstruct 200 bp sequences that presented very similar motif composition to those trained sequences. The plan is to add the cell conditional variables to the model to check how different regulatory regions depend on the cell-specific context.

If applicable, mention any preliminary findings (e.g. experiments you have run on your own or heard about) that support the project's importance.

## Next Steps
Expand the model lengh to generate complete regulatory regions (enhancers + Gene promoter pairs)
Use our syntethic enhancers on in-vivo models and check how they can regulate the transcriptional dynamics in biological scenarios (Besides the MPRA arrays).


## How to contribute
If this project sounds exciting to you, **please join us**!
Join the OpenBioML discord: https://discord.gg/Y9CN2dUzQJ, we are discussing this project in the **dna-diffusion** channel and we will provide instructions on how to get involved.

## Known contributors

If the project is successfully completed, are there any obvious next steps?
You can access the contributor list [here](https://docs.google.com/spreadsheets/d/1_nxDI6DIoWbyUDpIDX-tJIILejrJ0kEYrcXXdWlzPvU/edit#gid=1871728801).

## Known contributors

Please list community members that you know are interested in contributing. It is best if a project proposal already has an associated team capable of going ahead with the project by themselves, but it is not necessary.
Binary file added diff_first.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dna-diffusion/.DS_Store
Binary file not shown.
6 changes: 6 additions & 0 deletions dna-diffusion/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# DNA-diffusion files structure
- data: contains the data used in the DNA diffusion project
- losses: contains the losses used in the DNA diffusion project
- metrics: contains the internal metrics used to evaluate the quality of the generated sequences after training
- models: contains the models used in the DNA diffusion project (UNET, VQ-VAE, etc.)
- utils: contains the utils used in the DNA diffusion project
1 change: 1 addition & 0 deletions dna-diffusion/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Data
1 change: 1 addition & 0 deletions dna-diffusion/losses/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Losses
1 change: 1 addition & 0 deletions dna-diffusion/metrics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Internal metrics to assess the quality of generated sequences after training.
1 change: 1 addition & 0 deletions dna-diffusion/models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3# Models
1 change: 1 addition & 0 deletions dna-diffusion/utils/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Utils
4 changes: 4 additions & 0 deletions notebooks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Notebook
- experiments: Playgroud for experiments
- refactoring: Notebooks to be refactored to the codebase
- tutorials: Tutorials ex (how to use the model to generate new sequences, how to find motifs, etc)
4 changes: 4 additions & 0 deletions notebooks/experiments/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Add here notebooks experimenting with the DNA-diffusion model
This is a collection of notebooks that are used to experiment with the DNA-diffusion model.


1 change: 1 addition & 0 deletions notebooks/experiments/conditional_diffusion/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Loading