In [1]:
import pandas as pd
import numpy as np

# Seemly useful ones

## 1. Genome-wide Functional Characterization of Escherichia coli Promoters and Regulatory Elements Responsible for their Function

https://www.biorxiv.org/content/10.1101/2020.01.04.894907v1.full

Notes: They characterize promoters from three previous research, identifying a lot of reported promoters are not transcriptionally active. A thing worth noticing is that they're using ML model to predict whether a sequence (a k-mer) is active or inactive, which is similar to what we're trying to do. Their prameters used are also similar: max −10 σ70 motif position weight matrix (PWM) score, max σ70 −35 motif PWM score, paired −10 and −35 PWM score (PWMs scanned jointly allowing for, 16, 17, or 18 gap between the −10 and −35), and percent GC content. They also used both the featurized models and CNN, but with only sigma70. 

Data: 

- 2,859 distinct promoters (full table not provided)
- The following table contains **promoters regulating 99 out of 100 randomly sampled essential genes**.

In [34]:
df = pd.read_csv('Genome_promoters.csv')
df

Unnamed: 0.1,Unnamed: 0,Gene,Gene.Left,Gene.Right,Strand,Promoter.found,Upstream.Operon,TSS,Promoter_region,Minimal_promoter,TSS..10,TSS_strength,Region_score,Minimal_promoter_activity
0,0,ribA,1336593.0,1337184.0,-,Y,,TSS_5291_regulondb,1337163_1337254_-,,"TSS_5291_regulondb,1337212,-",1.281903,111.13,
1,1,hemL,173601.0,174882.0,-,Y,,"TSS_714_regulondb, TSS_716_regulondb, TSS_718_...",174923_175099_-,174949_175099_-,,,214.77,1.022321
2,2,murB,4170079.0,4171108.0,+,Y,,TSS_16296_storz_regulondb,4169977_4170170_+,4170007_4170157_+,"TSS_16296_storz_regulondb,4170057,+",1.091445,229.15,1.123441
3,3,murI,4163450.0,4164308.0,+,Y,btuB-murI,TSS_16250_storz,4161317_4161527_+,4161367_4161477_+,"TSS_16250_storz,4161392,+",1.128931,269.52,1.059988
4,4,dfp,3810753.0,3811974.0,+,Y,,TSS_14775_storz,3810605_3810836_+,3810686_3810836_+,"TSS_14775_storz,3810724,+",1.017665,287.95,1.006948
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,97,murG,99643.0,100711.0,+,Y,mraZ-rsmH-ftsLI-murEF-mraY-murD-ftsW-murG,"TSS_385_regulondb, TSS_383_regulondb",,"89547_89587_+, 89317_89357_+",,,,
98,98,lolC,1174649.0,1175849.0,+,Y,,"TSS_4544_regulondb, TSS_4545_regulondb",,1174588_1174618_+,,,,1.334013
99,99,topA,1329071.0,1331669.0,+,Y,,"TSS_5245_storz_regulondb, TSS_5247_regulondb, ...",1328557_1329157_+,1328927_1328837_+,,,,1.728379
100,100,yjeE,4393608.0,4394069.0,+,,,,,,,,,


## 2. Automated design of thousands of nonrepetitive parts for engineering stable genetic systems, 2020

https://www.nature.com/articles/s41587-020-0584-2#Sec19

Notes: 
- Nonrepetitive Parts Calculator to rapidly generate thousands of highly nonrepetitive genetic parts from specified design constraints.
- Use ML to explain what factors influence the transcription rate.

Code: A user-friendly interface to the Nonrepetitive Parts Calculator is available at
https://salislab.net/software 
Source code is available at https://github.com/hsalis/SalisLabCode 

Data: 4,350 highly nonrepetitive, CRISPR-tunable σ70 promoters (sequence + transcription rate)

![image.png](attachment:image.png)

In [35]:
# TOOLBOX are in Nonrepetitive_parts_toolbox.xls

df = pd.read_excel('Nonrepetitive_parts.xls', sheet_name=1)
df

Unnamed: 0,PROMOTER ID,PROMOTER SEQUENCE,BARCODE,TOOLBOX,MISMATCHES,R1-DNA-COUNT,R2-DNA-COUNT,R1-RNA-COUNT,R2-RNA-COUNT,R1-DNA-COUNT-NORMALIZED,NORMALIZED-R2-DNA-COUNT,NORMALIZED-R1-RNA-COUNT,NORMALIZED-R2-RNA-COUNT,NORMALIZED-TX-RATE-MEAN,NORMALIZED-TX-RATE-STDEV
0,SLP2018-1-1,GGGCCCCTACTCACCGAGGATTGACATATTGCCTACCAGCGTGTAT...,CTCTGGCTCGTGATGCCAAG,1,0.0,3798,3776,15472,11061,3410.093948,4231.792073,11347.113853,14533.042901,11.728189,2.046538
1,SLP2018-1-2,ACGTGCGCTAGGCCAGCCGATTGACATCTGGTTTTCTGCATAGTAT...,TAGTTGTGACAACAAGGCGA,1,0.0,8842,10345,10894,12895,9222.929422,9767.529980,13382.635261,10469.908184,4.260632,0.038779
2,SLP2018-1-3,GTCATCCGGCTCGTGTTGGGTTGACAGCGCGCCTCAGATTGGATAT...,TAAGACGACGCGGTTAGCCC,1,0.0,4047,4740,6718,6442,4274.586783,4505.095227,6420.456503,6588.668341,5.118120,0.747912
3,SLP2018-1-4,GCAGGCTAGTCTGGTCTCGTTTGACATTAAGAGGTATGGAGGATAT...,ATATGTTGTCCATGCGCGGC,1,0.0,3410,2995,36668,25200,2709.072899,3805.813402,27798.485415,32164.434533,31.877170,2.021227
4,SLP2018-1-5,ACAAGTTGCGGTCCGGGCATTTGACAACTTGTACGCCGACCCGTAT...,TAACTAACTCATCAGGCCAC,1,0.0,1865,2667,3387,4363,2414.573248,2109.672569,4324.205976,3352.230684,5.792040,0.581337
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4346,SLP2018-3-47,ATGCTTTCGCGCCACTGATGCCACGCCTTACTGGCGTATCTTCGCG...,ATCGACTGCCTGTAACAAAA,3,12.0,13613,9719,130,82,8677.244449,15072.223840,119.939146,226.745426,0.050279,0.010062
4347,SLP2018-3-48,ACGACGCTTTACCGGTTACAGGAGGCCCCATTGTGTCATGTAGAGG...,TCTCACTCGGTGGCCAGAGA,3,12.0,2872,3179,10,72,2874.260386,3215.084068,110.197864,111.926821,0.125593,0.014041
4348,SLP2018-3-49,TCTGGTATCAACACCGACGAGGCGGGTAAGCGGTCATGGTTATCGC...,GTGGTATTGGTTCGGCGACC,3,12.0,7132,8581,620,745,7680.807842,7890.310208,766.552077,695.845548,0.322054,0.031675
4349,SLP2018-3-50,AGGGTAAACATATAGGCGTGCGCCGGTAGATACATACGTCGACGTG...,ATAGCAACCAGACTCAAAGA,3,12.0,7468,10095,207,381,9005.397718,8258.885418,411.360330,300.433680,0.139408,0.006500


# Not so useful ones

## 1. Deciphering the regulatory genome of Escherichia coli, one hundred promoters at a time

https://elifesciences.org/articles/55308

Promotor Data: 
Contain all data and code for experiment:https://zenodo.org/record/3953312#.YmIuRepKhPY 

Summary: 
Since in the bacterium Escherichia coli, for ≈65% of promoters remain ignorant of their regulation, they introduce a new method called Reg-Seq, that links massively parallel reporter assays with mass spectrometry to produce a base pair resolution dissection of more than a E. coli promoters in 12 growth conditions. They demonstrated that the method recapitulates known regulatory information and examined regulatory architectures for more than 80 promoters which previously had no known regulatory information.

**They have data for all their figures. I'll look more into their other data to see if any is useful.**

In [15]:
df = pd.read_csv('promoter_107.csv')
df

Unnamed: 0,name,start_site
0,fdoH,4085867
1,sdaB,2928035
2,thiM,2185451
3,yedJ,2033449
4,ykgE,321511
...,...,...
101,znuA,1942661
102,zupT,3182433
103,pitA,3637612
104,ecnB,4376509


## 2. The Whole Set of Constitutive Promoters Recognized by RNA Polymerase RpoD Holoenzyme of Escherichia coli

https://arxiv.org/pdf/2001.07396.pdf 


Promotor Data: https://elifesciences.org/articles/55308/figures#supp1
Supp1: presumed location of the TSS for each promoter region in Reg-Seq
Supp3: all transcription-factor-binding sites identified either through the automated binding site algorithm or which were identified manually and have additional evidence for binding
Their source code is also available in the same link as above.

Summary: The authors introduced Reg-Seq analysis,whose original formulation termed Sort-Seq was based on the use of fluorescence activated cell sorting one gene at a time as a way to uncover putative binding sites for previously uncharacterized promoters. 


In [29]:
#Table of transcription start sites.
#This file contains the name of each gene, listed TSS, if the transcription occurs 5’ to 3’, and the wild type sequence of the regulatory region.
# additionally, this file has the transcription factors listed in RegulonDB and the Tfs expected to be found in the mutagenized region.
# The reasoning behind selecting the Transcription start site for use in the experiment is also listed.
#In the section below all the gene listing, the construct sequence, with the oligo and barcode is listed.

df = pd.read_excel('presumed_tss.xlsx', header=6)
df = df.iloc[:115,1:9]
df.to_csv('presumed_tss.csv')
df

Unnamed: 0,name,start_site,reverse or forward,offset,all features listed in regulonDB,Features expected to appear in the mutagenized region.,notes,Reason for choosing TSS
0,fdoH,4085867.0,rev,0.0,small_RNA_SdhX,,,TSS cluster as identified from RACE experiment...
1,sdaB,2928035.0,fwd,0.0,,,,Computationally predicted start site. See Huer...
2,thiM,2185451.0,rev,0.0,thiamine_diphosphate,,not in region.,Experimentally determined start site as found ...
3,yedJ,2033449.0,rev,0.0,,,,RACE start site from RegulonDB. This is the st...
4,ykgE,321511.0,fwd,0.0,,,,Computationally predicted start site for sigma...
...,...,...,...,...,...,...,...,...
110,dcm,2032352.0,rev,0.0,,,,
111,amiC,2949066.0,rev,0.0,,,,
112,mutM,3811175.0,rev,0.0,,,,
113,ybjX,919103.0,rev,0.0,,,,


In [33]:
#Transcription factor binding locations identified by RegSeq. 

#All sites shown here have either been identified by the automated binding site algorithm or have additional evidence (Mass spec, bioinformatic, etc.) for their presence.
#Some large sites have been manually divided into two sites, and if there are multiple binding partners to one location, the site is not listed multiple times.
#Start and end locations are given with respect to the presumed TSS locations (listed in Supplementary Table 1).
#Some start and end locations have been adjusted manually,  automatically determined edges can be found in the other tabs.

df = pd.read_excel('TFbindingsite.xlsx', header=7)
df = df.iloc[:115,1:9]
df.to_csv('TFbindingsite.csv')
df

Unnamed: 0,start,end,type,identity
0,-12,4,repressor,
1,-48,-27,repressor,YgbI
2,15,29,repressor,
3,-50,-30,activator,CRP
4,-37,-28,repressor,DgoR
...,...,...,...,...
86,-27,-4,repressor,NsrR
87,-78,-55,activator,XylR
88,-50,-27,activator,XylR
89,-24,18,repressor,MarA


## 3. Systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria, 2018

https://www.pnas.org/content/115/21/E4796.short 

Promotor Data: https://zenodo.org/record/1184169#.YmJHaurMJPY (which is just the **RegulonDB** dataset)

Predicted Promoter files:
(PromoterPredictionSigma70Set_SeqOnly_35):https://docs.google.com/document/d/1BlHw6rsB_c1NX365xXoNFDHP7lzXqdSXD5ZAM8BbrWE/edit?usp=sharing 
(PromoterPredictionSigma70Set_SeqOnly_10):https://docs.google.com/document/d/1LTNtoNPgAbFTSWXz3SjY38HnQmpw_E40eVhZ3_1kOkc/edit?usp=sharing 

Summary: Since characterizing the molecular mechanisms by which individual regulatory sequences operate requires using low-throughput methods. The authors went for multi promoter dissection and found how a combination of massively parallel reporter assays, mass spectrometry, and information-theoretic modeling can be used to dissect multiple bacterial promoters in a systematic way. They recover nucleotide-resolution models of promoter mechanisms. For some promoters, including previously unannotated ones, the approach allowed people to further extract quantitative biophysical models describing input–output relationships.

# Tables in not-table format

## 1. The Whole Set of Constitutive Promoters Recognized by RNA Polymerase RpoD Holoenzyme of Escherichia coli, 2014

Link: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090447#abstract0

Summary:
 identification of the full range of constitutive promoters on the E. coli genome by identifying RpoD holoenzyme-binding sites by Genomic SELEX screening and then using the reconfirmed consensus sequence.
High level of promoter sequence conservation, about 85% carrying five-out-of-six agreements with -35 or -10 consensus sequence

Promoter data (S2): https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090447#s5 (a form in pdf format) Constitutive promoter classified into three spacer types. Contains Map position, Operon, and Gene name.


## 2. The Whole Set of Constitutive Promoters Recognized by RNA Polymerase RpoD Holoenzyme of Escherichia coli, 2014


Link: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090447 

Promotor Data:  (Promoters shown by shaded background represent those listed in RegulonDB and EcoCyc databases)
RNA Polymerase RpoD Holoenzyme-Binding Sites on the E. coli Genome (Cut-off level, 2.0)
A total of 1,075 RpoD holoenzyme-binding sites were identified within spacers on the entire E. coli K-12 W3110 genome. The binding sites identified within intergenic spacers were classified into Type-A, Type-B and Type-C:
https://drive.google.com/file/d/1nUKVT8y1vuJ7F4mQJFDEORdDVSywxxeI/view?usp=sharing 

Constitutive Promoters within Type-A Spacers: the constitutive promoters were predicted based on the location of the RpoD holoenzyme-binding sites:
https://drive.google.com/file/d/1KoWfCnZGd8YxURckNDf60kxO8zSxErzP/view?usp=sharing 
Constitutive Promoters (Type-B Spacers) (Leftward transcription)
https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0090447.t002 
Constitutive Promoters (Type-B Spacers) (Rightward transcription).
https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0090447.t003 

Summary:   In order to identify the canonical sequence of “constitutive promoters” that are recognized by the RNA polymerase holoenzyme containing RpoD sigma in the absence of supporting transcription factors, an in vitro mixed transcription assay was carried out using a whole set of variant promoters.
For identification of the full range of constitutive promoters on the E. coli genome, the authors, identifying a total of 2,701 RpoD holoenzyme-binding sites by Genomic SELEX screening, using the reconfirmed consensus promoter sequence, a total of maximum 669 constitutive promoters were identified, implying that the majority of hitherto identified promoters represents the TF-dependent “inducible promoters”

## 3. Global analysis of translation termination in E. coli

Link: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006676

Summary: 
ribosome profiling on a set of isogenic strains with well-characterized release factor mutations to determine how they alter translation globally

Data: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006676#sec020
(table inside dox file)
Annotated likely/possible recording and non-recording events of each gene.
Stop codon distribution of annotated recording events
