# Exploring Regulatory Sequence of *tetR*/*tetA* in Tn10 

(c) 2020 Tom Röschinger. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

***

In this notebook we look at the transposon **tn10**, which contains a natural system for the expression of *tetA*, which is regulated by *tetR*.

In [1]:
import wgregseq

# Include these if package is manipulated while running the notebook
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np

First we read the FASTA file obtained from Genebank.

In [2]:
with open ("tn10.fasta", "r") as file:
    data = file.read().split('\n')[1:]
    sequence = "".join(data)

Organization of tetR/tetA regulation:
- two operators that can be bound independently by TetR
- tetA is repressed by both tetO1 and tetO2
- tetR is repressed only by tetO1
- Affinity of tetO2 to TetR is about twice as high as tetO1

![](tn10_tet.png)

From Genebank, we can find the positions for *tetA* and *tetR*. The repressor gene is reversed, so we will have to obtain the complementary sequence in case we are interested in the actual sequences.

In [3]:
# Exact positions from Genebank
tetR_pos = [4702, 5328]
tetA_pos = [5407, 6612]

For simple access, let's extract the region between the two genes, which contains all regulatory elements.

In [4]:
intergenic_region = sequence[5328:5407-1]
intergenic_region_rev = wgregseq.complement_seq(intergenic_region, rev=True)
intergenic_region

'TAATTCCTAATTTTTGTTGACACTCTATCATTGATAGAGTTATTTTACCACTCCCTATCAGTGATAGAGAAAAGTGAA'

In [5]:
len(intergenic_region)

78

Let's extract the sequences for the operators and promoters to confirm we found the right indices.

In [6]:
tetO1 = intergenic_region[21:40]
print("tetO1: ", tetO1)

tetO2 = intergenic_region[51:70]
print("tetO2: ", tetO2)

tetO1:  ACTCTATCATTGATAGAGT
tetO2:  TCCCTATCAGTGATAGAGA


In [7]:
rev_tetO1 = wgregseq.complement_seq(tetO1, rev=True)
rev_tetO1

'ACTCTATCAATGATAGAGT'

In [8]:
rev_tetO2 = wgregseq.complement_seq(tetO2, rev=True)
rev_tetO2

'TCTCTATCACTGATAGGGA'

In [9]:
P_tetA = intergenic_region[16:53]
P_tetA

'TTGACACTCTATCATTGATAGAGTTATTTTACCACTC'

In [10]:
P_tetR1 = intergenic_region_rev[7:45]
P_tetR1

'TTCTCTATCACTGATAGGGAGTGGTAAAATAACTCTAT'

In [11]:
P_tetR2 = intergenic_region_rev[28:65]
P_tetR2

'TGGTAAAATAACTCTATCAATGATAGAGTGTCAACAA'

These all look fine. Let's assign the lacUV5 sequence from Brewster 2012 to a variable, so we can add it to the mutated sequences later on.

In [12]:
lacUV5 = 'TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGG'

## Constructs

All constructs which are include a tet operator need to be integrated into a cell which expresses the tet repressor. However, the inserts which only have the promoter should be observed in the absence of the repressor to identify the binding energy matrix for the -10/-35 regions.

### LacUV5 + individual operators downstream

First we mutate the operator sequences and put them downstream of lacUV5, such that we obtain a simple repression motif. Therefore, we use the function `wgregseq.mutations_det` to generate single and double mutants of each operator. We include all single mutants and 400 single mutants. The function guarantees that only unique sequences are returned, so no duplicates.

First the tetO1 single mutants.

In [13]:
# Obtain mutants
mutants_single = wgregseq.mutations_det(tetO1, mut_per_seq=1)

# Store sequences in data frame
tetO1_df_single = pd.DataFrame({"seq":mutants_single})

# Add description column
tetO1_df_single["construct"] = "lacUV5_tetO1 single mutant"

# Show first 5 rows
tetO1_df_single.head()

Unnamed: 0,seq,construct
0,cCTCTATCATTGATAGAGT,lacUV5_tetO1 single mutant
1,AaTCTATCATTGATAGAGT,lacUV5_tetO1 single mutant
2,ACaCTATCATTGATAGAGT,lacUV5_tetO1 single mutant
3,ACTaTATCATTGATAGAGT,lacUV5_tetO1 single mutant
4,ACTCaATCATTGATAGAGT,lacUV5_tetO1 single mutant


Second the tetO1 double mutants.

In [15]:
# Obtain mutants
mutants_double_O1 = wgregseq.mutations_det(tetO1, mut_per_seq=2, num_mutants=1000, site_start=-20)

# Store sequences in data frame
tetO1_df_double = pd.DataFrame({"seq":mutants_double_O1})

# Add description column
tetO1_df_double["construct"] = "lacUV5_tetO1 double mutant"

# Show first 5 rows
tetO1_df_double.head()

Unnamed: 0,seq,construct
0,ACTCTgTCATcGATAGAGT,lacUV5_tetO1 double mutant
1,ACTtTATCATTGATAGAGc,lacUV5_tetO1 double mutant
2,ACTCTATCATTGATAGtGa,lacUV5_tetO1 double mutant
3,ACTCTATaATTGgTAGAGT,lacUV5_tetO1 double mutant
4,ACTCTATCATaGAaAGAGT,lacUV5_tetO1 double mutant


Since we are downsampling the double mutants, we should make sure that we obtain sufficient mutational coverage across the sequence. Hence, we use `wgregseq.mutation_coverage` to compute the mutation rate at each position. If there are positions which significantly vary of 0.1, we either have to rerun the function which generates the mutations, or take a larger sample size.

In [16]:
wgregseq.mutation_coverage(tetO1, mutants_double_O1)

array([0.105, 0.106, 0.099, 0.109, 0.105, 0.116, 0.117, 0.103, 0.106,
       0.106, 0.11 , 0.1  , 0.108, 0.092, 0.106, 0.112, 0.094, 0.105,
       0.101])

Now we can generate the mutants for tetO2 as well.

In [17]:
# Single mutants
mutants_single_O2 = wgregseq.mutations_det(tetO2, mut_per_seq=1)
tetO2_df_single = pd.DataFrame({"seq":mutants_single})
tetO2_df_single["construct"] = "lacUV5_tetO2 single mutant"

# Double mutants
mutants_double_O2 = wgregseq.mutations_det(tetO2, mut_per_seq=2, num_mutants=1000)
tetO2_df_double = pd.DataFrame({"seq":mutants_double_O2})
tetO2_df_double["construct"] = "lacUV5_tetO2 double mutant"

Again, before we proceed we should check the mutation coverage.

In [18]:
wgregseq.mutation_coverage(tetO2, mutants_double_O2)

array([0.11 , 0.094, 0.094, 0.098, 0.112, 0.102, 0.105, 0.1  , 0.098,
       0.113, 0.108, 0.102, 0.11 , 0.11 , 0.105, 0.109, 0.11 , 0.112,
       0.108])

Finally, we combine all sequences into a single data frame, and attach the lacUV5 promoter directly upstream of the mutated operators.

In [20]:
# Combine data frames
tet_df = pd.concat([tetO1_df_single, tetO1_df_double, tetO2_df_single, tetO2_df_double], ignore_index=True)

# Attach lacUV5 to each sequence
tet_df.seq = [lacUV5 + seq for seq in tet_df.seq]

# Print first 5 rows
tet_df

Unnamed: 0,seq,construct
0,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGcCTCT...,lacUV5_tetO1 single mutant
1,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAaTCT...,lacUV5_tetO1 single mutant
2,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGACaCT...,lacUV5_tetO1 single mutant
3,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGACTaT...,lacUV5_tetO1 single mutant
4,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGACTCa...,lacUV5_tetO1 single mutant
...,...,...
2109,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGTCCCg...,lacUV5_tetO2 double mutant
2110,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGTCCCT...,lacUV5_tetO2 double mutant
2111,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGTCCCT...,lacUV5_tetO2 double mutant
2112,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGTCCCT...,lacUV5_tetO2 double mutant


This data frame will be included in the final file for the twist order. Therefore, we need to include a column which contains the information if we already added primers to the sequences. Then we save the table as `csv` file in the appropriate folder in this repository.

In [21]:
# Add column
tet_df['primer_added'] = False

# Store file
tet_df.to_csv("../../../../data/twist_order/lacUV5_tetOx_single_double_mutants.csv")

## Native Promoter sequences

Next, we design the constructs which contain the native promoter sequences. The goal here is to obtain an energy matrix for the RNAP binding sites. We do not need to be bothered about the operator binding sites, since we can integrate these constructs into strains which do not express the tet repressor and therefore the operators are not bound. In the paper there are two promoters annotated, but since we want to obtain data for the promoters individually, we will randomize the -10/-35 region of the opposing promoter if possible.

First we consider the promoter P_tetR1. Since the -35 region of P_tetR2 overlaps with the -10 of P_tetR2, we can only randomize the -10. Note that we are looking at the reverse sequence, since *tetR* is transcribed in the reverse direction.

In [22]:
# Obtain reversed intergenic region as list to manipulate positions
randomized_R2 = list(intergenic_region_rev)

# Randomize -10 of P_tetR2
randomized_R2[50:66] = list(wgregseq.gen_rand_seq(16).lower())

# Recombine into string
randomized_R2 = "".join(randomized_R2)

# Print result
randomized_R2

'TTCACTTTTCTCTATCACTGATAGGGAGTGGTAAAATAACTCTATCAATGttgtagggccaagggaAATTAGGAATTA'

Now we can generate mutations in P_tetR1. We'll randomly mutate the RNAP binding site at an average 0.1 rate (the actual number of mutants is picked from a Poisson distribution with mean `len(sequence) * rate`. The number of mutations can also be fixed, refer to the docstring of the function.) After generating mutations, we check the mutation coverage.

In [25]:
# Generate mutants
PR1_mutants = np.unique(wgregseq.mutations_rand(randomized_R2, 1500, 0.1, site_start=6, site_end=46))

# Check mutation coverage
wgregseq.mutation_coverage(randomized_R2, PR1_mutants, site_start=6, site_end=46)

array([0.10431154, 0.10639777, 0.09735744, 0.10709318, 0.11196106,
       0.09527121, 0.0876217 , 0.10987483, 0.10570236, 0.10570236,
       0.09805285, 0.10083449, 0.10500695, 0.11196106, 0.09179416,
       0.10361613, 0.10013908, 0.10222531, 0.10222531, 0.10431154,
       0.09179416, 0.10292072, 0.09874826, 0.09735744, 0.09944367,
       0.1077886 , 0.10987483, 0.09388039, 0.1147427 , 0.10292072,
       0.10570236, 0.10709318, 0.10013908, 0.10570236, 0.10987483,
       0.1015299 , 0.10848401, 0.10500695, 0.10500695, 0.10292072])

Next, we mutate the P_tetR2 region. Therefore, we first have to randomize the -35 region of P_tetR1. 

In [26]:
# Obtain reversed intergentic region as list
randomized_R1 = list(intergenic_region_rev)

# Randomize P_tetR1 -35 region
randomized_R1[6:21] = list(wgregseq.gen_rand_seq(16).lower())

# Recombine into string
randomized_R1 = "".join(randomized_R1)

# Print sequence
randomized_R1

'TTCACTgagaatgctgtaggcaTAGGGAGTGGTAAAATAACTCTATCAATGATAGAGTGTCAACAAAAATTAGGAATTA'

Again, we generate mutants of the RNAP binding site and compute the mutation rate per position as quality control.

In [27]:
# Generate mutants
PR2_mutants = np.unique(wgregseq.mutations_rand(randomized_R1, 1500, 0.1, site_start=27, site_end=66))

# Compute mutation rate per position
wgregseq.mutation_coverage(randomized_R1, PR2_mutants, site_start=27, site_end=66)

array([0.09992963, 0.10415201, 0.10344828, 0.10696692, 0.11611541,
       0.10907811, 0.10626319, 0.09078114, 0.11400422, 0.10978184,
       0.10626319, 0.10063336, 0.10133709, 0.11681914, 0.08937368,
       0.10767065, 0.10274455, 0.09992963, 0.09429979, 0.1111893 ,
       0.10274455, 0.09781844, 0.09570725, 0.12033779, 0.10274455,
       0.09078114, 0.10696692, 0.09500352, 0.10415201, 0.10274455,
       0.10696692, 0.09148487, 0.10837438, 0.10063336, 0.11189303,
       0.11259676, 0.08796622, 0.10274455, 0.11048557])

Finally, we get mutants for the P_tetA sequence. Since this is the only RNAP binding site in the forward direction, we do not need to randomize anything of the sequence prior to generating mutants.

In [28]:
# Generate mutants
PA_mutants = np.unique(wgregseq.mutations_rand(intergenic_region, 1500, 0.1, site_start=16, site_end=53))

# Compute mutation rate per position
wgregseq.mutation_coverage(intergenic_region, PA_mutants, site_start=16, site_end=53)

array([0.09239517, 0.11229566, 0.108742  , 0.11940299, 0.10447761,
       0.108742  , 0.10660981, 0.10447761, 0.09737029, 0.09950249,
       0.11513859, 0.10376688, 0.10447761, 0.09168443, 0.11869225,
       0.10163468, 0.11158493, 0.10732054, 0.10518834, 0.108742  ,
       0.108742  , 0.10163468, 0.09950249, 0.11656006, 0.11158493,
       0.10447761, 0.09381663, 0.09665956, 0.12153518, 0.10732054,
       0.09808102, 0.10945274, 0.09168443, 0.12508884, 0.11016347,
       0.11513859, 0.10447761])

We combine all generated sequences into a data frame and add the necessary columns. Also we add a column that indicates if there is a region in the sequence that was randomized on top of the mutations in the sequence of interest.

In [29]:
# Define individual data frames
dfR1 = pd.DataFrame({'seq': PR1_mutants, 'primer_added' : False, 'construct' : "P_tetR1", 'note': "P_tetR2 -10 region randomized"})
dfR2 = pd.DataFrame({'seq': PR2_mutants, 'primer_added' : False, 'construct' : "P_tetR2", 'note': "P_tetR1 -35 region randomized"})
dfA = pd.DataFrame({'seq': PA_mutants, 'primer_added' : False, 'construct' : "P_tetA", 'note': ""})

# Combine data frames
df_promoters = pd.concat([dfR1, dfR2, dfA], ignore_index=True)
df_promoters

Unnamed: 0,seq,primer_added,construct,note
0,TTCACTTTTCTCTATCACTGATAGGGAGTGGTAAAATAACTCTATC...,False,P_tetR1,P_tetR2 -10 region randomized
1,TTCACTTTTCTCTATCACTGATAGGGAGTGGTAAAATAACTCTATa...,False,P_tetR1,P_tetR2 -10 region randomized
2,TTCACTTTTCTCTATCACTGATAGGGAGTGGTAAAATAACTCTATg...,False,P_tetR1,P_tetR2 -10 region randomized
3,TTCACTTTTCTCTATCACTGATAGGGAGTGGTAAAATAACTCTATt...,False,P_tetR1,P_tetR2 -10 region randomized
4,TTCACTTTTCTCTATCACTGATAGGGAGTGGTAAAATAACTCTAgC...,False,P_tetR1,P_tetR2 -10 region randomized
...,...,...,...,...
4261,TAATTCCTAATTTTTGgTcACACTCaATCATTGATgGAGTTcTgTT...,False,P_tetA,
4262,TAATTCCTAATTTTTGgTcACAgTCgATCATTGATAGAGTTATTTT...,False,P_tetA,
4263,TAATTCCTAATTTTTGgcGACtCTCTATCATTcAcAGAGTTATTTT...,False,P_tetA,
4264,TAATTCCTAATTTTTGggGACACTCTATCATTGATAtAGTTATTTT...,False,P_tetA,


The combined data frame is saved in the appropriate folder where we are combining the individual constructs into a final order.

In [30]:
df_promoters.to_csv("../../../../data/twist_order/natural_tet_promoters_mutated.csv")

## Computational environment

In [27]:
%load_ext watermark
%watermark -v -p pandas,wgregseq

CPython 3.8.5
IPython 7.10.0

pandas 1.0.3
wgregseq 0.0.1
