# Find genetic sequences from TSS

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

In [2]:
import pandas as pd
import Bio.SeqIO
import regseq.prior_designs

For a more detailed explanation, refer to the [documention of the Reg-Seq experiment](https://github.com/RPGroup-PBoC/regseq/wiki/1.-Sequence-Design).
<br>
<br>
In this notebook, we are extracting parts of a wild type genome for genes we consider for this experiment. We are using sequences of length 160 bp, 115 upstream of the transcription start site and 45 bases downstream. The wild type sequences will then be used for mutagenisis, where we create plenty of different sequences with mutations based on the wild type sequence. To run this notebook, you need to have a file in hand which contains information about the genes of interest, i.e., the name, the transcription start site, and the direction of transcription. For *E.coli*, this information can be obtained from EcoCyc, and you can an in depth tutorial on how to find the necessary information in the [documention of the Reg-Seq experiment](https://github.com/RPGroup-PBoC/regseq/wiki/1.-Sequence-Design).

We start by loading in the file which contains the wildtype genome. In this repo we have stored a `fasta` file contain the genome of K12 *E.coli*.

In [7]:
for record in Bio.SeqIO.parse('../data/wild_type/sequencev3.fasta', "fasta"):
    genome = str(record.seq)
    
print("Length genome: {}".format(len(genome)))
print("First 100 bases: {}".format(genome[:100]))

Length genome: 4641652
First 100 bases: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT


In this repo you can find an example file for a list of genes that are used to find their respective wild type sequence. If you created your own file, simply change the path below to your own file.

In [10]:
totdf = pd.read_csv('../data/prior_designs/example/selected_genes.csv')
totdf

Unnamed: 0,name,start_site,rev,notes
0,livM,3597755,rev,
1,ygbI,2861256,rev,race over comput
2,deaD,3308086,rev,full operon over compute
3,frlR,3504043,fwd,
4,slyA,1720870,rev,
5,wzxC,2120337,rev,comp
6,ycgB,1237285,rev,comp
7,ymgC,1215752,fwd,operator


To generate sequences, we take only the relevant information from the table, i.e, gene name, location of TSS, and transcription direction. These are the only columns that are needed for further preparation are `name`, `start_site`, and `rev`.

In [11]:
totdf = totdf.loc[:,['name','start_site','rev']]

Now we can get the wildtype sequences for all TSS. Therefore we use the function `find_seq` in the `regseq.prior_designs` module. We use this function to read the genome at the position of the TSS and copy 160 bp, 115 upstream of the transcription start site and 45 bases downstream. These sequences are stored in the column `geneseq`. 

In [12]:
totdf['geneseq'] = totdf.apply(regseq.prior_designs.find_seq, axis=1, args=(genome,))
totdf

Unnamed: 0,name,start_site,rev,geneseq
0,livM,3597755,rev,ACAAAATTAAAACATTAGAGAATGAAAAATGTCCAGCATAATCCCC...
1,ygbI,2861256,rev,AAGATAACGGTATGGTGATCTGATTCACATAAATTAACATTGTGTG...
2,deaD,3308086,rev,AAGTACTACCTAAGTCTGGGGGATTTGGACAGCGCCACGGCACTGT...
3,frlR,3504043,fwd,ATTCAGTACCACGGTGCCTGGTAGGTATAACGTTGGCGTGAGCATC...
4,slyA,1720870,rev,TAATAAATATTCTTTAAGTGCGAAAAATTTACGCGCAATTTCTGAA...
5,wzxC,2120337,rev,TCAATGTGCTGACCGGGGGGATGTCGATTGTCGGTCCACGTCCGCA...
6,ycgB,1237285,rev,TATCCAGCATAAAATTCCGTTCAGAAGCGGATTAGTGGCACTCTGA...
7,ymgC,1215752,fwd,ATGATGCAATATGTTTTATCATAACACATTGTTTTATATGCATTAG...


Note that the sequences in the `geneseq` column all have the same length, and are cut short for displaying purposes, hence `...` at the end. We can double check by asking if there is any entry that has a different length.

In [15]:
any([len(x) != 160 for x in totdf['geneseq'].values])

False

To not run all these steps separately, we combined them into a single function `regseq.prior_designs.get_wt_seqs`, that only takes the file with gene names, transcription start sites and direction of transcription as input, as well as the path where the sequences are saved to. 

In [7]:
file = '../data/prior_designs/example/selected_genes.csv'
output = '../data/prior_designs/example/wtsequences.csv'
regseq.prior_designs.get_wt_seqs(file, output)

Let's look at the output and confirm that it is what we expect it is.

In [8]:
pd.read_csv('../data/prior_designs/example/wtsequences.csv')

Unnamed: 0,name,start_site,rev,geneseq
0,livM,3597755,rev,ACAAAATTAAAACATTAGAGAATGAAAAATGTCCAGCATAATCCCC...
1,ygbI,2861256,rev,AAGATAACGGTATGGTGATCTGATTCACATAAATTAACATTGTGTG...
2,deaD,3308086,rev,AAGTACTACCTAAGTCTGGGGGATTTGGACAGCGCCACGGCACTGT...
3,frlR,3504043,fwd,ATTCAGTACCACGGTGCCTGGTAGGTATAACGTTGGCGTGAGCATC...
4,slyA,1720870,rev,TAATAAATATTCTTTAAGTGCGAAAAATTTACGCGCAATTTCTGAA...
5,wzxC,2120337,rev,TCAATGTGCTGACCGGGGGGATGTCGATTGTCGGTCCACGTCCGCA...
6,ycgB,1237285,rev,TATCCAGCATAAAATTCCGTTCAGAAGCGGATTAGTGGCACTCTGA...
7,ymgC,1215752,fwd,ATGATGCAATATGTTTTATCATAACACATTGTTTTATATGCATTAG...


Finally, here are the versions of packages used in this notebook. To display the versions, we are using the Jupyter Lab extension `watermark`, which can be found [here](https://github.com/rasbt/watermark). (This will already be installed if you use the environment we prepared.)

## Computing Enviromnment

In [9]:
%load_ext watermark
%watermark -v -p pandas,Bio,regseq

CPython 3.6.9
IPython 7.13.0

pandas 1.0.3
Bio 1.76
regseq unknown
