# Generate mutated sequences

In [1]:
import numpy as np
import pandas as pd

import Bio.SeqIO
import RegSeq

# import sortseq.utils as utils
import mpathic.utils as utils
seq_dict,inv_dict = utils.choose_dict('dna')


First we load the wildtype sequences of the genes of interest. For how these sequences are generated, check the `create_gene_seqs.ipynb` notebook.

In [2]:
df = pd.read_csv('../data/test_data/wtsequences.csv', index_col=0)
df = df.dropna()
df

Unnamed: 0,name,start_site,rev,geneseq,ssdiff,offset
0,fdoH,4085867.0,rev,CATTATGGTATTCTGTTACAAACCCTTCCTGGATGGAGGGAAATTG...,0.0,0.0
3,sdaB,2928035.0,fwd,TACATATATTGCGCGCCCCGGAAGAAGTCAGATGTCGTTTAATGGG...,0.0,0.0
6,thiM,2185451.0,rev,TCTGGATGTCGTTCTGAAGGTGCTGGATTCATATATCAAATAATTT...,0.0,0.0
7,yedJ,2033449.0,rev,TTTTTCCTGTATTCACTGCCGTTGCGCAAAATTTATCTATTTGTTC...,0.0,0.0
9,ykgE,321511.0,fwd,TCGATTTCCCCATAAAATGTGAGCGATGCCGAAAGAAATAAAATTA...,0.0,0.0
...,...,...,...,...,...,...
27,znuA,1942661.0,rev,TTGGCCCAAGTAAAGTCAAAATTTTTCCAGGTTTAAGTTCCAGCGA...,0.0,0.0
28,zupT,3182433.0,fwd,TGCCAGCTGCGGGTATACAAATTATCTTCCAGCACGTTCATCGGAC...,0.0,0.0
29,pitA,3637612.0,fwd,TGCCTGAATTATATAAGATAATTATTTTTTGAGTGAAATCCATACA...,0.0,0.0
30,ecnB,4376509.0,fwd,GAAGACATCAAACATCTCGGCAACTCCATCTCTCGCGCTGCCAGCT...,0.0,0.0


Load in primers provided to us by the Kosuri group.

In [3]:
kosprimefwd = Bio.SeqIO.parse('../data/primers/forward_finalprimers.fasta','fasta')
kosprimerev = Bio.SeqIO.parse('../data/primers/reverse_finalprimers.fasta','fasta')

Extract the 19 primers from the list.

In [4]:
fwdprimes, revprimes = RegSeq.utils.get_primers(kosprimefwd, kosprimerev, 100, 118)

Determine total number of genes we are ordering.

In [5]:
ngenes = len(df.index)
ngenes

103

Determine number of sequences to order

In [6]:
norder = 150000
nseqs = int(np.floor(norder/ngenes)) - 1
nseqs

1455

Now we generate mutated sequences and create an output data frame with each mutant sequence and associated gene.

In [7]:
allseqs = RegSeq.utils.mutation_sequences(df, fwdprimes, revprimes, nseqs)

Let's store the result in a csv file.

In [8]:
pd.set_option('max_colwidth',int(1e8))
allseqs.to_string(
        open('mutatedseqs_test','w'), index=False,col_space=10,float_format=utils.format_string)

Now we can evaluate the resulting sequences to check for a good mutation rates. If the reader is interested in details of this function, we recommend the docstring, `?RegSeq.utils.check_mutation_rate`. The full code can be found either in the file `../RegSeq/utils.py`, or by running the last cell in this notebook, and removing the `#`.

In [None]:
allseqs = pd.io.parsers.read_csv('mutatedseqs_test',delim_whitespace=True)
RegSeq.utils.check_mutation_rate(df, allseqs, buffer=10)

Base with low mutation rate for gene fdoH 
Base with low mutation rate for gene yqhC 
Base with low mutation rate for gene yicI 
Base with low mutation rate for gene ybjT 
Base with low mutation rate for gene yncD 
Base with low mutation rate for gene eco 
Base with low mutation rate for gene iap 
Base with low mutation rate for gene ygjP 
Base with low mutation rate for gene rapA 
Base with low mutation rate for gene ycgB 
Base with low mutation rate for gene yehS 
Base with low mutation rate for gene ydhO 
Base with low mutation rate for gene modE 
Base with low mutation rate for gene rcsF 
Base with low mutation rate for gene ygeR 
Bad mutation rate for gene mscK.
Base with low mutation rate for gene mscK 
Base with low mutation rate for gene ybdG 
Base with low mutation rate for gene ydjA 
Base with low mutation rate for gene yggW 
Base with low mutation rate for gene ybiO 
Base with low mutation rate for gene mscL 
Base with low mutation rate for gene zapB 
Base with low mutation 

In [None]:
#import inspect
#print(inspect.getsource(RegSeq.utils.check_mutation_rate))