# Generate sequences with mutations

In [None]:
import numpy as np
import pandas as pd

import Bio.SeqIO
import regseq

# import sortseq.utils as utils
import mpathic.utils as utils
seq_dict,inv_dict = utils.choose_dict('dna')


For a more detailed explanation of the purpose of this notebook, refer to the [documentation of the regseq experiment](https://github.com/RPGroup-PBoC/regseq/wiki/1.-Sequence-Design).

In this notebook, we are generating a collection of mutated sequences based on  sequences obtained from wild type. For a demonstration of how to obtain the necessary wild type sequences, refer to the chapter *Obtaining Wildtype Sequences of Regulatory Binding Sites* in the [documentation of the regseq experiment](https://github.com/RPGroup-PBoC/regseq/wiki/1.-Sequence-Design), as well as the `create_gene_seqs.ipynb` notebook in this repository.

First, we are importing the file containing the genes of interest, their transciptional start site, the orientation of the gene, and the genetic sequence. Below is an example file.

In [None]:
df = pd.read_csv('../data/test_data/wtsequences.csv', index_col=0)
df = df.dropna()
df.head()

There are many orthogonal primer pairs (3,000, to be exact) that can be appended to groups of sequences for selective amplification. We use orthogonal primer pairs that were developed by the Kosuri lab at UCLA, a full list of which can be found in the the data/primers folder of this GitHub repository.

In [None]:
kosprimefwd = Bio.SeqIO.parse('../data/primers/forward_finalprimers.fasta','fasta')
kosprimerev = Bio.SeqIO.parse('../data/primers/reverse_finalprimers.fasta','fasta')

There is no reason that we use these specific primer pairs (pairs 101, 111, 112, 113, and 114); there is no logic to the choice whatsoever. We randomly selected these primer pairs in the past, and found that they worked reliably. Thus, we continued to use these pairs. In the primer sequences above, the last 20 nucleotides are the actual, orthogonal primer pairs. These 20-mers are where the amplification primers bind and amplify, and each pair is orthogonal from the others (e.g. they will not cross-amplify).
<br><br>
We append these primer sequences to the 5' and 3' ends (fwd and rev sequences, respectively) for each mutated promoter sequence that is ordered. The mutated sequences for each promoter are designed computationally such that each base in the 160 bp promoter region has a 10% probability of being mutated. For each given promoter’s library, ensure that the mutation rate as averaged across all sequences is kept between 9.5% and 10.5%, otherwise regenerate the library.

Genes are being lumped into groups of 5 per primer pair. For details, refer to  the chapter on *Generating Mutated Sequences of Regulatory Binding Sites and Appending Primer Binding Sites* in the [documentation of the regseq experiment](https://github.com/RPGroup-PBoC/regseq/wiki/1.-Sequence-Design).

In [None]:
ngenes = len(df.index)
print("Number of genes: {}".format(ngenes))
num_groups = int(np.ceil(ngenes/5))
print("Number of primer pairs: {}".format(num_groups))

Using this information, we can extract the right number of primer pairs from the list. 

In [None]:
fwdprimes, revprimes = regseq.utils.get_primers(kosprimefwd, kosprimerev, 100, 100+num_groups-1)

In total, we ordered 150000 sequences. This number can be adjusted, but usually there should be at least 1200-1500 sequences for each TSS.

In [None]:
norder = 150000
nseqs = int(np.floor(norder/ngenes))-1
nseqs

Now we generate mutated sequences and create an output data frame with each mutant sequence and associated gene. For information on this function, either type `?regseq.utils.mutation_sequences`, or find the function in the `../regseq/utils.py` file. In short, we are generating sequences from the wild type sequence, with a mutation rate of .1/bp. This this process is completely random, is possible that some sequences end up with a higher or lower rate than this, or that some nucleotides are underrepresented. We tackle that issue later in this notebook.

In [None]:
allseqs, primer_df = regseq.utils.mutation_sequences(df, fwdprimes, revprimes, nseqs)

Now we can evaluate the resulting sequences to check for a good mutation rates. If the reader is interested in details of this function, we recommend the docstring, `?regseq.utils.check_mutation_rate`. The full code can be found in the file `../regseq/utils.py`. The function checks the generated sequences for repeated sequences, either very high or very total mutation rates computed from all generated sequences, and exceptionally rare nucleotides. The user can decide if a just a warning should be returned if a criteria is not met, or if new sequences should be generated to fulfill the criteria.
<br>
For example, in the cell below we want to generate new sequences in case that a nucleotide is underrepresented, since this issue is occuring quite frequently. 

In [None]:
allseqs = regseq.utils.check_mutation_rate(df, allseqs, buffer=10, fix_low_base=True, primer_df=primer_df)

Let's store the result in a csv file.

In [None]:
pd.set_option('max_colwidth',int(1e8))
allseqs.to_string(
        open('mutatedseqs_test','w'), index=False,col_space=10,float_format=utils.format_string)

Finally, here are the versions of packages used in this notebook. To display the versions, we are using the Jupyter Lab extension `watermark`, which can be found [here](https://github.com/rasbt/watermark).

## Computing environment

In [None]:
%load_ext watermark
%watermark -v -p jupyterlab,numpy,pandas,Bio,mpathic,regseq