# Generate sequences with mutations

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

In [7]:
import numpy as np
import pandas as pd

import Bio.SeqIO
import regseq

# import sortseq.utils as utils
import mpathic.utils
seq_dict,inv_dict = mpathic.utils.choose_dict('dna')

For a more detailed explanation of the purpose of this notebook, refer to the [documentation of the Reg-Seq experiment](https://github.com/RPGroup-PBoC/regseq/wiki/1.-Sequence-Design).

In this notebook, we are generating a collection of mutated sequences based on  sequences obtained from wild type. For a demonstration of how to obtain the necessary wild type sequences, refer to the chapter *Obtaining Wildtype Sequences of Regulatory Binding Sites* in the [documentation of the Reg-Seq experiment](https://github.com/RPGroup-PBoC/regseq/wiki/1.-Sequence-Design), as well as the `1_1_create_gene_seqs.ipynb` notebook in this repository. These sequences are the basis for creating the library and adding barcodes in the following steps.

The file contains the names of the genes of interest, their transcription start site, the direction of transcription, and the genetic sequence from 115 bp downstream and 45 bp upstream of the ranscription start site from the wild type genome. 

In [8]:
df = pd.read_csv('../data/prior_designs/example/wtsequences.csv')
df.head()

Unnamed: 0,name,start_site,rev,geneseq
0,livM,3597755,rev,ACAAAATTAAAACATTAGAGAATGAAAAATGTCCAGCATAATCCCC...
1,ygbI,2861256,rev,AAGATAACGGTATGGTGATCTGATTCACATAAATTAACATTGTGTG...
2,deaD,3308086,rev,AAGTACTACCTAAGTCTGGGGGATTTGGACAGCGCCACGGCACTGT...
3,frlR,3504043,fwd,ATTCAGTACCACGGTGCCTGGTAGGTATAACGTTGGCGTGAGCATC...
4,slyA,1720870,rev,TAATAAATATTCTTTAAGTGCGAAAAATTTACGCGCAATTTCTGAA...


When generating mutated sequences for a Reg-Seq experiment, we typically "cluster" 3-5 mutated gene libraries together. Each of these clusters shares a set of orthogonal primer binding sites, enabling each cluster to be individually amplified, apart from the other mutated libraries. In this context orthogonal means that the generated primers do not cross-hybridize, i.e., the probability of binding of primers to each other is very low. The number of possible 20mers is about $10^12$, therefore the probability of generating a primer that is identical to a 20mer in the genome is negligible. We cluster genes in this way for a couple of reasons:

1. If one pair of orthogonal primers does not amplify as well as the other primer pairs, one needs not throw out the entire oligo pool; rather, you could simply reorder the 3-5 affected genes, thus potentially saving money in the future.
2. Clustering genes enables selective amplification of a subset of genes. If you wanted to repeat Reg-Seq on just a handful of related genes, for instance, then you could specifically amplify that subset and repeat the experiment. Again, this would save money, as you would not waste sequencing reads on an Illumina machine by repeating a Reg-Seq experiment on the entire oligo pool.

There are many orthogonal primer pairs (3,000, to be exact) that can be appended to groups of sequences for selective amplification. We use orthogonal primer pairs that were developed by the Kosuri lab at UCLA, a full list of which can be found in the the data/primers folder of this GitHub repository.

In [9]:
kosprimefwd = Bio.SeqIO.parse('../data/primers/forward_finalprimers.fasta','fasta')
kosprimerev = Bio.SeqIO.parse('../data/primers/reverse_finalprimers.fasta','fasta')

Let's have a look at a couple of primers. Since the primers are in the `fasta` format, we first have to read out the format and find the sequence.

In [10]:
seqs = list(enumerate(kosprimefwd))[100:105]
for i,fwd_rec in seqs:
    print(str(fwd_rec.seq))

GCTTATTCGTGCCGTGTTAT
TTTGCTTCAGTCAGATTCGC
GTCGAGTCCTATGTAACCGT
GTAAGATGGAAGCCGGGATA
GGTGTCGCAACATGATCTAC


These 20-mers are where the amplification primers bind and amplify, and each pair is orthogonal from the others (e.g. they will not cross-amplify). Since all primer pairs in the list are equally orthogonal, there is no logic in choosing primers.
<br><br>
We append these primer sequences to the 5' and 3' ends (fwd and rev sequences, respectively) for each mutated promoter sequence that is ordered. The mutated sequences for each promoter are designed computationally such that each base in the 160 bp promoter region has a 10% probability of being mutated. For each given promoter’s library, ensure that the mutation rate as averaged across all sequences is kept between 9.5% and 10.5%, otherwise regenerate the library.

Now we have to find the number of primer pairs we need for the experiment. Therefore we count the number of genes in the list provided above, stored in the dataframe `df`. 

In [11]:
ngenes = len(df.index)
print("Number of genes: {}".format(ngenes))
num_groups = int(np.ceil(ngenes/5))
print("Number of primer pairs: {}".format(num_groups))

Number of genes: 8
Number of primer pairs: 2


Having the number of primer pairs we need, we can extract them from the list. Therefore we use the function `get_primers` in the `regseq.prior_designs` module. We need to give the list of forward and reverse primers that we loaded in previously, and the index of the first primer pair, which we set to 100 here. As said previously, there is no logic in choosing primer pairs, but we had success using these pairs in experiments. 

In [12]:
kosprimefwd = Bio.SeqIO.parse('../data/primers/forward_finalprimers.fasta','fasta')
kosprimerev = Bio.SeqIO.parse('../data/primers/reverse_finalprimers.fasta','fasta')
fwdprimes, revprimes = regseq.prior_designs.get_primers(kosprimefwd, kosprimerev, 100, 100+num_groups-1)

The function returns two lists, containing the forward and reverse primers each. Let's have a look at one of these lists.

In [13]:
fwdprimes

['GCTTATTCGTGCCGTGTTAT', 'TTTGCTTCAGTCAGATTCGC']

In total, we ordered 150000 sequences for the Reg-Seq experiment. This is the combined number for all genes considered. To compute how many mutated sequences we generate per gene, we have to divide this number by the number of genes. The result should be at least 1200-1500 per gene. If the resulting number is far lower, you have to consider ordering more sequences. In this example we compute the number of sequences per gene with the variable `nseqs`, which is 150000/8 - 1 = 18749. We have to substract one from the number of generated sequences, since the first sequence for each gene will be the wildtype sequence. 

In [14]:
norder = 150000
nseqs = int(np.floor(norder/ngenes))-1
nseqs

18749

Now we generate mutated sequences and create an output data frame with each mutant sequence and associated gene. For information on this function, either type `?regseq.prior_designs.mutation_sequences`, or find the function in the `../regseq/prior_designs.py` file. In short, we are generating sequences from the wild type sequence, with a mutation rate of 0.1/bp. This process is completely random, is possible that some sequences end up with a higher or lower rate than this, or that some nucleotides are underrepresented. We tackle that issue later in this notebook.

In [15]:
allseqs, primer_df = regseq.prior_designs.mutation_sequences(df, fwdprimes, revprimes, nseqs)

Now we can evaluate the resulting sequences to check for a good mutation rates. If the reader is interested in details of this function, we recommend the docstring, `?regseq.prior_designs.check_mutation_rate`. The full code can be found in the file `../regseq/prior_designs.py`. The function checks the generated sequences for repeated sequences, either very high or very total mutation rates computed from all generated sequences, and exceptionally rare nucleotides. The user can decide if a just a warning should be returned if a criteria is not met, or if new sequences should be generated to fulfill the criteria.
<br>
For example, in the cell below we want to generate new sequences in case that a nucleotide is underrepresented, since this issue is occuring quite frequently. 

In [16]:
allseqs = regseq.prior_designs.check_mutation_rate(df, allseqs, max_it=10, fix_low_base=True, fix_ex_rate=True, primer_df=primer_df)

Gene livM done.
Gene ygbI done.
Gene deaD done.
Gene frlR done.
Gene slyA done.
Gene wzxC done.
Gene ycgB done.
Gene ymgC done.


Let’s look at a mutated sequences and compare them to the wildtype for gene `livM`. Therefore we read the first line of the dataframe `allseqs` we just created, and read the wildtype sequence from the list of wildtype sequences from the beginning.

In [17]:
# Find mutated sequence
mutated = allseqs.loc[allseqs["gene"] == "livM"].iloc[1]["seq"]

# Find wildtype sequence
wildtype = df.loc[df["name"] == "livM", "geneseq"].values[0]

print("Length wildtype sequence: {}".format(len(wildtype)))
print("Length generated sequence: {}".format(len(mutated)))

Length wildtype sequence: 160
Length generated sequence: 200


Let's look at the section of the sequences and compare the generated sequences to the wildtype. You will find some patterns in the mutations. This is coming from the way the `mpathic` package is creating the sequences. However, when checking the sequences we made sure that there are no pathologies. 

In [18]:
# Get multiple sequences
mutated = allseqs.loc[allseqs["gene"] == "livM"].iloc[10:15]["seq"].values

# Show sequences
print("Wildtype: \n{}".format(wildtype[:30]))
print("Mutated: ")
for i in range(4):
    print(mutated[i][20:50])

Wildtype: 
ACAAAATTAAAACATTAGAGAATGAAAAAT
Mutated: 
AAAAAAATAAAACATTAGTGAATGAAAAAT
AAAAAAATAAAGCATTAGAGCATGAAAAAT
AAAAAAATAAGACATTAGAGAATGAAAAAT
AAAAAAATGAAACATTAGAGTATGAAAAAT


All sequences are very similar to the wild type sequences, but differer in a couple of positions (10% on average). You will find that on a first look, the mutated sequences do not seem to be completely random. That is because the we use the `mpathic` package to generate mutated sequences, and the list of mutated sequences is sorted by the sequence. To double check on this, lets look at a base and its mutation rate. 

In [20]:
# Base to test
base = 10

# Get nucleotide in wildtype sequence
wildtypebase = wildtype[base]

# Get all mutated sequences
mutated_seqs = allseqs.loc[allseqs["gene"] == "livM"].iloc[2:]["seq"].values

# Add 20 bases due to additional primer leading the sequence
rate = sum([wildtypebase != mut[base+20] for mut in mutated_seqs])/len(mutated_seqs)

print("Mutation rate at base {}: {}".format(base, rate))

Mutation rate at base 10: 0.09675698741199061


All the steps shown above are combined into a single function.

In [13]:
?regseq.prior_designs.gen_mutated_seq

[0;31mSignature:[0m
[0mregseq[0m[0;34m.[0m[0mprior_designs[0m[0;34m.[0m[0mgen_mutated_seq[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfile[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mforward_primers[0m[0;34m=[0m[0;34m'../data/primers/forward_finalprimers.fasta'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreverse_primers[0m[0;34m=[0m[0;34m'../data/primers/reverse_finalprimers.fasta'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnorder[0m[0;34m=[0m[0;36m150000[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfix_ex_rate[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfix_rep[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfix_low_base[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Generate mutated sequences from given sequences.

Parameters
----------
file : str
    Path to file w

As input we give the location of the file containing genes of interest and the wildtype sequence. The output is a table with all generated sequences for every gene. The path to where this file is supposed to be stored has to be given as an input to the function.

In [21]:
file = '../data/prior_designs/example/wtsequences.csv'
output = '../data/prior_designs/mutatedseqs.csv'

regseq.prior_designs.gen_mutated_seq(file, output)

Gene livM done.
Gene ygbI done.
Gene deaD done.
Gene frlR done.
Gene slyA done.
Gene wzxC done.
Gene ycgB done.
Gene ymgC done.


Finally, here are the versions of packages used in this notebook. To display the versions, we are using the Jupyter Lab extension `watermark`, which can be found [here](https://github.com/rasbt/watermark). (This will already be installed if you use the environment we prepared.)

## Computing environment

In [15]:
%load_ext watermark
%watermark -v -p numpy,pandas,Bio,mpathic,regseq

CPython 3.6.9
IPython 7.13.0

numpy 1.18.1
pandas 1.0.3
Bio 1.76
mpathic 0.1.20
regseq 0.0.2
