# Generating mutation scrambles of different sizes.

(c) 2020 Tom Röschinger. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

In [5]:
import wgregseq
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import warnings

import matplotlib.pyplot as plt


# Set default plotting style
wgregseq.plotting_style()
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In this notebook we explore how to generate scrambles given a certain sequence. We want to be able to do it for sequences of various lengths, with variable size of the scrambles and variable overlap of consecutive scrambles. This method is based of the work from [Urtecho et al., 2020](https://www.biorxiv.org/content/10.1101/2020.01.04.894907v1), where they used scrambles of size 10, with an overlap of 5 base pairs. To find scrambles which are distant from the wild type sequence, they generated 100 scrambles by permuting the original sequence, and chose the sequence which was most distant. This preserves the GC-content of the sequence.

Let's generate a random sequence to begin with. Therefore we can use the function `wgregseq.gen_rand_seq()`, which takes the sequence length as an argument. One can also give a list of letters as second argument. By default, a random DNA sequence is generated.

In [6]:
sequence = wgregseq.gen_rand_seq(160)
sequence

'AGGCCAAGTTAATATGGCCCATTTAGTAAAACAGAGGGTGGCACTGGCAACACAGTACACCATAAGGGTCAACTTTATCGTGGACGTAAATGTGATGTGCCTGACGTCCGTGGAGTTATATATTGGTAACGCTCCGACAGCTGGGCAAATGATTGATCTA'

To generate scrambles, we can use the function `wgregseq.create_scrambles()`, which takes as arguments the wild type sequence, the length of the generated scrambles, the number of overlapping bases, and the number of sequences generated per scramble, from which the most distant is chosen. The function will return an error message, if it cannot generate scrambles evenly throughout the sequence.

In [7]:
sum([len(wgregseq.create_scrambles(sequence, 10, 9, 100, ignore_imperfect_scrambling=True))])

152

The scrambles can also be returned in a data frame, which also includes the start and stop positions of the scrambles, as well as the center position. Note that here the wild type is not included.

In [9]:
df = wgregseq.create_scrambles_df(sequence, 10, 5, 100)
df

AGGCCAAGTTAATATGGCCCATTTAGTAAAACAGAGGGTGGCACTGGCAACACAGTACACCATAAGGGTCAACTTTATCGTGGACGTAAATGTGATGTGCCTGACGTCCGTGGAGTTATATATTGGTAACGCTCCGACAGCTGGGCAAATGATTGATCTA


Unnamed: 0,start_pos,stop_pos,sequence,center_pos
0,0,10,gaatgctacgAATATGGCCCATTTAGTAAAACAGAGGGTGGCACTG...,5.0
1,5,15,AGGCCttaagatataGGCCCATTTAGTAAAACAGAGGGTGGCACTG...,10.0
2,10,20,AGGCCAAGTTccctgaagatATTTAGTAAAACAGAGGGTGGCACTG...,15.0
3,15,25,AGGCCAAGTTAATATcattacggctGTAAAACAGAGGGTGGCACTG...,20.0
4,20,30,AGGCCAAGTTAATATGGCCCgaaattaattACAGAGGGTGGCACTG...,25.0
5,25,35,AGGCCAAGTTAATATGGCCCATTTAaaggcaataaGGGTGGCACTG...,30.0
6,30,40,AGGCCAAGTTAATATGGCCCATTTAGTAAAgggggaaactGCACTG...,35.0
7,35,45,AGGCCAAGTTAATATGGCCCATTTAGTAAAACAGAacggtcggtgG...,40.0
8,40,50,AGGCCAAGTTAATATGGCCCATTTAGTAAAACAGAGGGTGaactac...,45.0
9,45,55,AGGCCAAGTTAATATGGCCCATTTAGTAAAACAGAGGGTGGCACTc...,50.0


## Scan scrambles

Now let's use a real sequence from the original Reg-Seq dataset and create scrambles for that sequence. Therefore, we import the file which contains all the wild type sequences.

In [None]:
sequence_df = pd.read_csv("../../data/RegSeq/wtsequences.csv", index_col=0).reset_index()
sequence = sequence_df.loc[sequence_df["name"] == "ykgE", "geneseq"].values[0]
sequence

Next, we import the 'energy matrix' that was obtained for this promoter in arabinose. We also rename the columns to contain just the headers for further analysis.

In [None]:
effect_matrix = pd.read_csv("../../data/RegSeq/ykgEarabinosedataset_alldone_with_largeMCMC194", delim_whitespace=True)
effect_matrix.rename(columns={"val_A":"A", "val_C":"C", "val_G":"G", "val_T":"T"}, inplace=True)
effect_matrix

Now we can generate scrambles using the wild type sequence. 

In [None]:
df = wgregseq.create_scrambles_df(sequence, 5, 2, 100, number=1,  ignore_imperfect_scrambling=True)
df.head()

To evaluate each sequence using the energy matrix, we use the following function.

In [None]:
eff_df = wgregseq.sum_emat_df(df, effect_matrix)

Let's have a look at the effect of scrambles at each position. Here we average over the effect of all scrambles at a position.

In [None]:
x = eff_df["center_pos"].unique()
y = eff_df.groupby("center_pos")["effect"].mean().values
plt.scatter(x - 115, y)

## Create single and double mutants

At some point we want to generate 

In [None]:
mutants = wgregseq.mutations_det(sequence[0:5], num_mutants=10, mut_per_seq=1, site_start=0, site_end=5)
mutants

In [None]:
wgregseq.add_primers(df["sequence"].values, 0)

These sequences including scrambles can be used to identify transcription factor binding sites or RNAP binding sites.

## Computing Environment

In [None]:
%load_ext watermark
%watermark -v -p wgregseq,numpy,pandas