# Generating mutation scrambles of different sizes.

(c) 2020 Tom Röschinger. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

In [29]:
import wgregseq
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

# Set default plotting style
wgregseq.plotting_style();

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In this notebook we explore how to generate scrambles given a certain sequence. We want to be able to do it for sequences of various lengths, with variable size of the scrambles and variable overlap of consecutive scrambles. This method is based of the work from [Urtecho et al., 2020](https://www.biorxiv.org/content/10.1101/2020.01.04.894907v1), where they used scrambles of size 10, with an overlap of 5 base pairs. To find scrambles which are distant from the wild type sequence, they generated 100 scrambles by permuting the original sequence, and chose the sequence which was most distant. This preserves the GC-content of the sequence.

Let's generate a random sequence to begin with. Therefore we can use the function `wgregseq.gen_rand_seq()`, which takes the sequence length as an argument. One can also give a list of letters as second argument. By default, a random DNA sequence is generated.

In [43]:
sequence = wgregseq.gen_rand_seq(20.)
sequence

'CCAGACGGCGATCGGTAGCT'

To generate scrambles, we can use the function `wgregseq.create_scrambles()`, which takes as arguments the wild type sequence, the length of the generated scrambles, the number of overlapping bases, and the number of sequences generated per scramble, from which the most distant is chosen. The function will return an error message, if it cannot generate scrambles evenly throughout the sequence.

In [82]:
wgregseq.create_scrambles(sequence, 10, 5, 100)

array(['CCAGACGGCGATCGGTAGCT', 'GGCCCGCAGAATCGGTAGCT',
       'CCAGAGCGGCCGGTATAGCT', 'CCAGACGGCGCGGATCTTGA'], dtype='<U160')

The scrambles can also be returned in a data frame, which also includes the start and stop positions of the scrambles, as well as the center position. Note that here the wild type is not included.

In [89]:
df = wgregseq.create_scrambles_df(sequence, 10, 5, 100)
df

Unnamed: 0,start_pos,stop_pos,sequence,center_pos
0,0,10,AGGCCGCCGAATCGGTAGCT,5.0
1,5,15,CCAGAGCCGCGGATGTAGCT,10.0
2,10,20,CCAGACGGCGGAGTACTCTG,15.0


These sequences including scrambles can be used to identify transcription factor binding sites or RNAP binding sites.

## Computing Environment

In [5]:
%load_ext watermark
%watermark -v -p wgregseq,numpy,pandas

CPython 3.8.2
IPython 7.16.1

wgregseq 0.0.1
numpy 1.18.1
pandas 1.0.3
