## Initiation plasmids



### GC skew and content

"We will use all combinations resulting from varying GC content (40, 50, 60 and 70%) and GC skew (0, 0.1, 0.2, 0.4, and 0.6)"

In [1]:
from itertools import combinations
import pprint
import numpy as np
import pandas as pd

pp = pprint.PrettyPrinter(indent=4)

gc_skew_levels = [0, 0.1, 0.2, 0.4, 0.6]
gc_content_levels = [40, 50, 60, 70]

combos = []

for each_skew in gc_skew_levels:
    for each_content in gc_content_levels:
        combos.append((each_skew, each_content))

df = pd.DataFrame(set(combos))
df.columns = ['GC Skew', 'GC Content']
print(df)


    GC Skew  GC Content
0       0.1          40
1       0.4          70
2       0.6          60
3       0.1          70
4       0.0          60
5       0.2          40
6       0.4          60
7       0.6          50
8       0.1          60
9       0.0          50
10      0.2          70
11      0.4          50
12      0.2          60
13      0.1          50
14      0.0          40
15      0.6          40
16      0.6          70
17      0.0          70
18      0.4          40
19      0.2          50


What does a `0.2` GC content and 50% GC skew plasmid actually look like?

GC skew = (C - G) / (C + G)

GC content = % bases that are either G or C

If 
- $\alpha$ = GC skew
- $\lambda$ = GC content
- $l$ = length of sequence
- $c$ = number of Cs
- $g$ = number of Gs

then

$\frac{c - g}{c + g} = \alpha$

$\frac{c - g}{l} = \lambda$

$c + g = \lambda l$

$\frac{ c - g }{ \lambda l } = \alpha$

$ c - g = \alpha \lambda l$

and 

$c + g = \lambda l$

So number of Gs + number of Cs must be equal to the desired GC content times the length of the sequence and the number of Cs - the number of Gs must be equal to the desired GC skew times the desired GC content times the length of the sequence. Any number of G and Cs that satisfy these conditions could be used in a given sequence. It is then a matter of figuring out where they will go in the sequence.

Lets try making some test "sequences"

In [2]:
l = 200
def satifies_conditions(g, c, alpha, lamb, l):
    # determine if a given number of g and c counts
    # satifies the required conditions. Lambda abbreviated
    # ad lamb since lambda is reservered by Python
    if g + c == int(lamb * l) and c - g == int(alpha * lamb * l):
        return True
    else:
        return False



Naive brute force approach. This is the same as just directly calculating GC skew and content really.

In [3]:
# Define conditions we want
alpha = 0.0
lamb = 0.4

G_C_combos = []
# all combinations of number of G and C for all possible sums of G and C up to the length of the sequence.
for k in range(2, l):
    G_C_combos += [(i, k-i) for i in range(0, k+1)]

good_combos = [(G, C) for G, C in G_C_combos if satifies_conditions(G, C, alpha, lamb, l)]

print(good_combos)  # print out combinations that pass conditions

[(40, 40)]


Test to make sure `good_combos` actually are valid.

In [4]:
for G, C in good_combos:
    assert (G + C) / l == lamb, 'GC content incorrect'
    assert (C - G) / (C + G) == alpha, 'GC skew incorrect!'
print('All correct!')

All correct!


Looks like its working at the time of writing. Lets try for all GC skew and content levels in the original table.

In [5]:
# Turn brute force approach into a function

def make_all_possible_GC_count_combos(l):
    G_C_combos = []
    for k in range(2, l):
        G_C_combos += [(i, k-i) for i in range(0, k+1)]
    return G_C_combos

def brute_force_G_C_counts(alpha, lamb, l, G_C_combos):
    good_combos = [(G, C) for G, C in G_C_combos if satifies_conditions(G, C, alpha, lamb, l)]
    return good_combos


In [6]:
# Interate over all rows in dataframe and brute force it
gc_combos = make_all_possible_GC_count_combos(l)
good_combos_list = []  # record good combos here
for index, row in df.iterrows():
    alpha, lamb = row['GC Skew'], row['GC Content'] / 100
    good_combos = brute_force_G_C_counts(alpha, lamb, l, gc_combos)
    if not good_combos:
        good_combos = [('NA', 'NA')]
    for good_combo in good_combos:
        G, C = good_combo
        good_combos_list.append(
            {'GC skew': alpha, 'GC Content': lamb, 'G count': G, 'C count': C}
        )

print(pd.DataFrame(good_combos_list))



    GC skew  GC Content G count C count
0       0.1         0.4      36      44
1       0.4         0.7      NA      NA
2       0.6         0.6      24      96
3       0.1         0.7      NA      NA
4       0.0         0.6      60      60
5       0.2         0.4      32      48
6       0.4         0.6      36      84
7       0.6         0.5      20      80
8       0.1         0.6      54      66
9       0.0         0.5      50      50
10      0.2         0.7      NA      NA
11      0.4         0.5      30      70
12      0.2         0.6      48      72
13      0.1         0.5      45      55
14      0.0         0.4      40      40
15      0.6         0.4      16      64
16      0.6         0.7      28     112
17      0.0         0.7      70      70
18      0.4         0.4      24      56
19      0.2         0.5      40      60


In some cases there are no possible combinations of C and G counts that produce a vialble result. 

## Termination sequences


Language from grant

"We will add a constant extension region of 100 bp, and synthesize and clone a series of 200 bp sequences immediately thereafter. These potential termination regions will possess decreasing GC content (50, 40, 30%) and decreasing GC skew (0, -0.2, -0.4)"


In [11]:
GC_skew_term = [0, -0.2, -0.4]
GC_content_term = [50, 40, 30]
term_len = 100

term_combos = []

for each_skew in GC_skew_term:
    for each_content in GC_content_term:
        term_combos.append((each_skew, each_content / 100))

df_term = pd.DataFrame(set(term_combos))
df_term.columns = ['GC Skew', 'GC Content']
print(df_term)



   GC Skew  GC Content
0      0.0         0.4
1      0.0         0.5
2     -0.2         0.4
3     -0.4         0.3
4     -0.2         0.5
5     -0.4         0.4
6     -0.4         0.5
7      0.0         0.3
8     -0.2         0.3


Can use same functions as intiation sequences here

In [16]:
# Interate over all rows in dataframe and brute force it
good_combos_list_term = []  # record good combos here
for index, row in df_term.iterrows():

    alpha, lamb = row['GC Skew'], row['GC Content']
    good_combos_term = brute_force_G_C_counts(alpha, lamb, l, gc_combos)

    if not good_combos_term:
        good_combos_term = [('NA', 'NA')]

    for good_combo in good_combos_term:
        G, C = good_combo
        good_combos_list_term.append(
            {'GC skew': alpha, 'GC Content': lamb, 'G count': G, 'C count': C}
        )

print(pd.DataFrame(good_combos_list_term))

   GC skew  GC Content  G count  C count
0      0.0         0.4       40       40
1      0.0         0.5       50       50
2     -0.2         0.4       48       32
3     -0.4         0.3       42       18
4     -0.2         0.5       60       40
5     -0.4         0.4       56       24
6     -0.4         0.5       70       30
7      0.0         0.3       30       30
8     -0.2         0.3       36       24
