# Testing repetitiveness of the genetic codes for the codon tags

# Section 1

### Section 1.1: Specifying the dictionary of the codon tags & the colour they're assigned to

#### Genetic code option 1:
#### - Random allocation of colours to codons
#### - 60 codons total
#### - 7 codons for each colour, with black having 11 codons

In [6]:
#Making the dictionary of codon tags & their colour
#The codon tags are keys (each key is unique), and the colours are values
dict_one = {
'TTC' : 'Red','CTG' : 'Red','GTT' : 'Red','GCC' : 'Red','CAG' : 'Red','AGG' : 'Red', 'TGC' : 'Red',
    'CTT' : 'Green','ATA' : 'Green','ACT' : 'Green','GCG' : 'Green','TAC' : 'Green','CGC' : 'Green','AGT' : 'Green',
    'CTA' : 'Blue','GTA' : 'Blue','TCT' : 'Blue','CCT' : 'Blue','CCA' : 'Blue','GAA' : 'Blue','CGA' : 'Blue',
    'CTC' : 'Cyan','GTG' : 'Cyan','TCG' : 'Cyan','ACA' : 'Cyan','GCT' : 'Cyan','CGT' : 'Cyan','GGC' : 'Cyan',
    'ATT' : 'Yellow','GCA' : 'Yellow','ACC' : 'Yellow','TCC' : 'Yellow','TAT' : 'Yellow','TGG' : 'Yellow','AGC' : 'Yellow',
    'TTG' : 'Magenta','ATC' : 'Magenta','GAC' : 'Magenta','CAT' : 'Magenta','TGT' : 'Magenta','AGA' : 'Magenta','GGT' : 'Magenta',
    'TCA' : 'White','CCG' : 'White','CAA' : 'White','AAT' : 'White','AAC' : 'White','GAT' : 'White','GAG' : 'White',
    'TTA' : 'Black','CAC' : 'Black','CGG' : 'Black','GTC' : 'Black','ACG' : 'Black','AAG' : 'Black','GGA' : 'Black',
    'TTT' : 'Black','CCC' : 'Black','AAA' : 'Black','GGG' : 'Black'
}   

### Section 1.2: Generating a single test signal sequence

In [7]:
##We can generate a test signal sequence by either:
# randomly sampling codon tags
# or, randomly sampling pairs of colours & tags (key/ val pairs) so we know the colour pattern

In [8]:
#Randomly selecting a certain number (n) of tags per colour
#Selecting random keys, from any colour
#This code randomly samples 3 tags (keys) 
import numpy as np
random_tags = np.random.choice(list(dict_one),3)
print(random_tags)

['TCT' 'ACG' 'GAA']


In [9]:
#Randomly selecting a colour/ tag (key/value) pair
import random 

#sample key/val pairs so we know the pattern of colours
random_pairs = random.sample(list(dict_one.items()), 3)
print(random_pairs) #gives us the randomly selected pairs

#extracting just the keys (tags) into a list
random_tags_pattern = [k for k, v in random_pairs]
print(random_tags_pattern) #gives us just the tags in the order the pairs were sampled in

[('ACT', 'Green'), ('CAT', 'Magenta'), ('AAG', 'Black')]
['ACT', 'CAT', 'AAG']


In [10]:
#What if we want to sample a lot at once, and store the results?

#First we need to make the keys (codon tags) of the dictionary into a list
dict_one_key_list = list(dict_one.keys())


In [11]:
#Make empty list for the results of the for loop
random_tags_one = []
#The for loop
for i in np.arange(34):
    #do one random sample of 34 tags and put into list one
    random_tags_one.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(random_tags_one)


['AAA', 'TTA', 'GTG', 'GAG', 'AGT', 'GAA', 'GCG', 'AGT', 'CAT', 'CCC', 'TTG', 'GAG', 'GCG', 'GCG', 'CTT', 'ATA', 'AAG', 'ATA', 'GAG', 'TAT', 'GGC', 'TTA', 'TTC', 'GAG', 'ATT', 'AAA', 'GAG', 'AAT', 'CGC', 'ACG', 'AAA', 'GTC', 'TAC', 'GTG']


In [12]:
#To randomly sample multiple colours e.g. to make a pattern
# I randomly sampled key, value pairs to get both colours and their tags
# then just keep the keys in the order they were sampled in

import random 

#randomly sample key: val pairs so we know the pattern of colours
random_pairs = random.sample(list(dict_one.items()), 34)
print(random_pairs)

#extracting just the keys (tags) into a list
random_tags_one_2 = [k for k, v in random_pairs]
print(random_tags_one_2)

[('AGC', 'Yellow'), ('CGC', 'Green'), ('TCC', 'Yellow'), ('CCT', 'Blue'), ('CTG', 'Red'), ('GGG', 'Black'), ('ACG', 'Black'), ('TTG', 'Magenta'), ('GCT', 'Cyan'), ('GAG', 'White'), ('CAA', 'White'), ('GGA', 'Black'), ('AGG', 'Red'), ('CGA', 'Blue'), ('GGT', 'Magenta'), ('GCC', 'Red'), ('ATA', 'Green'), ('CTT', 'Green'), ('TCG', 'Cyan'), ('TGC', 'Red'), ('GTC', 'Black'), ('TGG', 'Yellow'), ('ATT', 'Yellow'), ('AAA', 'Black'), ('GTT', 'Red'), ('CTA', 'Blue'), ('TTA', 'Black'), ('ACC', 'Yellow'), ('CAT', 'Magenta'), ('GTA', 'Blue'), ('AGA', 'Magenta'), ('CTC', 'Cyan'), ('GAC', 'Magenta'), ('CGT', 'Cyan')]
['AGC', 'CGC', 'TCC', 'CCT', 'CTG', 'GGG', 'ACG', 'TTG', 'GCT', 'GAG', 'CAA', 'GGA', 'AGG', 'CGA', 'GGT', 'GCC', 'ATA', 'CTT', 'TCG', 'TGC', 'GTC', 'TGG', 'ATT', 'AAA', 'GTT', 'CTA', 'TTA', 'ACC', 'CAT', 'GTA', 'AGA', 'CTC', 'GAC', 'CGT']


### Section 1.3: Putting the random test sequences together

In [13]:
#Now, we can put all the randomly sampled codon tags
# together into one long sequence
random_seq_one = ''.join(map(str,random_tags_one))
print(random_seq_one)

#And also get the nucleotide length of the codon tag sequence
length = len(random_seq_one)
print("Sequence length: " + str(length))

AAATTAGTGGAGAGTGAAGCGAGTCATCCCTTGGAGGCGGCGCTTATAAAGATAGAGTATGGCTTATTCGAGATTAAAGAGAATCGCACGAAAGTCTACGTG
Sequence length: 102


In [14]:
#Now, put together randomly sampled codon tags
# that were sampled from key:value pairs
random_seq_one_2 = ''.join(map(str,random_tags_one_2))
print(random_seq_one_2)

#Get nucleotide length of sequence
length = len(random_seq_one_2)
print("Sequence length: " + str(length))

AGCCGCTCCCCTCTGGGGACGTTGGCTGAGCAAGGAAGGCGAGGTGCCATACTTTCGTGCGTCTGGATTAAAGTTCTATTAACCCATGTAAGACTCGACCGT
Sequence length: 102


### Section 1.4: Analysing the randomly generated sequences for repeats/ repetitiveness

In [15]:
#By identifying regular expressions

#Import the reg. exp. modules
import re


#For the first sequence (random_seq_one)
#Finds all the repeats at once
print(random_seq_one)
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|(GT){2,}|\
(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", random_seq_one)

#Makes empty list for results  
all_repeats_one = []
for m in matches:
    all_repeats_one.append(m.group()) #puts all the repeats found together in one list

#prints the list of repeats found
print(all_repeats_one)


#Doing the same for the second sequence (random_seq_one_2)
#Finds all the repeats at once
print(random_seq_one_2)
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|(GT){2,}|\
                      (GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", random_seq_one_2)
#Makes empty list for results of second sequence (random_seq_one_2) 
all_repeats_one_2 = []
for m in matches:
    all_repeats_one_2.append(m.group()) #puts all the repeats found together in one list

#prints the list of repeats found
print(all_repeats_one_2)


AAATTAGTGGAGAGTGAAGCGAGTCATCCCTTGGAGGCGGCGCTTATAAAGATAGAGTATGGCTTATTCGAGATTAAAGAGAATCGCACGAAAGTCTACGTG
['AGAG', 'GCGC', 'TATA', 'AGAG', 'AGAG']
AGCCGCTCCCCTCTGGGGACGTTGGCTGAGCAAGGAAGGCGAGGTGCCATACTTTCGTGCGTCTGGATTAAAGTTCTATTAACCCATGTAAGACTCGACCGT
['CCCC', 'GGGG']


### Section 1.5: Counting the number of repeat sequences found

In [16]:
#For the first sequence; random_seq_one
repeat_total_one = len(all_repeats_one)
print("Total repeats: " + str(repeat_total_one))

#For the second sequence; random_seq_one_2
repeat_total_one_2 = len(all_repeats_one_2)
print("Total repeats: " + str(repeat_total_one_2))

Total repeats: 5
Total repeats: 2


In [126]:
#Now we know the total number of repeats present in each sequence
#We might want to know where they are in the sequences

#For the first sequence; random_seq_one
#This is the sequence
print(random_seq_one)

#making empty list for positions
pos_list_one = []

#Finding positions of repeats
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|(GT){2,}|\
                      (GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", random_seq_one)
for m in matches:
    repeat = m.group()
    pos_one = m.start()
    pos_list_one.append(pos_one)
    print(repeat + " found at position " + str(pos_one))

print(pos_list_one)

AAATTAGTGGAGAGTGAAGCGAGTCATCCCTTGGAGGCGGCGCTTATAAAGATAGAGTATGGCTTATTCGAGATTAAAGAGAATCGCACGAAAGTCTACGTG
AGAG found at position 10
GCGC found at position 39
TATA found at position 44
AGAG found at position 53
AGAG found at position 77
[10, 39, 44, 53, 77]


In [18]:
#If we want to split the DNA sequence at every point a repeat is present
#We can use the list of positions made earlier

#this O below is added to the list so that it goes from (first pos., second pos.) to (second pos., third pos.) etc.
pos_list_one = [0] + pos_list_one + [len(random_seq_one)]
[random_seq_one[x:y] for x,y in zip(pos_list_one, pos_list_one[1:])]

['AAATTAGTG',
 'GAGAGTGAAGCGAGTCATCCCTTGGAGGCG',
 'GCGCT',
 'TATAAAGAT',
 'AGAGTATGGCTTATTC',
 'GAGATTAA',
 'AGAGAATCGCACGAAAGTCTACGTG']

In [127]:
#For the second sequence; random_seq_two
#This is the sequence
print(random_seq_one_2)

#making empty list for positions
pos_list_one_2 = []

#Finding positions of repeats
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|\
                      (GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", random_seq_one_2)
for m in matches:
    repeat = m.group()
    pos_one_2 = m.start()
    pos_list_one_2.append(pos_one_2)
    print(repeat + " found at position " + str(pos_one_2))

print(pos_list_one_2)

AGCCGCTCCCCTCTGGGGACGTTGGCTGAGCAAGGAAGGCGAGGTGCCATACTTTCGTGCGTCTGGATTAAAGTTCTATTAACCCATGTAAGACTCGACCGT
CCCC found at position 7
GGGG found at position 14
[7, 14]


In [20]:
#If we want to split the DNA sequence at every point a repeat is present
#We can use the list of positions made earlier

#this O below is added to the list so that it goes from (first pos., second pos.) to (second pos., third pos.) etc.
pos_list_one_2 = [0] + pos_list_one_2 + [len(random_seq_one_2)]
[random_seq_one_2[x:y] for x,y in zip(pos_list_one_2, pos_list_one_2[1:])]

['AGCCGCT',
 'CCCCTCT',
 'GGGGACGTTGGCTGAGCAAGGAAGGCGAGGTGCCATACTTTCGTGCGTCTGGATTAAAGTTCTATTAACCCATGTAAGACTCGACCGT']

### Section 1.6: Summarising results: sequence length & repeat count

In [21]:
#First, counting number of each individual repeat
#counts how many times each of the found repeat seq is present in the entire sequence

####
#First sequence: random_seq_one
import collections 

repeat_counts_one = collections.Counter(all_repeats_one)
print(repeat_counts_one)


#Nucleotide length of the sequence
nt_length = len(random_seq_one)
print("Sequence length: " + str(nt_length))

#Total number of repeats in sequence
repeat_total_one = len(all_repeats_one)
print("Total repeats: " + str(repeat_total_one))


####
#Second sequence: random_seq_one_2
import collections

repeat_counts_one_2 = collections.Counter(all_repeats_one_2)
print(repeat_counts_one_2)


#Nucleotide length of the sequence
nt_length = len(random_seq_one_2)
print("Sequence length: " + str(nt_length))

#Total number of repeats in sequence
repeat_total_one_2 = len(all_repeats_one_2)
print("Total repeats: " + str(repeat_total_one_2))

Counter({'AGAG': 3, 'GCGC': 1, 'TATA': 1})
Sequence length: 102
Total repeats: 5
Counter({'CCCC': 1, 'GGGG': 1})
Sequence length: 102
Total repeats: 2


In [22]:
#Other things we might want to know about our randomly generated sequences

#AT/ GC content?

#Function to see if sequence is AT-rich (e.g., rich if AT content is >0.65)
#Writing the function:
def AT_rich(dna):
    length = len(dna)
    a_count = dna.upper().count('A')
    t_count = dna.upper().count('T')
    at_content = (a_count + t_count) / length
    if at_content > 0.65:
        return True
    else:
        return False

#Function to see if GC-rich (> 0.65)
#Writing the function:
def GC_rich(dna):
    length = len(dna)
    g_count = dna.upper().count('G')
    c_count = dna.upper().count('C')
    gc_content = (g_count + c_count) / length
    if gc_content > 0.65:
        return True
    else:
        return False

#Testing AT content:
#of first sequence
print(random_seq_one)
print(AT_rich(random_seq_one))

#of second sequence
print(random_seq_one_2)
print(AT_rich(random_seq_one_2))


#Testing GC content 
#of first sequence
print(random_seq_one)
print(GC_rich(random_seq_one))

#of second sequence
print(random_seq_one_2)
print(GC_rich(random_seq_one_2))

AAATTAGTGGAGAGTGAAGCGAGTCATCCCTTGGAGGCGGCGCTTATAAAGATAGAGTATGGCTTATTCGAGATTAAAGAGAATCGCACGAAAGTCTACGTG
False
AGCCGCTCCCCTCTGGGGACGTTGGCTGAGCAAGGAAGGCGAGGTGCCATACTTTCGTGCGTCTGGATTAAAGTTCTATTAACCCATGTAAGACTCGACCGT
False
AAATTAGTGGAGAGTGAAGCGAGTCATCCCTTGGAGGCGGCGCTTATAAAGATAGAGTATGGCTTATTCGAGATTAAAGAGAATCGCACGAAAGTCTACGTG
False
AGCCGCTCCCCTCTGGGGACGTTGGCTGAGCAAGGAAGGCGAGGTGCCATACTTTCGTGCGTCTGGATTAAAGTTCTATTAACCCATGTAAGACTCGACCGT
False


### Section 1.7: Randomly sampling multiple sequences at once, to make a 'library' of random sequences

In [8]:
############### 
# CHANGE THE NUMBERS IN THE FOLLOWING CODE TO CHANGE HOW MANY SEQUENCES/ WHAT LENGTH SEQUENCES YOU WANT TO GENERATE

#What if we want to sample a lot at once, and store the results?
import numpy as np

#First we need to make the keys (codon tags) of the dictionary into a list
dict_one_key_list = list(dict_one.keys())

#Making empty libraries to store the many randomly generated codon tag sequences
#empty library for 50 seq:
random_tags_library_one_50 = []
#empty library for 100 seq:
random_tags_library_one_100 = []
#empty library for 1000 seq:
random_tags_library_one_1000 = []

#This for loop
#generates 50 sequences, of 34 tags each (each seq.  102nt in length)
for x in np.arange(50): #do the loop 50 times
    for i in np.arange(34): #each time, generate 34 random tags
        #put tags into one library
        random_tags_library_one_50.append(np.random.choice(dict_one_key_list))
#print(random_tags_library_one_50)

In [24]:
#This generates 100 sequences, of 34 tags each (102nt length sequences)
for x in np.arange(100): #do the loop 100 times
    for i in np.arange(34): #each time, generate 34 random tags
        #put tags into one library
        random_tags_library_one_100.append(np.random.choice(dict_one_key_list))
#print(random_tags_library_one_100)

In [25]:
#This generates 1000 sequences, of 34 tags each (102nt length sequences)
for x in np.arange(1000): #do the loop 1000 times
    for i in np.arange(34): #each time, generate 34 random tags
        #put tags into one library
        random_tags_library_one_1000.append(np.random.choice(dict_one_key_list))
#print(random_tags_library_one_1000)

In [10]:
#splitting the massive lists in the libraries into smaller lists of length 34
# creates a nested list/ lists of lists/ two-dimensional list

#For the 50 sequences
library_one_nested_50 = [random_tags_library_one_50[x:x+34] 
                         for x in range(0, len(random_tags_library_one_50), 34)]
print(library_one_nested_50) #prints the nested list (lists w/in list)

[['ATA', 'CCT', 'TTC', 'CCT', 'ATA', 'TTC', 'ATA', 'ATA', 'TTC', 'ATA', 'ATA', 'ATA', 'ATA', 'TTC', 'CCT', 'ATA', 'ATA', 'CCT', 'CCT', 'ATA', 'ATA', 'ATA', 'ATA', 'TTC', 'TTC', 'CCT', 'CCT', 'ATA', 'ATA', 'TTC', 'ATA', 'ATA', 'CCT', 'CCT'], ['CCT', 'ATA', 'CCT', 'CCT', 'CCT', 'ATA', 'ATA', 'ATA', 'TTC', 'ATA', 'CCT', 'ATA', 'CCT', 'ATA', 'TTC', 'TTC', 'CCT', 'CCT', 'TTC', 'TTC', 'ATA', 'CCT', 'ATA', 'ATA', 'TTC', 'CCT', 'CCT', 'TTC', 'ATA', 'ATA', 'CCT', 'CCT', 'CCT', 'CCT'], ['ATA', 'ATA', 'ATA', 'ATA', 'TTC', 'TTC', 'CCT', 'TTC', 'TTC', 'CCT', 'CCT', 'ATA', 'ATA', 'ATA', 'TTC', 'TTC', 'TTC', 'ATA', 'ATA', 'TTC', 'TTC', 'TTC', 'CCT', 'TTC', 'TTC', 'CCT', 'CCT', 'ATA', 'TTC', 'CCT', 'TTC', 'ATA', 'ATA', 'CCT'], ['ATA', 'CCT', 'CCT', 'ATA', 'ATA', 'ATA', 'TTC', 'TTC', 'CCT', 'ATA', 'ATA', 'CCT', 'TTC', 'CCT', 'TTC', 'CCT', 'ATA', 'CCT', 'TTC', 'TTC', 'CCT', 'CCT', 'ATA', 'ATA', 'TTC', 'ATA', 'CCT', 'CCT', 'TTC', 'TTC', 'ATA', 'ATA', 'CCT', 'CCT'], ['TTC', 'CCT', 'CCT', 'ATA', 'ATA', 'CC

In [27]:
#For the 100 sequences
library_one_nested_100 = [random_tags_library_one_100[x:x+34] for x in range(0, len(random_tags_library_one_100), 34)]
#print(library_one_nested_100) #prints the nested list (lists w/in list)

In [28]:
#For the 1000 sequences
library_one_nested_1000 = [random_tags_library_one_1000[x:x+34] for x in range(0, len(random_tags_library_one_1000), 34)]
#print(library_one_nested_1000) #prints the nested list (lists w/in list)

### Section 1.8 Putting together each of the randomly generated sequence tags into individual sequences

In [29]:
#Putting all the tags in the indiv. libraries together into one long sequence
#using the nested list, not the dictionary

#this library is now a list of strings of 50 sequences
library_one_50 = [''.join(l) for l in library_one_nested_50]
print(library_one_50)

In [30]:
#List of strings/ list of 100 sequences
library_one_100 = [''.join(l) for l in library_one_nested_100]
#print(library_one_100)

In [31]:
#List of strings/ list of 1000 sequences
library_one_1000 = [''.join(l) for l in library_one_nested_1000]
#print(library_one_1000)

In [32]:
#Identifying the indeces for the libraries of sequences 
# will show us how many sequences there are in each library

#For 50 sequences:
library_one_50_index = enumerate(library_one_50)
print(list(library_one_50_index))

#If we want an indiv. seq. from a library, we call the index of that seq.
#print(library_one_50[3])

[(0, 'AGTTGGGGGGCGCCTAATACCGCACTACACGGTAATGGTGCCATCCATCGAGACTTGGAAGTTCACTATACGAACCTACTAAACTCGGCAAGCTTCGACTGG'), (1, 'ACTAACGAAGGTACTACACCCCGACCCGGACCTAACAACTGCACATGCAAACAGCGCGTGTTGGGCCGCCACAGCAAGGTGGACATTGAGTTTCAATTCTCG'), (2, 'GGACCGTGTCCTCCAGTCTCTCGTTTGAGTAATCAATTTTTTCCATTGTCAATCAGGACTCTATGTGCCGCTTGCCGGCTTATTTTAACCTGGCCCCTAACT'), (3, 'CCTTTGAACGGGAGGTCATCTTCTCGGTTCCCGCCAGACCCGTTGCAGGGTGCAAAAACACTCTTTCGTAAATGCAGACATGCCTCCGGTGCCCGAACCCTA'), (4, 'GGTCTGCATCTCCAACTATCTCGTTTGCGTTATGATTGTTCATCGGTGTCAGGCCTAGGGTTGAGTCCAAGGCTGGAAGTCCCCACGTTGCTACGCCTCACA'), (5, 'AGCATTAATGGTTGCCCTACCCGCGCAATACCCGCCGCATGGACTAGTTTCACGCAGGCAAAAGTTAACTCTGCATATCCACGTATCTCCACATGGGGTATA'), (6, 'AGTAGTGCGTCCAGACAATATGAGCGCCGTCGTCTGCGTCTCTCAGGCAGACCTAATATATTAGCCGGGTGGCCCGATTTTCAGTGCTTTGTAAAAAGCGAT'), (7, 'GTGTCGCTTCAAATACGTGCCAAGCATATACATGCAAGACATGGTATCACATCGCGGTCGAATGCCTACGGGCACATTACGAGGGGGAACGGCTCAGGGGCG'), (8, 'TTAAGACTCTTCAACGAGGGACCTTACACCAAGTTGTGCGGTCGTCTAAACTGTGACTTGTACCTCGTCATCATTGACGCTAGTTACCCAGTTCGTGACGGA'), 

In [33]:
#For 100 sequences:
library_one_100_index = enumerate(library_one_100)
#print(list(library_one_100_index))

In [34]:
#For 1000 sequences:
library_one_1000_index = enumerate(library_one_1000)
#print(list(library_one_1000_index))

### Section 1.9: Finding repeats for the sequences in the library

In [128]:
#By identifying regular expressions
import re

#Finds repeats of just one sequence at a time
print(library_one_50[0])
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|\
                        (TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|\
                        (GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", 
                        library_one_50[0])
repeats_one_50_seq0 = []
for m in matches:
    repeats_one_50_seq0.append(m.group())
print(repeats_one_50_seq0)

print(library_one_100[0])
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|\
                      (GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", library_one_100[0])
repeats_one_100_seq0 = []
for m in matches:
    repeats_one_100_seq0.append(m.group())
print(repeats_one_100_seq0)

print(library_one_1000[0])
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|\
                      (GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", library_one_1000[0])
repeats_one_1000_seq0 = []
for m in matches:
    repeats_one_1000_seq0.append(m.group())
print(repeats_one_1000_seq0)

AGTTGGGGGGCGCCTAATACCGCACTACACGGTAATGGTGCCATCCATCGAGACTTGGAAGTTCACTATACGAACCTACTAAACTCGGCAAGCTTCGACTGG
['GGGGGG', 'ACAC', 'GAGA', 'TATA']
GTTGTCGCAGAGTATTCTCTAGATCCAGACCTGATATGCTTGGAAGTAACGATAGTCCCCCTGCTCTTATCCGATGTATCCCCGGCTGCGGTATCCCGGAGC
['AGAG', 'TCTC', 'ATAT', 'CCCC', 'CTCT', 'CCCC']
TCGCCGAGACTCACGGGAGGAGGTTTACCGGACACTGCCACACAGTGTGCCTGGAAACATCCTACCGTCTCGTGCCTCCCCACCATCTTTCTCCGCGTGTTG
['GAGA', 'ACAC', 'CACACA', 'TGTG', 'TCTC', 'CCCC', 'TCTC', 'CGCG']


In [15]:
#Defining a function to get the repeats from a given sequence
def get_repeats(sequence):
    seq_repeats_one = []
    compiled_expressions = re.compile(r'(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|\
                                    (TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|\
                                      (GC){2,}|(GG){2,}|(GT){2,}|(GA){2,}|\
                                      (CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}')
    for match in compiled_expressions.finditer(sequence):
            seq_repeats_one.append(match.group())
    return seq_repeats_one

#Writing a for loop and calling the get_repeats function
# to get the repeats for all sequences in a list of strings from a library

#Getting all the repeats from the sequences in the library of 50 sequences
#empty library for repeats
repeats_library_one_50 = []
for sequence in library_one_50:
    repeats = get_repeats(sequence) 
    repeats_library_one_50.append(repeats)
print(repeats_library_one_50) 
#returns nested list of repeats found for each indiv. sequence

#This gives the same results as it did when finding the repeats for each seq. indiv. as done above

NameError: name 'library_one_50' is not defined

In [37]:
#Getting all the repeats from the sequences in the library of 100 sequences
#empty library for repeats
repeats_library_one_100 = []
for sequence in library_one_100:
    repeats = get_repeats(sequence) 
    repeats_library_one_100.append(repeats)
#print(repeats_library_one_100) #returns nested list of repeats found for each indiv. sequence

In [38]:
#Getting all the repeats from the sequences in the library of 1000 sequences
#empty library for repeats
repeats_library_one_1000 = []
for sequence in library_one_1000:
    repeats = get_repeats(sequence) 
    repeats_library_one_1000.append(repeats)
#print(repeats_library_one_1000) #returns nested list of repeats found for each indiv. sequence

### Section 1.10: Counting the number of repeats found in each of the sequences in the library

In [14]:
#To count the number of repeats found in the sequences above, we can create another function:
# this is called a recursive function
def repeat_count(repeats):
    count = 0 #count starts at 0
    for elem in repeats:
        if type(elem) == list: #check if the element is a list
            count += repeat_count(elem) #if list, get size
        else:
            count += 1 #if not list, add 1 to count
    return count

#Writing a for loop and calling the repeat_count function
# to get the number of repeats found in each seq., as found in the previous code

#The number of repeats for the indiv. sequences in the library of 50 sequences:
repeat_counts_library_one_50 = []
for repeats in repeats_library_one_50:
    number = repeat_count(repeats)
    repeat_counts_library_one_50.append(number)
print(repeat_counts_library_one_50)

#Then, to get the total number of 
#all repeats for all 50 sequences in the library
total_repeats_lib_one_50 = (sum(repeat_counts_library_one_50))
print(total_repeats_lib_one_50)

NameError: name 'repeats_library_one_50' is not defined

In [40]:
#The number of repeats for the indiv. sequences in the library of 100 sequences:
repeat_counts_library_one_100 = []
for repeats in repeats_library_one_100:
    number = repeat_count(repeats)
    repeat_counts_library_one_100.append(number)
#print(repeat_counts_library_one_100)

#Then, to get the total number of all repeats for all 100 sequences in the library
total_repeats_lib_one_100 = (sum(repeat_counts_library_one_100))
print(total_repeats_lib_one_100)

457


In [41]:
#The number of repeats for the indiv. sequences in the library of 1000 sequences:
repeat_counts_library_one_1000 = []
for repeats in repeats_library_one_1000:
    number = repeat_count(repeats)
    repeat_counts_library_one_1000.append(number)
#print(repeat_counts_library_one_1000)

#Then, to get the total number of all repeats for all 1000 sequences in the library
total_repeats_lib_one_1000 = (sum(repeat_counts_library_one_1000))
print(total_repeats_lib_one_1000)

4631


### Section 1.11: Creating library for sequences randomly generated by using key:val pairs

In [12]:
#Defining a function to get the keys from the key;val pairs
def get_keys(key):
    keys_one = []
    keys_one = [k for k, v in key]
    return keys_one

In [13]:
############### 
# CHANGE THE NUMBERS IN THE FOLLOWING CODE TO CHANGE HOW MANY SEQUENCES/ WHAT LENGTH SEQUENCES YOU WANT TO GENERATE

#What if we want to sample a lot at once, and store the results?
import numpy as np
import random

#Making empty libraries for generating 50, 100, and 1000 random sequences
random_pairs_library_one_50 = []
random_pairs_library_one_100 = []
random_pairs_library_one_1000 = []

#This for loop
# generates 50 sequences, of 34 key/val pairs each (102nt in length)
for x in np.arange(50): #do the loop 50 times
    #put tags into library
    random_pairs_library_one_50.append(random.sample(list(dict_one.items()), 34)) 
    #generates 34 random key;val pairs each time
print(random_pairs_library_one_50)

#Writing a for loop and calling the get_keys function
# to get the keys for all the key:val pairs in random_pairs_library_one_50
keys_library_one_50 = []
for key in random_pairs_library_one_50:
    keys = get_keys(key) 
    keys_library_one_50.append(keys)
#print(keys_library_one_50) #returns nested list of keys for each indiv. sequence
#so 50 lists of 34 keys
# to make into 50 sequences of 34 tags = sequences of 102nt in length

ValueError: Sample larger than population or is negative

In [44]:
#This for loop
# generates 100 sequences, of 34 key/val pairs each (102nt in length)
for x in np.arange(100): #do the loop 100 times
    #put tags into library
    random_pairs_library_one_100.append(random.sample(list(dict_one.items()), 34)) 
    #generates 34 random key;val pairs each time
#print(random_pairs_library_one_100)

#Writing a for loop and calling the get_keys function
# to get the keys for all the key:val pairs in random_pairs_library_one_100
keys_library_one_100 = []
for key in random_pairs_library_one_100:
    keys = get_keys(key) 
    keys_library_one_100.append(keys)
#print(keys_library_one_100) #returns nested list of keys for each indiv. sequence
#so 100 lists of 34 keys
# to make into 100 sequences of 34 tags = sequences of 102nt in length

In [45]:
#This for loop
# generates 1000 sequences, of 34 key/val pairs each (102nt in length)
for x in np.arange(1000): #do the loop 1000 times
    #put tags into library
    random_pairs_library_one_1000.append(random.sample(list(dict_one.items()), 34)) 
    #generates 34 random key;val pairs each time
#print(random_pairs_library_one_1000)

#Writing a for loop and calling the get_keys function
# to get the keys for all the key:val pairs in random_pairs_library_one_1000
keys_library_one_1000 = []
for key in random_pairs_library_one_1000:
    keys = get_keys(key) 
    keys_library_one_1000.append(keys)
#print(keys_library_one_1000) #returns nested list of keys for each indiv. sequence
#so 1000 lists of 34 keys
# to make into 1000 sequences of 34 tags = sequences of 102nt in length

In [46]:
#Putting all the tags in the indiv. libraries together into actual sequences

#For library of 50 sequences:
#this library is now a list of strings 
library_keys_one_50 = [''.join(l) for l in keys_library_one_50]
#print(library_keys_one_50)

In [47]:
#For library of 100 sequences: 
library_keys_one_100 = [''.join(l) for l in keys_library_one_100]
#print(library_keys_one_100)

In [48]:
#For library of 1000 sequences: 
library_keys_one_1000 = [''.join(l) for l in keys_library_one_1000]
#print(library_keys_one_1000)

In [49]:
#Identifying the indeces for the sequences 
# will show us how many sequences there are

#For the library of 50 sequences:
library_keys_one_50_index = enumerate(library_keys_one_50)
#print(list(library_keys_one_50_index))

#If we want an indiv. seq. from the library, we call the index of that seq.
#print(library_keys_one_50[3])

In [50]:
#For the library of 100 sequences:
library_keys_one_100_index = enumerate(library_keys_one_100)
#print(list(library_keys_one_100_index))

In [51]:
#For the library of 1000 sequences:
library_keys_one_1000_index = enumerate(library_keys_one_1000)
#print(list(library_keys_one_1000_index))

In [130]:
#Defining a function to get the repeats from a given sequence
def get_repeats(sequence):
    seq_repeats_one = []
    compiled_expressions = re.compile(r'(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|\
                                      (GC){2,}|(GG){2,}|(GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}')
    for match in compiled_expressions.finditer(sequence):
            seq_repeats_one.append(match.group())
    return seq_repeats_one

#Writing a for loop and calling the get_repeats function
# to get the repeats for all sequences in a list of strings

#For the library of 50 sequences, to get the repeats of each indiv. sequence:
repeats_keys_library_one_50 = []
for sequence in library_keys_one_50:
    repeats = get_repeats(sequence) 
    repeats_keys_library_one_50.append(repeats)
#print(repeats_keys_library_one_50) #returns nested list of repeats found for each indiv. sequence

In [53]:
#For the library of 100 sequences, to get the repeats of each indiv. sequence:
repeats_keys_library_one_100 = []
for sequence in library_keys_one_100:
    repeats = get_repeats(sequence) 
    repeats_keys_library_one_100.append(repeats)
#print(repeats_keys_library_one_100) #returns nested list of repeats found for each indiv. sequence

In [54]:
#For the library of 1000 sequences, to get the repeats of each indiv. sequence:
repeats_keys_library_one_1000 = []
for sequence in library_keys_one_1000:
    repeats = get_repeats(sequence) 
    repeats_keys_library_one_1000.append(repeats)
#print(repeats_keys_library_one_1000) #returns nested list of repeats found for each indiv. sequence

In [55]:
#To count the number of repeats in each indiv. sequence, we can create another function:
def repeat_count(repeats):
    count = 0 #count starts at 0
    for elem in repeats:
        if type(elem) == list: #check if the element is a list
            count += repeat_count(elem) #if list, get size
        else:
            count += 1 #if not list, add 1 to count
    return count

#Writing a for loop and calling the repeat_count function
# to get the number of repeats found in each seq., as stated in the libraries above

#For the library of 50 sequences:
repeat_counts_keys_library_one_50 = []
for repeats in repeats_keys_library_one_50:
    number = repeat_count(repeats)
    repeat_counts_keys_library_one_50.append(number)
#print(repeat_counts_keys_library_one_50)

#Then, to get the total number of all repeats in all sequences in the library
total_repeats_keys_lib_one_50 = sum(repeat_counts_keys_library_one_50)
print(total_repeats_keys_lib_one_50)

247


In [56]:
#For the library of 100 sequences:
repeat_counts_keys_library_one_100 = []
for repeats in repeats_keys_library_one_100:
    number = repeat_count(repeats)
    repeat_counts_keys_library_one_100.append(number)
#print(repeat_counts_keys_library_one_100)

#Then, to get the total number of all repeats in all sequences in the library
total_repeats_keys_lib_one_100 = sum(repeat_counts_keys_library_one_100)
print(total_repeats_keys_lib_one_100)

477


In [57]:
#For the library of 1000 sequences:
repeat_counts_keys_library_one_1000 = []
for repeats in repeats_keys_library_one_1000:
    number = repeat_count(repeats)
    repeat_counts_keys_library_one_1000.append(number)
#print(repeat_counts_keys_library_one_1000)

#Then, to get the total number of all repeats in all sequences in the library
total_repeats_keys_lib_one_1000 = sum(repeat_counts_keys_library_one_1000)
print(total_repeats_keys_lib_one_1000)

4727


### Section 1.12: Creating a unique sequence of 9bp to use as 'spacer sequences' between the individual tags in the fragments ("lego blocks") 1A-1H on Geneious

In [58]:
#Making the dictionary as a list
dict_one_key_list = list(dict_one.keys())  

#inputting in a function to always get the same tags sampled
np.random.seed(5)

#The random, unique 9bp sequence this code creates is used as spacer sequences between
#the individual tags in the 'tag' fragment/ "lego blocks"

#Make empty list for the results of the for loop
unique_tag_fragment_seq = []
#The for loop
for i in np.arange(3):
    #do one random sample of 3 tags and put into list to get sequence of 9bp
    unique_tag_fragment_seq.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(unique_tag_fragment_seq)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
unique_tag_seq = ''.join(map(str,unique_tag_fragment_seq))
print(unique_tag_seq)

#And also get the nucleotide length of the codon tag sequence
length = len(unique_tag_seq)
print("Sequence length: " + str(length))

['TTG', 'CTA', 'GAT']
TTGCTAGAT
Sequence length: 9


### Section 1.13: Creating unique sequences of 24bp to use as 'extension sequences' in the fragments ("lego blocks") 1A-1H on Geneious

In [59]:
#Fragments 1A-1H require 2 'extension sequences' each; one at the beginning, and one at the end of the fragment
# these 'extension sequences' are to elongate the fragment to ensure the BsaI enzyme has space to bind by the cut site


### Sequence 1 for fragment 1A-1H:

#inputting in a function to always get the same tags sampled for reproducibility 
np.random.seed(25)

#Make empty list for the results of the for loop
ext_seq_1_frag1 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_1_frag1.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_1_frag1)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_1_seq_1 = ''.join(map(str,ext_seq_1_frag1))
print(ext_fragments_1_seq_1)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_1_seq_1)
print("Sequence length: " + str(length))

['CAG', 'CGT', 'GTA', 'AAG', 'TCG', 'CAA', 'CAC', 'ATA']
CAGCGTGTAAAGTCGCAACACATA
Sequence length: 24
['CTC', 'CGG', 'CAA', 'AGC', 'AGA', 'TAT', 'GGT', 'TTA']
CTCCGGCAAAGCAGATATGGTTTA
Sequence length: 24


In [None]:
### Sequence 2 for fragment 1A-1H:

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(29)

#Make empty list for the results of the for loop
ext_seq_2_frag1 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_2_frag1.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_2_frag1)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_1_seq_2 = ''.join(map(str,ext_seq_2_frag1))
print(ext_fragments_1_seq_2)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_1_seq_2)
print("Sequence length: " + str(length))

### Section 1.14: Creating unique sequences of 24bp to use as 'extension sequencs' in the fragments ("lego blocks") 2A & 2B on Geneious

In [60]:
#Fragment 2A requires 4 'extension sequences'; two at each end around the BsaI cut site
#Fragment 2B requires 5 'extension sequences'; again, two at each end around the cut site, and one replacing the 
#fluorescent protein


### Sequence 1 for fragments 2A and 2B:

#inputting in a function to always get the same tags sampled for reproducibility 
np.random.seed(7)

#Make empty list for the results of the for loop
ext_seq_1_frag2 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_1_frag2.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_1_frag2)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_2_seq_1 = ''.join(map(str,ext_seq_1_frag2))
print(ext_fragments_2_seq_1)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_2_seq_1)
print("Sequence length: " + str(length))


### Sequence 2 for fragments 2A and 2B:

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(30)

#Make empty list for the results of the for loop
ext_seq_2_frag2 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_2_frag2.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_2_frag2)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_2_seq_2 = ''.join(map(str,ext_seq_2_frag2))
print(ext_fragments_2_seq_2)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_2_seq_2)
print("Sequence length: " + str(length))


### Sequence 3 for fragments 2A and 2B:

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(34)

#Make empty list for the results of the for loop
ext_seq_3_frag2 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_3_frag2.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_3_frag2)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_2_seq_3 = ''.join(map(str,ext_seq_3_frag2))
print(ext_fragments_2_seq_3)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_2_seq_3)
print("Sequence length: " + str(length))


### Sequence 4 for fragments 2A and 2B:

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(42)

#Make empty list for the results of the for loop
ext_seq_4_frag2 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_4_frag2.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_4_frag2)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_2_seq_4 = ''.join(map(str,ext_seq_4_frag2))
print(ext_fragments_2_seq_4)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_2_seq_4)
print("Sequence length: " + str(length))


### Sequence 5 for fragment 2B:

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(50)

#Make empty list for the results of the for loop
ext_seq_5_frag2 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_5_frag2.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_5_frag2)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_2_seq_5 = ''.join(map(str,ext_seq_5_frag2))
print(ext_fragments_2_seq_5)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_2_seq_5)
print("Sequence length: " + str(length))

['GAT', 'CAG', 'GCT', 'AAG', 'GCC', 'GAA', 'TCG', 'TGT']
GATCAGGCTAAGGCCGAATCGTGT
Sequence length: 24
['GAC', 'GAC', 'AAT', 'AAT', 'GTC', 'CGC', 'TCG', 'GTT']
GACGACAATAATGTCCGCTCGGTT
Sequence length: 24
['TGG', 'AAA', 'TCA', 'GGT', 'AAG', 'CTC', 'CAG', 'TTG']
TGGAAATCAGGTAAGCTCCAGTTG
Sequence length: 24
['CAT', 'CGG', 'ATT', 'CTA', 'TCA', 'CTT', 'CGA', 'CAT']
CATCGGATTCTATCACTTCGACAT
Sequence length: 24
['GAG', 'TAT', 'TAC', 'AAT', 'TGG', 'ACC', 'CAG', 'TGC']
GAGTATTACAATTGGACCCAGTGC
Sequence length: 24


### Section 1.15: Creating unique sequences of 24bp to use as 'extension sequencs' in the fragments ("lego blocks") 3A & 3B on Geneious

In [61]:
#Fragment 3A requires 4 'extension sequences'; two at each end around the BsaI cut site
#Fragment 3B requires 5 'extension sequences'; again, two at each end around the cut site, and one replacing the 
#fluorescent protein


### Sequence 1 for fragments 3A and 3B: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(52)

#Make empty list for the results of the for loop
ext_seq_1_frag3 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_1_frag3.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_1_frag3)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_3_seq_1 = ''.join(map(str,ext_seq_1_frag3))
print(ext_fragments_3_seq_1)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_3_seq_1)
print("Sequence length: " + str(length))


### Sequence 2 for fragments 3A and 3B: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(55)

#Make empty list for the results of the for loop
ext_seq_2_frag3 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_2_frag3.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_2_frag3)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_3_seq_2 = ''.join(map(str,ext_seq_2_frag3))
print(ext_fragments_3_seq_2)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_3_seq_2)
print("Sequence length: " + str(length))


### Sequence 3 for fragments 3A and 3B: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(61)

#Make empty list for the results of the for loop
ext_seq_3_frag3 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_3_frag3.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_3_frag3)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_3_seq_3 = ''.join(map(str,ext_seq_3_frag3))
print(ext_fragments_3_seq_3)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_3_seq_3)
print("Sequence length: " + str(length))


### Sequence 4 for fragments 3A and 3B: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(64)

#Make empty list for the results of the for loop
ext_seq_4_frag3 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_4_frag3.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_4_frag3)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_3_seq_4 = ''.join(map(str,ext_seq_4_frag3))
print(ext_fragments_3_seq_4)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_3_seq_4)
print("Sequence length: " + str(length))


### Sequence 5 for fragment 3B: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(67)

#Make empty list for the results of the for loop
ext_seq_5_frag3 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_5_frag3.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_5_frag3)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_3_seq_5 = ''.join(map(str,ext_seq_5_frag3))
print(ext_fragments_3_seq_5)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_3_seq_5)
print("Sequence length: " + str(length))

['ACG', 'ATT', 'TAC', 'AGT', 'TCG', 'GTG', 'ATT', 'GGA']
ACGATTTACAGTTCGGTGATTGGA
Sequence length: 24
['AGT', 'CGT', 'TGT', 'ATA', 'GCA', 'GAC', 'GGA', 'GAC']
AGTCGTTGTATAGCAGACGGAGAC
Sequence length: 24
['TTG', 'CCA', 'AGC', 'TCT', 'GAT', 'AGC', 'GGG', 'TCA']
TTGCCAAGCTCTGATAGCGGGTCA
Sequence length: 24
['CAG', 'CAT', 'GGA', 'CAT', 'AAG', 'TTA', 'GTC', 'CGG']
CAGCATGGACATAAGTTAGTCCGG
Sequence length: 24
['GCC', 'ACG', 'GCG', 'AGG', 'CTT', 'GCT', 'ACT', 'GGC']
GCCACGGCGAGGCTTGCTACTGGC
Sequence length: 24


### Section 1.16: Creating unique sequences of 24bp to use as 'extension sequencs' in the fragments ("lego blocks") 4A & 4B on Geneious

In [62]:
#Fragment 4A requires 4 'extension sequences'; two at each end around the BsaI cut site
#Fragment 4B requires 5 'extension sequences'; again, two at each end around the cut site, and one replacing the 
#fluorescent protein


### Sequence 1 for fragments 4A and 4B: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(68)

#Make empty list for the results of the for loop
ext_seq_1_frag4 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_1_frag4.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_1_frag4)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_4_seq_1 = ''.join(map(str,ext_seq_1_frag4))
print(ext_fragments_4_seq_1)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_4_seq_1)
print("Sequence length: " + str(length))


### Sequence 2 for fragments 4A and 4B: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(69)

#Make empty list for the results of the for loop
ext_seq_2_frag4 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_2_frag4.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_2_frag4)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_4_seq_2 = ''.join(map(str,ext_seq_2_frag4))
print(ext_fragments_4_seq_2)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_4_seq_2)
print("Sequence length: " + str(length))


### Sequence 3 for fragments 4A and 4B: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(70)

#Make empty list for the results of the for loop
ext_seq_3_frag4 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_3_frag4.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_3_frag4)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_4_seq_3 = ''.join(map(str,ext_seq_3_frag4))
print(ext_fragments_4_seq_3)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_4_seq_3)
print("Sequence length: " + str(length))


### Sequence 4 for fragments 4A and 4B: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(73)

#Make empty list for the results of the for loop
ext_seq_4_frag4 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_4_frag4.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_4_frag4)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_4_seq_4 = ''.join(map(str,ext_seq_4_frag4))
print(ext_fragments_4_seq_4)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_4_seq_4)
print("Sequence length: " + str(length))


### Sequence 5 for fragment 4B: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(77)

#Make empty list for the results of the for loop
ext_seq_5_frag4 = []
#The for loop
for i in np.arange(8):
    #do one random sample of 8 tags and put into list to get sequence of 24bp
    ext_seq_5_frag4.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_5_frag4)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_fragments_4_seq_5 = ''.join(map(str,ext_seq_5_frag4))
print(ext_fragments_4_seq_5)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_fragments_4_seq_5)
print("Sequence length: " + str(length))

['CAC', 'TGT', 'ATC', 'GCG', 'CAA', 'GAG', 'ATT', 'ACA']
CACTGTATCGCGCAAGAGATTACA
Sequence length: 24
['AAG', 'TAC', 'ACT', 'CCG', 'CGT', 'GGA', 'CGA', 'TTA']
AAGTACACTCCGCGTGGACGATTA
Sequence length: 24
['CTA', 'GTG', 'CAC', 'ACA', 'GGG', 'TTC', 'GTC', 'ACG']
CTAGTGCACACAGGGTTCGTCACG
Sequence length: 24
['GTG', 'CCA', 'AAC', 'GCG', 'TCT', 'ATA', 'GCC', 'CTG']
GTGCCAAACGCGTCTATAGCCCTG
Sequence length: 24
['TCG', 'TCC', 'CGA', 'CGA', 'CCG', 'GAC', 'ACA', 'TCC']
TCGTCCCGACGACCGGACACATCC
Sequence length: 24


### Section 1.17: Creating unique sequences of 12bp to use as 'spacer' sequences between the RGB positions in the "skeleton" fragment on Geneious

In [63]:
# The "skeleton" requires these 12bp spacer sequences in between the restriction enzyme sites of the RGB proteins
# So, we need 3 different 12bp spacer sequences

### Sequence for where the red fluorescent protein will go: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(80)

#Make empty list for the results of the for loop
ext_seq_skeleton_red = []
#The for loop
for i in np.arange(4):
    #do one random sample of 4 tags and put into list to get sequence of 12bp
    ext_seq_skeleton_red.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_skeleton_red)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_frag_skeleton_red = ''.join(map(str,ext_seq_skeleton_red))
print(ext_frag_skeleton_red)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_frag_skeleton_red)
print("Sequence length: " + str(length))



### Sequence for where the green fluorescent protein will go: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(85)

#Make empty list for the results of the for loop
ext_seq_skeleton_green = []
#The for loop
for i in np.arange(4):
    #do one random sample of 4 tags and put into list to get sequence of 12bp
    ext_seq_skeleton_green.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_skeleton_green)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_frag_skeleton_green = ''.join(map(str,ext_seq_skeleton_green))
print(ext_frag_skeleton_green)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_frag_skeleton_green)
print("Sequence length: " + str(length))



### Sequence for where the blue fluorescent protein will go: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(86)

#Make empty list for the results of the for loop
ext_seq_skeleton_blue = []
#The for loop
for i in np.arange(4):
    #do one random sample of 4 tags and put into list to get sequence of 12bp
    ext_seq_skeleton_blue.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_skeleton_blue)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_frag_skeleton_blue = ''.join(map(str,ext_seq_skeleton_blue))
print(ext_frag_skeleton_blue)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_frag_skeleton_blue)
print("Sequence length: " + str(length))

['GAT', 'CAT', 'GAA', 'GCG']
GATCATGAAGCG
Sequence length: 12
['TCA', 'TAT', 'AGT', 'CAG']
TCATATAGTCAG
Sequence length: 12
['CGA', 'TAT', 'CAA', 'ATT']
CGATATCAAATT
Sequence length: 12


### Section 1.18: Creating unique sequences of 15bp to use as extra 'extension' sequences on the 5' end of the primers used to PCR up the chosen RGB fluorescent proteins

In [64]:
### Sequence for extending the forward primer for the chosen red fluorescent protein: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(1)

#Make empty list for the results of the for loop
ext_seq_primer_red = []
#The for loop
for i in np.arange(5):
    #do one random sample of 5 tags and put into list to get sequence of 15bp
    ext_seq_primer_red.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_primer_red)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_primer_red = ''.join(map(str,ext_seq_primer_red))
print(ext_primer_red)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_primer_red)
print("Sequence length: " + str(length))


### Sequence for extending the reverse primer for the chosen red fluorescent protein: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(2)

#Make empty list for the results of the for loop
ext_seq_primerrev_red = []
#The for loop
for i in np.arange(5):
    #do one random sample of 5 tags and put into list to get sequence of 15bp
    ext_seq_primerrev_red.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_primerrev_red)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_primerrev_red = ''.join(map(str,ext_seq_primerrev_red))
print(ext_primerrev_red)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_primerrev_red)
print("Sequence length: " + str(length))

['GAC', 'CCG', 'CGC', 'ATA', 'ACT']
GACCCGCGCATAACT
Sequence length: 15
['AGA', 'GTA', 'AAT', 'ATA', 'GTG']
AGAGTAAATATAGTG
Sequence length: 15


In [65]:
### Sequence for extending the forward primer for the chosen green fluorescent protein: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(3)

#Make empty list for the results of the for loop
ext_seq_primer_green = []
#The for loop
for i in np.arange(5):
    #do one random sample of 5 tags and put into list to get sequence of 15bp
    ext_seq_primer_green.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_primer_green)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_primer_green = ''.join(map(str,ext_seq_primer_green))
print(ext_primer_green)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_primer_green)
print("Sequence length: " + str(length))


### Sequence for extending the reverse primer for the chosen green fluorescent protein: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(6)

#Make empty list for the results of the for loop
ext_seq_primerrev_green = []
#The for loop
for i in np.arange(5):
    #do one random sample of 5 tags and put into list to get sequence of 15bp
    ext_seq_primerrev_green.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_primerrev_green)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_primerrev_green = ''.join(map(str,ext_seq_primerrev_green))
print(ext_primerrev_green)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_primerrev_green)
print("Sequence length: " + str(length))

['TCA', 'ACA', 'CCC', 'GCC', 'TTT']
TCAACACCCGCCTTT
Sequence length: 15
['GCG', 'ACT', 'TTG', 'CGA', 'TCA']
GCGACTTTGCGATCA
Sequence length: 15


In [66]:
### Sequence for extending the forward primer for the chosen blue fluorescent protein: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(7)

#Make empty list for the results of the for loop
ext_seq_primer_blue = []
#The for loop
for i in np.arange(5):
    #do one random sample of 5 tags and put into list to get sequence of 15bp
    ext_seq_primer_blue.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_primer_blue)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_primer_blue = ''.join(map(str,ext_seq_primer_blue))
print(ext_primer_blue)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_primer_blue)
print("Sequence length: " + str(length))


### Sequence for extending the reverse primer for the chosen blue fluorescent protein: 

#inputting in a function to always get the same tags sampled for reproducibility
np.random.seed(8)

#Make empty list for the results of the for loop
ext_seq_primerrev_blue = []
#The for loop
for i in np.arange(5):
    #do one random sample of 5 tags and put into list to get sequence of 15bp
    ext_seq_primerrev_blue.append(np.random.choice(dict_one_key_list))
#viewing the list of randomly sampled codon tags
print(ext_seq_primerrev_blue)

#Now, we can put all the randomly sampled codon tags
# together into one long sequence
ext_primerrev_blue = ''.join(map(str,ext_seq_primerrev_blue))
print(ext_primerrev_blue)

#And also get the nucleotide length of the codon tag sequence
length = len(ext_primerrev_blue)
print("Sequence length: " + str(length))

['GAT', 'CAG', 'GCT', 'AAG', 'GCC']
GATCAGGCTAAGGCC
Sequence length: 15
['GCC', 'CGA', 'TTA', 'GGT', 'AGG']
GCCCGATTAGGTAGG
Sequence length: 15


# Section 2:

#### Genetic code option 2:
#### - Selected blocks of similar codons for each colour
#### - 60 codons total
#### - 7 to each colour, with black having 11 codons

### Section 2.1: Specifying the dictionary of the codon tags & the colour they're assigned to

In [18]:
#Making the dictionary of codon tags & their colour
#The codon tags are keys, and the colours are values
dict_two = {
    'TTC':'Red', 'TTA':'Red', 'TTG':'Red', 'CTT':'Red', 'CTC':'Red', 'CTA':'Red', 'CTG':'Red',
    'ATT':'Green', 'ATC':'Green', 'ATA':'Green', 'GTT':'Green', 'GTC':'Green', 'GTA':'Green', 'GTG':'Green',
    'TCT':'Blue', 'TCC':'Blue', 'TCA':'Blue', 'TCG':'Blue', 'ACT':'Blue', 'ACC':'Blue', 'ACA':'Blue',
    'CCT':'Cyan', 'CCA':'Cyan', 'CCG':'Cyan', 'GCT':'Cyan', 'GCC':'Cyan', 'GCA':'Cyan', 'GCG':'Cyan',
    'CAA':'Yellow', 'CAG':'Yellow', 'AAT':'Yellow', 'AAC':'Yellow', 'AGT':'Yellow', 'AGC':'Yellow', 'ACG':'Yellow',
    'GAT':'Magenta', 'GAC':'Magenta', 'GAA':'Magenta', 'GAG':'Magenta', 'GGT':'Magenta', 'GGC':'Magenta', 'GGA':'Magenta',
    'CGT':'White', 'CGC':'White', 'CGA':'White', 'CGG':'White', 'AGA':'White', 'AGG':'White', 'AAG':'White',
    'TAT':'Black', 'TAC':'Black', 'CAT':'Black', 'CAC':'Black', 'AAA':'Black', 'TGT':'Black', 'TGC':'Black',
    'TGG':'Black', 'GGG':'Black', 'TTT':'Black', 'CCC':'Black'
}

### Section 2.2: Generating a test signal sequence

In [68]:
##We can generate a test signal sequence by either:
# randomly sampling codon tags
# or, randomly sampling colours to make a pattern and sampling tags from the colours in that order

In [69]:
#Sampling multiple keys (tags) or vaules (colours) at once and storing the results

#First we need to make the keys (codon tags) of the dictionary into a list
dict_two_key_list = list(dict_two.keys())

In [70]:
#Make empty list for the results of the for loop
random_tags_two = []
#The for loop
for i in np.arange(34):
    #do one random sample of 34 tags and put into list above
    random_tags_two.append(np.random.choice(dict_two_key_list))
#viewing the list of randomly sampled codon tags
print(random_tags_two)

['GCA', 'ATC', 'ACC', 'GGC', 'AAG', 'CGG', 'CCT', 'TAT', 'CAT', 'GTT', 'GCG', 'CGA', 'GTG', 'ATA', 'TCC', 'CAG', 'ACG', 'CCC', 'ACT', 'TCT', 'GAG', 'CAG', 'TGG', 'ATA', 'CAT', 'CTC', 'AAA', 'AAA', 'ATT', 'CCC', 'CGT', 'GGA', 'GCA', 'TTG']


In [71]:
#To randomly sample multiple colours e.g. to make a pattern
# I randomly sampled key, value pairs to get both colours and their tags
# then just keep the keys in the order they were sampled in

import random 

#randomly sample key: val pairs so we know the pattern of colours
random_pairs_two = random.sample(list(dict_two.items()), 34)
print(random_pairs_two)

#extracting just the keys (tags) into a list
random_tags_two_2 = [k for k, v in random_pairs_two]
print(random_tags_two_2)

[('AGT', 'Yellow'), ('AGC', 'Yellow'), ('GTA', 'Green'), ('TGG', 'Black'), ('AGA', 'White'), ('CCG', 'Cyan'), ('GGA', 'Magenta'), ('GGG', 'Black'), ('TAT', 'Black'), ('CAC', 'Black'), ('TTC', 'Red'), ('TCG', 'Blue'), ('CAA', 'Yellow'), ('CGT', 'White'), ('GTC', 'Green'), ('TCT', 'Blue'), ('CAG', 'Yellow'), ('GCC', 'Cyan'), ('CTG', 'Red'), ('GGT', 'Magenta'), ('TCC', 'Blue'), ('GGC', 'Magenta'), ('TGC', 'Black'), ('TTG', 'Red'), ('CGC', 'White'), ('TCA', 'Blue'), ('TGT', 'Black'), ('ACT', 'Blue'), ('CGA', 'White'), ('GTT', 'Green'), ('CTC', 'Red'), ('CTA', 'Red'), ('AAT', 'Yellow'), ('GAA', 'Magenta')]
['AGT', 'AGC', 'GTA', 'TGG', 'AGA', 'CCG', 'GGA', 'GGG', 'TAT', 'CAC', 'TTC', 'TCG', 'CAA', 'CGT', 'GTC', 'TCT', 'CAG', 'GCC', 'CTG', 'GGT', 'TCC', 'GGC', 'TGC', 'TTG', 'CGC', 'TCA', 'TGT', 'ACT', 'CGA', 'GTT', 'CTC', 'CTA', 'AAT', 'GAA']


### Section 2.3: Putting the random samples together into sequences

In [72]:
#Now, we can put all the randomly sampled codon tags
# together into one long sequence
random_seq_two = ''.join(map(str,random_tags_two))
print(random_seq_two)

#And also get the nucleotide length of the codon tag sequence
length = len(random_seq_two)
print("Sequence length: " + str(length))

GCAATCACCGGCAAGCGGCCTTATCATGTTGCGCGAGTGATATCCCAGACGCCCACTTCTGAGCAGTGGATACATCTCAAAAAAATTCCCCGTGGAGCATTG
Sequence length: 102


In [73]:
#Now, put together randomly sampled codon tags
# that were sampled from key:value pairs
random_seq_two_2 = ''.join(map(str,random_tags_two_2))
print(random_seq_two_2)

#Get nucleotide length of sequence
length = len(random_seq_two_2)
print("Sequence length: " + str(length))

AGTAGCGTATGGAGACCGGGAGGGTATCACTTCTCGCAACGTGTCTCTCAGGCCCTGGGTTCCGGCTGCTTGCGCTCATGTACTCGAGTTCTCCTAAATGAA
Sequence length: 102


### Section 2.4: Analysing sequences for repeats/ repetitiveness

In [132]:
#By identifying regular expressions

#Import the reg. exp. modules
import re

#For the first sequence (random_seq_two)
print(random_seq_two)
#Finds all the repeats at once
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|\
                      (GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", random_seq_two)
#Makes empty list for results  
all_repeats_two = []
for m in matches:
    all_repeats_two.append(m.group()) #puts all the repeats found together in one list

#prints the list of repeats found
print(all_repeats_two)


#Doing the same for the second sequence (random_seq_two_2)
print(random_seq_two_2)
#Finds all the repeats at once
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|\
                      (GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", random_seq_two_2)
#Makes empty list for results of second sequence (random_seq_one_2) 
all_repeats_two_2 = []
for m in matches:
    all_repeats_two_2.append(m.group()) #puts all the repeats found together in one list

#prints the list of repeats found
print(all_repeats_two_2)

GCAATCACCGGCAAGCGGCCTTATCATGTTGCGCGAGTGATATCCCAGACGCCCACTTCTGAGCAGTGGATACATCTCAAAAAAATTCCCCGTGGAGCATTG
['GCGC', 'ATAT', 'TCTC', 'AAAAAA', 'CCCC']
AGTAGCGTATGGAGACCGGGAGGGTATCACTTCTCGCAACGTGTCTCTCAGGCCCTGGGTTCCGGCTGCTTGCGCTCATGTACTCGAGTTCTCCTAAATGAA
['GAGA', 'TCTC', 'TCTCTC', 'GCGC', 'TCTC']


### Section 2.5: Counting number of repeat sequences

In [75]:
#For the first sequence; random_seq_two
repeat_total_two = len(all_repeats_two)
print("Total repeats: " + str(repeat_total_two))

#For the second sequence; random_seq_two_2
repeat_total_two_2 = len(all_repeats_two_2)
print("Total repeats: " + str(repeat_total_two_2))

Total repeats: 5
Total repeats: 6


In [133]:
#Now we know the total number of repeats present in each sequence
#We might want to know where they are in the sequences

#For the first sequence; random_seq_two
#This is the sequence
print(random_seq_two)

#making empty list for positions
pos_list_two = []

#Finding positions of repeats
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|\
                      (GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", random_seq_two)
for m in matches:
    repeat = m.group()
    pos_two = m.start()
    pos_list_two.append(pos_two)
    print(repeat + " found at position " + str(pos_two))

print(pos_list_two)

GCAATCACCGGCAAGCGGCCTTATCATGTTGCGCGAGTGATATCCCAGACGCCCACTTCTGAGCAGTGGATACATCTCAAAAAAATTCCCCGTGGAGCATTG
GCGC found at position 30
ATAT found at position 39
TCTC found at position 74
AAAAAA found at position 78
CCCC found at position 87
[30, 39, 74, 78, 87]


In [77]:
#If we want to split the DNA sequence at every point a repeat is present
#We can use the list of positions made earlier

#this O below is added to the list so that it goes from (first pos., second pos.) to (second pos., third pos.) etc.
pos_list_two = [0] + pos_list_two + [len(random_seq_two)]
[random_seq_two[x:y] for x,y in zip(pos_list_two, pos_list_two[1:])]

['GCAATCACCGGCAAGCGGCCTTATCATGTT',
 'GCGCGAGTG',
 'ATATCCCAGACGCCCACTTCTGAGCAGTGGATACA',
 'TCTC',
 'AAAAAAATT',
 'CCCCGTGGAGCATTG']

In [134]:
#For the second sequence; random_seq_two_2
#This is the sequence
print(random_seq_two_2)

#making empty list for positions
pos_list_two_2 = []

#Finding positions of repeats
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|\
                      (GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", random_seq_two_2)
for m in matches:
    repeat = m.group()
    pos_two_2 = m.start()
    pos_list_two_2.append(pos_two_2)
    print(repeat + " found at position " + str(pos_two_2))

print(pos_list_two_2)

AGTAGCGTATGGAGACCGGGAGGGTATCACTTCTCGCAACGTGTCTCTCAGGCCCTGGGTTCCGGCTGCTTGCGCTCATGTACTCGAGTTCTCCTAAATGAA
GAGA found at position 11
TCTC found at position 31
TCTCTC found at position 43
GCGC found at position 71
TCTC found at position 89
[11, 31, 43, 71, 89]


In [79]:
#If we want to split the DNA sequence at every point a repeat is present
#We can use the list of positions made earlier

#this O below is added to the list so that it goes from (first pos., second pos.) to (second pos., third pos.) etc.
pos_list_two_2 = [0] + pos_list_two_2 + [len(random_seq_two_2)]
[random_seq_two_2[x:y] for x,y in zip(pos_list_two_2, pos_list_two_2[1:])]

['AGTAGCGTATG',
 'GAGACCGGGAGGGTATCACT',
 'TCTCGCAAC',
 'GTGT',
 'CTCTCAGGCCCTGGGTTCCGGCTGCTT',
 'GCGCTCATGTACTCGAGT',
 'TCTCCTAAATGAA']

### Section 2.6: Summarising results: sequence length & repeat count

In [80]:
#First, counting number of each individual repeat
#counts how many times each of the found repeat seq is present in the entire sequence

####
#First sequence: random_seq_two
import collections 

repeat_counts_two = collections.Counter(all_repeats_two)
print(repeat_counts_two)


#Nucleotide length of the sequence
nt_length = len(random_seq_two)
print("Sequence length: " + str(nt_length))

#Total number of repeats in sequence
repeat_total_two = len(all_repeats_two)
print("Total repeats: " + str(repeat_total_two))


####
#Second sequence: random_seq_two_2
import collections

repeat_counts_two_2 = collections.Counter(all_repeats_two_2)
print(repeat_counts_two_2)


#Nucleotide length of the sequence
nt_length = len(random_seq_two_2)
print("Sequence length: " + str(nt_length))

#Total number of repeats in sequence
repeat_total_two_2 = len(all_repeats_two_2)
print("Total repeats: " + str(repeat_total_two_2))

Counter({'GCGC': 1, 'ATAT': 1, 'TCTC': 1, 'AAAAAA': 1, 'CCCC': 1})
Sequence length: 102
Total repeats: 5
Counter({'TCTC': 2, 'GAGA': 1, 'GTGT': 1, 'CTCT': 1, 'GCGC': 1})
Sequence length: 102
Total repeats: 6


In [81]:
#Other things we might want to know about our randomly generated sequences

#AT/ GC content?

#Function to see if sequence is AT-rich (e.g., rich if AT content is >0.65)
#First sequence: random_seq_two
def AT_rich(dna):
    length = len(dna)
    a_count = dna.upper().count('A')
    t_count = dna.upper().count('T')
    at_content = (a_count + t_count) / length
    if at_content > 0.65:
        return True
    else:
        return False


#Function to see if GC-rich (> 0.65)
#Writing the function:
def GC_rich(dna):
    length = len(dna)
    g_count = dna.upper().count('G')
    c_count = dna.upper().count('C')
    gc_content = (g_count + c_count) / length
    if gc_content > 0.65:
        return True
    else:
        return False
    
    
#Testing AT content of first sequence
print(random_seq_two)
print(AT_rich(random_seq_two))

print(random_seq_two_2)
print(AT_rich(random_seq_two_2))

#Testing GC content of first sequence
print(random_seq_two)
print(GC_rich(random_seq_two))

print(random_seq_two_2)
print(GC_rich(random_seq_two_2))

GCAATCACCGGCAAGCGGCCTTATCATGTTGCGCGAGTGATATCCCAGACGCCCACTTCTGAGCAGTGGATACATCTCAAAAAAATTCCCCGTGGAGCATTG
False
AGTAGCGTATGGAGACCGGGAGGGTATCACTTCTCGCAACGTGTCTCTCAGGCCCTGGGTTCCGGCTGCTTGCGCTCATGTACTCGAGTTCTCCTAAATGAA
False
GCAATCACCGGCAAGCGGCCTTATCATGTTGCGCGAGTGATATCCCAGACGCCCACTTCTGAGCAGTGGATACATCTCAAAAAAATTCCCCGTGGAGCATTG
False
AGTAGCGTATGGAGACCGGGAGGGTATCACTTCTCGCAACGTGTCTCTCAGGCCCTGGGTTCCGGCTGCTTGCGCTCATGTACTCGAGTTCTCCTAAATGAA
False


### Section 2.7: Randomly sampling multiple tags at once to create a library of sequences

In [82]:
############### 
# CHANGE THE NUMBERS IN THE FOLLOWING CODE TO CHANGE HOW MANY SEQUENCES/ WHAT LENGTH SEQUENCES YOU WANT TO GENERATE

#What if we want to sample a lot at once, and store the results?
import numpy as np

#First we need to make the keys (codon tags) of the dictionary into a list
dict_two_key_list = list(dict_two.keys())

#Making libraries for storing the many randomly generated codon tag sequences
#For 50 random sequences:
random_tags_library_two_50 = []
#For 100:
random_tags_library_two_100 = []
#For 1000:
random_tags_library_two_1000 = []

#This for loop
# generates 50 sequences, of 34 tags each (102nt long)
for x in np.arange(50): #do the loop 50 times
    for i in np.arange(34): #each time, generate 34 random tags
        #put tags into a library
        random_tags_library_two_50.append(np.random.choice(dict_two_key_list))
print(random_tags_library_two_50)

['GAG', 'ATA', 'GAA', 'AGA', 'TCG', 'CGC', 'TCC', 'CAA', 'GCG', 'CGC', 'CCA', 'AAC', 'ACA', 'CGG', 'CCT', 'ATC', 'CAA', 'CAA', 'AAC', 'AGG', 'GTT', 'TTC', 'ATT', 'CGG', 'TTG', 'GAG', 'ATT', 'AAC', 'GAC', 'CCC', 'TGG', 'GCC', 'GAG', 'GCG', 'CTC', 'GAA', 'CGC', 'TTG', 'ACT', 'GAG', 'TGT', 'CCA', 'GGT', 'TTT', 'AGA', 'GGA', 'CTG', 'AAC', 'AGT', 'ATA', 'ACG', 'ATT', 'ATA', 'ACG', 'GAG', 'CGT', 'ACC', 'CCC', 'AAG', 'ATA', 'TCA', 'AAG', 'TCT', 'TGT', 'GGG', 'TGC', 'CAC', 'TGC', 'GGA', 'ACA', 'TGG', 'ACG', 'CCG', 'ACG', 'GCT', 'CTG', 'CAA', 'CTA', 'TTG', 'ACC', 'GAT', 'GTC', 'GTC', 'AGA', 'CAA', 'ATC', 'TGC', 'GCC', 'TCT', 'GGT', 'TCC', 'ACC', 'TTT', 'GGC', 'AAG', 'TGC', 'CGC', 'TCC', 'CAA', 'GAT', 'AGT', 'GGA', 'TCA', 'CCC', 'GTT', 'AGC', 'CGG', 'TTG', 'CTG', 'ATT', 'ACC', 'ATC', 'ACG', 'GGA', 'CGG', 'CCA', 'GTA', 'CAA', 'CCT', 'TAT', 'GAT', 'CAG', 'CAG', 'CGA', 'TTC', 'AAG', 'TGG', 'CTG', 'GGT', 'GCT', 'AGC', 'AGT', 'GCT', 'TCC', 'GGA', 'TCT', 'GTG', 'CCT', 'GTT', 'TTG', 'CAG', 'AAT', 'TTG'

In [83]:
#This for loop
# generates 100 sequences, of 34 tags each (102nt long)
for x in np.arange(100): #do the loop 100 times
    for i in np.arange(34): #each time, generate 34 random tags
        #put tags into a library
        random_tags_library_two_100.append(np.random.choice(dict_two_key_list))
print(random_tags_library_two_100)

['TTG', 'CTG', 'GGT', 'GTT', 'ACG', 'CCG', 'ACG', 'AAA', 'CTT', 'GGC', 'AGC', 'AAT', 'GTT', 'TGC', 'ACA', 'AAA', 'ACC', 'CAC', 'TAT', 'GCG', 'ATC', 'GGG', 'ATC', 'GAC', 'CAC', 'ACG', 'CTA', 'GGA', 'AGC', 'CGT', 'TGT', 'GCG', 'ACC', 'CAA', 'CTA', 'GAT', 'CAA', 'GTT', 'TGC', 'CTA', 'GCG', 'GAT', 'GGC', 'GGG', 'ACG', 'CCA', 'TTA', 'GAC', 'CCC', 'GGG', 'CTT', 'CAC', 'CTA', 'ATC', 'AAT', 'GCC', 'CTC', 'GTT', 'CGC', 'TGT', 'TTC', 'CTC', 'GCA', 'TGC', 'TCT', 'CCC', 'AGA', 'CGA', 'CGA', 'CAA', 'AGC', 'AAC', 'AGC', 'GAG', 'AGG', 'AGT', 'GTC', 'GAG', 'ATT', 'GCT', 'CGC', 'CCA', 'TCA', 'CAA', 'CGT', 'AGC', 'ACA', 'GAT', 'GCA', 'TAC', 'CCC', 'GTG', 'AAT', 'CTC', 'CGT', 'TCA', 'AAC', 'CAC', 'CCC', 'ATA', 'AGA', 'TGC', 'ATT', 'CGC', 'CAG', 'TGC', 'GCT', 'CGA', 'GAA', 'GTG', 'AGC', 'GAA', 'GAA', 'GGA', 'TTC', 'AAC', 'GTA', 'CCT', 'CCG', 'TCG', 'TGG', 'TAT', 'AAT', 'AGT', 'CGG', 'TTC', 'CGC', 'TTC', 'AAT', 'GAG', 'AGG', 'GAC', 'ACG', 'GAT', 'TAT', 'CCC', 'ATT', 'CAG', 'GTG', 'AGA', 'CGA', 'TGC', 'TTT'

In [84]:
#This for loop
# generates 1000 sequences, of 34 tags each (102nt long)
for x in np.arange(1000): #do the loop 1000 times
    for i in np.arange(34): #each time, generate 34 random tags
        #put tags into a library
        random_tags_library_two_1000.append(np.random.choice(dict_two_key_list))
print(random_tags_library_two_1000)

['CGC', 'AAT', 'CGC', 'TTA', 'CCC', 'CAT', 'CAG', 'CCT', 'TTA', 'GCC', 'CAC', 'CTC', 'CAG', 'TAT', 'TCT', 'TCA', 'TAT', 'GAG', 'TTT', 'CCG', 'TGG', 'AGA', 'CTG', 'AAA', 'GAG', 'GTG', 'GAT', 'AAG', 'CGC', 'CAA', 'GCG', 'CAT', 'AAA', 'CTC', 'GGG', 'CCA', 'ACT', 'GAA', 'TGT', 'TGG', 'AAG', 'GGC', 'TTC', 'GTT', 'CAG', 'AGG', 'TGG', 'ACG', 'CTC', 'CCA', 'ACC', 'CCT', 'TCG', 'CCC', 'AGA', 'AGC', 'CCA', 'CGA', 'AGC', 'TGG', 'CGT', 'AGT', 'ACT', 'CCT', 'TGC', 'CCT', 'CCT', 'ATT', 'GAC', 'CAC', 'CAA', 'CTC', 'AAG', 'ATA', 'CAC', 'AAT', 'GAA', 'CTC', 'TCA', 'TTT', 'CCG', 'AAA', 'CAG', 'AAG', 'CCC', 'CTG', 'GTC', 'GAA', 'CTG', 'CAC', 'AAC', 'ATA', 'CCT', 'CTC', 'AGT', 'CCG', 'GCG', 'ATC', 'ATT', 'CCG', 'ATT', 'GTT', 'AGA', 'CAT', 'GAC', 'GGC', 'TCT', 'ACC', 'CGA', 'GTC', 'AAG', 'TTT', 'GAG', 'ACT', 'TGT', 'CGA', 'CCT', 'CCA', 'GGC', 'TAT', 'GAG', 'AGG', 'GTT', 'CGC', 'GTC', 'TGG', 'CTT', 'AAT', 'CAA', 'TCA', 'CAC', 'CAC', 'CAC', 'ATT', 'GTG', 'GGC', 'CAT', 'GGG', 'GTC', 'CTT', 'GGT', 'GGA', 'TAC'

In [85]:
#splitting the massive lists in the libraries into smaller lists (sequences) of length 34
# creates a nested list/ lists of lists/ two-dimensional list

#For library of 50 sequences:
library_two_nested_50 = [random_tags_library_two_50[x:x+34] for x in range(0, len(random_tags_library_two_50), 34)]
print(library_two_nested_50) #prints the nested list (lists w/in list)

[['GAG', 'ATA', 'GAA', 'AGA', 'TCG', 'CGC', 'TCC', 'CAA', 'GCG', 'CGC', 'CCA', 'AAC', 'ACA', 'CGG', 'CCT', 'ATC', 'CAA', 'CAA', 'AAC', 'AGG', 'GTT', 'TTC', 'ATT', 'CGG', 'TTG', 'GAG', 'ATT', 'AAC', 'GAC', 'CCC', 'TGG', 'GCC', 'GAG', 'GCG'], ['CTC', 'GAA', 'CGC', 'TTG', 'ACT', 'GAG', 'TGT', 'CCA', 'GGT', 'TTT', 'AGA', 'GGA', 'CTG', 'AAC', 'AGT', 'ATA', 'ACG', 'ATT', 'ATA', 'ACG', 'GAG', 'CGT', 'ACC', 'CCC', 'AAG', 'ATA', 'TCA', 'AAG', 'TCT', 'TGT', 'GGG', 'TGC', 'CAC', 'TGC'], ['GGA', 'ACA', 'TGG', 'ACG', 'CCG', 'ACG', 'GCT', 'CTG', 'CAA', 'CTA', 'TTG', 'ACC', 'GAT', 'GTC', 'GTC', 'AGA', 'CAA', 'ATC', 'TGC', 'GCC', 'TCT', 'GGT', 'TCC', 'ACC', 'TTT', 'GGC', 'AAG', 'TGC', 'CGC', 'TCC', 'CAA', 'GAT', 'AGT', 'GGA'], ['TCA', 'CCC', 'GTT', 'AGC', 'CGG', 'TTG', 'CTG', 'ATT', 'ACC', 'ATC', 'ACG', 'GGA', 'CGG', 'CCA', 'GTA', 'CAA', 'CCT', 'TAT', 'GAT', 'CAG', 'CAG', 'CGA', 'TTC', 'AAG', 'TGG', 'CTG', 'GGT', 'GCT', 'AGC', 'AGT', 'GCT', 'TCC', 'GGA', 'TCT'], ['GTG', 'CCT', 'GTT', 'TTG', 'CAG', 'AA

In [86]:
#For library of 100 sequences:
library_two_nested_100 = [random_tags_library_two_100[x:x+34] for x in range(0, len(random_tags_library_two_100), 34)]
print(library_two_nested_100) #prints the nested list (lists w/in list)

[['TTG', 'CTG', 'GGT', 'GTT', 'ACG', 'CCG', 'ACG', 'AAA', 'CTT', 'GGC', 'AGC', 'AAT', 'GTT', 'TGC', 'ACA', 'AAA', 'ACC', 'CAC', 'TAT', 'GCG', 'ATC', 'GGG', 'ATC', 'GAC', 'CAC', 'ACG', 'CTA', 'GGA', 'AGC', 'CGT', 'TGT', 'GCG', 'ACC', 'CAA'], ['CTA', 'GAT', 'CAA', 'GTT', 'TGC', 'CTA', 'GCG', 'GAT', 'GGC', 'GGG', 'ACG', 'CCA', 'TTA', 'GAC', 'CCC', 'GGG', 'CTT', 'CAC', 'CTA', 'ATC', 'AAT', 'GCC', 'CTC', 'GTT', 'CGC', 'TGT', 'TTC', 'CTC', 'GCA', 'TGC', 'TCT', 'CCC', 'AGA', 'CGA'], ['CGA', 'CAA', 'AGC', 'AAC', 'AGC', 'GAG', 'AGG', 'AGT', 'GTC', 'GAG', 'ATT', 'GCT', 'CGC', 'CCA', 'TCA', 'CAA', 'CGT', 'AGC', 'ACA', 'GAT', 'GCA', 'TAC', 'CCC', 'GTG', 'AAT', 'CTC', 'CGT', 'TCA', 'AAC', 'CAC', 'CCC', 'ATA', 'AGA', 'TGC'], ['ATT', 'CGC', 'CAG', 'TGC', 'GCT', 'CGA', 'GAA', 'GTG', 'AGC', 'GAA', 'GAA', 'GGA', 'TTC', 'AAC', 'GTA', 'CCT', 'CCG', 'TCG', 'TGG', 'TAT', 'AAT', 'AGT', 'CGG', 'TTC', 'CGC', 'TTC', 'AAT', 'GAG', 'AGG', 'GAC', 'ACG', 'GAT', 'TAT', 'CCC'], ['ATT', 'CAG', 'GTG', 'AGA', 'CGA', 'TG

In [87]:
#For library of 1000 sequences:
library_two_nested_1000 = [random_tags_library_two_1000[x:x+34] for x in range(0, len(random_tags_library_two_1000), 34)]
print(library_two_nested_1000) #prints the nested list (lists w/in list)

[['CGC', 'AAT', 'CGC', 'TTA', 'CCC', 'CAT', 'CAG', 'CCT', 'TTA', 'GCC', 'CAC', 'CTC', 'CAG', 'TAT', 'TCT', 'TCA', 'TAT', 'GAG', 'TTT', 'CCG', 'TGG', 'AGA', 'CTG', 'AAA', 'GAG', 'GTG', 'GAT', 'AAG', 'CGC', 'CAA', 'GCG', 'CAT', 'AAA', 'CTC'], ['GGG', 'CCA', 'ACT', 'GAA', 'TGT', 'TGG', 'AAG', 'GGC', 'TTC', 'GTT', 'CAG', 'AGG', 'TGG', 'ACG', 'CTC', 'CCA', 'ACC', 'CCT', 'TCG', 'CCC', 'AGA', 'AGC', 'CCA', 'CGA', 'AGC', 'TGG', 'CGT', 'AGT', 'ACT', 'CCT', 'TGC', 'CCT', 'CCT', 'ATT'], ['GAC', 'CAC', 'CAA', 'CTC', 'AAG', 'ATA', 'CAC', 'AAT', 'GAA', 'CTC', 'TCA', 'TTT', 'CCG', 'AAA', 'CAG', 'AAG', 'CCC', 'CTG', 'GTC', 'GAA', 'CTG', 'CAC', 'AAC', 'ATA', 'CCT', 'CTC', 'AGT', 'CCG', 'GCG', 'ATC', 'ATT', 'CCG', 'ATT', 'GTT'], ['AGA', 'CAT', 'GAC', 'GGC', 'TCT', 'ACC', 'CGA', 'GTC', 'AAG', 'TTT', 'GAG', 'ACT', 'TGT', 'CGA', 'CCT', 'CCA', 'GGC', 'TAT', 'GAG', 'AGG', 'GTT', 'CGC', 'GTC', 'TGG', 'CTT', 'AAT', 'CAA', 'TCA', 'CAC', 'CAC', 'CAC', 'ATT', 'GTG', 'GGC'], ['CAT', 'GGG', 'GTC', 'CTT', 'GGT', 'GG

### Section 2.8: Putting the randomly generated tags into sequences

In [88]:
#Putting all the tags in the indiv. libraries together into sequences
#using the nested list

#For the library of 50 sequences:
library_two_50 = [''.join(l) for l in library_two_nested_50]
print(library_two_50)

['GAGATAGAAAGATCGCGCTCCCAAGCGCGCCCAAACACACGGCCTATCCAACAAAACAGGGTTTTCATTCGGTTGGAGATTAACGACCCCTGGGCCGAGGCG', 'CTCGAACGCTTGACTGAGTGTCCAGGTTTTAGAGGACTGAACAGTATAACGATTATAACGGAGCGTACCCCCAAGATATCAAAGTCTTGTGGGTGCCACTGC', 'GGAACATGGACGCCGACGGCTCTGCAACTATTGACCGATGTCGTCAGACAAATCTGCGCCTCTGGTTCCACCTTTGGCAAGTGCCGCTCCCAAGATAGTGGA', 'TCACCCGTTAGCCGGTTGCTGATTACCATCACGGGACGGCCAGTACAACCTTATGATCAGCAGCGATTCAAGTGGCTGGGTGCTAGCAGTGCTTCCGGATCT', 'GTGCCTGTTTTGCAGAATTTGGTAACGATTATTATTTGTGTTAAGGATTACTCCCACTCGGAACTGCATACCAAGCTCTGTGGCCTCCCAGGGGGCAAACCC', 'CGAAAGACGCTTCGTGAATTTCAGGTTGATAGAGCCCTGGCAGTAGTTGGGTTTCAATACGTACCAGCGTCGTCGATAACCAGGCAAAGTACGGCTTTCTCG', 'AAATTTCGACATACGGGTGTACCTAGCTTACTGGGCGGGCGAGCTAGTCCTGACATCTCGCCATTCCTGTATACACTAGGTCGGTTATCGTCTCACAGCACT', 'ACCGAGTCCACCTCTCGACGCCCCCATTGCACACGTCGTTCAACACCGAAGCTGCGCGGACCGATCTATCGACTGTGTGCAACTGTGTTAGAAGTCGTGCTC', 'ACAGCACGGCCTGGACGTCCTGTAGCAGTCTATGAGACAATTTACTATCAAAACTACAAGCAGACCAACTTACTCCAGCCCTTGAATATACAGGATCGGACG', 'GACAGTGCTCGTCACCCCGACGATTCGATAGCTTTAGGGGAACT

In [89]:
#For the library of 100 sequences:
library_two_100 = [''.join(l) for l in library_two_nested_100]
print(library_two_100)

['TTGCTGGGTGTTACGCCGACGAAACTTGGCAGCAATGTTTGCACAAAAACCCACTATGCGATCGGGATCGACCACACGCTAGGAAGCCGTTGTGCGACCCAA', 'CTAGATCAAGTTTGCCTAGCGGATGGCGGGACGCCATTAGACCCCGGGCTTCACCTAATCAATGCCCTCGTTCGCTGTTTCCTCGCATGCTCTCCCAGACGA', 'CGACAAAGCAACAGCGAGAGGAGTGTCGAGATTGCTCGCCCATCACAACGTAGCACAGATGCATACCCCGTGAATCTCCGTTCAAACCACCCCATAAGATGC', 'ATTCGCCAGTGCGCTCGAGAAGTGAGCGAAGAAGGATTCAACGTACCTCCGTCGTGGTATAATAGTCGGTTCCGCTTCAATGAGAGGGACACGGATTATCCC', 'ATTCAGGTGAGACGATGCTTTAGGGTACCTATTTCTATTATTCGGCACTCCAAGTTTTTGATTACGTTATCTTCCTTAGTTAGTTCACGTTTAGAAAAGGCT', 'AGCTCGTCATCGGCAACATATACACGGGCGCCGCTGTTAACGGCACCACGGCAGGAAGACGCATGCCCTCAGGATATTTGGGTTATACATGCGCCGATCGGA', 'GTATCTCCATGCAGAACCATCATCAGGAGGCACAACCGAATTACGGTATTTTCACGACGGGGTGTCCGTATCGATATAAAGCTGGGATGCGGGGGAAGCCGC', 'GCTCGACTAGGCGGCATTTTACCAGCCTCGAAGTGGCTTAGACCGGAGCTCCTTGAACGCGTTAACATTCCTGGTTCGAGCGCGACCGCTGCGCCCGTCCTG', 'CAATTATCACCTAGTTCGGCATTCGAGGTCGTCGTGCGCGACGGGCGGTGCTGTGTCATTCAGCGTACACAACTGATTCGGTTATGCTGGCTCGTACTTATA', 'GCACCTACTATTGCAGGACCGGACTCACCATTTTCTGAATGTTC

In [90]:
#For the library of 1000 sequences:
library_two_1000 = [''.join(l) for l in library_two_nested_1000]
print(library_two_1000)

['CGCAATCGCTTACCCCATCAGCCTTTAGCCCACCTCCAGTATTCTTCATATGAGTTTCCGTGGAGACTGAAAGAGGTGGATAAGCGCCAAGCGCATAAACTC', 'GGGCCAACTGAATGTTGGAAGGGCTTCGTTCAGAGGTGGACGCTCCCAACCCCTTCGCCCAGAAGCCCACGAAGCTGGCGTAGTACTCCTTGCCCTCCTATT', 'GACCACCAACTCAAGATACACAATGAACTCTCATTTCCGAAACAGAAGCCCCTGGTCGAACTGCACAACATACCTCTCAGTCCGGCGATCATTCCGATTGTT', 'AGACATGACGGCTCTACCCGAGTCAAGTTTGAGACTTGTCGACCTCCAGGCTATGAGAGGGTTCGCGTCTGGCTTAATCAATCACACCACCACATTGTGGGC', 'CATGGGGTCCTTGGTGGATACCAGTCCGTCCGGGGCTCGACAGCTGTCGTCACCGAGTGCATCGCATACAACCGCGCGCCAAGGTTGTGGCATGGCTGTGTG', 'GTAGCAGTGGTCTTCAGGTGTGTAACAGAGAATCTCAAATTCGTCATTTCAGAAGCACAAAGTCGTAAATGTATTAGCCAAATCGGCGGCAACCGATTCACC', 'CCATCATTAACTCCAATCGTCATAGCGCCAGCTCGCCTTTGCCTCTACGATCAAGAATCATGCACTTCGGTTGTGGAGGCTTTCGAACTCGACCTTTCCAGG', 'ACCATTGAGGATTTCTTTACACGATTGGTCCCCGTGCCAGGTATCATTGGGGCCTTTTATGAGAGCTTTTCGGTGTGTGACCTACTACATGCTGGTCAAGTC', 'AACAGTGTCGAGGGAGGATTGCGACCACACGACAGAAGTATCGATATTGAGCTGACTGGGACCAATCAGTCTGCGCTGTTGTTAAAGCTTACGAAACAAGTG', 'GCCTGCCTATCCGACAATGAACGACCACAGGCTGAGACAGAATT

In [91]:
#Identifying the indeces for the sequences in the libraries

#For the library of 50 sequences:
library_two_50_index = enumerate(library_two_50)
print(list(library_two_50_index))

#If we want an indiv. seq. from the library, we call the index of that seq.
print(library_two_50[3])

[(0, 'GAGATAGAAAGATCGCGCTCCCAAGCGCGCCCAAACACACGGCCTATCCAACAAAACAGGGTTTTCATTCGGTTGGAGATTAACGACCCCTGGGCCGAGGCG'), (1, 'CTCGAACGCTTGACTGAGTGTCCAGGTTTTAGAGGACTGAACAGTATAACGATTATAACGGAGCGTACCCCCAAGATATCAAAGTCTTGTGGGTGCCACTGC'), (2, 'GGAACATGGACGCCGACGGCTCTGCAACTATTGACCGATGTCGTCAGACAAATCTGCGCCTCTGGTTCCACCTTTGGCAAGTGCCGCTCCCAAGATAGTGGA'), (3, 'TCACCCGTTAGCCGGTTGCTGATTACCATCACGGGACGGCCAGTACAACCTTATGATCAGCAGCGATTCAAGTGGCTGGGTGCTAGCAGTGCTTCCGGATCT'), (4, 'GTGCCTGTTTTGCAGAATTTGGTAACGATTATTATTTGTGTTAAGGATTACTCCCACTCGGAACTGCATACCAAGCTCTGTGGCCTCCCAGGGGGCAAACCC'), (5, 'CGAAAGACGCTTCGTGAATTTCAGGTTGATAGAGCCCTGGCAGTAGTTGGGTTTCAATACGTACCAGCGTCGTCGATAACCAGGCAAAGTACGGCTTTCTCG'), (6, 'AAATTTCGACATACGGGTGTACCTAGCTTACTGGGCGGGCGAGCTAGTCCTGACATCTCGCCATTCCTGTATACACTAGGTCGGTTATCGTCTCACAGCACT'), (7, 'ACCGAGTCCACCTCTCGACGCCCCCATTGCACACGTCGTTCAACACCGAAGCTGCGCGGACCGATCTATCGACTGTGTGCAACTGTGTTAGAAGTCGTGCTC'), (8, 'ACAGCACGGCCTGGACGTCCTGTAGCAGTCTATGAGACAATTTACTATCAAAACTACAAGCAGACCAACTTACTCCAGCCCTTGAATATACAGGATCGGACG'), 

In [92]:
#For the library of 100 sequences:
library_two_100_index = enumerate(library_two_100)
print(list(library_two_100_index))

[(0, 'TTGCTGGGTGTTACGCCGACGAAACTTGGCAGCAATGTTTGCACAAAAACCCACTATGCGATCGGGATCGACCACACGCTAGGAAGCCGTTGTGCGACCCAA'), (1, 'CTAGATCAAGTTTGCCTAGCGGATGGCGGGACGCCATTAGACCCCGGGCTTCACCTAATCAATGCCCTCGTTCGCTGTTTCCTCGCATGCTCTCCCAGACGA'), (2, 'CGACAAAGCAACAGCGAGAGGAGTGTCGAGATTGCTCGCCCATCACAACGTAGCACAGATGCATACCCCGTGAATCTCCGTTCAAACCACCCCATAAGATGC'), (3, 'ATTCGCCAGTGCGCTCGAGAAGTGAGCGAAGAAGGATTCAACGTACCTCCGTCGTGGTATAATAGTCGGTTCCGCTTCAATGAGAGGGACACGGATTATCCC'), (4, 'ATTCAGGTGAGACGATGCTTTAGGGTACCTATTTCTATTATTCGGCACTCCAAGTTTTTGATTACGTTATCTTCCTTAGTTAGTTCACGTTTAGAAAAGGCT'), (5, 'AGCTCGTCATCGGCAACATATACACGGGCGCCGCTGTTAACGGCACCACGGCAGGAAGACGCATGCCCTCAGGATATTTGGGTTATACATGCGCCGATCGGA'), (6, 'GTATCTCCATGCAGAACCATCATCAGGAGGCACAACCGAATTACGGTATTTTCACGACGGGGTGTCCGTATCGATATAAAGCTGGGATGCGGGGGAAGCCGC'), (7, 'GCTCGACTAGGCGGCATTTTACCAGCCTCGAAGTGGCTTAGACCGGAGCTCCTTGAACGCGTTAACATTCCTGGTTCGAGCGCGACCGCTGCGCCCGTCCTG'), (8, 'CAATTATCACCTAGTTCGGCATTCGAGGTCGTCGTGCGCGACGGGCGGTGCTGTGTCATTCAGCGTACACAACTGATTCGGTTATGCTGGCTCGTACTTATA'), 

In [93]:
#For the library of 1000 sequences:
library_two_1000_index = enumerate(library_two_1000)
print(list(library_two_1000_index))

[(0, 'CGCAATCGCTTACCCCATCAGCCTTTAGCCCACCTCCAGTATTCTTCATATGAGTTTCCGTGGAGACTGAAAGAGGTGGATAAGCGCCAAGCGCATAAACTC'), (1, 'GGGCCAACTGAATGTTGGAAGGGCTTCGTTCAGAGGTGGACGCTCCCAACCCCTTCGCCCAGAAGCCCACGAAGCTGGCGTAGTACTCCTTGCCCTCCTATT'), (2, 'GACCACCAACTCAAGATACACAATGAACTCTCATTTCCGAAACAGAAGCCCCTGGTCGAACTGCACAACATACCTCTCAGTCCGGCGATCATTCCGATTGTT'), (3, 'AGACATGACGGCTCTACCCGAGTCAAGTTTGAGACTTGTCGACCTCCAGGCTATGAGAGGGTTCGCGTCTGGCTTAATCAATCACACCACCACATTGTGGGC'), (4, 'CATGGGGTCCTTGGTGGATACCAGTCCGTCCGGGGCTCGACAGCTGTCGTCACCGAGTGCATCGCATACAACCGCGCGCCAAGGTTGTGGCATGGCTGTGTG'), (5, 'GTAGCAGTGGTCTTCAGGTGTGTAACAGAGAATCTCAAATTCGTCATTTCAGAAGCACAAAGTCGTAAATGTATTAGCCAAATCGGCGGCAACCGATTCACC'), (6, 'CCATCATTAACTCCAATCGTCATAGCGCCAGCTCGCCTTTGCCTCTACGATCAAGAATCATGCACTTCGGTTGTGGAGGCTTTCGAACTCGACCTTTCCAGG'), (7, 'ACCATTGAGGATTTCTTTACACGATTGGTCCCCGTGCCAGGTATCATTGGGGCCTTTTATGAGAGCTTTTCGGTGTGTGACCTACTACATGCTGGTCAAGTC'), (8, 'AACAGTGTCGAGGGAGGATTGCGACCACACGACAGAAGTATCGATATTGAGCTGACTGGGACCAATCAGTCTGCGCTGTTGTTAAAGCTTACGAAACAAGTG'), 




### Section 2.9: Finding repeats within these sequences

In [135]:
#By identifying regular expressions
import re

#Finds repeats of just one sequence at a time
print(library_two_50[0])
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|\
                      (GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", library_two_50[0])
repeats_two_50_seq0 = []
for m in matches:
    repeats_two_50_seq0.append(m.group())
print(repeats_two_50_seq0)

print(library_two_100[0])
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|\
                      (GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", library_two_100[0])
repeats_two_100_seq0 = []
for m in matches:
    repeats_two_100_seq0.append(m.group())
print(repeats_two_100_seq0)

print(library_two_1000[0])
matches = re.finditer(r"(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|(GC){2,}|(GG){2,}|\
                      (GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}", library_two_1000[0])
repeats_two_1000_seq0 = []
for m in matches:
    repeats_two_1000_seq0.append(m.group())
print(repeats_two_1000_seq0)

GAGATAGAAAGATCGCGCTCCCAAGCGCGCCCAAACACACGGCCTATCCAACAAAACAGGGTTTTCATTCGGTTGGAGATTAACGACCCCTGGGCCGAGGCG
['GAGA', 'CGCG', 'GCGCGC', 'ACACAC', 'AAAA', 'TTTT', 'GAGA', 'CCCC']
TTGCTGGGTGTTACGCCGACGAAACTTGGCAGCAATGTTTGCACAAAAACCCACTATGCGATCGGGATCGACCACACGCTAGGAAGCCGTTGTGCGACCCAA
['CACA', 'AAAA', 'CACA', 'TGTG']
CGCAATCGCTTACCCCATCAGCCTTTAGCCCACCTCCAGTATTCTTCATATGAGTTTCCGTGGAGACTGAAAGAGGTGGATAAGCGCCAAGCGCATAAACTC
['CCCC', 'ATAT', 'GAGA', 'AGAG', 'GCGC', 'GCGC']


In [136]:
#Defining a function to get the repeats from a given sequence
def get_repeats(sequence):
    seq_repeats_two = []
    compiled_expressions = re.compile(r'(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|\
                                      (GC){2,}|(GG){2,}|(GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}')
    for match in compiled_expressions.finditer(sequence):
            seq_repeats_two.append(match.group())
    return seq_repeats_two

#Writing a for loop and calling the get_repeats function
# to get the repeats for all sequences in a list of strings

#For the library of 50 sequences:
repeats_library_two_50 = []
for sequence in library_two_50:
    repeats = get_repeats(sequence)
    repeats_library_two_50.append(repeats)
print(repeats_library_two_50) #returns nested list of repeats found for each indiv. sequence

#This gives the same results as it did when finding the repeats for each seq. indiv. as done above

[['GAGA', 'CGCG', 'CGCG', 'ACACAC', 'AAAA', 'TTTT', 'GAGA', 'CCCC'], ['GTGT', 'TTTT', 'AGAG', 'TATA', 'TATA', 'CCCC', 'ATAT', 'TGTG'], ['CTCT', 'CTCT'], [], ['TTTT', 'TGTG', 'CTCT', 'GGGG'], ['AGAG', 'TCTC'], ['GTGT', 'TCTC', 'TATA', 'TCTC'], ['CTCT', 'CCCC', 'CACA', 'ACAC', 'CGCG', 'TGTGTG', 'TGTG'], ['GAGA', 'AAAA', 'ATAT'], ['CCCC', 'GGGG', 'GAGA', 'CTCT'], ['GTGT', 'TTTT', 'TGTG', 'CTCT', 'TTTT', 'ATAT', 'GGGG'], ['TGTG', 'CCCC'], ['TGTG', 'ACAC', 'CTCT', 'CTCT', 'GAGA'], ['TATA', 'GAGA', 'CTCT', 'AAAA'], ['GTGT', 'TATA', 'GTGT', 'CTCT', 'ACAC', 'TATA', 'GAGA'], ['CACA', 'CCCC', 'CCCC', 'GAGA'], ['AAAA', 'AGAG', 'TGTG'], ['CTCT', 'TTTT', 'GAGA'], ['CTCT'], ['TGTG', 'AGAG', 'TGTG', 'CCCC', 'TATA', 'CGCG'], ['CGCG', 'AAAA', 'AGAG', 'ACAC'], ['TATA', 'TTTT', 'TTTT'], ['AGAG', 'CACA', 'ATAT', 'GTGT', 'CTCT', 'GGGGGG'], ['TCTC', 'GAGA', 'ATAT', 'CTCT', 'GGGG'], ['TGTGTG', 'CCCC', 'TTTT', 'TATA'], ['CCCC', 'AAAAAA'], ['TATA', 'CCCC', 'CTCT', 'AGAG', 'GGGG'], ['GTGTGT', 'ACAC', 'AGAG', 'A

In [96]:
#For the library of 100 sequences:
repeats_library_two_100 = []
for sequence in library_two_100:
    repeats = get_repeats(sequence)
    repeats_library_two_100.append(repeats)
print(repeats_library_two_100) #returns nested list of repeats found for each indiv. sequence

[['GTGT', 'CACA', 'AAAA', 'CACA', 'TGTG'], ['CCCC', 'CTCT'], ['GAGA', 'GTGT', 'GAGA', 'CACA', 'CACA', 'CCCC', 'TCTC', 'CCCC'], ['GCGC', 'GAGA', 'TATA', 'GAGA', 'ACAC'], ['GAGA', 'TTTT', 'AAAA'], ['ATAT', 'ACAC', 'GCGC', 'ATAT', 'TATA', 'GCGC'], ['TCTC', 'CACA', 'TTTT', 'GGGG', 'ATAT', 'GGGG'], ['TTTT', 'CGCG', 'GCGC', 'GCGC'], ['GCGC', 'TGTG', 'ACAC', 'TATA'], ['TTTT', 'CTCT', 'AGAG', 'GCGC'], ['TTTT', 'CCCC', 'CGCG', 'GGGG'], ['TCTC', 'TTTT', 'CTCT', 'CTCT', 'CCCC', 'GTGT', 'GTGT', 'CACA'], ['TGTG', 'GCGC', 'TCTC'], ['GAGA', 'GTGT', 'ACAC', 'AAAA'], ['AAAA', 'AAAA', 'ACAC', 'ACAC', 'CACA', 'CTCT', 'ACAC'], ['AGAG', 'CCCC', 'GCGC', 'CCCCCC', 'TATA', 'TCTC', 'CGCG'], ['GTGT', 'CGCG', 'GAGAGA', 'GCGC', 'GGGG'], ['GCGC', 'TCTC', 'TCTC', 'TATA', 'GCGCGC'], ['GGGG', 'CACA', 'GGGG'], ['TTTT', 'CACA', 'AGAG'], ['TTTT', 'CGCG', 'TCTC'], ['ACAC', 'TCTC', 'TATA'], ['CGCG', 'TCTC', 'CACA', 'AAAA', 'GAGA'], ['ACAC', 'CTCT', 'TTTT', 'ACAC'], ['GGGG', 'CTCT', 'CGCG', 'GAGA'], ['GTGT'], ['TCTCTC', 'A

In [97]:
#For the library of 1000 sequences:
repeats_library_two_1000 = []
for sequence in library_two_1000:
    repeats = get_repeats(sequence)
    repeats_library_two_1000.append(repeats)
print(repeats_library_two_1000) #returns nested list of repeats found for each indiv. sequence

[['CCCC', 'ATAT', 'GAGA', 'AGAG', 'GCGC', 'GCGC'], ['AGAG', 'CCCC'], ['ACAC', 'CTCT', 'CCCC', 'CACA', 'CTCT'], ['CTCT', 'GAGA', 'GAGA', 'CGCG', 'CACA', 'CACA', 'TGTG'], ['GGGG', 'GGGG', 'CGCGCG', 'TGTG', 'TGTGTG'], ['GTGTGT', 'AGAG', 'TCTC', 'CACA'], ['GCGC', 'CTCT', 'TGTG'], ['ACAC', 'CCCC', 'GGGG', 'TTTT', 'GAGA', 'TTTT', 'GTGTGT'], ['GTGT', 'CACA', 'ATAT', 'GCGC'], ['CACA', 'GAGA', 'AGAG', 'CGCG'], ['ACAC', 'AAAA', 'CTCT', 'TTTT', 'CTCT', 'TTTT'], ['TCTC', 'GCGC', 'ACAC', 'AGAG', 'ATAT', 'TGTG'], ['GGGG', 'CACA', 'TATA', 'CACA', 'AAAA'], ['ACAC', 'CTCT'], ['TCTC', 'TGTG', 'AGAG', 'GGGG', 'GAGAGA'], ['CTCT', 'AGAGAG', 'CCCC', 'CACA'], ['CGCG', 'ACAC', 'GCGC', 'CGCG', 'CCCC'], ['CGCG', 'CCCCCC', 'CGCG', 'CGCG'], ['CCCC', 'GTGT', 'GCGC', 'CGCG', 'CTCT', 'GTGT'], ['AGAG', 'CCCC', 'TGTG', 'TCTC', 'GGGG', 'CCCC', 'TATA'], ['CCCC', 'CGCG', 'TGTG', 'TATA', 'CGCG', 'GTGT', 'TCTC'], ['CACA', 'GTGT', 'CTCT', 'CTCT'], ['TGTG', 'TTTT', 'GTGT', 'AGAG', 'GGGG'], ['CTCT', 'AAAA', 'GTGT', 'CACA'], [

### Section 2.10: Counting the number of repeats found in each sequence of the library

In [98]:

#To count the number of repeats in each indiv. sequence in a library, we can create another function:
def repeat_count(repeats):
    count = 0 #count starts at 0
    for elem in repeats:
        if type(elem) == list: #check if the element is a list
            count += repeat_count(elem) #if list, get size
        else:
            count += 1 #if not list, add 1 to count
    return count

#Writing a for loop and calling the repeat_count function
# to get the number of repeats found in each seq., as found in the above code

#For the library of 50 sequences:
repeat_counts_library_two_50 = []
for repeats in repeats_library_two_50:
    number = repeat_count(repeats)
    repeat_counts_library_two_50.append(number)
print(repeat_counts_library_two_50)

#Then, to get the total number of all repeats in all sequences in the library
total_repeats_lib_two_50 = sum(repeat_counts_library_two_50)
print(total_repeats_lib_two_50)

[8, 8, 3, 0, 4, 2, 4, 7, 3, 4, 7, 3, 5, 4, 7, 5, 3, 3, 1, 6, 4, 3, 6, 5, 4, 2, 5, 6, 1, 7, 7, 5, 5, 3, 5, 4, 7, 8, 6, 8, 6, 3, 3, 5, 1, 5, 7, 5, 5, 4]
232


In [99]:
#For the library of 100 sequences:
repeat_counts_library_two_100 = []
for repeats in repeats_library_two_100:
    number = repeat_count(repeats)
    repeat_counts_library_two_100.append(number)
print(repeat_counts_library_two_100)

#Then, to get the total number of all repeats in all sequences in the library
total_repeats_lib_two_100 = sum(repeat_counts_library_two_100)
print(total_repeats_lib_two_100)

[5, 2, 8, 5, 3, 6, 6, 4, 4, 4, 4, 8, 3, 4, 7, 7, 5, 5, 3, 3, 3, 3, 5, 4, 4, 1, 6, 3, 5, 7, 7, 6, 2, 2, 7, 4, 6, 4, 5, 4, 4, 6, 1, 4, 4, 2, 2, 2, 7, 6, 3, 5, 3, 4, 3, 3, 8, 3, 4, 4, 2, 3, 3, 6, 5, 3, 3, 6, 5, 3, 6, 6, 5, 7, 1, 4, 6, 5, 2, 5, 3, 3, 6, 5, 5, 2, 3, 7, 7, 2, 4, 3, 3, 7, 5, 3, 5, 5, 3, 4]
433


In [100]:
#For the library of 1000 sequences:
repeat_counts_library_two_1000 = []
for repeats in repeats_library_two_1000:
    number = repeat_count(repeats)
    repeat_counts_library_two_1000.append(number)
print(repeat_counts_library_two_1000)

#Then, to get the total number of all repeats in all sequences in the library
total_repeats_lib_two_1000 = sum(repeat_counts_library_two_1000)
print(total_repeats_lib_two_1000)

[6, 2, 5, 7, 5, 4, 3, 7, 4, 4, 6, 6, 5, 2, 5, 4, 5, 4, 6, 7, 7, 4, 5, 4, 4, 8, 7, 3, 7, 6, 5, 4, 2, 4, 3, 4, 5, 2, 6, 1, 3, 5, 5, 4, 6, 5, 5, 5, 3, 5, 5, 5, 1, 4, 1, 3, 3, 7, 6, 3, 4, 4, 5, 4, 4, 4, 4, 3, 4, 4, 2, 4, 7, 4, 6, 6, 5, 7, 3, 4, 4, 5, 3, 5, 2, 5, 6, 2, 4, 3, 2, 6, 2, 5, 4, 7, 6, 5, 3, 3, 4, 2, 5, 6, 8, 4, 5, 5, 7, 6, 5, 5, 6, 9, 2, 4, 5, 6, 5, 4, 9, 5, 4, 3, 4, 5, 2, 4, 6, 4, 2, 4, 8, 8, 3, 7, 2, 6, 5, 6, 5, 4, 4, 4, 5, 5, 3, 3, 4, 7, 4, 5, 6, 5, 5, 6, 6, 8, 4, 4, 7, 9, 5, 6, 1, 6, 8, 1, 6, 4, 3, 5, 4, 6, 3, 2, 4, 4, 4, 5, 6, 6, 4, 1, 4, 1, 3, 6, 8, 8, 4, 7, 5, 3, 7, 6, 4, 4, 5, 5, 5, 6, 2, 7, 3, 5, 7, 8, 4, 7, 5, 3, 5, 3, 5, 4, 4, 5, 4, 6, 2, 2, 6, 4, 4, 6, 4, 3, 4, 3, 4, 4, 7, 1, 5, 5, 6, 4, 3, 4, 3, 6, 4, 7, 5, 7, 5, 4, 8, 7, 4, 7, 2, 4, 1, 7, 7, 5, 6, 5, 5, 3, 5, 3, 4, 7, 5, 4, 7, 5, 3, 7, 6, 3, 5, 6, 5, 8, 5, 7, 6, 5, 6, 4, 8, 2, 5, 3, 3, 4, 4, 1, 11, 3, 5, 4, 6, 4, 5, 5, 4, 6, 4, 7, 4, 4, 6, 3, 6, 6, 4, 5, 6, 4, 8, 6, 6, 9, 3, 2, 5, 4, 3, 4, 5, 2, 9, 7, 6, 2, 5, 6, 5,

### Section 2.11: Generating library of random sequences by using key:value pairs

In [19]:
#Defining a function to get the keys from the key;val pairs
def get_keys(key):
    keys_one = []
    keys_one = [k for k, v in key]
    return keys_one

In [None]:
keys_library_two_50 = []
for key in random_pairs_library_two_50:
    keys = get_keys(key) 
    keys_library_two_50.append(keys)
print(keys_library_two_50) #returns nested list of keys for each indiv. sequence
#so 50 lists of 34 keys
# to make into 50 sequences of 34 tags = sequences of 102nt in length

In [20]:
############### 
# CHANGE THE NUMBERS IN THE FOLLOWING CODE TO CHANGE HOW MANY SEQUENCES/ WHAT LENGTH SEQUENCES YOU WANT TO GENERATE

#What if we want to sample a lot at once, and store the results?
import numpy as np

#making multiple empty libraries for a certain amount of sequences:
#For 50 sequences:
random_pairs_library_two_50 = []
#for 100 sequences:
random_pairs_library_two_100 = []
#for 1000 sequences:
random_pairs_library_two_1000 = []

#This for loop
# generates 50 sequences, of 34 key/val pairs (to make 50 seq. of 102nt length)
for x in np.arange(50): #do the loop 50 times
    #put tags into library
    random_pairs_library_two_50.append(random.sample(list(dict_two.items()), 34)) 
    #generates 34 random key;val pairs each time
print(random_pairs_library_two_50)

#Writing a for loop and calling the get_keys function
# to get the keys for all the key:val pairs in the above library
keys_library_two_50 = []
for key in random_pairs_library_two_50:
    keys = get_keys(key) 
    keys_library_two_50.append(keys)
print(keys_library_two_50) #returns nested list of keys for each indiv. sequence
#so 50 lists of 34 keys
# to make into 50 sequences of 34 tags = sequences of 102nt in length

[[('ACT', 'Blue'), ('AAA', 'Black'), ('CGC', 'White'), ('GCG', 'Cyan'), ('CTT', 'Red'), ('ATC', 'Green'), ('GGG', 'Black'), ('CGA', 'White'), ('AGA', 'White'), ('ATA', 'Green'), ('GAA', 'Magenta'), ('GTC', 'Green'), ('CTC', 'Red'), ('GCT', 'Cyan'), ('TCG', 'Blue'), ('TTG', 'Red'), ('TTA', 'Red'), ('ACC', 'Blue'), ('CAA', 'Yellow'), ('ACA', 'Blue'), ('GAG', 'Magenta'), ('GCA', 'Cyan'), ('CCG', 'Cyan'), ('GAT', 'Magenta'), ('TTT', 'Black'), ('AAG', 'White'), ('TGT', 'Black'), ('CCT', 'Cyan'), ('AGC', 'Yellow'), ('AGG', 'White'), ('TGC', 'Black'), ('TAT', 'Black'), ('GTG', 'Green'), ('CTA', 'Red')], [('GTG', 'Green'), ('CGA', 'White'), ('AAC', 'Yellow'), ('CCC', 'Black'), ('GCC', 'Cyan'), ('GGG', 'Black'), ('TTG', 'Red'), ('GCT', 'Cyan'), ('CAC', 'Black'), ('CGG', 'White'), ('GCG', 'Cyan'), ('TAT', 'Black'), ('CGC', 'White'), ('AGT', 'Yellow'), ('GGC', 'Magenta'), ('TGG', 'Black'), ('AAG', 'White'), ('AAT', 'Yellow'), ('TCT', 'Blue'), ('CAG', 'Yellow'), ('ACG', 'Yellow'), ('CTC', 'Red'), 

In [103]:
#This for loop
# generates 100 sequences, of 34 key/val pairs (to make 100 seq. of 102nt length)
for x in np.arange(100): #do the loop 100 times
    #put tags into library
    random_pairs_library_two_100.append(random.sample(list(dict_two.items()), 34)) 
    #generates 34 random key;val pairs each time
print(random_pairs_library_two_100)

#Writing a for loop and calling the get_keys function
# to get the keys for all the key:val pairs in the above library
keys_library_two_100 = []
for key in random_pairs_library_two_100:
    keys = get_keys(key) 
    keys_library_two_100.append(keys)
print(keys_library_two_100) #returns nested list of keys for each indiv. sequence

[[('GCG', 'Cyan'), ('TTA', 'Red'), ('ATA', 'Green'), ('ATC', 'Green'), ('GTA', 'Green'), ('CGG', 'White'), ('GAC', 'Magenta'), ('CCG', 'Cyan'), ('CGA', 'White'), ('GTG', 'Green'), ('GAT', 'Magenta'), ('ACC', 'Blue'), ('CGC', 'White'), ('TCG', 'Blue'), ('GTT', 'Green'), ('AGT', 'Yellow'), ('GTC', 'Green'), ('TTC', 'Red'), ('GCA', 'Cyan'), ('TCA', 'Blue'), ('GGT', 'Magenta'), ('CTT', 'Red'), ('CGT', 'White'), ('GGG', 'Black'), ('ACA', 'Blue'), ('CTA', 'Red'), ('GCT', 'Cyan'), ('TGC', 'Black'), ('CTC', 'Red'), ('GAG', 'Magenta'), ('AAC', 'Yellow'), ('AGC', 'Yellow'), ('AGA', 'White'), ('AAT', 'Yellow')], [('AAA', 'Black'), ('TCA', 'Blue'), ('TCT', 'Blue'), ('GCG', 'Cyan'), ('GCC', 'Cyan'), ('TGT', 'Black'), ('TCG', 'Blue'), ('CGA', 'White'), ('AGA', 'White'), ('CAT', 'Black'), ('GTT', 'Green'), ('CCT', 'Cyan'), ('CTC', 'Red'), ('GGG', 'Black'), ('TTG', 'Red'), ('ATA', 'Green'), ('GTG', 'Green'), ('GTA', 'Green'), ('CCA', 'Cyan'), ('CCC', 'Black'), ('GTC', 'Green'), ('ATC', 'Green'), ('CCG

In [21]:
#This for loop
# generates 1000 sequences, of 34 key/val pairs (to make 1000 seq. of 102nt length)
for x in np.arange(1000): #do the loop 1000 times
    #put tags into library
    random_pairs_library_two_1000.append(random.sample(list(dict_two.items()), 34)) 
    #generates 34 random key;val pairs each time
print(random_pairs_library_two_1000)

#Writing a for loop and calling the get_keys function
# to get the keys for all the key:val pairs in the above library
keys_library_two_1000 = []
for key in random_pairs_library_two_1000:
    keys = get_keys(key) 
    keys_library_two_1000.append(keys)
print(keys_library_two_1000) #returns nested list of keys for each indiv. sequence

[[('AAC', 'Yellow'), ('CCA', 'Cyan'), ('GAA', 'Magenta'), ('AAG', 'White'), ('CCG', 'Cyan'), ('TCC', 'Blue'), ('TCA', 'Blue'), ('AGC', 'Yellow'), ('CGA', 'White'), ('AGA', 'White'), ('AGT', 'Yellow'), ('GTT', 'Green'), ('TAT', 'Black'), ('CTG', 'Red'), ('GCT', 'Cyan'), ('CAG', 'Yellow'), ('GAC', 'Magenta'), ('GAG', 'Magenta'), ('CCT', 'Cyan'), ('TGG', 'Black'), ('CTA', 'Red'), ('ACC', 'Blue'), ('GCA', 'Cyan'), ('ACA', 'Blue'), ('ATC', 'Green'), ('CAC', 'Black'), ('TGT', 'Black'), ('GGT', 'Magenta'), ('GAT', 'Magenta'), ('AGG', 'White'), ('TTG', 'Red'), ('CAA', 'Yellow'), ('TGC', 'Black'), ('ACT', 'Blue')], [('AGT', 'Yellow'), ('CAT', 'Black'), ('GCA', 'Cyan'), ('AAG', 'White'), ('ACC', 'Blue'), ('TCA', 'Blue'), ('GAC', 'Magenta'), ('AAC', 'Yellow'), ('CTG', 'Red'), ('ATT', 'Green'), ('GGG', 'Black'), ('CTC', 'Red'), ('CCG', 'Cyan'), ('CAA', 'Yellow'), ('TTC', 'Red'), ('GAT', 'Magenta'), ('TGT', 'Black'), ('CTT', 'Red'), ('ATA', 'Green'), ('TTA', 'Red'), ('GGC', 'Magenta'), ('CCT', 'Cya

In [105]:
#Putting all the tags in the indiv. libraries together into sequences

#the library will become a list of strings
#For the library of 50 sequences:
library_keys_two_50 = [''.join(l) for l in keys_library_two_50]
print(library_keys_two_50)

['CTCCCAATCGTAAGGTACGGTTGCTTACAATTTATACCGAAGCATATTGCTTTGCCTGCATCTTTCTATCGTACGGATACTGGACAGAAAGTTTGTCGGGGC', 'GTTAATCGTAAGCTCTCGCTAAGGCATGCATTAAAAAACTTTCCTTGTGTGGCTGAACCCCCGACCTGCGGCACAGGTGACTCATGGCGACAGCACCCAATT', 'CACTCCCAAGCTCTACCAGCCCGATTCGAATGGTGCGATGTCGTAGAGATAAAAAGAGGGTCTGACATTTCGCCTTCAACGTACTTACAGAACCGGAATTTT', 'CTATTAGAGGGCGGTCGCATATACTCGCTCTGGGGGGACAAGTCCAGAGTACGGATTGTTAGCGATAACCTGCAACCCGTCAGTGCCCCTACCTGCTTTTTG', 'GGATCCCTAACATTCCACTGGATTCAGCCACATGACATCCTGCTCCTTCAAGGTCCCGCGGTAGAACCGGAGTGCGCTTCAGTCAGTAGCTTAGTGAGGCCT', 'AACAATCCCCCTTCCGTAGTCTCACAAGTGCCAAGTTATGGGCATGAGATAGGTGTTTTGCTCAAGCTGTTCGCAGGACTTGCTGAAAGAAGCACTCGGGAT', 'CATTATAAAAGGGCTACCCTGGTACGAGAGATCTCTCGGTCGCTAAACGTTTCACAGGGATTTAAGGCACCCCGCCTCAGACCTGTCCTTGATTCCAGCGCG', 'GGCAACCGCCGGTCTGTGTCAGCTGTAGAGCCTAAGTTAATCGCACACGACTTCTGCGCGTGTTACACTCGTAGACAAAATAAAAGGTATATATCGGCCACC', 'ACTATAATCAACTTGTCTGGCTGTGCAGAGGGTCCTGTCGCGACCAGCCCACAGGTTAGGAAACTACACAGTCTCATTCGGCGTGTGAAGACGCGCTTTCGA', 'CCTGGGGGTGTCACACTGTCACATTTATTCTCCTGTACCGGACA

In [106]:
#For the library of 100 sequences:
library_keys_two_100 = [''.join(l) for l in keys_library_two_100]
print(library_keys_two_100)

['GCGTTAATAATCGTACGGGACCCGCGAGTGGATACCCGCTCGGTTAGTGTCTTCGCATCAGGTCTTCGTGGGACACTAGCTTGCCTCGAGAACAGCAGAAAT', 'AAATCATCTGCGGCCTGTTCGCGAAGACATGTTCCTCTCGGGTTGATAGTGGTACCACCCGTCATCCCGAGCGGTACAGAGGGACAGGCAGATAACATTAAT', 'GACACGATAGGGAACGGCCTGTGCCGATTAAATGAACCGAGAGCAAGCGGACGCGCCGTAACACTATTGGATCTCTTCTGTCGGCCCCAACCTCCAACTGCT', 'AGGTCTTATCCAGTCAGACCGCGGGGTTGGCCCGCAGAGGCGGATAAACTGACTGCTCATGTTAGCTTGTTACGTAATGGGCCTAACTGTCAGATTAGTATC', 'GTAGATTGGCCGGCACTGTGTCTTGGAACCCAAGGGTGCCGTCAGAGGAACAAGGGTACTGTCTTATCTCCAGAAAAACCCACATTCACGGGCATTCATGAC', 'AGCTGCTCGTTCTCAGCCGTGCTTTCCGACAACGTTGTCACGCCTCATCCAAAACTGGGGCCCCGGGGCTATCAGAGGTTTGTAATTAATCGAGGTACCCGT', 'GTCCCCGAAGCTAACAATCGGATTGGGGGCCAACATTCCCGAGTGCCACGTGGTAGTTCGTGGAAATTGTCACCTGCGCACCGCGGAAAGCTAACGAGCATA', 'CTGCCTGATACGCCCCACGTCGTGCGTAGACTTGCATTTAATATAACCGCTTACACTACAGACCCAAAGGGGCAGCGAAACTCGCGGCTAAGGGAATGTTGC', 'CAGAATTTACGAACGTGTGGAGCCTGCCTCAAGCTTCTAGACCGGGATACCGTCGAATCTGTACAAGCGGTGATATCCAGAATCTGGAAAACTCCATACAAC', 'ACGTCCGTAGCCGTTCGCCAAAGGTTTGTCGAATGGCGGTGTCC

In [107]:
#For the library of 1000 sequences:
library_keys_two_1000 = [''.join(l) for l in keys_library_two_1000]
print(library_keys_two_1000)

['GTCGTTCTCCCGGAGAAGTATATAACTGGCAAACAGTGCCGACTTATCCCAAACACCCAATGTGTGAGTTCTCGGGGTCGCCCCATTAGAGACCTGCTAGCT', 'TCCGCCCGGATCTTACGCGTTCATGGTTTTCCTCAGGATTGGGGCACACCCGAATACGCGTTGCACGGATGTCTATGCCGAACCATTACGGTCCGTAGCAAC', 'GGGTCCCCACAAGCGGTTTGTCTACGTTCGGCCTATTCAGCTATCGTAGATGTCATTAAAGCAGACTTTTCTGTGACCCCCACTAACCGATGCACAATAGAG', 'CAAGAGTATAGCCATGTAACGCGGCGACCAGGACGCGTGGCTATAAATTTTTTGTCGAGTTGGTCAGATGGGAGGTACACCCCCGACGTTACTGTCAAGTGT', 'TGTCCAATAGCTGAGAATCTACATCAGCACCTGTTTAAACGGTGCAGATGGACGGGACCCCCTAGGCTTGATCGCGTGACATCTCAAAACGTCGTTACCGTA', 'AAAGATCGTGGGATTCTGGCAAAGTCGGTGAATAGGTGTTTAACTCCAGGAATCAGCTCCGGTTTCTATCCGCAGCGGAGTATAGCCACGGCGCCTTTTTGC', 'CCAGTAAGAAGGGGTCTTGTGCGATTTGCTTTGCTGAAAATTGGGGCCTCTCGTGGCCAGATAGACATCGTTTGCTATTCGCTAAAGCCGTGGACCCCTGTC', 'CTCAGGCGGGTTTTACTACCCCAAAAAGGTCCGCTTCTGCATAAGCCAAGTTGTGGAGAGGGGATAGTCCAGAATGTGTTTGCCCGCGTAACGACTTGCAAC', 'GAGATCGCAGCTACGCCATTAAGATCATGTATTTCCGTGAAAACAAAGCCCCCGTGGCGACGTCTCGAAGTCGACCTTTTGCTAAATTGCACTCTGCGCAGC', 'AACCTGCCCTCCCATTGGAGATTCTGTCCACCGTCAATCACGAG

In [108]:
#Identifying the indeces for the sequences in each library
# will show us how many sequences there are

#For the library of 50 sequences:
library_keys_two_50_index = enumerate(library_keys_two_50)
print(list(library_keys_two_50_index))

#If we want an indiv. seq. from the library, we call the index of that seq.
#print(library_keys_two_50[3])

[(0, 'CTCCCAATCGTAAGGTACGGTTGCTTACAATTTATACCGAAGCATATTGCTTTGCCTGCATCTTTCTATCGTACGGATACTGGACAGAAAGTTTGTCGGGGC'), (1, 'GTTAATCGTAAGCTCTCGCTAAGGCATGCATTAAAAAACTTTCCTTGTGTGGCTGAACCCCCGACCTGCGGCACAGGTGACTCATGGCGACAGCACCCAATT'), (2, 'CACTCCCAAGCTCTACCAGCCCGATTCGAATGGTGCGATGTCGTAGAGATAAAAAGAGGGTCTGACATTTCGCCTTCAACGTACTTACAGAACCGGAATTTT'), (3, 'CTATTAGAGGGCGGTCGCATATACTCGCTCTGGGGGGACAAGTCCAGAGTACGGATTGTTAGCGATAACCTGCAACCCGTCAGTGCCCCTACCTGCTTTTTG'), (4, 'GGATCCCTAACATTCCACTGGATTCAGCCACATGACATCCTGCTCCTTCAAGGTCCCGCGGTAGAACCGGAGTGCGCTTCAGTCAGTAGCTTAGTGAGGCCT'), (5, 'AACAATCCCCCTTCCGTAGTCTCACAAGTGCCAAGTTATGGGCATGAGATAGGTGTTTTGCTCAAGCTGTTCGCAGGACTTGCTGAAAGAAGCACTCGGGAT'), (6, 'CATTATAAAAGGGCTACCCTGGTACGAGAGATCTCTCGGTCGCTAAACGTTTCACAGGGATTTAAGGCACCCCGCCTCAGACCTGTCCTTGATTCCAGCGCG'), (7, 'GGCAACCGCCGGTCTGTGTCAGCTGTAGAGCCTAAGTTAATCGCACACGACTTCTGCGCGTGTTACACTCGTAGACAAAATAAAAGGTATATATCGGCCACC'), (8, 'ACTATAATCAACTTGTCTGGCTGTGCAGAGGGTCCTGTCGCGACCAGCCCACAGGTTAGGAAACTACACAGTCTCATTCGGCGTGTGAAGACGCGCTTTCGA'), 

In [109]:
#For the library of 100 sequences:
library_keys_two_100_index = enumerate(library_keys_two_100)
print(list(library_keys_two_100_index))

[(0, 'GCGTTAATAATCGTACGGGACCCGCGAGTGGATACCCGCTCGGTTAGTGTCTTCGCATCAGGTCTTCGTGGGACACTAGCTTGCCTCGAGAACAGCAGAAAT'), (1, 'AAATCATCTGCGGCCTGTTCGCGAAGACATGTTCCTCTCGGGTTGATAGTGGTACCACCCGTCATCCCGAGCGGTACAGAGGGACAGGCAGATAACATTAAT'), (2, 'GACACGATAGGGAACGGCCTGTGCCGATTAAATGAACCGAGAGCAAGCGGACGCGCCGTAACACTATTGGATCTCTTCTGTCGGCCCCAACCTCCAACTGCT'), (3, 'AGGTCTTATCCAGTCAGACCGCGGGGTTGGCCCGCAGAGGCGGATAAACTGACTGCTCATGTTAGCTTGTTACGTAATGGGCCTAACTGTCAGATTAGTATC'), (4, 'GTAGATTGGCCGGCACTGTGTCTTGGAACCCAAGGGTGCCGTCAGAGGAACAAGGGTACTGTCTTATCTCCAGAAAAACCCACATTCACGGGCATTCATGAC'), (5, 'AGCTGCTCGTTCTCAGCCGTGCTTTCCGACAACGTTGTCACGCCTCATCCAAAACTGGGGCCCCGGGGCTATCAGAGGTTTGTAATTAATCGAGGTACCCGT'), (6, 'GTCCCCGAAGCTAACAATCGGATTGGGGGCCAACATTCCCGAGTGCCACGTGGTAGTTCGTGGAAATTGTCACCTGCGCACCGCGGAAAGCTAACGAGCATA'), (7, 'CTGCCTGATACGCCCCACGTCGTGCGTAGACTTGCATTTAATATAACCGCTTACACTACAGACCCAAAGGGGCAGCGAAACTCGCGGCTAAGGGAATGTTGC'), (8, 'CAGAATTTACGAACGTGTGGAGCCTGCCTCAAGCTTCTAGACCGGGATACCGTCGAATCTGTACAAGCGGTGATATCCAGAATCTGGAAAACTCCATACAAC'), 

In [110]:
#For the library of 1000 sequences:
library_keys_two_1000_index = enumerate(library_keys_two_1000)
print(list(library_keys_two_1000_index))

[(0, 'GTCGTTCTCCCGGAGAAGTATATAACTGGCAAACAGTGCCGACTTATCCCAAACACCCAATGTGTGAGTTCTCGGGGTCGCCCCATTAGAGACCTGCTAGCT'), (1, 'TCCGCCCGGATCTTACGCGTTCATGGTTTTCCTCAGGATTGGGGCACACCCGAATACGCGTTGCACGGATGTCTATGCCGAACCATTACGGTCCGTAGCAAC'), (2, 'GGGTCCCCACAAGCGGTTTGTCTACGTTCGGCCTATTCAGCTATCGTAGATGTCATTAAAGCAGACTTTTCTGTGACCCCCACTAACCGATGCACAATAGAG'), (3, 'CAAGAGTATAGCCATGTAACGCGGCGACCAGGACGCGTGGCTATAAATTTTTTGTCGAGTTGGTCAGATGGGAGGTACACCCCCGACGTTACTGTCAAGTGT'), (4, 'TGTCCAATAGCTGAGAATCTACATCAGCACCTGTTTAAACGGTGCAGATGGACGGGACCCCCTAGGCTTGATCGCGTGACATCTCAAAACGTCGTTACCGTA'), (5, 'AAAGATCGTGGGATTCTGGCAAAGTCGGTGAATAGGTGTTTAACTCCAGGAATCAGCTCCGGTTTCTATCCGCAGCGGAGTATAGCCACGGCGCCTTTTTGC'), (6, 'CCAGTAAGAAGGGGTCTTGTGCGATTTGCTTTGCTGAAAATTGGGGCCTCTCGTGGCCAGATAGACATCGTTTGCTATTCGCTAAAGCCGTGGACCCCTGTC'), (7, 'CTCAGGCGGGTTTTACTACCCCAAAAAGGTCCGCTTCTGCATAAGCCAAGTTGTGGAGAGGGGATAGTCCAGAATGTGTTTGCCCGCGTAACGACTTGCAAC'), (8, 'GAGATCGCAGCTACGCCATTAAGATCATGTATTTCCGTGAAAACAAAGCCCCCGTGGCGACGTCTCGAAGTCGACCTTTTGCTAAATTGCACTCTGCGCAGC'), 

In [137]:
#Defining a function to get the repeats from a given sequence
def get_repeats(sequence):
    seq_repeats_one = []
    compiled_expressions = re.compile(r'(AA){2,}|(AT){2,}|(AG){2,}|(AC){2,}|(TT){2,}|(TG){2,}|(TC){2,}|(TA){2,}|\
                                      (GC){2,}|(GG){2,}|(GT){2,}|(GA){2,}|(CA){2,}|(CT){2,}|(CG){2,}|(CC){2,}')
    for match in compiled_expressions.finditer(sequence):
            seq_repeats_one.append(match.group())
    return seq_repeats_one

#Writing a for loop and calling the get_repeats function
# to get the repeats for all sequences in a list of strings in the library

#For the library of 50 sequences:
repeats_keys_library_two_50 = []
for sequence in library_keys_two_50:
    repeats = get_repeats(sequence) 
    repeats_keys_library_two_50.append(repeats)
print(repeats_keys_library_two_50) #returns nested list of repeats found for each indiv. sequence

[['TATA', 'ATAT', 'GGGG'], ['CTCT', 'AAAAAA', 'TGTGTG', 'CCCC', 'CACA'], ['CTCT', 'AGAG', 'AAAA', 'AGAG', 'TTTT'], ['AGAG', 'ATAT', 'CTCT', 'GGGGGG', 'AGAG', 'CCCC', 'TTTT'], ['CACA', 'CGCG'], ['CCCC', 'TCTC', 'GAGA', 'GTGT'], ['TATA', 'GAGAGA', 'TCTCTC', 'CACA', 'CCCC', 'CGCG'], ['TGTG', 'AGAG', 'CACA', 'CGCG', 'ACAC', 'AAAA', 'AAAA', 'TATATA'], ['TATA', 'TGTG', 'AGAG', 'CGCG', 'CACA', 'ACAC', 'TCTC', 'GTGT', 'CGCG'], ['GGGG', 'GTGT', 'CACA', 'CACA', 'TCTC'], ['GGGG', 'CGCGCG', 'CCCC'], ['ATAT', 'CGCG', 'CCCC', 'TTTT', 'ACAC'], ['ATAT', 'TATA', 'TCTC', 'GGGG', 'CCCC'], ['CTCT', 'AAAA', 'GTGT', 'CTCT', 'AGAG'], ['TTTT', 'GAGA', 'ATAT', 'CGCG'], ['CCCC', 'ACAC', 'CCCC', 'CTCT', 'ATAT'], ['CACA', 'CGCG', 'CCCC', 'TATA', 'CGCG', 'GTGT', 'CACA', 'GTGT', 'CGCG', 'CTCT'], ['TTTT', 'GGGG', 'GAGA'], ['GTGT', 'CACA', 'CCCC', 'TTTT', 'CACA'], ['GGGG', 'CACA', 'TGTG', 'CTCT'], ['CCCC', 'TGTG', 'TATA', 'GGGG', 'CGCG'], ['ACAC', 'GTGT', 'TCTC', 'ATAT'], ['CGCG', 'AGAG', 'ACAC', 'ACAC', 'TTTT', 'AAA

In [112]:
#For the library of 100 sequences:
repeats_keys_library_two_100 = []
for sequence in library_keys_two_100:
    repeats = get_repeats(sequence) 
    repeats_keys_library_two_100.append(repeats)
print(repeats_keys_library_two_100) #returns nested list of repeats found for each indiv. sequence

[['CGCG', 'GTGT', 'ACAC', 'GAGA'], ['CGCG', 'CTCT', 'AGAG'], ['ACAC', 'TGTG', 'GAGA', 'CGCG', 'ACAC', 'TCTC', 'CCCC'], ['CGCG', 'AGAG'], ['TGTG', 'AGAG', 'TCTC', 'AAAA', 'CACA'], ['TCTC', 'AAAA', 'GGGG', 'CCCC', 'GGGG', 'AGAG'], ['CCCC', 'GGGG', 'GCGC', 'CGCG'], ['CCCC', 'ATAT', 'ACAC', 'GGGG', 'CGCG'], ['GTGT', 'ATAT', 'AAAA'], ['GTGT', 'CTCT', 'CTCT'], ['CGCG', 'CTCT', 'TTTT', 'CTCT', 'TATA'], ['TCTC', 'TATA', 'TGTG', 'GAGA', 'GTGT', 'GCGC'], ['CACA', 'TGTG', 'AGAG', 'TCTC'], ['CTCT', 'GTGT', 'ACAC'], ['TTTT', 'ACAC', 'CTCT', 'CACA'], ['GCGC', 'TCTC', 'TGTG'], ['ATAT'], ['ACAC', 'TCTC', 'GAGA'], ['CCCC', 'GGGG', 'TTTT', 'GTGT', 'AAAA', 'TTTT'], ['ATAT', 'CGCG', 'GCGC', 'AGAGAG', 'CTCT', 'TTTT', 'CACACA'], ['AGAG', 'ATAT'], ['CCCC', 'TATA'], ['GGGG', 'TCTC', 'TATA'], ['TCTC', 'CACA', 'ACAC', 'TTTT', 'ATAT', 'GTGT'], ['GGGG', 'CACA', 'TCTCTC', 'CGCG', 'ATAT', 'ACAC', 'CGCG', 'AAAA'], ['TGTG', 'GGGG'], ['TGTG', 'CACA', 'GTGTGT', 'CGCG'], ['CCCC', 'AGAG', 'TCTC', 'CTCT', 'GGGG'], ['GCGCG

In [113]:
#For the library of 1000 sequences:
repeats_keys_library_two_1000 = []
for sequence in library_keys_two_1000:
    repeats = get_repeats(sequence) 
    repeats_keys_library_two_1000.append(repeats)
print(repeats_keys_library_two_1000) #returns nested list of repeats found for each indiv. sequence

[['TCTC', 'GAGA', 'TATATA', 'ACAC', 'TGTGTG', 'TCTC', 'GGGG', 'CCCC', 'AGAG'], ['CGCG', 'TTTT', 'GGGG', 'CACA', 'CGCG'], ['CCCC', 'TTTT', 'TGTG', 'CCCC', 'CACA', 'AGAG'], ['AGAG', 'TATA', 'CGCG', 'CGCG', 'TATA', 'TTTTTT', 'ACAC', 'CCCC', 'GTGT'], ['GAGA', 'CCCC', 'CGCG', 'TCTC', 'AAAA'], ['GTGT', 'TATA', 'GCGC', 'TTTT'], ['GGGG', 'TGTG', 'AAAA', 'GGGG', 'CTCT', 'CCCC'], ['TTTT', 'CCCC', 'AAAA', 'TGTG', 'GAGA', 'GGGG', 'TGTG', 'CGCG'], ['GAGA', 'AAAA', 'CCCC', 'TCTC', 'TTTT', 'CTCT', 'GCGC'], ['GAGA'], ['TGTG', 'GCGC', 'ATAT'], ['TTTT', 'ATAT', 'ATAT', 'GTGT'], ['CGCGCG', 'GAGA', 'AAAA'], ['GGGG', 'TGTG', 'TCTC', 'ACAC', 'ACAC', 'CCCC'], ['AAAAAA', 'TTTT', 'CACACA', 'GTGT', 'ATAT', 'GCGC'], ['ACAC', 'TATA', 'TTTT'], ['ATAT', 'CGCG', 'AGAG'], ['ACAC', 'GTGT', 'ATAT', 'GGGG', 'GAGAGA'], ['GCGC', 'CACACA', 'CACA', 'GGGG'], ['AGAG', 'TGTG', 'TCTC', 'GGGG', 'ATAT'], ['CACA', 'CCCC', 'AGAG', 'GCGC'], ['TCTC', 'CGCG', 'TGTG', 'GGGG'], ['ATAT', 'CTCT', 'GAGA', 'GCGC', 'CTCT', 'TATA', 'AAAA', 'T




In [114]:
#To count the repeats in each indiv. seq. in a library, we can create another function:
def repeat_count(repeats):
    count = 0 #count starts at 0
    for elem in repeats:
        if type(elem) == list: #check if the element is a list
            count += repeat_count(elem) #if list, get size
        else:
            count += 1 #if not list, add 1 to count
    return count

#Writing a for loop and calling the repeat_count function
# to get the number of repeats found in each seq., as found in previous code

#For the library of 50 sequences:
repeat_counts_keys_library_two_50 = []
for repeats in repeats_keys_library_two_50:
    number = repeat_count(repeats)
    repeat_counts_keys_library_two_50.append(number)
print(repeat_counts_keys_library_two_50)

#Then, to get the total number of all repeats in all sequences in the library
total_repeats_keys_lib_two_50 = sum(repeat_counts_keys_library_two_50)
print(total_repeats_keys_lib_two_50)

[3, 5, 5, 7, 3, 4, 6, 9, 9, 5, 3, 5, 5, 5, 4, 5, 10, 3, 5, 4, 5, 4, 7, 6, 5, 2, 6, 2, 6, 4, 9, 5, 4, 3, 5, 6, 2, 8, 5, 3, 5, 3, 7, 3, 4, 1, 5, 5, 4, 5]
244


In [115]:
#For the library of 100 sequences:
repeat_counts_keys_library_two_100 = []
for repeats in repeats_keys_library_two_100:
    number = repeat_count(repeats)
    repeat_counts_keys_library_two_100.append(number)
print(repeat_counts_keys_library_two_100)

#Then, to get the total number of all repeats in all sequences in the library
total_repeats_keys_lib_two_100 = sum(repeat_counts_keys_library_two_100)
print(total_repeats_keys_lib_two_100)

[4, 3, 7, 2, 5, 6, 4, 5, 3, 3, 5, 6, 4, 3, 4, 3, 1, 3, 6, 7, 2, 2, 3, 6, 8, 2, 4, 5, 3, 5, 5, 6, 3, 4, 3, 6, 4, 4, 5, 5, 3, 6, 4, 6, 5, 9, 3, 3, 7, 9, 7, 4, 3, 6, 2, 0, 4, 4, 7, 1, 3, 8, 5, 3, 1, 5, 4, 6, 3, 5, 6, 4, 5, 6, 5, 8, 7, 3, 6, 5, 3, 3, 4, 4, 2, 5, 3, 3, 2, 4, 5, 2, 7, 9, 4, 4, 6, 4, 0, 4]
438


In [116]:
#For the library of 1000 sequences:
repeat_counts_keys_library_two_1000 = []
for repeats in repeats_keys_library_two_1000:
    number = repeat_count(repeats)
    repeat_counts_keys_library_two_1000.append(number)
print(repeat_counts_keys_library_two_1000)

#Then, to get the total number of all repeats in all sequences in the library
total_repeats_keys_lib_two_1000 = sum(repeat_counts_keys_library_two_1000)
print(total_repeats_keys_lib_two_1000)

[9, 5, 6, 9, 5, 4, 6, 8, 7, 1, 3, 4, 3, 6, 6, 3, 3, 5, 4, 5, 4, 4, 8, 8, 8, 3, 3, 3, 7, 5, 8, 5, 4, 5, 5, 6, 5, 6, 7, 5, 2, 6, 4, 7, 5, 5, 3, 6, 8, 3, 6, 7, 6, 5, 3, 4, 5, 4, 6, 3, 5, 6, 3, 6, 5, 3, 2, 7, 5, 7, 7, 4, 3, 3, 7, 6, 6, 8, 4, 2, 4, 8, 4, 5, 3, 6, 3, 4, 1, 4, 4, 6, 5, 5, 6, 5, 2, 5, 4, 5, 7, 3, 8, 3, 5, 8, 6, 3, 1, 7, 3, 3, 4, 4, 6, 2, 3, 5, 2, 4, 2, 5, 4, 3, 6, 3, 6, 2, 5, 10, 4, 3, 6, 7, 6, 5, 3, 7, 4, 7, 5, 4, 4, 2, 7, 3, 2, 3, 3, 4, 5, 7, 4, 7, 3, 7, 5, 6, 7, 1, 6, 6, 4, 6, 5, 6, 3, 4, 3, 4, 4, 6, 4, 4, 6, 6, 4, 3, 3, 11, 4, 6, 2, 5, 7, 8, 2, 4, 5, 5, 3, 8, 6, 9, 2, 3, 4, 6, 4, 5, 9, 7, 6, 3, 5, 4, 4, 3, 2, 5, 4, 6, 5, 4, 3, 5, 6, 4, 3, 3, 3, 5, 10, 7, 4, 7, 6, 3, 1, 3, 7, 2, 2, 3, 6, 6, 4, 6, 4, 4, 5, 7, 6, 4, 6, 3, 3, 2, 4, 2, 3, 4, 4, 2, 7, 3, 2, 5, 4, 5, 4, 5, 4, 3, 5, 1, 7, 5, 2, 3, 9, 4, 4, 3, 5, 3, 3, 8, 7, 1, 6, 7, 7, 6, 6, 6, 6, 4, 5, 7, 6, 5, 5, 6, 4, 2, 4, 7, 5, 5, 4, 6, 9, 5, 7, 5, 8, 4, 0, 5, 5, 3, 4, 2, 4, 5, 4, 5, 5, 2, 4, 4, 4, 2, 6, 4, 4, 3, 2, 4, 8, 2, 

# Section 3:

### Section 3.1: Chi-square testing for significant differences in number of repeats between the two versions

In [117]:
#The chi-square test I'll do is the 'test of independence'
# this uses two variables, while will be: 
# the no. of repeats in the sequences & the genetic code option the sequences were generated from

#So, this will allow us to know:
# whether the number of repeats found in the randomly generated sequences,
# is related to the genetic code option the sequences were generated from.

#Null hypothesis (H0):
# proportion of repeats found in the seq. is independent of the genetic code option
#Alternative hypothesis (H1):
# proportion of repeats found in the seq. is different for the different genetic code options

In [118]:
## Using a dataframe to store the data will likely be the easiest
# e.g., columns for whether the seq. was randomly generated just using 'keys' or w/ 'key:val' pairs
# and then rows are the dictionary the sequences were generated from


#repeat count from library of genetic code option/ dictionary 1
print("In dictionary 1, sampling keys: ")
##keys sampling for 50 sequences:
print("Repeats in 50 sequences: " + str(total_repeats_lib_one_50))
#for 100 sequences:
print("Repeats in 100 sequences: " + str(total_repeats_lib_one_100))
#For 1000 sequences:
print("Repeats in 1000 sequences: " + str(total_repeats_lib_one_1000))

##key:val pairs sampling:
print("In dictionary 1, sampling key/val pairs: ")
#For 50 sequences:
print("Repeats in 50 sequences: " + str(total_repeats_keys_lib_one_50))
#For 100 sequences:
print("Repeats in 100 sequences: " + str(total_repeats_keys_lib_one_100))
#For 1000 sequences:
print("Repeats in 1000 sequences: " + str(total_repeats_keys_lib_one_1000))

In dictionary 1, sampling keys: 
Repeats in 50 sequences: 239
Repeats in 100 sequences: 457
Repeats in 1000 sequences: 4631
In dictionary 1, sampling key/val pairs: 
Repeats in 50 sequences: 247
Repeats in 100 sequences: 477
Repeats in 1000 sequences: 4727


In [119]:
#repeat count from library of genetic code option/ dictionary 2
##keys sampling
print("In dictionary 2, sampling keys: ")
#For 50 sequences:
print("Repeats in 50 sequences: " + str(total_repeats_lib_two_50))
#For 100 sequences:
print("Repeats in 100 sequences: " + str(total_repeats_lib_two_100))
#For 1000 sequences:
print("Repeats in 1000 sequences: " + str(total_repeats_lib_two_1000))

##key:val pairs sampling
print("In dictionary 2, sampling key/val pairs: ")
#For 50 sequences:
print("Repeats in 50 sequences: " + str(total_repeats_keys_lib_two_50))
#For 100 sequences:
print("Repeats in 100 sequences: " + str(total_repeats_keys_lib_two_100))
#For 1000 sequences:
print("Repeats in 1000 sequences: " + str(total_repeats_keys_lib_two_1000))

In dictionary 2, sampling keys: 
Repeats in 50 sequences: 232
Repeats in 100 sequences: 433
Repeats in 1000 sequences: 4657
In dictionary 2, sampling key/val pairs: 
Repeats in 50 sequences: 244
Repeats in 100 sequences: 438
Repeats in 1000 sequences: 4714


In [120]:
#Putting it all into a dataframe
#Note: 'repeats 1' = repeats counted from sampled keys
# 'repeats 2' = repeats counted from sampled key:val pairs
# 'dictionary 1' & 'dictionary 2' refer to the genetic code used to generate the seq. the repeats came from

import pandas as pd

#Data frame for 50 sequences:
data_50 = {'Repeats 1': [total_repeats_lib_one_50, total_repeats_lib_two_50],
       'Repeats 2': [total_repeats_keys_lib_one_50, total_repeats_keys_lib_two_50]}

df_50 = pd.DataFrame(data_50, index = ['Dictionary 1', 'Dictionary 2'])

print("Contingency table for 50 sequences: ")
print(df_50)


#Data frame for 100 sequences:
data_100 = {'Repeats 1': [total_repeats_lib_one_100, total_repeats_lib_two_100],
       'Repeats 2': [total_repeats_keys_lib_one_100, total_repeats_keys_lib_two_100]}

df_100 = pd.DataFrame(data_100, index = ['Dictionary 1', 'Dictionary 2'])

print("Contingency table for 100 sequences: ")
print(df_100)


#Data frame for 1000 sequences:
data_1000 = {'Repeats 1': [total_repeats_lib_one_1000, total_repeats_lib_two_1000],
       'Repeats 2': [total_repeats_keys_lib_one_1000, total_repeats_keys_lib_two_1000]}

df_1000 = pd.DataFrame(data_1000, index = ['Dictionary 1', 'Dictionary 2'])

print("Contingency table for 1000 sequences: ")
print(df_1000)

Contingency table for 50 sequences: 
              Repeats 1  Repeats 2
Dictionary 1        239        247
Dictionary 2        232        244
Contingency table for 100 sequences: 
              Repeats 1  Repeats 2
Dictionary 1        457        477
Dictionary 2        433        438
Contingency table for 1000 sequences: 
              Repeats 1  Repeats 2
Dictionary 1       4631       4727
Dictionary 2       4657       4714


In [121]:
#Doing chi-square test of independence

#First, install packages required
import scipy
from scipy.stats import chi2
import scipy.stats as stats

#Null hypothesis (H0):
# proportion of repeats found in the seq. is independent of the genetic code option
# i.e.: there's no relationship between the repeat count and the genetic code option

#Alternative hypothesis (H1):
# proportion of repeats found in the seq. is different for the different genetic code options
# i.e.: the repeat count depends on the genetic code option

In [122]:
#Doing chi-square test for 50 randomly generated sequences:

chi_sq_results_50 = stats.chi2_contingency(df_50)
print(chi_sq_results_50)

#The information is returned within a tuple where the first value is the 
# chi-squared test static, the second value is the p-value, and the third number is the degrees of freedom. 
# An array is also returned which contains the expected cell counts.

#For 50 sequences from each:
## first run:
#so, chi-squared test statistic = 2.16
# p-value = 0.14 (> 0.05, so fail to reject H0)
# degrees of freedom = 1
## second run:
# x2: 1.12
# p-val: 0.29
## third run:
# x2: 1.18
# p-val: 0.27

###
#Now, we need to know the critical chi-square value 
# for signif. level of 0.05 and degrees of freedom of 1
import scipy.stats

chi_sq_critical = scipy.stats.chi2.ppf(1-.05, df=1)
print(chi_sq_critical)

#The chi-sq values calculated were 2.16, 1.12, and 1.18
#The chi-sq critical value is 3.84

#The calculated value is smaller than the critical value
# so, we fail to reject the null hypothesis H0 [we accept H0]

(0.005070122972241408, 0.9432347644584346, 1, array([[237.94802495, 248.05197505],
       [233.05197505, 242.94802495]]))
3.841458820694124


In [123]:
#Doing chi-square test for 100 randomly generated sequences:

chi_sq_results_100 = stats.chi2_contingency(df_100)
print(chi_sq_results_100)

#The information is returned within a tuple where the first value is the 
# chi-squared test static, the second value is the p-value, and the third number is the degrees of freedom. 
# An array is also returned which contains the expected cell counts.

#For 100 sequences from each:
## first run:
#so, chi-squared test statistic = 0.14
# p-value = 0.7 (> 0.05, so fail to reject H0)
# degrees of freedom = 1
## second run:
# x2: 0.03
# p-val: 0.86
## third run:
# x2: 0.14
# p-val: 0.7

###
#Now, we need to know the critical chi-square value 
# for signif. level of 0.05 and degrees of freedom of 1
import scipy.stats

chi_sq_critical = scipy.stats.chi2.ppf(1-.05, df=1)
print(chi_sq_critical)

#The chi-sq values calculated were 0.14, 0.03, and 0.14
#The chi-sq critical value is 3.84

#The calculated value is smaller than the critical value
# so, we fail to reject the null hypothesis H0 [we accept H0]

(0.08159670371136754, 0.7751451563731889, 1, array([[460.53185596, 473.46814404],
       [429.46814404, 441.53185596]]))
3.841458820694124


In [124]:
#Doing chi-square test for 1000 randomly generated sequences:

chi_sq_results_1000 = stats.chi2_contingency(df_1000)
print(chi_sq_results_1000)

#The information is returned within a tuple where the first value is the 
# chi-squared test static, the second value is the p-value, and the third number is the degrees of freedom. 
# An array is also returned which contains the expected cell counts.

#For 1000 sequences from each:
## first run:
#so, chi-squared test statistic = 0.25
# p-value = 0.61
# degrees of freedom = 1
## second run:
# x2: 0.55
# p-val: 0.45
## third run:
# x2: 0.45
# p-val: 0.5

###
#Now, we need to know the critical chi-square value 
# for signif. level of 0.05 and degrees of freedom of 1

import scipy.stats

chi_sq_critical = scipy.stats.chi2.ppf(1-.05, df=1)
print(chi_sq_critical)

#The chi-sq values calculated were 0.25, 0.55, and 0.45
#The chi-sq critical value is 3.84

#The calculated value is smaller than the critical value
# so, we fail to reject the null hypothesis H0 [we accept H0]

(0.07352034571764632, 0.7862784704959056, 1, array([[4640.77654974, 4717.22345026],
       [4647.22345026, 4723.77654974]]))
3.841458820694124


### Section 3.2: summaries of chi-square tests

#### Comparing repeat counts of library of randomly generated sequences, with 50 seq. generated from each genetic code option via either random sampling of keys, or random sampling of key:value pairs
 - The calculated value of X2 (chi-square statistic) was smaller than the critical value of X2 in all three runs
 - The p-value was also always larger than 0.05 in all three runs (> 0.05, so fail to reject H0)
 - Therefore, we fail to reject the null hypothesis (H0)
 -  So, there is no relationship between the repeat count of the sequences and the genetic code option the sequences were generated from; repeat count and genetic code option are independent

#### This means, that neither genetic code option gives more repeats than the other, so I can choose either to use

#### Comparing repeat counts of library of randomly generated sequences, with 100 seq. generated from each genetic code option via either random sampling of keys, or random sampling of key:value pairs
 - The calculated value of X2 (chi-square statistic) was smaller than the critical value of X2 in all three runs
 - The p-value was also always larger than 0.05 in all three runs (> 0.05, so fail to reject H0)
 - Therefore, we fail to reject the null hypothesis (H0)
 -  So, there is no relationship between the repeat count of the sequences and the genetic code option the sequences were generated from; repeat count and genetic code option are independent

#### This means, that neither genetic code option gives more repeats than the other, so I can choose either to use

#### Comparing repeat counts of library of randomly generated sequences, with 1000 seq. generated from each genetic code option via either random sampling of keys, or random sampling of key:value pairs
 - The calculated value of X2 (chi-square statistic) was smaller than the critical value of X2 in all three runs
 - THe p-value was also always larger than 0.05 in all three runs (> 0.05, so fail to reject H0)
 - Therefore, we fail to reject the null hypothesis (H0)
 -  So, there is no relationship between the repeat count of the sequences and the genetic code option the sequences were generated from; repeat count and genetic code option are independent

#### This means, that neither genetic code option gives more repeats than the other, so I can choose either to use