# Primer Design Script

### Given a consensus sequence, this script will examine and present the possible primers with given parameters

In [1]:
import os
import Bio
from Bio import SeqIO
import re
import pandas as pd
import numpy as np

In [3]:
os.chdir("/home/ekinbasar/Spring_21-22/bioeksen_biotech/consensus_genome/HAV")

In [4]:
seq_object = SeqIO.read("hav_95_cons.fasta", "fasta")

In [5]:
sequence = str(seq_object.seq)

In [159]:
# sequence = str(sequence[0:500]) # -> short sequence extraction as a sample

## Primer Design - String Approach

    Instead of using the previous primer design approach that uses windows to analyse a sequence, I thought it might be beneficial to examine the string data with basic python functions.

    -> Following cells use various built-in functions and list comprehension 

In [6]:
length_cut_off = 18 # decide the minimum length we are looking for in the sequences without N
                    # the sequences shorter than this length are ignored

matches = sequence.split("N")
matches=[x for x in matches if len(x)>=length_cut_off]#decide the shortest length that to be included to the list

In [7]:
# get the lengths of the matching substrings from the "matches" list
length_list = []

for sequences in matches:
    length_list.append(len(sequences))

In [9]:
# get the indexes of the matching substrings from the "matches" list
index_list = []
for i in matches:
    index = sequence.index(i)
    index_list.append(index)

In [11]:
# the distance between consecutive elements 
# -> the distance between the elemenet and the next one (sequence-wise sorting)  
distance_list = [x - index_list[i - 1] for i, x in enumerate(index_list)][1:]
# the 0 is added at the end because the last sequence has no subsequent sequence, its the last one :)
distance_list.append(0)

### Tm Calculation

From a websourse, the equation is suggested for sequences longer than 14 bases:

        Tm= 64.9 +41*(yG+zC-16.4)/(wA+xT+yG+zC)
        
        
        
    Example:

            ATAGCGTTGCTGCTCCGAAATT

            -> 64.9 + 41*(10-16.4)/22 = 52.97272727272728
            
            Tool predicts 53

In [49]:
def tm_calculator(sequence):
   
    g_c_count = sequence.count("C") + sequence.count("G")

    tm = float("%.2f" % (64.9 + 41 * (g_c_count - 16.4) / len(sequence))) # 2 numbers are allowed after the dot
    
    return tm

In [50]:
tm_list = []

for i in matches:
    tm_list.append(tm_calculator(i))

        Create a dataframe to see how it's going

In [51]:
sequence_dictionary = {"Matches":matches,"Lengths":length_list,"Start Position":index_list,"Tm":tm_list, "Distance(between consecutive start points)[Subsequent - First]":distance_list}

In [52]:
sequence_df = pd.DataFrame(sequence_dictionary)

In [53]:
sequence_df["Tm/Length"] = pd.to_numeric(sequence_df["Tm"]) / sequence_df["Lengths"] # add the Tm / Length column

In [54]:
sequence_df

Unnamed: 0,Matches,Lengths,Start Position,Tm,Distance(between consecutive start points)[Subsequent - First],Tm/Length
0,TTCAAGAGGGGTCTCCGGAGTTTTCCGGA,29,0,64.33,47,2.218276
1,TGGTGAGGGGACTTGATACCTCACCGCCGTTTGCCTAGGCTATAGG...,50,47,73.59,88,1.471800
2,TTGTTTGTAAATATTAATTCCTGCAGGTTCAGGGTTCTT,39,135,61.33,87,1.572564
3,CTTTCTTCCAGGGCTCTCCCCTTGCCCTAGGCTCTGGCCGTTGCGC...,61,222,81.43,68,1.334918
4,TAGCATGGAGCTGTAGGAGTCTAAATTGGGGAC,33,290,64.40,52,1.951515
...,...,...,...,...,...,...
67,TGAGTTTTATCAGAAATT,18,7296,36.66,19,2.036667
68,TATTATTTTGTTCAGTCCTG,20,7315,43.58,49,2.179000
69,CTTAAATCTTATGATTGGTGGAGAATGAGATTTTATGACCAGTG,44,7364,63.60,45,1.445455
70,TTCATTTGTGACCTTTCATGATTTG,25,7409,51.12,26,2.044800


    Until now, we have the lists of:
        Matching Sequences
        Lengths
        Start Position (index) 
        Tm
        Distance from the starting point of one match to the next match's starting point 
        Tm / Length ration of the sequences
        
        Next, analyse the combined sequences pair-wise and as triad :)

# Combined Two Consecutive Sequences 

In [28]:
# get two combined sequences that satisfy the conditions

# this cell requires a clean up + check the Tm's of the sequences -> Tm bottomline is 50; drop if less

for i in range(len(distance_list)-1):

    distance = distance_list[i]
    
    
    if distance < 200: 
        
        if float(tm_list[i]) and float(tm_list[i+1]) > 50:
            
            starting_sequence_index = index_list[i]
            starting_sequence = matches[distance_list.index(distance)]
            subsequent_sequence = matches[i+1]
            subsequent_sequence_length = length_list[int(i)+1]
            partition_length = distance + subsequent_sequence_length
            combined_sequence = sequence[starting_sequence_index:(starting_sequence_index+partition_length)]
            
            # add the sequences to a list or directly analyse them 

In [29]:
print(combined_sequence)

TTCATTTGTGACCTTTCATGATTTGNTTAAACAAATTTTCTTAAAATTTCTGAGGTTTGTTTATTTCTTTTATCAGTAAATAAAAAAAAAAAAAAAAAAAA


# Combined Three Consecutive Sequences 

      # We can specify an amplicon size upper limit and display the ones that satisfy.
      
      # Similarly, we can change the distance (between starting points) condition
      
      # I added the Tm lower limit condition from the excel sheet (Şükrü Bey - Design)

In [30]:
distance_list_1 = [x - index_list[i - 2] for i, x in enumerate(index_list)][1:]
distance_list_1.pop(0)
distance_list_1.append(0)
distance_list_1.append(0)

# two zeros added at the end of the list
# this list works like that: It has the distance between the index(starting point) of the corresponding sequence 
# and the sequence two index later

# Yani ilk sıradaki element, aslında üçüncünün birinciden farkı, 
# o yüzden son iki element 0 çünkü onlardan 2 sonraki elementler diye bir şey yok

In [59]:
# get three combined sequences that satisfy the conditions

triplet_count = 0

for i in range(len(distance_list_1)-2):

    distance = distance_list_1[i]
    
    
    if distance < 200:     
        
        if float(tm_list[i]) > 50.0:
            if float(tm_list[i+1]) > 50.0:
                if float(tm_list[i+2]) > 50.0:
            
                    starting_sequence_index = index_list[i]

                    third_sequence_length = length_list[int(i)+2]
                    partition_length = distance + third_sequence_length 
                    combined_sequence = sequence[starting_sequence_index:(starting_sequence_index+partition_length)]

                    print("The Tm's of the sequences are (respectively):",tm_list[i],tm_list[i+1],tm_list[i+2])
                    print("The length of the combined sequences (amplicon size) is:", partition_length)
                    print("The N count for the combined sequence is:",combined_sequence.count("N"))
                    print(combined_sequence,"\n")
                    
                    triplet_count = triplet_count + 1
        
print("\nThe viable sequence count is:", triplet_count)      
        # Partition Length -> len(p)
        # len(p) = Index(3rd) - Index(1st) + len(3rd)

The Tm's of the sequences are (respectively): 64.33 73.59 61.33
The length of the combined sequences (amplicon size) is: 174
The N count for the combined sequence is: 11
TTCAAGAGGGGTCTCCGGAGTTTTCCGGANCCCCTCTTGGAAGTCCNTGGTGAGGGGACTTGATACCTCACCGCCGTTTGCCTAGGCTATAGGCTAANTTTCCCTTTCCCTGTCCNTNNCNNATTNCCNTTTGTNTTGTTTGTAAATATTAATTCCTGCAGGTTCAGGGTTCTT 

The Tm's of the sequences are (respectively): 73.59 61.33 81.43
The length of the combined sequences (amplicon size) is: 236
The N count for the combined sequence is: 15
TGGTGAGGGGACTTGATACCTCACCGCCGTTTGCCTAGGCTATAGGCTAANTTTCCCTTTCCCTGTCCNTNNCNNATTNCCNTTTGTNTTGTTTGTAAATATTAATTCCTGCAGGTTCAGGGTTCTTNAATCTGTTTCTCTATAANAACACTCANNTTTTNACGCTTTCTGTCTNCTTTCTTCCAGGGCTCTCCCCTTGCCCTAGGCTCTGGCCGTTGCGCCCGGCGGGGTCAACT 

The Tm's of the sequences are (respectively): 61.33 81.43 64.4
The length of the combined sequences (amplicon size) is: 188
The N count for the combined sequence is: 8
TTGTTTGTAAATATTAATTCCTGCAGGTTCAGGGTTCTTNAATCTGTTTCTCTATAANAACACTCANNTTTTNACG

# Combined Four Consecutive Sequences 

In [32]:
distance_list_2 = [x - index_list[i - 3] for i, x in enumerate(index_list)][1:]
distance_list_2.pop(0)
distance_list_2.pop(0)
distance_list_2.append(0)
distance_list_2.append(0)
distance_list_2.append(0)

In [58]:
# get three combined sequences that satisfy the conditions

quadruplet_count = 0 

for i in range(len(distance_list_1)-3):

    distance = distance_list_1[i]
    
    
    if distance < 400: 
        
         if tm_list[i] > 50:
            if tm_list[i+1] > 50:
                if tm_list[i+2] > 50:
                    if tm_list[i+3] > 50:
        
                        fourth_sequence_length = length_list[int(i)+3]
                        partition_length = distance + fourth_sequence_length 
                        combined_sequence = sequence[starting_sequence_index:(starting_sequence_index+partition_length)]

                        print("The Tm's of the sequences are (respectively):",tm_list[i],tm_list[i+1],tm_list[i+2],tm_list[i+3])
                        print("The length of the combined sequences (amplicon size) is:", partition_length)

                        print(combined_sequence,"\n")


                        quadruplet_count = quadruplet_count + 1
        
print("\nThe viable sequence count is:", quadruplet_count) 
        
        # Partition Length -> len(p)
        # len(p) = Index(4rd) - Index(1st) + len(4rd)

The Tm's of the sequences are (respectively): 64.33 73.59 61.33 81.43
The length of the combined sequences (amplicon size) is: 196
CTTAAATCTTATGATTGGTGGAGAATGAGATTTTATGACCAGTGNTTCATTTGTGACCTTTCATGATTTGNTTAAACAAATTTTCTTAAAATTTCTGAGGTTTGTTTATTTCTTTTATCAGTAAATAAAAAAAAAAAAAAAAAAAA 

The Tm's of the sequences are (respectively): 73.59 61.33 81.43 64.4
The length of the combined sequences (amplicon size) is: 208
CTTAAATCTTATGATTGGTGGAGAATGAGATTTTATGACCAGTGNTTCATTTGTGACCTTTCATGATTTGNTTAAACAAATTTTCTTAAAATTTCTGAGGTTTGTTTATTTCTTTTATCAGTAAATAAAAAAAAAAAAAAAAAAAA 

The Tm's of the sequences are (respectively): 61.33 81.43 64.4 52.4
The length of the combined sequences (amplicon size) is: 176
CTTAAATCTTATGATTGGTGGAGAATGAGATTTTATGACCAGTGNTTCATTTGTGACCTTTCATGATTTGNTTAAACAAATTTTCTTAAAATTTCTGAGGTTTGTTTATTTCTTTTATCAGTAAATAAAAAAAAAAAAAAAAAAAA 

The Tm's of the sequences are (respectively): 81.43 64.4 52.4 58.84
The length of the combined sequences (amplicon size) is: 143
CTTAAATCTTATGATTGGTGGAGAATGAGATTTT

# Report 
    Format for the consecutive sequence analysis should be discussed both for following Primer3 analysis and visualization of the results.

In [169]:
# This for loop is designed to print out the two consecutive sequence analysis but this cell should be adjusted to 
# present the full results of the analysis


# Printed results are nice here, can be used later on

for i in range(len(distance_list)-1):

    distance = distance_list[i]
    
    if distance < 200:
        
        if float(tm_list[i]) and float(tm_list[i+1]) > 50:
            
            starting_sequence_index = index_list[i]

            starting_sequence = matches[distance_list.index(distance)]

            subsequent_sequence = matches[i+1]

            subsequent_sequence_length = length_list[int(i)+1]

            partition_length = distance + subsequent_sequence_length

            combined_sequence = sequence[starting_sequence_index:(starting_sequence_index+partition_length)]
            # the combined sequence is acquired by indexing the whole sequence with the starting position of the first 
            # sequence and the sum up of the starting position (indexed from the whole sequence)of the first sequence 
            # and the length of the second sequence


            print("The first sequence's index is:",starting_sequence_index)
            print("The first sequence is:",starting_sequence)
            print("The Tm of the first sequence is:", tm_list[i])
            print("The length of the first sequence is:",length_list[i])
            print("The starting point distance between first and the second sequence is:",distance)
            print("The second sequence is:",subsequent_sequence)
            print("The Tm of the second sequence is:", tm_list[i])
            print("The length of the second sequence is:",subsequent_sequence_length)
            print("The length of the two sequences combined is:",partition_length)
            print("The N count for the combined sequence is:",combined_sequence.count("N"))
            print("The combined sequence for two consecutive sequences from the consensus genome is:\n", combined_sequence,"\n")
              

The first sequence's index is: 0
The first sequence is: TTCAAGAGGGGTCTCCGGAGTTTTCCGGA
The Tm of the first sequence is: 64.33
The length of the first sequence is: 29
The starting point distance between first and the second sequence is: 47
The second sequence is: TGGTGAGGGGACTTGATACCTCACCGCCGTTTGCCTAGGCTATAGGCTAA
The Tm of the second sequence is: 64.33
The length of the second sequence is: 50
The length of the two sequences combined is: 97
The N count for the combined sequence is: 2
The combined sequence for two consecutive sequences from the consensus genome is:
 TTCAAGAGGGGTCTCCGGAGTTTTCCGGANCCCCTCTTGGAAGTCCNTGGTGAGGGGACTTGATACCTCACCGCCGTTTGCCTAGGCTATAGGCTAA 

The first sequence's index is: 47
The first sequence is: TGGTGAGGGGACTTGATACCTCACCGCCGTTTGCCTAGGCTATAGGCTAA
The Tm of the first sequence is: 73.59
The length of the first sequence is: 50
The starting point distance between first and the second sequence is: 88
The second sequence is: TTGTTTGTAAATATTAATTCCTGCAGGTTCAGGGTTCTT
The Tm 