**NAME** : ARAVIND ELANGOVAN

**UBNO** : 22031295

**TITLE** : PROBLEM-2 [ primer suitability using sliding window technique]

**INTRODUCTION:**

Designing an effective primer is an important task in Bioinformatics, as it is used for many biological applications such as PCR amplification, DNA sequencing and other molecular diagnostics. Primers are generally a shorter DNA sequences that are designed specifically to bind to the intended target site within a genome to ensure optimal results (Figure1). Poorly designed primer has the capability to bind to multiple locations of the genome, which leads to off target amplification and thus producing ambiguous sequencing results (Ye et al. 2012).

![image.png](attachment:bf600ea2-6fc2-4e7f-a105-5559080015e3.png) Figure 1 : Representation of forward and reverse primer binding (addgene 2025)

To address this problem a computational approach of using Python code was opted to evaluate primer suitability by identifying all potential binding sites for a given primer on to the target sequences. This will help us understand the likelihood of all possible mispriming and informs the fidelity of the primer.

Although, there are some key techniques, such as creating a standalone local database of the target sequence and blasting the primer to the target sequence, or by means of fuzzy search which uses approximate matching, where it matches the search query with a similar term within the allowed edit distance. But, This problem was executed by means of using basic alignment strategy called “Sliding Window Technique” (figure2), which involves comparing  primer with each of the similar part size of the target sequence one by one, like sliding the door across the window (GeeksforGeeks 2017), this method was opted because of its simplicity, and efficient time complexity of the program execution. The primary aim is to use this ‘sliding window technique’ to find the possibilities of all primer binding sites across the target sequences by allowing some mismatches.

![image.png](attachment:42d37ba3-3851-47cf-bf25-4dfe0c0bd7f9.png)Figure 2 : Representation of sliding window technique (GeeksforGeeks 2017)


**LOGICAL STEPS** :

1.	Creating a first function for Parsing the primer fastafile and returning the function after making the primer sequence into uppercase, also while converting the primer sequence as a string.

2.	Creating a second function to generate the reverse complement and returning it to convert the forward primer string sequence to its reverse complement.


3.	Creating a third function to Parse the Target fastafile, and returning it after making the target sequence into uppercase, also while converting the target sequence as a string.

4.	Creating  a fourth important function that uses primer, target and allowed mismatches as a input and opts for sliding window algorithm to compare the primer across each of the target sequence, one by one and returning the results of the error position, nucleotide change and also showing the mismatch count.

5.	Finally executing all the four functions and presenting the results based on the user allowed mismatches for forward and reverse primer sequences.


**METHODS**:

For this primer suitabilty problem two fasta files of primer and target were given. So, for the first part of the code, from the Biopython library, SeqIo  module and Seq  class were imported. Further two variables were assigned to the directory of the primer sequence and the target files respectively.

(Note: while executing the code, if the jupyter undefined function error persist, use Shift + Enter to execute each of the coding cells )

In [39]:
#code : step 1 importing modules and creating a function to read the primer
from Bio import SeqIO
from Bio.Seq import Seq
primer_file = r"C:\USERS\aravi\Downloads\Compressed\primer.fasta" #location of the primer file
target_file = r"C:\USERS\aravi\Downloads\Compressed\test_sequences.fasta" #location of the target file



Then two functions were defined, First function takes the input of the primer location, reads the primer by parsing through it with the use of SeqIo module, and the funtion was returned after making it uppercase and storing it in a form of string. Then, The Second function was done to reverse complement the forward primer, it takes the input string of the forward primer and uses imported Seq  to classify the input as nucleotides and directly does the operation by using reverse complement function from the Seq class and returns it . 

In [28]:

#step 2 defining a functiom for primerread and reverse complement
def primer_read(primer):
    ''' read the primer sequence and make it Upper case'''
    for record in SeqIO.parse(primer,"fasta"): #use of seqio module to parse the primer
        justprimerseq= str(record.seq.upper()) #making the sequence upper case and converting to string
        return justprimerseq
        
forwardprimer = primer_read(primer_file) #performing the function and storing it in a variable

#reversecomplement of the function
def reverse_complement(pseq):  # in a string format 
    ''' reverse complement of the primer '''
    revcompseq = Seq(pseq).reverse_complement() # Seq a class of Biopython to understand the strings are Nucleotides and reverse complement was performed directly by calling it as a funtion  
    return revcompseq

reverseprimer = reverse_complement(forwardprimer) #performing the function and storing it in a variable

Further, another function was created simlarly to read the target, it takes input location of the target sequence and the same method to read the primer was used. All of the functions are performed and stored in the variables correspondingly. Since the problem requires to compare both forward and reverse primer onto the target sequence.

In [31]:
#step 3: defining a function to read the target
def target_read(targetseq):
    ''' reads the target sequences and stores ID and sequence in a list '''
    target = [] #creating empty target list to store the results
    for record in SeqIO.parse(targetseq, "fasta"): #use of seqio module to parse the sequence
        target.append((record.id, str(record.seq).upper())) #making the sequence upper case and converting to string
    return target #returing outside of the loop to get every taregts not just the first target

ttarget = target_read(target_file) #performing the function and storing it in a variable


After creating three necessary functions for reading the target, primer and reverse complementing it. To address this primer suitability problem, among various techniques, a sliding window approach was used. This is done by creating a another function, that requires to have input of primer and target sequence, along with the maximum mismatches allowed. 

In [86]:
# step 4 : defining a function to do fixed window allignment technique

forwardprimer_results = []  #creating an empty list to put the matched part of forward primer
reverseprimer_results = []  #creating an empty list to put the matched part of reverse primer
def simple_match(primer,target,max_mm):
    ''' takes primer, target and maximum mismatches into consideration, uses sliding window technique by using primer 
    and comparing each sequence of the target and gives result based on mismatches allowed'''
    results=[] #empty list to store the results
    primerlen= len(primer) #length of the primer
    for seq_id, sequence in target: #using for loop seperate target id and target sequence
         targetlen = len(sequence) #length of the targetsequence
    
         for startpos in range ( targetlen - primerlen+1): # setting the range of the primer that navigates from start to the end that fits in the last possible position of the target sequence
             endpos = startpos + primerlen #defining the end position of the part sequence 
             part = sequence[startpos:endpos] # defining part sequence of range from startposition to end position(to compare with primer size)
             mismatch_count= 0 #initiating a counter mismatcg
             mismatch_info =[] #empty list to append mismatch info
             for i in range(primerlen): #going through the index of the primerlength
                 if primer[i] != part[i]: #comparing primer with part sequence using index position
                      mismatch_count = mismatch_count+1 #incrementing the counter if there is mismatch
                      mismatch_info.append((i+1, primer[i], part[i])) #appending the mismatch position of the sequence, primer nucleotide and part sequence nucleotide
             if mismatch_count <= max_mm: #if mismatchcount is less than the allowed maxmismatch proceed with the results
                 if primer == forwardprimer: #if the matched primer is forward primer
                    forwardprimer_results.append(part) #append the matched part to the forwardprimer list
                
                 else:
                    reverseprimer_results.append(part) #else append the matched part to the reversedprimer list
               
                 results.append({ "Sequence ID" :seq_id,  
                            "Start" : startpos, 
                            "End" : endpos, 
                            "MMcount" : mismatch_count,
                            "mismatches" : mismatch_info})
    return results # returning outside to get all the target matches 





**RESULTS**:

Finally, after prompting the user to input the allowed mismatch range for the given primer onto the target sequence, the results reveals at the minimum threshold of '8' mismatches allowed for both the reverse and forward primer, all the possible binding sites to the target sequences were established. 

In [116]:
#step 5 getting the results for forward and reverse primer

setmismatch_forward= int(input("set your forward mismatch:")) # asking the mismatch input for forward primer, Int beacuse its number
results = simple_match(forwardprimer,ttarget,setmismatch_forward ) #performing the sliding window function


print("FORWARD PRIMER MATCHES :")
i = 0
for res in results:
    print(f"forward primer : {forwardprimer}")
    print(f"Matched result : {forwardprimer_results[i]}")
    i += 1
    print("Target ID:", res['Sequence ID'])
    print("Start:", res['Start'])
    print("End:", res['End'])
    print("Mmcount", res['MMcount'])
    print("Mismatches (position, primer, targetpart seq):", res['mismatches'])
    print()

# Settings for reverse primer
setmismatch_reverse = int(input("Set your reverse mismatch: "))
print("REVERSE PRIMER MATCHES :")
results = simple_match(reverseprimer, ttarget, setmismatch_reverse)
j = 0
for res in results:
    print(f"reverse primer : {reverseprimer}")
    print(f"Matched result : {reverseprimer_results[j]}")
    j += 1
    print("Target ID:", res['Sequence ID'])
    print("Start:", res['Start'])
    print("End:", res['End'])
    print("Mmcount", res['MMcount'])
    print("Mismatches (position, primer, targetpart seq):", res['mismatches'])
    print()

set your forward mismatch: 8


FORWARD PRIMER MATCHES :
forward primer : ATTACATGGTTTACAACTTT
Matched result : ATGATGTCCTTTACACTTTA
Target ID: Capsella_grandiflora_(Cgr-B-H8-16H07)
Start: 178
End: 198
Mmcount 8
Mismatches (position, primer, targetpart seq): [(3, 'T', 'G'), (5, 'C', 'T'), (6, 'A', 'G'), (8, 'G', 'C'), (9, 'G', 'C'), (16, 'A', 'C'), (17, 'C', 'T'), (20, 'T', 'A')]

forward primer : ATTACATGGTTTACAACTTT
Matched result : ATGATGTCCTTTACATTTTA
Target ID: Arabidopsis_lyrata_(AF328999.2)_SRK139
Start: 55
End: 75
Mmcount 8
Mismatches (position, primer, targetpart seq): [(3, 'T', 'G'), (5, 'C', 'T'), (6, 'A', 'G'), (8, 'G', 'C'), (9, 'G', 'C'), (16, 'A', 'T'), (17, 'C', 'T'), (20, 'T', 'A')]

forward primer : ATTACATGGTTTACAACTTT
Matched result : CCTACAAGTTTTACCGCCTA
Target ID: Arabidopsis_lyrata_(AF328999.2)_SRK139
Start: 883
End: 903
Mmcount 8
Mismatches (position, primer, targetpart seq): [(1, 'A', 'C'), (2, 'T', 'C'), (7, 'T', 'A'), (9, 'G', 'T'), (15, 'A', 'C'), (16, 'A', 'G'), (18, 'T', 'C'), (20, 'T', 

Set your reverse mismatch:  8


REVERSE PRIMER MATCHES :
reverse primer : AAAGTTGTAAACCATGTAAT
Matched result : AAAGTTTTGTGCCGAGTTGT
Target ID: Capsella_grandiflora_(Cgr-B-H8-16H07)
Start: 337
End: 357
Mmcount 8
Mismatches (position, primer, targetpart seq): [(7, 'G', 'T'), (9, 'A', 'G'), (10, 'A', 'T'), (11, 'A', 'G'), (14, 'A', 'G'), (15, 'T', 'A'), (18, 'A', 'T'), (19, 'A', 'G')]

reverse primer : AAAGTTGTAAACCATGTAAT
Matched result : AAACTTCGATTTCTTGTACT
Target ID: Capsella_grandiflora_(Cgr-B-H8-16H07)
Start: 880
End: 900
Mmcount 8
Mismatches (position, primer, targetpart seq): [(4, 'G', 'C'), (7, 'G', 'C'), (8, 'T', 'G'), (10, 'A', 'T'), (11, 'A', 'T'), (12, 'C', 'T'), (14, 'A', 'T'), (19, 'A', 'C')]

reverse primer : AAAGTTGTAAACCATGTAAT
Matched result : AGAGTTTTGTCCCCAGTTAT
Target ID: Arabidopsis_lyrata_(AF328999.2)_SRK139
Start: 214
End: 234
Mmcount 8
Mismatches (position, primer, targetpart seq): [(2, 'A', 'G'), (7, 'G', 'T'), (9, 'A', 'G'), (10, 'A', 'T'), (11, 'A', 'C'), (14, 'A', 'C'), (15, 'T', 'A'), (18

**DISCUSSION:**

Initially, Fuzzy search method with levenshtein distance was the primary option to solve this primer suitability problem, but this method was later dropped due to facing many issues, the problem arised from the very start of  importing those modules to facing  longer run time execution, and also other factors were also taken into consideration such as the complexity structure of the code itself and the results were not quite as expected as initially thought to be. 

 Finally, the problem was resolved by using a basic alignment strategy called “sliding window technique”. It involves comparing equal length of the primer to each of the target sequence, one by one like sliding a window door (Figure 1). Compared to fuzzy searching, this technique opted a minimalistic approach, while also being quite effective and less time consuming in executing the code.
 
This method also prompts the user to enter the mismatch range of the primer binding, while also presenting the error site position and nucleotide change of the target with the primer. 

As for the results for the given primer and target, the primer binding only started with the minimum of 8 allowed mismatches. Comparing this results to other primary articles, Although it is said the mismatches can be tolerable based on the primer length or nature of mismatches, such as front, middle or end mismatches (Stadhouders et al. 2010). Ideally 3-5 mismatches considered t tolerable, the 8 mismatches is a bit too high, which may affect the effeciency of the amplification (Huang et al. 2024). 




**CONCLUSION**:

Overall for the given primer and target sequences, all the possible binding sites were established according to the user allowed maximum mismatch range using the sliding window technique, which proven to be quite simplistic and effiecient in generating the output.

**REFERENCES**:

 Addgene: Protocol. https://www.addgene.org/protocols/primer-design/ Accessed 13 May 2025b.

 Brans, P. (2022) fuzzy search. TechTarget 8 August

 GeeksforGeeks (2017) Sliding Window Technique. GeeksforGeeks 16 April.

 Huang, K., Zhang, J., Li, J., Qiu, H., Wei, L., Yang, Y. and Wang, C. (2024) Exploring the Impact of Primer–Template Mismatches on PCR Performance of DNA Polymerases Varying in Proofreading Activity. Genes 15 (2), .

 Stadhouders, R., Pas, S.D., Anber, J., Voermans, J., Mes, T.H.M. and Schutten, M. (2010) The Effect of Primer-Template Mismatches on the Detection and Quantification of Nucleic Acids Using the 5′ Nuclease Assay. The Journal of Molecular Diagnostics 12 (1), 109–117.
 
 Ye, J., Coulouris, G., Zaretskaya, I., Cutcutache, I., Rozen, S. and Madden, T.L. (2012) Primer-BLAST: A tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics 13 (1), 1–11.

 

  



  
