#Python for Genomic Data Science final project
JeongHo Choi, 15th June 2022 (updated)

-Write a Python program that takes as input a file containing DNA sequences in multi-FASTA format, and computes the answers to the following questions:

(4) A repeat is a substring of a DNA sequence that occurs in multiple copies (more than one) somewhere in the sequence. Although repeats can occur on both the forward and reverse strands of the DNA sequence, we will only consider repeats on the forward strand here. Also we will allow repeats to overlap themselves. 

-Given a length n, your program should be able to identify all repeats of length n in all sequences in the FASTA file. -Your program should also determine how many times each repeat occurs in the file, and which is the most frequent repeat of a given length.

In [1]:
from Bio import SeqIO

from collections import Counter

In [2]:
#list out all substrings of length n in the sequence
def substrings(sequence, repeat_len):
    substring_list = []
    repeat_count = 0
    
    for i in range(len(sequence)-repeat_len):
        substring_list.append(sequence[i:(i+repeat_len)])
    return substring_list

In [3]:
#identify all repeats of length n
def fasta_repeat(input_file, n):
    multi_fasta = SeqIO.parse(open(input_file), 'fasta')
    
    sub_count = [] #for counting repeated substrings
    
    for fasta in multi_fasta:
        name, description, sequence = fasta.id, str(fasta.description).split(), str(fasta.seq)
        
        sub_list = substrings(sequence, n)
        for i in range(len(sub_list)):
            sub_count.append(sub_list[i])

    sub_count = Counter(sub_count).most_common(3)
    
    print("Top 3 repeats in length %d are: " % n, sub_count)

In [4]:
input_file = 'dna2.fasta'

In [5]:
fasta_repeat(input_file, 5)

Top 3 repeats in length 5 are:  [('CGCGC', 418), ('GCGCG', 414), ('CGCCG', 365)]
