#Python for Genomic Data Science final project
JeongHo Choi, 15th June 2022 (updated)

-Write a Python program that takes as input a file containing DNA sequences in multi-FASTA format, and computes the answers to the following questions:

(3) Given an input reading frame on the forward strand (1, 2, or 3), your program should be able to identify all ORFs present in each sequence of the FASTA file.
-what is the length of the longest ORF in the file? 
-What is the identifier of the sequence containing the longest ORF? 
-For a given sequence identifier, what is the longest ORF contained in the sequence represented by that identifier? 
-What is the starting position of the longest ORF in the sequence that contains it? 

In [1]:
from Bio import SeqIO

In [2]:
#calculate the start position and the length of longest ORF in sequence of FASTA file
def ORF(sequence, reading_frame):
    seq = sequence[reading_frame-1:] #for the three possible reading frames (1, 2, 3) in the forward direction (e.g. 0:3, 1:4, 2:5)
    max_len_ORF = 0 #for finding out maxium length of ORF
    max_len_start = 0 #for figuring out the start position of longest ORF
    
    for i in range(0, len(seq)-6, 3): #minimum length to be at least 6 as we take into account for start(3) and stop(3) codon
        if seq[i:i+3] == 'ATG':
            for j in range(i+3, len(seq)-3, 3):
                if seq[j:j+3] in ['TAA', 'TAG', 'TGA']:
                    len_ORF = j+3-i
                    if len_ORF > max_len_ORF:
                        max_len_ORF = len_ORF
                        max_len_start = i
                    break
    
    return max_len_start, max_len_ORF

In [3]:
def fasta_ORF_analysis(input_file, reading_frame):
    multi_fasta = SeqIO.parse(open(input_file), 'fasta')
    
    ORF_record = {} #dictionary for recording maximum length of sequences in FASTA; IDs as key, [start position:maximum length] as value
    
    for fasta in multi_fasta:
        name, description, sequence = fasta.id, str(fasta.description).split(), str(fasta.seq)
        
        #calculate the start position and the length of longest Open Reading Frame, with recalling ORF function
        orf_start, orf_len = ORF(sequence, reading_frame)
        ORF_record[description[0]] = [orf_start, orf_len]
    
    #sorting by length in descending order 
    ORF_record_sorted = sorted(ORF_record.items(), key=lambda x: x[1][1], reverse=True)
    
    print("ORF%d, sorted by length: " % reading_frame, ORF_record_sorted)

In [4]:
input_file = 'dna2.fasta'

In [5]:
fasta_ORF_analysis(input_file, 1)

ORF1, sorted by length:  [('gi|142022655|gb|EQ086233.1|45', [384, 2394]), ('gi|142022655|gb|EQ086233.1|250', [561, 1560]), ('gi|142022655|gb|EQ086233.1|16', [1527, 1509]), ('gi|142022655|gb|EQ086233.1|255', [291, 1443]), ('gi|142022655|gb|EQ086233.1|91', [978, 1296]), ('gi|142022655|gb|EQ086233.1|396', [528, 1059]), ('gi|142022655|gb|EQ086233.1|454', [2337, 1044]), ('gi|142022655|gb|EQ086233.1|293', [1389, 312]), ('gi|142022655|gb|EQ086233.1|4', [444, 249]), ('gi|142022655|gb|EQ086233.1|277', [597, 204]), ('gi|142022655|gb|EQ086233.1|527', [1224, 195]), ('gi|142022655|gb|EQ086233.1|75', [819, 180]), ('gi|142022655|gb|EQ086233.1|88', [81, 120]), ('gi|142022655|gb|EQ086233.1|304', [858, 105]), ('gi|142022655|gb|EQ086233.1|584', [159, 90]), ('gi|142022655|gb|EQ086233.1|594', [27, 42]), ('gi|142022655|gb|EQ086233.1|322', [0, 0]), ('gi|142022655|gb|EQ086233.1|346', [0, 0])]
