# Problem 18: Open Reading Frames
Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

Given: A DNA string s of length at most 1 kbp in FASTA format.

Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.

Sample Dataset
>\>Rosalind_99 \
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG

Sample Output
>MLLGSFRLIPKETLIQVAGSSPCNLS\
M\
MGMTPRLGLESLLE\
MTPRLGLESLLE

In [39]:
def transcribeDNAtoRNA(sequence):
    return sequence.replace("T","U")

In [40]:
condons = """UUU:F\nCUU:L\nAUU:I\nGUU:V
UUC:F\nCUC:L\nAUC:I\nGUC:V
UUA:L\nCUA:L\nAUA:I\nGUA:V
UUG:L\nCUG:L\nAUG:M\nGUG:V
UCU:S\nCCU:P\nACU:T\nGCU:A
UCC:S\nCCC:P\nACC:T\nGCC:A
UCA:S\nCCA:P\nACA:T\nGCA:A
UCG:S\nCCG:P\nACG:T\nGCG:A
UAU:Y\nCAU:H\nAAU:N\nGAU:D
UAC:Y\nCAC:H\nAAC:N\nGAC:D
UAA:Stop\nCAA:Q\nAAA:K\nGAA:E
UAG:Stop\nCAG:Q\nAAG:K\nGAG:E
UGU:C\nCGU:R\nAGU:S\nGGU:G
UGC:C\nCGC:R\nAGC:S\nGGC:G
UGA:Stop\nCGA:R\nAGA:R\nGGA:G
UGG:W\nCGG:R\nAGG:R\nGGG:G"""

codon_dict = {}
for x in condons.split("\n"):
    temp = x.split(":")
    codon_dict[temp[0]] = temp[1]

In [41]:
from textwrap import wrap

def generateRedingFrames(sequence):
    rf1 = wrap(sequence,3)
    rf2 = wrap(sequence[1:],3)
    rf3 = wrap(sequence[2:],3)
    
    if len(rf3[-1]) < 3:
        rf3.remove(rf3[-1])
    if len(rf2[-1]) < 3:
        rf2.remove(rf2[-1])
    if len(rf1[-1]) < 3:
        rf1.remove(rf1[-1])
    return [rf1, rf2, rf3]

def complementDNA(sequence):
    complementDict =  {'A': 'T', 'T':'A', 'C': 'G', 'G': 'C'}
    transTable = sequence.maketrans(complementDict)
    return sequence.translate(transTable)[::-1]

def wrapReverseAndNormalRF(sequence):
    rfs = []
    reverseSequence =  transcribeDNAtoRNA(complementDNA(sequence))
    sequence = transcribeDNAtoRNA(sequence)
    rfs += generateRedingFrames(sequence)
    rfs += generateRedingFrames(reverseSequence) 
    return rfs

def generateAllSequences(sequence):
    
    readingFrames = wrapReverseAndNormalRF(sequence)
    
    return readingFrames

def createProteinSeq(sequence):
    AASeq = []
    # print()
    # print(("".join([codon_dict[x] for x in sequence])))
    for i, codon in enumerate(sequence):
        if codon == 'AUG': #start of a sequence found
            tempSequence = ['M']
            # print('M')
            sequenceLeft = sequence[i+1:]
            for nextCodon in sequenceLeft:
                nextAA = codon_dict[nextCodon]
                # print(nextAA)
                if nextAA == 'Stop':
                    AASeq.append(tempSequence)
                    # print(tempSequence)
                    break
                else:
                    tempSequence.append(nextAA)
            
            
    return AASeq
            
def findPosibleProtienStrings(sequence):
    allSequences = generateAllSequences(sequence)
    
    sequenceBank = []
    for sequence in allSequences:
        sequenceBank += (createProteinSeq(sequence))
        
    return print("\n".join(list(set(["".join(x) for x in sequenceBank]))))

In [42]:
testSequence = "AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG"
findPosibleProtienStrings(testSequence)

MLLGSFRLIPKETLIQVAGSSPCNLS
M
MGMTPRLGLESLLE
MTPRLGLESLLE


In [43]:
actualSequence = """GAGTGAATATGGTGAGTACATAGCATAAGGCTGTGCTTAGATGTCGTACTAGTCCCCTGG
GAGCAGTTAAAATACGGTCTCACATGTCATCCTATGGATGATTCGGATCCTTCGCGAGGC
CAGAAAATCCGGGGTCCAGCGGTATGTAGTGAGAAGGTGATGGTTTCAGAACTATGCGTT
TCATCAGCCTGGACGCTCTGCCCGAGCGGATGTTCCCCTAATTATGTGGGCCAATCGGTG
ATACGCGCGCCTCGGATCCTTTGAAATACTGTTAGGAAAGGCCTGGGAACAAAGCACATA
ATGCGGCTGTGACCATAACCACAATCAGAGTGTACATTTCGAGGTAAATGGCGCCTCTCG
TATAGACCGATGCATTACTAACACTACATACCTACCAAGCCTGCTGATTCTTGCTATTCT
CCTTGCCTAAGGCCCCACATAGTCGGATGAGCGGTTGATGCACGGGGTTCACACGCTATC
AGACTAGCTAGTCTGATAGCGTGTGAACCCCGTGCATATTAATCAGGTCTATCACACTAC
GTAGACGTGGACCGGACTCAACGGATACATTGCGGCCTGCTCCATAGATTGAAATATTCG
GCACGTCTGCTTGGAAAATTCTGATCTTCCTCGTTCAAGCCCATGTGGCATACCATACGA
GACATATGAGATCGCTCCTGGGTTTGTCAGTCCAAGTAGTAACTTCTCTTCTGGGAATGC
CGTACATGCCTTAGGTGTTTGGTTCAACGATGATCGAGCAGTACAACTGCGCAACCAACG
CTCGTTCCTTATGCGTGCGACTACATCGCATACCCCCCCTTGGGCGGGCTAAAGCTGAGT
CAGTCTTTGAGGGTCTGTTAAGTTTGACACACAACTGCCGATGCGTTAACCTGAAACTAG
AAGAGATCCGGATACCACAGTGGAGGTTGACCATAACTTGGCTATGGAGTAAAATATTAT
ATCCTGACCGTAAC""".replace("\n","")

In [44]:
findPosibleProtienStrings(actualSequence)


MR
MP
MEQAAMYPLSPVHVYVV
MVTAALCALFPGLS
MVSELCVSSAWTLCPSGCSPNYVGQSVIRAPRIL
MSY
MHY
MPHGLERGRSEFSKQTCRIFQSMEQAAMYPLSPVHVYVV
MVNLHCGIRISSSFRLTHRQLCVKLNRPSKTDSALARPRGGMRCSRTHKERALVAQLYCSIIVEPNT
MTCETVF
MAPLV
MSSYG
MYPLSPVHVYVV
MIRILREARKSGVQRYVVRR
MYTLIVVMVTAALCALFPGLS
MKRIVLKPSPSHYIPLDPGFSGLAKDPNHP
MWHTIRDI
MVCHMGLNEEDQNFPSRRAEYFNLWSRPQCIR
MRSLLGLSVQVVTSLLGMPYMP
MDDSDPSRGQKIRGPAVCSEKVMVSELCVSSAWTLCPSGCSPNYVGQSVIRAPRIL
MHGVHTLSD
MSRMVCHMGLNEEDQNFPSRRAEYFNLWSRPQCIR
MWGLRQGE
MRFISLDALPERMFP
M
MVST
MRL
MFP
MHRSIREAPFTSKCTL
MYGIPRREVTTWTDKPRSDLICLVWYATWA
MSG
MRATTSHTPPWAG
MRCSRTHKERALVAQLYCSIIVEPNT
MPYMP
MWANR
MCFVPRPFLTVFQRIRGARITDWPT
MGLNEEDQNFPSRRAEYFNLWSRPQCIR
ME
