# 1. Source

Click on the link to go to the source web page of **Rosalind**: [Finding Genes with ORFs](https://rosalind.info/problems/orfr/)

 **Problem**
 
 ![Finding Genes with ORFs](orfr_problem.png "Finding Genes with ORFs")

**Sample Dataset**

AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG

**Sample Output**

MLLGSFRLIPKETLIQVAGSSPCNLS

# 2. Workspace

In [1]:
# read the input file and extract the dna string

with open('orfr_test.txt', 'r') as file:
    dnaSeq = file.read().rstrip().upper()
    
# display dnaSeq

dnaSeq

'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG'

In [2]:
# to find the possible orfs, we can write a regex pattern

import re

pattern = r'ATG(?:(?!TAA|TAG|TGA)...)*(?:TAA|TAG|TGA)?'

# 1. should start with ATG
# 2. 3-nuc groups of A,T,G,C combination - zero or more - but not any TGA, TAA or TGA
# 3. should end with TGA, TAG or TAA - if not, then to the end of the string

re.findall(pattern, dnaSeq)

['ATGTAG',
 'ATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAA',
 'ATGATCCGAGTAGCATCTCAG']

In [3]:
# the longest one

dnaSeq_longest_orf = max(re.findall(pattern, dnaSeq), key = lambda x: len(x))

# print

print(len(dnaSeq_longest_orf))
print(dnaSeq_longest_orf)

45
ATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAA


In [4]:
# we need to repeat for the reverse complement

from Bio.Seq import Seq

revcSeq = str(Seq(dnaSeq).reverse_complement())

revcSeq

'CTGAGATGCTACTCGGATCATTCAGGCTTATTCCAAAAGAGACTCTAATCCAAGTCGCGGGGTCATCCCCATGTAACCTGAGTTAGCTACATGGCT'

In [5]:
# find the longest orf for revcSeq

revcSeq_longest_orf = max(re.findall(pattern, revcSeq), key = lambda x: len(x))

# print

print(len(revcSeq_longest_orf))
print(revcSeq_longest_orf)

81
ATGCTACTCGGATCATTCAGGCTTATTCCAAAAGAGACTCTAATCCAAGTCGCGGGGTCATCCCCATGTAACCTGAGTTAG


In [6]:
# the second one is longer
# translate the second orf into amino acids

longest_protein = Seq(revcSeq_longest_orf).translate(to_stop = True)

# print

print(longest_protein)

MLLGSFRLIPKETLIQVAGSSPCNLS


In [7]:
# is it the same with the sample output?

sample_output = 'MLLGSFRLIPKETLIQVAGSSPCNLS'

longest_protein == sample_output

True

# 3. Implementation

In [8]:
def orfr(filename):
    
    '''
    input
        a file containing a dna sequence
    process
        finds the longest orf's protein sequence
    output
        prints answer to console and writes it in a file
    '''
    
    from Bio.Seq import Seq
    import re
    
    # read input file and extract dnaSeq
    with open(filename, 'r') as file:
        dnaSeq = file.read().rstrip().upper()
        
    # orf finder regex pattern
    pattern = r'ATG(?:(?!TAA|TAG|TGA)...)*(?:TAA|TAG|TGA)?'
    
    # find the longest orf of dnaSeq
    dnaSeq_longest_orf = max(re.findall(pattern, dnaSeq), key = lambda x: len(x))
    
    # create reverse complement
    revcSeq = str(Seq(dnaSeq).reverse_complement())
    
    # find the longest orf of revcSeq
    revcSeq_longest_orf = max(re.findall(pattern, revcSeq), key = lambda x: len(x))
    
    # select the longer and create protein sequence
    if len(dnaSeq_longest_orf) >= len(revcSeq_longest_orf):
        protein = Seq(dnaSeq_longest_orf).translate(to_stop = True)
    else:
        protein = Seq(revcSeq_longest_orf).translate(to_stop = True)
    
    # print answer to console
    print('\n\x1B[1mANSWER\x1B[0m\n______\n')
    print(f'{protein}')
    
    # open file and write answer
    file = open(f'{filename.split(".")[0]}_answer.txt', 'w')
    file.write(f'{protein}')
    file.close()
    print('\n\n#! The answer has been written into the file:',
          f'\x1B[1m./{filename.split(".")[0]}_answer.txt\x1B[0m\n')

# 4. Execution

In [9]:
orfr('orfr_test.txt')


[1mANSWER[0m
______

MLLGSFRLIPKETLIQVAGSSPCNLS


#! The answer has been written into the file: [1m./orfr_test_answer.txt[0m



In [10]:
orfr('rosalind_orfr_1_dataset.txt')


[1mANSWER[0m
______

MRCAWLLVSPLRMAHCHGSCASSVFYQFRGALTLRARNWTLYHTSRDSVIPRDSKLRYHTCWGYSSSIRMR


#! The answer has been written into the file: [1m./rosalind_orfr_1_dataset_answer.txt[0m



In [11]:
orfr('rosalind_orfr.txt')


[1mANSWER[0m
______

MYWAVYRQRAASPFTRGLPRVDPMHKVGPLRFSRTIVKGHSSVAAATPRESVGKSGRLTSISRGRHLLSGDSDLPVLLTDDRKSF


#! The answer has been written into the file: [1m./rosalind_orfr_answer.txt[0m



<p style='text-align: right;'>
    <!--<b><font size = '5'>Contact</font></b><br>-->
    <b>Orcun Tasar</b><br>
    <i>Bioinformatician / Data Scientist</i><br>
    orcuntasar |at@| ogr.iu.edu.tr<br>
    tasar.orcun |at@| gmail.com<br>
    <a href = 'https://www.linkedin.com/in/orçun-taşar-7b5992a1/'>Linkedin</a> | <a href = 'https://www.instagram.com/shatranuchor/'>Instagram</a>
</p>