### Open Reading Frame; Biopython

- Biopython does not have one-line `.find-orfs()` method 
- This notebook is dedicated to manually building ORF from scratch

***



> A continuous stretch of codons without stop codons, beginning with a start codon (usually ATG) and ending at a stop (TAA, TAG, or TGA).


### 1. Get the `genomic sequence`

**Escherichia coli strain PNUSAE211155, whole genome shotgun sequencing project**

Escherichia coli strain PNUSAE211155, whole genome shotgun sequencing project

In [None]:
from Bio import Entrez,SeqIO,SeqUtils, SeqRecord
from Bio.Seq import Seq
import os


In [None]:
Entrez.email = 'jasovicluka1@gmail.com'
handle = Entrez.esearch(db='nucleotide',term='E. coli', retmax=100,idtype='acc')
record = Entrez.read(handle,'gb')
handle.close()

id_list = record['IdList']
id_list
selected_ids = []
for i,id in enumerate(id_list):
    selected_ids = id_list[17]
print(f'My selected ID: {selected_ids}')



In [None]:
fetch = Entrez.efetch(db='nucleotide',id=selected_ids,rettype='gb',retmode='text')
read = SeqIO.read(fetch,'gb')
handle.close()
record_1 = read.seq
print(len(record_1))


#Save the complete file

with open('ORF_1.txt', 'w') as file_01:
    SeqIO.write(read,file_01,'fasta')


### 2. ORF 

In bacteria, the most common start codon is `AUG` which codes for N-formylmethionine (**fMet**)

Apart from this start codon, there are some alternative start codons, and the most common ones are:

- `GUG`

- `UUG`

These start codons are recognized by the initiator tRNA carrying fMet

In [None]:
# AUG - ATG 
# GUG - GTG
# UUG - TTG 

In [None]:
# Full Sequence lenght: 4073392
print(f'Full sequence lenght is {len(record_1)}')


#### 2.1 Load the full sequence, fasta format:

In [None]:
with open('ORF_1.txt','r') as file:
    record = SeqIO.read(file,"fasta")
sequence = record.seq
    

In [None]:
sequence

#### 2.2 Iterate over each of the 6 possible reading frames.

In [None]:
#START:ATG STOP:TAA, TGA TAG

reverse_complement = sequence.reverse_complement()
reverse_complement

In [None]:

seq_str = str(sequence)

In [None]:
print(len(sequence))

Reading frame 0 → starts at position 0:
ATG, CAA, TGA, ...

Reading frame 1 → starts at position 1:
TGC, AAT, GAT, ...

Reading frame 2 → starts at position 2:
GCA, ATG, ATA, ...

In [None]:
#loop through the complete sequence
# start storing when hit the start codon
# stop when hit the stop codon
"""
#ORF 0
import re
#rf1 = []
#rf2 = []
#rf3 = []
rfs = [[],[],[]]
#Checkpoint #1 - Searching for start codon "ATG" and it's position in the full sequence
for i in [0,1,2]:
    found_item = False
    for j in range(i,len(sequence)-2,3):
        current_frame = []
        codons = seq_str[j:j+3]
        #print(codons)
        if codons == 'ATG' and not found_item:
            print(f'Found start of the current frame, ORF {i}')
            found_item = True
        if found_item:
            current_frame.append(codons)
            print(current_frame)
            if codons in ['TAA','TGA','TAG']:
                found_item = False
                rfs[i].append(current_frame)

print(len(rfs[0]))
print(len(rfs[1]))
print(len(rfs[2]))


"""   

In [None]:
def orf_reader(sequence,found_item=True):
    rfs1 =[[],[],[]]
    for i in [0,1,2]:
        found_item= False
        current_frame = []
        for j in range(i,len(sequence) -2,3):
            codons = str(sequence[j:j+3])                        
            if codons == 'ATG' and not found_item:
                current_frame =['ATG']
                found_item = True
                continue # to escape adding 2x ATG
            if found_item:
                current_frame.append(codons)
                if codons in ['TAA','TAG','TGA']:
                    rfs1[i].append(current_frame)
                    current_frame = []
                    found_item = False

    
    return rfs1
rfs0 = orf_reader(sequence)
print(rfs0)

#orf_reader(sequence)

# Sta hocu da uraidm
# funckija koja ce da procita sve sto je storovano iz sequence , i reverse complement 

In [None]:
orf_0 = rfs0[0]
orf_1 = rfs0[1]
orf_2 = rfs0[2]
print(len(orf_0))
print(len(orf_1))
print(len(orf_2))

### Control point below:

- Testing `orf_reader ()` if it works on synthetic sequence before I continue to next part 


In [None]:

synthetic_seq = Seq("ATGAAATAATATGAGATAAATGAAGGTAG")
orf_reader(synthetic_seq)

In [None]:
synth_reverse = synthetic_seq.reverse_complement()

In [32]:
reslt = orf_reader(synthetic_seq,found_item=True)
print(reslt)



[[['ATG', 'AAA', 'TAA']], [['ATG', 'AGA', 'TAA']], []]


#### Wrapper function that will get all the readingframes (`forward` + `reverse` strand -> 6 ORFS total)

In [None]:
def scan_all_frames(seq1,seq2):
    all_orfs = {}
    reslt = orf_reader(seq1,found_item=True)
    print(f'Control point, results passed from the previous func.: {reslt}')
    reslt_rev = orf_reader(seq2,found_item=True)
    print(f"2nd Control pont, synthetic reverse strand ORFs: {reslt_rev}")
    key1 = {}
    key =0
    val = {}
    for x in range(len(reslt)):
        for j in (reslt[x]):
            
        return key
scan_all_frames(synthetic_seq,synth_reverse)

Control point, results passed from the previous func.: [[['ATG', 'AAA', 'TAA']], [['ATG', 'AGA', 'TAA']], []]
2nd Control pont, synthetic reverse strand ORFs: [[], [], []]


{0: 345}

moras da proveris ponasanje i kako bi najprostije odradio sto ti padne na pamet