### Open Reading Frame; Biopython

- Biopython does not have one-line `.find-orfs()` method 
- This notebook is dedicated to manually building ORF from scratch

***



> A continuous stretch of codons without stop codons, beginning with a start codon (usually ATG) and ending at a stop (TAA, TAG, or TGA).


### 1. Get the `genomic sequence`

**Escherichia coli strain PNUSAE211155, whole genome shotgun sequencing project**

Escherichia coli strain PNUSAE211155, whole genome shotgun sequencing project

In [None]:
from Bio import Entrez,SeqIO,SeqUtils, SeqRecord
from Bio.Seq import Seq
import os


In [None]:
Entrez.email = 'jasovicluka1@gmail.com'
handle = Entrez.esearch(db='nucleotide',term='E. coli', retmax=10,idtype='acc')
record = Entrez.read(handle,'gb')
handle.close()

id_list = record['IdList']
id_list
first_acc = id_list[5]
first_acc


In [None]:
fetch = Entrez.efetch(db='nucleotide',id=first_acc,rettype='gb',retmode='text')
read = SeqIO.read(fetch,'gb')
handle.close()
record_1 = read.seq
print(len(record_1))


#Save the complete file

with open('ORF_1.txt', 'w') as file_01:
    SeqIO.write(read,file_01,'fasta')


### 2. ORF 

In bacteria, the most common start codon is `AUG` which codes for N-formylmethionine (**fMet**)

Apart from this start codon, there are some alternative start codons, and the most common ones are:

- `GUG`

- `UUG`

These start codons are recognized by the initiator tRNA carrying fMet

In [None]:
# AUG - ATG 
# GUG - GTG
# UUG - TTG 

In [None]:
# Full Sequence lenght: 4073392
print(f'Full sequence lenght is {len(record_1)}')


#### 2.1 Load the full sequence, fasta format:

In [None]:
with open('ORF_1.txt','r') as file:
    record = SeqIO.read(file,"fasta")
sequence = record.seq
    

In [None]:
sequence

#### 2.2 Iterate over each of the 6 possible reading frames.

In [None]:
#START:ATG STOP:TAA, TGA TAG

start_1 = sequence[0:3]
start_2 = sequence[3:6]
start_3 = sequence[6:9]

reverse_complement = sequence.reverse_complement()
reverse_complement

In [None]:
start_1rvs = reverse_complement[0:3]
start_2rvs = reverse_complement[3:6]
start_3rvs = reverse_complement[6:9]
seq_str = str(sequence)

In [None]:
print(len(sequence))

Reading frame 0 → starts at position 0:
ATG, CAA, TGA, ...

Reading frame 1 → starts at position 1:
TGC, AAT, GAT, ...

Reading frame 2 → starts at position 2:
GCA, ATG, ATA, ...

In [None]:
#loop through the complete sequence
# start storing when hit the start codon
# stop when hit the stop codon

#ORF 0
import re
rf1 = []
rf2 = []
rf3 = []
#Checkpoint #1 - Searching for start codon "ATG" and it's position in the full sequence
 
for i in range(0,len(sequence)-2,3):
    full_frame = seq_str[i:i+3]
    #print(full_frame)
    x = re.search("ATG",full_frame)
    print(x)

In [None]:
#position 0-3 start codon
# I try with 3 different open reading frames, position 0, 1 and 2 representing A-start , T-start, G-start
for f in range(0,3):
    for j in range(f,len(sequence)-2,3):
    if f ==0:
            start_frame1 = seq_str[j]
            rf1.append(start_frame1)
            print(rf1)
        else:
            break

In [None]:
for i in range(0,len(sequence)-2,3):
    full_frame = seq_str[i:i+3]
    #print(full_frame)
    print(full_frame)