# Importing Required Libraries

In this assignment, we'll be using the Biopython package, and we need to import the necessary libraries. The following libraries will be utilized:

1. **Bio.Entrez:** This library will be used to connect to the NCBI database.

2. **Bio.SeqIO:** This library will be used to parse the downloaded FASTA file into an object of type "Seq." This will enable us to leverage the various methods and functions provided by the Biopython package for our convenience.

3. **Bio.SeqUtils:** This library contains the `nt_search()` function which will be used to search for the Origin of Replication inside the DNA sequence.

4. **Bio.Seq:**  This library has a class "Seq" which makes our work easier while dealing with long DNA sequences 


In [3]:
from Bio import Entrez as ent
from Bio import SeqUtils as util
from Bio.Seq import Seq as seq
from Bio import SeqIO as sio

# Providing Email and Accession Number

To connect to the NCBI database, we need to provide our email for authentication purposes. Additionally, we define a list of accession numbers that we will be using for this assignment.

This specific accession number will serve as the identifier to fetch the corresponding FASTA file.

In [4]:
emailID = "nikhilkumar190804@gmail.com"
ent.email=emailID
list_of_bakers_yeast_assession_number = ["NC_001133.9","NC_001134.8","NC_001135.5","NC_001136.10","NC_001137.3","NC_001138.5","NC_001139.9","NC_001140.6","NC_001141.2","NC_001142.9","NC_001143.9","NC_001144.5","NC_001145.3","NC_001146.8","NC_001147.6","NC_001148.4"]
file_name = "Chromosome "
Chromosome_number = 1

# Utilizing the efetch() Function from the Entrez Module

Now, we employ the `efetch()` function from the Entrez Module. This function is designed to retrieve a specific file in a specified format from the NCBI database using the provided accession number.


In [6]:
for accession_number in list_of_bakers_yeast_assession_number:
    handle = ent.efetch(db="nucleotide",id=accession_number,rettype="fasta",retmode="text")
    record = handle.read()

# File Handling to Create a FASTA File

Using Python's file handling capabilities, we proceed to create a FASTA file. The `open()` function is employed for this purpose, and we write the data fetched by the handle into the file. This step completes the process of downloading the genome sequence of Baker's Yeast.


In [1]:
for accession_number in list_of_bakers_yeast_assession_number:
    handle = ent.efetch(db="nucleotide",id=accession_number,rettype="fasta",retmode="text")
    record = handle.read()
    file = open(file_name+str(Chromosome_number)+".fasta",'w+')
    file.write(record)
    file.flush()
    file.close()
    Chromosome_number+=1

# Following is the whole code for Part(a):-

In [2]:
from Bio import Entrez as ent
from Bio import SeqUtils as util
from Bio.Seq import Seq as seq
from Bio import SeqIO as sio

#Downloading the genome sequence of Baker's Yeast
emailID = "nikhilkumar190804@gmail.com"
ent.email=emailID

list_of_bakers_yeast_assession_number = ["NC_001133.9","NC_001134.8","NC_001135.5","NC_001136.10","NC_001137.3","NC_001138.5","NC_001139.9","NC_001140.6","NC_001141.2","NC_001142.9","NC_001143.9","NC_001144.5","NC_001145.3","NC_001146.8","NC_001147.6","NC_001148.4"]
file_name = "Chromosome "
Chromosome_number = 1

for accession_number in list_of_bakers_yeast_assession_number:
    handle = ent.efetch(db="nucleotide",id=accession_number,rettype="fasta",retmode="text")
    record = handle.read()
    file = open(file_name+str(Chromosome_number)+".fasta",'w+')
    file.write(record)
    file.flush()
    file.close()
    Chromosome_number+=1

# Possble Sequences for Origin of Replication

Now we want to search for the Origin of replication in all of these yeast chromosomes . 

For this we search for a particular DNA sequence which i obtained after reading some articles on the pubmed database of the NCBI. The sequence is a 11 base pair sequence and has mainly 'A' and 'T'. 

Following is the sequence:-
[(A/T)TTTAT(A/G)TTT(A/T)]

Now there are 8 possible sequences after considering for the three choices at the position: 1,7,11.

Hence we store all of these possible sequences in a list.

The Article used for reference is : https://www.sciencedirect.com/science/article/pii/S0923250812000435?via%3Dihub

NCBI Link of the Article: https://pubmed.ncbi.nlm.nih.gov/22504206/

PMID Of Article: 22504206


In [None]:
patterns = ["ATTTATATTTA","TTTTATGTTTT","ATTTATATTTT","ATTTATGTTTA","ATTTATGTTTT","TTTTATATTTA","TTTTATATTTT","TTTTATGTTTA"]

# Parsing the fasta files

Now we open the downloaded chromosomes sequences one by one which are 16 in number hence we use a for loop to repeat this procedure 16 times. 

And now we parse each of the fasta file using the SeqIO module provided by the BioPython package, to parse the fasta files we use the function `parse()`.

After parsing the fasta file we extract the DNA sequence and convert it into string type object, so as to do the rest of our work.


In [None]:
for i in range(1,17):
    print("\n",end="")
    print(f"For Chromosome {i}: ")
    new_file_name = file_name+str(i)+".fasta"
    #nor parse this
    dna_sequence=""
    for sequence in sio.parse(new_file_name,"fasta"):
        dna_sequence=str(sequence.seq)

# Searching for the Origin of Replication using `nt_search()`

After getting the DNA sequence as the string type object, we now iterate over all the possible sequences of the Origin of Replication and we try to search for that sequence in that particular DNA chromosome.

And here we use the `nt_search()` function of the SeqUtils module which matches a given subsequence in a bigger sequence and return the positions where the match has occurred.

In [None]:
for sequence in sio.parse(new_file_name,"fasta"):
    dna_sequence=str(sequence.seq)
total_positions = []
for specific_sequence in patterns:
    matches = util.nt_search(dna_sequence,specific_sequence)

# Storing the possible positions

Now if the `nt_search()` function successfully found some matches then we will have them stored inside the variable matches. 

And now we iterate over the matches and check whether we have got any result and if we have got some result then we store the result into the list: total_positions


In [None]:
for position_number in matches:
    if(isinstance(position_number,int)):
        total_positions.append(position_number)

# Finding all the possible positions

Finally we repeat this process for all the possible patterns and if successfully found some matches using the `nt_search()` function then we store the result into the total_positions list. 
And finally we print the content of the total_position, which gives us the positions of the Origin of Replication for that particular Chromosome.

In [None]:
total_positions = []
for specific_sequence in patterns:
    matches = util.nt_search(dna_sequence,specific_sequence)
    for position_number in matches:
        if(isinstance(position_number,int)):
            total_positions.append(position_number)
print(f"Origin of Replication found at positions: {total_positions}")

# Following is the whole Code for Part(b):-

In [2]:
from Bio import Entrez as ent
from Bio import SeqUtils as util
from Bio.Seq import Seq as seq
from Bio import SeqIO as sio
file_name = "Chromosome "
Chromosome_number = 1
patterns = ["ATTTATATTTA","TTTTATGTTTT","ATTTATATTTT","ATTTATGTTTA","ATTTATGTTTT","TTTTATATTTA","TTTTATATTTT","TTTTATGTTTA"]
for i in range(1,17):
    print("\n",end="")
    print(f"For Chromosome {i}: ")
    new_file_name = file_name+str(i)+".fasta"
    dna_sequence=""
    for sequence in sio.parse(new_file_name,"fasta"):
        dna_sequence=str(sequence.seq)
    total_positions = []
    for specific_sequence in patterns:
        matches = util.nt_search(dna_sequence,specific_sequence)
        for position_number in matches:
            if(isinstance(position_number,int)):
                total_positions.append(position_number)
    print(f"Origin of Replication found at positions: {total_positions}")


For Chromosome 1: 
Origin of Replication found at positions: [159953, 176522, 176236, 17149, 208605, 229450]

For Chromosome 2: 
Origin of Replication found at positions: [632052, 238293, 381151, 420235, 622760, 792466, 603190, 665038, 122598, 368745, 543395, 568821, 755032, 80, 195767, 326080, 812416]

For Chromosome 3: 
Origin of Replication found at positions: [74520, 11256, 78863, 224863, 14700, 201845, 231261, 315820]

For Chromosome 4: 
Origin of Replication found at positions: [104908, 521761, 807779, 1182775, 1240933, 340870, 347217, 480280, 210566, 477645, 1070495, 1422530, 232057, 263124, 1462061, 77223, 405175, 443872, 427871, 1272225, 50459, 67634, 233925, 420761, 521602, 561437, 609151, 677939, 709270, 913867, 1398457, 1404336, 1445629, 1462567, 111128, 1057898, 1447485, 1524662]

For Chromosome 5: 
Origin of Replication found at positions: [59536, 309689, 16057, 99492, 280702, 284609, 390603, 7976, 230002, 301681, 498891, 287569, 105316, 49778, 109085, 110515, 345455, 64