# PROJECT 1: Parsing a FASTA file

This program demonstrates how to parse a FASTA file using Biopython and print the sequence IDs and their lengths. FASTA is a common file format used to represent nucleotide or protein sequences.

## Objectives
1. Parse a local FASTA file and extract sequence information.
2. Fetch and parse FASTA data from an online database.

### STEP1: Installing and Importing Libraries 

**Biopython** is a collection of tools and libraries for computational biology and bioinformatics. It provides functionalities to handle biological data, such as sequences, structures, and annotations.

In [3]:
# Installing Biopython library
!pip install biopython

Collecting biopython
  Downloading biopython-1.83-cp310-cp310-win_amd64.whl (2.7 MB)
     ---------------------------------------- 2.7/2.7 MB 4.7 MB/s eta 0:00:00
Installing collected packages: biopython
Successfully installed biopython-1.83


The **SeqIO** module in Biopython is used for reading and writing sequence file formats. It supports a variety of formats, including FASTA, GenBank, and others. The **Entrez** module in Biopython provides access to NCBI's Entrez databases. It allows you to search and fetch data programmatically.

In [4]:
# Importing necessary modules from Biopython
from Bio import SeqIO
from Bio import Entrez

In [5]:
# Printing the Biopython version to verify installation
import Bio
print(f"Biopython version: {Bio.__version__}")

Biopython version: 1.83


### Objective 1: Parsing a Local FASTA File

In bioinformatics, parsing a FASTA file helps us pick out sequence IDs and sequences so we can analyze them.

The **"parse_fasta"** function takes a file path as input and parses the fasta file.
We then use a loop to iterate through each record in the faste file.
For each record, the sequence ID and length is printed

In [11]:
# parse_fasta is a function to parse FASTA file and print sequence IDs and lengths
def parse_fasta(file_path):
    for record in SeqIO.parse(file_path, "fasta"):
        print(f"ID: {record.id}")
        print(f"Sequence: {record.seq}")
        print(f"Length: {len(record.seq)}\n")

In [12]:
# Example usage
fasta_file = "Example1.fasta"  # Replace with your FASTA file path
parse_fasta(fasta_file)

ID: seq1
Sequence: ATGCGTACGTAGCTAGCTGAC
Length: 21

ID: seq2
Sequence: CGTACGTAGCTAGCTGACGTAGCTAGC
Length: 27



### Objective 2: Fetch and parse FASTA data from an online database

In bioinformatics, data often needs to be retrieved from large online databases. One commonly used database is NCBI (National Center for Biotechnology Information), which stores a vast amount of biological data. We'll use Biopython's Entrez module to fetch FASTA data from NCBI and then parse it to extract useful information.

Below are the steps to achieve this

a) **Set Up Access to NCBI:** Before we can fetch data, we need to set up access to NCBI. NCBI requires you to provide an email address when using their API, so they can contact you if there's any issue with your queries

b) **Perform a Search Query:** Entrez.esearch(): Performs a search in the specified database ("nucleotide" in this case) for the query ("Homo sapiens COX1").

c) **Fetch the data:**  Use Entrez.efetch() to retrieve the sequence data in FASTA format

d) **Parse data:** Convert the fetched data into a file-like object and use SeqIO.parse() to read and process the sequences.

#### i) Steps to fetch sequence data from NCBI's database using Entrez.

In [13]:
# Set email for NCBI Entrez
Entrez.email = "k26sangeetha@gmail.com"  # Replace with your email

In [14]:
# Function to fetch FASTA data from NCBI
def fetch_fasta_from_ncbi(query, database="nucleotide"):
    handle = Entrez.esearch(db=database, term=query, retmax=1)
    record = Entrez.read(handle)
    handle.close()
    if record["IdList"]:
        seq_id = record["IdList"][0]
        handle = Entrez.efetch(db=database, id=seq_id, rettype="fasta", retmode="text")
        fasta_data = handle.read()
        handle.close()
        return fasta_data
    else:
        return None

In [16]:
# Example usage
query = "Homo sapiens COX1" 
fasta_data = fetch_fasta_from_ncbi(query)
print(fasta_data)

>PP914118.1 Taenia solium isolate B cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial
TAGATTTTTTAATGTTTTCTTTACATTTAGCTGGTGTATCAAGTATTTTTAGTTCTATTAATTTTATATG
TACATTATATAGAGTTTTTATGACTAATATATTTTCTCGTACATCTATAGTGTTATGATCTTATTTATTT
ACATCTATCTTGTTATTGGTTACTTTACCTGTTTTGGCAGCCGCTGTTACTATGCTTCTATTTGATCGTA
AATTTAGTTCTGCGTTTTTTGATCCGTTAGGAGGTGGTGATCCTGTTTTATTTCAACATATGTTTTGATT
TTTTGGTCATCCTGAGGTTTATGTGTTAATTCTTCCGGGGTTTGGTATAATTAGTCATATATGTTTGAGT
ATAAGTATGTGTTCTGATGCTTTTGGCTTTTATGGGTTATTGTTTGCTATGTTTTCAATAGTATGTTTAG
GAAGAAGTGTATGAGGGCATCATATGTTTACGGTTGGGTTAGATGTTAAGACGGCTGTATTTTTTAGTTC
TGTTACTATGATAATTGGAGTGCCTACGGGGATTAAGGTTTTTACTTGGCTTTATATGCTTTTAAAATCT
CGTGTTAATAAGAGTGATCCGGTTTTATGATGAATAATTTCGTTTATAGTATTGTTTACATTTGGTGGTG
TAACTGGTATTATTCTATCTGCTTGTGTATTAGATAAAGTTCTTCATGATACTTGGTTTGTTGTTGCTCA
TTTTCATT




#### ii) Parsing the Fetched FASTA Data 

In [17]:
from io import StringIO

In [18]:
# Function to parse a FASTA string and print sequence IDs and lengths
def parse_fasta_string(fasta_string):
    fasta_io = StringIO(fasta_string)
    for record in SeqIO.parse(fasta_io, "fasta"):
        print(f"ID: {record.id}")
        print(f"Length: {len(record.seq)}\n")

In [19]:
if fasta_data:
    parse_fasta_string(fasta_data)
else:
    print("No data fetched.")

ID: PP914118.1
Length: 708

