# Parsing GenBank Files and Fetching Protein Data

**1. Download GenBank File:**
Go to the NCBI website or any other database that provides GenBank files.

Search for the organism or gene of interest.

Download the GenBank file (.gb or .gbk) to your local machine.

**2. Save the File Locally:**

Save the downloaded file with an appropriate name, such as example.gb.

Alternatively, you can download GenBank files directly from NCBI using Biopython. Biopython provides modules to access NCBI's Entrez system, which allows you to download sequences programmaticall

In [8]:
from Bio import Entrez, SeqIO

In [9]:
# Set email for NCBI Entrez
Entrez.email = "k26sangeetha@gmail.com"

In [10]:
# Function to download a GenBank file
def download_genbank(accession_id, filename):
    with Entrez.efetch(db="nucleotide", id=accession_id, rettype="gb", retmode="text") as handle:
        with open(filename, "w") as out_handle:
            out_handle.write(handle.read())

**3. Parse the GenBank File Using Biopython:**

Use the Python script provided to parse the GenBank file and extract protein data.

In [1]:
from Bio import SeqIO

### Function to Parse GenBank File

In [2]:
# Function to parse a GenBank file and extract protein data
def parse_genbank(file_path):
    proteins = []
    for record in SeqIO.parse(file_path, "genbank"):
        for feature in record.features:
            if feature.type == "CDS" and "translation" in feature.qualifiers:
                protein_data = {
                    "id": record.id,
                    "description": record.description,
                    "protein_id": feature.qualifiers.get("protein_id", [""])[0],
                    "protein_sequence": feature.qualifiers["translation"][0],
                    "gene": feature.qualifiers.get("gene", [""])[0],
                    "product": feature.qualifiers.get("product", [""])[0]
                }
                proteins.append(protein_data)
    return proteins

### Function to Print Protein Data

In [3]:
# Function to print extracted protein data
def print_protein_data(proteins):
    for protein in proteins:
        print(f"Record ID: {protein['id']}")
        print(f"Description: {protein['description']}")
        print(f"Protein ID: {protein['protein_id']}")
        print(f"Gene: {protein['gene']}")
        print(f"Product: {protein['product']}")
        print(f"Protein Sequence: {protein['protein_sequence']}\n")

In [7]:
# Example usage
genbank_file = "Example2.gb"  # Replace with your GenBank file path

# Parse the GenBank file
proteins = parse_genbank(genbank_file)

# Print the extracted protein data
print_protein_data(proteins)

Record ID: U49845.1
Description: Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds
Protein ID: AAA98665.1
Gene: TCP1-beta
Product: chaperonin
Protein Sequence: MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGQKVSRKGVVQVKAVNDKDWSAMSGF

Record ID: U49845.1
Description: Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds
Protein ID: AAA98666.1
Gene: AXL2
Product: Axl2p
Protein Sequence: MYSDFHNMDATSTHGNATGVLHSAGIHVGLEIFQYLVHKNHIQQTFNLAM

Record ID: U49845.1
Description: Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds
Protein ID: AAA98667.1
Gene: REV7
Product: Rev7p
Protein Sequence: MFTYPDFKILPSNLTHFSKNHPVFDWLQEDDSRRVRLIIVDDGE



### PARSING GENBANK FILE

In [12]:
# Example usage of using GenBank file you want to download
accession_id = "U49845"  # Replace with the accession ID of the GenBank file you want to download
filename = "example.gb"

In [13]:
# Download the GenBank file
download_genbank(accession_id, filename)

In [14]:
# Parse the GenBank file
proteins = parse_genbank(filename)

In [15]:
# Print the extracted protein data
print_protein_data(proteins)

Record ID: U49845.1
Description: Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds
Protein ID: AAA98665.1
Gene: 
Product: TCP1-beta
Protein Sequence: SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM

Record ID: U49845.1
Description: Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds
Protein ID: AAA98666.1
Gene: AXL2
Product: Axl2p
Protein Sequence: MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESFTFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFNVILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNEVFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPETSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYVYLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYGDVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQDHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSANATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIACGVAIPL