# Introduction to Biopython 

## What is Biopython?

Biopython is a powerful open-source collection of tools for computational biology, written in Python.
It helps scientists work with biological data such as DNA/RNA sequences, protein structures, and files from biological databases (e.g., GenBank, FASTA, PDB).

In this chapter, we’ll learn how to use Biopython for common bioinformatics tasks that are relevant to biochemistry and molecular biology.

## Learning Objectives:
- Understand what Biopython is and how to install it
- Learn how to read and write biological sequences (DNA, RNA, protein)
 - Perform simple sequence manipulations
 - Translate DNA to protein
 - Parse real biological data files (FASTA and GenBank)
 - Use example datasets from biosciences for practice

Let’s start with some real examples!

In [5]:
# Install BioPython if not already installed (only once)
!pip install biopython

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO




## Example 1: Creating and manipulating DNA sequences

In [6]:
# Create a DNA sequence
dna_seq = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

# Print the sequence
print("Original DNA sequence:")
print(dna_seq)

# Calculate the complement (A<->T, G<->C)
complement = dna_seq.complement()
print("\nComplementary DNA:")
print(complement)

# Calculate the reverse complement (useful in reverse transcription)
rev_complement = dna_seq.reverse_complement()
print("\nReverse Complement:")
print(rev_complement)

# Transcribe DNA to mRNA
mRNA = dna_seq.transcribe()
print("\nTranscribed mRNA:")
print(mRNA)

# Translate mRNA to protein
protein = mRNA.translate()
print("\nTranslated Protein Sequence:")
print(protein)


Original DNA sequence:
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG

Complementary DNA:
TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC

Reverse Complement:
CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT

Transcribed mRNA:
AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG

Translated Protein Sequence:
MAIVMGR*KGAR*


##  Example 2: Analyzing a short DNA sequence from a lab experiment
Let's say we amplified a gene fragment from a plasmid and want to analyze it.

In [7]:
# A mock DNA sequence from a gene (e.g., a fragment of lacZ)
gene_fragment = Seq("ATGACCATGATTACGCCAAGCTTGTTCTGAAAGGAGGAA")

# Count number of each nucleotide
print("Nucleotide counts:")
print(f"A: {gene_fragment.count('A')}")
print(f"T: {gene_fragment.count('T')}")
print(f"G: {gene_fragment.count('G')}")
print(f"C: {gene_fragment.count('C')}")

# GC content is important for PCR design and gene stability
gc_content = (gene_fragment.count('G') + gene_fragment.count('C')) / len(gene_fragment) * 100
print(f"\nGC Content: {gc_content:.2f}%")

# Translate to see what protein it encodes
protein = gene_fragment.translate()
print("\nProtein Sequence:")
print(protein)


Nucleotide counts:
A: 13
T: 9
G: 10
C: 7

GC Content: 43.59%

Protein Sequence:
MTMITPSLF*KEE


## Example 3: Reading a FASTA file (e.g., sequencing results)

In [8]:
# We'll create a sample FASTA file with two sequences
# This simulates loading real sequencing results or genome fragments

# Creating two sample sequences
seq1 = SeqRecord(Seq("ATGCGTACGTAGCTAGCTAG"), id="gene1", description="Example gene 1")
seq2 = SeqRecord(Seq("ATGGGGTACGTTAGCAGTAG"), id="gene2", description="Example gene 2")

# Save to a FASTA file
with open("sample_sequences.fasta", "w") as output_handle:
    SeqIO.write([seq1, seq2], output_handle, "fasta")

# Now read it back
print("Reading sequences from FASTA file:\n")
for record in SeqIO.parse("sample_sequences.fasta", "fasta"):
    print(f"ID: {record.id}")
    print(f"Description: {record.description}")
    print(f"Sequence: {record.seq}")
    print("---")


Reading sequences from FASTA file:

ID: gene1
Description: gene1 Example gene 1
Sequence: ATGCGTACGTAGCTAGCTAG
---
ID: gene2
Description: gene2 Example gene 2
Sequence: ATGGGGTACGTTAGCAGTAG
---


## Example 4: Parsing a GenBank file (commonly used in NCBI)

In [9]:
# Biopython can parse real GenBank files from NCBI.
# For this example, we’ll create a small mock GenBank record.

# NOTE: Normally you'd use: SeqIO.read("filename.gb", "genbank")
# For illustration, we'll download a real GenBank record from NCBI in another example.

from Bio import Entrez

# Always include your email when accessing NCBI
Entrez.email = "your_email@example.com"

# Fetch a real GenBank record using NCBI Entrez
# For example, lacZ gene from E. coli (Accession: J01636.1)

handle = Entrez.efetch(db="nucleotide", id="J01636.1", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

# Print basic information
print(f"ID: {record.id}")
print(f"Name: {record.name}")
print(f"Description: {record.description}")
print(f"Length: {len(record.seq)} bp")

# Print the first 100 nucleotides
print("\nFirst 100 bases:")
print(record.seq[:100])

# List features (like genes, CDS, promoters)
print("\nFeatures in the record:")
for feature in record.features[:5]:  # Print only first 5 for brevity
    print(f"- {feature.type}")


ID: J01636.1
Name: ECOLAC
Description: E.coli lactose operon with lacI, lacZ, lacY and lacA genes
Length: 7477 bp

First 100 bases:
GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGGTGGTGAATGTGAAACCAGTAACGTTATACG

Features in the record:
- source
- variation
- gene
- CDS
- regulatory


##  Example 5: Translating a gene coding sequence (CDS)

In [10]:
# We'll extract the protein-coding sequence (CDS) from the GenBank record
# Find first CDS feature
for feature in record.features:
    if feature.type == "CDS":
        cds_seq = feature.extract(record.seq)
        print("\nExtracted CDS sequence:")
        print(cds_seq[:60], "...")  # print first part only
        protein = cds_seq.translate(to_stop=True)
        print("\nTranslated protein sequence:")
        print(protein[:60], "...")  # print first part only
        break



Extracted CDS sequence:
GTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTT ...

Translated protein sequence:
VKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPNRVAQQLAGKQ ...


## Summary: What You’ve Learned
In this chapter, you learned how to:

1. Install and import Biopython modules
2.  Create and manipulate biological sequences (DNA, RNA, protein)
3.  Transcribe DNA and translate to protein
4.  Count nucleotides and compute GC content
5.  Read and write FASTA files (common in sequencing)
6.  Access and parse real GenBank records from NCBI
7.  Extract CDS and translate protein from GenBank files