# BIOPYTHON
Biopython provides modules for biological data.<br>

For beginners, focus on:<br>
**Bio.Seq**: Manipulate DNA, RNA, and protein sequences.<br>
**Bio.SeqIO**: Read/write sequence files (e.g., FASTA, GenBank).<br>
**Bio.Entrez**: Access NCBI databases for real biological data.<br>
**Bio.SeqUtils**: Analyze sequences (e.g., GC content).<br>

In [22]:
# Working with sequences
from Bio.Seq import Seq

dna = Seq("ATGCGTACG")

# Basic operation
print("DNA: ", dna, end="\n")
print("RNA: ", dna.transcribe(), end="\n")
print("Protein: ", dna.translate(), end="\n")
print("Complement: ", dna.complement(), end="\n")
print("Reverse Complement: ", dna.reverse_complement(), end="\n")

DNA:  ATGCGTACG
RNA:  AUGCGUACG
Protein:  MRT
Complement:  TACGCATGC
Reverse Complement:  CGTACGCAT


# DNA
**Structure and shape:** double-stranded molecule forming a double helix<br>
**Strand:** Each strand is made of nucleotides. The strands are held together by hydrogen bonds.<br>
**Nucleotide:** Sugar molecule (Deoxyribose) + Phosphate group + One of four nitrogenous bases<br>
**Nitrogenous bases:** Adenine (A), Guanine (G), Thymine (T), Cytosine (C)<br>
*A* pairs with *T* and *G* with *C*<br>

# RNA
**Structure** single-stranded molecule<br>
**Strand:** made of nucleotides.<br>
**Nucleotide:** Sugar molecule (Ribose) + Phosphate group + One of four nitrogenous bases<br>
**Nitrogenous bases:** Adenine (A), Guanine (G), Uracil (U), Cytosine (C)<br>

# TRANSCRIPTION
DNA's genetic code is copied into mRNA in the nucleus.<br>
**Example:**<br>
Template strand: ATGCGTACG<br>
Transcribed mRNA: UACGCAUGC<br>

ATGCGTACG is part of the template strand (3' → 5'), which RNA polymerase reads to synthesize mRNA.<br>
Its complementary DNA strand is TACGCATGC, known as the coding strand (5' → 3').<br>
It is called the coding strand because its sequence matches the mRNA (except that thymine T is replaced with uracil U in RNA).<br>

_transcribe()_ takes in the DNA (template) strand and returns the corresponding mRNA strand.<br>

# TRANSLATION
The mRNA exits the nucleus and enters the ribosome in the cytoplasm, where tRNA brings amino acids based on the codons in the mRNA.<br>
The ribosome links these amino acids into a growing polypeptide (protein).<br>

_translate()_ takes in mRNA and gives amino acid sequence<br>

_complement()_ returns the complementary DNA strand<br>
*reverse_complement()* returns the reverse complement<br>

# Bio.Seq

The Bio.Seq module provides tools to create, manipulate, and analyze biological sequences.<br>

**Key Features of Bio.Seq**<br>
**Sequence Creation:** Represent DNA, RNA, or protein sequences as Seq objects.<br>
**Sequence Operations:** Perform biological transformations (e.g., transcription, translation) and string-like operations (e.g., slicing, searching).<br>
**Alphabets:** Specify sequence type (DNA, RNA, protein) for validation (though deprecated in newer versions, still useful for context).<br>
**Integration:** Works seamlessly with other Biopython modules like SeqIO and SeqUtils.<br>

# Seq Objects

A Seq object is like a Python string.<br>
They are designed for biological sequences with methods for biological operation<br>

In [None]:
# creating a Seq Object
from Bio.Seq import Seq
dna = Seq("ATGCGACGTTAGCT") # create a DNA sequence

print("DNA Sequence: ", dna, end="\n")
print("Type: ", type(dna))
print("Length: ", len(dna))

DNA Sequence:  ATGCGXACGTTAGCT
Type:  <class 'Bio.Seq.Seq'>
Length:  15


In [None]:
# String-like operations
from Bio.Seq import Seq
dna = Seq("ATGCGTACGTTAGCT")

#slicing - extract parts of a sequence
print("first 6 nucleotides: ", dna[:6], end="\n")
print("last 3 nucleotides:", dna[-3:], end="\n")

# concatenation
dna_2 = Seq("GCTTAG")
combined = dna + dna_2
print("combiend variant 1: ", combined, end="\n")
combined = dna_2 + dna
print("combiend variant 1: ", combined, end="\n")

# searching
motif = "CGT"
count = str(dna).count(motif)
print(f"{motif} appeared {count} time/s")

# upper or lower case
print("upper case: ", dna.upper())
print("upper case: ", dna.lower())

first 6 nucleotides:  ATGCGT
last 3 nucleotides: GCT
combiend variant 1:  ATGCGTACGTTAGCTGCTTAG
combiend variant 1:  GCTTAGATGCGTACGTTAGCT
CGT appeared 2 time/s
upper case:  ATGCGTACGTTAGCT
upper case:  atgcgtacgttagct


In [6]:
# biological operations
from Bio.Seq import Seq
dna = Seq("ATGCGTACGTTAGCT") # it is a coding strand, 5' to 3'

print("complementary strand: ", dna.complement(), end="\n") # output: template strand, 3' to 5'
print("mRNA: ", dna.transcribe(), end="\n")
print("protein: ", dna.translate(), end="\n")
print("reverse complement: ", dna.reverse_complement(), end="\n")

complementary strand:  TACGCATGCAATCGA
mRNA:  AUGCGUACGUUAGCU
protein:  MRTLA
reverse complement:  AGCTAACGTACGCAT


In [7]:
# Reverse transcription
from Bio.Seq import Seq
rna = Seq("AUGCGUACGUUAGCU")
print("type: ", type(rna), end="\n")

print("DNA: ", rna.back_transcribe(), end="\n")

type:  <class 'Bio.Seq.Seq'>
DNA:  ATGCGTACGTTAGCT


# HANDLING AMBIGUITY

## IUPAC Ambiguous Base Codes
The International Union of Pure and Applied Chemistry (IUPAC) defines standard codes for ambiguous nucleotides.<br>

Here are the common ones for DNA:<br>
<ol>
    <li><b>N:</b> Any nucleotide (A, T, G, or C)</li>
    <li><b>R:</b> Purine (A or G)</li>
    <li><b>Y:</b> Pyrimidine (C or T)</li>
    <li><b>S:</b> Strong (G or C)</li>
    <li><b>W:</b> Weak (A or T)</li>
    <li><b>K:</b> Keto (G or T)</li>
    <li><b>M:</b> Amino (A or C)</li>
    <li><b>B:</b> Not A (C, G, or T)</li>
    <li><b>D:</b> Not C (A, G, or T)</li>
    <li><b>H:</b> Not G (A, C, or T)</li>
    <li><b>V:</b> Not T (A, C, or G)</li>
</ol>
For RNA, replace T with U (e.g., Y = C or U). These codes are case-insensitive in Biopython.<br>

In [1]:
from Bio.Seq import Seq

dna = Seq("ATGNGTRCG")

print("DNA: ", dna, end="\n")
print("Complement: ", dna.complement(), end="\n")
print("Reverse Complement: ", dna.reverse_complement(), end="\n")
rna = dna.transcribe()
print("RNA: ", rna, end="\n")

protein = dna.translate()
print("Protein: ", protein, end="\n\n")

dna = rna.back_transcribe()
print("RNA: ", rna, end="\n")
print("DNA: ", dna, end="\n")

DNA:  ATGNGTRCG
Complement:  TACNCAYGC
Reverse Complement:  CGYACNCAT
RNA:  AUGNGURCG
Protein:  MXX

RNA:  AUGNGURCG
DNA:  ATGNGTRCG
