# DNA Transcription and Translation

## Import DNA Sequence

In [1]:
from dna import DNASequence

## DNA Transcription

DNA stores the genetic instructions of living organisms in the sequence of bases `A`, `T`, `C` and `G`.

To make proteins, DNA is first **transcribed** into RNA (ribonucleic acid), where the base `T (Thymine)` is replaced by `U (Uracil)`.

For example:
`DNA`: `ATGCGT`
`RNA`: `AUGCGU`

Transcription is the process by which a segment of DNA is copied into RNA - specifically messenger RNA (mRNA) - by enzyme RNA polymerase.

Transcription is performed by RNA polymerase in cells, that we will simulate with Python.

DNA -> RNA is the first step in protein synthesis. We replace `T` -> `U` to mimic this biological process.

In [2]:
class DNASequence(DNASequence): # We extend the existing class to make out lives easier
    def transcribe(self):
        """Transcribes this DNA Sequence into RNA"""
        return self.sequence.replace('T', 'U')

In [3]:
seq = DNASequence("ATGCGTACGATAC")
seq.transcribe()

'AUGCGUACGAUAC'

## Translation

We've seen how DNA (A, T, G, C) is transcribed into RNA (A, U, G, C).

Now we move to **translation**, where RNA is decoded into a **protein** - a sequence of amino acids.

## The Genetic Code

Each _set of 3 RNA bases_ is called a **codon**.

For example:
`AUG` -> Methionine (M)
`UUU` -> Phenylalanine (F)
`UGA` -> Stop

There are 64 (4^3) possible combinations -- 4 bases that can repeat 3 times for a single combination.

Of these, 61 are for amino acids and 3 are designated as **stop codons** (`UUA`, `UAG`, `UGA`)

Translations usually begins at `AUG` (start codon), which codes for `Methionine (M)`, and ends when a _stop codon_ is encountered.

So for this RNA: `AUGUUUUAA`

The resulting protein is: `MF`

and stops at the `UAA` codon.

In [4]:
CODON_TABLE = {
    # Phenylalanine (F)
    "UUU": "F", "UUC": "F",
    # Leucine (L)
    "UUA": "L", "UUG": "L", "CUU": "L", "CUC": "L", "CUA": "L", "CUG": "L",
    # Isoleucine (I)
    "AUU": "I", "AUC": "I", "AUA": "I",
    # Methionine (M) - Start codon
    "AUG": "M",
    # Valine (V)
    "GUU": "V", "GUC": "V", "GUA": "V", "GUG": "V",
    # Serine (S)
    "UCU": "S", "UCC": "S", "UCA": "S", "UCG": "S", "AGU": "S", "AGC": "S",
    # Proline (P)
    "CCU": "P", "CCC": "P", "CCA": "P", "CCG": "P",
    # Threonine (T)
    "ACU": "T", "ACC": "T", "ACA": "T", "ACG": "T",
    # Alanine (A)
    "GCU": "A", "GCC": "A", "GCA": "A", "GCG": "A",
    # Tyrosine (Y)
    "UAU": "Y", "UAC": "Y",
    # Histidine (H)
    "CAU": "H", "CAC": "H",
    # Glutamine (Q)
    "CAA": "Q", "CAG": "Q",
    # Asparagine (N)
    "AAU": "N", "AAC": "N",
    # Lysine (K)
    "AAA": "K", "AAG": "K",
    # Aspartic Acid (D)
    "GAU": "D", "GAC": "D",
    # Glutamic Acid (E)
    "GAA": "E", "GAG": "E",
    # Cysteine (C)
    "UGU": "C", "UGC": "C",
    # Tryptophan (W)
    "UGG": "W",
    # Arginine (R)
    "CGU": "R", "CGC": "R", "CGA": "R", "CGG": "R", "AGA": "R", "AGG": "R",
    # Glycine (G)
    "GGU": "G", "GGC": "G", "GGA": "G", "GGG": "G",
    # Stop codons
    "UAA": "*", "UAG": "*", "UGA": "*"
}


Each key is a codon (3 RNA bases). Each value is a single-letter amino acid code.

The stop codons (`UAA`, `UAG`, `UGA`) map to `*` -- we'll use this symbol to mark where the translation should end.

`AUG` (Methionine) is the *start codon** -- translation often begins here.

Translation happens in the *ribosome* - the cell's protein factory.
- The `mRNA` sequence is read in groups of three bases called **codons**.
- Each codon corresponds to one **amino acid** (the building block of proteins)
- Translation starts at a start codon (AUG) and continues until a stop codon is encountered

In [5]:
def translate_rna(rna_seq: str) -> str:
    """
    Translate an RNA sequence into a protein sequence.
    Translation starts at AUG and stops at the first STOP codon (*)
    """
    protein = ""
    started = False

    # Iterate through the RNA sequence in triplets (codons)
    for i in range(0, len(rna_seq) - 2, 3):
        codon = rna_seq[i:i+3].upper()
        if codon not in CODON_TABLE:
            continue # Skip invalid codons

        amino_acid = CODON_TABLE[codon]

        if amino_acid == "*": # Stop Codon
            if started:
                break # Stop translation
            else:
                continue

        if codon == "AUG" and not started:
            started = True # Start translation

        if started:
            protein += amino_acid

    return protein


In [6]:
rna = "AUGUUUUAA"
protein = translate_rna(rna)
print(f"RNA: {rna} -> Protein: {protein}")

RNA: AUGUUUUAA -> Protein: MF


In [7]:
random_dna = DNASequence.random(256)
random_dna

<DNASequence(length=256)?

In [8]:
rna = random_dna.transcribe()
rna

'CUACAGACUUUCACCUAUCAUGACGAACACUGUGCUUGUCGCGCGAAGGUUGACCACCAAUUUUUUCUUCUAGCAGCGAACGAACCGUCGGUGAACAUAUCACUGGAAUUCGCAGCGAAUGCCCGCGCGACUGUUCAGGCCAUGUGCUUUUAUACGAAUAGCUUAUGAUAGUGGUCAGCAAGAUCACAACCUCACAAAGAUCAACCGCAGAAAUAAUCUUUGAUGUAUGUGGUGCGCGAAUACACAGGAUGGACCG'

In [9]:
protein = translate_rna(rna)
protein

'MCFYTNSL'

Now let's extend our class.

In [10]:
class DNASequence(DNASequence):
    def translate(self):
        """Translates an mRNA sequence into a list of amino acids"""
        amino_acids = []
        mrna = mrna.upper()

        # Find the start codon
        start_index = mrna.find("AUG")
        if start_index == -1:
            return [] # No start codon found.
        
        # Read codons in triplets from the start
        for i in range(start_index, len(mrna), 3):
            codon = mrna[i:i+3]
            if len(codon) < 3:
                break # Incomplete codon at the end?
            amino_acid = CODON_TABLE.get(codon, '???')
            if amino_acid == "*":
                break

            amino_acids.append(amino_acid)
        
        return "".join(amino_acids)

In [14]:
dna = DNASequence.random(1024)
dna.translate()

'MGSDDSETGSDSGSRSMCYIFV'