# Exploring the Central Dogma of Molecular Biology: Unraveling the Genetic Information Flow

Welcome to this Bioinformatics tutorial, where we embark on a journey through the intricate world of molecular biology. The central dogma, a concept introduced by Francis Crick in 1958, lies at the heart of understanding how the genetic information encoded in DNA orchestrates the creation of functional proteins, the workhorses of life.

In the quest to comprehend the molecular intricacies of life, the central dogma serves as our guiding light. It elegantly outlines the sequential flow of genetic information: from the stable repository of instructions encoded in DNA to the dynamic world of RNA, and finally, to the synthesis of proteins. This fundamental process underpins the very essence of life, enabling the creation of diverse molecules that drive the myriad of biological processes we witness in living organisms.


<div style="text-align:center;">
    <img src="https://slcc.pressbooks.pub/app/uploads/sites/20/2020/12/Central-Dogma-Sequences-1024x512.png" alt="Image Description" width="800">
</div>



Throughout this tutorial, we will delve into the essential operations that bioinformaticians and molecular biologists employ to decipher, manipulate, and analyze the central dogma. Whether you are a novice in the field or an experienced researcher seeking to deepen your understanding, this tutorial is designed to provide you with valuable insights into the tools and techniques used to navigate the molecular pathways that govern life itself.


Let's begin our exploration of the code of life!


# **Task 1.** Read the DNA Sequence from file

1. Go to **[GenBank](https://www.ncbi.nlm.nih.gov/genbank/)**, a genetic sequence database that includes collection of all publicly available DNA sequences.

2. Write any gene name you prefer to explore (e.g. AGTR1) and download the gene sequence file (it will be downloaded as a .zip file but we will use **gene.fna** file).

3. Upload **gene.fna** file into your Colab.

In [None]:
# Open the file for reading
with open('gene.fna', 'r') as file:
    # Initialize an empty string to store the gene sequences
    gene_sequences = ""

    # Read each line in the file
    for line in file:
        # Check if the line starts with '>'
        if line.startswith('>'):
            # This is a header line, you can skip it or process it as needed
            pass
        else:
            # Append the gene sequence to the gene_sequences string
            gene_sequences += line.strip()

# Print the entire gene sequence
print(gene_sequences)


# **Task 2.** Find where is replication origin in the DNA sequence

In [None]:
def find_ori_location(genome):
    ori = "ATACAATA"

    for i in range(len(genome) - len(ori) + 1):
        if genome[i:i + len(ori)] == ori:
            return i

    return -1  # Return -1 if ori is not found in the genome

ori_location = find_ori_location(gene_sequences)

if ori_location != -1:
    print("The origin of replication (ori) is found at position " + str(ori_location))
else:
    print("The origin of replication (ori) was not found in the genome.")

# **Task 3.** How to transcribe a DNA sequence into an RNA sequence

Transcription is the process of converting DNA into RNA. In this process, thymine (T) in DNA is replaced with uracil (U) in RNA

In [None]:
# Define the DNA sequence
dna_sequence = gene_sequences

# Initialize an empty string to store the RNA sequence
rna_sequence = ""

# Transcribe DNA to RNA by replacing 'T' with 'U'
for base in dna_sequence:
    if base == 'T':
        rna_sequence += 'U'
    else:
        rna_sequence += base

# Print the RNA sequence
print(rna_sequence)


# **Task 4.** Find a specific motifs within the DNA sequence

Motif is a region (a subsequence) of protein or DNA sequence that has a specific structure, presence of a motif may be used as a base of protein classification.

In [None]:
# Import/Call a library/package for reguler expressions
import re

# Define the DNA sequence
dna_sequence = gene_sequences

# Define the motif you want to search for
motif = "TACGT"

# Use regular expressions to search for the motif
matches = re.finditer(motif, dna_sequence)

# Initialize a list to store the positions of motif matches
match_positions = []

# Iterate over the matches and record their positions
for match in matches:
    start = match.start()
    end = match.end()
    match_positions.append((start, end))

# Print the positions of motif matches
if match_positions:
    print(f"The motif '{motif}' was found at the following positions:")
    for start, end in match_positions:
        print(f"Start: {start}, End: {end}")
else:
    print(f"The motif '{motif}' was not found in the DNA sequence.")


# **Task 5.** Calculate the GC content (the percentage of guanine (G) and cytosine (C) bases) in a DNA sequence

In [None]:
# Define the DNA sequence
dna_sequence = gene_sequences

# Calculate the GC content
gc_count = dna_sequence.count('G') + dna_sequence.count('C')
sequence_length = len(dna_sequence)
gc_content = (gc_count / sequence_length) * 100

# Print the GC content
print(f"GC Content: {gc_content:.2f}%")

# **Task 6.** Translate an RNA sequence into a protein sequence

Mainly we will use a dictionary that maps RNA codons to amino acids.

<div style="text-align:center;">
    <img src="https://upload.wikimedia.org/wikipedia/commons/7/70/Aminoacids_table.svg" alt="Image Description" width="400">
</div>

The rule: start from the center to find the first letter of triplet, then move toward the periphery for 2nd and 3rd letters.


In [None]:
# Define a dictionary for RNA to amino acid translation
rna_codon_table = {
    "AUG": "M",  # Start codon
    "UUU": "F",  "UUC": "F",
    "UUA": "L",  "UUG": "L",
    "UCU": "S",  "UCC": "S",  "UCA": "S",  "UCG": "S",
    "UAU": "Y",  "UAC": "Y",
    "UGU": "C",  "UGC": "C",
    "UGG": "W",
    "CUU": "L",  "CUC": "L",  "CUA": "L",  "CUG": "L",
    "CCU": "P",  "CCC": "P",  "CCA": "P",  "CCG": "P",
    "CAU": "H",  "CAC": "H",
    "CAA": "Q",  "CAG": "Q",
    "CGU": "R",  "CGC": "R",  "CGA": "R",  "CGG": "R",
    "AUU": "I",  "AUC": "I",  "AUA": "I",
    "ACU": "T",  "ACC": "T",  "ACA": "T",  "ACG": "T",
    "AAU": "N",  "AAC": "N",
    "AAA": "K",  "AAG": "K",
    "AGU": "S",  "AGC": "S",
    "AGA": "R",  "AGG": "R",
    "GUU": "V",  "GUC": "V",  "GUA": "V",  "GUG": "V",
    "GCU": "A",  "GCC": "A",  "GCA": "A",  "GCG": "A",
    "GAU": "D",  "GAC": "D",
    "GAA": "E",  "GAG": "E",
    "GGU": "G",  "GGC": "G",  "GGA": "G",  "GGG": "G",
    "UAA": "*",  "UAG": "*",  "UGA": "*",  # Stop codons
}

# Initialize an empty string to store the protein sequence
protein_sequence = ""

# Translate the RNA sequence into a protein sequence
for i in range(0, len(rna_sequence), 3):
    codon = rna_sequence[i:i+3]
    amino_acid = rna_codon_table.get(codon, 'X')  # 'X' for unknown codons
    protein_sequence += amino_acid

# Print the protein sequence
print(protein_sequence)