**Step1: DNA** is a set of instructions like a computer program for our human body, Its made up of four nitrogenous bases—adenine (A), thymine (T), cytosine (C), and guanine (G). These bases pair in a specific way: A pairs with T, and C pairs with G.


Gene Count: We as a humans have approximately 19,000-20,000 genes (list here), though exact count may vary slightly depending on the source. Genes can be looked up in databases like the one on Wikipedia, which provides an overview of various gene categories and functions.

Ref: https://en.wikipedia.org/wiki/Human_genome

List: https://en.wikipedia.org/wiki/Lists_of_human_genes

In [16]:
def check_for_valid_dna(sequence):
    return all(char in 'ATCG' for char in sequence)

print("Check Whether the given DNA Sequence is Valid ?: ATCGTTAG")
print(check_for_valid_dna("ATCGTTAG"))  # Output: Valid Sequence, it must contain only A, T, C, G: True

print("Check Whether the given DNA Sequence is Valid ?: ATCGXTAG")

print(check_for_valid_dna("ATCGXTAG"))  # Output: Not a valid Sequence as it contains 'x" which is not a valid base: False

Check Whether the given DNA Sequence is Valid ?: ATCGTTAG
True
Check Whether the given DNA Sequence is Valid ?: ATCGXTAG
False


In [None]:
#Libraries and Packages
from Bio import SeqIO # Import the SeqIO.parse() which is used to read each sequence from the FASTA file

**Step2: RNA and Transcription:**
Converting DNA to RNA: The process of converting DNA into RNA is called transcription. During transcription, an enzyme called RNA polymerase reads the DNA sequence and and then creates a complementary RNA strand.
Key Difference (T to U): In RNA, thymine (T) is replaced by uracil (U), so whenever there’s an "A" on the DNA template, it pairs with "U" in the RNA. This switch from "T" to "U" is a unique feature of RNA.

In [22]:
# So, lets replace character "T" with "U" to Convert DNA into RNA:

def transcribe_dna_to_mrna(dna_sequence):
    print("Our Given DNA:" + dna_sequence )
    return print("Converted RNA:" + dna_sequence.replace('T', 'U'))

transcribe_dna_to_mrna("ATCGTTAG")

print("")
print("We just replaced T with U")

Our Given DNA:ATCGTTAG
Converted RNA:AUCGUUAG

We just replaced T with U


**Step3: Proteins:** Then, lets convert in into Proteins, for that we need mappings:

Ref: https://en.wikipedia.org/wiki/Gene_mapping

In [25]:
genetic_code = {
    'AUG': 'M', 'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L',
    'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L', 'AUU': 'I',
    'AUC': 'I', 'AUA': 'I', 'GUU': 'V', 'GUC': 'V', 'GUA': 'V',
    'GUG': 'V', 'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S',
    'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P', 'ACU': 'T',
    'ACC': 'T', 'ACA': 'T', 'ACG': 'T', 'GCU': 'A', 'GCC': 'A',
    'GCA': 'A', 'GCG': 'A', 'UAU': 'Y', 'UAC': 'Y', 'CAU': 'H',
    'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q', 'AAU': 'N', 'AAC': 'N',
    'AAA': 'K', 'AAG': 'K', 'GAU': 'D', 'GAC': 'D', 'GAA': 'E',
    'GAG': 'E', 'UGU': 'C', 'UGC': 'C', 'UGG': 'W', 'CGU': 'R',
    'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'AGU': 'S', 'AGC': 'S',
    'AGA': 'R', 'AGG': 'R', 'GGU': 'G', 'GGC': 'G', 'GGA': 'G',
    'GGG': 'G', 'UAA': '*', 'UAG': '*', 'UGA': '*'
}

print("Mapping:")
print(genetic_code)


Mapping:
{'AUG': 'M', 'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L', 'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L', 'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V', 'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S', 'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P', 'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T', 'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A', 'UAU': 'Y', 'UAC': 'Y', 'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q', 'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K', 'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E', 'UGU': 'C', 'UGC': 'C', 'UGG': 'W', 'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R', 'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G', 'UAA': '*', 'UAG': '*', 'UGA': '*'}


In [28]:
def convert_into_protein(mrna_sequence):
    protein = [] #empty data frame
    start = mrna_sequence.find('AUG')
    if start == -1:
        return ""  # No start codon found

    for i in range(start, len(mrna_sequence) - 2, 3):
        codon = mrna_sequence[i:i+3]
        amino_acid = genetic_code.get(codon, '')
        if amino_acid == '*':  # Stop codon
            break
        if amino_acid:
            protein.append(amino_acid)

    return ''.join(protein)

#Lets type some RNA:
convert_into_protein("UUAAUGUUCUUU")

#UUA - Not a proper starting
#AUG - It starts with AUG, in short "M"
#UUC - Denotes "F"
#UUU - Denotes "F"

#Final Results as "MFF"

'MFF'

In [None]:
pip install biopython

**4. Two Sequence Comparion:**

In [29]:
def seq_comparision(seq_one, seq_two, w_size):
    sequence_one = seq_one
    sequence_two = seq_two
    window_size = w_size
    print("Using Window Size as: " + str(window_size))

    # it must be odd and greater than 0
    if window_size % 2 == 0 or window_size < 1:
        print("Window size must be an odd integer greater than 0.")
    else:
        # window size for centering
        half_window = window_size // 2
        print("Half Window: " + str(half_window))
        #print(half_window)

        # Create a 2D grid (list of lists) to store the dot plot
        # loops through each character in both sequences with a sliding window
        plot = [[" " for _ in range(len(sequence_one))] for _ in range(len(sequence_two))]

        for i in range(half_window, len(sequence_one) - half_window):
            for j in range(half_window, len(sequence_two) - half_window):
                # Check if the windowed sections of both sequences match
                if sequence_one[i - half_window:i + half_window + 1] == sequence_two[j - half_window:j + half_window + 1]:
                    plot[j][i] = "●"  # Place a dot in the plot

        # Print the plot with sequence labels
        print("  " + " ".join(sequence_one))
        for idx, row in enumerate(plot):
            print(sequence_two[idx] + " " + " ".join(row))


In [30]:
seq_comparision("AGTCGATCGATTACCGAT", "AGTCAATCGCTTACCGAA", 3)

Window Size: 3
Half Window: 1
  A G T C G A T C G A T T A C C G A T
A                                    
G   ●                                
T     ●                              
C                                    
A                                    
A                                    
T             ●                      
C       ●       ●                    
G                                    
C                                    
T                                    
T                       ●            
A                         ●          
C                           ●        
C                             ●      
G         ●       ●             ●    
A                                    
A                                    
