<a href="https://colab.research.google.com/github/NerdVerse2024/DNA-Playbox/blob/main/DNA_playbox.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

> **Introduction**

In molecular biology, nucleotides are the basic building blocks of DNA. Each nucleotide consists of a nitrogenous base, a sugar molecule, and a phosphate group. The four types of nitrogenous bases found in DNA are adenine (A), cytosine (C), guanine (G), and thymine (T).

In our DNA toolkit, we define a `Nucleotides` list representing these four bases:

In [None]:
#DNA toolkit day 1
Nucleotides = ["A", "C", "G", "T"]


>Executing color codes to nucleotide bases

Let's add color codes to the sequences

In [None]:
# Adding color codes to sequences
def colored(seq):
    basecolors = {
        'A': '\033[92m',  # green
        'C': '\033[94m',  # blue
        'G': '\033[93m',  # yellow
        'T': '\033[91m',  # red
        'U': '\033[91m',  # red for RNA
        'reset': '\033[0m' # reset to default color
    }
    colored_seq = ""
    for nuc in seq:
        if nuc in basecolors:
            colored_seq += basecolors[nuc] + nuc
        else:
            colored_seq += basecolors['reset'] + nuc
    return colored_seq
dna_string = "CGGGGTGACCCTTGTAGCGGACCCCGACGCGGCGGCAACCCCCCGCTTCTGCCTCTGGGGGAGGAAAGCGTCGGC"
print(colored(dna_string))

> **DNA seq Validation**

We've implemented a function called `validatesequence` to validate the integrity of DNA sequences. The function takes a DNA sequence as input and ensures that it contains only valid nucleotides (A, C, G, T). The validation is case-insensitive, as the sequence is converted to uppercase before processing.

### Function: `validatesequence(dna_seq: str) -> bool`

This function returns `tmpseq (the desired dna sequence)' if the input DNA sequence is valid and contains only nucleotides from the predefined `Nucleotides` list. Otherwise, it returns `False`.

#### Parameters:
- `dna_seq` (str): The input DNA sequence to be validated.


In [None]:
def validatesequence(dna_seq):
    tmpseq = dna_seq.upper()
    for nuc_base in tmpseq:
        if nuc_base not in Nucleotides:
            return False
        return tmpseq


> **Testing**

The validatesequence function is applied to the DNA sequence "ATTGGCCTA". Let's break down the process:

The DNA sequence is converted to uppercase ("attggccta" → "ATTGGCCTA").
The function iterates through each nucleotide in the sequence.
If any character is not found in the predefined Nucleotides list (["A", "C", "G", "T"]), the function returns False.
If the iteration completes without finding invalid nucleotides, the function returns True, indicating that the sequence is valid.
In this specific example, the output is True, indicating that "ATTGGCCTA" is a valid DNA sequence.

In [None]:

## Validating a New DNA Sequence
newDNA_string = "ATTGGCCTA"
print(colored(validatesequence(newDNA_string)))

The validatesequence function, as defined in the previous code snippets, converts the input sequence to uppercase and checks if each character is present in the Nucleotides list (["A", "C", "G", "T"]). If any character is not found in this list, the function returns False.

In [None]:
## testing with a different sequence
testDNA_string = "MLBRDQ"
print(validatesequence(testDNA_string))

We use the **random module** to generate a random DNA sequence of length 16. The **random.choice(Nucleotides)** expression is used to randomly select a nucleotide from the predefined Nucleotides list (**["A", "C", "G", "T"]**), and this process is repeated 16 times to form the random sequence.


In [None]:
import random
random_dna_string = ''.join([random.choice(Nucleotides)
                            for nuc in range(16)])
print(colored(validatesequence(random_dna_string)))


In [None]:
import random
random_dna_string = ''.join([random.choice(Nucleotides)
                            for nuc in range(16)])
print(colored(validatesequence(random_dna_string)))

> **DNA sequence Counting**

The **countfrequencyNuc** function initializes a dictionary **tempdictionary_Freq** to store the count of each nucleotide.
It iterates through each nucleotide in the given DNA sequence (seq) and increments the corresponding count in the dictionary.

In [None]:
import random

def countfrequencyNuc(seq):
    tempdictionary_Freq = {"A": 0, "C": 0, "G": 0, "T": 0}
    for Nuc in seq:
        tempdictionary_Freq[Nuc] += 1
    return tempdictionary_Freq

# Generating a random DNA string for testing
random_dna_string = ''.join(random.choice("ACGT") for _ in range(26))

# Calling the function with the generated DNA string
resulting_dna_frequency = countfrequencyNuc(random_dna_string)

# Printing the result
print(colored(resulting_dna_frequency))

> **DNA-RNA Transcription**

This Python code defines a function called transcription that performs transcription on a DNA sequence, converting it into an RNA sequence. The DNA sequence is represented as a list of **nucleotides** (adenine '**A**', cytosine '**C**', guanine '**G**', and thymine '**T**') stored in the variable **Nuc_seq**. The function takes a DNA sequence as input, and it replaces each 'T' (thymine) with 'U' (uracil) to create the corresponding RNA sequence.


In [None]:
Nuc_seq = ['A', 'C', 'G', 'T']
# transcription of dna sequence to rna sequence with code
def transcription(sequence):
    return sequence.replace('T', 'U')

It transcribes the DNA sequence "**ACCTAAAGCATAAACGTCGAGCAGT**" into an RNA sequence using the transcription function, and then it prints both the resulting RNA sequence and its length. This code will output the transcribed RNA sequence and its length. The transcription function replaces each '**T**' with '**U**' in the DNA sequence, mimicking the process of transcription in biology.

In [None]:
dna_string = "ACCTAAAGCATAAACGTCGAGCAGT"
string_rna = transcription(dna_string)
print(colored(string_rna))
print(len(string_rna))

In [None]:
# Solving a few rosalind problems
#Testing the transcription code
dna_string_1 = "GATGGAACTTGACTACGTAAATT"
string_rna_1 = transcription(dna_string_1)
print(colored(string_rna_1))

#Rosalind problem 2
dna_string_2 = "CCAGACGTGTCAGTGAGATAGCGGCTTTGGGGCATTCCCCGAGTAAATCAATTTCTGACGAAGAAAAACACGAGGCTTCGTGAGCAGTAGCGGCAACCTCACGTCGTTTGTGCGTCTTAGAGTATCGGCCGAAGAAAAGGGGACCGAGCGATCATAGGGCCCGGAGAATAACAGTAAGAATAAATACACCAACTCAAGGTTTACCCGTACGCAATACATTCATTTTTCCCACCGTGTTTAATGTGCGTCTGTAGATAATAACATTGTTCCTCCTCAGGAAGACGTGCGGCTAACTGCGTGTTGGCGCCTTAAATGAGCAGGAAGACTGACCAAGCATATGAGGCAGGATCCGGATTACGGACTTTCTTGACCTGGATCGCGAGTCTTTGAGGAAACTGTACTTAACAGTACGGACGTATAGGTGTACTCAATCACACGGCGAGCAAAACACCGCATGTTGCAACATTCGCCTGAGAGTGGCAGTCGGTTGCCGCCCAACGTGTCCGCTAAGAATGGCCTTGGCGGCGGTATGCACGTAATCACCCCTATCAGACTAGTTCACAGGACTTAGGAAAGTTGTTGTGATCAGCGAGTTGAAGGGAAGCATTAACAGTTTGCGCCTCAGGCACCACTGCTGTTGGTGGGTGAGGTTCCACCTCTATTGCCACCAACGGTATGGTACATGATCAAGCCGGCAAGGCCCACAACTTCTATCTACTAAGCCACAGCCTCGTTAGTATACCCCGGTCGGCCTGTTCATCTTGAGGGGCTCCTCGTCCTCAATAAAGGCCGTCTAAACATATTAGACCCATTGGTAACGATATTTATTAGTCGGGCCTGTTGCACTCGCCAAGACTACAGGATCATACTCCATACGTTTCAAACCTTAGCCGGTCTTCTATTTCTCACTCTCCCG"
string_rna_2 = transcription(dna_string_2)
print(colored(string_rna_2))

> **Reverse complement of DNA seq**

This code defines a dictionary **DNA_Reversecomplementseq** mapping each nucleotide to its complementary nucleotide (A to T, T to A, G to C, and C to G). It then defines a function **reversecomplement(sequence)** that takes a DNA sequence as input and returns its reverse complement.

The code generates a random DNA sequence **random_dna** and then uses the reversecomplement() function to find its reverse complement. Finally, it prints the reverse complement of the random DNA sequence.

In [None]:
#Determining reverse complementary sequence of dna string
DNA_Reversecomplementseq = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}

def reversecomplement(sequence):
    return ''.join([DNA_Reversecomplementseq[nuc] for nuc in sequence])[::-1]
random_dna = "CGGGGTGACCCTTGTAGCGGACCCCGACGCGGCGGCAACCCCCCGCTTCTGCCTCTGGGGGAGGAAAGCGTCGGC"
new_reverse_dna_string = reversecomplement(random_dna)
print(colored(new_reverse_dna_string))

In [None]:
#Solving for rosalind problem 1 for Complementing a Strand of DNA

dna_1 = "AAAACCCGGT"
new_reverse_dna_string = reversecomplement(dna_1)
print(colored(new_reverse_dna_string))

#problem 2
dna_2 = "TCCCAGTTACGTGACCGGACTGTACAGTTACTTGGCCCTTTTTGAATGCTGGCCGTCCTTGAGAAGTTGATGCATCTGTCTCCTATGGGGATACGTAGAACGGGTGCATTGGCGCCCCCGTATATCTCGATTATGACTGTCATGTCGCGACCACCTTGTATCTATACCGAAGGAACTGACCTGTTGGGCCGTTATACACGTCAAGCCACGTGATGTAATTGAGATTCAAACAAATTAAAATTATGGTAGTAGTTGACCTGTGCGAAGCAGCCTGGTAATGAGTCTGCAGGTTGCTGCCCTGACCGTTAGTGATAGGGCGCTCAGTGCCACCGAAATTTCTCCCTACAACCGCAACACTGAGTCAGACATTATCGACTACAGCCAGATGCGAAAAAGCGAGAGATAAGGAAAGGACTCCATGGTTTCGGTACCTCCGTGGTTCACTGGTTTCTACCGTGTAGTCGTATTAAGTATACCTATACGAGGCCTTTACGAACCTGAACCGTTGACGGGCGGCTGTGCCCATCCCAAATCGTCTACGGATGAACCTACGCCCCACGGAGATCCTTCCGCCGCTGCTAACTCTTGAAGCATCACAACGGAGACGGTGATCGCAAAACGCGTAACCCTTGAGGGCCCCTCAACTATGTGAGCGACACTTGAACATAAGCATCTTGGGCAGGAGTATCGCCTTCATAGAGTGTATGGCATGACGAAACCTACGTGGCGTAGCATTCTCCCGTGATGGCTAAATTCACAATCTAGAGCTTGTGAGTTTACCCGAATTTTCGACTAGACTATAAGGATGTTCAATGTCGTCAATGTCTTGGATATATCGCACATGGGTGGATTGTTTGTATAGTTCACTAGGAGATTTTGTCCGGGTCATTACATCGAACTATAAATGCTCAAAG"
new_reverse_dna_string = reversecomplement(dna_2)
print(colored(new_reverse_dna_string))

> **Base-pairing of complementary seq**

This code defines a **function reverse_complement** that generates the reverse complement of a given DNA string. It then uses this function to find the **reverse complement** of **dna_string** and prints it along with the original string in a format that illustrates the **base pairing sequence**.

In [None]:
#Illustrating reverse complement in base pairing sequence manner
def reverse_complement(dna_string):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
    return ''.join(complement[base] for base in reversed(dna_string))

dna_string = "CGGGGTGACCCTTGTAGCGGACCCCGACGCGGCGGCAACCCCCCGCTTCTGCCTCTGGGGGAGGAAAGCGTCGGC"
new_reverse_dna_string = reverse_complement(dna_string)

print(colored(f"Reverse complement + dna seq string:\n5' {new_reverse_dna_string} 3'"))
print("".join([' ' for c in range(len(new_reverse_dna_string))]))
print(colored(f"3' {dna_string[::-1]} 5'\n"))

In [None]:
Nuc_seq = ['A', 'C', 'G', 'T']
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}