<a href="https://www.kaggle.com/code/rukaiyakhatoon/dna-playbox?scriptVersionId=175740177" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/rosalind2/rosalind_gc (1).txt
/kaggle/input/rosalind/rosalind_gc.txt
/kaggle/input/gc-content-test/testdata.txt


# Introduction

In molecular biology, nucleotides are the basic building blocks of DNA. Each nucleotide consists of a nitrogenous base, a sugar molecule, and a phosphate group. The four types of nitrogenous bases found in DNA are adenine (A), cytosine (C), guanine (G), and thymine (T).

In our DNA toolkit, we define a `Nucleotides` list representing these four bases:

In [2]:
#DNA toolkit day 1
Nucleotides = ["A", "C", "G", "T"]


# Executing color codes to nucleotide bases 

Let's add color codes to the sequences 

In [3]:
# Adding color codes to sequences
def colored(seq):
    basecolors = {
        'A': '\033[92m',  # green
        'C': '\033[94m',  # blue
        'G': '\033[93m',  # yellow
        'T': '\033[91m',  # red
        'U': '\033[91m',  # red for RNA
        'reset': '\033[0m' # reset to default color
    }
    colored_seq = ""
    for nuc in seq:
        if nuc in basecolors:
            colored_seq += basecolors[nuc] + nuc 
        else:
            colored_seq += basecolors['reset'] + nuc
    return colored_seq
dna_string = "CGGGGTGACCCTTGTAGCGGACCCCGACGCGGCGGCAACCCCCCGCTTCTGCCTCTGGGGGAGGAAAGCGTCGGC"
print(colored(dna_string))

[94mC[93mG[93mG[93mG[93mG[91mT[93mG[92mA[94mC[94mC[94mC[91mT[91mT[93mG[91mT[92mA[93mG[94mC[93mG[93mG[92mA[94mC[94mC[94mC[94mC[93mG[92mA[94mC[93mG[94mC[93mG[93mG[94mC[93mG[93mG[94mC[92mA[92mA[94mC[94mC[94mC[94mC[94mC[94mC[93mG[94mC[91mT[91mT[94mC[91mT[93mG[94mC[94mC[91mT[94mC[91mT[93mG[93mG[93mG[93mG[93mG[92mA[93mG[93mG[92mA[92mA[92mA[93mG[94mC[93mG[91mT[94mC[93mG[93mG[94mC


# DNA seq Validation

We've implemented a function called `validatesequence` to validate the integrity of DNA sequences. The function takes a DNA sequence as input and ensures that it contains only valid nucleotides (A, C, G, T). The validation is case-insensitive, as the sequence is converted to uppercase before processing.

Function: `validatesequence(dna_seq: str) -> bool`

This function returns `tmpseq (the desired dna sequence)' if the input DNA sequence is valid and contains only nucleotides from the predefined `Nucleotides` list. Otherwise, it returns `False`.

#### Parameters:
- `dna_seq` (str): The input DNA sequence to be validated.


In [4]:
def validatesequence(dna_seq):
    tmpseq = dna_seq.upper()
    for nuc_base in tmpseq:
        if nuc_base not in Nucleotides: 
            return False
        return tmpseq
        

> **Testing**

The validatesequence function is applied to the DNA sequence "ATTGGCCTA". Let's break down the process:

The DNA sequence is converted to uppercase ("attggccta" → "ATTGGCCTA").
The function iterates through each nucleotide in the sequence.
If any character is not found in the predefined Nucleotides list (["A", "C", "G", "T"]), the function returns False.
If the iteration completes without finding invalid nucleotides, the function returns True, indicating that the sequence is valid.
In this specific example, the output is True, indicating that "ATTGGCCTA" is a valid DNA sequence.

In [5]:

## Validating a New DNA Sequence
newDNA_string = "ATTGGCCTA"
print(colored(validatesequence(newDNA_string)))

[92mA[91mT[91mT[93mG[93mG[94mC[94mC[91mT[92mA


The validatesequence function, as defined in the previous code snippets, converts the input sequence to uppercase and checks if each character is present in the Nucleotides list (["A", "C", "G", "T"]). If any character is not found in this list, the function returns False.

In [6]:
## testing with a different sequence
testDNA_string = "MLBRDQ"
print(validatesequence(testDNA_string))

False


We use the **random module** to generate a random DNA sequence of length 16. The **random.choice(Nucleotides)** expression is used to randomly select a nucleotide from the predefined Nucleotides list (**["A", "C", "G", "T"]**), and this process is repeated 16 times to form the random sequence.


In [7]:
import random
random_dna_string = ''.join([random.choice(Nucleotides)
                            for nuc in range(16)])
print(colored(validatesequence(random_dna_string)))
   

[92mA[91mT[92mA[92mA[92mA[92mA[92mA[91mT[94mC[93mG[91mT[94mC[94mC[92mA[94mC[92mA


In [8]:
import random
random_dna_string = ''.join([random.choice(Nucleotides)
                            for nuc in range(16)])
print(colored(validatesequence(random_dna_string)))

[91mT[91mT[93mG[94mC[94mC[93mG[92mA[93mG[92mA[92mA[92mA[94mC[91mT[92mA[92mA[91mT


# DNA sequence Counting

The **countfrequencyNuc** function initializes a dictionary **tempdictionary_Freq** to store the count of each nucleotide.
It iterates through each nucleotide in the given DNA sequence (seq) and increments the corresponding count in the dictionary.

In [9]:
import random

def countfrequencyNuc(seq):
    tempdictionary_Freq = {"A": 0, "C": 0, "G": 0, "T": 0}
    for Nuc in seq:
        tempdictionary_Freq[Nuc] += 1
    return tempdictionary_Freq

# Generating a random DNA string for testing
random_dna_string = ''.join(random.choice("ACGT") for _ in range(26))

# Calling the function with the generated DNA string
resulting_dna_frequency = countfrequencyNuc(random_dna_string)

# Printing the result
print(colored(resulting_dna_frequency))

[92mA[94mC[93mG[91mT


# DNA-RNA Transcription

This Python code defines a function called transcription that performs transcription on a DNA sequence, converting it into an RNA sequence. The DNA sequence is represented as a list of **nucleotides** (adenine '**A**', cytosine '**C**', guanine '**G**', and thymine '**T**') stored in the variable **Nuc_seq**. The function takes a DNA sequence as input, and it replaces each 'T' (thymine) with 'U' (uracil) to create the corresponding RNA sequence.


In [10]:
Nuc_seq = ['A', 'C', 'G', 'T']
# transcription of dna sequence to rna sequence with code 
def transcription(sequence): 
    return sequence.replace('T', 'U')

It transcribes the DNA sequence "**ACCTAAAGCATAAACGTCGAGCAGT**" into an RNA sequence using the transcription function, and then it prints both the resulting RNA sequence and its length. This code will output the transcribed RNA sequence and its length. The transcription function replaces each '**T**' with '**U**' in the DNA sequence, mimicking the process of transcription in biology.

In [11]:
dna_string = "ACCTAAAGCATAAACGTCGAGCAGT"
string_rna = transcription(dna_string)
print(colored(string_rna))
print(len(string_rna))

[92mA[94mC[94mC[91mU[92mA[92mA[92mA[93mG[94mC[92mA[91mU[92mA[92mA[92mA[94mC[93mG[91mU[94mC[93mG[92mA[93mG[94mC[92mA[93mG[91mU
25


In [12]:
# Solving a few rosalind problems 
#Testing the transcription code 
dna_string_1 = "GATGGAACTTGACTACGTAAATT"
string_rna_1 = transcription(dna_string_1)
print(colored(string_rna_1))

#Rosalind problem 2
dna_string_2 = "CCAGACGTGTCAGTGAGATAGCGGCTTTGGGGCATTCCCCGAGTAAATCAATTTCTGACGAAGAAAAACACGAGGCTTCGTGAGCAGTAGCGGCAACCTCACGTCGTTTGTGCGTCTTAGAGTATCGGCCGAAGAAAAGGGGACCGAGCGATCATAGGGCCCGGAGAATAACAGTAAGAATAAATACACCAACTCAAGGTTTACCCGTACGCAATACATTCATTTTTCCCACCGTGTTTAATGTGCGTCTGTAGATAATAACATTGTTCCTCCTCAGGAAGACGTGCGGCTAACTGCGTGTTGGCGCCTTAAATGAGCAGGAAGACTGACCAAGCATATGAGGCAGGATCCGGATTACGGACTTTCTTGACCTGGATCGCGAGTCTTTGAGGAAACTGTACTTAACAGTACGGACGTATAGGTGTACTCAATCACACGGCGAGCAAAACACCGCATGTTGCAACATTCGCCTGAGAGTGGCAGTCGGTTGCCGCCCAACGTGTCCGCTAAGAATGGCCTTGGCGGCGGTATGCACGTAATCACCCCTATCAGACTAGTTCACAGGACTTAGGAAAGTTGTTGTGATCAGCGAGTTGAAGGGAAGCATTAACAGTTTGCGCCTCAGGCACCACTGCTGTTGGTGGGTGAGGTTCCACCTCTATTGCCACCAACGGTATGGTACATGATCAAGCCGGCAAGGCCCACAACTTCTATCTACTAAGCCACAGCCTCGTTAGTATACCCCGGTCGGCCTGTTCATCTTGAGGGGCTCCTCGTCCTCAATAAAGGCCGTCTAAACATATTAGACCCATTGGTAACGATATTTATTAGTCGGGCCTGTTGCACTCGCCAAGACTACAGGATCATACTCCATACGTTTCAAACCTTAGCCGGTCTTCTATTTCTCACTCTCCCG"
string_rna_2 = transcription(dna_string_2)
print(colored(string_rna_2))

[93mG[92mA[91mU[93mG[93mG[92mA[92mA[94mC[91mU[91mU[93mG[92mA[94mC[91mU[92mA[94mC[93mG[91mU[92mA[92mA[92mA[91mU[91mU
[94mC[94mC[92mA[93mG[92mA[94mC[93mG[91mU[93mG[91mU[94mC[92mA[93mG[91mU[93mG[92mA[93mG[92mA[91mU[92mA[93mG[94mC[93mG[93mG[94mC[91mU[91mU[91mU[93mG[93mG[93mG[93mG[94mC[92mA[91mU[91mU[94mC[94mC[94mC[94mC[93mG[92mA[93mG[91mU[92mA[92mA[92mA[91mU[94mC[92mA[92mA[91mU[91mU[91mU[94mC[91mU[93mG[92mA[94mC[93mG[92mA[92mA[93mG[92mA[92mA[92mA[92mA[92mA[94mC[92mA[94mC[93mG[92mA[93mG[93mG[94mC[91mU[91mU[94mC[93mG[91mU[93mG[92mA[93mG[94mC[92mA[93mG[91mU[92mA[93mG[94mC[93mG[93mG[94mC[92mA[92mA[94mC[94mC[91mU[94mC[92mA[94mC[93mG[91mU[94mC[93mG[91mU[91mU[91mU[93mG[91mU[93mG[94mC[93mG[91mU[94mC[91mU[91mU[92mA[93mG[92mA[93mG[91mU[92mA[91mU[94mC[93mG[93mG[94mC[94mC[93mG[92mA[92mA[93mG[92mA[92mA[92mA[92mA[93mG[93mG[93mG[93mG[92mA[9

# Reverse complement of DNA seq

This code defines a dictionary **DNA_Reversecomplementseq** mapping each nucleotide to its complementary nucleotide (A to T, T to A, G to C, and C to G). It then defines a function **reversecomplement(sequence)** that takes a DNA sequence as input and returns its reverse complement.

The code generates a random DNA sequence **random_dna** and then uses the reversecomplement() function to find its reverse complement. Finally, it prints the reverse complement of the random DNA sequence.

In [13]:
#Determining reverse complementary sequence of dna string
DNA_Reversecomplementseq = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}

def reversecomplement(sequence): 
    return ''.join([DNA_Reversecomplementseq[nuc] for nuc in sequence])[::-1]
random_dna = "CGGGGTGACCCTTGTAGCGGACCCCGACGCGGCGGCAACCCCCCGCTTCTGCCTCTGGGGGAGGAAAGCGTCGGC"
new_reverse_dna_string = reversecomplement(random_dna)
print(colored(new_reverse_dna_string))

[93mG[94mC[94mC[93mG[92mA[94mC[93mG[94mC[91mT[91mT[91mT[94mC[94mC[91mT[94mC[94mC[94mC[94mC[94mC[92mA[93mG[92mA[93mG[93mG[94mC[92mA[93mG[92mA[92mA[93mG[94mC[93mG[93mG[93mG[93mG[93mG[93mG[91mT[91mT[93mG[94mC[94mC[93mG[94mC[94mC[93mG[94mC[93mG[91mT[94mC[93mG[93mG[93mG[93mG[91mT[94mC[94mC[93mG[94mC[91mT[92mA[94mC[92mA[92mA[93mG[93mG[93mG[91mT[94mC[92mA[94mC[94mC[94mC[94mC[93mG


In [14]:
#Solving for rosalind problem 1 for Complementing a Strand of DNA 

dna_1 = "AAAACCCGGT"
new_reverse_dna_string = reversecomplement(dna_1)
print(colored(new_reverse_dna_string))

#problem 2
dna_2 = "TCCCAGTTACGTGACCGGACTGTACAGTTACTTGGCCCTTTTTGAATGCTGGCCGTCCTTGAGAAGTTGATGCATCTGTCTCCTATGGGGATACGTAGAACGGGTGCATTGGCGCCCCCGTATATCTCGATTATGACTGTCATGTCGCGACCACCTTGTATCTATACCGAAGGAACTGACCTGTTGGGCCGTTATACACGTCAAGCCACGTGATGTAATTGAGATTCAAACAAATTAAAATTATGGTAGTAGTTGACCTGTGCGAAGCAGCCTGGTAATGAGTCTGCAGGTTGCTGCCCTGACCGTTAGTGATAGGGCGCTCAGTGCCACCGAAATTTCTCCCTACAACCGCAACACTGAGTCAGACATTATCGACTACAGCCAGATGCGAAAAAGCGAGAGATAAGGAAAGGACTCCATGGTTTCGGTACCTCCGTGGTTCACTGGTTTCTACCGTGTAGTCGTATTAAGTATACCTATACGAGGCCTTTACGAACCTGAACCGTTGACGGGCGGCTGTGCCCATCCCAAATCGTCTACGGATGAACCTACGCCCCACGGAGATCCTTCCGCCGCTGCTAACTCTTGAAGCATCACAACGGAGACGGTGATCGCAAAACGCGTAACCCTTGAGGGCCCCTCAACTATGTGAGCGACACTTGAACATAAGCATCTTGGGCAGGAGTATCGCCTTCATAGAGTGTATGGCATGACGAAACCTACGTGGCGTAGCATTCTCCCGTGATGGCTAAATTCACAATCTAGAGCTTGTGAGTTTACCCGAATTTTCGACTAGACTATAAGGATGTTCAATGTCGTCAATGTCTTGGATATATCGCACATGGGTGGATTGTTTGTATAGTTCACTAGGAGATTTTGTCCGGGTCATTACATCGAACTATAAATGCTCAAAG"
new_reverse_dna_string = reversecomplement(dna_2)
print(colored(new_reverse_dna_string))

[92mA[94mC[94mC[93mG[93mG[93mG[91mT[91mT[91mT[91mT
[94mC[91mT[91mT[91mT[93mG[92mA[93mG[94mC[92mA[91mT[91mT[91mT[92mA[91mT[92mA[93mG[91mT[91mT[94mC[93mG[92mA[91mT[93mG[91mT[92mA[92mA[91mT[93mG[92mA[94mC[94mC[94mC[93mG[93mG[92mA[94mC[92mA[92mA[92mA[92mA[91mT[94mC[91mT[94mC[94mC[91mT[92mA[93mG[91mT[93mG[92mA[92mA[94mC[91mT[92mA[91mT[92mA[94mC[92mA[92mA[92mA[94mC[92mA[92mA[91mT[94mC[94mC[92mA[94mC[94mC[94mC[92mA[91mT[93mG[91mT[93mG[94mC[93mG[92mA[91mT[92mA[91mT[92mA[91mT[94mC[94mC[92mA[92mA[93mG[92mA[94mC[92mA[91mT[91mT[93mG[92mA[94mC[93mG[92mA[94mC[92mA[91mT[91mT[93mG[92mA[92mA[94mC[92mA[91mT[94mC[94mC[91mT[91mT[92mA[91mT[92mA[93mG[91mT[94mC[91mT[92mA[93mG[91mT[94mC[93mG[92mA[92mA[92mA[92mA[91mT[91mT[94mC[93mG[93mG[93mG[91mT[92mA[92mA[92mA[94mC[91mT[94mC[92mA[94mC[92mA[92mA[93mG[94mC[91mT[94mC[91mT[92mA[93mG[92mA[91mT[91mT[9

> **Base-pairing of complementary seq**

This code defines a **function reverse_complement** that generates the reverse complement of a given DNA string. It then uses this function to find the **reverse complement** of **dna_string** and prints it along with the original string in a format that illustrates the **base pairing sequence**.

In [15]:
#Illustrating reverse complement in base pairing sequence manner 
def reverse_complement(dna_string):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
    return ''.join(complement[base] for base in reversed(dna_string))

dna_string = "CGGGGTGACCCTTGTAGCGGACCCCGACGCGGCGGCAACCCCCCGCTTCTGCCTCTGGGGGAGGAAAGCGTCGGC"
new_reverse_dna_string = reverse_complement(dna_string)

print(colored(f"Reverse complement + dna seq string:\n5' {new_reverse_dna_string} 3'"))
print("".join([' ' for c in range(len(new_reverse_dna_string))]))
print(colored(f"3' {dna_string[::-1]} 5'\n"))

[0mR[0me[0mv[0me[0mr[0ms[0me[0m [0mc[0mo[0mm[0mp[0ml[0me[0mm[0me[0mn[0mt[0m [0m+[0m [0md[0mn[0ma[0m [0ms[0me[0mq[0m [0ms[0mt[0mr[0mi[0mn[0mg[0m:[0m
[0m5[0m'[0m [93mG[94mC[94mC[93mG[92mA[94mC[93mG[94mC[91mT[91mT[91mT[94mC[94mC[91mT[94mC[94mC[94mC[94mC[94mC[92mA[93mG[92mA[93mG[93mG[94mC[92mA[93mG[92mA[92mA[93mG[94mC[93mG[93mG[93mG[93mG[93mG[93mG[91mT[91mT[93mG[94mC[94mC[93mG[94mC[94mC[93mG[94mC[93mG[91mT[94mC[93mG[93mG[93mG[93mG[91mT[94mC[94mC[93mG[94mC[91mT[92mA[94mC[92mA[92mA[93mG[93mG[93mG[91mT[94mC[92mA[94mC[94mC[94mC[94mC[93mG[0m [0m3[0m'
                                                                           
[0m3[0m'[0m [94mC[93mG[93mG[94mC[91mT[93mG[94mC[93mG[92mA[92mA[92mA[93mG[93mG[92mA[93mG[93mG[93mG[93mG[93mG[91mT[94mC[91mT[94mC[94mC[93mG[91mT[94mC[91mT[91mT[94mC[93mG[94mC[94mC[94mC[94mC[94mC[94mC[92mA[92mA[94mC[9

In [16]:
Nuc_seq = ['A', 'C', 'G', 'T']
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}

# GC content calculation

This function calculates the **GC content** of a DNA sequence. It counts the occurrences of the bases '**G**' and '**C**' in the sequence and divides this total by the length of the sequence. The result is then multiplied by 100 to obtain the **percentage of GC** bases in the sequence.

In [17]:
def gc_content(sequence):
    gc_count = sequence.count('G') + sequence.count('C')
    return (gc_count / len(sequence)) * 100

In [18]:
dna_string = "CGGGGTGACCCTTGTAGCGGACCCCGACGCGGCGGCAACCCCCCGCTTCTGCCTCTGGGGGAGGAAAGCGTCGGC"
print(colored(f' GC_content: {gc_content(dna_string)}%\n'))

[0m [93mG[94mC[0m_[0mc[0mo[0mn[0mt[0me[0mn[0mt[0m:[0m [0m7[0m3[0m.[0m3[0m3[0m3[0m3[0m3[0m3[0m3[0m3[0m3[0m3[0m3[0m3[0m3[0m3[0m%[0m



This function calculates the **GC content for subsequences ** of a given DNA sequence. It takes two arguments: sequence, which is the DNA sequence, and k, which is the length of the subsequences.

The function iterates over the sequence in steps of k to extract non-overlapping subsequences of length k. For each subsequence, it calculates the GC content using the **gc_content** function (assuming it is defined elsewhere in the code) and appends the result to the res list.

Finally, it returns a list **res** containing the GC content for each subsequence in the input sequence.

In [19]:
def gc_content_subseq(sequence, k=20):
    res = []
    for i in range(0, len(sequence) - k + 1, k):
        subsequence = sequence[i:i + k]
        res.append(gc_content(subsequence))
    return res

In [20]:
dna_string = "CGGGGTGACCCTTGTAGCGGACCCCGACGCGGCGGCAACCCCCCGCTTCTGCCTCTGGGGGAGGAAAGCGTCGGC"
print(colored(f'GC_content in subsection k = 10: {gc_content_subseq(dna_string, k=10)}%\n'))

[93mG[94mC[0m_[0mc[0mo[0mn[0mt[0me[0mn[0mt[0m [0mi[0mn[0m [0ms[0mu[0mb[0ms[0me[0mc[0mt[0mi[0mo[0mn[0m [0mk[0m [0m=[0m [0m1[0m0[0m:[0m [0m[[0m8[0m0[0m.[0m0[0m,[0m [0m6[0m0[0m.[0m0[0m,[0m [0m8[0m0[0m.[0m0[0m,[0m [0m8[0m0[0m.[0m0[0m,[0m [0m7[0m0[0m.[0m0[0m,[0m [0m8[0m0[0m.[0m0[0m,[0m [0m6[0m0[0m.[0m0[0m][0m%[0m



ROSALIND PROBLEMS ON GC CONTENT

In [21]:
def readFile(filepath):
    with open(filepath, 'r') as f:
        return [l.strip() for l in f.readlines()]

# Example usage of the readFile function
fastafile = readFile("/kaggle/input/gc-content-test/testdata.txt")
print(fastafile)
fasta_dict = {}

fasta_label = ""

for line in fastafile:
    if line.startswith('>'):
        fasta_label = line.strip()
        fasta_dict[fasta_label] = ""
    else:
        fasta_dict[fasta_label] += line.strip()
print(fasta_dict)

result_dict = {key: gc_content(value) for (key,value) in fasta_dict.items()}
print(result_dict)

maximumGCkey = max(result_dict, key=result_dict.get)

print(colored(f'{maximumGCkey}\n{result_dict[maximumGCkey]}'))

['>Rosalind_6404', 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC', 'TCCCACTAATAATTCTGAGG', '>Rosalind_5959', 'CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT', 'ATATCCATTTGTCAGCAGACACGC', '>Rosalind_0808', 'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC', 'TGGGAACCTGCGGGCAGTAGGTGGAAT']
{'>Rosalind_6404': 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG', '>Rosalind_5959': 'CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC', '>Rosalind_0808': 'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT'}
{'>Rosalind_6404': 53.75, '>Rosalind_5959': 53.57142857142857, '>Rosalind_0808': 60.91954022988506}
[0m>[0mR[0mo[0ms[0ma[0ml[0mi[0mn[0md[0m_[0m0[0m8[0m0[0m8[0m
[0m6[0m0[0m.[0m9[0m1[0m9[0m5[0m4[0m0[0m2[0m2[0m9[0m8[0m8[0m5[0m0[0m6


In [22]:
def readFile(filepath):
    with open(filepath, 'r') as f:
        return [l.strip() for l in f.readlines()]

# Example usage of the readFile function
fastafile = readFile("/kaggle/input/rosalind2/rosalind_gc (1).txt")

fasta_dict = {}

fasta_label = ""

for line in fastafile:
    if line.startswith('>'):
        fasta_label = line.strip()
        fasta_dict[fasta_label] = ""
    else:
        fasta_dict[fasta_label] += line.strip()


result_dict = {key: gc_content(value) for (key,value) in fasta_dict.items()}


maximumGCkey = max(result_dict, key=result_dict.get)

print(f'{maximumGCkey[1:]}\n{result_dict[maximumGCkey]}')

Rosalind_5488
53.71549893842887


# DNA TRANSLATION


In [23]:
DNA_Codons = {
    # 'M' - START, '_' - STOP
    "GCT": "A", "GCC": "A", "GCA": "A", "GCG": "A",
    "TGT": "C", "TGC": "C",
    "GAT": "D", "GAC": "D",
    "GAA": "E", "GAG": "E",
    "TTT": "F", "TTC": "F",
    "GGT": "G", "GGC": "G", "GGA": "G", "GGG": "G",
    "CAT": "H", "CAC": "H",
    "ATA": "I", "ATT": "I", "ATC": "I",
    "AAA": "K", "AAG": "K",
    "TTA": "L", "TTG": "L", "CTT": "L", "CTC": "L", "CTA": "L", "CTG": "L",
    "ATG": "M",
    "AAT": "N", "AAC": "N",
    "CCT": "P", "CCC": "P", "CCA": "P", "CCG": "P",
    "CAA": "Q", "CAG": "Q",
    "CGT": "R", "CGC": "R", "CGA": "R", "CGG": "R", "AGA": "R", "AGG": "R",
    "TCT": "S", "TCC": "S", "TCA": "S", "TCG": "S", "AGT": "S", "AGC": "S",
    "ACT": "T", "ACC": "T", "ACA": "T", "ACG": "T",
    "GTT": "V", "GTC": "V", "GTA": "V", "GTG": "V",
    "TGG": "W",
    "TAT": "Y", "TAC": "Y",
    "TAA": "_", "TAG": "_", "TGA": "_"
}


In [24]:
def translate_seq(seq, init_pos=0):
    """Translates a DNA sequence into an amino acid sequence"""
    return [DNA_Codons[seq[pos:pos + 3]] for pos in range(init_pos, len(seq) - 2, 3)]

dna_string = "CGGGGTGACCCTTGTAGCGGACCCCGACGCGGCGGCAACCCCCCGCTTCTGCCTCTGGGGGAGGAAAGCGTCGGC"
print(f'amino acid seq from dna seq: {translate_seq(dna_string, 0)}\n')

amino acid seq from dna seq: ['R', 'G', 'D', 'P', 'C', 'S', 'G', 'P', 'R', 'R', 'G', 'G', 'N', 'P', 'P', 'L', 'L', 'P', 'L', 'G', 'E', 'E', 'S', 'V', 'G']



In [25]:
from collections import Counter

def codon_usage(seq, aminoacid):
    """Provides the frequency of each codon encoding a given aminoacid in a DNA sequence"""
    tmpList = []
    for i in range(0, len(seq) - 2, 3):
        if DNA_Codons[seq[i:i + 3]] == aminoacid:
            tmpList.append(seq[i:i + 3])

    freqDict = dict(Counter(tmpList))
    totalWeight = sum(freqDict.values())
    for codon_seq in freqDict:
        freqDict[codon_seq] = round(freqDict[codon_seq] / totalWeight, 2)
    return freqDict

print(f'frequency of codons [L]: {codon_usage(dna_string, "L")}\n')
print(f'frequency of codons [R]: {codon_usage(dna_string, "R")}\n')

frequency of codons [L]: {'CTT': 0.33, 'CTG': 0.67}

frequency of codons [R]: {'CGG': 0.33, 'CGA': 0.33, 'CGC': 0.33}

