1. Counting DNA Nucleotides

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string ss of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in ss.

In [15]:
def count_nucleotides (file_in, file_out):
    with open(file_in, 'r') as f:
        st_dna = f.readline()
    n = len(st_dna)
    count = []
    count.append(st_dna.count('A'))
    count.append(st_dna.count('C'))
    count.append(st_dna.count('G'))
    count.append(st_dna.count('T'))
    with open(file_out, 'w') as f:
        for i in count:
            if (i == count[-1]):
                f.write(str(i))
            else:
                f.write(str(i) + ' ')

2. Transcribing DNA into RNA

An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string tt corresponding to a coding strand, its transcribed RNA string uu is formed by replacing all occurrences of 'T' in tt with 'U' in uu.

Given: A DNA string tt having length at most 1000 nt.

Return: The transcribed RNA string of tt.

In [27]:
import re
def dna_to_rna (file_in, file_out):
    with open(file_in, 'r') as f:
        st_dna = f.readline()
    st_rna = re.sub('T', 'U', st_dna)
    with open(file_out, 'w') as f:
        f.write(st_rna.strip('\n'))

3. Complementing a Strand of DNA

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string ss is the string scsc formed by reversing the symbols of ss, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string ss of length at most 1000 bp.

Return: The reverse complement scsc of ss.

In [30]:
import re
def revers_compl_dna(file_in, file_out):
    with open(file_in, 'r') as f:
        st_dna = f.readline()
    kn = re.sub('C', 'N', re.sub('A', 'K', st_dna))
    with open(file_out, 'w') as f:
        f.write((re.sub('K', 'T', re.sub('N', 'G', re.sub('T', 'A', re.sub('G', 'C', kn)))))[::-1].strip('\n'))

4. Computing GC Content 

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

In [36]:
import re
def percenCG(st_dna):
    n = len(st_dna)
    m = st_dna.count('C') + st_dna.count('G')
    return m*100./n

def top_percenCG(file_in, file_out):
    with open(file_in, 'r') as f:
        list_name = []
        list_dna = []
        str_dna = ""
        for line in f:
            if (line[0] == '>'):
                list_dna.append(str_dna)
                str_dna = ""
                list_name.append(line[1:-1])
            else:
                str_dna += line[:-1]
        list_dna.append(str_dna)
        list_dna = list_dna[1:]
    maxCG = percenCG(list_dna[0])
    maxi = 0
    for i in range(1,len(list_name)):
        n = percenCG(list_dna[i])
        if (maxCG < n):
            maxCG = n
            maxi = i
    with open(file_out, 'w') as f:
        f.write(list_name[maxi] + '\n' + str(maxCG))

5. Finding a Motif in DNA

Given two strings ss and tt, tt is a substring of ss if tt is contained as a contiguous collection of symbols in ss (as a result, tt must be no longer than ss).

position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). The symbol at position ii of ss is denoted by s[i]s[i].

A substring of ss can be represented as s[j:k]s[j:k], where jj and kk represent the starting and ending positions of the substring in ss; for example, if ss = "AUGCUUCAGAAAGGUCUUACG", then s[2:5]s[2:5] = "UGCU".

location of a substring s[j:k]s[j:k] is its beginning position jj; note that tt will have multiple locations in ss if it occurs more than once as a substring of ss (see the Sample below).

Given: Two DNA strings ss and tt (each of length at most 1 kbp).

Return: All locations of tt as a substring of ss.

In [60]:
def tSubstrS (file_in, file_out):
    with open(file_in, 'r') as f:
        s = f.readline()[:-1]
        t = f.readline()[:-1]
    k = []
    i = 0
    while (t in s):
        n = s.find(t) + 1
        s = s[n:]
        i += n
        k.append(i)
    with open(file_out, 'w') as f:
        for i in k:
            if (i == k[-1]):
                f.write(str(i))
            else:
                f.write(str(i) + ' ')