# Chaining the Amino Acids

Problem
In a weighted alphabet, every symbol is assigned a positive real number called a weight. A string formed from a weighted alphabet is called a weighted string, and its weight is equal to the sum of the weights of its symbols.

The standard weight assigned to each member of the 20-symbol amino acid alphabet is the monoisotopic mass of the corresponding amino acid.

Given: A protein string P of length at most 1000 aa.

Return: The total weight of P. Consult the monoisotopic mass table.

In [1]:
with open('rosalind_prtm.txt') as f:
    protein = f.read().rstrip('\n')

In [2]:
weight = {'A': 71.03711, 'C': 103.00919, 'D': 115.02694, 'E': 129.04259, 'F': 147.06841, 'G': 57.02146, 
          'H': 137.05891, 'I': 113.08406, 'K': 128.09496, 'L': 113.08406, 'M': 131.04049, 'N': 114.04293, 
          'P': 97.05276, 'Q': 128.05858, 'R': 156.10111, 'S': 87.03203, 'T': 101.04768, 'V': 99.06841, 
          'W': 186.07931, 'Y': 163.06333}

proteins_weight = 0
for i in protein:
    proteins_weight += weight[i]
    
round(proteins_weight, 3)

98124.65

# Combing Through the Haystack

Problem  
Given two strings s and t, t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s[i].

A substring of s can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; for example, if s = "AUGCUUCAGAAAGGUCUUACG", then s[2:5] = "UGCU".

The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s (see the Sample below).

Given: Two DNA strings s and t (each of length at most 1 kbp).

Return: All locations of t as a substring of s.

In [3]:
with open('rosalind_subs.txt') as f:
    s = f.readline().rstrip('\n')
    t = f.readline().rstrip('\n')

In [4]:
ans = []
for i in range(len(s)):
    if s[i:i+len(t)] == t:
        ans.append(i + 1)
        
' '.join(str(x) for x in ans)

'11 18 68 78 97 125 183 198 326 364 371 386 434 455 535 542 597 625 632 639 646 670 730 766 783'

# Counting DNA Nucleotides

Problem
A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

In [5]:
with open('rosalind_dna.txt') as f:
    dna = f.readline()

In [6]:
print(dna.count('A'), dna.count('C'), dna.count('G'), dna.count('T'))

231 242 202 209


# Finding a Shared Spliced Motif

Problem
A string u is a common subsequence of strings s and t if the symbols of u appear in order as a subsequence of both s and t. For example, "ACTG" is a common subsequence of "AACCTTGG" and "ACACTGTGA".

Analogously to the definition of longest common substring, u is a longest common subsequence of s and t if there does not exist a longer common subsequence of the two strings. Continuing our above example, "ACCTTG" is a longest common subsequence of "AACCTTGG" and "ACACTGTGA", as is "AACTGG".

Given: Two DNA strings s and t (each having length at most 1 kbp) in FASTA format.

Return: A longest common subsequence of s and t. (If more than one solution exists, you may return any one.)

In [7]:
with open('rosalind_lcsq.txt') as f:
    data = f.read()
data = data.replace('\n', '')
data = data.replace('Rosalind_', '')
data = data.split('>')
data = data[1:]
S = data[0][4:]
T = data[1][4:]

In [8]:
S = data[0][4:]
T = data[1][4:]
cur = [''] * (len(T) + 1) #dummy entries as per wiki
for s in S:
    last, cur = cur, [''] 
    for i, t in enumerate(T):
        cur.append(last[i] + s if s==t else max(last[i+1], cur[-1], key=len))
print(cur[-1])

CTGCGCCGAATAGTCAAAGTGAACTTTGACAGAACAAGGTGTCCTAACTCTAAAGTCACCCATCGGTAAGTTGGATCTCGTGGGGAGTAGTATTGCCCCCTCAGGGGGTTTTGGACATTCCATCACACGTCGCACGGGCGTAATATTTTTTTCAGGATTACCCGGACTTGCGGATGAGTAGGGTAATTGGGCCCATTCCTCGAGGAACAGTACCGCCAGCAAAGTTTAAGACGTACCATGCAGACAGGCTGTAGGATCGAGATTTGTCCTAGCGATGGGCCGTTTGACGAAAGGAATGGTGGGTGACGGCTTGCTTGAGCAAGGTCCAAAGGCAGTCACGCCTAGTATTATATGCGCAGAATCTAGTGAATCTGTCCGGGCGATCTCGTCTCTAAGGGGGGGCGGGAAACACTGCAGAGTTGAATCCAGTAATAGTGCGTTTTTTCAAGTAGTTCAAAACTATTGTAACTTTTTCGCAAGCCTGCCCGCTATAGTGGCGTTGTAGGACTTCATGTAGGGATCGTTAGAAGGGGGCCGTTATTGTGAGGGCCTGGGGTTTGTGGTTGAG


# Finding a Spliced Motif

Problem
A subsequence of a string is a collection of symbols contained in order (though not necessarily contiguously) in the string (e.g., ACG is a subsequence of TATGCTAAGATC). The indices of a subsequence are the positions in the string at which the symbols of the subsequence appear; thus, the indices of ACG in TATGCTAAGATC can be represented by (2, 5, 9).

As a substring can have multiple locations, a subsequence can have multiple collections of indices, and the same index can be reused in more than one appearance of the subsequence; for example, ACG is a subsequence of AACCGGTT in 8 different ways.

Given: Two DNA strings s and t (each of length at most 1 kbp) in FASTA format.

Return: One collection of indices of s in which the symbols of t appear as a subsequence of s. If multiple solutions exist, you may return any one.

NOTE! indices starts with 1!!!

In [9]:
with open('rosalind_sseq.txt') as f:
    file = f.read().split('>')
file = file[1:]
string = file[0][13:].replace('\n', '')
substring = file[1][13:].replace('\n', '')

In [10]:
ans = [string.index(substring[0])]
for i in range(1, len(substring)):
    a = string[ans[i-1]+1:].index(substring[i]) + ans[i-1]+1
    ans.append(a)

' '.join(str(i+1) for i in ans)

'2 3 5 6 7 8 10 23 28 34 37 40 41 45 54 56 60 63 67 78 80 84 86 101 107 114 120 122 128 129 135 136 140 141 144 145 147 154 162 163 164 166 176 181 189 190 193 199 205 210 213 217 218 226 227 238 250 252 256 257 271 276 277 284 287 288 290 293 294'

# Genes are Discontiguous

Problem After identifying the exons and introns of an RNA string, we only need to delete the introns and concatenate the exons to form a new string ready for translation.

Given: A DNA string s (of length at most 1 kbp) and a collection of substrings of s acting as introns. All strings are given in FASTA format.

Return: A protein string resulting from transcribing and translating the exons of s. (Note: Only one solution will exist for the dataset provided.)


In [11]:
with open('rosalind_splc.txt') as f:
    data = f.read()
data = data.replace('\n', '')
data = data.split('>')
data = data[1:]

In [12]:
s = data[0][13:]
data = data[1:]
for i in data:
    s = s.replace(i[13:], '')

protein = {'TTT':'F', 'CTT': 'L', 'ATT': 'I', 'GTT': 'V', 'TTC': 'F', 'CTC': 'L', 'ATC': 'I', 'GTC': 'V', 
           'TTA': 'L', 'CTA': 'L', 'ATA': 'I', 'GTA': 'V', 'TTG': 'L', 'CTG': 'L', 'ATG': 'M', 'GTG': 'V',
           'TCT': 'S', 'CCT': 'P', 'ACT': 'T', 'GCT': 'A', 'TCC': 'S', 'CCC': 'P', 'ACC': 'T', 'GCC': 'A',
           'TCA': 'S', 'CCA': 'P', 'ACA': 'T', 'GCA': 'A', 'TCG': 'S', 'CCG': 'P', 'ACG': 'T', 'GCG': 'A',
           'TAT': 'Y', 'CAT': 'H', 'AAT': 'N', 'GAT': 'D', 'TAC': 'Y', 'CAC': 'H', 'AAC': 'N', 'GAC': 'D',
           'TAA': 'Stop', 'CAA': 'Q', 'AAA': 'K', 'GAA': 'E', 'TAG': 'Stop', 'CAG': 'Q', 'AAG': 'K', 'GAG': 'E',
           'TGT': 'C', 'CGT': 'R', 'AGT': 'S', 'GGT': 'G', 'TGC': 'C', 'CGC': 'R', 'AGC': 'S', 'GGC': 'G',
           'TGA': 'Stop', 'CGA': 'R', 'AGA': 'R', 'GGA': 'G', 'TGG': 'W', 'CGG': 'R', 'AGG': 'R', 'GGG': 'G'} 

k = 0
ps = []
for i in range(3, len(s)+1, 3):
    p = protein[s[k:i]]
    if p != 'Stop':
        ps.append(p)
    k = i
    
print(''.join(ps))

MPSDQIVMTEERIESLLTTSVCDMRSALSNYSRGSNASPRCMLINYTFRTHGIPGGQRVREETEGVRLHDPTWSIVELFTISRVQSVLLPKVIVELTLNDISQNTCLYIHCQERHASFSPSTSIPHANTLCTLERRASGGRSVAEIRWYRNYPSRPSARRSTGVQSVPLRLLRRTA


# Identifying Unknown DNA Quickly

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

In [13]:
with open('rosalind_gc.txt') as f:
    data = f.read()

data = data.split('>')
data = data[1:]

In [14]:
%%time 

k = 0
p = 0
ind = 0
for i in data:
    a = i[:13]
    b = i[14:]
    b = b.replace('\n', '')
    p_a = (b.count('C') + b.count('G'))/len(b)
    if p_a > p:
        p = p_a
        ind = k
    k += 1

print(data[ind][:13])
print(round(p*100, 6))

Rosalind_6888
53.125
CPU times: user 611 µs, sys: 0 ns, total: 611 µs
Wall time: 713 µs


# Calculating Expected Offspring

Problem
For a random variable X taking integer values between 1 and n, the expected value of X is E(X)=∑nk=1k×Pr(X=k). The expected value offers us a way of taking the long-term average of a random variable over a large number of trials.

As a motivating example, let X be the number on a six-sided die. Over a large number of rolls, we should expect to obtain an average of 3.5 on the die (even though it's not possible to roll a 3.5). The formula for expected value confirms that E(X)=∑6k=1k×Pr(X=k)=3.5.

More generally, a random variable for which every one of a number of equally spaced outcomes has the same probability is called a uniform random variable (in the die example, this "equal spacing" is equal to 1). We can generalize our die example to find that if X is a uniform random variable with minimum possible value a and maximum possible value b, then E(X)=a+b2. You may also wish to verify that for the dice example, if Y is the random variable associated with the outcome of a second die roll, then E(X+Y)=7.

Given: Six nonnegative integers, each of which does not exceed 20,000. The integers correspond to the number of couples in a population possessing each genotype pairing for a given factor. In order, the six given integers represent the number of couples having the following genotypes:

1.AA-AA
2.AA-Aa
3.AA-aa
4.Aa-Aa
5.Aa-aa
6.aa-aa

Return: The expected number of offspring displaying the dominant phenotype in the next generation, under the assumption that every couple has exactly two offspring.

In [17]:
data = list(map(int, input().split()))

18826 19777 16766 17262 19500 17149


In [18]:
n = 2  #every couple has exactly two offspring.
prob = [1, 1, 1, 0.75, 0.5, 0]   #probabilities of dominant phenotype
def n_offspring(d):
    return sum([n * i[0] * i[1] for i in list(zip(d, prob))])
print(n_offspring(data))

156131.0


# Counting Point Mutations

Problem

Figure 2. The Hamming distance between these two strings is 7. Mismatched symbols are colored red. Given two strings s and t of equal length, the Hamming distance between s and t, denoted dH(s,t), is the number of corresponding symbols that differ in s and t. See Figure 2.

Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).

Return: The Hamming distance dH(s,t).

In [19]:
with open('rosalind_hamm.txt') as f:
    s = f.readline()
    t = f.readline()
s = s.strip()
t = t.strip()

In [20]:
s = list(s)
t = list(t)
count = [1 for i in range(len(s)) if s[i] != t[i] ]
sum(count)

492

# Finding a Shared Motif 

Problem
A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, "CG" is a common substring of "ACGTACGT" and "AACCGTATA", but it is not as long as possible; in this case, "CGTA" is a longest common substring of "ACGTACGT" and "AACCGTATA".

Note that the longest common substring is not necessarily unique; for a simple example, "AA" and "CC" are both longest common substrings of "AACC" and "CCAA".

Given: A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

In [21]:
with open('rosalind_lcsm.txt') as f:
    data = f.read()
data = data.replace('\n', '')
data = data.replace('Rosalind_', '')
data = data.split('>')
data = data[1:]

In [22]:
def long_substr(data):
    substrs = lambda x: {x[i:i+j] for i in range(len(x)) for j in range(len(x) - i + 1)}
    s = substrs(data[0])
    for val in data[1:]:
        s.intersection_update(substrs(val))
    return max(s, key=len)

def is_substr(find, data):
    if len(data) < 1 and len(find) < 1:
        return False
    for i in range(len(data)):
        if find not in data[i]:
            return False
    return True
print(long_substr(data))

GATCTGTTTGTTCCCGTCCTGTTCATAACAACGCCAAGGAATTTGGGCTAAGGCTGGCTATGTGGGGCAAAGAATCACTGGAAAGCCGTTCATCTTATAAATGCGCCATTAGAACGTTTCGAGGAGCCTATGAAATCCCACCGTACAGATAAAAGGGCGGATAATAGGGTAGATTCTGGGGTGCCTGTTCCGCCCGTTAAGAGCCGCCTTGAAATCGACCCAACAAGGATGTAGGGCTCTTAAGCGCATATGTCGCTTAAATGCTAAGCCACAGCTAGAGCCCACGTATTATCACAG


# Overlap Graphs 

Problem
A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.

A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail v and head w is represented by (v,w) (but not by (w,v)). A directed loop is a directed edge of the form (v,v).

For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as s≠t; we demand s≠t to prevent directed loops in the overlap graph (although directed cycles may be present).

Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.

Return: The adjacency list corresponding to O3. You may return edges in any order.



In [23]:
k = 3
data = {}
with open('rosalind_grph.txt') as file:
    line = file.readline().rstrip()
    while line:
        if line.startswith('>'):
            name = line[1:]
            data[name] = []
        else:
            data[name].append(line)
        line = file.readline().rstrip()
for i in data.keys():
    data[i] = ''.join(data[i])

In [24]:
answer = []
for i in data.keys():
    for j in data.keys():
        if i != j and data[j][:3] == data[i][-3:]:
            answer.append(' '.join([i, j]))

with open('answer.txt', 'w') as file:
    file.write('\n'.join(answer))

# The Genetic Code

Problem  
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

Given: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).

Return: The protein string encoded by s.

In [25]:
with open('rosalind_prot.txt') as f:
    dna = f.read()
dna = dna.strip()

In [26]:
protein = {'UUU':'F', 'CUU': 'L', 'AUU': 'I', 'GUU': 'V', 'UUC': 'F', 'CUC': 'L', 'AUC': 'I', 'GUC': 'V', 
           'UUA': 'L', 'CUA': 'L', 'AUA': 'I', 'GUA': 'V', 'UUG': 'L', 'CUG': 'L', 'AUG': 'M', 'GUG': 'V',
           'UCU': 'S', 'CCU': 'P', 'ACU': 'T', 'GCU': 'A', 'UCC': 'S', 'CCC': 'P', 'ACC': 'T', 'GCC': 'A',
           'UCA': 'S', 'CCA': 'P', 'ACA': 'T', 'GCA': 'A', 'UCG': 'S', 'CCG': 'P', 'ACG': 'T', 'GCG': 'A',
           'UAU': 'Y', 'CAU': 'H', 'AAU': 'N', 'GAU': 'D', 'UAC': 'Y', 'CAC': 'H', 'AAC': 'N', 'GAC': 'D',
           'UAA': 'Stop', 'CAA': 'Q', 'AAA': 'K', 'GAA': 'E', 'UAG': 'Stop', 'CAG': 'Q', 'AAG': 'K', 'GAG': 'E',
           'UGU': 'C', 'CGU': 'R', 'AGU': 'S', 'GGU': 'G', 'UGC': 'C', 'CGC': 'R', 'AGC': 'S', 'GGC': 'G',
           'UGA': 'Stop', 'CGA': 'R', 'AGA': 'R', 'GGA': 'G', 'UGG': 'W', 'CGG': 'R', 'AGG': 'R', 'GGG': 'G'} 

k = 0
ps = []
for i in range(3, len(dna)+1, 3):
    p = protein[dna[k:i]]
    if p != 'Stop':
        ps.append(p)
    k = i

print(''.join(ps))

MARARCTTKKFKSDISDGLSAKHRLGKLKSRYSRARNAPTKKCNFRSDLLALLLSVGIACGPPPTKKHTLFDPKHYAAETRCALRRGAPHSMIIQCQHTGQMCSRCPDRASCTLATGLSSLQTRALCSLASNRFKKSQIGRDVPHDIADKSHNLVVMPPIPETWLGTVGTSHSPPSDNRCGLITWHRRSDPPPQRRASSRHREIEPARLANCDRSPSEERLTSIEDNDRGGRQIDVVRWVQRCTKRIALRWLLPIRRQWPAALRGFQAKQDALILLPPSSKHVCVILIHTREAAVRSECPNADPRSPQHFWITPIYGSTRKRWKITRAASQSMYHLAPEDYLRPNRRLHSSLQFTYNPQHLRGSFPKIRVERKQLCDRFVDYPLTLGVELISCALASVTPVINRGEITRVHVFKSNLNRDDEHKPYVLSTPQILLRSAAQRTDVQPCETIQSTKFPIACYFDSSGIFLFRIAVWGGPEIRLIIPNALTVGLLVLSLWAGLNHFWKYLLRIRFCLKDTVKPPVNVAPVGRLLSARSLLLSILMSKRITRYVRLGSNSHDNCVSRLWYVSYSLRKLRNSSTARNTIVTSKVSFPRVIKHGRASRRVDRARKDRTEFDAYARTYEIVYGTEYYVTVSLHKSPFWVSSYSTDLLYINLALQMGYPDPTIRVPILGRIQIAVLRRSNALLLNMTITYRGIGGPYVFPTCPGVQSTVFYFFCFEINVLPHISQDAIKYTPLSNGELFSGLPRRTPIICFLLPSKLYGLNLLSSTIGHMVLDGRTDTRREIDTSSIRIPFSFFSYPEGLIDVYRLPCQRLPSLVSIGTDLNNVGLTPPRLGVIHLSGSSSTSHLQICQDGSSLKRFGHEQWIGVQVDVLGGLRPVLVKSAAYGLPWRNGSRPPHDYACTSIVTELFLDPVLRGRYPHITTYMYRNVVTLPVTLRWVAQSRSAAHDNYTYPVRASTSPKPDTKEELRDRLVQVAGPYRCVLLFGIALSGGCRCLESRH

# The Secondary and Tertiary Structures of DNA

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string s of length at most 1000 bp.

Return: The reverse complement sc of s.

In [27]:
with open('rosalind_revc.txt') as f:
    s = f.readline()
s = s.strip()

In [28]:
translation = s.maketrans('ATCG', 'TAGC')
complement = s.translate(translation)
reverse_complement = complement[::-1]
print(reverse_complement)

CCCACGTTGTCCGCTTCCCTTTCGCTTACGGCCGTTCTGACACGTTTTCCAACGGCTGACATATCCCCCAAAGAATATTGATCAGTCATCATCACGGCGAGCGTATGAAAGTTAAGGTGCCGCAGAACGAGAGTCGTGTCTCACGACTGGTTGGGGGGCTGCGATCGCACCGGAATAGTAGAAGTACTGGTCATTATTTTTATGCGCCAGTACAACATGACGGGGACTATGCCAGCCCGAACAAGTTTGGACGAACCTGAGACGATCGACTATGGGGAGTTGAGAATCTTAGAAACGAGTTGAGCGCTTCGTGCATACATATCTTGCCTTCCGAACTGATTCAATCTATCCGGGACTGGTAGTTCCGACTGTGGCAATTCCTATAAATAATGCATACATGCTCTCGTAGCTGTGGGCTTGTTTTGGCTTAGCGTAACACACGGAGTGGCCTTAGCAATCCGCGGATTTTGATCAGGAGGATATCCTGTCAGAGCACCCAGGGTCAAATGTTGAGCTAGCTTGATAGATTTTTGGTCACCAAAGCTTATTAGACTACTCATCCGGGCTGGAACGGGGAGGTTAACAACCTCTTTCGCTTGCCATATTGCCGCGCCCGACTCAAGCCTGGGATACATGGTGCGGATACAGAAGCTTGTTCATAAGCAGATGACCGGCGGCGTTCTCCCGGTATCGTTCGTATGCCTGTCACTAGCGCCACCCATACCGTATCTGACACCATGCTGCGGCTGCAAGTGCGCCCGGGAAAGAGAGCAGTCGTAATTGGCACGCCACGTGGACCGTACATTCGGAAGCACTTATCGGTTTGTCAGTAGCTAGGGGAAAAGCCCTCACAATCCACTGGACGGCGCCAGTATGTAGATTCAGGCCGTCAGACAGATCCAACGTATCGCATTATATCAGTGCTTTAC


# Transcribing DNA into RNA

Problem  
An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

Given: A DNA string t having length at most 1000 nt.

Return: The transcribed RNA string of t.

In [29]:
with open('rosalind_rna.txt') as f:
    dna = f.readline()

In [30]:
rna = dna.replace('T', 'U')
print(rna)

ACGGCGAAUUUCGUGUGCUUCUGGACGGAGGAGAGUUCUCUAGUACUCCAUGGUCAUCAUCUAGUAGCUGCGACGGAUUCAGGCCCUGUUUACAAGGAACGGGACUCUCGCAUCUGCCCGAUGCACGGAGGAAUAUCUAUGACCUCGUUUAGGAUAACAUGUUUUCCUCUUUAGGCGCCAGCUUCCGCACGUUAACAUGUCAUACAUGUAUUCUAUGUCGUUGCAACUCGAUCAUAUGGUCGGGCAAGCGGCUGUUGUGAGCCUAUGCCCACCCUUUCUACGGGGCUUCUAUACAGAGGUUUUCUACGAAAAGCGUGGUUGUUGAUCUCGGCGCUUCACUACAGUGGUAGUUUCCGACUUCACGUGGCACUACAGCCACAGCAUCAGUGACGACGCGACCCAGUAUGCCCGAUAAUUCGGAAAUCGCGCCCCUUCCUCGUCUUUUCAGUCUCGGAUUUAGCCAGAUCGUCAGGAGUACAAGUAGGAAAGCCUCUCUGCUCACUCCUUAGAUAGGUUGCUUUGCCUGUGCUCUGGCCGUUGUAUGCGCUUAUCGUAGAACUAGCGGGCUGUGCAUAAAUUUAUAGAAGACAGACGUAAGCUGUGAAGCAGAACGUUAAACCAGAGCCAGAUACACGUCGUUUUACUUCUCCCCUCAAGCCAACCAGUCCCUAGUUAAGAAUGCUCCCGGGGUCUCGCGGCUCAGCGAAAACCCAGAAGGCGACCUAAUACGCCUAGGCCAUUGCCACGGAAGAUUCUUGCCCUUUUGUCUGCACUGUAGGCUCUUGUUUCACUGUGUGGGGCACUUGUGAUGCACGUCUGCAUUACAGGACCAUUGUGUGCUUCGGCCGGAACGAGUGUUUUGCAUGGAGCACCGUCGUUGCGUAAGCCCAGUGAAUCUUCCAAUUCAAAGAUUCCGAACGGUGACCUCCACCUCUCGCUAUGCGUCGUGCCUUAUAGCUCAGCUGUUGCG



# Transitions and Transversions

Problem  
For DNA strings s1 and s2 having the same length, their transition/transversion ratio R(s1,s2) is the ratio of the total number of transitions to the total number of transversions, where symbol substitutions are inferred from mismatched corresponding symbols as when calculating Hamming distance (see “Counting Point Mutations”).

Given: Two DNA strings s1 and s2 of equal length (at most 1 kbp).

Return: The transition/transversion ratio R(s1,s2).

In [31]:
with open('rosalind_tran.txt') as f:
    file = f.read().split('>')
file = file[1:]

In [32]:
s = file[0][13:].replace('\n', '')
t = file[1][13:].replace('\n', '')
s = list(s)
t = list(t)
hd = [1 for i in range(len(s)) if s[i] != t[i]]
transition = [1 for i in range(len(s)) if (s[i] == 'A' and  t[i] == 'G') or (t[i] == 'A' and  s[i] == 'G') or
             (s[i] == 'C' and  t[i] == 'T') or (t[i] == 'C' and  s[i] == 'T')]

sum(transition)/(sum(hd) - sum(transition))

1.7096774193548387

# Wascally Wabbits

Problem  
A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite. Two examples are the finite sequence (π,−2–√,0,π) and the infinite sequence of odd numbers (1,3,5,7,9,…). We use the notation an to represent the n-th term of a sequence.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if Fn represents the number of rabbit pairs alive after the n-th month, then we obtain the Fibonacci sequence having terms Fn that are defined by the recurrence relation Fn=Fn−1+Fn−2 (with F1=F2=1 to initiate the sequence). Although the sequence bears Fibonacci's name, it was known to Indian mathematicians over two millennia ago.

When finding the n-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms for progressively larger values of n. This problem introduces us to the computational technique of dynamic programming, which successively builds up solutions by using the answers to smaller cases.

Given: Positive integers n≤40 and k≤5.

Return: The total number of rabbit pairs that will be present after n months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

In [33]:
with open('rosalind_fib.txt') as f:
    s = f.readline()
s = s.strip()

In [35]:
n, k = s.split()
def fib(n, k):
    if n == 0 or n == 1:
        return 1
    else: 
        return (fib(n-1, k) + k * fib(n-2, k))
    
fib(int(n)-1, int(k))

1850229480761