# Table of Contents
 <p><div class="lev1"><a href="#Installing-Python"><span class="toc-item-num">1&nbsp;&nbsp;</span>Installing Python</a></div><div class="lev1"><a href="#Variables-and-Some-Arithmetic"><span class="toc-item-num">2&nbsp;&nbsp;</span>Variables and Some Arithmetic</a></div><div class="lev1"><a href="#Counting-DNA-Nucleotides"><span class="toc-item-num">3&nbsp;&nbsp;</span>Counting DNA Nucleotides</a></div><div class="lev1"><a href="#Transcribing-DNA-into-RNA"><span class="toc-item-num">4&nbsp;&nbsp;</span>Transcribing DNA into RNA</a></div><div class="lev1"><a href="#Complementing-a-Strand-of-DNA"><span class="toc-item-num">5&nbsp;&nbsp;</span>Complementing a Strand of DNA</a></div><div class="lev1"><a href="#Computing-GC-Content"><span class="toc-item-num">6&nbsp;&nbsp;</span>Computing GC Content</a></div><div class="lev1"><a href="#Finding-a-Motif-in-DNA"><span class="toc-item-num">7&nbsp;&nbsp;</span>Finding a Motif in DNA</a></div><div class="lev1"><a href="#Counting-Point-Mutations"><span class="toc-item-num">8&nbsp;&nbsp;</span>Counting Point Mutations</a></div><div class="lev1"><a href="#Translating-RNA-into-Protein"><span class="toc-item-num">9&nbsp;&nbsp;</span>Translating RNA into Protein</a></div><div class="lev1"><a href="#Locating-Restriction-Sites"><span class="toc-item-num">10&nbsp;&nbsp;</span>Locating Restriction Sites</a></div><div class="lev1"><a href="#Rabbits-and-Recurrence-Relations"><span class="toc-item-num">11&nbsp;&nbsp;</span>Rabbits and Recurrence Relations</a></div><div class="lev1"><a href="#Introduction-to-Mendelian-Inheritance"><span class="toc-item-num">12&nbsp;&nbsp;</span>Introduction to Mendelian Inheritance</a></div><div class="lev1"><a href="#Enumerating-Gene-Orders"><span class="toc-item-num">13&nbsp;&nbsp;</span>Enumerating Gene Orders</a></div><div class="lev1"><a href="#Finding-a-Shared-Motif"><span class="toc-item-num">14&nbsp;&nbsp;</span>Finding a Shared Motif</a></div><div class="lev1"><a href="#Calculating-Protein-Mass"><span class="toc-item-num">15&nbsp;&nbsp;</span>Calculating Protein Mass</a></div><div class="lev1"><a href="#Consensus-and-Profile"><span class="toc-item-num">16&nbsp;&nbsp;</span>Consensus and Profile</a></div><div class="lev1"><a href="#Overlap-Graphs"><span class="toc-item-num">17&nbsp;&nbsp;</span>Overlap Graphs</a></div><div class="lev1"><a href="#Enumerating-k-mers-Lexicographically"><span class="toc-item-num">18&nbsp;&nbsp;</span>Enumerating k-mers Lexicographically</a></div><div class="lev1"><a href="#RNA-Splicing"><span class="toc-item-num">19&nbsp;&nbsp;</span>RNA Splicing</a></div>

In [1]:
import pandas as pd
import numpy as np

In [2]:
cd /home/jessime/Code/kmers/

/home/jessime/Code/kmers


In [3]:
%load_ext autoreload

In [4]:
autoreload 2

In [5]:
%aimport fasta

In [6]:
def clean(infasta, outfasta):
    cleaner = fasta.Cleaner(infasta, outfasta)
    cleaner.data = cleaner.seq_per_line()
    cleaner.save()

# Installing Python

In [33]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


# Variables and Some Arithmetic

In [35]:
print 3**2 + 5**2

34


In [34]:
print 799**2 + 882**2

1416325


In [38]:
def init_slicing(i1, i2, i3, i4, string):
    print string[i1:i2+1], string[i3:i4+1]

In [39]:
s1 = 'HumptyDumptysatonawallHumptyDumptyhadagreatfallAlltheKingshorsesandalltheKingsmenCouldntputHumptyDumptyinhisplaceagain.'
init_slicing(22, 27, 97, 102, s1)

Humpty Dumpty


In [40]:
s2 = 'tsBpd0mHfromS1sOtvxz74aUVPwnOCiNy1l5hFSaU51apBWSpalerosophisAelWT8Q3BH2MA7JnCHIi9ufPwE6waFJkdBthFSJrnkNNntuNpKwNsYDxHO9Jxl7XqCMT6rE85GJctQyGcateniferlBDPZhKF36zyhRhhpx6gSQoCnywUlAWpxsMAPjCyBFOD060.'

In [41]:
init_slicing(47, 59, 140, 148, s2)

Spalerosophis catenifer


# Counting DNA Nucleotides

In [3]:
from collections import Counter
def count_bases(seq):
    bases = Counter(seq)
    print bases['A'], bases['C'], bases['G'], bases['T']

In [4]:
s = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
count_bases(s)

20 12 17 21


In [5]:
with open('/home/jessime/Downloads/rosalind_dna.txt') as infile:
    s = infile.read()
    count_bases(s)

237 244 248 253


# Transcribing DNA into RNA

In [10]:
def dna2rna(seq):
    print seq
    print ''
    print seq.replace('T', 'U')

In [11]:
s = 'GATGGAACTTGACTACGTAAATT'
dna2rna(s)

GATGGAACTTGACTACGTAAATT

GAUGGAACUUGACUACGUAAAUU


In [14]:
with open('/home/jessime/Downloads/rosalind_rna.txt') as infile:
    s = infile.read()
    dna2rna(s)

CATCCCAGTGTTATCGTTAGTGGACAATGGTACGATATCATATGTTTCATACTGTCGGTGCTAAATTGAGGTACAGCGTCTCTGGTAACACCTAACGTTCACTACCTTCCATATGCCTGCAGCAAACCTAGTTCAGGATTATGGATCTTGGCCAACGGGAAAGTTGGGCTCTTGGCAATGCCTTCATGGTGATGATTGAAAAGGCGTCCTCCGCTTACCTAAACTGTAGTATCGCAAGGGTTGGCGATAGCTTGTGGGTATGGCTTGACGGCAATTCTTTCCCCCCGTAGCGCTTATACGAGAATGAATACTTTCAAAATTTCCTTGGTCAACTCCCGCCAGCAGCGTATGTTTCCACTAAAGTCGGGAGGGATCTTCGTTTGCGCGAAGCGGAGAGACTTACATAGACTTATAGGGGCATAGTTTGCTCAAGAAACAGGGCCCCCAGTTAGACCACCGGATCCGGCTTTTAAATTTGAAGTGCTCGGCTGATGCTGTAGATCGTTCACGGCGACTCTTGGATGCAGATGTCATCGGAGGTACCCAATCCTTACTGAATTGGCGTTATCCAACTCGACGTGCGGTGCGCGGAAAAGTCGACAGCCTGTTAACCGTGGCTTGAGGACCTTAAAGTATCTTTTAAGTACTTTGACGCCGACCCGAAACTAAAGGATTCCACACGGATATGTCTTTTGAGCCGGGATTCGCGCGGCATTCGGGGTCAGCTCGCGTGGGTCGACCCGATAAAGAGGGAGGTAATTAGCTTGTTGGCTAGAAAGTGTGCATCAAGTTAGCCCGCAGTACTCGCAACCATGTTCGCGCTTCTAAAATTATTCGTCCTCCTGGAGAGGCGAAGTTTCCTACGGCTCCAGGACTTGCCAGTTCTCAACTAACTAATTTCTCAAATTTGTATCTGAAAGACGTATCCTTCATCCTACCAACTCAAGGAAAGTCAATAGGCCTAATCACCTTG


CAUCCCAGUGUUAUCGUUAGUGGA

# Complementing a Strand of DNA

In [27]:
def rev_complement(seq):
    print seq
    print ''
    comp = {'A':'T', 'G':'C','C':'G','T':'A'}
    new_seq = [comp[b] for b in seq]
    new_seq = ''.join(new_seq[::-1])
    print new_seq

In [28]:
s = 'AAAACCCGGT'
rev_complement(s)

AAAACCCGGT

ACCGGGTTTT


In [29]:
with open('/home/jessime/Downloads/rosalind_revc.txt') as infile:
    s = infile.read().strip()
    rev_complement(s)

ACACAGTTTACCTCGACCGGCCCCTGTCCCGCTAACAATACATCCCGTGACAATCGGTCAATACTACTGCTAGTAGAGGTCGGAGGGGTCCGGGATGGGATAGGAAGCCGGCCCGATATATTGGGGCCAAGTCGGCGCATCTTGCCAATGCTGCCTTGCGAGTGCTAACCAGGTGTGATGGTTACTTCTCCAGCCAATGCGCCGTCATACTGACCAAACGCAGAACTCGACGGTTACCATACCATTCCGTGCTCCAATGACGTGAGAGGCTAGTCGTACCACGGCTGAGGTCTACCACTACCTGTGCACCGAATCTTTTATCTATGAGGCTTACCCTACCGAGTGTCCCAACCAGATCGCTGGGAGAGCATTATGGGATGCTGCCTGGTCGATAGAGTTCTAACGAGTAAGGGTGCTCCACGTCGAACGTTCCGCCGGCCCGTTCTAGAACCTAGGGATTTCTAAGTGAGTATTGTGAAGGCATCTATCGTTCTGATCGCAGTCCTACCGTGTTCTAAGCAGGTGAGCCCCGGACAGTTACCATATGTGAGGCAAAGCGTCACCTGGAATGTTTACAGATGAGGGCACCTACCCGAATCTTCTGGCAATGCCTTAAGCCATCGGATATAAGGTATCCAATAGTGCTCTAGCCGCGGTTGGGGCCCAAGCTTTGTGACCCGCCTTGCAGCCAAGAAGGTGATCTAGTTGGTGTTTGGGAGCCATGCCTGTGATCTGTTCAGTACGTGTGGACGTGGATTTGCTCTGTAGTACCCCTGATAGTTGCCGGCACGCGATACCCTGATGTCGCTCGGGTTCACCTTGATACCGCCCACATTCATGCTCCGTCAGGCAAGCGACAGGCCGCCAGCCGTATCTAAATTGTCGGTCCAGATGACGGATAAAGGGCTCATGATACGAATAGCAGGTTTGTCGACGTCAACTGCCGACTGATCCAGGGTCAGAACTAAAGACA

TGTCTTTAGTTCTGACCCTGGATCA

# Computing GC Content

In [43]:
def max_gc(fasta):
    max_seq = None
    max_gc = 0
    gc = 0
    len_seq = 0
    
    for line in fasta:
        if line['0'] == '>':
            name = line[:1]
            
            
    data = fasta.split('\n')

    for line in data:
        print line
        if line[0] == '>':
            name = line[1:]
        else:
            bases = Counter(line)
            gc = (bases['G'] + bases['C'])/float(len(line.strip()))
            if gc > max_gc:
                max_gc = gc
                max_seq = name
    print max_seq
    print max_gc

In [45]:
s = """>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT"""

max_gc(s)

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
35
60.0
TCCCACTAATAATTCTGAGG
8
20.0
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
34
60.0
ATATCCATTTGTCAGCAGACACGC
11
24.0
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
37
60.0
TGGGAACCTGCGGGCAGTAGGTGGAAT
16
27.0
Rosalind_0808
0.616666666667


In [40]:
s = "CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT"
print len(s)
print s.count('G')
print s.count('C')

87
29
24


In [42]:
(24+29)

53

In [95]:
def max_gc(fasta):
    #build lists
    names_ls = []
    seqs_ls = []
    current_seq = ''
    with open(fasta) as fasta:
        for line in fasta:
            if line[0] == '>':
                names_ls.append(line[1:].strip())
                if current_seq:
                    seqs_ls.append(current_seq)
                    current_seq = ''
            else:
                current_seq += line.strip()
        seqs_ls.append(current_seq)     

    #calc gc
    max_seq = None
    max_gc = 0
    for name, seq in zip(names_ls, seqs_ls):
        gc = (seq.count('G')+seq.count('C'))/float(len(seq))*100
        print name, gc
        if gc > max_gc:
            max_gc = gc
            max_seq = name

    print max_seq
    print max_gc
        

In [97]:
max_gc('/home/jessime/Downloads/rosalind_gc6.txt')

8
8
Rosalind_0450 51.3116474292
Rosalind_7272 50.5434782609
Rosalind_8905 48.4375
Rosalind_4092 49.5464852608
Rosalind_3126 52.787258248
Rosalind_4517 50.6815365551
Rosalind_7293 52.8199566161
Rosalind_8059 49.7371188223
Rosalind_7293
52.8199566161


In [72]:
l = ['CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG', 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGGCCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC', 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGGCCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGCCCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT']

In [73]:
l[-1]

'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGGCCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGCCCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT'

In [86]:
s1 = 'GTCTGCCCTCCCGGGCCCTAGTTGTTCTCTTAGCTAACTTCACTGTAATTATCATAAACAATTTCCACTTTCTTTCAGTATTATTGCACGCATAAGGGGGTGACGGGACAACTATAATATGTCTTACTTGAAGTAATGCATCTAGTCGTAAAAGGTGCCATGCAGAACGATACAATCAACGACGGACTGACACCGCTAGTTGGAGTAGGTACACGGAGGAATATTTGGATACAAACCGCGCACCTCGCCACGCTCAACTCTGCGACCTCCCTAGCGGCCGCGGACGCCACTTGGGTAGGCTCGCAGCTCCTCCTAATTAGATGAGGATGCCATTGGGCGATTTACTGCGTATCGAATATCTGGCATGGGCGCTGCCGGCAATATCATGTATTTCATAGGCAAACAGGGAGTGCTCTAAGTTCCGGCGGGGCCTCCAGTCCCTCCAACGGGTATTATGCAGGCTATCGACCAATAACCATGGTAAATCGTTAAGGCGGTCATCTGCGCGGCACTGCGGGAAGTATCGGCTTCGCTATCTCCTCGGTATTCCCGGAATCCACTAATTTATCTCCTTGCGAGCCCACCCTTCTCTAGTTGCCGATGCGTTCGCTGGCATATGCTTGCACGTCTTGAAGAAATAGCCACCCCTTTTCGTGATATGCTTCACCCGGTGCGAAAGGCGCTAGACCATTCACCCCCCAATAGTAATCCTTACGTTAACCTACCCCCCGGAGCGGGGTCCGAGAGGTTCGCAAAAGTAAGATAAGCATGGATCCAGGGTTACTACGGTGTGCCACTTCGCACCCGGACGGGGAAACAATGCG'

In [87]:
s2 = 'GTCTGCCCTCCCGGGCCCTAGTTGTTCTCTTAGCTAACTTCACTGTAATTATCATAAACAATTTCCACTTTCTTTCAGTATTATTGCACGCATAAGGGGGTGACGGGACAACTATAATATGTCTTACTTGAAGTAATGCATCTAGTCGTAAAAGGTGCCATGCAGAACGATACAATCAACGACGGACTGACACCGCTAGTTGGAGTAGGTACACGGAGGAATATTTGGATACAAACCGCGCACCTCGCCACGCTCAACTCTGCGACCTCCCTAGCGGCCGCGGACGCCACTTGGGTAGGCTCGCAGCTCCTCCTAATTAGATGAGGATGCCATTGGGCGATTTACTGCGTATCGAATATCTGGCATGGGCGCTGCCGGCAATATCATGTATTTCATAGGCAAACAGGGAGTGCTCTAAGTTCCGGCGGGGCCTCCAGTCCCTCCAACGGGTATTATGCAGGCTATCGACCAATAACCATGGTAAATCGTTAAGGCGGTCATCTGCGCGGCACTGCGGGAAGTATCGGCTTCGCTATCTCCTCGGTATTCCCGGAATCCACTAATTTATCTCCTTGCGAGCCCACCCTTCTCTAGTTGCCGATGCGTTCGCTGGCATATGCTTGCACGTCTTGAAGAAATAGCCACCCCTTTTCGTGATATGCTTCACCCGGTGCGAAAGGCGCTAGACCATTCACCCCCCAATAGTAATCCTTACGTTAACCTACCCCCCGGAGCGGGGTCCGAGAGGTTCGCAAAAGTAAGATAAGCATGGATCCAGGGTTACTACGGTGTGCCACTTCGCACCCGGACGGGGAAACAATGCG'

In [88]:
s1 == s2

True

# Finding a Motif in DNA

In [45]:
from __future__ import print_function
def occurrences(string, sub):
    c = 0
    while c+len(sub) <= len(string):
        chunk = string[c:c+len(sub)]
        c += 1
        if chunk == sub:
            print(c, end=' ')

In [46]:
s = 'GATATATGCATATACTT'
a = 'ATAT'
occurrences(s,a)

2 4 10 

In [47]:
string = 'AATTGTTTGTATATAACAGATTGTTTAATTGTTTGGCGTATTGTTTGAATTGTTTTATTGTTTGATTGTTTCAATCAGATGGATTGTTTGTTCGACTATTGTTTGATTGTTTATTGTTTATTGTTTTATTGTTTTACAATTGTTTATTGTTTCATTGTTTGTCCGCCGGATTGTTTACTTCACCCATTGTTTGGGGCGGGTATTGTTTGATTGTTTCATTGTTTATTGTTTTCATTGTTTAATTGTTTTTGAAATTGTTTCGGGATTGTTTCTAATCATCTATATTGTTTAATTGTTTATTGTTTATTGTTTATTGTTTGGGTCTATTGATTGTTTATTGTTTAAGATTGTTTTATTGTATTGTTTTTAGATTGTTTATTGTTTTGCATTGTTTGCCCATTGTTTATTGTTTGCAATTGTTTATTGTTTATTGTTTCGTATTGTTTCATTGTTTACGAGATTGTTTCTATTGTTTATGAAGATTGTTTAAAATTGTTTAGCTATTGTTTCACCCTCATTGTTTATTGTTTGATTGTTTGAGGTTATTGTTTTTGTATTGTTTAATTGTTTGAATTGTTTCTACATTGTTTTATTGTTTCGGCAATTGTTTATTGTTTGCATTGTTTCAGAAGTGATTGTTTTCGTGCGGCATTGTTTATTGTTTACTCGATTGTTTATTGTTTGAAGTATTGTTTCCCAATTGAATTGTTTAATGATTGTTTGCTATTGTTTAATGCGCGCAATTGTTTGATTGTTTCATAATTGTTTATTGTTTATTGTTTCCCTAGGCCGGTGGCATTGTTTACGAGATTGTTTGTCATTGTTTTATTGTTTAACATTGTTTTCAGATTGTTTGTACATTGTTTACGGGATTGTTTGATTGTTTGCTTGATTGTTTTCCGATTGTTT'
sub = 'ATTGTTTAT'
occurrences(string, sub)

106 113 139 218 292 299 306 330 371 399 416 423 469 517 604 651 670 762 769 

# Counting Point Mutations

In [4]:
def hamming(string1, string2):
    total = sum([1 if s1 != s2 else 0 for s1, s2 in zip(string1, string2)])
    return total

In [5]:
string1 = 'ATGGGGAAAGAGGCCCAGCCGATCATTCTGTATGTGGAAATCGTATGATCATGTCATTTGGAAGCATGCCACAGTTGTCTTCGCGCGGCGTACCGAAAGATAGCCGACACCAGGTTGGTAGTCTAGCTCAGATGGTGAATTAGACCGACGGCAGAGCCAATAATGAATGTACAAAACGTTCACTCGGACGCGTATCATGTAGTATGTCGAGAGGCTCTTACGAGACGATAGCTCCGCACGTAACCGATGGCACAGTGATTTCAGCTAGGCGAAAGCTATCCCAGGTGTAGGGCAGAAGCTCATATAGGGAAACGTCCTGGGCTTCCGGTTTGAGTAAGCACAAAGCGCCGGTATCGTCACCGGGCTGGTCGGGCCTTAGCCGTCCCGCGGGAGTGAAGGATACAAGTACACATTCGACTGCGCCCAGTGGACACTTATGCAGCGTAGTAGATAATCGGATGCTATATTTGATAATTGAAACGCCGCAGCGAAACACGCGGGCTGGAGGAGGCCCGTAATCCTTGTGCAGATGTCCGTAAATATGATGGGCCGACCGTGTGCGTCTTGTACTGTCAGGTACACCCGTCATCGGTCTCCTCGGTAGCAGACAACCTGCGGGGGGGTCTCTGTCCCAACGTGAGTTGGTGAAAGCTACTCTTAGTGGTCTCCTCCGAGAACGCCGCCCGCTCACTGGTCACGCGTAAGCTTACGCCTAGCAGCAGAAAGCCTGTGTATCTGGATTGCTTTCAAGCTAAGGTAGCAGAGGGCGTCATACCCTACTGGGATAAACCCCGTCCCGTTCGCTTAAAAGAGAAAAGATGAGGCACACCAACATAGTCTAGAGCGCCCTCGGCTTTCGCCGGCGCCTATGCGATGAGATATTAACATGCAAATAGCTGTGAGGATGCCGAAAATAGTATCGTCGCGATCTTACCC'
string2 = 'ATGGGTTACGAGCCACCGACGGTCATCCTGTCGGGTGCAACCGTATGACTGTCGAGTTTGCTAGCCTTCCATTATTCTGCTCACTCTCAATCCAGTTTGTTAGACGACCCATGGTGAGTTGTATGGCTCAGGTGGCGGTAGACACCGAAAGCAAAGTGAATAACGAAATCACGCGTACCAGGTTGAGGCGACAGATGCGTTATACGTCGATACGCTTCTAACGACCGTTTGGTCGTACGCCTACCGATGTATCTATACTGTCAGATAAAGGGCCTCTCCCACAAGAGTATAGCGGAAGCTAGCAAATGGAAGCGTCCTATAATTTCACGGGCCGGAACTTTAATTCGCTGGCAGGTTTAGTTTGCTTATCTGACCCGGGCCGTACAGTGATACGTAAATCTTTATCAACACTGTACGCGTCCCCCGCTGCACACGGAAGCGCCAGAGTAGACAATGGGCAAGTATACTTGATAAAAGCAAGTGCGCCCGCGGACATAGCCCGAGTAGTGGGCGGGCATTCGCTGCCGTGTTGTCGCTGAAAATCATGGTCTTCCAGCATATGTGATGTACAGCGCATTACACTAGCCTTACCTAAGATCACTGCCGGACACCTTGATGGGGATTTCGCGCGCCAACTCTGAATGGGCAAATAGTCCTTTTGAGGTTCTTACCAGGGATCAGCGAAGCAAAAAGTCCAGAATTACGCGTGAAAGGGCTTGGTCACCGTGTATCTAGCGGGCTCGTTAAGAAAAGTTATCAACAACTGACGGCCTACTATAGGGATCCGAGATTGGTTGAGTATGGGCGCACTGAATGCGCTCAGGCTCACTAAGCTAGGCTTGGGTCCCATCCAGTTGTGGGGGCAGTTACCATGCGGGCGATGGCGTCTCAAATCGCCGTGAGTTTGCATCACCTGGCCTTCCCGGGAGTTTACCG'
hamming(string1, string2)

479

# Translating RNA into Protein

In [7]:
table = """UUU F      CUU L      AUU I      GUU V
UUC F      CUC L      AUC I      GUC V
UUA L      CUA L      AUA I      GUA V
UUG L      CUG L      AUG M      GUG V
UCU S      CCU P      ACU T      GCU A
UCC S      CCC P      ACC T      GCC A
UCA S      CCA P      ACA T      GCA A
UCG S      CCG P      ACG T      GCG A
UAU Y      CAU H      AAU N      GAU D
UAC Y      CAC H      AAC N      GAC D
UAA Stop   CAA Q      AAA K      GAA E
UAG Stop   CAG Q      AAG K      GAG E
UGU C      CGU R      AGU S      GGU G
UGC C      CGC R      AGC S      GGC G
UGA Stop   CGA R      AGA R      GGA G
UGG W      CGG R      AGG R      GGG G""".split()
table_dic = {k:v for k,v in zip(table[::2], table[1::2])}

In [8]:
def translate(dna):
    codons = [dna[i:i+3] for i in xrange(0, len(dna), 3)]
    del codons[-1]
    protein = "".join([table_dic[c] for c in codons])
    return protein

In [17]:
translate('AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA')

'MAMAPRTEINSTRING'

In [18]:
translate('AUGCUGACAUCGGACCUCUUUAAACAGACGAUUUGCCUGCGAGGUUGCUCCCCGUCGGGUCGAUAUUUAGUUGAGUUGGUGGCAUUCUUCGUAUCGCAGCUGCCCGAGUCAAUUUGGGUUACACUAAUCUGUCACUCUAGCCGGUGGUUGGUGCAACAGAAGAUCCCGUCCGCAGGCGUAUGGCCCUCACGCGCAGUGCACCGGUUUUUGGUUAGAGCAGAUACGCUUCGUUACCCCACACGUUACUUACGUAGAGUGGCUAGACAUGCAGGAGAGGGCAAAGUGUCACCUUCCCUGGGGGCUACAUCUGUCGAUCGGACGGCACGGCGUGGGAGAAAGCAGCCCCCGGAAAUUCAAUUACGUGUGCGUAGAUUUGUCGACGUACAGAUCCUGAAAACCCUCCUUUUUGUUGAGUAUACUCAUCAGAAUGAUAGAACCAGUGGCAAGCGUGUUUUAAAUGUUACGCAAGGCCACGGGGCCAGAUUAGUUGUUGAUACACGGCGUGAGGGUUUCUGCAAACCCAAUUCGCUACACGAUGAGUGCCCGCACGUCCGAAGCGGACUUGAGCAGCGUCGAGACUGCCGUCGCUCGUACUCCAAAGGCGCUUCGUCUGGGGGGAAGAGCGUUCAAGUCGCCGCCGAACAGUUGAUAUACUUAAGCGGGUUAGUUACCACAAUUUAUUCAGCCUUGCCGUGCGCGCACAGUCGCUUCCAGGUACUGGAGACUUGUUCCCACAAGUUGUAUUCUUCUGACCGCCGAGGAUUUAGGUCUAUUUCCGGGUGGUGCUGGCGUACCUGCAACCAAGGUUUGGCUGGAAGCCAACGGAGCGGACCUAUGCACCACAGCUUUCACCAGGGAUGGGCAGAAUUUCGCAACCCUUGGAAGAACUCUUCGACUUGCCUCCGGACGUUCUCUGGCUCAAUUCAAACUUCACUUUUAGCGGGCCCCAUAAUUCUUAAGGUAUUCGAAUGUGGACCGAUGAUACACGGUACGAAGGAACGCGUGGGUACUGGGUCAGGACAACGGUCGGCUACAAUAGCACUCAGGAGGCAGACUACAAAAAAUGCGACAGUUGUCGCCACGCUUGUGCAUGAUGGUUCAUGCCAUCCUCCGUGGGAUUUCUCUAAGGUGUACACCGAAAUCGUCAUAUUAGUUGUAGUAUCUCUAUGUGCUCGAUACCUGCGGUCACGCAUCCCCGUGAGGCGGGAGGGUUUCGCGUGUCCAAAAUUAAACGACACACCUAAUGGCGCUUGUGCGCCGACGGAGUGGCCUCACCACGACCAGCCCAGGUAUCCCCCUGUCCUUUCAUUAGAGCCCCUCAGGUCACUUUGUAUCUAUAAUAUACAUUUACUAGCUCCACAUUUCAUCGAAUUGCAUGUAAUAGUGGAUAGCCAUGGCGUACUGGUGUCCGGAUGGAAUCACUGUGUAUUGAAGGCCAUAGCGGCACUACUCCCCACACGUUACUCCCAAUGUCGACACGGUUACGACCCGGCCAGCGGUUCCCGAUAUUACGCCACGUUCGAGGCAGUUUUGCCGAUGAUAACAAUUCGGAACAUGUGGUCUCCGAGCACCCAAUAUAGGCCUGAGGACUCGGUGCACUCGUCUUUUCAGCAAGAGCCACGAAGGAGCAAAUAUGGCCACCGGGGUACGGGCGUGAGAAGCGCAACGGUUGACGUGACAGUUCCGAGACUUCCAAAUGAUUACGUGCACUCCCGUAAGUUAGUGCAUUGUAACCCAUCCACGUUUGGGGCAGCGAUGGUUAAAGCGGUGUUUUGGCCCCAGGUAGUAAGACUGUGUCACUCAAUCGCUAGUUGCGUAGUGAGUCCGGUUCUAUGGGAACGCGCUGGGGUAGCACCUGAGAUUCGAAACGCCAUGCACUUGCGGGGCACAACGUGUUCUAAAAGGGCCCUUGCAAUCAUUCUUAGGCCACUCGGCUACAUAAGUGACUUGGGGAUCUUCGAGAAUGCAGAAGUUGCGGUUAAUGUGGUCCCCCGCCAGCUCCCUCAUUCUGCCUGGCGCUGCUACGGGCUGUGGUUAGGCCGUUUUCGGGCGCCCGAUGCUUUUUUCAAUCCAAAAAAACCUGCACCUUAUGUUGCGAGUCGGUCAUUAAUACGGCUAGAGUUGCUCCACACCGGUGCUGACAUGGCGAGGCAAGCUCGAGAUGAGGAUUCGAAGGCUAAUGCCGCAGCCUGCCUACAGUUAUCUUACAGCACUAACGUGGCUUUAAUCACGCAGGAUAUUAUGGCGCGCAAGCCGUCGGUUAAUUCUUUGGGCGAACCCACGGUGCAACCAUCCCCCGCCCACGGUCAUUUUAUGCAGGACAUUCGGCCCACUAUUCCCCUCUCCGAAAUGGGCUGUAUGGCACCCUGUCGUGUCCAGGGAUUCGUUGCACUGGAGUAUACGUGCUUGAGGUCUCCAAGGACGGUAUCUGUGUGGAGAUCUACCGACGGUCUAUCUGCCCUUGUAGCGAACCAUGGAUCCGAUUCCAAUGGCACCUCGCAGAUCACGCGCAACCAGUCUGCAGUAACCGCGUACUGUGCGGGUAUCAGGGUAGCUGACGCCGGGCUGCGAUGUGUGCCGACGAUGUUUUGGAGUAAAAGAAGAGUGGUUGAUAACGUAUCGGCAUUCGAGAAGACACUAGUUAUGCCGUCAGUAGGAACUGCAAGAAGCAGCAAUGCAUAUUUACUUAGACCUGCAACGUGGCCGGGGGUAUAUGUCAACACUCCUGAACGACCUGUUUUCACCACCAUAGACGACGCCGUCAAAGGGUACCUCACUUGGUUAGCGCAGAGCAAAGCUCGGAUACACCCCUUGGAGACUGUCCGUGUCCGCUUAAGAGAUGGGGCUAUAUGUCCCGAGAUCUCAGAGGAAUCGACCGUCCUGCUAAUGAGCAUUUGGGAACUGCUGGUUCACAUCUUAAAGACCGUCCAGCCUAUUUCCGUCGCGGAGUGCAUUUUACGGGAGGACUGCGGGAGUGGCAGCUCCUCGUUCAGGUUACGACUUUGUAGAACUGCUCGCUCUGUUCCGUUAACAAAGUUUAGUCGCGAUCAUUCCUUACGGAACUGUGUGGGUCUGCGAAUAUGUAAGGGACAACCUUACUUCACCCGAGAUAGACCUCGGGAGUCCUUUGGCGGCUCCAACGAUCAUGUAGUAUUGGAAAAUUAUAAGUUCAUACGUGUGCGGUGGUGUGGGCGUUCCAAUGCUCCUCCACGCCGAAGGGGCUUCGCCAAUUCCAAGACUAGGACUAACUGCGGCAGACAUCCUGCGGACAUAAGCAAUUGCUGGUAUCUACUUGCGUGCCCUUGCAAAAUCUGCCACCUCGCUAGGUUUUCUGUCGGUAAGCGACACGUCUAUUUUGUAAACGCUACGUGUGGACGGCCAACUGGUAACGAUGAAAUGUCUGUAGUGCCUCGAGGCACUCGGGAGUGCCGCAGGGCCGACACCGUAAUACAAAAAUUUCUGGCGCCGUGUUCGGAGAGGAGGAUGAUAGGGUGCGUCGAAUCGGCGGGCGCCGCGGGAUCUAUAAAGGUGACCCCCGAUAUCUCUGGGUUAAGGAUGUCCUGCCGGUUUUGGGUUGCGCAGGACUGGCCGCUCUGGAUCGCACCUUCACGGAUUGCUCGAUUUGCAAUCUUUGUGCUUGGACAAGUUGGCUAUUCUCGCUCGAUCGUGCAGCCUACGGAUUGCGUAUCGUCGGCUCCAGCGAUUCGUGUUUCGAGCGACAUGCCUAACACCCUCACGCUUUCUAAUCGAUCGCAAGUAAGUGAUAAACGCCGCUUAUAUACGUCUCAUUGGUGGUGGGGGUAUCUGCACUACUUGGCAUACUCGUUGGGCCCGAACGUUUUGUGUCACGUCCACCUGGGACCAGGCCUUUUAAAAUUCGAGCAGAUCUCUCGGCGAGCAAUUGUCUGUGCGAGAACAGCUGUCCACGCUAGUGCGAAUCUUGACCUUGACAAGUCUUUAUCUGCAACGCUGUAUUGCCCCCGGUGGCCAGCCCUCGGCGCCAGACAAUUAUACAGACAAAGACUCAGGGUUGCCUCUGCCGUGUGCCCGGAUAAUUUAUUUCUUUGGCCCUCAGUAUUAACCACAUACAUGUCUGAGAGAGAGCGGCCUAGUAUAUGCGCCCCCGCAGACGUUUGCCUCGGACGGCGAACACACACGGUCUCACGGGAGGCUCGCUCUUCGCCCACGAGGCCACGGCAGUCACUCGUCUCUCCGGACACAAAGCGAUUAGACAGUGUUGAGGGUAUUGAACAGUACCGUUCCACGUUACUAUUGUCGGGUACCGUGGUUCCGCUUUCACGAUACAAUGAGGCUUACUGGCUAGGAGGAUCUUCUACAAUAGAUUCGAAUAACCAUUUCGUGGUAGGGAAGCCCGCCAAUUUUCCAAUACCACAAGCAGCUAUAGCCGCCGGUGUGAACCCAUACGAUACGGCCGGGCCCCCCAUUUUGGACUUGGAUAUCAAACCCGAUCAGAUUGGUACACGGGCCGAUAGGAAGUCAAUGUUCCAGCUUUUCUACCGGCUUACUAUUAGUGGGUUGGCUCAAGGGCGCACGGACAUGGGCCGGAGUCCCAGGGCUGCUAGCUGUAUAUACCUUAGAUCUGUUAUCCCAUCGGUGUUCACUAGGCAGCACUGCUUGAAAAGUUAUUUGCGCGAGACCUCUAUAUAUACACACAGACCGCUAGUGAUGACACAUAGGACUUCCCUAGGGCCAAUCAAGAGCAGGACGUACCUUAAGACUCCGGUUGUCGUUCCUGGACUCAGCGCCUUCUGGUAUUGUCUUCGACUGCCCCUUCCCGAUUUCAGGGCGUGCGAGGAUCGCCGGCAGCUGAGUUACCAUAGGUGUGGGAACCAAAAACAUCCUACCUAUAGGCUGACGUUCCUAGGAUGGACAGAGCUGUGUCGAUCAAGCUUUCCUAUAGGAUUAGCUUGUUUCGCGCUAACGCCGGCCUGCAUCGCUGGCCAUUUCGGUGCAAAUACUGCAAGUAUGUGUGCGAGAUGUGCAUGCGCGGUUGCUAGCAUGCAUGAAGGGAUAAUUGCAGCCUGGUCUACGCCACGCUUGCAGGCAGCACUUAUACCCUUGAGGUGCGUGAAAAUGACCUAUAUUGGCUCGAACAUGGCGGGGUCAAGGUCCCAGAAAAGGAUCCAUUUAAAUUGCCGAAAAUCUCGCGUGGCAGCUUACCUUACCGCGGGAGCGAGACCUACGCCCCCUUGUCGUCCCAUAUCCUCCGCCUAUCGUGUGACAUACGGGCGUACUGAAAGAGGGCCAAUACUUACCCCGCUACAAUCGACCGUGUCUUCGGGCUCACCGAAUUCGAGGGUGAUCCUCCACUCAAUGGACACGACUUGUGGAAUAGACAUACGUGAACGCCAAAGGGCGUGCGGCUUCUCCGGUUUAGAUUUCGUGGCAUUCGACCGGGCCAGGCUCUACAGUGAAGUACCGCGGCCAAGUCCCAGAGAAAUCCCACUAAUCCUUCUUACUAAGAAGGCAGGCAACGCGAUACCGUUCUCGCAGUCGCGACGAUCUAAGUCCUUCAAUGACGUCCCGGUGUGUCAACGAUGCGAAUUCCAACCUAAGGCCACACAGCGCGCGUCAACUUCUAUCGAGAGUGCACCGUGGUGUCCCUGGUGGAGUGUGAUGUUAGCUUCAUUUCCACACACCCGGCGGUUAGCGUCUUCGGUUACUCGGCUUGUCGAAAACCGAGGUGCUGCAAUGAUAACGGUACCGGACGUCUUUGAGACGCCGCGGUCAGGCUGCUCGUACGCGAGCCGCGACCCUUUGCAGUCUGAAAUAAGACUAAGGCAUUUUCCGCCACUGUCACAAUUUAUCCCGGCCGCGCUCACUCGCUGCCGGGUAAGGAACAUUACUGGAAUGGGGAGCACACUCAGACUACCGAGCCUCUUAAACCUCAAUCUGACAUCCGCAAUUUUAAGUACGUCUAAGUGCCACGGGCUUUCACGUAGCCCUUCGCAAGUUCUGAGCCCAAACACAACCCAUAGAACUGUCCCAGGAUCAGUCCCAUGCCAAUUUACAAGAAAUGUUAUAGAGCAAGGGAUGGCGUCCCCGAAAGCGUUUAUGUACUCAGUUAACUCGCUCUCCAGUGUUGUUAUGACUUUGGCCUGGAGAGACAACAUGCUAGGACCUUAUAGCCGAGGCUCGACAGUCUAUGCUUGUUACAUACAUAGCGUGCGGCUACAUCAUGCUCGAUUUGGGCUGGCUGCGUCUCAUAGAGCCCCGCAACAAGGGGAUAUGUUUGCGAAUAAUCGACUUUGCCUUUACCAUGUUCUAUUUGGAAUCCAAAUCCCUCGUUGUAGUAGAACAGCCAUUCGGGGCCGUUCAAAGCGAGCCUAUAGUCCUGUCAGUGUUAUACUUUCGUUGGGUUUCGGUAUCAUCGUGCCUGACCAAGUUGGCCAAUGGGGUUCUGCACUGCACCCGACUCUCUACUCAACUUCGAUCGAGGGGCAAAGACAUGAACCGCACCCAUUUUUACUAAAAACCUCACCACGUCACGGGGAGAUAGCCGAGGAUUCCAGAUACGCCUCGAACAGACAAGCCCCACAAUGUGAUUUGAGCGGUGUAACUGGCCUGAGAGGCCUUUUACGGAAUUCAAAGCGACAAGCGCGGCUCUGUCUUUGGUCACCAAAGGCUGACCCUGGUGUUUCGUUACGGUGGCCAUAUUCGACGUCCUCGCACCGGAUCGUCCAACGUCUCCCACCCGUAUGUUCUAGUGUCCAUUCUGGGCUGAUCAGGGUAAAUUCUUCGUCAGGUCUCUAUGAAACCCUAAGGAGUCGAGUCCGCGUUUCAUGUGGCUCCUACGUGCUGACAGAUAGGACUCCAAUAACUAAUAUGGGCACUGAACGCCCGUCUAACCGCCUCACGCUUCCGGAACUGAAAUGUGGUAAGCAGCAGACGGCGAUGAGGCAUCGUAUUGUGUCCACUAACGGCAAGGAGGGACGGACUGGCGUUGUCGAGCUUAUAUUCAACAGGCUCAUAAGCAAAGUAGUGAGUCAGAAUCAUCUUACCAGAGUAGGGCGGCCAGCUCAUCUAGACGGGCCUGCGGGACCACGAGAUCGUGUUCUUGUCCUGUUCGUUCCAAUAAUCAUAGGUCGACGAGCCACCUUUCCCCGAGGGGCUAACUUUUACAUUGUGCACGUCCCAGGGUCACAAGGGUGGGCGGGCGACUUUCCAUUCGAUUAUAUGGUCGAUACCAAUAGCUCAUCGAGAAGCUCAGGUUGGGGAAAAGCCAUAACAAUCGUUCUCCUAGGGGAAACAGCGCGUAAACCGCGCGACAUUUCACACAACUCCGGGGCAGGACCUCCAGGAUGCCUCGUCCCCGCUACUUUUCUGGAACACGGUUUCCUCCGCCGCCUUACAUGGACGUGGCUUAGGUGGUGGGACUUUUCCAUUGGACAUUUCACGAAUGGACCUUCUAUCAACCCCUCCUCCAAACCCGUUCCACGACCAUGGGUCCCCCAUCGCCCGUAUUCUGAUCGCGCUGCUCCCUACGAUGGCAAGCAUUCUGCGAAGACAGGCCUAUGCCAUGCCUCCCCGAGCACUACCUUCCUAUUCAAUUCCCUCCCGUCUUCACAAGUUUCCGGUAAUAGGCAUGCUGUCAUAUUCGGCAGCAGCCAUCUGGACUGCAACGGGCCGAAGCCGUGUCAUGGUGGGCCACGGUCUUGGGACACCCUGAAGUGCGUACUGACGACUAAUCCGCUAAUAUCACGGGCUGAAAUGAACUUGCCCCCCCGGAAACACGGCUCAAGUCUUCGCAUGCCACAGUGUAGCCAGACGAUUCGGUUUACUCCGUUUACUGGCGCAAUUCAAACGAGAUCACGGAUUACCUGUGACGACACACCUCCGAGCGGCCACCCCAUGACGGGCAAGCUGUGUCCUGUGUGUUCGUCUAUCAACAGUUCCGCAGGCAACGGGAUCGCCAACUUGGGGGCCAGUUCGCCCGUCCGCAAUCCGGGAUACCGGGGCGAGUACUUUCUACUAAGCGUACUUCUCGUGAUGCUCGAGGAGAACACGUUGCUCUGCCGAAUUCGUCCUUGCUUCUGCAGCAUUGCAUCGUACCAUGGUAUACAACUAAUUAUGACGAGGCCUCUACGAAUUGGUGAACGCUCUCUGUAUAUCCAGGAGUUCAUCCAUCGAAGGUACGGGACGAUGAGAUUGAGGGCCUGGGACUUAGUUAUCCUAUCCAAGGCACGGUUCCCCCCGACACGAGGGCGCUCUGCAUGGACUAAGCGCUGUAACAAUACGUCUCGGGCUGGAGGUAGAUUGUCCCGGGCGUUAGUUUCCGGCCGGGAUGUCUUGCUGUACGGAGCUACGUCUACGCGGUUUUACAAGGGUAAUGGCCAUAAGUCACUGCAAGCCCGCCGCGGUCUGACUCAUGGUUGCUACAGCCAGCAGCGGUCUCCAAGAAGCUUGCCAUAUGAGUCCAGGCUUAACACAUUAUUUACCGUCCCGCUCGCUGCGACCCCCUAUGAAACCCAUUCACCUGCUCUGCAACUCGGUAGAAUCUCUCGGGUUACUCUGUUGUACGCACAUGUUUCCGGGCUUCUGGCGUACGCUAACACUUCAAAAUCAAUGGAAGUCUCAGCGCUGUGUGGCCAUUCACAUAAGUGGUCUAUCGAUUGCCAUGCGAUUUCAAACCUUAGUGUGUUUGUCUCACUUGCACUAGGGUCAUGGGAAAAAAGUGUACCAACGUGUUUCAGGACGAGGUUUAUCAUCAGAGGCCUGCAGGUCCAUCAAAACCGACCGUUCUGUUGCGAACUCUUAGGUACUGACACAGCGUCAUCGAUUAACCUGGUUACUUCGAUCUUGCUGAAAAUACGCCCCCUCUUACAUCGAAAGGACGGAAACCGAGAUAUCACCGCUUUCGGUUGCGUCUUCUCUCGCAGGACGAAGAAACUUUAUAUGGUGUCCUUAUUUACGUGUAGUCUCAGUGGAUGUGCGCUUCCGCACUUCUAUGGAGACCCGUACAACACUUUGAUUCGCAUGGCUCUGCCUAUGCUAACCUGGAAAUGCACCUGCGGUCGCAUUGUGGCAGAAAUACCGCUCGUGCCCGGAAGCGAGCAGCUGCACGGAGUACUGUUUACUGCAAUACAGAAAACGUUGUUCGGUGGCCGUAACGACAACGGCGUCCUUUCCACUUCCCGGCACGGUAAGCCCUUUAAUGUAGUAGGAAUUACUUGA')

'MLTSDLFKQTICLRGCSPSGRYLVELVAFFVSQLPESIWVTLICHSSRWLVQQKIPSAGVWPSRAVHRFLVRADTLRYPTRYLRRVARHAGEGKVSPSLGATSVDRTARRGRKQPPEIQLRVRRFVDVQILKTLLFVEYTHQNDRTSGKRVLNVTQGHGARLVVDTRREGFCKPNSLHDECPHVRSGLEQRRDCRRSYSKGASSGGKSVQVAAEQLIYLSGLVTTIYSALPCAHSRFQVLETCSHKLYSSDRRGFRSISGWCWRTCNQGLAGSQRSGPMHHSFHQGWAEFRNPWKNSSTCLRTFSGSIQTSLLAGPIILKVFECGPMIHGTKERVGTGSGQRSATIALRRQTTKNATVVATLVHDGSCHPPWDFSKVYTEIVILVVVSLCARYLRSRIPVRREGFACPKLNDTPNGACAPTEWPHHDQPRYPPVLSLEPLRSLCIYNIHLLAPHFIELHVIVDSHGVLVSGWNHCVLKAIAALLPTRYSQCRHGYDPASGSRYYATFEAVLPMITIRNMWSPSTQYRPEDSVHSSFQQEPRRSKYGHRGTGVRSATVDVTVPRLPNDYVHSRKLVHCNPSTFGAAMVKAVFWPQVVRLCHSIASCVVSPVLWERAGVAPEIRNAMHLRGTTCSKRALAIILRPLGYISDLGIFENAEVAVNVVPRQLPHSAWRCYGLWLGRFRAPDAFFNPKKPAPYVASRSLIRLELLHTGADMARQARDEDSKANAAACLQLSYSTNVALITQDIMARKPSVNSLGEPTVQPSPAHGHFMQDIRPTIPLSEMGCMAPCRVQGFVALEYTCLRSPRTVSVWRSTDGLSALVANHGSDSNGTSQITRNQSAVTAYCAGIRVADAGLRCVPTMFWSKRRVVDNVSAFEKTLVMPSVGTARSSNAYLLRPATWPGVYVNTPERPVFTTIDDAVKGYLTWLAQSKARIHPLETVRVRLRDGAICPEISEESTVLLMSIWELLVHILKTVQPISVAECILREDCGSGSSSFRL

# Locating Restriction Sites

In [25]:
def reverse_palandrome(dna):
    size = len(dna)
    compliments = {'A': 'T', 'T':'A', 'C':'G', 'G':'C'}
    for i in xrange(size-4+1):
        for k in xrange(4,13):
            if i+k > size:
                break
            kmer = dna[i:i+k]
            rev_comp = ''.join([compliments[x] for x in kmer][::-1])
            if kmer == rev_comp:
                print i+1, k

In [26]:
dna = 'TCAATGCATGCGGGTCTATATGCAT'
reverse_palandrome(dna)

4 6
5 4
6 6
7 4
17 4
18 4
20 6
21 4


In [29]:
dna = """GCGCGGCGAACCACCAAGGCCTTTATTGGGAGACTCGTCGAGCGGCGCTGTCAATATGAA
ACAAGTTCAACCGGCCTATAAGTCGGCAGAAGAGTCAGGCGGGTAAAGTGAAACTCCGGG
ATTAAATGCGGGCCGCGGCTACATTTTGTAATGTGTTAGAATTCAATAAACACCCGGCTC
GGCGACGTTGCCAAGGTGGCGCGTTAAGTGGCGAGTAAAAATCTGCTGACCATTAGCCGG
CTATTACTTACGTAGCTAGCATAAGAACGGTGTCAGGAATGACTCATTCGCCGGGACAGC
TGAAATGTACTCTGACCGGCTAGTAGCGCTCTTCCCGAATTTAGGCGTGCCCTCTCTTAG
TATAGTTACTGCTTAGGGATGTCTCCCCTGTAGGCCCTTTTGTTGGATTAAATACTCCCC
AAGGATAGCGGTAAAGCAGAGACTAGCCGGCGAAGAACCCAGGATAGCTTTTGTTCATAT
TTGTAACGTTACAAGTATACCTATAAAAGCAGATCTCGCCCTCTTTGAATAAGCTCTGCA
GCGCGCCAAGCCTAAATTAAGGAAGAAATCCCATTCAACAAGAGGTAGCCTTTATCCCCG
GATAGTGGTCCATAATGGCTCGTACTCATAGAGGGTACGCCACTAACCGTAATGCACAGC
CTCGATGATCGTATCTAATCATCCATTACCGAGGCTGACGAGCGCCGGTCTCGTGCAAGG
CAAGAACGAGCTTTGGGTCACCAGGGTAGCCGGCTGTCGTAGTTTATTGAATGATCTGTG
CTCAACTACAGGCCGGGTTCCTACGTCACTTAGGTAAGGGGGCCCCGAAGTTTCCGACCG
GGAGGGTATCCCTGAATGGGATGCCCTTGTCCAACAATTCCCTAGAGGAGTACTGTGCGG
TCACTCACTACGCCAAGGCCTTACTCTGTGGCCAGGCGGCAGACACCTCCGTTGGGCTTT
TACGCAAGACCGCG""".replace('\n', '')
reverse_palandrome(dna)

1 4
2 4
16 8
17 6
18 4
38 4
45 4
54 4
71 4
73 4
77 4
116 4
122 4
131 4
132 8
133 6
134 4
159 6
160 4
174 4
185 4
199 4
200 4
204 4
234 10
235 8
236 6
237 4
249 6
250 4
253 6
254 4
255 6
256 4
291 4
297 6
298 4
307 4
316 4
320 4
325 6
326 4
338 4
361 4
393 4
408 4
443 4
446 6
447 4
466 4
477 4
482 12
483 10
484 8
485 6
486 4
495 6
496 4
502 4
511 6
512 4
532 4
536 6
537 4
541 4
541 6
542 4
543 4
555 4
557 4
598 4
622 4
635 4
653 4
662 4
667 4
702 4
705 4
714 4
729 4
748 8
749 6
750 4
773 4
791 4
793 4
803 4
819 8
820 6
821 4
838 4
876 4
882 4
889 6
890 4
915 8
916 6
917 4
929 6
930 4
971 4


# Rabbits and Recurrence Relations

In [54]:
def Fib(months, litter):
    a,b = 1,1
    yield a
    yield b
    for i in xrange(months-2):
        a, b = b, litter*a + b
        yield b

In [56]:
for a in Fib(36, 3):
    print a

1
1
4
7
19
40
97
217
508
1159
2683
6160
14209
32689
75316
173383
399331
919480
2117473
4875913
11228332
25856071
59541067
137109280
315732481
727060321
1674257764
3855438727
8878212019
20444528200
47079164257
108412748857
249650241628
574888488199
1323839213083
3048504677680


# Introduction to Mendelian Inheritance

In [61]:
def dom_prob(k, m, n):
    s = sum((k,m,n))
    prob = 0
    prob += (k/s)*((k-1)/(s-1))
    prob += (k/s)*(m/(s-1))
    prob += (k/s)*(n/(s-1))
    
    prob += (m/s)*(k/(s-1))
    prob += (m/s)*((m-1)/(s-1))*.75
    prob += (m/s)*(n/(s-1))*.5
    
    prob += (n/s)*(k/(s-1))
    prob += (n/s)*(m/(s-1))*.5
    return prob

In [62]:
dom_prob(2., 2., 2.)

0.7833333333333333

In [66]:
dom_prob(30., 29., 25.)

0.780837636259323

# Enumerating Gene Orders

In [74]:
from itertools import permutations

In [101]:
def all_perms(num):
    perms = list(permutations(range(1,1+num)))
    base = '{} '*num
    print len(perms)
    for p in perms:
        print base.format(*p)

In [102]:
all_perms(3)

6
1 2 3 
1 3 2 
2 1 3 
2 3 1 
3 1 2 
3 2 1 


The permutations2 function below was taken from 

http://stackoverflow.com/questions/104420/how-to-generate-all-permutations-of-a-list-in-python

**I still need to digest how it works exactly**

In [114]:
def permutations2(elements):
    if len(elements) <=1:
        yield elements
    else:
        for perm in permutations2(elements[1:]):
            for i in range(len(elements)):
                # nb elements[0:1] works in both string and list contexts
                yield perm[:i] + elements[0:1] + perm[i:]

In [115]:
def all_perms2(num):
    perms = list(permutations2(range(1,1+num)))
    base = '{} '*num
    print len(perms)
    for p in perms:
        print base.format(*p)

In [116]:
all_perms2(3)

6
1 2 3 
2 1 3 
2 3 1 
1 3 2 
3 1 2 
3 2 1 


In [118]:
all_perms(2)

2
1 2 
2 1 


# Finding a Shared Motif

In [41]:
from tqdm import trange

In [44]:
def in_all_seqs(kmer, seqs):
    for s in seqs:
        if kmer not in s:
            return False
    return True

def shared_motif(infasta):
    with open(infasta) as infasta:
        seqs = [l.strip() for l in infasta.readlines()][1::2]
    seqs.sort(key=len)
    min_seq = seqs.pop(0)
    print min_seq
    motif = ''
    for k in trange(1, len(min_seq)+1):
        for char in xrange(len(min_seq)-k+1):
            kmer = min_seq[char:char+k]
            motif_flag = in_all_seqs(kmer, seqs)
            if motif_flag:
                motif = kmer
                break
    return motif

In [16]:
shared_motif('/home/jessime/Downloads/rosalind_motif.txt')

['GATTACA', 'TAGACCA', 'ATACA']


'TA'

In [19]:
cd /home/jessime/Code/kmers/

/home/jessime/Code/kmers


In [20]:
import fasta

In [46]:
infasta = '/home/jessime/Downloads/rosalind_lcsm2.txt'
outfasta = '/home/jessime/Downloads/rosalind_lcsm_clean2.txt'

In [47]:
cleaner = fasta.Cleaner(infasta, outfasta)

In [48]:
cleaner.data = cleaner.seq_per_line()

In [49]:
cleaner.save()

In [50]:
shared_motif(outfasta)

AGTCCCAAGAACCCCAGTCCGATATCCAGGGAGGAAAGAGTCCCATCCCTCGAGGGCGGATAAGTCATGACCCCATGGATAGATACTGAGGTTGGCCTTAGTCAACAGTACGCACAGATCGCTCCAACGCTGCACTGTCAGACTGGAGGTTATCTCTAAAAATCCGCACTGACTGACTAATAATATGTGTTTAGACAGCTGCTCCTCGTGACCTACTCTCGAAACTGGGACACGGCAGGCGCTGATAGCTCGTAGAATATCAGACCGTATCGCCGCCTCTCCACCAATTATTCGTTCCATGAATTTACTCCTGAAGTACACAGTGTTCGCAAGTGACTTAACAGCCGGAGGTCATCGCTTAGACGGGAACGGCCTACGATACTGTAAGCCACTATACTTCACCCAGAGGGCGTGCGAACTAAACCCACTCGCGGGGAAGAGTTTGGCTTTCGGGGTTCCATCTTCCCGCGATTTGTACACCACTGGGAAGGTTGGAATATATGACCGATACACAATTTCCCGAATTGGGCGTGACGTCTCATAAAACGCATTATAGCCTTACCGAATGTGCTGCAAAGAGTTGGTGGGTCCTGATGACGGAGTCGGTTCTCACTATCACAAGCATCTGTAATAGTGCACCGACCGGCAAAGTTGGCAACTGGTGCCACAACCCTGGATCCAATCCCTGGATAGTATAGACGGCCTCAGACGAGAGCATGAGCGTACTGACTACCGAAGAGAAGCTCTCACGTCCTGGTGTACTAGCTCGGCGCGATTCATCCCGAAGTCCTCTATCCATGATGGCAGGCCCAGAAGTGAGTGCTCGAGAACCCACTATCAGGTGGTTGCGATGATTCATCCTCGGAATCCCGAACGCAGGATTAATCCTAAATTGTCCGACACTTGAAGCGATGTGGGATTATACAGAAAAGGGTCGACGGCAAGCGCCAATGCCAATAAGGCCCTCCTTCTGTTAGGTAGTACCCGTTCCCTGCATAAC

100%|██████████| 1000/1000 [00:01<00:00, 764.81it/s]


'TGAAGTACACAGTGTTCGCAAGTGACTTAACAGCCGGAGGTCATCGCTTAGACGGGAACGGCCTACGATACTGTAAGCCACTATACTTCACCCAGAGGGCGTGCGAACTAAACCCACTCGCGGGGAAGAGTTTGGCTTTCGGGGTTCCATCTTCCCG'

# Calculating Protein Mass

In [1]:
aa_mass = {'A':71.03711,
           'C':103.00919, 
           'D':115.02694,
           'E':129.04259,
           'F':147.06841,
           'G':57.02146,
           'H':137.05891,
           'I':113.08406,
           'K':128.09496,
           'L':113.08406,
           'M':131.04049,
           'N':114.04293,
           'P':97.05276,
           'Q':128.05858,
           'R':156.10111,
           'S':87.03203,
           'T':101.04768,
           'V':99.06841,
           'W':186.07931,
           'Y':163.06333}

In [2]:
def protein_mass(protein):
    mass = sum([aa_mass[aa] for aa in protein])
    return mass

In [3]:
protein_mass('SKADYEK')

821.3919199999999

In [4]:
protein_mass('CHTFHACWYLWQLFNWSWSTMNKFDQNELHPWWESIAIQMNWKGGKSIGSDPQSIYYDVEHPVTMYPNSVGHPAETIGKDSQCYTMHRNFQKSLHHMTAFERYHFCCFVFNDPYDSATDGYTPNKRRSKQHSCFDEVYFWHPNGGSNTPEAVYQKARMTKVRALTRISQLVMLIHITIGFWNERIESTMSAQTGMHPFTPYVFTDNMMRRCEVVVVQPNKAILCDIHKNHMPMASSKQYEASWSANSYLPTSHLHVLYRHGCIAPKNVQNPGDCTMKRHQMHKDYMQWWEGVVPDDCVFNFIMYDGYTNNWLCPVLQLHLWAGVGLNEFSWRSMEPEPEVFLKNSICEWYHHRWTGPYRTQITGWWLCPHTRGKCRLKNPMSNEFWGYYTSMQHYEKRSSNMKIEFSPDAHDKIPRNQHQIQINPCGATYRYDGYNFIGFQDHCHCHQRTKFLGQAPIQRTMGAQWYRPWIQTTGGTWPEQFRQWDMRFGADTLIKKVCGEPRCWIPVPQLCKQKKSVSRDHKDMSGCWSKEDFEHNFTSTGDRRGIGYITLTQNYRYRQSFVHSNHEKQWGFEFRFPHENGCRMSYLQWVQAPGDNKQWCPINWPGTLTSGYTPAEAQHQVWENCGKPLPWEDHSFPTRQGICLMSQWWGASMRLSRGDMMQPMNCKTPWNNNQECMTIPVEPINTHEMCFWWKVNYLQYSWMLVPSAGAFMLIRMREHPNNDPNEPNHENAWDKEEVSEHTSAYHYMGKSMTSCQIHNVRMHAVWGIIYDAGNNMHRELCRNMEFWDFWSVCQLSNVRFMRDPTVWMTIRVNNQGPDLANIYLSYEKAQIDFSDWRWGRCVFRTPIGN')

100797.11556000044

# Consensus and Profile

In [63]:
import pandas as pd
import numpy as np

In [77]:
cd /home/jessime/Code/kmers/

/home/jessime/Code/kmers


In [78]:
import fasta

In [79]:
def clean(infasta, outfasta):
    cleaner = fasta.Cleaner(infasta, outfasta)
    cleaner.data = cleaner.seq_per_line()
    cleaner.save()

In [7]:
def get_seqs(infasta):
    with open(infasta) as infasta:
        seqs = [line.strip() for line in infasta if line[0] != '>']
    return seqs

In [73]:
def profile(infasta, outfile):
    seqs = get_seqs(infasta)
    seqs = [list(s) for s in seqs]
    
    seqsDF = pd.DataFrame(seqs)
    
    profileDF = pd.DataFrame(index = ['A', 'C', 'G', 'T'])
    for i, col in seqsDF.iteritems():
        profileDF[i] = col.value_counts()

    profileDF = profileDF.fillna(0)
    concensus = ''.join(profileDF.idxmax().values)
    
    with open(outfile, 'w') as outfile:
        outfile.write(concensus+'\n')
        for i, row in profileDF.iterrows():
            data = ' '.join([str(int(i)) for i in row.values])
            outfile.write('{}: {}'.format(row.name, data)+'\n')

In [72]:
profile('/home/jessime/Downloads/consensus.txt')

ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6


In [80]:
clean('/home/jessime/Downloads/rosalind_cons.txt', '/home/jessime/Downloads/rosalind_cons2.txt')
profile('/home/jessime/Downloads/rosalind_cons2.txt', '/home/jessime/Downloads/rosalind_cons_jk.txt')

# Overlap Graphs

In [25]:
def overlap_graph(infasta, outfile):
    data = fasta.Extracter(infasta=infasta).data
    
    with open(outfile, 'w') as outfile:
        for name, seq in data:
            for name2, seq2 in data:
                if name == name2:
                    continue
                if seq[-3:] == seq2[:3]:
                    outfile.write('{} {}\n'.format(name[1:], name2[1:]))

In [26]:
clean('/home/jessime/Downloads/rosalind_og.txt', '/home/jessime/Downloads/rosalind_og2.txt')
overlap_graph('/home/jessime/Downloads/rosalind_og2.txt', '/home/jessime/Downloads/rosalind_og_jk.txt')

In [27]:
clean('/home/jessime/Downloads/rosalind_grph.txt', '/home/jessime/Downloads/rosalind_grph2.txt')
overlap_graph('/home/jessime/Downloads/rosalind_grph2.txt', '/home/jessime/Downloads/rosalind_grph_jk.txt')

# Enumerating k-mers Lexicographically

In [7]:
import itertools

In [9]:
def kmer_file(vocab, k, outfile):
    vocab = vocab.replace(' ', '')
    kmers = ["".join(i) for i in list(itertools.product(vocab, repeat=k))]
    
    with open(outfile, 'w') as outfile:
        for i in kmers:
            outfile.write(i+'\n') 

In [10]:
kmer_file('T A G C', 2, '/home/jessime/Downloads/rosalind_kmer.txt')

In [12]:
kmer_file('E D Y Q M J S R N', 3, '/home/jessime/Downloads/rosalind_kmer2.txt')

# RNA Splicing

In [22]:
def splice_translate(infasta):
    seqs = fasta.Extracter(infasta=infasta).seqs
    primary = seqs.pop(0)
    for seq in seqs:
        primary = primary.replace(seq, '')
    primary = primary.replace('T', 'U')
    protein = translate(primary)
    print protein
        

In [10]:
clean('/home/jessime/Downloads/rosalind_splice.txt', '/home/jessime/Downloads/rosalind_splice2.txt')

In [23]:
splice_translate('/home/jessime/Downloads/rosalind_splice2.txt')

MVYIADKQHVASREAYGHMFKVCA


In [24]:
clean('/home/jessime/Downloads/rosalind_splc.txt', '/home/jessime/Downloads/rosalind_splc2.txt')
splice_translate('/home/jessime/Downloads/rosalind_splc2.txt')

MTVDPIRRCSEGCASWQLMVIHNKGNLPQGSNSSPPAIQLRGAQSWDTDPTTETLSSGDQVFRLIQLHKQATFAQSRVGGCMTQPWPFTIHTHACGLIRPQGGPTNGFPQVTPRNQTARNQPRFTSLRLIYRNSKYVVVGITTILRFSSASIRLISCERYSTLGLMLHPGPRPKHEHAADREPCKAPLLQNGP
