# Week 3 - Implementing the tasks from week 1

For this tutorial, we will be revisiting the activities performed by hand in week 1. We will be writing code to perform all the tasks we performed by hand. Activities for the tutorial will be split into two: implementing methods from scratch, and using existing implementations and data structures to store and process sequences.

## Task 1 - Computing the reverse complement

Here, we will write a script to determine the reverse complement sequence of a given sequence. We begin by creating a dictionary of mappings.

In [2]:
complement_dict = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
complement_dict['C']

'G'

In [6]:
dna_seq = 'GATCTTCGGGTCTAGTTCAGGTTAACC'
complement_dict = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
complement_seq = ''

for base in dna_seq:
    complement_seq += complement_dict[base]
    
print(complement_seq)

CTAGAAGCCCAGATCAAGTCCAATTGG


In [None]:
# Note: we do not modify the original DNA sequence. This allows it to be reused in other places.
dna_seq

The above script can be written up as a function thereby making it reusable

In [53]:
def complement(seq):
    """
    Compute the reverse complement of a given DNA sequence
    """
    
    complement_dict = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
    complement_seq = ''

    for nt in seq:
        complement_seq += complement_dict[nt]
    return complement_seq

In [5]:
print(complement('AAAAA')) # should give 'TTTTT'
print(complement(dna_seq))

TTTTT
CTAGAAGCCCAGATCAAGTCCAATTGG


## Task 2 - Transcribing a DNA sequence

Here, we trancribe a DNA sequence into an RNA-sequence. Write a function to transcribe a given DNA sequence

In [56]:
def transcribe(dna):
    complement_dictionary = {'A': 'A', 'T': 'U', 'C': 'C', 'G': 'G'}
    complement_sequence = ''
    
    for base in dna:
        complement_sequence += complement_dictionary[base]
    return complement_sequence
    
    """
    Compute the transcript resulting from a DNA sequence
    """
    
    # put your code here
    

In [9]:
print(transcribe('ATAT')) # should give 'AUAU'
print(transcribe('ATGCCCCAACTAAATACTACCGTATGGCCCACCATAATTACC'))

AUAU
AUGCCCCAACUAAAUACUACCGUAUGGCCCACCAUAAUUACC


# Task 3 - Translate a DNA sequence

As with task 1, we will be needing a dictionary to help us map codons to their respective amino acids. We first form the dictionary using information provided in lab 1

In [10]:
# Note: * represents the stop codon and M the start codon
base1 = 'TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG'
base2 = 'TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG'
base3 = 'TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG'
aa =    'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
#your code here
codon_map = {}
for i in range(0,len(aa)):
    codon_map[base1[i] + base2[i] + base3[i]] = aa[i]
codon_map

{'TTT': 'F',
 'TTC': 'F',
 'TTA': 'L',
 'TTG': 'L',
 'TCT': 'S',
 'TCC': 'S',
 'TCA': 'S',
 'TCG': 'S',
 'TAT': 'Y',
 'TAC': 'Y',
 'TAA': '*',
 'TAG': '*',
 'TGT': 'C',
 'TGC': 'C',
 'TGA': '*',
 'TGG': 'W',
 'CTT': 'L',
 'CTC': 'L',
 'CTA': 'L',
 'CTG': 'L',
 'CCT': 'P',
 'CCC': 'P',
 'CCA': 'P',
 'CCG': 'P',
 'CAT': 'H',
 'CAC': 'H',
 'CAA': 'Q',
 'CAG': 'Q',
 'CGT': 'R',
 'CGC': 'R',
 'CGA': 'R',
 'CGG': 'R',
 'ATT': 'I',
 'ATC': 'I',
 'ATA': 'I',
 'ATG': 'M',
 'ACT': 'T',
 'ACC': 'T',
 'ACA': 'T',
 'ACG': 'T',
 'AAT': 'N',
 'AAC': 'N',
 'AAA': 'K',
 'AAG': 'K',
 'AGT': 'S',
 'AGC': 'S',
 'AGA': 'R',
 'AGG': 'R',
 'GTT': 'V',
 'GTC': 'V',
 'GTA': 'V',
 'GTG': 'V',
 'GCT': 'A',
 'GCC': 'A',
 'GCA': 'A',
 'GCG': 'A',
 'GAT': 'D',
 'GAC': 'D',
 'GAA': 'E',
 'GAG': 'E',
 'GGT': 'G',
 'GGC': 'G',
 'GGA': 'G',
 'GGG': 'G'}

Now use, the dictionary above to get the amino acid sequence for the first reading frame (no offset on the sequence). You can use the `dict.get` function to return default values if the keys do not exist in the dictionary

In [44]:
def translate(dna, codon_dict):
    """
    Translate a DNA sequence from the first reading frame, given a codon mapping dictionary
    Codons are keys and amino acids are values in this dictionary
    """
    codon = ''
    protein = ''
    for base in range(0, len(dna), 3):
        if(base+3 <= len(dna)):
            
            protein += codon_dict[dna[base: base+3]]
        
        else:
            protein += 'X'
        
    
    return protein
    
    #your code here

In [45]:
# Note: * represents the stop codon and M the start codon
base1 = 'TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG'
base2 = 'TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG'
base3 = 'TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG'
aa =    'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
#your code here
codon_map = {}
for i in range(0,len(aa)):
    codon_map[base1[i] + base2[i] + base3[i]] = aa[i]
codon_map

dna_seq = 'ACTATTAAACCCATATAACCTCCCCCAAAATTCAGAATAATAAC'
print(translate('ATGATGA', codon_map)) # should give MM or MMX where X represents an incomplete codon
print(translate(dna_seq, codon_map))

MMX
TIKPI*PPPKFRIIX


Now write a function that uses the above function to get the amino acid sequence of all 6 reading frames. Note: three reading frames will be from the reverse complement strand. You may use the `dna_seq[::-1]` to reverse a sequence this. This is a shorter way to write `dna_seq[44::-1]` which means start at position 44, go all the way to the end (position 0 inclusive) and move with a step -1 (step backwards).

In [None]:
dna_seq[::-1]

In [None]:
dna_seq[44::-1]

In [91]:
def complement(dna):
    complement_dictionary = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
    complement_sequence = ''
    
    for base in dna:
        complement_sequence += complement_dictionary[base]
    return complement_sequence

def six_rfs(dna, codon_dict):
    print(translate(dna, codon_dict))
    print(translate(dna[1:], codon_dict))
    print(translate(dna[2:], codon_dict))
    print(translate(complement(dna[::-1]), codon_dict))
    print(translate(complement(dna[::-1])[1:],codon_dict))
    print(translate(complement(dna[::-1])[2:],codon_dict))
    
    """
    Get the amino acid sequence from all six reading frames of a sequence.
    This function should use the translate function implemented earlier
    Return the result as a list of size 6
    """
    
    #your code here
    

In [92]:
# should give: (with or without the X)
# 'TIKPI*PPPKFRIIX'
# 'LLNPYNLPQNSE**X'
# 'Y*THITSPKIQNNN'
# 'VIILNFGGGYMGLIX'
# 'LLF*ILGEVIWV**X'
# 'YYSEFWGRLYGFNS'

# Note: * represents the stop codon and M the start codon
base1 = 'TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG'
base2 = 'TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG'
base3 = 'TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG'
aa =    'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
#your code here
codon_map = {}
for i in range(0,len(aa)):
    codon_map[base1[i] + base2[i] + base3[i]] = aa[i]
codon_map
dna_seq = 'ACTATTAAACCCATATAACCTCCCCCAAAATTCAGAATAATAAC'

six_rfs(dna_seq, codon_map)

TIKPI*PPPKFRIIX
LLNPYNLPQNSE**X
Y*THITSPKIQNNN
VIILNFGGGYMGLIX
LLF*ILGEVIWV**X
YYSEFWGRLYGFNS


# Task 4 - using the scikit-bio library to manage sequences

All of the above tasks can be performed using functions in the scikit-bio library. It provides functions to read and parse some popular file formats, and functions to store and modify sequences.

To install it on your personal computer, use the command:
conda install -c conda-forge scikit-bio

You do not need to install this on the lab servers or if you downloaded Anaconda.

In [1]:
#import the library
import skbio

scikit-bio like many python libraries uses an object oriented programming paradigm. As an example, a DNA string is treated as an object. All objects have properties and behaviours. Properties could be metadata such as the sequence ID of a DNA sequence or its quality. Behaviours could be getting the transcribing or translating the sequence. Properties and behaviours are referred to as *attributes* and *methods* in python.

In [2]:

dna_seq = skbio.sequence.DNA('ACTATTAAACCCATATAACCTCCCCCAAAATTCAGAATAATAAC')
dna_seq

DNA
--------------------------------------------------
Stats:
    length: 44
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 31.82%
--------------------------------------------------
0 ACTATTAAAC CCATATAACC TCCCCCAAAA TTCAGAATAA TAAC

In [3]:
# the alphabet used to encode a DNA sequence is an attribute of the DNA object from skbio
dna_seq.alphabet

{'-',
 '.',
 'A',
 'B',
 'C',
 'D',
 'G',
 'H',
 'K',
 'M',
 'N',
 'R',
 'S',
 'T',
 'V',
 'W',
 'Y'}

# Task 5 - Behaviours/Methods of a DNA object

We now load the sequence of the dnaA gene from a fasta file in the data folder using the `skbio.io.read` function. Type `?skbio.io.read` in a code cell to get to the help page of this function.

In [4]:
dnaA = skbio.io.read('data/dnaA.fa', format = 'fasta', into = skbio.sequence.DNA)
dnaA

DNA
-----------------------------------------------------------------------
Metadata:
    'description': 'dnaA Chromosomal replication initiator protein DnaA
                    1148048:1149370 reverse'
    'id': 'MON-234_01132'
Stats:
    length: 1323
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 32.05%
-----------------------------------------------------------------------
0    ATGAATCCAA GCCAAATACT TGAAAATTTA AAAAAAGAAC TCAATGAAAA CGAATACGAA
60   AATTATATCG CTATCTTAAA ATTTAACGAA AAACAAAGTA AAGCAGATCT TCTAGTCTTT
...
1260 GAACAAATAA AGACAAAAAT CGAAGAATTA AAAAACAAAA TTCTTACAAA AAGTCAAAGT
1320 TAA

The above DNA object holds attributes such as a description and an ID. We can get the complement of this sequence, trancribe it and translate it using functions from the scikit-bio library. For more information on all the functions and classes (DNA, RNA, etc.) the library provides, read the documentation page http://scikit-bio.org/docs/0.5.1/index.html.

In [5]:
dnaA.complement()

DNA
-----------------------------------------------------------------------
Metadata:
    'description': 'dnaA Chromosomal replication initiator protein DnaA
                    1148048:1149370 reverse'
    'id': 'MON-234_01132'
Stats:
    length: 1323
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 32.05%
-----------------------------------------------------------------------
0    TACTTAGGTT CGGTTTATGA ACTTTTAAAT TTTTTTCTTG AGTTACTTTT GCTTATGCTT
60   TTAATATAGC GATAGAATTT TAAATTGCTT TTTGTTTCAT TTCGTCTAGA AGATCAGAAA
...
1260 CTTGTTTATT TCTGTTTTTA GCTTCTTAAT TTTTTGTTTT AAGAATGTTT TTCAGTTTCA
1320 ATT

In [6]:
dnaA.transcribe()

RNA
-----------------------------------------------------------------------
Metadata:
    'description': 'dnaA Chromosomal replication initiator protein DnaA
                    1148048:1149370 reverse'
    'id': 'MON-234_01132'
Stats:
    length: 1323
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 32.05%
-----------------------------------------------------------------------
0    AUGAAUCCAA GCCAAAUACU UGAAAAUUUA AAAAAAGAAC UCAAUGAAAA CGAAUACGAA
60   AAUUAUAUCG CUAUCUUAAA AUUUAACGAA AAACAAAGUA AAGCAGAUCU UCUAGUCUUU
...
1260 GAACAAAUAA AGACAAAAAU CGAAGAAUUA AAAAACAAAA UUCUUACAAA AAGUCAAAGU
1320 UAA

In [7]:
dnaA.translate()

Protein
-----------------------------------------------------------------------
Metadata:
    'description': 'dnaA Chromosomal replication initiator protein DnaA
                    1148048:1149370 reverse'
    'id': 'MON-234_01132'
Stats:
    length: 441
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: True
-----------------------------------------------------------------------
0   MNPSQILENL KKELNENEYE NYIAILKFNE KQSKADLLVF NAPNELLAKF IQTKYGKKIS
60  HFYEVQSGNK ASVLIQAQSQ KTTSKSTKID IAHIKAQSTI LNPSFTFESF VVGDSNKYAY
...
360 SDVKSSKKTQ NVVTARRIAI YLARELTSLT FSQLANFFVM KDHTAISHSV KKIKELMEDD
420 EQIKTKIEEL KNKILTKSQS *

In [8]:
list(dnaA.translate_six_frames())

[Protein
 -----------------------------------------------------------------------
 Metadata:
     'description': 'dnaA Chromosomal replication initiator protein DnaA
                     1148048:1149370 reverse'
     'id': 'MON-234_01132'
 Stats:
     length: 441
     has gaps: False
     has degenerates: False
     has definites: True
     has stops: True
 -----------------------------------------------------------------------
 0   MNPSQILENL KKELNENEYE NYIAILKFNE KQSKADLLVF NAPNELLAKF IQTKYGKKIS
 60  HFYEVQSGNK ASVLIQAQSQ KTTSKSTKID IAHIKAQSTI LNPSFTFESF VVGDSNKYAY
 ...
 360 SDVKSSKKTQ NVVTARRIAI YLARELTSLT FSQLANFFVM KDHTAISHSV KKIKELMEDD
 420 EQIKTKIEEL KNKILTKSQS *, Protein
 -----------------------------------------------------------------------
 Metadata:
     'description': 'dnaA Chromosomal replication initiator protein DnaA
                     1148048:1149370 reverse'
     'id': 'MON-234_01132'
 Stats:
     length: 440
     has gaps: False
     has degenerates: False
     has