# MODEL ANSWERS!



# workbook D: Writing your own functions

Before starting this exercise you should work through

*  *Chapter 1 Writing your own functions*

from the
[DataCamp online course: Python Data Science Toolbox (Part 1)
](https://campus.datacamp.com/courses/python-data-science-toolbox-part-1/).

We will then use what you have learnt there to explore how biological
data can be handled in Python.

> ### Reminder: saving your work
>
> As you work through the work book it is important to regularly save your work. Notice that as you have made  changes the Jupyter window top line will warn you there are `(unsaved changes)` in small text. To save your work in this notebook by either select menu item `File` `Save` or by hit the save button:
> 
> <img src="./images/save_button.png"/>
>
> 
> ### Reminder: getting help 
> Please see the page:
> [Help with programming](https://canvas.anglia.ac.uk/courses/1490/pages/help-with-programming)
> on ARU Canvas.

## Python functions: built-in functions

We have been using already been using Python's in-built functions a lot: 

In [1]:
# Instruction: run this cell to see built in functions we have been using
print('to find the length of a string len("a string")=', len('a string'))
species = ['dog', 'cat', 'mule']
print('to the type of the "species" variable use type(species)=', type(species))

to find the length of a string len("a string")= 8
to the type of the "species" variable use type(species)= <class 'list'>


Great thing about these functions is that
* we do not need to worry about how they work
* they produce code that is easy to read - it is obvious `len(snips_list)` will produce a number that equals the number of items in the `snips_list`
* and that it is easy to get help about how they work: for instance to get help on the len function:

In [2]:
# Instruction: run this cell to get help on the len() function
help(len)

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



**Your turn** what does the Python built-in function `abs()` do?

In [3]:
# Instruction: write Python expression to find out what abs function does
### your Python code
help(abs)

Help on built-in function abs in module builtins:

abs(x, /)
    Return the absolute value of the argument.



Now print out the abs function for -15 and -123.5 and 7

In [4]:
# Instruction: print out the abs function for the input -15 and -123.5 and 7
for number in -15, -123.5, 7:
    print('number:', number, 'abs(number):', abs(number))

number: -15 abs(number): 15
number: -123.5 abs(number): 123.5
number: 7 abs(number): 7


## Writing a function

Built-in functions are really useful but we clearly need to be able to produce our own functions for bioinformatics and data science.

As you have already seen in the [DataCamp online course: Python Data Science Toolbox (Part 1)
](https://campus.datacamp.com/courses/python-data-science-toolbox-part-1/) writing a function is simple:

In [5]:
# Instruction: run this cell containing an example function called double
def double(number):
    """
    returns doubles the input number.
    
    Note: this is a docstring describing what function does. 
    It is can have more than one line!
    """
    double_number = 2*number
    return double_number

When you ran the cell notice that no output was produced. But the Jupyter Notebook kernel now knows there is a function called `double`.
* It has a single **argument** called `number`. 
* the function definition starts with a header line starting `def` and ending with a colon `:`
* the function is then indented with 4 spaces (in the same way as `if` and `for`).
* The first thing in a function should be a **docstring** that is surrounded by triple double quotation marks.
* The **docstring** can have more than one line.
* this function returns a single value with a **return** statement.

The function can be called:

In [6]:
# Instruction: run this cell to show running the double function 
print(double(2))
print(double(-1975.13))

4
-3950.26


The docstring help message can be accessed:

In [7]:
help(double)

Help on function double in module __main__:

double(number)
    returns doubles the input number.
    
    Note: this is a docstring describing what function does. 
    It is can have more than one line!



**Now it is your turn!**

The following bit of code works out the complement of a given DNA sequence:

In [8]:
sequence = 'ATTGGGCCCC'
complement = sequence.replace('A','t')
complement = complement.replace('T','a')
complement = complement.replace('G','c')
complement = complement.replace('C','g')
complement = complement.upper()
complement

'TAACCCGGGG'

But what if we wanted to work out the complement of another sequence:
```
tata_oligo = 'CTGCTATAAAAGGCTG'
```
We *could* copy and paste the code section above and then adapt it but this
is **a really bad idea**. Instead it is much better to use a function. 
If you find yourself copy and  pasting sections of code in your work then think again - 
there is likely to be a better way.

So lets create a function to work out the complement of a sequence. What would a sensible function name be>

In [9]:
# Instruction: write a function thats returns the complement in a given sequence.
def complement(sequence):
    """
    returns the complement of DNA sequence string.
    
    The complement replaces bases, according to the table:
    
    Base | Replacement
    -----+-----------
     A   |  T
     T   |  A
     G   |  C
     C   |  G
    letters that are not in the list ATGC are left alone and
    not replaced.
    """
    complement = sequence.replace('A','t')
    complement = complement.replace('T','a')
    complement = complement.replace('G','c')
    complement = complement.replace('C','g')
    complement = complement.upper()
    return complement


Now use the `complement` function to check the complements of the following sequences

* `A`
* `AT`
* `ATGC`
* `GACGATATTTTCCGAC`

You might want to use http://arep.med.harvard.edu/cgi-bin/adnan/revcomp.pl

In [10]:
# Instruction: use the complement function to check the sequences
# have their expected complements
tests = [('A', 'T'),
         ('AT', 'TA'),
         ('ATGC', 'TACG'),
         ('CTGCTATAAAAGGCTG', 'GACGATATTTTCCGAC')]
for seq, known_complement in tests:
    print('sequence:         ', seq)
    print('known complement: ', known_complement)
    fn_complement = complement(seq)
    print('complement(seq):  ', fn_complement)
    if fn_complement == known_complement:
        print('PASS')
    else:
        print('FAIL')

sequence:          A
known complement:  T
complement(seq):   T
PASS
sequence:          AT
known complement:  TA
complement(seq):   TA
PASS
sequence:          ATGC
known complement:  TACG
complement(seq):   TACG
PASS
sequence:          CTGCTATAAAAGGCTG
known complement:  GACGATATTTTCCGAC
complement(seq):   GACGATATTTTCCGAC
PASS


Is the help message for the `complement` function reasonable?
Imagine you are a user of the function who cannot read the source code.

Check the help message:

In [11]:
# Instruction: run this cell to check help message for complement
help(complement)

Help on function complement in module __main__:

complement(sequence)
    returns the complement of DNA sequence string.
    
    The complement replaces bases, according to the table:
    
    Base | Replacement
    -----+-----------
     A   |  T
     T   |  A
     G   |  C
     C   |  G
    letters that are not in the list ATGC are left alone and
    not replaced.



Now go on and create a `reverse_complement` function to work out the reverse complement of a DNA sequence

In [12]:
# Instruction: write a reverse_complement function
# REALLY POOR WAY TO DO IT - COPY PASTE!
def reverse_complement(sequence):
    """
    returns the reverse complement of DNA sequence string.
    
    The complement replaces bases, according to the table:
    
    Base | Replacement
    -----+-----------
     A   |  T
     T   |  A
     G   |  C
     C   |  G
    letters that are not in the list ATGC are left alone and
    not replaced.
    
    The reverse complement reverses the string to give the sequence
    of the complement in the normal 5' to 3' direction
    """
    complement = sequence.replace('A','t')
    complement = complement.replace('T','a')
    complement = complement.replace('G','c')
    complement = complement.replace('C','g')
    complement = complement.upper()
    complement = complement[::-1]  # reverses string
    return complement

In [13]:
# Instruction: test your reverse_complement function
tests = [('A', 'T'),
         ('AT', 'AT'),
         ('ATGC', 'GCAT'),
         ('CTGCTATAAAAGGCTG', 'CAGCCTTTTATAGCAG')]
for seq, known_rev_complement in tests:
    print('sequence:                 ', seq)
    print('known reverse complement: ', known_rev_complement)
    fn_rev_complement = reverse_complement(seq)
    print('reverse_complement(seq):  ', fn_rev_complement)
    if fn_rev_complement == known_rev_complement:
        print('PASS')
    else:
        print('FAIL')

sequence:                  A
known reverse complement:  T
reverse_complement(seq):   T
PASS
sequence:                  AT
known reverse complement:  AT
reverse_complement(seq):   AT
PASS
sequence:                  ATGC
known reverse complement:  GCAT
reverse_complement(seq):   GCAT
PASS
sequence:                  CTGCTATAAAAGGCTG
known reverse complement:  CAGCCTTTTATAGCAG
reverse_complement(seq):   CAGCCTTTTATAGCAG
PASS


In writing your `reverse_complement` function have you copy and pasted the `complement` function?

What happens if we need to handle RNA sequences that have the base U (with the complement A)? Would you need to edit both the `complement` and `reverse_complement` functions.

If you answered yes then think again and rewrite the `reverse_complement` function so that it calls the `complement` function.

In [14]:
# Instruction: rewritten reverse complement function
def reverse_complement(sequence):
    """
    returns the reverse complement of DNA sequence string.
    
    For complement see help(complement)
    """
    answer = complement(sequence)
    answer = answer[::-1]  # reverses string
    return answer

and test the rewritten:

In [15]:
# Instruction: test your reverse_complement function
tests = [('A', 'T'),
         ('AT', 'AT'),
         ('ATGC', 'GCAT'),
         ('CTGCTATAAAAGGCTG', 'CAGCCTTTTATAGCAG')]
for seq, known_rev_complement in tests:
    print('sequence:                 ', seq)
    print('known reverse complement: ', known_rev_complement)
    fn_rev_complement = reverse_complement(seq)
    print('reverse_complement(seq):  ', fn_rev_complement)
    if fn_rev_complement == known_rev_complement:
        print('PASS')
    else:
        print('FAIL')

sequence:                  A
known reverse complement:  T
reverse_complement(seq):   T
PASS
sequence:                  AT
known reverse complement:  AT
reverse_complement(seq):   AT
PASS
sequence:                  ATGC
known reverse complement:  GCAT
reverse_complement(seq):   GCAT
PASS
sequence:                  CTGCTATAAAAGGCTG
known reverse complement:  CAGCCTTTTATAGCAG
reverse_complement(seq):   CAGCCTTTTATAGCAG
PASS


Now replace the `complement` function with an extension that deals with RNA base U that has the complement A.

In [16]:
# Instruction: extended complement function
def complement(sequence):
    """
    returns the complement of DNA sequence string.
    
    The complement replaces bases, according to the table:
    
    Base | Replacement
    -----+-----------
     A   |  T
     T   |  A
     G   |  C
     C   |  G
     U   |  A
    letters that are not in the list ATGC are left alone and
    not replaced.
    """
    complement = sequence.replace('A','t')
    complement = complement.replace('T','a')
    complement = complement.replace('G','c')
    complement = complement.replace('C','g')
    complement = complement.replace('U','a')
    complement = complement.upper()
    return complement

Show that the `complement` and `reverse_complement` function work as expected for the following case.
* the RNA sequence: `UUUCG`
* ... has complement: `AAAGC` and
* ... has reverse complement `CGAAA`

In [17]:
# Instruction: test complement and reverse_complement
# for RNA sequence UUCG 
rna_sequence = 'UUUCG'
print('rna_sequence:        ', rna_sequence)
print('complement:          ', complement(rna_sequence))
print('reverse complement:  ', reverse_complement(rna_sequence))

rna_sequence:         UUUCG
complement:           AAAGC
reverse complement:   CGAAA


**Notice** that by avoiding cut-and-paste we can avoid having to make the same change in two places of code.

## functions with default arguments

These can be really useful to provide **default arguments** to functions. Lets start with an example:

In [18]:
# Instruction: run this cell so the kernel has the log_entry definition
def log_entry(name, status='worker'):
    print('LOG ENTRY name:"{}" status:"{}"'.format(name, status))

If we call log_entry with a single argument then the `status` will be `worker`:

In [19]:
# Instruction: run this cell to see result of calling log_entry with a single argument
log_entry('Jasmine Begum')

LOG ENTRY name:"Jasmine Begum" status:"worker"


But if Jasmine brought a visitor:

In [20]:
# Instruction: run this cell to see override of the default
log_entry('John Brown', status='visitor')

LOG ENTRY name:"John Brown" status:"visitor"


Log the entry of 
* Tom Singh whose status is 'security' 
* Burglar whose status is 'intruder'.
* Alice Clarke who works here.

In [21]:
# Instruction: call log_entry to log the 3 entries
log_entry('Tom Singh', status='securiy')
log_entry('Burglar', status='intruder')
log_entry('Alice Clark')

LOG ENTRY name:"Tom Singh" status:"securiy"
LOG ENTRY name:"Burglar" status:"intruder"
LOG ENTRY name:"Alice Clark" status:"worker"


Finally note that we can call log_entry using parameter names 
for all arguments:

In [22]:
log_entry(name='Field Mouse')
log_entry(name='Green Frog', status='wildlife')
log_entry(status='unknown', name='Brown Bear')
log_entry('Fred Flintstone', 'cartoon character') # using 2 positional arguments

LOG ENTRY name:"Field Mouse" status:"worker"
LOG ENTRY name:"Green Frog" status:"wildlife"
LOG ENTRY name:"Brown Bear" status:"unknown"
LOG ENTRY name:"Fred Flintstone" status:"cartoon character"


**TIP:** For functions with loads of parameters it is normally much clearer to use parameter names rather than positional arguments as the code produced is much easier to read.

Returning to the complement and reverse complement procedures it could be simplification to define a single function:

In [23]:
# Instruction: complete the new complement fn. with reverse argument
def complement(sequence, reverse=False):
    """
    returns the complement or reverse complement of a DNA sequence string.
    
    If reverse is True then the reverse complement is returned (to give the sequence
    of the complement strand in the normal 5' to 3' direction
    
    The complement replaces bases, according to the table:
    
    Base | Replacement
    -----+-----------
     A   |  T
     T   |  A
     G   |  C
     C   |  G
     U   |  A
    letters that are not in the list ATGCU are left alone and
    not replaced.
    
    .
    """
    complement = sequence.replace('A','t')
    complement = complement.replace('T','a')
    complement = complement.replace('G','c')
    complement = complement.replace('C','g')
    complement = complement.replace('U','a')
    complement = complement.upper()
    if reverse:
        complement = complement[::-1]  # reverses string
    return complement

To use the new complement function we can call either without a reverse argument for the normal forward complement or with `reverse=True` to get the reverse complement.

In [24]:
test_seq = 'AAAG'
test_comp = complement(test_seq)
test_rev_comp = complement(test_seq, reverse=True)
print('test_seq={} complement={} reverse_complement={}'.
      format(test_seq, test_comp, test_rev_comp))

test_seq=AAAG complement=TTTC reverse_complement=CTTT


### Optional exercise: deal with RNA sequences properly

What result does your complement function give for the input
RNA sequence `AAAUUU`?

In [25]:
rna2 = 'AAAUUU'
print(rna2, ' complement: ', complement(rna2))

AAAUUU  complement:  TTTAAA


The complement sequence `TTTAAA` returned is a DNA sequence but is this what is wanted? An RNA sequence (with **`U`**s in place of **`T`s**) might make more sense? We cannot be sure and need to know whether the user wants a DNA or RNA sequence..

So write a new function `rna_complement` that returns an RNA sequence given an input DNA or RNA sequence. 
N.B. do not copy and paste your existing complement function to do this!

In [26]:
# Instruction: write rna_complement function 
def rna_complement(sequence, reverse=False):
    """ returns the RNA complement or reverse complement of an input DNA or RNA sequence string.
    
    If reverse is True then the reverse complement is returned (to give the sequence
    of the complement strand in the normal 5' to 3' direction
    
    The complement replaces bases, according to the table:
    
    Base | Replacement
    -----+-----------
     A   |  U
     U   |  A
     G   |  C
     C   |  G
     T   |  A
    letters that are not in the list AUGCT are left alone and
    not replaced.
    """
    dna_complement = complement(sequence, reverse=reverse)
    rna_complement = dna_complement.replace('T', 'U')
    return rna_complement

In [27]:
# Instruction: your test rna_complement function by running this cell.
test_rna         = 'AUGC'
given_complement = 'UACG'
given_reverse_c  = 'GCAU'
print('test rna_complement(test_rna): ', end='')
if rna_complement(test_rna) == given_complement:
    print('TEST PASS')
else:
    print('TEST FAIL')
    
print('test rna_complement(test_rna, reverse=True): ', end='')
if rna_complement(test_rna, reverse=True) == given_reverse_c:
    print('TEST PASS')
else:
    print('TEST FAIL')

test rna_complement(test_rna): TEST PASS
test rna_complement(test_rna, reverse=True): TEST PASS


**Notice** that we can reuse the complement function without copy paste!

Question: an alternative to having a seperate rna_complement function would
be to extend the `complement` function introducing a new optional argument `rna_result=False`. 
Do you think this would be better?

### Optional exercise: deal with lower case letters in sequences

> *Have a go at this **if you have time**.*

Suppose we want to deal with DNA sequences with lower case letters like:
```
tata_oligo = 'ctgcTATAAAaggctg'
```
the complement sequence should preserve the upper/lower case for each letter.
So our sequence:
```
tata_oligo = 'ctgcTATAAAaggctg'
tata_compl = 'gacgATATTTtccgac'
```

(See https://www.bioinformatics.org/sms/rev_comp.html for a tool that does this).

What happens with our current complement function

In [28]:
# Instruction: test current complement 
tata_oligo = 'ctgcTATAAAaggctg'
tata_compl = complement(tata_oligo)
print('tata_oligo =' + tata_oligo)
print('tata_compl =' + tata_compl)

tata_oligo =ctgcTATAAAaggctg
tata_compl =CTGCATATTTAGGCTG


Question: what does the current `complement` function do with lower case? 

Answer: **because our complement function uses lower case letters in the conversion it produces completely wrong result**

How can you fix the procedure?

Hint: possible solutions:
* use a for loop to go through the sequence and substitute letter by letter 
* use the Python translate method as explained at https://www.tutorialspoint.com/python/string_translate.htm and demonstrated here.

In [29]:
# Instruction: run this cell for a demonstration of 
# String translate in Python3
# want to replace all a's with 1, all b's to 2, 
# all c's to 3, all d's to 4. Leaving all other characters alone
work_on_chars = 'abcd'
translate__to = '1234'
abcd_to_1234 = str.maketrans(work_on_chars,translate__to)
test_string = 'abcdefg. The quick brown fox jumps over the lazy dog!'
translate_string = test_string.translate(abcd_to_1234)
print(translate_string)

1234efg. The qui3k 2rown fox jumps over the l1zy 4og!


**My solution using Python translate:**

In [30]:
# Instruction: produce a complement function that preserves 
# the upper/lower case of the input sequence.
def complement(sequence, reverse=False, rna=False):
    """returns the DNA complement or reverse complement of an
    input DNA or RNA sequence string.
    
    supports lower case letters.
    """
    base_in = 'ATGCUatgcu'
    base_to = 'TACGAtacga'
    complement_translate = str.maketrans(base_in, base_to)
    complement = sequence.translate(complement_translate)
    if reverse:
        complement = complement[::-1]
    if rna:
        complement = complement.replace('T', 'U')
        complement = complement.replace('t', 'u')
    return complement

In [31]:
# Instruction: test new complement function.
tata_oligo = 'ctgcTATAAAaggctg'
know_compl = 'gacgATATTTtccgac'
tata_compl = complement(tata_oligo)
print('tata_oligo =' + tata_oligo)
print('tata_compl =' + tata_compl)
if tata_compl != know_compl:
    print('error!')
else:
    print('test passes')

tata_oligo =ctgcTATAAAaggctg
tata_compl =gacgATATTTtccgac
test passes


**A long winded solution using a for loop and if elif etc.**

In [32]:
def complement(sequence, reverse=False, rna=False):
    """returns the DNA complement or reverse complement of an
    input DNA or RNA sequence string.
    
    supports lower case letters.
    """
    complement = ''  # it would be more efficient to use a list!
    for base in sequence:
        if base.islower():
            base_lower = True
        else:
            base_lower = False
        base = base.upper()
        if base == 'A':
            comp_base = 'T'
        elif base == 'T':
            comp_base = 'A'
        elif base == 'G':
            comp_base = 'C'
        elif base == 'C':
            comp_base = 'G'
        else:
            comp_base = base
        if base_lower:
            comp_base = comp_base.lower()
        complement += comp_base         
    if reverse:
        complement = complement[::-1]
    if rna:
        complement = complement.replace('T', 'U')
        complement = complement.replace('t', 'u')
    return complement

In [33]:
# quick test should print gacgATATTTtccgac
print(complement(tata_oligo))

gacgATATTTtccgac


**My solution using a for loop and a Python dictionary**

In [34]:
def complement(sequence, reverse=False, rna=False):
    """returns the DNA complement or reverse complement of an
    input DNA or RNA sequence string.
    
    supports lower case letters.
    """
    complement_dict = {'A':'T', 'T':'A', 'G':'C', 'C':'G', 'U':'A',
                       'a':'t', 't':'a', 'g':'c', 'c':'g', 'u':'a'}
    seq_comp = []
    for base in sequence:
        seq_comp.append(complement_dict.get(base,base))
    complement = ''.join(seq_comp)
    if reverse:
        complement = complement[::-1]
    if rna:
        complement = complement.replace('T', 'U')
        complement = complement.replace('t', 'u')
    return complement

In [35]:
# quick test should print gacgATATTTtccgac
print(complement(tata_oligo))

gacgATATTTtccgac


### Optional exercise: deal with Ambiguity codes

> *Have a go at this **if you have time**.*

* See http://reverse-complement.com/ambiguity.html 
* extend your code to deal with ambituity codes
* compare results against http://reverse-complement.com/ and http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html

In [36]:
# my solution 
def complement(sequence, reverse=False, rna=False):
    """returns the DNA complement or reverse complement of an
    input DNA or RNA sequence string.
    
    supports lower case letters and Ambiguity codes from
    http://reverse-complement.com/ambiguity.html
    """
    base_in = 'AGCTURYSWKMBVDH'
    base_to = 'TCGAAYRSWMKVBHD'
    base_in = base_in + base_in.lower()
    base_to = base_to + base_to.lower()
    complement_translate = str.maketrans(base_in, base_to)
    complement = sequence.translate(complement_translate)
    if reverse:
        complement = complement[::-1]
    if rna:
        complement = complement.replace('T', 'U')
        complement = complement.replace('t', 'u')
    return complement

In [37]:
# quick test should print gacgATATTTtccgac
print(complement(tata_oligo))

gacgATATTTtccgac


In [38]:
# test sequence of all lower and upper case letters
import string 
letters = string.ascii_letters
# from http://reverse-complement.com/
# reverse complement of:
# abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
# is:
# ZRXWBAASYQPONKLMJIDCFEHGVTzrxwbaasyqponklmjidcfehgvt
expect = 'ZRXWBAASYQPONKLMJIDCFEHGVTzrxwbaasyqponklmjidcfehgvt'
print('test reverse complement of all ascii letters')
print('input:  ', letters)
print('expect: ', expect)
get = complement(sequence=letters, reverse=True)
print('get:    ', get)
if get == expect:
    print('TEST PASSES')
else:
    print('TEST FAILS')

test reverse complement of all ascii letters
input:   abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
expect:  ZRXWBAASYQPONKLMJIDCFEHGVTzrxwbaasyqponklmjidcfehgvt
get:     ZRXWBAASYQPONKLMJIDCFEHGVTzrxwbaasyqponklmjidcfehgvt
TEST PASSES


## Homework D: Write your own functions
Now go on and complete the [homework D](./ex_D_homework.ipynb)