# FUNCTIONS pt. 2: 
## Simple bioinformatics exercises

Let's continue with some exercises and explanations on how to use functions.

_(This set of exerciss was inspired by the book "Bioinformatics Programming Using Python" by Mitchell L. Model, O'Reilly.)_

We will write functions that:

1) check if a certain string is contained in another string: `.find()`

2) verify that a string only contains certain characters: `.upper()`, `.count()`

3) calculate the GC content of a DNA sequence: `len()`

4) include an assertion statement: `assert`

5) have default parameter values

### 1) Find the DNA binding site in a sequence
Here is a simple function on how to find a binding site in a DNA sequence. For this, I'm going to use the **`.find()`** method, which tells me the position of the first occurrence of my query. 

In [10]:
def recognition_site (base_seq, recognition_seq): 
    """This function takes 2 strings as arguments and returns the position 
    of the second string in the first string (or -1 if it's not present)"""
    
    return base_seq.find(recognition_seq)

seq1 = "ACTGTAGTACCATAGATCCATATGTAGTCCCATAGTCCCAGAGCACCAGTC"
seq2 = "CCATA"

recognition_site (seq1, seq2) #function call, providing two parameters

9

### 2) Check if DNA sequence contains only A, C, T, G
DNA sequences can only contain 4 characters - A, C, T, G - and they should all be in uppercase. To make sure that both of these things are true, I will check for that with a function. I will use **`.upper()`** to convert any lowercase letters in my sequence into uppercase. Then I will check if the length of the sequence (**`len()`**) is the same as the sum of all As, Cs, Ts and Gs in the sequence, which I count with the **`.count()`** method. The function returns `True` if only As, Cs, Ts and Gs are in the sequence, and `False` if any other characters (including white spaces) are detected.

In [53]:
def validate_seq (base_seq):
    '''Returns True if the sequence contains only A, G, T, C, and
    False if otherwise.'''
    base_seq = base_seq.upper() #convert the string into all uppercase
    return len(base_seq) == (base_seq.count("A") + base_seq.count("C") 
                             + base_seq.count("T") + base_seq.count("G"))

seq3 = "aatagtgatcccacacgtgat"
seq4 = "abcdkjsdfiosdfjsdklfj"

print (validate_seq(seq3))
print (validate_seq(seq4))

True
False


### 3) Calculate the GC content of a DNA sequence

How would I go about calculating the percentage of Gs and Cs in a given DNA sequence? Let's go through it step by step:

1) Count the Gs and Cs occurring in the DNA sequence and add them up.

2) Calculate the ratio of GC.

3) Return the GC content as a percentage. 

In [54]:
seq5 = "ACCCATTGATTGATACAGATGAACACACAGATAGA"

def GC_content (base_seq):
    "Returns the GC content (in %) of a given DNA sequence"
    GC_count = base_seq.count ("G") + base_seq.count ("C") #1)
    GC_ratio = GC_count / len(base_seq) #2)
    GC_percent = int (GC_ratio *100) #3) 
    return print ("The GC content is " + str (GC_percent) + (" %."))

GC_content (seq5)                                    
                 

The GC content is 37 %.


Notice that to get nice round numbers, I convert GC_percent into an integer. To print everything in a string, GC_percent has to be converted into a string.

## 4) Including an assertion statement

This tests whether an expression is `True` or `False` and causes an error if it is `False`. This way you can avoid to continue with the output of a function that is the result of an invalid argument. E.g. in the GC content example above, I didn't check if the sequence consisted only of As, Gs, Ts and Cs, and so the function would work perfectly fine if `base_seq` is `"TXTWECGATJH"`. 

So let's improve `GC_content` by ensuring that its argument is a valid DNA sequence. Conviently, we already wrote a suitable function earlier, called `validate_seq`. So I will just call `validate_seq` inside the `GC_content` function.

In [61]:
seq5 = "ACCCaattggTTGATaCAGATGAACACCcAGATAgA"

def improved_GC_content (base_seq):
    "Returns the GC content (in %) of a given DNA sequence"
    #Returns an error if other characters than ATCG appear in the sequence
    assert validate_seq(base_seq), "argument has invalid characters"
    #Make sure that all characters are in uppercase
    base_seq = base_seq.upper()
    GC_count = base_seq.count ("G") + base_seq.count ("C") #1)
    GC_ratio = GC_count / len(base_seq) #2)
    GC_percent = int (GC_ratio *100) #3) 
    return print ("The GC content is " + str (GC_percent) + (" %."))

improved_GC_content (seq5)  

The GC content is 41 %.


## 5) Default parameter values

In many functions, one parameter will often be the same value, but not always. You'd like to define a default value, that is taken when no explicit value is mentioned when the function is called. However, IF a different value is mentioned in the function call, that one should be used instead of the default value. 

For example, in the abovementioned `validate_seq` function, I would like to occasionally also be able to check for RNA sequences. RNA contains the bases A, C, G and U, whereas DNA contains A, C, G and T. I will do this with a **flag** and call it **`RNAflag`**. I'll give it the default value `False`, so I don't have to specify the parameter when I deal with DNA sequences, and only when I have an RNA sequence, I will provide the additional parameter `True` in the function call.

The flag has to be included not only in the function parameters, but also of course as an expression in the function itself, e.g. as an IF statement. 

In [9]:
def validate_DNA_or_RNA_seq (base_seq, RNAflag=False):
    '''This function returns True if the string base_seq contains only 
    upper- or lowercase T (or U, if RNAflag), A, G and C characters.'''
    base_seq = base_seq.upper()
    return len(base_seq) == (base_seq.count("A") 
                             + base_seq.count("C") 
                             + base_seq.count("U" if RNAflag else "T") 
                             + base_seq.count("G"))

seq3 = "aatagtgattcccacacgtgat" #DNA sequence
seq4 = "abcdkjsdfiosdfjsdklfj" #gibberish sequence
seq5 = "aucuaucgugucuacua" #RNA sequence

print (validate_DNA_or_RNA_seq(seq3))
print (validate_DNA_or_RNA_seq(seq4))
print (validate_DNA_or_RNA_seq(seq5, True)) #Here I specify RNAflag = True

True
False
True


In the parameter list when you define a function, as well as in the argument list when you call a function, all required arguments have to go before the arguments with optional values (i.e. the ones that have default values defined).