# Lecture 08: Practical analyses in Python

Last week Phil gave a great intro to Python!
This week we are going to introduce a few more concepts that will help you make your code more sophisticated (and flexible).

# Defining new functions
We use the `def` keyword to define a new function. 

In [17]:
def say_hello():
    print('Hello!')

s = say_hello()
print(s)

Hello!


In [11]:
# functions can take inputs, called "arguments" or "parameters"
def greet(name):
    print('Nice to meet you,', name)

greet('phil')
greet('charlie')

Nice to meet you, phil
Nice to meet you, charlie


Functions can return results to the caller, using the `return` statement

In [None]:
def reverse_complement(seq):
    """returns the reverse complement of a nucleic acid sequence"""
    base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
    bwd = ''
    # iterate through all bases in the sequence
    for base in seq:
        # look up the complementary base in the dictionary
        pair = base_partner[base]
        # add the complentary base to the beginning of the string (reverse comp)
        bwd = pair + bwd
    return bwd

In [12]:
fwd = 'ACGGTAATGATCCTCAG'
rev = reverse_complement( fwd )

print('fwd=',fwd,'rev=',rev)

fwd= ACGGTAATGATCCTCAG rev= CTGAGGATCATTACCGT


In [13]:
# an assert statement checks to see if something is true, and will interrupt the program if it's not.
assert reverse_complement(reverse_complement(fwd)) == fwd

Functions can have OPTIONAL ARGUMENTS whose DEFAULT VALUES are pre-specified in the function definition.

In [14]:
def reverse_complement(seq, unk_partner='N'):
    """Returns the reverse complement of a nucleic acid sequence
    
    Uses unk_partner as the partner of unrecognized letters
    """
    base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
    rseq = ''
    for a in seq:
        if a in base_partner:
            # look up the complementary base in the dictionary
            pair = base_partner[a]
            rseq = pair + rseq
        else:
            rseq = unk_partner + rseq
    return rseq


In [15]:
fwd = 'ACTGTAGCxGAcTNCGAC'

print(reverse_complement(fwd))
print(reverse_complement(fwd, unk_partner='-'))

GTCGNANTCNGCTACAGT
GTCG-A-TC-GCTACAGT


Functions can call other functions, even themselves:

In [18]:
def factorial(n):
    """Calculate the factorial of a number recursively. Bad things will happen 
    if the number is negative or not an integer """
    if n==0:
        return 1
    else:
        return n * factorial(n-1) # this is called "recursion"

for i in range(10):
    print(i,'factorial =',factorial(i))

0 factorial = 1
1 factorial = 1
2 factorial = 2
3 factorial = 6
4 factorial = 24
5 factorial = 120
6 factorial = 720
7 factorial = 5040
8 factorial = 40320
9 factorial = 362880


Try using `help(factorial)`, `factorial?`, and `factorial??` to see the docstring and source code of our new function. 

In [19]:
help(factorial)


Help on function factorial in module __main__:

factorial(n)
    Calculate the factorial of a number recursively. Bad things will happen 
    if the number is negative or not an integer



In [20]:
factorial?

[0;31mSignature:[0m [0mfactorial[0m[0;34m([0m[0mn[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Calculate the factorial of a number recursively. Bad things will happen 
if the number is negative or not an integer 
[0;31mFile:[0m      /tmp/ipykernel_509/2221558712.py
[0;31mType:[0m      function


In [21]:
factorial??

[0;31mSignature:[0m [0mfactorial[0m[0;34m([0m[0mn[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mfactorial[0m[0;34m([0m[0mn[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Calculate the factorial of a number recursively. Bad things will happen [0m
[0;34m    if the number is negative or not an integer """[0m[0;34m[0m
[0;34m[0m    [0;32mif[0m [0mn[0m[0;34m==[0m[0;36m0[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;32mreturn[0m [0;36m1[0m[0;34m[0m
[0;34m[0m    [0;32melse[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;32mreturn[0m [0mn[0m [0;34m*[0m [0mfactorial[0m[0;34m([0m[0mn[0m[0;34m-[0m[0;36m1[0m[0;34m)[0m [0;31m# this is called "recursion"[0m[0;34m[0m[0;34m[0m[0m
[0;31mFile:[0m      /tmp/ipykernel_509/2221558712.py
[0;31mType:[0m      function


### Practice time:

Take a few minutes to write a function that uses a dictionary to count the number of times each base occurs.
Be sure to account for sequences that contain uppercase and lowercase bases.

I've provided an empty function here (with some docstrings already written). Use the provided test cases check your work.

In [22]:
def count_bases(seq):
    """Count the number of times each base occurs in the sequence.
    
    Parameters
    ----------
    seq : string
        DNA sequence.
        
    Returns
    -------
    dict
        Keyed by each nucleotide, value is number of times the nucleotide
        is observed.
    
    """
    base_counts = {}
    
    seq = seq.upper()
    
    for base in seq:
        if base in base_counts:
            base_counts[base] += 1
        else:
            base_counts[base] = 1
    
    return base_counts

In [23]:
count_bases('AATCGGCT')

{'A': 2, 'T': 2, 'C': 2, 'G': 2}

In [24]:
count_bases('aatTGGcT')

{'A': 2, 'T': 3, 'G': 2, 'C': 1}

## Regular expressions
The `re` module is for "regular expressions". 
These are very useful for parsing strings.

A regular expression is a sequence of characters that forms a search pattern.
They can be used to check if a string contains the specified search pattern.

The `re` package offers a set of functions that allows us to search a string for a match. 
The main functions we will be using are called `search` which allows us to find string matches at any position in the string.
We can search for a pattern in a string like this:

```
import re
string = "This is an example string"

# compile the search pattern
search_pattern = re.compile("search pattern here")

# search for the search pattern in the string
search_pattern.search(string)
```

Here are some common elements to have in your search pattern:
* letter characters which returns a match where the string contains the specified letter (e.g. `A`, `B`, `C`, ...)
* special characters which returns a match where the string contains the specified special character; these must be preceded by a `\`
* `\d` which returns a match where the string contains digits (numbers from `0`-`9`)

You may also want to add the following customizations:
* `[]` specifies a set of characters to search for (e.g. `[a-n]`)
* `()` capture and group everything contained inside, and search the string for everything together
* `?P<name>` indicates a search pattern group with name `name`
* `+` specifies one or more occurrences of a certain pattern element
* `{}` specifies exactly the specified number of occurrences of a certain pattern element
* `$` specifies the end of the string
* `^` specifies the beginning of the string


Here is a common example dealing with influenza.
You download some strains from the database, and they have names that look like this:

In [None]:
strain1 = 'A/New York/3/1994 (H3N2)'
strain2 = 'A/California/3/X/2003 (H12N1)'
strain3 = 'A/Perth/2009 (H3N2)'

strains = [strain1, strain2, strain3]

You want to get some information out of these, like the subtype.
Let's build a regular expression that gets the subtype out of `strain2`:

In [28]:
import re

# compile a re for the subtype
strainmatch = re.compile(
        '\(H\d+N\d+\)')                        

# search for the search pattern in the string
strainmatch.search(strain2)

<re.Match object; span=(22, 29), match='(H12N1)'>

Now, let's extend that a bit and build a regular expression that gets only the subtype out and then use a dictionary to count how many sequences there are of each subtype:

In [29]:
# compile a re for the subtype (with a named search pattern)
strainmatch = re.compile(
        '\((?P<subtype>H\d+N\d+)\)$')   
    
subtype_counter = {}  # dict to store the results

for strain in strains:  # loop over all strains
    # search for re in each strain
    m = strainmatch.search(strain)
    # isolate named pattern
    subtype = m.group('subtype')
    # add pattern/count to dictionary
    if subtype in subtype_counter:
        subtype_counter[subtype] += 1
    else:
        subtype_counter[subtype] = 1
        
print(subtype_counter)

{'H3N2': 2, 'H12N1': 1}


There are lots of handy special codes in the Python regular expression module (see [here](https://docs.python.org/3.7/library/re.html)), and you can use them to do almost any type of string matching.

I like to test my regular expression calls using [this website](https://regex101.com)

## Using regular expressions to parse barcodes
Now we will use regular expressions to parse barcodes from nucleotide sequences.
For instance, you might have to do this in a single-cell RNA-seq experiment where there is a barcode at the end of each read telling you the cell that the read came from.

Imagine that our valid molecules should have sequences like this:

`CTAGCNNNNNNGATCA`

See how there is a 6-nucleotide barcode in the center of the sequence.
We have a list of sequences, and want to parse through them to figure out which ones meet the expected pattern--and get the barcode from those that do:

In [None]:
seqs = ['CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CCAGCatagcaGATCA',  # does not have expected 5' sequence
        'CTAGCtacagGATCA',   # barcode too short
        'CTAGCgaccatGATCA',  # has barcode GACCAT
        'CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CTAGCatcgatGGTCA',  # does not have expected 3' sequence
        ]

Write a function that parses these barcoded sequences and gets the ones with valid barcodes.
In doing this, note that:

  1. If you have a string `s`, `s.upper()` makes it all uppercase.
  2. The `match` function finds a match at the beginning of the string (whereas the `search` function we were using before searches for matches throughout the whole string)
    
Below I've written the function documentation, try to implement it.
__Take a few minutes in groups to work through this.__

In [None]:
def count_barcodes(seqs, bclen=6, upstream='CTAGC', downstream='GATCA'):
    """Parse and count barcodes.
    
    Parameters
    ----------
    seqs : list
        DNA sequences.
    bclen : int
        Length of barcode
    upstream : str
        Sequence upstream of barcode.
    downstream : str
        Sequence downstream of barcode.
        
    Returns
    -------
    dict
        Keyed by each valid barcode, value is number of times the barcode
        is observed.
        
    Note
    ----
    The function is **not** case-sensitive, and all barcodes are reported
    in upper-case.
    
    """
    
    # your code here ...
    
    return None


Run the function once you've implemented it. Does it give the right result?

In [None]:
count_barcodes(seqs)