# __Lecture 08: Practical analyses in Python__

----
#### __Last week, Phil talked about:__
- Python! (e.g. data types, lists, dictionaries, flow control)
- how to define functions in Python

#### __Today we will:__
- review functions
- introduce regular expressions and the `re` package

###  
----

# __Review__: Defining new functions
We use the `def` keyword to define a new function. 

In [87]:
# functions can take inputs, called "arguments" or "parameters"
def greet(name):
    print('Nice to meet you,', name)

In [88]:
greet('Sarah')
greet('Nashwa')

Nice to meet you, Sarah
Nice to meet you, Nashwa


A function won't return an object to you unless you tell it to

In [89]:
s = greet("Maggie")

Nice to meet you, Maggie


In [90]:
print(s)

None


Functions can return results to the caller, using the `return` statement

In [91]:
def reverse_complement(seq):
    """returns the reverse complement of a nucleic acid sequence"""
    base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
    bwd = ''
    # iterate through all bases in the sequence
    for base in seq:
        # look up the complementary base in the dictionary
        pair = base_partner[base]
        # add the complentary base to the beginning of the string (reverse comp)
        bwd = pair + bwd
    return bwd

In [92]:
fwd = 'ACGGTAATGATCCTCAG'
rev = reverse_complement( fwd )

print('fwd=',fwd,'rev=',rev)

fwd= ACGGTAATGATCCTCAG rev= CTGAGGATCATTACCGT


Functions can have OPTIONAL ARGUMENTS whose DEFAULT VALUES are pre-specified in the function definition.

In [93]:
def reverse_complement(seq, unk_partner='N'):
    """Returns the reverse complement of a nucleic acid sequence
    
    Uses unk_partner as the partner of unrecognized letters
    """
    base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
    rseq = ''
    # iterate through all bases in the sequence
    for a in seq:
        # check if the base is in the dictionary
        if a in base_partner:
            # look up the complementary base in the dictionary
            pair = base_partner[a]
            # add the complentary base to the beginning of the string (reverse comp)
            rseq = pair + rseq
        else:
            rseq = unk_partner + rseq
    return rseq


In [94]:
fwd = 'ACTGTAGCxGAcTNCGAC'

reverse_complement(fwd)

'GTCGNANTCNGCTACAGT'

In [95]:
reverse_complement(fwd, unk_partner='-')

'GTCG-A-TC-GCTACAGT'

Try using `help(reverse_complement)`, `reverse_complement?`, and `reverse_complement??` to see the docstring and source code of our new function. 

In [96]:
help(reverse_complement)


Help on function reverse_complement in module __main__:

reverse_complement(seq, unk_partner='N')
    Returns the reverse complement of a nucleic acid sequence
    
    Uses unk_partner as the partner of unrecognized letters



In [97]:
reverse_complement?

[0;31mSignature:[0m [0mreverse_complement[0m[0;34m([0m[0mseq[0m[0;34m,[0m [0munk_partner[0m[0;34m=[0m[0;34m'N'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Returns the reverse complement of a nucleic acid sequence

Uses unk_partner as the partner of unrecognized letters
[0;31mFile:[0m      /tmp/ipykernel_18548/2581700929.py
[0;31mType:[0m      function


In [98]:
reverse_complement??

[0;31mSignature:[0m [0mreverse_complement[0m[0;34m([0m[0mseq[0m[0;34m,[0m [0munk_partner[0m[0;34m=[0m[0;34m'N'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mreverse_complement[0m[0;34m([0m[0mseq[0m[0;34m,[0m [0munk_partner[0m[0;34m=[0m[0;34m'N'[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Returns the reverse complement of a nucleic acid sequence[0m
[0;34m    [0m
[0;34m    Uses unk_partner as the partner of unrecognized letters[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m    [0mbase_partner[0m [0;34m=[0m [0;34m{[0m[0;34m'A'[0m[0;34m:[0m[0;34m'T'[0m[0;34m,[0m [0;34m'T'[0m[0;34m:[0m[0;34m'A'[0m[0;34m,[0m [0;34m'C'[0m[0;34m:[0m[0;34m'G'[0m[0;34m,[0m [0;34m'G'[0m[0;34m:[0m[0;34m'C'[0m[0;34m}[0m[0;34m[0m
[0;34m[0m    [0mrseq[0m [0;34m=[0m [0;34m''[0m[0;34m[0m
[0;34m[0m    [0;31m# iterate through all bases in the sequence[0m[0;34m[0m
[0;34m[0m    [

### __Practice:__

#### Write a function that uses a dictionary to count the number of times each base occurs.
Be sure to account for sequences that contain uppercase and lowercase bases. I've provided an empty function here (with some docstrings already written). Use the provided test cases check your work.

In [99]:
def count_bases(seq, case='upper'):
    """Count the number of times each base occurs in the sequence.
    
    Parameters
    ----------
    seq : string
        DNA sequence.
    case : string, optional
        Specifies the case in which to count the bases. Default is 'upper'.
        
    Returns
    -------
    dict
        Keyed by each nucleotide, value is number of times the nucleotide
        is observed in the sequence.
    
    """
    
    return None

In [100]:
count_bases('AATCGGCT')

In [101]:
count_bases('aatTGGcT')

# __Regular expressions__
The `re` module is for "regular expressions". 
These are very useful for parsing strings.

A __regular expression__ is a sequence of characters that forms a __search pattern__.
They can be used to check if a string contains the specified search pattern.

The `re` package offers a set of functions that allows us to search a string for a match. 
The main functions we will be using are called `search` which allows us to find string matches at any position in the string.
We can search for a pattern in a string like this:

```
import re
string = "This is an example string"

# compile the search pattern
search_pattern = re.compile("search pattern here")

# search for the search pattern in the string
search_pattern.search(string)
```

#### Here are some common elements to have in your __search pattern__:
* letter characters which returns a match where the string contains the specified letter (e.g. `A`, `B`, `C`, ...)
* special characters which returns a match where the string contains the specified special character; these must be preceded by a `\`
* `\d` which returns a match where the string contains digits (numbers from `0`-`9`)

#### You may also want to add the following search pattern _customizations_:
* `.` matches any single character
* `[]` specifies a set of characters to search for (e.g. `[a-n]`)
* `()` capture and group everything contained inside, and search the string for everything together
* `?P<name>` indicates a search pattern group with name `name`
* `+` specifies one or more occurrences of a certain pattern element
* `{}` specifies exactly the specified number of occurrences of a certain pattern element
* `$` specifies the end of the string
* `^` specifies the beginning of the string


Here is a common example dealing with influenza.
You download some strains from the database, and they have names that look like this:

In [102]:
strain = 'A/New York/3/1994 (H3N2)'

Say you want to get some information out of these, like the year. Let's build a regular expression that gets the year out of `strain`:

In [103]:
import re

# compile a re for the year
yearmatch = re.compile('\d{4}')

# search for the search pattern in the string
yearmatch.search(strain)


<re.Match object; span=(13, 17), match='1994'>

Sometimes you may want to include additional search elements that you don't necessarily want to be included in your final search pattern. In this case, you may want to create a named search pattern:

In [104]:
# compile a named re for the year
named_yearmatch = re.compile('\/(?P<year>\d{4})')

# search for the search pattern in the string
named_search = named_yearmatch.search(strain)

# isolate named pattern
year = named_search.group('year')

print(year)


1994


There are lots of handy special codes in the Python regular expression module (see [here](https://docs.python.org/3.7/library/re.html)), and you can use them to do almost any type of string matching.

I like to test my regular expression calls using [this website](https://regex101.com)

### __Practice:__ 

#### Let's build a regular expression that gets the subtype (e.g. `H3N2`) out the `strain` object

_Hints:_
- _Be sure to name the search pattern `subtype`_
- _Include the parentheses in your search pattern, but not in the final named search pattern_

In [105]:
# write your code here...

### __Practice:__

#### Using regular expressions to parse barcodes

Now we will use regular expressions to parse barcodes from nucleotide sequences.
For instance, you might have to do this in a single-cell RNA-seq experiment where there is a barcode at the end of each read telling you the cell that the read came from.

Imagine that our valid molecules should have sequences like this:

`CTAGCNNNNNNGATCA`

See how there is a 6-nucleotide barcode in the center of the sequence.
We have a list of sequences, and want to parse through them to figure out which ones meet the expected pattern--and get the barcode from those that do:

In [106]:
seqs = ['CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CCAGCatagcaGATCA',  # does not have expected 5' sequence
        'CTAGCtacagGATCA',   # barcode too short
        'CTAGCgaccatGATCA',  # has barcode GACCAT
        'CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CTAGCatcgatGGTCA',  # does not have expected 3' sequence
        ]

Write a function that parses these barcoded sequences and gets the ones with valid barcodes.
In doing this, note that:

  1. If you have a string `s`, `s.upper()` makes it all uppercase.
  2. You can access elements of a dictionary using a key value like this: `dictionary_name[key_name]`
    
Below I've written the function documentation, try to implement it.
__Take a few minutes in groups to work through this.__

In [107]:
def count_barcodes(seqs, bclen=6, upstream='CTAGC', downstream='GATCA'):
    """Parse and count barcodes.
    
    Parameters
    ----------
    seqs : list
        DNA sequences.
    bclen : int
        Length of barcode
    upstream : str
        Sequence upstream of barcode.
    downstream : str
        Sequence downstream of barcode.
        
    Returns
    -------
    dict
        Keyed by each valid barcode, value is number of times the barcode
        is observed.
        
    Note
    ----
    The function is **not** case-sensitive, and all barcodes are reported
    in upper-case.
    
    """
    
    # your code here ...
    
    return None


Run the function once you've implemented it. Does it give the correct result?

In [108]:
count_barcodes(seqs)