# __Lecture 08: Practical analyses in Python__

----
#### __Last week, Phil talked about:__
- Python! (e.g. data types, lists, dictionaries, for loops, if statements, etc.)

#### __Today we will:__
- review for loops and if statements
- learn how to define functions in Python
- introduce regular expressions and the `re` package

###  
----

# __Review__: Using `for` loops and `if` statements
#### Python `for` loops let you repeat a block of code while changing the value of a _looping variable_

In [None]:
seq = "ATGCTC"

# base pairing dictionary
base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}

# we can iterate through all bases in the sequence
for base in seq:
    # look up the complementary base in the dictionary
    pair = base_partner[base]
    print(pair)

#### Python `if` statements let you choose between two (or more) outcomes based on a boolean expression 

In [None]:
seq2 = "ATGNCT"

# base pairing dictionary
base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}

for base in seq2:
    # check if the base is in the dictionary
    if base in base_partner:
        # look up the complementary base in the dictionary
        pair = base_partner[base]
        print(pair)
    else:
        print("Unknown nucleotide")

----
# __New:__ Using `break` and `continue` statements
#### The `break` statement gets us out of a loop, while `continue` jumps directly to the next cycle through

In [None]:
seq3 = "AAGCNT"

# base pairing dictionary
base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}

for base in seq3:
    if base == "A": # A is a bad nucleotide
        continue
    
    if base == "N":
        break
        
    # look up the complementary base in the dictionary
    pair = base_partner[base]
    print(pair)

----
# __Defining new functions__
#### We use the `def` keyword to define a new function. Functions can take inputs called _"arguments"_ or _"parameters."_

In [None]:
def print_sum(a, b):
    print(a + b)

# This just displays the result, but you can't save it
print_sum(3, 5)

#### `return` statements
In Python, if you want a function to give back a result that can be saved and used later, you need to include a `return` statement. This is different from just using `print()`, which only displays the output but doesn't allow you to save it as an object. __For example:__

In [33]:
def add_numbers(a, b):
    return(a + b)

# This lets you store the result in a variable
result = add_numbers(3, 5)

In [None]:
print(result)

#### Using `for` loops and `if` statements within functions

We will often use these within our functions to efficiently construct new lists or dictionaries from existing iterables. __For example:__

In [35]:
def reverse_complement(seq):
    """returns the reverse complement of a nucleic acid sequence"""
    base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
    bwd = ''
    # iterate through all bases in the sequence
    for base in seq:
        # look up the complementary base in the dictionary
        pair = base_partner[base]
        # add the complentary base to the beginning of the string (reverse comp)
        bwd = pair + bwd
    return(bwd)

In [None]:
fwd = 'ACGGTAATGATCCTCAG'
rev = reverse_complement(fwd)

print("forward:", fwd)
print("reverse comp:", rev)

#### Optional function arguments

Functions can have OPTIONAL ARGUMENTS whose DEFAULT VALUES are pre-specified in the function definition.

In [37]:
def reverse_complement(seq, unk_partner='N'):
    """Returns the reverse complement of a nucleic acid sequence
    
    Uses unk_partner as the partner of unrecognized letters
    """
    base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
    rseq = ''
    # iterate through all bases in the sequence
    for a in seq:
        # check if the base is in the dictionary
        if a in base_partner:
            # look up the complementary base in the dictionary
            pair = base_partner[a]
            # add the complentary base to the beginning of the string (reverse comp)
            rseq = pair + rseq
        else:
            rseq = unk_partner + rseq
    return rseq


In [None]:
fwd = 'ACTGTAGCxGAcTNCGAC'

reverse_complement(fwd)

In [None]:
reverse_complement(fwd, unk_partner='-')

#### Viewing the function docstring and source code
Try using `reverse_complement??` to see the docstring and source code of our new function. 

In [None]:
reverse_complement??

-----
# __Practice defining functions:__

#### Write a function that uses a dictionary to count the number of times each base occurs.
Be sure to account for sequences that contain uppercase and lowercase bases. I've provided an empty function here (with some docstrings already written). Use the provided test cases check your work.

In [19]:
def count_bases(seq, case='upper'):
    """Count the number of times each base occurs in the sequence.
    
    Parameters
    ----------
    seq : string
        DNA sequence.
    case : string, optional
        Specifies the case in which to count the bases. Default is 'upper'.
        
    Returns
    -------
    dict
        Keyed by each nucleotide, value is number of times the nucleotide
        is observed in the sequence.
        
    HINT: <object>.upper() will convert the <object> string to uppercase
    
    """
    
    # your code here ...
    
    return None
                

In [30]:
count_bases('AATXXGGCT')

In [31]:
count_bases('aatTGGcT', case = "lower")

-----
#### Functions can call other functions and modules

Python has an extensive collection of built-in **modules** which contain all sorts of useful special purpose functions and objects. A few favorites:

* `math`: mathematical functions and constants
* `re`: regular expression searches.
* `os`: access to operating system routines (e.g., `os.path.exists` function to check if a file exists)
* `sys`: special variables used or maintained by the interpreter (`sys.path`, `sys.stdout`, `sys.argv`, ...)
* `random`: random number generators

[The full list is here](https://docs.python.org/3/library/)

In [41]:
# Here we are "importing" a module called math which contains some math-y functions like sqrt and log
import math

def log_sqrt( num ):
    """Take the log of the sqrt of a number """

    # get the square root
    sqrt_value = math.sqrt(num)
    
    # take the log
    log_value = math.log(sqrt_value)
  
    return(log_value)

In [None]:
log_sqrt(124)

---
# __New module:__  Regular expressions

#### A __regular expression__ is a sequence of characters that forms a __search pattern__.
They can be used to check if a string contains the specified search pattern.
This is particularly useful when parsing strings

The `re` package offers a set of functions that allows us to search a string for a match using "regular expressions". 
The main functions we will be using are called `search` which allows us to find string matches at any position in the string.
We can search for a pattern in a string like this:

```
import re
string = "This is an example string"

# compile the search pattern
search_pattern = re.compile("search pattern here")

# search for the search pattern in the string
search_pattern.search(string)
```

#### Here are some common elements to have in your __search pattern__:

* **Specific patterns:**
    * **Specific letter characters** (e.g., `A`, `B`, `C`, ...): Matches any occurrence of the specified letter in the string.
    * **Specific numerical digits** (e.g., `1`, `2`, `3`, ...): Matches any occurrence of the specified digit.
    * **Specific special characters** (e.g., `*`, `$`, ...): Matches any occurrence of the specified special character. Special characters need to be preceded by a `\` (e.g., `\.` for a period, `\$` for a dollar sign).
  
* **General patterns:**
    * **Numerical digits**: `\d` matches any digit (numbers `0`–`9`).
    * **Any single character**: `.` matches _any_ single character.
    * **Any letter or character from a set**: `[]` allows you to specify a set of characters to search for (e.g., `[a-n]` matches any lowercase letter from `a` to `n`).
    * **One or more occurrences**: `+` specifies one or more occurrences of the preceding pattern (e.g., `a+` matches "a", "aa", etc.).
    * **Exact number of occurrences**: `{}` specifies an exact number of occurrences of the preceding pattern (e.g., `a{6}` matches exactly six "a" characters).

* **Anchors for start/end of string**:
    * **End of string**: `$` matches the end of the string (e.g., `\d$` matches a string that ends with a digit).
    * **Beginning of string**: `^` matches the beginning of the string (e.g., `^A` matches any string that starts with "A").


There are lots of handy special codes in the Python regular expression module (see [here](https://docs.python.org/3.7/library/re.html)), and you can use them to do almost any type of string matching.

You can test your regular expression calls using [this website](https://regex101.com)

#### Example:
Here is a common example dealing with influenza.
You download some strains from the database, and they have names that look like this:

In [43]:
strain = 'A/New York/3/1994 (H3N2)'

Say you want to get some information out of these, like the year. Let's build a regular expression that gets the year out of `strain`:

In [None]:
import re

# compile a re for the year
yearmatch = re.compile('\d{4}')

# search for the search pattern in the string
match = yearmatch.search(strain)

# return our search pattern match
print(match)


#### Adding __groups__ and __names__ to search patterns

Sometimes you may want to group parts of a search pattern or create named groups for easier reference. Here's how:

* **Grouping patterns**: `()` is used to group part of your search pattern. Everything inside the parentheses is treated as a single unit, and you can capture and reference this group.
* **Named groups**: `(?P<name>...)` defines a named group with the specified name, allowing you to reference this part of the match later (e.g., `(?P<name>\d{4})` for a four-digit year).

In [None]:
# compile a named re for the year
## here, we are searching for 4 digits that are preceded by the / character
named_yearmatch = re.compile('\/(?P<year>\d{4})')

# search for the search pattern in the string
named_search = named_yearmatch.search(strain)

# isolate named pattern
year = named_search.group('year')

print(year)


----
### __Practice:__ 

#### Let's build a regular expression that gets the subtype (e.g. `H3N2`) out the `strain` object

_Hints:_
- _Be sure to name the search pattern `subtype`_
- _Include the parentheses in your search pattern, but not in the final named search pattern_

In [38]:
# your code here

----
### __Practice:__

#### Using regular expressions to parse barcodes

Now we will use regular expressions to parse barcodes from nucleotide sequences.
For instance, you might have to do this in a single-cell RNA-seq experiment where there is a barcode at the end of each read telling you the cell that the read came from.

Imagine that our valid molecules should have sequences like this:

`CTAGCNNNNNNGATCA`

See how there is a 6-nucleotide barcode in the center of the sequence.
We have a list of sequences, and want to parse through them to figure out which ones meet the expected pattern--and get the barcode from those that do:

In [35]:
seqs = ['CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CCAGCatagcaGATCA',  # does not have expected 5' sequence
        'CTAGCtacagGATCA',   # barcode too short
        'CTAGCgaccatGATCA',  # has barcode GACCAT
        'CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CTAGCatcgatGGTCA',  # does not have expected 3' sequence
        ]

Write a function that parses these barcoded sequences and gets the ones with valid barcodes.
In doing this, note that:

  1. If you have a string `s`, `s.upper()` makes it all uppercase.
  2. You can access elements of a dictionary using a key value like this: `dictionary_name[key_name]`
    
Below I've written the function documentation, try to implement it.
__Take a few minutes in groups to work through this.__

In [36]:
def count_barcodes(seqs, bclen=6, upstream='CTAGC', downstream='GATCA'):
    """Parse and count barcodes.
    
    Parameters
    ----------
    seqs : list
        DNA sequences.
    bclen : int
        Length of barcode
    upstream : str
        Sequence upstream of barcode.
    downstream : str
        Sequence downstream of barcode.
        
    Returns
    -------
    dict
        Keyed by each valid barcode, value is number of times the barcode
        is observed.
        
    Note
    ----
    The function is **not** case-sensitive, and all barcodes are reported
    in upper-case.
    
    """
    
    # your code here ...
    
    return None


Run the function once you've implemented it. Does it give the correct result?

In [37]:
count_barcodes(seqs)