# Lecture 08: Practical analyses in Python

## Dictionaries
Python defines a powerful type called a dictionary or `dict` (which Phil introduced last week).

### Mutable versus immutable types
Before using dictionaries even more, we need to understand the difference between [mutable and immutable types](https://medium.com/@meghamohan/mutable-and-immutable-side-of-python-c2145cf72747).

Let's review the types we've learned so far:
#### Integers

In [None]:
x_int = 1

type(x_int)

An integer is __immutable__, we can't change it's value by reassigment:

In [None]:
y_int = x_int
y_int += 1
print(f"y_int = {y_int}, x_int = {x_int}")

#### Floats

In [None]:
x_float = 3.7

type(x_float)

Floats are also __immutable__:

In [None]:
y_float = x_float
y_float += 1
print(f"y_float = {y_float}, x_float = {x_float}")

#### Lists

In [None]:
x_list = [1, 2, 3]

type(x_list)

But lists are __mutable__:

In [None]:
y_list = x_list
y_list.append(4)

print(f"y_list = {y_list}, x_list = {x_list}")

Notice how the value of `x_list` has changed too! Lists are mutable, meaning they point to the mutable object, not to its value.
Be careful with mutable objects!

There is an immutable object that has some similar properties to a list called a `tuple`, but we aren't going to go into those right now.

#### Strings

In [None]:
x_str = 'hello'

type(x_str)

Strings are __immutable__:

In [None]:
y_str = x_str
y_str += ' friend'

print(f"y_str = {y_str}, x_str = {x_str}")

### What is a dictionary?
Dictionaries are "look up tables" that can be used to map keys to values:

In [None]:
color_dict = {'best_color': 'green',
              'worst_color': 'brown'}

color_dict

This dictionary maps the best and worst colors to their names.
We can look up the value for a key:

In [None]:
color_dict['best_color']

Note that we get an error if they key doesn't exist:

In [None]:
color_dict['second_best_color']

We can get all of the keys:

In [None]:
color_dict.keys()

Or the values:

In [None]:
color_dict.values()

Or the items, which are tuples with keys and values:

In [None]:
color_dict.items()

We can add entries to a dictionary like this:

In [None]:
color_dict['second_worst_color'] = 'blue'

color_dict

Note that the above cell shows that dictionaries are __mutable__, as we changed the dictionary by adding to it.

An advantage of a dictionary is that it is very fast to look up key / value pairs even if there are lots of entries in the dictionary.

So why did I make a big deal about mutable versus immutable?
Dictionary keys can **only** be immutable. 
So keying a dictionary with a mutable type (such as a `list`) is not allowed:

In [None]:
color_dict[['second_best_color', 'third_best_color']] = ['yellow', 'pink']

But dictionary values can be mutable:

In [None]:
color_dict['second_and_third_best_colors'] = ['yellow', 'pink']

color_dict

## Using a dictionary in a function
Now we will use a dictionary to implement a function that gets the reverse complement of a sequence.

Here are the desired parameters and outputs of the function. Can you write this function?
_Hint: use a dictionary within the function to store the reverse complement of each possible nucleotide_

__Take a few minutes in groups to work through this.__

In [None]:
def reverse_complement(seq):
    """Get reverse complement of a DNA sequence.
    
    Parameters
    -----------
    seq : str
        Uppercase DNA sequence.
        
    Returns
    -------
    str
        Reverse complement of the sequence in upper case.
        
    Example
    --------
    >>> reverse_complement('ATGCAC')
    'GTGCAT'
    
    """
    
    # your code here...
    
    return None

Once the function is written, you can use it:

In [None]:
reverse_complement('ATGCAC')

In [None]:
reverse_complement('aTGCTAAAAGTTCAGGATACAGGTAAN')

## Regular expressions
The `re` module is for "regular expressions". 
These are very useful for parsing strings.

A regular expression is a sequence of characters that forms a search pattern.
They can be used to check if a string contains the specified search pattern.

The `re` package offers a set of functions that allows us to search a string for a match. 
The main functions we will be using are called `search` which allows us to find string matches at any position in the string.
We can search for a pattern in a string like this:

```
import re
string = "This is an example string"

# compile the search pattern
search_pattern = re.compile("search pattern here")

# search for the search pattern in the string
search_pattern.search(string)
```

Here are some common elements to have in your search pattern:
* letter characters which returns a match where the string contains the specified letter (e.g. `A`, `B`, `C`, ...)
* special characters which returns a match where the string contains the specified special character; these must be preceded by a `\`
* `\d` which returns a match where the string contains digits (numbers from `0`-`9`)

You may also want to add the following customizations:
* `[]` specifies a set of characters to search for (e.g. `[a-n]`)
* `()` capture and group everything contained inside, and search the string for everything together
* `?P<name>` indicates a search pattern group with name `name`
* `+` specifies one or more occurrences of a certain pattern element
* `{}` specifies exactly the specified number of occurrences of a certain pattern element
* `$` specifies the end of the string
* `^` specifies the beginning of the string


Here is a common example dealing with influenza.
You download some strains from the database, and they have names that look like this:

In [None]:
strain1 = 'A/New York/3/1994 (H3N2)'
strain2 = 'A/California/3/X/2003 (H12N1)'
strain3 = 'A/Perth/2009 (H3N2)'

strains = [strain1, strain2, strain3]

You want to get some information out of these, like the subtype.
Let's build a regular expression that gets the subtype out of `strain2`:

In [None]:
import re

# compile a re for the subtype
strainmatch = re.compile(
        '\(H\d+N\d+\)$')                        

# search for the search pattern in the string
strainmatch.search(strain2)

Now, let's extend that a bit and build a regular expression that gets only the subtype out and then use a dictionary to count how many sequences there are of each subtype:

In [None]:
# compile a re for the subtype (with a named search pattern)
strainmatch = re.compile(
        '\((?P<subtype>H\d+N\d+)\)$')   
    
subtype_counter = {}  # dict to store the results

for strain in strains:  # loop over all strains
    # search for re in each strain
    m = strainmatch.search(strain)
    # isolate named pattern
    subtype = m.group('subtype')
    # add pattern/count to dictionary
    if subtype in subtype_counter:
        subtype_counter[subtype] += 1
    else:
        subtype_counter[subtype] = 1
        
print(subtype_counter)

There are lots of handy special codes in the Python regular expression module (see [here](https://docs.python.org/3.7/library/re.html)), and you can use them to do almost any type of string matching.

I like to test my regular expression calls using [this website](https://regex101.com)

## Using regular expressions to parse barcodes
Now we will use regular expressions to parse barcodes from nucleotide sequences.
For instance, you might have to do this in a single-cell RNA-seq experiment where there is a barcode at the end of each read telling you the cell that the read came from.

Imagine that our valid molecules should have sequences like this:

`CTAGCNNNNNNGATCA`

See how there is a 6-nucleotide barcode in the center of the sequence.
We have a list of sequences, and want to parse through them to figure out which ones meet the expected pattern--and get the barcode from those that do:

In [None]:
seqs = ['CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CCAGCatagcaGATCA',  # does not have expected 5' sequence
        'CTAGCtacagGATCA',   # barcode too short
        'CTAGCgaccatGATCA',  # has barcode GACCAT
        'CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CTAGCatcgatGGTCA',  # does not have expected 3' sequence
        ]

Write a function that parses these barcoded sequences and gets the ones with valid barcodes.
In doing this, note that:

  1. If you have a string `s`, `s.upper()` makes it all uppercase.
  2. The `match` function finds a match at the beginning of the string (whereas the `search` function we were using before searches for matches throughout the whole string)
    
Below I've written the function documentation, try to implement it.
__Take a few minutes in groups to work through this.__

In [None]:
def count_barcodes(seqs, bclen=6, upstream='CTAGC', downstream='GATCA'):
    """Parse and count barcodes.
    
    Parameters
    ----------
    seqs : list
        DNA sequences.
    bclen : int
        Length of barcode
    upstream : str
        Sequence upstream of barcode.
    downstream : str
        Sequence downstream of barcode.
        
    Returns
    -------
    dict
        Keyed by each valid barcode, value is number of times the barcode
        is observed.
        
    Note
    ----
    The function is **not** case-sensitive, and all barcodes are reported
    in upper-case.
    
    """
    
    # your code here ...
    
    return None


Run the function once you've implemented it. Does it give the right result?

In [None]:
count_barcodes(seqs)

## Biopython
[Biopython](https://biopython.org/) is a package that has lots of useful functions for computational biology.

It is very handy for things like reading in sequences in many different formats: [Bio.SeqIO](https://biopython.org/wiki/SeqIO) is your friend!

(Do note that if you are analyzing truly large datasets, `Biopython` is not very fast and you may want to use something like [pysam](https://pysam.readthedocs.io/en/latest/api.html); but `Biopython` is a good starting point).

### Reading in a file
I have included the file [barcodes_R1.fastq](barcodes_R1.fastq), which has some FASTQ sequences in it.

First, let's just see what the beginning of that file looks like:

In [None]:
! head -n 8 barcodes_R1.fastq

Now let's use `Biopython` to read the FASTQ entries.

First, import `Biopython.SeqIO`:

In [None]:
import Bio.SeqIO

Now read in the sequencing reads:

In [None]:
seqreads = list(Bio.SeqIO.parse('barcodes_R1.fastq', format='fastq'))

How many reads were there?

In [None]:
print(f"Found {len(seqreads)} sequencing reads.")

Let's look at the first read:

In [None]:
seqreads[0]

You can see that it has a lot of information, including the id, name, description, etc.

For our purposes, we will just convert the sequence part into a string for each sequence:

In [None]:
seqreads_str = []
for seqrecord in seqreads:
    seqreads_str.append(str(seqrecord.seq))

Make sure we still have the same number of sequencing reads, and look at the first one:

In [None]:
assert len(seqreads_str) == len(seqreads)

seqreads_str[0]

## A real biological analysis: parsing barcodes
The reads that we just read as `seqreads_str` come from a real sequencing run of influenza virus HA and NA genes.

The sequences are as follows:

    5'-[end of HA]-AGGCGGCCGC-[16 X N]-3'
    
or 

    5'-[end of NA]-AGGCGGCCGC-[16 X N]-3'
    
The end of NA is:

    ...CACGATAGATAAATAATAGTGCACCAT
    
The end of HA is:

    ...CCGGATTTGCATATAATGATGCACCAT
    
The sequencing run reads from the reverse end of the molecules, so the first thing in the sequencing reads is the barcode followed by the constant sequence and the end of HA or NA.

We want to determine which reads have valid sequences, get the barcodes out of strings, figure out if the barcode matches to HA or NA, and count the barcodes.
So this requires setting up an analysis that does the following:

 1. Get the reverse complement of each read.
 2. Determine if it matches the expected pattern for HA and NA, and if so which one.
 3. If it matches, extract the barcode and add it to a dictionary to keep track of counts.
 
Can you write code that does this? __We will work on this in groups on Thursday__