# __Lecture 09: Practical analyses in Python (continued)__

----
#### __Announcement:__ Homework 4 will be due on October 31

#### __On Tuesday, we talked about:__
- break and continue statements
- defining functions in python
- working with regular expressions using the `re` module

#### __Today we will:__
- continue to practice using the `re` module
- learn about (and practice) using the `biopython` module

###  
----

## __Review:__ Regular expressions

The `re` module offers a set of functions that allows us to search a string for a match using a __search pattern__ 

#### Here are some common elements to have in your __search pattern__:

* **Specific patterns:**
    * **Specific letter characters** (e.g., `A`, `B`, `C`, ...): Matches any occurrence of the specified letter in the string.
    * **Specific numerical digits** (e.g., `1`, `2`, `3`, ...): Matches any occurrence of the specified digit.
    * **Specific special characters** (e.g., `*`, `$`, ...): Matches any occurrence of the specified special character. Special characters need to be preceded by a `\` (e.g., `\.` for a period, `\$` for a dollar sign).
  
* **General patterns:**
    * **Numerical digits**: `\d` matches any digit (numbers `0`–`9`).
    * **Any single character**: `.` matches _any_ single character.
    * **Any letter or character from a set**: `[]` allows you to specify a set of characters to search for (e.g., `[a-n]` matches any lowercase letter from `a` to `n`).
    * **One or more occurrences**: `+` specifies one or more occurrences of the preceding pattern (e.g., `a+` matches "a", "aa", etc.).
    * **Exact number of occurrences**: `{}` specifies an exact number of occurrences of the preceding pattern (e.g., `a{6}` matches exactly six "a" characters).

* **Anchors for start/end of string**:
    * **End of string**: `$` matches the end of the string (e.g., `\d$` matches a string that ends with a digit).
    * **Beginning of string**: `^` matches the beginning of the string (e.g., `^A` matches any string that starts with "A").

* **Grouping and naming**:
    * **Grouping patterns**: `()` is used to group part of your search pattern. Everything inside the parentheses is treated as a single unit, and you can capture and reference this group.
    * **Named groups**: `(?P<name>...)` defines a named group with the specified name, allowing you to reference this part of the match later (e.g., `(?P<year>\d{4})` for a four-digit year).



#### Always remember to import the `re` module if you'd like to work with regular expressions:

In [30]:
import re

## __Review practice__: using regular expressions to parse flu subtypes

You download some strains from a database, and they have names that look like this:

In [31]:
strain = 'A/New York/3/1994 (H3N2)'

Let's build a regular expression that gets the subtype (e.g. `H3N2`) out the `strain` object

_Hints:_
- _Be sure to name the search pattern `subtype`_
- _Include the parentheses in your search pattern, but not in the final named search pattern_

In [None]:
# compile a named re for the subtype
## This searches for a pattern that consists of:
## - A capital letter
## - Followed by one or more digits
## - Then another capital letter
## - Followed by one or more digits
## The entire pattern is enclosed in parentheses.
subtype_pattern = re.compile('\((?P<subtype>[A-Z]\d+[A-Z]\d+)\)')

# search for the search pattern in the string
match = subtype_pattern.search(strain)

# isolate named pattern
subtype_match = match.group('subtype')

print(subtype_match)

----
## __Practice:__ using regular expressions to parse barcodes
Now we will use regular expressions to parse barcodes from nucleotide sequences.
For instance, you might have to do this in a single-cell RNA-seq experiment where there is a barcode at the end of each read telling you the cell that the read came from.

Imagine that our valid molecules should have sequences like this:

`CTAGCNNNNNNGATCA`

See how there is a 6-nucleotide barcode in the center of the sequence.
We have a list of sequences, and want to parse through them to figure out which ones meet the expected pattern--and get the barcode from those that do:

In [33]:
seqs = ['CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CCAGCatagcaGATCA',  # does not have expected 5' sequence
        'CTAGCtacagGATCA',   # barcode too short
        'CTAGCgaccatGATCA',  # has barcode GACCAT
        'CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CTAGCatcgatGGTCA',  # does not have expected 3' sequence
        ]

### __Practice part 1__
First, take a few minutes in groups to write a search pattern that will match the valid barcodes from these sequences.
When doing this, note that:

- You will want to __group__ and __name__ the barcode portion of the search pattern
- You will want to __include the fixed upstream (CTAGC) and downstream (GATCA) portions__ of the sequence in your search pattern, but not as part of the barcode subgroup

In [34]:
# your code here...

### __Practice part 2__

Now, modify your code to create a function that parses barcoded sequences and extracts those with valid barcodes. 
This requires setting up your function so that it does the following:

 1. Convert each sequence to uppercase (uwing `s.upper()`)
 2. Search the string for a specified search pattern
 3. Determine if it matches the expected pattern (with the correct length barcode and constant upstream/downstream sequences)
 3. If it matches, extract the barcode and add it to a dictionary to keep track of counts

A few notes to keep in mind:
- You may want to search from the start of the string (use the `^` symbol in your search pattern).
- The barcode length, upstream sequence, and downstream sequence should be passed as function arguments with default values, rather than being fixed. You’ll need to generalize your search pattern from above to accommodate these variable inputs (e.g., by adding strings and variables together).

I’ve provided the function documentation below—try to implement it.
__Take a few minutes to work through this in groups.__

In [35]:
def count_barcodes(seqs, bclen=6, upstream='CTAGC', downstream='GATCA'):
    """Parse and count barcodes.
    
    Parameters
    ----------
    seqs : list
        DNA sequences.
    bclen : int
        Length of barcode
    upstream : str
        Sequence upstream of barcode.
    downstream : str
        Sequence downstream of barcode.
        
    Returns
    -------
    dict
        Keyed by each valid barcode, value is number of times the barcode
        is observed.
        
    Note
    ----
    The function is **not** case-sensitive, and all barcodes are reported
    in upper-case.
    
    """
    
    # your code here ...
    
    return None


Run the function once you've implemented it. Does it give the right result?

In [36]:
count_barcodes(seqs)

----
# __New module:__ Biopython

The [Biopython](https://biopython.org/) package has lots of useful functions for computational biology.

It is very handy for things like __reading in sequences__ in many different formats and __processing sequence data__: the subpackage [Bio.SeqIO](https://biopython.org/wiki/SeqIO) is your friend!

_(Do note that if you are analyzing truly large datasets, `Biopython` is not very fast and you may want to use something like [pysam](https://pysam.readthedocs.io/en/latest/api.html); but `Biopython` is a good starting point)._

First, we'll need to import the submodules `Biopython.SeqIO` and `Biopython.Seq`

In [37]:
import Bio.SeqIO
import Bio.Seq


#### __Reading in a file__
I have included the file [barcodes_R1.fastq](barcodes_R1.fastq), which has some FASTQ sequences in it.

Let's use `Biopython` to read the FASTQ entries and convert them to a list:

In [38]:
reads = Bio.SeqIO.parse('barcodes_R1.fastq', format='fastq')
seqreads = list(reads)

How many reads were there?

In [None]:
print(f"Found {len(seqreads)} sequencing reads.")

#### Reads are read as `SeqRecord` objects
[SeqRecord](https://biopython.org/wiki/SeqRecord) objects have a lot of information, including the header, quality scores, etc.

Let's look at the first read:


In [None]:
seqreads[0]

In [None]:
type(seqreads[0])


If you want to just access the sequence element of each `SeqRecord`, you can do this as follows:

In [None]:
seqreads[0].seq

#### Sequences are `Seq` objects


In [None]:
type(seqreads[0].seq)

Let's make a list of just the sequences from our `seqreads`:

In [44]:
seqreads_Seq = []
for seqrecord in seqreads:
    sequence = seqrecord.seq # isolate the sequence from the seqrecord
    seqreads_Seq.append(sequence) # add string sequence to list

In [None]:
seqreads_Seq[0:5]

`Seq` objects come with many built-in methods specifically for working with sequences. 
We'll mostly be using sequences as `Seq` objects, but if you need to convert a `Seq` object to a regular `string` to use standard string methods, you can do so like this:

In [None]:
# let's isolate just the first sequence
seq = seqreads_Seq[0]

# convert this sequence to a string object
seq_string = str(seq)

print(seq_string)

#### Built-in `Seq` methods

`Biopython` has many useful built-in functions for working with sequencing data. 
We will discuss a few examples in class from the submodule [Bio.Seq](https://biopython.org/docs/1.75/api/Bio.Seq.html), but feel free to read about more [here](https://biopython.org/wiki/Seq)

We can use this module to get the __complement__ and __reverse complement__ of a sequence:

In [None]:
# recall we previously saved a sequence as the `seq` variable
seq

In [None]:
seq.complement()

In [None]:
seq.reverse_complement()

We can use this module to __transcribe__ and __translate__ a sequence:

In [None]:
seq.transcribe()

In [None]:
seq.translate()

__If you choose to use these methods, remember that a `Seq` object is not a string. You will need to convert your sequence back to a string before using methods/functions that require strings.__

---

## __A real biological analysis: parsing barcodes__
<a id='real_analysis'></a>
The reads that we just read as `seqreads_Seq` come from a real sequencing run of influenza virus HA and NA genes.

The __actual sequences__ are as follows:

    5'-[end of HA/NA]-AGGCGGCCGC-[16 X N barcode]-3'

    
The __sequencing run reads__ from the reverse end of the molecules (in 5'>3' orientation), so the sequencing reads are as follows:

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of HA/NA]-3'




We want to determine which reads have valid sequences, get the barcodes out of strings, and count the barcodes.
So this requires setting up an analysis that does the following:

 1. Get the reverse complement of each read.
 2. Determine if it matches the expected pattern (with the correct length barcode and constant sequence)
 3. If it matches, extract the barcode and add it to a dictionary to keep track of counts.

### __Group activity__
Work together to write some code to do this.
I have created a code chunk for each step (with some parts filled in). 
Remember to run the code chunks in the correct order!

For your homework, you will be asked to extend this in-class analysis to get statistics for HA and NA seperately.

In [52]:
# load necessary packages
import re
import Bio.SeqIO
import Bio.Seq

__Step 1:__ You'll need to write a function that identifies a barcode with a known upstream sequence. 
I've provided the documentation here--__try writing this function on your own.__

_Hint: we wrote a similar search pattern earlier_

_Hint 2: You can use the built-in reverse complement method_

_Hint 3: You will want to convert the sequence to a string before searching for regular expressions_

In [53]:
def read_barcode(seqread, bclen, upstream='AGGCGGCCGC'):
    """Identify barcode with known upstream sequence.
    
    Parameters
    ----------
    seqread : Seq object
        Nucleotide sequence read matching UPSTREAM-BARCODE in reverse orientation.
    bclen : int
        Length of barcode
    upstream: str
        Sequence upstream of the barcode.
        
    Returns
    -------
    str or None
        Sequence of the barcode in the forward orientation, or `None` if no match to expected barcoded sequence.
        
    Example
    -------
    >>> read_barcode(Bio.Seq.Seq('TTTTTTTTTTTTTTTTGCGGCCGCCT'), bclen=16)
    'AAAAAAAAAAAAAAAA'
        
    """
    
    # your code here ...
    
    return None

__Step 2:__ Read sequences from the barcodes_R1.fastq file and create a list of only the sequences (as Seq objects). __We already did this step earlier so you can move to step 3 after running this code chunk__

In [54]:
# run this code chunk...
seqreads = list(Bio.SeqIO.parse('barcodes_R1.fastq', 'fastq'))

seqreads_Seq = []
for seqrecord in seqreads:
    seqreads_Seq.append(seqrecord.seq)

__Step 3:__ Get the counts of all barcodes. _(Hint: you might want to store barcodes and counts in a dictionary, and also keep track of the number of sequences that don't have a valid barcode)_

Please name your dictionary `barcode_counts`

In [55]:
# your code here ...
    

__Step 4:__ Report the total number of sequences parsed, and how many lacked a valid barcode.

In [56]:
# your code here ...