# __Homework 4:__ Practical analysis with BioPython

For the homework, you are going to extend the code from the analysis of our FASTQ file in lectures 8 and 9.
Recall that the FASTQ file contains reads from a real sequencing run of influenza virus HA and NA genes.

---
The __actual sequences__ are as follows:

    5'-[end of HA]-AGGCGGCCGC-[16 X N barcode]-3'
or 

    5'-[end of NA]-AGGCGGCCGC-[16 X N barcode]-3'
---


__The end of NA is__ `...CACGATAGATAAATAATAGTGCACCAT`
    
__The end of HA is__ `...CCGGATTTGCATATAATGATGCACCAT`

---    

    
The __sequencing reads__ from the reverse end of the molecules (in 5'>3' orientation), so the sequencing reads are as follows:

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of HA]-3'
or

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of NA]-3'

---   
    
The reads can originate from **either** HA or NA, and that will be distinguished by the most 3' end of the read.
But in our example exercise in class, we did not distinguish among reads matching to HA and NA, as we didn't even look far enough into the read to tell the identity.

For the homework, your goal is to write code that extends the material from lectures 8 and 9 to also distinguish between HA and NA.
This homework can be completed almost entirely by re-using code from lecture 9. You will need to set up your analysis to do the following:
 1. Get the reverse complement of each read.
 2. Determine if it matches the expected pattern for HA and NA, and if so which one.
 3. If it matches, extract the barcode and add it to a dictionary to keep track of counts.
 4. Determine the number and distribution of barcodes for HA and NA separately.

Please include code to address each of the following questions. Please include code comments to explain what your code is attempting to accomplish. Don't forget to include references to the sources you used to obtain your answer, including your classmates (if you are working in groups).  

1. How many reads map to HA, and how many reads map to NA?

In [1]:
import re
import Bio.SeqIO

# Read the sequencing reads
seqreads = list(Bio.SeqIO.parse('barcodes_R1.fastq', format = 'fastq'))

# Create a list of sequence from seqread
seqreads_l = []
for seqrecord in seqreads: 
    sequence = str(seqrecord.seq)
    seqreads_l.append(sequence)
    
# Define a function that read the reverse_complement of the sequence
def reverse_complement(seq, unk_partner = 'N'):
    base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
    rseq = ''
    for a in seq:
        if a in base_partner:
            # look up the complementary base in the dictionary
            pair = base_partner[a]
            rseq = pair + rseq
        else:
            rseq = unk_partner + rseq
    return rseq

# Define each arguments: upstream, bclen, end of HA, and end of NA    
upstream ='AGGCGGCCGC'
end_of_ha = 'CCGGATTTGCATATAATGATGCACCAT'
end_of_na = 'CACGATAGATAAATAATAGTGCACCAT'
bclen = 16

def read_barcode_pattern(seqread, bclen, end_of_na, end_of_ha, upstream):
    
    # compile the barcode search pattern
    barcode_pattern_na = re.compile(end_of_na + upstream + f"(?P<barcode>[ATCG]{{{bclen}}})$")
    barcode_pattern_ha = re.compile(end_of_ha + upstream + f"(?P<barcode>[ATCG]{{{bclen}}})$") 
    # make an empty dictionary to count na
    barcode_na_dic = {}
    # make an empty dictionary to count ha 
    barcode_ha_dic = {}
    for seq in seqread:
        seq = seq.upper()
        # get the reverse complement of the read
        reverse = reverse_complement(seq)
        # search for the barcode pattern 
        match_na = barcode_pattern_na.search(reverse)
        match_ha = barcode_pattern_ha.search(reverse)

        if match_na: #check if there is a match for na pattern
            barcode_na = match_na.group('barcode')
            # add barcode in dictionary and count number of na pattern
            if barcode_na in barcode_na_dic:
                barcode_na_dic[barcode_na] += 1
            else: 
                barcode_na_dic[barcode_na] = 1
        elif match_ha: #check if there is a match for ha pattern
            barcode_ha = match_ha.group('barcode')
            # add barcode in dictionary and count number of ha pattern 
            if barcode_ha in barcode_ha_dic:
                barcode_ha_dic[barcode_ha] += 1
            else:
                barcode_ha_dic[barcode_ha] = 1
                
    return barcode_na_dic, barcode_ha_dic

barcode_na, barcode_ha = read_barcode_pattern(seqreads_l, bclen, end_of_na, end_of_ha, upstream)    
print(barcode_ha)
print(barcode_na)

def count_barcode (barcode_dic):
    count = 0
    for key, value in barcode_dic.items():
        count += value 
    return count

na_barcode_count = count_barcode(barcode_na)
ha_barcode_count = count_barcode(barcode_ha)
print('\n')
print(f'There {na_barcode_count} sequence map to NA')
print(f'There {ha_barcode_count} sequence map to HA')

{'AACCGTGACCAGGAAG': 70, 'TTATCGTCTCCCATAT': 77, 'CATACCAGTCATCCCT': 28, 'ACTTACGTATAAGTCA': 53, 'GCTACTACTATACCAT': 119, 'GTTACCCACAGTCCGC': 62, 'CACCACACAAGGATGT': 41, 'TCATACATCACACTTA': 47, 'GTACCCTCCGTGAATC': 99, 'ACTCCACGCTACCACG': 31, 'GCACTCCTCAACCCTT': 48, 'CCGCTCCCTGCTGTCC': 43, 'AAACGTAGCGATAACT': 61, 'TCACGTCCCATATTAC': 9, 'CCCGACCCGACATTAA': 155, 'ACGAGAGGTCGACTCG': 60, 'TTCGACTTCCTAGTAC': 86, 'ACCCAGTCTAGCTAAC': 70, 'GGTCATACGCCTTCGC': 16, 'CTTAACCTTCCGACAA': 55, 'TGGGCAATAAATGTAG': 57, 'CCCTCATCCTGTGTCA': 54, 'AATTCCATTCAGGCTG': 35, 'TTCTAGCCTTATCTCC': 81, 'AGAATAATCTCAAACT': 62, 'AAACAAACAAGTCTGT': 50, 'AATACGAACATCTCGG': 49, 'CGAATCTGCGCAATCT': 60, 'GATTTCCGATCAGTCT': 124, 'AGAACGTCTTGATAGC': 30, 'GACCGGTGCTTCAACA': 76, 'TGCTGACAACACGTAA': 47, 'ATAGTCGGGCCTACCG': 21, 'TTCCCAATGGTACTCG': 28, 'CTGGCAAGAAAATTCC': 43, 'TTCGGTGAACGTACAC': 76, 'CTCCATCACTCGCCAA': 115, 'TCAACCCCACGCCGCT': 24, 'TAACTAGCGTTTTCAA': 1, 'AGTACTCATCGAGCAA': 31, 'AACCGTCACCTCAACC': 104, 'GAACGACCCTT

2. How many HA sequences did not have a valid barcode? Also anwer the same question for NA.

In [4]:
def read_seq(seqread, bclen, end_of_na, end_of_ha, upstream):
    # compile the seq search pattern
    pattern_na = re.compile(end_of_na)
    pattern_ha = re.compile(end_of_ha)
    # counters for na and ha
    na_count = 0
    ha_count = 0

    for seq in seqread:
        seq = seq.upper()
        # get the reverse complement of the read
        reverse = reverse_complement(seq)
        # search for the patterns
        match_na = pattern_na.search(reverse)
        match_ha = pattern_ha.search(reverse)

        # check for matches and increment the respective counters
        if match_na:
            na_count += 1
        elif match_ha:
            ha_count += 1
                
    return na_count, ha_count

na_count, ha_count = read_seq(seqreads_l, bclen, end_of_na, end_of_ha, upstream)    
print(f'There are total {na_count} NA sequence')
print(f'There are total {ha_count} HA sequence')
print('\n')
print(f'There {na_count - na_barcode_count} NA sequence that did not have a valid barcode')
print(f'There {ha_count - ha_barcode_count} HA sequence that did not have a valid barcode')

There are total 4122 NA sequence
There are total 5409 HA sequence


There 215 NA sequence that did not have a valid barcode
There 164 HA sequence that did not have a valid barcode


In [5]:
5409 - 5245

164

3. What is the HA barcode with the most counts (and how many counts)? Also answer the same question for NA.

    _Hint: you will need to find the key associated with the maximum value in your dictionary. There are many ways to do this._

In [3]:
# HA barcode with the most counts
# maximum value in dictionary expression found in: https://datagy.io/python-get-dictionary-key-with-max-value/
mf_ha_barcode = [key for key, value in barcode_ha.items() if value == max(barcode_ha.values())]
print('most frequence HA barcode is:', mf_ha_barcode)

# Number of most frequent HA barcode counts
mf_ha_barcode_count = max(barcode_ha.values())
print('counts of the most freqeunt HA barcode is:', mf_ha_barcode_count)

# NA barcode with the most counts
mf_na_barcode = [key for key, value in barcode_na.items() if value == max(barcode_na.values())]
print('most frequent NA barcode is:', mf_na_barcode)

# Number of most frequent NA barcode counts
mf_na_barcode_count = max(barcode_na.values())
print('counts of the most frequent HA barcode is:', mf_na_barcode_count)

most frequence HA barcode is: ['CCCGACCCGACATTAA']
counts of the most freqeunt HA barcode is: 155
most frequent NA barcode is: ['ACCAGTTCTCCCCGGG']
counts of the most frequent HA barcode is: 152
