# __Homework 4:__ Practical analysis with BioPython

For the homework, you are going to extend the code from the analysis of our FASTQ file in lectures 8 and 9.
Recall that the FASTQ file contains reads from a real sequencing run of influenza virus HA and NA genes.

---
The __actual sequences__ are as follows:

    5'-[end of HA]-AGGCGGCCGC-[16 X N barcode]-3'
or 

    5'-[end of NA]-AGGCGGCCGC-[16 X N barcode]-3'
---


__The end of NA is__ `...CACGATAGATAAATAATAGTGCACCAT`
    
__The end of HA is__ `...CCGGATTTGCATATAATGATGCACCAT`

---    

    
The __sequencing reads__ from the reverse end of the molecules (in 5'>3' orientation), so the sequencing reads are as follows:

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of HA]-3'
or

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of NA]-3'

---   
    
The reads can originate from **either** HA or NA, and that will be distinguished by the most 3' end of the read.
But in our example exercise in class, we did not distinguish among reads matching to HA and NA, as we didn't even look far enough into the read to tell the identity.

For the homework, your goal is to write code that extends the material from lectures 8 and 9 to also distinguish between HA and NA.
This homework can be completed almost entirely by re-using code from lecture 9. You will need to set up your analysis to do the following:
 1. Get the reverse complement of each read.
 2. Determine if it matches the expected pattern for HA and NA, and if so which one.
 3. If it matches, extract the barcode and add it to a dictionary to keep track of counts.
 4. Determine the number and distribution of barcodes for HA and NA separately.

Please include code to address each of the following questions. Please include code comments to explain what your code is attempting to accomplish. Don't forget to include references to the sources you used to obtain your answer, including your classmates (if you are working in groups).  

1. How many reads map to HA, and how many reads map to NA?

In [1]:
# load necessary packages
import re
import Bio.SeqIO
import Bio.Seq
import pandas as pd

In [2]:
#Read in sequencing file
reads = Bio.SeqIO.parse('barcodes_R1.fastq', format='fastq')
seqreads = list(reads)

In [3]:
#make an empty list to put our lists in
seqreads_Seq = []
for seqrecord in seqreads:
    sequence = seqrecord.seq # isolate the sequence from the seqrecord
    seqreads_Seq.append(sequence) # append sequence to our empty list

In [4]:
#Make an empty list of the reverse complements of the sequences in seqread_Seq 
seqreads_Seq_fw = []
for seqrecord in seqreads_Seq: 
    sequence_rev = seqrecord.reverse_complement()
    seqreads_Seq_fw.append(sequence_rev)

seqreads_Seq_fw[:5]

[Seq('TCTGCTATTCCGTGTCAATAAACATCGTCGTTGCATGCCTTCATTTTAGTGCTG...AGC'),
 Seq('TCGTACACTCTGTCATTAGGGATGTATTTGTTTAATGCATGGGGTTGTATACTA...AAG'),
 Seq('TCGTAGTGTATAGTAGAAGGGACGTCTACGTTAATCAGTGTCATAAGTTCGATC...GTG'),
 Seq('TCAGATGAATGGTAGTTGGTGATAGCATGAGGTTGGGTCGGATGGTTAGTGTCT...TGA'),
 Seq('GAGTAAAGACTGTGTTCTGGGACGCGATCGAGTCTGCGAATGTGTGTAGCGCTA...TAT')]

In [5]:
#Count reads to HA or NA
#Make a function which counts the barcodes, and empty lists with HA read counts and NA read counts
def count_barcodes(seqreads_Seq_fw, adaptor ='AGGCGGCCGC', HA = 'CCGGATTTGCAT', NA = 'CACGATAGATAA'):
    HA_sequence_count = 0
    NA_sequence_count = 0
    HA_sequences_list = []
    NA_sequences_list = []
    
    for s in seqreads_Seq_fw:
        upper_seqs = str(s).upper()  # Convert s to a string
        #search for pattern of HA unique sequences
        HA_pattern = re.compile(r'(?P<HA>CCGGATTTGCAT)')
        #search for NA unique sequences
        NA_pattern = re.compile(r'(?P<NA>CACGATAGATAA)')

        #search for the HA pattern in the strings(upper_seqs)
        HA_match = HA_pattern.search(upper_seqs)
        # isolate named pattern for HA and NA
        if HA_match:  # Only proceed if there is a match
            #HA_sequence = HA_match.group("HA") 
            HA_sequence_count +=1
            HA_sequences_list.append(upper_seqs)
           
        #search for the NA pattern in the strings(upper_seqs)
        NA_match = NA_pattern.search(upper_seqs)
        if NA_match: #proceed only if there's a match
            #NA_sequence = NA_match.group("NA")
            NA_sequence_count +=1
            NA_sequences_list.append(upper_seqs)

    return HA_sequences_list, NA_sequences_list 

ha_reads, na_reads = count_barcodes(seqreads_Seq_fw)
print("The number of HA reads is:", len(ha_reads))
print("The number of NA reads is:", len(na_reads))



The number of HA reads is: 5493
The number of NA reads is: 4156


2. How many HA sequences did not have a valid barcode? Also anwer the same question for NA.

In [9]:
#create a function which counts valid barcodes given matching string
barcode_pattern = re.compile(r"AGGCGGCCGC([ATCG]{16})")
                                
#making empty function structure here
def count_valid_barcodes(reads, pattern):
    Barcodes = []
    BC_count = 0
    len_reads = len(reads)
    for r in reads:
        BC = pattern.search(r)
        if BC: 
            Barcodes.append(BC)
            BC_count += 1
        else:
            continue
    counts_series = pd.Series(Barcodes).value_counts()  #my labmate Jenny Nathans helped me with this as well as Lucas during OH
    no_BC = len_reads - BC_count
    return counts_series, no_BC   


#here I'm telling it what goes into that function, both input and output
HA_count_valid_barcodes, HA_no_BC = count_valid_barcodes(ha_reads,barcode_pattern)
NA_count_valid_barcodes, NA_no_BC = count_valid_barcodes(na_reads,barcode_pattern)

print(HA_no_BC, NA_no_BC)
print('The number of HA sequences without a valid barcode is', HA_no_BC)
print('The number of NA sequenes without a valid barcode is',  NA_no_BC)

169 222
The number of HA sequences without a valid barcode is 169
The number of NA sequenes without a valid barcode is 222


3. What is the HA barcode with the most counts (and how many counts)? Also answer the same question for NA.

    _Hint: you will need to find the key associated with the maximum value in your dictionary. There are many ways to do this._

In [42]:
from collections import Counter

#create a function which counts valid barcodes given matching string
barcode_pattern = re.compile(r"AGGCGGCCGC(?P<barcode>[ATCG]{16})")
                                
#making empty function structure here
def count_valid_barcodes(reads, pattern):
    barcodes = [] #list to store matched barcodes
    
    for r in reads:
        match = pattern.search(r)
        if match: #only proceed if there's a match
            barcode_match = match.group('barcode') #get the matched barcode
            barcodes.append(barcode_match) #append to the list
        else:
            continue

    return barcodes

HA_barcode_list = count_valid_barcodes(ha_reads,barcode_pattern)
NA_barcode_list = count_valid_barcodes(na_reads,barcode_pattern)

#Count the occurrences of each barcode
HA_barcode_counts = Counter(HA_barcode_list)
NA_barcode_counts = Counter(NA_barcode_list)

#Find max count item in list
most_common_HA_barcode, HA_max_count = HA_barcode_counts.most_common(1)[0]
most_common_NA_barcode, NA_max_count = NA_barcode_counts.most_common(1)[0]

print("The most common HA barcode is", most_common_HA_barcode)
print("The most common HA barcode is",HA_max_count)

print("The most common NA barcode is", most_common_NA_barcode )
print("The most common NA barcode is",NA_max_count)



#chatGPT helped me with this one, on how to find max count in a list


The most common HA barcode is CCCGACCCGACATTAA
The most common HA barcode is 155
The most common NA barcode is ACCAGTTCTCCCCGGG
The most common NA barcode is 152
