# BioPython Tutorial

[Biopython](https://biopython.org/) is a module with a variety of functions that are useful for working with sequence data. We will use the following examples to help you make use of sequence data. 

## Using BioPython to examine sequence files

Your sequence data are in the `.ab1` format. This format is common for Sanger sequencing data. First, let's take a look at the files:

In [None]:
!ls Biocoding-july-2018

Let's just choose a single file `Student_1-M13F.ab1` to open and explore with BioPython: 

In [None]:
# import SeqIO from the Bio library/module
# SeqIO lets us handle the file (open and read from it)
from Bio import SeqIO
 
# specify the path of the file
file_path = "/Users/jasonwilliams/Desktop/Biocoding-july-2018/Student_1-M13F.ab1"
 
# We create an "object" using the SeqIO.read function 
sequence_object = SeqIO.read(open(file_path,"rb"),"abi")  

#Next, use the .format method to extract the DNA sequence in FASTA format
fasta_sequence = sequence_object.format("fasta")
print(fasta_sequence) 

# What other methods are associated with the sequence object?
# See a full list at https://biopython.org/wiki/SeqIO
 
# get the quality scores for each base
quals = sequence_object.letter_annotations['phred_quality']
print(quals)
print(len(quals))



We could also plot our quality score as a histogram:

In [None]:
# Since quals is a list, what would it look like to plot this...

import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
% matplotlib inline

plot = plt.hist(quals, facecolor='blue', alpha=0.75)

plt.xlabel('Phred Scores')
plt.ylabel('Number of nucleotides')
plt.title('Histogram of Phred scores')

plt.show()

In [None]:
import numpy as np                                                               
import matplotlib.pyplot as plt
% matplotlib inline

xs = np.arange(len(quals)) 
width = 10

plt.plot(xs, quals)

plt.show()

In [None]:
channels = ['DATA9', 'DATA10', 'DATA11', 'DATA12']
from collections import defaultdict
trace = defaultdict(list)
for c in channels:
    trace[c] = sequence_object.annotations['abif_raw'][c]

plt.plot(trace['DATA9'][::10], color='blue')
plt.plot(trace['DATA10'][::10], color='red')
plt.plot(trace['DATA11'][::10], color='green')
plt.plot(trace['DATA12'][::10], color='yellow')

plt.show()

## Blasting a sequence

Next, let's search NCBI to see what matches we can return from our sequence. This may take up to a few minutes

In [None]:
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

#blast the sequence result using blastn, against the
#nr database, evalue of 0.0001, return top 5 hits
blast_result = NCBIWWW.qblast('blastn','nt',fasta_sequence, expect = 0.0001, hitlist_size=5) 

blast_record = NCBIXML.read(blast_result)

# print the blast hit results

for records in blast_record.alignments:
    print(records)

# Where to go next

See a whole list of BioPython tutorials that you can use in putting together your own notebook...http://biopython.org/DIST/docs/tutorial/Tutorial.html
