# BLAST an unknown sequence 



As stated in the introduction, we have an sequence from *D.yakuba*, but we don't know much about it. First, let's examine the [sequence](./files/yakuba.fa), which is saved in the same directory as this notebook. 

We will use Linux's `head` command to to preview the first few line of the file. 
> Tip: To execute a bash command we can place a `!` in front of the command to launch within this Python Jupyter notebook. 

In [None]:
!head ./files/yakuba.fa

## Starting with Biopython

In these notebooks, we will be using [Biopython](http://biopython.org/) a set of free software tools for a variety of bioinformatics applications. While this tutorial will not teach Biopython comprehensively, you will learn some useful features and we will refer you to the [Biopython documentation](http://biopython.org/wiki/Documentation) to learn more. 

### Load Biopython and check version
First, let's check that Biopython is installed and check the version. 

In [None]:
import Bio
print("Biopython version is " + Bio.__version__)

> tip: If you did not have Biopython installed, see their [installation instructions](http://biopython.org/wiki/Download)

### Load a fasta file for use in Biopython

In this step, we want to load the yakuba.fa sequence into a variable that can be used in our blast search. To to this we create a variable called `fasta_file` and use Python's `open()` function to read the file. As shown above, the yakuba file is in a folder called `files` at `./files/yakuba.fa`

In [None]:
# Complete this code by entering the name of your file. The filename and 
# filepath should be in quotes

fasta_file = open().read()

In [None]:
fasta_file = open('./files/yakuba.fa').read()

We can preview what was read into the fasta file by printing it:

In [None]:
print(fasta_file)

### Preform a BLAST search using Biopython

As mentioned in the introduction, BLAST is a tool for similarity searching. This is done by taking your **query** sequence (the sequence you want to find matches for), as well as **search parameters** (some optional adjustments to the way you wish to limit or expand your search) and searching a **database** (a repository of known DNA sequences). 

First, we will load the appropriate Biopython module for doing a BLAST search over the Internet. The [NCBIWWW module](http://biopython.org/DIST/docs/api/Bio.Blast.NCBIWWW-module.html) has a variety of features we will explore in a moment. 

In [None]:
from Bio.Blast import NCBIWWW

We will do our first BLAST using this piece of Biopython code. 
> tip: Since this is a real BLAST search, you will get an 'In [\*]' in the cell below for up to several minutes as the search is executed. Don't proceed in the notebook until the '\*' turns into a number. 

In [None]:
blast_result_1 = NCBIWWW.qblast("blastn", "nt", fasta_file)

The blast result returned by the NCBIWWW.qblast function is not easy to read as it is an [XML file](https://en.wikipedia.org/wiki/XML). We will use some additional code to examine. 

First, let's save the blast result as its own file. This 

In [None]:
with open("./files/blast_output.xml", "w") as output_xml:
    output_xml.write(blast_result_1.read())
blast_result_1.close()

We can preview the first few lines of the `blast_output.xml` file and then go on to extract the information we need. 

In [None]:
# Use the `!head` command (using the -n argument to specify the 
# number of lines) to preview the first 50 lines of the blast_output.xml file

### your code here

In [None]:
!head -n 50 ./files/blast_output.xml

In [None]:
ddd

In [None]:
!head -n 50 ./files/blast_output.xml