# Workshop 5 - Automate BLAST Runs

### Topics

1. About BLAST
2. Running BLAST over the Internet 

## About BLAST 

BLAST (basic local alignment search tool) is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. 

## Types of BLAST

* __Nucleotide-Nucleotide BLAST (blastn)__: given a DNA query, returns the most similar DNA sequences from the DNA database that the user specifies. 
* __Protein-Protein BLAST (blastp)__: given a protein query, returns the most similar protein sequences from the protein database that the user specifies. 
* __Translated Nucleotide-Protein BLAST (blastx)__: for identifying potential protein products encoded by a nucleotide query. 
* __Protein-Translated Nucleotide (tblastn)__: for identifying database sequences encoding proteins similar to the query. 

## BLAST Sequence Databases 

Nucleotide Database:

* __nt (default)__: All GenBank + EMBL + DDBJ + PDB sequences, excluding sequences from PAT, EST, STS, GSS, WGS, TSA and phase 0, 1 or 2 HTGS sequences. Non-redundant, records with identical sequences collapsed into a single entry. 
* __rRNA/ITS databases__: A collection of four databases: a 16S Microbial rRNA sequences from NCBI's Targeted Loci Projects, an 18S and a 26S RNA rRNA databases for fungi, plus an ITS database for fungi. 
* __refseq_rna__: Curated (NM_, NR_) plus predicted (XM_, XR_) sequences from NCBI Reference Sequence Project. 
* ... and more

Protein Databases:

* __nr (default)__: Non-redundant GenBank CDS translations + RefSeq + PDB + SwissProt + PIR + PRF, excluding those in PAT, TSA, and env_nr.
* __refseq_protein__: Protein sequences from NCBI Reference Sequence project. 
* __Landmark__: The landmark database includes proteomes from representative genomes spanning a wide taxonomic range.
* ... and more

For more information about BLAST programs and databases, please see NCBI's [How To BLAST Guide](https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf).

## Biopython's BLAST Tools 

Most people's first introduction to running BLAST was probably via NCBI web-service. It can be difficult to deal with the volume of data generated by large runs, and to automate BLAST runs in general. Biopython provides several tools to make things easier. We will introduce a few here. 

### Running BLAST over the Internet

This tool allows you to call the online version of BLAST and doesn't need you to download the databases to local. It can save you time and resources if the number of quries you are going to run isn't very large. 

We use the function `qblast()` in the `Bio.Blast.NCBIWWW` module to call the online version of BLAST. This function has three non-optional arguments:

* The first argument is the __blast program__ to use for the search, as a lower case string. Currently `qblast` only works with blastn, blastp, blastx, tblast, and tblastx. 
* The second argument specifies the __databases__ to search against. The options for this are available on the [NCBI Guide to BLAST](https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf).
* The third argument is a string containing your __query sequence__. This can either be the sequence itself, the sequence in fasta format, or an identifier like a GI number. 

__Usage Guidelines:__

The NCBI BLAST servers are a shared resource. It gives priority to interactive users. In order to ensure availability of the service to the entire community, NCBI may limit searches for some high volume users and will move searches of users who submit more than 100 searches in a 24 hour period to a slower queue, or, in extreme cases, will block the requests. To avoid problems, API users should comply with the following guidelines:

* Do not contact the server more often than once every 10 seconds.
* Do not poll for any single Request ID more often than once a minute. 
* Use the URL parameter email and tool, so that the NCBI can contact you if there is a problem.
* Run scripts weekends or between 9pm and 5am Eastern Time on weekdays if more than 50 searches will be submitted. 

To fulfill the third point, one can set the `NCBIWWW.email` variable.

In [None]:
from Bio.Blast import NCBIWWW

NCBIWWW.email = "u1133824@anu.edu.au" 

The `qblast` function also takes a number of other option arguments, which are basically analogous to the different parameters you can set on the BLAST web page. We'll highlight a few of them here:

* The argument `url_base` sets the base URL for running BLAST over the internet. By default it connects to the NCBI, but one can use this to connect to an instance of NCBI BLAST running in the cloud. 
* The `qblast` function can return the BLAST results in various formats, which you can choose with the optional `format_type` keyword: "HTML", "Text", "ASN.1", or "XML". The default is "XML", as that is the format expected by the parser. 
* The argument `expect` sets the expectation or e-value threshold.

For more about the optional BLAST arguments, you can read the NCBI BLAST's own documentation, or the help documentation built into Biopython:

In [None]:
help(NCBIWWW.qblast)

Note that the default settings on the NCBI BLAST website are not quite the same as the defaults on QBLAST. If you get different results, you'll need to check the parameters (e.g., the expectation value threshold and the gap values). 

__Code Example 01:__ using GI number of your query sequence to search against the nucleotide database (nt) using BLASTN.

In [None]:
result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")

In [None]:
print(result_handle.getvalue())

The result we get from `qblast` is a StringIO object. A StringIO object is a text stream using an in-memory text buffer. The `getvalue()` method of `StringIO` will return a string containing the entire contents of the buffer. 

__Code Example 02:__ if we have our query sequence already in a FASTA formatted file, we just need to open the file and read in this record as a string, and use that as the query argument:

In [None]:
fasta_string = open("data/one_cds.fa").read()
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)

__Code Example 03:__ we could also have read in the FASTA file as a `SeqRecord` object and then supplied just the sequence itself:

In [None]:
from Bio import SeqIO
record = SeqIO.read("data/few_cds.fa", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)

__Note:__ `qblast` can only take one query in at a time. To run blast for multiple queries, we can use for loop.

## Saving BLAST Output  

Whatever arguments you give the `qblast()` function, you should get back your results in a handle object (by default in XML format). To save a handle object, we can use the following code to save it to file "my_blast.xml". 

In [None]:
with open("my_blast.xml", "w") as out_handle:
    out_handle.write(result_handle.read())

result_handle.close()

`result_handle.close()` closes the handle object and the handle can no longer be read. We can try to get the value of the result_handle again.

In [None]:
result_handle.getvalue()

## Parsing BLAST Output 

# References

* Biopython - [Biopython Tutorial and Cookbook](http://biopython.org/DIST/docs/tutorial/Tutorial.html) 
* Wikipedia - [BLAST (biotechnology)](https://en.wikipedia.org/wiki/BLAST_(biotechnology))
* NCBI - [How To BLAST Guide](https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf) 
* NCBI - [BLAST Help Developer Information](https://blast.ncbi.nlm.nih.gov/doc/blast-help/developerinfo.html#developerinfo)