# Gene Ontology and BLAST search

## Instructions

Gene Ontology is not discussed by the Biopython tutorial because Biopython does not support Gene Ontology. You should read the [Gene Ontology](http://www.geneontology.org/page/introduction-go-resource) documentation to the extent you need to understand the purpose and overall content of Gene Ontology. The [OBO Flat File Format Guide](https://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html) is another useful resource for this course.

For BLAST, read the sections 7.1, 7.3, and 7.4 of the Biopython tutorial. If you are interested, you can also take a look at the chapter 8.

## Objectives

- parse and use Gene Ontology terms
- perform BLAST searches and process their results

## Summary

[Gene Ontology](http://www.geneontology.org/) is an extensive ontology that contains terms for cellular components, molecular functions, and biological processes. The meanings of the terms are described and their relationships to other terms are defined. Among others, UniProt uses Gene Ontology to annotate protein sequences.

BLAST is a method to search for similar sequences from a database. NCBI provides an online BLAST service that can be used to query several databases. A programmatic interface to the NCBI BLAST is implemented in the [`Bio.Blast`](https://biopython.org/DIST/docs/api/Bio.Blast-module.html) module of Biopython. There is also the [`Bio.SearchIO`](https://biopython.org/DIST/docs/api/Bio.SearchIO-module.html) module, which defines a generic interface to sequence search program outputs, but it is still experimental.

#### GeneOntology is a *de facto* resource for characterising proteins

Gene Ontology is available for download at [http://purl.obolibrary.org/obo/go.obo](http://purl.obolibrary.org/obo/go.obo) in the OBO format. The OBO format is a simple, human-readable format. Gene Ontology is not supported by Biopython, but the syntax of the OBO format is easy to parse.

The OBO format is intended for representing ontologies. It is hence very expressive and not all of its features are used by all ontologies. The [OBO Flat File Format Guide](https://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html) is a good source to start learning the details of the format. To deeply understand the OBO representation of an ontology, you should also know the basic principles of ontology construction. It is not within the scope of this course, however.

The terms are defined in `[Term]` stanzas. Each stanza contains at least one `tag: value ! comment` line.

The example below contains the terms `membrane` and `ATPase activator activity`. Note how the terms have basic information (such as `id`, `name`, `namespace`, and `def`) but also additional information (such as `is_a` and `relationship`).

```
[Term]
id: GO:0016020
name: membrane
namespace: cellular_component
def: "A lipid bilayer along with all the proteins and protein complexes embedded in it an attached to it." [GOC:dos, GOC:mah, ISBN:0815316194]
subset: goslim_aspergillus
subset: goslim_candida
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_metagenomics
subset: goslim_pir
subset: goslim_plant
subset: goslim_yeast
xref: Wikipedia:Biological_membrane
is_a: GO:0005575 ! cellular_component

[Term]
id: GO:0001671
name: ATPase activator activity
namespace: molecular_function
def: "Binds to and increases the ATP hydrolysis activity of an ATPase." [GOC:ajp]
synonym: "ATPase stimulator activity" EXACT []
xref: reactome:R-HSA-5251955 "HSP40s activate intrinsic ATPase activity of HSP70s in the nucleoplasm"
xref: reactome:R-HSA-5251959 "HSP40s activate intrinsic ATPase activity of HSP70s in the cytosol"
is_a: GO:0008047 ! enzyme activator activity
is_a: GO:0060590 ! ATPase regulator activity
relationship: part_of GO:0032781 ! positive regulation of ATPase activity
relationship: positively_regulates GO:0016887 ! ATPase activity
```

The relationship types are defined as `[Typedef]` stanzas. The syntax is the same as in `[Term]` stanzas, but the set of tags is different.

```
[Typedef]
id: positively_regulates
name: positively regulates
namespace: external
xref: RO:0002213
holds_over_chain: negatively_regulates negatively_regulates
is_a: regulates ! regulates
transitive_over: part_of ! part of
```

When you are implementing your own parser, you are free to only consider the tags that are relevant to your analysis and to collect the values into the data structure that best suits your needs. For simple analyses, it is enough to have a parser that reads the OBO file one line at the time and constructs named tuples from the parsed tag-value pairs.

#### NCBI online BLAST service is supported by Biopython

The [`Bio.Blast.NCBIWWW`](https://biopython.org/DIST/docs/api/Bio.Blast.NCBIWWW-module.html) module has the function `qblast`, with which the NCBI BLAST service can be used programmatically. The workflow of using NCBI BLAST is similar to that of using EUtils.

The `qblast` function requires the BLAST variant, the database, and the query sequence as arguments. The sequence can be given as a plain sequence, as an ID or in FASTA format. The available databases are listed, among others, in the README of the [NCBI BLAST FTP site](ftp://ftp.ncbi.nlm.nih.gov/blast/blastftp.txt).

There are also many optional arguments for fine-tuning the search. Make sure to check the default values as they may not be what you expect.

IMPORTANT: As with any other online service, do not overload the NCBI BLAST server with too many or too large queries. BLAST searches are slow and the NCBI BLAST (even with NCBI's massive server capacity) is no exception. Please wait patiently until you get the results and save your results to a file in order to save computational resources. (Biopython will handle the waiting for you.)

In [None]:
import Bio.Blast.NCBIWWW as BBNW

In [None]:
import Bio.Seq as BS
import Bio.Alphabet as BA

# BLAST program to use
prog = "blastp"
# database to search against
database = "swissprot"
# query sequence as a Seq object
query = BS.Seq("IRVEGNLRVEYLDDRNTFRHSVVVPYEPPE", alphabet=BA.IUPAC.protein)

# run NCBI BLAST
handle = BBNW.qblast(prog, database, query)

# save to file
# (particularly useful with BLAST, which is slow to run)
with open('blast-results.xml', 'w') as f:
    f.write(handle.read())

The XML output can be parsed with the [`Bio.Blast.NCBIXML`](https://biopython.org/DIST/docs/api/Bio.Blast.NCBIXML-module.html) module. The `read` function parses a single BLAST record, which is obtained from a single query sequence.

In [None]:
import Bio.Blast.NCBIXML as BBNX

In [None]:
# parse the result into a BLAST record
with open('blast-results.xml') as f:
    record = BBNX.read(f)

# a single BLAST record
# (note that there is no human-readable representation defined for BLAST record objects)
print(record)

The NCBI BLAST will accept several query sequences simultaneously. In fact, it is preferred to send all query sequences at once, if possible. The `parse` function of the `Bio.Blast.NCBIXML` module will parse and iterate over several BLAST records.

In [None]:
# BLAST program to use
prog = "blastp"
# database to search against
database = "swissprot"
# query sequences as a list of IDs
query = ['P01013', 'P12345']
# NCBI BLAST expects IDs as a string with one ID per line
query = "\n".join(query)

# run NCBI BLAST
handle = BBNW.qblast(prog, database, query)

# save to file
# (particularly useful with BLAST, which is slow to run)
with open('blast-results-many.xml', 'w') as f:
    f.write(handle.read())

In [None]:
# parse the result into BLAST records
with open('blast-results-many.xml') as f:
    # one BLAST record per query sequence
    for record in BBNX.parse(f):
        print(record)

As with the UniProt API, the `qblast` function reflects the behaviour of the corresponding website. It is therefore a good idea to take advantage of the graphical interface when designing the analysis and debugging the code. The NCBI BLAST and its descriptions are available at [https://blast.ncbi.nlm.nih.gov/Blast.cgi](https://blast.ncbi.nlm.nih.gov/Blast.cgi).

#### Blast records contain the details and results of the search

The process with which BLAST finds matching sequences is as follows:

- Create words (i.e. short segments) from the query sequence.
- Select the similar words that have the highest scores against the words in the query sequence.
- Scan the database for the exact matches of the selected words.
- Extend the matches until the score between the query sequence and database sequence starts to decrease. These extensions are called high-scoring segment pairs (HSPs).
- Filter and evaluate the high-scoring HSPs.

A single hit is therefore an alignment that is composed of at least one HSP.

The classes related to BLAST records are in the [`Bio.Blast.Record`](https://biopython.org/DIST/docs/api/Bio.Blast.Record-module.html) module. The Biopython tutorial has a useful diagram that summarises the relationships between the different classes as well as their attributes.

A `Bio.Blast.Record.Blast` object contains the full set of results from a single query. Its `descriptions` attribute is a list of `Bio.Blast.Record.Description` objects, each of which summarises one hit, and its `alignments` attribute is a list of `Bio.Blast.Record.Alignment` objects, which have the details of the alignments.

The `hsps` attribute of a `Bio.Blast.Record.Alignment` object contains the details of the HSPs as a list of `Bio.Blast.Record.HSP` objects.

A BLAST record also contains information regarding the search itself. If you are interested, see the attributes defined in the superclasses `Bio.Blast.Record.Header`, `Bio.Blast.Record.DatabaseReport`, and `Bio.Blast.Record.Parameters`.

The examples below print the details of hits in a human-readable format. You could also do the same for the full `Bio.Blast.Record.Blast` object.

In [None]:
# parse the result into a BLAST record
with open('blast-results.xml') as f:
    record = BBNX.read(f)

# iterate over hit descriptions (along with their indices)
for i, description in enumerate(record.descriptions):
    # iterate over all attributes of the description
    # (sorted by key)
    for k, v in sorted(description.__dict__.items()):
        print("%s: %s"%(k, v))
    print()
    # stop prematurely after 5 hits
    if i == 4:
        break

In [None]:
# parse the result into a BLAST record
with open('blast-results.xml') as f:
    record = BBNX.read(f)

# iterate over hit alignments (along with their indices)
for i, alignment in enumerate(record.alignments):
    # iterate over all attributes of the alignment
    # (sorted by key)
    for k, v in sorted(alignment.__dict__.items()):
        # skip over the list of HSPs because it will be handled later
        if k == 'hsps':
            continue
        print("%s: %s"%(k, v))
    # iterate over HSPs (along with their indices)
    for j, hsp in enumerate(alignment.hsps):
        print("-- HSP --")
        # iterate over all attributes of the HSP
        # (sorted by key)
        for k, v in sorted(hsp.__dict__.items()):
            print("  %s: %s"%(k, v))
        # stop prematurely after 3 HSPs
        if j == 2:
            break
    print()
    # stop prematurely after 3 hits
    if i == 2:
        break