# FINDING, ACCESSING, INTEGRATING, and REUSING data (FAIR)

During this course, we will explore the history of bioinformatics data publication and reuse.  We will explore data technologies from the most primitive through the past ~20 years to the most contemporary.  

The overview is:

* "Screen Scraping" from pages directly accessed on the Web
* APIs for data retrieval and integration
    * Custom APIs
    * REST APIs
    * BioRuby APIs
* Linked Data & SPARQL query
* FAIR Data
<pre>


</pre>


# The Painful Truth


Getting data from the Web is still far more painful than it should be.


Most resources do not offer data in a format that is easily accessible by machines.


Many resources claim to provide “RESTful” access, but even they often make it difficult (we will discuss REST in a future lecture).




# One example of a "friendly" provider

There are (too) many examples of "bad" data providers.  In your future careers, you may need to access data from one of these providers.   Good luck!

In this course, we are going to mainly focus on "nice" data providers.  Nevertheless, we are going to start with the most difficult case, so that you learn how to deal with these cases.  Over the next weeks, we will explore increasingly better, friendlier, more powerful, and easier-to-explore data sources.

For today, we are going to explore the European Bioinformatics Institute (EBI) dbFetch interface.

http://www.ebi.ac.uk/Tools/dbfetch/

dbFetch provides data in a variety of formats, using a predictable URL structure.


# From Within a Script

Scroll down the dbFetch web page until you get to the section titled "from within a script":

* from within a script - examples of the URL for all styles and formats

For people interested in programmatic access to the Dbfetch functionality, we recommend using our new Web Services version of Dbfetch: WSDbfetch.

Alternatively, you can use dbfetch for direct access:

Making scripted http requests to dbfetch is very simple, the parameters which can be used are db, id, format and style. Of these parameters only db and id are required fields. When omitting to use format and/or style, the defaults for the chosen database will be used (the default style is always html).

The URL to dbfetch is always of this format:
**dbfetch?db=DB_NAME&id=IDS&format=FORMAT_NAME&style=STYLE_NAME**

    DB_NAME - Must be chosen from the table below
    IDS - Single id/acc or comma/white-space separated list (id1 or id1 id2 id3 or id1,id2,id3)
    FORMAT_NAME - Name of the output format, varies between databases
    STYLE_NAME - Name of the output style, available styles are raw and html


For details of the available databases, formats and styles see the list of databases [http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases]. Additional examples are provided in the syntax guide [http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp].

Examples:

https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ena_sequence&id=J00231,K00650,D87894,AJ242600

Instead of the default raw (plain ASCII) style, entries can also be retrieved in plain text (raw):
https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ena_sequence&id=J00231,K00650,D87894,AJ242600&style=raw

It is also possible to retrieve Fasta formatted sequences:
https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ena_sequence&id=J00231,K00650,D87894,AJ242600&format=fasta

Because of backward compatibility issues the program can be simply called by giving one or more INSDC accession numbers or entry names:
https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?J00231

# The dbFetch URL Structure

## dbfetch?db=<span style='color:red;'>DB_NAME</span>&id=IDS&format=FORMAT_NAME&style=STYLE_NAME

The first element to notice in the dbFetch URL is the "DB_NAME" field.  The value of this field can be one of many of the databaases hosted at EBI.  The list is below.   I have highlighted "Ensembl Genomes Gene" because it is one of the few databases that understands *Arabodiopsis* gene IDs



# The List of Databases


An overview of each database is also provided, which includes a short description of the database, a link to the database, a collection of example identifiers and details of the available data formats and result styles.
Databases

* EDAM (edam)
* ENA Coding (ena_coding)
* ENA Geospatial (ena_geospatial)
* ENA Non-coding (ena_noncoding)
* ENA Sequence (ena_sequence)
* ENA Sequence Constructed (ena_sequence_con)
* ENA Sequence Constructed Expanded (ena_sequence_conexp)
* ENA/SVA (ena_sva)
* Ensembl Gene (ensemblgene)
* **Ensembl Genomes Gene (ensemblgenomesgene)**[ http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases#ensemblgenomesgene ]
* Ensembl Genomes Transcript (ensemblgenomestranscript)
* Ensembl Transcript (ensembltranscript)
* EPO Proteins (epo_prt)
* HGNC (hgnc)
* IMGT/HLA (imgthla)
* IMGT/LIGM-DB (imgtligm)
* InterPro (interpro)
* IPD-KIR (ipdkir)
* IPD-MHC (ipdmhc)
* IPRMC (iprmc)
* IPRMC UniParc (iprmcuniparc)
* JPO Proteins (jpo_prt)
* KIPO Proteins (kipo_prt)
* MEDLINE (medline)
* Patent DNA NRL1 (nrnl1)
* Patent DNA NRL2 (nrnl2)
* Patent Protein NRL1 (nrpl1)
* Patent Protein NRL2 (nrpl2)
* Patent Equivalents (patent_equivalents)
* PDB (pdb)
* RefSeq (nucleotide) (refseqn)
* RefSeq (protein) (refseqp)
* SGT (sgt)
* Taxonomy (taxonomy)
* Trace Archive (tracearchive)
* UniParc (uniparc)
* UniProtKB (uniprotkb)
* UniRef100 (uniref100)
* UniRef50 (uniref50)
* UniRef90 (uniref90)
* UniSave (unisave)
* USPTO Proteins (uspto_prt)


# Ensembl Genomes Gene (ensemblgenomesgene)

http://www.ensemblgenomes.org/

Ensembl Genomes genome databases for metazoa, plants, fungi, protists and bacteria, for vertebrate species and **model organisms** [1] see Ensembl instead of Ensembl Genomes. Gene sequences and annotations.

<span style='font-size: 50%;'>[1]  Apparently, EnsEMBL doesn't consider Arabidopsis a "model organism" ;-)</span>

| <span style='color: blue;'>Format</span> |	<span style='color: green;'>Styles</span> | 	Example Identifiers | 
| ------ | ------- | -------
| default  | 	default, raw 	 | Id: AAEL000001, AGAP006864, GB46163, b2736
| csv  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| **embl**  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| fasta  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| genbank  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| gff2  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| gff3  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| tab |  	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736 



## dbfetch?db=DB_NAME&id=IDS&format=<span style='color: blue;'>FORMAT_NAME</span>&style=<span style='color: green;'>STYLE_NAME</span>

The columns for Format and Style are highlighted, together with their position in our structured URL.  I have also highlighted "embl", because that is the data format that I want us to use for this example.  

# So far we have...

## dbfetch?db=<span style='color: red;'>ensemblgenomesgene</span>&format=<span style='color: blue;'>embl</span>&id=.....

These elements are called "GET String Parameters".  **db** is one parameter, with a value of "ensemblgenomesgene".  **format** is another parameter, with the value "embl".  In a URL, parameters follow the '?' symbol, and parameter/value pairs are separated by '&'. (We will discuss what "GET" means when we talk about REST in a future lecture.)

What is the complete URL?  What is before the '?'?

Go back to the main dbFetch page and look at the examples:

Examples:

**https://www.ebi.ac.uk/Tools/dbfetch/dbfetch**?db=ena_sequence&id=J00231,K00650,D87894,AJ242600

Instead of the default raw (plain ASCII) style, entries can also be retrieved in plain text (raw):

**https://www.ebi.ac.uk/Tools/dbfetch/dbfetch**?db=ena_sequence&id=J00231,K00650,D87894,AJ242600&style=raw

It is also possible to retrieve Fasta formatted sequences:

**https://www.ebi.ac.uk/Tools/dbfetch/dbfetch**?db=ena_sequence&id=J00231,K00650,D87894,AJ242600&format=fasta


##  So, the URL for the Arabidopsis gene At3g54340, in ensembl format, is:

### https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ensemblgenomesgene&format=embl&id=At3g54340

Open this in your browser (open in a new window!) and examine the structure of that file.  How would you extract information from that file, for example, the name of the gene, or the number of exons?

<pre>
  



</pre>
  
# Prove that you understand:

* Retrieve the EMBL record for AT4G36920 (what gene name is it?)
* Retrieve the FASTA file for AT4G36920
* Retrieve the UniProt record for AP3_ARATH 
    * DO NOT retrieve it in HTML - i.e. you want the RECORD, not the web page containing the record --> you will need to use the “style” GET string parameter (&style=...).  Try the different "style" options until you get what you want. 


<pre>



</pre>

# Your first step towards massively integrative informatics - Deja el navegador!

It is impossible to do modern bioinformatics through your browser!  To achieve data discovery and integration at the scale of contemporary systems biology, it is necessary to do all of the data retrieval and integration inside of your software.

How do you access The Web in your code?

<pre>

</pre>



In [1]:
require 'net/http'
require 'linkeddata'


def fetch(uri_str, limit = 10)
  # You should choose a better exception.
  raise ArgumentError, 'too many HTTP redirects' if limit == 0

  response = Net::HTTP.get_response(URI(uri_str))

  case response
    when Net::HTTPSuccess then
      puts "success"
      response
    when Net::HTTPRedirection then
      location = response['location']
      puts "redirected to #{location}"
      fetch(location, limit - 1)
    else
      puts "something else"
      response.value
    end
      
end
    


raise ArgumentError, 'arguments are really really bad!'
    
res = fetch('http://linkeddata.systems/Accessors/UniProtAccessor');
puts res.class
#puts res.public_methods
#puts = String.new.public_methods
body = res.body

LoadError: cannot load such file -- linkeddata