# FINDING, ACCESSING, INTEGRATING, and REUSING data (FAIR)

During this course, we will explore the history of bioinformatics data publication and reuse.  We will explore data technologies from the most primitive through the past ~20 years to the most contemporary.  

The overview of our lessons for the next ~2 months is:

* "Screen Scraping" data from pages directly accessed on the Web
* Application Programming Interfaces (APIs) for data retrieval and integration
    * Custom APIs
    * "REST" APIs
    * BioRuby APIs
* Linked Data & SPARQL query
* FAIR Data

At the end of the course, we will explore a very simple (very primitive!) way to create Web pages that are connected to databases.  

In the end, you will be required to show that you can publish data that is FAIR for both humans and machines - you will be making a Website that can be queried by both people and by computers (this is your final assignment of the year)
<pre>


</pre>


# The Painful Truth


Getting data from the Web is still far more painful than it should be.


Most resources do not offer data in a format that is easily accessible by machines.


Many resources claim to provide “RESTful” access, but even they often make it difficult (we will discuss REST in a future lecture).




# One example of a "friendly" provider

There are (too) many examples of Bioinformatics data providers that do not make it easy to retrieve their data.  In your future careers, you will probably need to access data from one of these providers.   Good luck!

In this course, we are going to mainly focus on "nice" data providers.  Nevertheless, we begin with a ~difficult case, so that you learn how to deal with these situations.  After that, and over the next weeks, we will explore increasingly better, friendlier, more powerful, and easier-to-explore data sources.

For today, we are going to explore the European Bioinformatics Institute (EBI) dbFetch interface (**NOTE: I am NOT saying that this is a BAD interface!  I am just using it as an example of a difficult case from the perspective of software!**).

http://www.ebi.ac.uk/Tools/dbfetch/

dbFetch provides data in a variety of formats, using a predictable URL structure.

<pre>


</pre>


# From Within a Script

Scroll down the dbFetch web page until you get to the section titled "from within a script":

------------------

* from within a script - examples of the URL for all styles and formats

For people interested in programmatic access to the Dbfetch functionality, we recommend using our new Web Services version of Dbfetch: WSDbfetch.

Alternatively, you can use dbfetch for direct access:

Making scripted http requests to dbfetch is very simple, the parameters which can be used are db, id, format and style. Of these parameters only db and id are required fields. When omitting to use format and/or style, the defaults for the chosen database will be used (the default style is always html).

The URL to dbfetch is always of this format:
**dbfetch?db=DB_NAME&id=IDS&format=FORMAT_NAME&style=STYLE_NAME**

    DB_NAME - Must be chosen from the table below
    IDS - Single id/acc or comma/white-space separated list (id1 or id1 id2 id3 or id1,id2,id3)
    FORMAT_NAME - Name of the output format, varies between databases
    STYLE_NAME - Name of the output style, available styles are raw and html


For details of the available databases, formats and styles see the list of databases [http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases]. Additional examples are provided in the syntax guide [http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp].

Examples:

https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ena_sequence&id=J00231,K00650,D87894,AJ242600

Instead of the default raw (plain ASCII) style, entries can also be retrieved in plain text (raw):
https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ena_sequence&id=J00231,K00650,D87894,AJ242600&style=raw

It is also possible to retrieve Fasta formatted sequences:
https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ena_sequence&id=J00231,K00650,D87894,AJ242600&format=fasta

Because of backward compatibility issues the program can be simply called by giving one or more INSDC accession numbers or entry names:
https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?J00231

-------------------

# The dbFetch URL Structure

## dbfetch?db=<span style='color:red;'>DB_NAME</span>&id=IDS&format=FORMAT_NAME&style=STYLE_NAME

The first element to notice in the dbFetch URL is the "DB_NAME" field.  The value of this field can be one of many of the databaases hosted at EBI.  The list is below.   I have highlighted "Ensembl Genomes Gene" because it is one of the few databases that understands *Arabodiopsis* gene IDs



# The List of Databases


An overview of each database is also provided, which includes a short description of the database, a link to the database, a collection of example identifiers and details of the available data formats and result styles.
Databases

* EDAM (edam)
* ENA Coding (ena_coding)
* ENA Geospatial (ena_geospatial)
* ENA Non-coding (ena_noncoding)
* ENA Sequence (ena_sequence)
* ENA Sequence Constructed (ena_sequence_con)
* ENA Sequence Constructed Expanded (ena_sequence_conexp)
* ENA/SVA (ena_sva)
* Ensembl Gene (ensemblgene)
* **Ensembl Genomes Gene (ensemblgenomesgene)**[ http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases#ensemblgenomesgene ]
* Ensembl Genomes Transcript (ensemblgenomestranscript)
* Ensembl Transcript (ensembltranscript)
* EPO Proteins (epo_prt)
* HGNC (hgnc)
* IMGT/HLA (imgthla)
* IMGT/LIGM-DB (imgtligm)
* InterPro (interpro)
* IPD-KIR (ipdkir)
* IPD-MHC (ipdmhc)
* IPRMC (iprmc)
* IPRMC UniParc (iprmcuniparc)
* JPO Proteins (jpo_prt)
* KIPO Proteins (kipo_prt)
* MEDLINE (medline)
* Patent DNA NRL1 (nrnl1)
* Patent DNA NRL2 (nrnl2)
* Patent Protein NRL1 (nrpl1)
* Patent Protein NRL2 (nrpl2)
* Patent Equivalents (patent_equivalents)
* PDB (pdb)
* RefSeq (nucleotide) (refseqn)
* RefSeq (protein) (refseqp)
* SGT (sgt)
* Taxonomy (taxonomy)
* Trace Archive (tracearchive)
* UniParc (uniparc)
* UniProtKB (uniprotkb)
* UniRef100 (uniref100)
* UniRef50 (uniref50)
* UniRef90 (uniref90)
* UniSave (unisave)
* USPTO Proteins (uspto_prt)


# Ensembl Genomes Gene (ensemblgenomesgene)

http://www.ensemblgenomes.org/

Ensembl Genomes genome databases for metazoa, plants, fungi, protists and bacteria, for vertebrate species and **model organisms** [1] see Ensembl instead of Ensembl Genomes. Gene sequences and annotations.

<span style='font-size: 50%;'>[1]  Apparently, EnsEMBL doesn't consider Arabidopsis a "model organism" ;-)</span>

| <span style='color: blue;'>Format</span> |	<span style='color: green;'>Styles</span> | 	Example Identifiers | 
| ------ | ------- | -------
| default  | 	default, raw 	 | Id: AAEL000001, AGAP006864, GB46163, b2736
| csv  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| **embl**  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| fasta  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| genbank  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| gff2  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| gff3  | 	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736
| tab |  	default, raw  | 	Id: AAEL000001, AGAP006864, GB46163, b2736 



## dbfetch?db=DB_NAME&id=IDS&format=<span style='color: blue;'>FORMAT_NAME</span>&style=<span style='color: green;'>STYLE_NAME</span>

The columns for Format and Style are highlighted, together with their position in our structured URL.  I have also highlighted **"embl"**, because that is the data format that I want us to use for this example.  

# So far we have...

## dbfetch?db=<span style='color: red;'>ensemblgenomesgene</span>&format=<span style='color: blue;'>embl</span>&id=.....

These elements are called "GET String Parameters".  **db** is one parameter, with a value of "ensemblgenomesgene".  **format** is another parameter, with the value "embl".  In a URL, Get String parameters have one of these two patterns:

    ?parameter1=value1;parameter2=value2;.....
    ?parameter1=value1&parameter2=value2&.....
   
i.e., at the end of teh URL, the parameter/value pairs follow a '?' symbol, and parameter/value pairs are separated by '&' or ';'. (We will discuss what "GET" means when we talk about REST in a future lecture.)

What is the complete URL?  What is before the '?'?

Go back to the main dbFetch page and look at the examples:

Examples:

**https://www.ebi.ac.uk/Tools/dbfetch/dbfetch**?db=ena_sequence&id=J00231,K00650,D87894,AJ242600

Instead of the default raw (plain ASCII) style, entries can also be retrieved in plain text (raw):

**https://www.ebi.ac.uk/Tools/dbfetch/dbfetch**?db=ena_sequence&id=J00231,K00650,D87894,AJ242600&style=raw

It is also possible to retrieve Fasta formatted sequences:

**https://www.ebi.ac.uk/Tools/dbfetch/dbfetch**?db=ena_sequence&id=J00231,K00650,D87894,AJ242600&format=fasta
<pre>


</pre>



##  So, the URL for the Arabidopsis gene At3g54340, in ensembl format, is:

### https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ensemblgenomesgene&format=embl&id=At3g54340

Open this in your browser (open in a new window!) and examine the structure of that file.  How would you extract information from that file, for example, the name of the gene, or the number of exons?

<pre>
  



</pre>
  
# Prove that you understand:

* Retrieve the EMBL record for AT4G36920 (what gene name is it?)
* Retrieve the FASTA file for AT4G36920
* Retrieve the UniProt record for AP3_ARATH 
    * DO NOT retrieve it in HTML - i.e. you want the RECORD, not the web page containing the record --> you will need to use the “style” GET string parameter (&style=...).  Try the different "style" options until you get what you want.  Look at the table(s) of styles for each database to decide


<pre>



</pre>

# Your first step towards massively integrative informatics - Deja el navegador!

It is impossible to do modern bioinformatics through your browser!  To achieve data discovery and integration at the scale of contemporary systems biology, it is necessary to do all of the data retrieval and integration inside of your software.

**How do you access The Web in your code?   Use a softeware library called "rest-client"**


<pre>

</pre>



In [1]:


require 'rest-client'   # this is how you access the Web

address = 'http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ensemblgenomesgene&format=embl&id=At3g54340'

response = RestClient::Request.execute(
  method: :get,
  url: address)  # use the RestClient::Request object's method "execute"
puts response.body



ID   3    standard; DNA; HTG; 2175 BP.
XX
AC   chromosome:TAIR10:3:20119140:20121314:1
XX
SV   chromosome:TAIR10:3:20119140:20121314:1
XX
DT   17-JUN-2020
XX
DE   Arabidopsis thaliana chromosome 3 TAIR10 partial sequence
DE   20119140..20121314 annotated by The Arabidopsis Information Resource
XX
KW   .
XX
OS   Arabidopsis thaliana (thale-cress)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids;
OC   eurosids II; Brassicales; Brassicaceae; Arabidopsis.
XX
CC   This sequence displays annotation from Ensembl Genomes, based on underlying
CC   annotation from The Arabidopsis Information Resource (
CC   https://www.araport.org/ ). See http://www.ensemblgenomes.org for more
CC   information.
XX
CC   All feature locations are relative to the first (5') base of the sequence
CC   in this file.  The sequence presented is always the forward strand of the
CC   assembly. Features that lie outside of the s

In [3]:
# RestClient also has Object methods for the most simple cases:

require 'rest-client'   # this is how you access the Web

address = 'http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ensemblgenomesgene&format=embl&id=At3g54340'

response = RestClient.get(address) 
puts response.body

ID   3    standard; DNA; HTG; 2175 BP.
XX
AC   chromosome:TAIR10:3:20119140:20121314:1
XX
SV   chromosome:TAIR10:3:20119140:20121314:1
XX
DT   20-JUL-2023
XX
DE   Arabidopsis thaliana chromosome 3 TAIR10 partial sequence
DE   20119140..20121314 annotated by Araport11
XX
KW   .
XX
OS   Arabidopsis thaliana (thale-cress)
OC   Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta;
OC   Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae;
OC   eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales;
OC   Brassicaceae; Camelineae.
XX
CC   This sequence displays annotation from Ensembl Genomes, based on underlying
CC   annotation from Araport11 ( https://araport.org ). See
CC   http://www.ensemblgenomes.org for more information.
XX
CC   All feature locations are relative to the first (5') base of the sequence
CC   in this file.  The sequence presented is always the forward strand of the
CC   assembly. Features that lie outside of the sequen


<pre>


</pre>

## That was easy!  


That was easy!   ....too easy!  This is a good time to learn about how to handle errors in Ruby.  A lot of the time when you try to access the Web, something will go wrong... maybe the Web page doesn't exist, or you have lost your WiFi connection, or you have made a mistake in your code, or... or... or...


The code below does the same thing as the code above, but it is more careful.  It uses a function called "fetch" to attempt to contact the website, and if it fails, it does something useful (returns False).  This is the first time you have seen error-handling.  Look at the examples in the code, and try to understand them.  See also:  http://rubylearning.com/satishtalim/ruby_exceptions.html 
<pre>


</pre>

In [5]:
require 'rest-client'  

# Create a function called "fetch" that we can re-use everywhere in our code

def fetch(url, headers = {accept: "*/*"}, user = "", pass="")
  response = RestClient::Request.execute({
    method: :get,
    url: url.to_s,
    user: user,
    password: pass,
    headers: headers})
  return response
  
  rescue RestClient::ExceptionWithResponse => e
    $stderr.puts e.inspect
    response = false
    return response  # now we are returning 'False', and we will check that with an \"if\" statement in our main code
  rescue RestClient::Exception => e
    $stderr.puts e.inspect
    response = false
    return response  # now we are returning 'False', and we will check that with an \"if\" statement in our main code
  rescue Exception => e
    $stderr.puts e.inspect
    response = false
    return response  # now we are returning 'False', and we will check that with an \"if\" statement in our main code
end 



    

    
res = fetch('http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ensemblgenomesgene&format=embl&id=At3g54340');
#puts res.class
#puts res.public_methods
#abort  #this is how you immediately stop a Ruby programme

if res  # res is either the response object (RestClient::Response), or false, so you can test it with 'if'
  body = res.body  # get the "body" of the response
  #headers = res.headers  # get other details about the HTTP message itself
  puts body
else
  puts "the Web call failed - see STDERR for details..."
end

  
  
  

ID   3    standard; DNA; HTG; 2175 BP.
XX
AC   chromosome:TAIR10:3:20119140:20121314:1
XX
SV   chromosome:TAIR10:3:20119140:20121314:1
XX
DT   20-JUL-2023
XX
DE   Arabidopsis thaliana chromosome 3 TAIR10 partial sequence
DE   20119140..20121314 annotated by Araport11
XX
KW   .
XX
OS   Arabidopsis thaliana (thale-cress)
OC   Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta;
OC   Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae;
OC   eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales;
OC   Brassicaceae; Camelineae.
XX
CC   This sequence displays annotation from Ensembl Genomes, based on underlying
CC   annotation from Araport11 ( https://araport.org ). See
CC   http://www.ensemblgenomes.org for more information.
XX
CC   All feature locations are relative to the first (5') base of the sequence
CC   in this file.  The sequence presented is always the forward strand of the
CC   assembly. Features that lie outside of the sequen

<pre>

</pre>
# Extracting data from the Web - "screen scraping"!

You can now use tools that you already know - like regular expressions - to extract specific pieces of data from this Web response.  For example, the gene name:

(note that this code contains more error handling.  It checks if the returned content of the web page contains a match to your regular expression, and it tries to "rescue" the situation if it fails to find a match.)



In [None]:
    
res = fetch('http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ensemblgenomesgene&format=embl&id=At3g54340');
  
if res  # res is either the response object, or False, so you can test it with 'if'
  body = res.body  # get the "body" of the response
  
  if body =~ /locus_tag="([^"]+)"/ 
    gene_name = $1
    puts "the name of the gene is #{gene_name}"
#     gene_name_regexp = Regexp.new(/locus_tag="([^"]+)"/)  # this is another way to do Regular Expressions in Ruby.  There are several!
#     match = gene_name_regexp.match(body)
#     if match
#      gene_name = match[1]  # matches act like an array, so the first match is [1]
#                            ## (try for yourself, what is match[0]?)
#      puts "the name of the gene is #{gene_name}"
#     end
  else
    begin  # use a "begin" block to handle errors
      puts "There was no gene name in this record"  # print a friendly message
      raise "this is an error" # raise an exception
    rescue # some code to rescue the situation... for example, maybe a different regexp?
      puts "exiting gracefully"  # in this case, we are just going to stop trying
    end
  end
end
  


<pre>




</pre>
# Prove that you understand:

### to do this, you should switch to VS Code and write the code as an independent file that uses the objects you created for your assignment - don't try to do this in Jupyter!

 1.  (easy) read the gene_information.tsv file from your assignment; retrieve the EMBL record for each gene in that file  (please retrieve them one-at-a-time, not all together)

 2.  (medium) Create an “AnnotatedGene” Class, that extends the Gene Class by adding the DNA sequence and the Protein sequence attributes (as Strings).

 3.  (hard) For every EMBL file retrieved from gene_information.tsv, find the cross-referenced UniProt identifier (needs a regular expression). 

    * Retrieve the UniProt Protein record as FASTA using dbfetch.  
    * Retrieve the DNA (FASTA) sequence using dbfetch.   
    * Create an AnnotatedGene object for each one.  
    * Loop over every AnnotatedGene object and create a report:




<pre><code>
AGI_Locus        At3GXXXXX
GeneName            Ap3
Protein_ID        P11234
DNA Sequence
>Cosa|Cosa|At3GXXXXX    (keep the original FASTA header)
ACTGCTAGCTGATGCTGATGCATGCTAGC
Protein Sequence
>Cosa|Cosa|P111234  (keep the original FASTA header)
MYLLMSSPIOO
</code></pre>


## Prove that you understand by using a different Web resource

__The Gene Ontology Consortium provides two kinds of files:__
* Ontology Files (Terms, Term ID, and Descriptions)
* Associations (The GO terms used to annotate every gene)

 4. (easy) In your code, retrieve the GO Slim Plant Ontology:
http://www.geneontology.org/ontology/subsets/goslim_plant.obo

 5. (~easy) In your code, parse that file, and for a GO identifier (e.g. GO:0006950) print the GO Term name to the screen (e.g. “response to stress”) (there are MANY different solutions for this!  All of them are regular expressions...)

 6. (hard) Add “GO_Annotation” attribute to your AnnotatedGene Class (array of strings), then:
 
For every gene in the gene_information.tsv file, 
* Retrieve the UniProt Record
* Retrieve the GO annotations (GO_NNNNNN) from the UniProt record
* Retreive the GO Term name from UniProt, and Term definition from the goslim_plant.obo file
* Add those annotations to your AnnotatedGene object (think about this…..)
* create a report like:


<code>Gene_ID: 
Term1 (def)
Term2 (def)
Term3 (def)
</code>

<pre>
  

</pre>

  ## These were examples of difficult cases
  
If you have to "screen scrape", you are facing a very difficult situation.  Your code will often break (when the data provider changes their Web page, for example).  You also probably have not seen EVERY possible case, so your RegularExpression will probably miss some data sometimes.  

If you can find a BETTER way to get the data, you should do that!  In the next lesson we will learn about REST and about other data formats like JSON and XML, that make it easier to extract the data.
