Skip to content

Custom data in Open PHACTS Docker instance

Stian Soiland-Reyes edited this page Sep 16, 2015 · 15 revisions

Work in progress

This is a step-by-step guide on how to customize the Open PHACTS Docker install to add additional data.

We assume you already have the Open PHACTS API running on http://localhost:3002/ (or equivalent), and the Virtuoso SPARQ endpoint on http://localhost:3003/sparql.

The data to load must already be in a RDF format, for instance N-Triples or Turtle. This page does not detail how to generate RDF from other data sources, various tools and approaches are available.

As a running example, we will add and integrate the NCATS Open Phenotypic Drug Screening Resource RDF data (OPDSR) with the Open PHACTS 1.5 release.

Retrieving the RDF

The second step is to retrieve the RDF data that is to be loaded to a new, temporary staging directory, let's say /data/rdf. You might need to use tools like scp, wget --recursive or similar. In this example, we are lucky in that the OPDSR data is available as a git repository

cd /data/rdf
git clone https://spotlite.nih.gov/ncats/opdsr.git

(Note that this particular checkout will take a fair amount of time)

In the opdsr case, the RDF data is located in the subfolder opdsr/rdf/data:

stain@biggie:/data/rdf/opdsr$ ls -alh rdf/data/
total 86M
drwxrwxr-x 2 stain stain 4.0K Sep 14 14:02 .
drwxrwxr-x 4 stain stain 4.0K Sep 14 14:02 ..
-rw-rw-r-- 1 stain stain  96K Sep 14 14:02 bao_vocabulary_assay.owl
-rw-rw-r-- 1 stain stain  60K Sep 14 14:02 chembl_cco.ttl
-rw-rw-r-- 1 stain stain  65M Sep 14 14:02 chembl_rdf_activity.ttl
-rw-rw-r-- 1 stain stain 311K Sep 14 14:02 chembl_target.ttl
-rw-rw-r-- 1 stain stain 8.0K Sep 14 14:02 npcpd2_assay.ttl
-rw-rw-r-- 1 stain stain 6.3K Sep 14 14:02 npcpd2_bao.ttl
-rw-rw-r-- 1 stain stain 3.7M Sep 14 14:02 npcpd2_results.ttl
-rw-rw-r-- 1 stain stain 176K Sep 14 14:02 npcpd2_substance.ttl
-rw-rw-r-- 1 stain stain  15K Sep 14 14:02 pubchem_pd2_assay.ttl
-rw-rw-r-- 1 stain stain  16M Sep 14 14:02 pubchem_pd2_substance.ttl
-rw-rw-r-- 1 stain stain  12K Sep 14 14:02 pubchem_vocabulary.owl
-rw-rw-r-- 1 stain stain 1.6M Sep 14 14:02 reactome.ttl

Not all of these might need to be loaded into Open PHACTS - in fact some of these includes parts of the Chembl RDF data. Compare with the Open PHACTS 1.5 RDF or their VoID descriptions.

Exploring the RDF data

A first step is to get to know the new RDF data, to understand its model and select which files, types and properties should be integrated into the RDF store and Open PHACTS API.

We'll do that by loading it into a new, temporary standalone Virtuoso instance.

Deciding on a graph name

When loading RDF, you will need to decide with graph name to load into. Open PHACTS uses named graphs to separate RDF from different sources, e.g. Uniprot vs. Chembl. This helps with provenance and updates, but also speed up the internal SPARQ queries by using the GRAPH to narrow the search space.

So one question when loading new data, is which named graph or graphs to use. To start with, we'll settle for one single graph - but we might need to refine that later. As we will first explore the data in a separate Virtuoso instance, we won't affect the Open PHACTS instance until we have decided.

A good graph name can be is a base URL that is common for all the custom data. We can inspect the individual RDF files to have a feel if such a URL exists. So looking at files like npcpd2_assay.ttl and npcpd2_substance.ttl we find that the namespaces used for new resources (not for the properties) are on the style of:

This looks like great Cool URIs, and even do correct Content-Negotiation:

stain@biggie:/data/rdf/opdsr/rdf/data$ curl -L -H "Accept: text/turtle" http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID144206464_AID1117310
@base <http://rdf.ncbi.nlm.nih.gov/pubchem/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<endpoint/SID144206464_AID1117310>
    <http://purl.obolibrary.org/obo/IAO_0000136> <substance/SID144206464> ;
    <vocabulary#PubChemAssayOutcome> <vocabulary#inactive> ;
    <http://semanticscience.org/resource/has-unit> <http://purl.obolibrary.org/obo/UO_0000064> ;
    <http://semanticscience.org/resource/has-value> "1.12202"^^<http://www.w3.org/2001/XMLSchema#float> ;
    a <http://www.bioassayontology.org/bao#BAO_0000186> ;
    <http://www.w3.org/2000/01/rdf-schema#label> "AC50" .

<measuregroup/AID1117310>
    <http://purl.obolibrary.org/obo/OBI_0000299> <endpoint/SID144206464_AID1117310> .

Note that the RDF data to be loaded does not need to be 5-star Open Data as above to be added to Open PHACTS, but as the URIs will be part of the API results, it is great if they actually do resolve to something useful.

So we decide that http://rdf.ncbi.nlm.nih.gov/pubchem/ would be a great graph name. Note that the URL for the graph name does not need to be retrievable. We'll include the /pubchem bit in case we later want to load other NCBI RDF data.

In some cases, no common base URI exists (e.g. the data is a collage from multiple sources), in which case you will have to make something up from a domain you own, e.g. http://example.com/datasets/clever4 if you had been the owner of http://example.com/ and the dataset was called clever4.

Loading the data in a new Virtuoso instance

The RDF store of Open PHACTS is OpenLink Virtuoso The Virtuoso docker image includes instructions for staging.

Following these instructions, we'll first make a new data volume for the testing.

docker run --name virtuoso-test-data -v /virtuoso busybox

We are going to use our /data/rdf folder (or equivalent) as the staging directory. We'll need to make a staging.sql file to specify what to load.

Edit the equivalent of /data/rdf/staging.sql with your favourite text editor (e.g. vim) to contain:

-- OPDSR
ld_dir('/staging/opdsr/rdf/data/', '*.ttl', 'http://rdf.ncbi.nlm.nih.gov/pubchem/');

(Don't forget that trailing ; !)

Note that the path after /staging will be resolved under /data/rdf (or your equivalent local folder). You might need to modify the *.ttl wild-card if your RDF files are not in Turtle format. The last parameter is the graph name to load into.

You can verify the file paths within docker using:

stain@biggie:/data/rdf$ docker run -v /data/rdf:/staging:ro --volumes-from virtuoso-test-data -it stain/virtuoso find /staging/
(..)
/staging/opdsr/rdf/data
/staging/opdsr/rdf/data/npcpd2_substance.ttl
/staging/opdsr/rdf/data/reactome.ttl
/staging/opdsr/rdf/data/pubchem_pd2_substance.ttl
(..)
/staging/opdsr/NPC-OIDD-Assay.csv
/staging/opdsr/NPC-OIDD-Compound.csv

Next we'll try to load the data:

docker run -v /data/rdf:/staging:ro --volumes-from virtuoso-test-data -it stain/virtuoso staging.sh
 * Starting Virtuoso OpenSource Edition 7.2  virtuoso-opensource-7                                                                                                                                                                                                           [ OK ] 
Configuring SPARQL
Populating from /staging/staging.sql
Starting RDF loader for core 0
Starting RDF loader for core 1
Starting RDF loader for core 2
Starting RDF loader for core 3
Checkpointing
Staging finished, total triples: 613668
 * Stopping Virtuoso OpenSource Edition 7.2 virtuoso-opensource-7     

If the "total triples" is very low, e.g. 3972, then you probably didn't load anything and you are only counting the default triples of the Virtuoso install.

Logs?

TODO: Check for parser errors, eg. prefix vocabulary: was missing

Starting a Virtuoso instance

Now let's start up our secondary Virtuoso using the freshly populated data volume. We'll expose this on port 8890 - modify below to something like -p 8892:8890 if 8890 is already in use.

docker run -p 8890:8890 --volumes-from virtuoso-test-data -d stain/virtuoso

We can verify the server started up by looking for the only running Docker container that isn't called ops- something:

docker ps | grep -v ops-
CONTAINER ID        IMAGE               COMMAND                CREATED              STATUS              PORTS                              NAMES
52cefde336b1        stain/virtuoso      "/docker-entrypoint.   About a minute ago   Up About a minute   1111/tcp, 0.0.0.0:8890->8890/tcp   modest_thompson 

Above we saw the auto-generated container name modest_thompson, now we check its log:

docker logs modest_thompson
(..)
13:49:29 Checkpoint started
13:49:29 Checkpoint finished, log reused
13:49:29 HTTP/WebDAV server online at 8890
13:49:29 Server online at 1111 (pid 1)

We can verify this is working by accessing http://localhost:8890/sparql (or equivalent host/port)

Querying the data

The default query in Virtuoso is often a good start:

select distinct ?Concept where {[] a ?Concept} LIMIT 100

If you execute that query you will see that it returns a big list of known types in the RDF store. You can ignore the openlinksw.com and w3.org types that come from the default statements in Virtuoso.

Some of the types we find:

In bioinformatics you will unfortunately often encounter unhelpful class and property names like BAO_0002989. See Exploring ontologies below.

In this particular case, there are almost 10.000 classes, as the CHEBI data represent each compound as a class. So we'll do a bit clever query:

SELECT ?Concept, COUNT(?Concept) AS ?instances 
WHERE { 
  [] a ?Concept 
} 
GROUP BY ?Concept
ORDER BY DESC(?instances)

(Note that running this query on the full Open PHACTS Virtuoso instance could take considerable amount of time)

Now we get much more useful information:

Concept instances
http://rdf.ebi.ac.uk/terms/chembl#SingleProtein 3505
http://rdf.ebi.ac.uk/terms/chembl#UniprotRef 2653
http://www.biopax.org/release/biopax-level3.owl#Pathway 1743
http://www.w3.org/2002/07/owl#Class 346
http://www.w3.org/2002/07/owl#DatatypeProperty 47
http://www.w3.org/2002/07/owl#ObjectProperty 44
http://www.bioassayontology.org/bao#BAO_0000015 35
http://www.bioassayontology.org/bao#BAO_0000010 30

So we explore:

SELECT * WHERE {
  ?s a <http://rdf.ebi.ac.uk/terms/chembl#SingleProtein> .
}
LIMIT 100

But they all have rdf.ebi.ac.uk URIs, from the loaded chembl_* files, which presumably have been taken from the Chembl RDF Download. If we FILTER away these, there are no other chembl:SingleProteins.

SELECT * WHERE {
  ?s a <http://rdf.ebi.ac.uk/terms/chembl#SingleProtein> .
  FILTER (!strStarts(str(?s), "http://rdf.ebi.ac.uk/"))
}
LIMIT 100

From this you might see why loading into different named graphs can be helpful.

But we know from manual inspection of the RDF files that we are instead looking for things in the http://rdf.ncbi.nlm.nih.gov/pubchem namespace - so we'll do a similar FILTER to only look at their types.

SELECT ?Concept, COUNT(?Concept) AS ?instances 
WHERE { 
  ?s a ?Concept 
  FILTER (strStarts(str(?s), "http://rdf.ncbi.nlm.nih.gov/pubchem"))

} 
GROUP BY ?Concept
ORDER BY DESC(?instances)

The results look like:

Concept instances
http://www.bioassayontology.org/bao#BAO_0000015 35
http://www.bioassayontology.org/bao#BAO_0000010 30
http://www.bioassayontology.org/bao#BAO_0000219 30
http://www.w3.org/2002/07/owl#Class 14
http://www.bioassayontology.org/bao#BAO_0002805 12
http://www.bioassayontology.org/bao#BAO_0002100 6
http://www.w3.org/2002/07/owl#ObjectProperty 6
http://www.bioassayontology.org/bao#BAO_0002786 5
http://www.bioassayontology.org/bao#BAO_0002989 5
http://www.bioassayontology.org/bao#BAO_0000223 5
http://www.bioassayontology.org/bao#BAO_0000041 5
http://purl.obolibrary.org/obo/CHEBI_50567 2
http://purl.obolibrary.org/obo/CHEBI_16813 2
http://purl.obolibrary.org/obo/CHEBI_31719 2
http://purl.obolibrary.org/obo/CHEBI_43968 2
(..)

Perhaps some resources don't have an explicit type, at least we see most of them have one in common. So let's just see which properties are used:

SELECT ?property, COUNT(?property) AS ?statements 
WHERE { 

  { ?s ?property [] . }
  UNION 
  { [] ?property ?s . }
  
  FILTER (strStarts(str(?s), "http://rdf.ncbi.nlm.nih.gov/pubchem"))

} 
GROUP BY ?property
ORDER BY DESC(?statements)

Now this gives very promising results:

property statements
http://purl.obolibrary.org/obo/BFO_0000056 201680
http://purl.obolibrary.org/obo/IAO_0000136 201458
http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#PubChemAssayOutcome 53972
http://semanticscience.org/resource/has-attribute 31816
http://purl.org/dc/terms/source 5090
http://www.w3.org/2004/02/skos/core#exactMatch 5051
http://semanticscience.org/resource/CHEMINF_000477 5012
http://purl.org/dc/terms/modified 2510
http://purl.org/dc/terms/available 2510
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 1316
http://www.bioassayontology.org/bao#BAO_0000209 70
http://www.w3.org/2000/01/rdf-schema#label 41
http://www.bioassayontology.org/bao#BAO_0000210 35
http://purl.org/dc/terms/title 35
http://www.w3.org/2000/01/rdf-schema#comment 17
http://www.w3.org/2002/07/owl#imports 5
http://www.w3.org/2002/07/owl#versionIRI 2
http://www.w3.org/2002/07/owl#inverseOf 2
http://www.w3.org/2000/01/rdf-schema#subPropertyOf 1

Perhaps some resources are more important than others. Let's look so we can find a good example to explore further:

SELECT ?s, (COUNT(?p) AS ?statements) WHERE {
 ?s ?p [] 
  FILTER (strStarts(str(?s), "http://rdf.ncbi.nlm.nih.gov/pubchem"))
}
GROUP BY ?s
ORDER BY DESC(?statements)

The results include:

s statements
http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID144206872 140
http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID144206876 133
http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID144206826 132
http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID144206852 132

Let's have a look at the first one using

DESCRIBE <http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID144206872>

The results are not particularly helpful:

@prefix xsd:	<http://www.w3.org/2001/XMLSchema#> .

@prefix ns0:	<http://purl.obolibrary.org/obo/> .
@prefix ns1:	<http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/> .
@prefix ns2:	<http://rdf.ncbi.nlm.nih.gov/pubchem/substance/> .
@prefix ns4:	<http://purl.org/dc/terms/> .
@prefix ns5:	<http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/> .
@prefix ns6:	<http://semanticscience.org/resource/> .
@prefix ns7:	<http://rdf.ncbi.nlm.nih.gov/pubchem/compound/> .
@prefix ns8:	<http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/> .
@prefix ns9:	<http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/> .

ns1:SID144206872_AID1117310	ns0:IAO_0000136	ns2:SID144206872 .
ns1:SID144206872_AID1117312	ns0:IAO_0000136	ns2:SID144206872 .
# ...
ns2:SID144206872	ns4:modified	"2014-12-12-04:00"^^xsd:date .
ns2:SID144206872	ns0:BFO_0000056	ns5:AID1117322 ,
		ns5:AID1117323 ,
		ns5:AID1117324 ,
		ns5:AID1117350 ,
# ..
		ns5:AID1117351 ;
	ns4:available	"2012-10-06-04:00"^^xsd:date .
ns2:SID144206872	ns6:CHEMINF_000477	ns7:CID26391 .
ns2:SID144206872	ns6:has-attribute	ns8:MD5_da87d1146f2fb14d8063aa40e97ad022 ,
		ns8:MD5_894dec1eae1c0e77e33aa780a16b0513 ,
		ns8:MD5_bf989a31741f8faac5e427eaf2faad2e ,
		ns8:MD5_0efbc007106f40cdbdf79ce68b76523f ,
		ns8:MD5_566fc8f0db12b11f8111299e88637b5f .
ns2:SID144206872	ns6:has-attribute	ns9:SID144206872_Substance_Version ,

(what happened to ns3? :)

TODO

Exploring ontologies

TODO

http://www.essepuntato.it/lode

http://www.essepuntato.it/lode/http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary.owl

Loading the data to production

TODO

It is possible to perform staging within a running Virtuoso instance (docker exec -it ops-virtuoso bash, use wget inside /staging, start isql and run ld_dir and rdf_loader_run()) - but this tutorial uses the simpler approach documented for the Docker image.