Custom data in Open PHACTS Docker instance
Work in progress
This is a step-by-step guide on how to customize the Open PHACTS Docker install to add additional data.
We assume you already have the Open PHACTS API running on http://localhost:3002/ (or equivalent), and the Virtuoso SPARQ endpoint on http://localhost:3003/sparql.
The data to load must already be in a RDF format, for instance N-Triples or Turtle. This page does not detail how to generate RDF from other data sources, various tools and approaches are available.
As a running example, we will add and integrate the NCATS Open Phenotypic Drug Screening Resource RDF data (OPDSR) with the Open PHACTS 1.5 release.
The second step is to retrieve the RDF data that is to be loaded to a new, temporary staging directory, let's say /data/rdf
. You might need to use tools like scp
, wget --recursive
or similar. In this example, we are lucky in that the OPDSR data is available as a git repository
cd /data/rdf
git clone https://spotlite.nih.gov/ncats/opdsr.git
(Note that this particular checkout will take a fair amount of time)
In the opdsr case, the RDF data is located in the subfolder opdsr/rdf/data
:
stain@biggie:/data/rdf/opdsr$ ls -alh rdf/data/
total 86M
drwxrwxr-x 2 stain stain 4.0K Sep 14 14:02 .
drwxrwxr-x 4 stain stain 4.0K Sep 14 14:02 ..
-rw-rw-r-- 1 stain stain 96K Sep 14 14:02 bao_vocabulary_assay.owl
-rw-rw-r-- 1 stain stain 60K Sep 14 14:02 chembl_cco.ttl
-rw-rw-r-- 1 stain stain 65M Sep 14 14:02 chembl_rdf_activity.ttl
-rw-rw-r-- 1 stain stain 311K Sep 14 14:02 chembl_target.ttl
-rw-rw-r-- 1 stain stain 8.0K Sep 14 14:02 npcpd2_assay.ttl
-rw-rw-r-- 1 stain stain 6.3K Sep 14 14:02 npcpd2_bao.ttl
-rw-rw-r-- 1 stain stain 3.7M Sep 14 14:02 npcpd2_results.ttl
-rw-rw-r-- 1 stain stain 176K Sep 14 14:02 npcpd2_substance.ttl
-rw-rw-r-- 1 stain stain 15K Sep 14 14:02 pubchem_pd2_assay.ttl
-rw-rw-r-- 1 stain stain 16M Sep 14 14:02 pubchem_pd2_substance.ttl
-rw-rw-r-- 1 stain stain 12K Sep 14 14:02 pubchem_vocabulary.owl
-rw-rw-r-- 1 stain stain 1.6M Sep 14 14:02 reactome.ttl
Not all of these might need to be loaded into Open PHACTS - in fact some of these includes parts of the Chembl RDF data. Compare with the Open PHACTS 1.5 RDF or their VoID descriptions.
A first step is to get to know the new RDF data, to understand its model and select which files, types and properties should be integrated into the RDF store and Open PHACTS API.
We'll do that by loading it into a new, temporary standalone Virtuoso instance.
When loading RDF, you will need to decide with graph name to load into. Open PHACTS uses named graphs to separate RDF from different sources, e.g. Uniprot vs. Chembl. This helps with provenance and updates, but also speed up the internal SPARQ queries by using the GRAPH
to narrow the search space.
So one question when loading new data, is which named graph or graphs to use. To start with, we'll settle for one single graph - but we might need to refine that later. As we will first explore the data in a separate Virtuoso instance, we won't affect the Open PHACTS instance until we have decided.
A good graph name can be is a base URL that is common for all the custom data. We can inspect the individual RDF files to have a feel if such a URL exists. So looking at files like npcpd2_assay.ttl and npcpd2_substance.ttl we find that the namespaces used for new resources (not for the properties) are on the style of:
- http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/AID1117348
- http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID170464702
- http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID144206464_AID1117310
This looks like great Cool URIs, and even do correct Content-Negotiation:
stain@biggie:/data/rdf/opdsr/rdf/data$ curl -L -H "Accept: text/turtle" http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID144206464_AID1117310
@base <http://rdf.ncbi.nlm.nih.gov/pubchem/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<endpoint/SID144206464_AID1117310>
<http://purl.obolibrary.org/obo/IAO_0000136> <substance/SID144206464> ;
<vocabulary#PubChemAssayOutcome> <vocabulary#inactive> ;
<http://semanticscience.org/resource/has-unit> <http://purl.obolibrary.org/obo/UO_0000064> ;
<http://semanticscience.org/resource/has-value> "1.12202"^^<http://www.w3.org/2001/XMLSchema#float> ;
a <http://www.bioassayontology.org/bao#BAO_0000186> ;
<http://www.w3.org/2000/01/rdf-schema#label> "AC50" .
<measuregroup/AID1117310>
<http://purl.obolibrary.org/obo/OBI_0000299> <endpoint/SID144206464_AID1117310> .
Note that the RDF data to be loaded does not need to be 5-star Open Data as above to be added to Open PHACTS, but as the URIs will be part of the API results, it is great if they actually do resolve to something useful.
So we decide that http://rdf.ncbi.nlm.nih.gov/pubchem/ would be a great graph name. Note that the URL for the graph name does not need to be retrievable. We'll include the /pubchem
bit in case we later want to load other NCBI RDF data.
In some cases, no common base URI exists (e.g. the data is a collage from multiple sources), in which case you will have to make something up from a domain you own, e.g. http://example.com/datasets/clever4 if you had been the owner of http://example.com/ and the dataset was called clever4
.
The RDF store of Open PHACTS is OpenLink Virtuoso The Virtuoso docker image includes instructions for staging.
Following these instructions, we'll first make a new data volume for the testing.
docker run --name virtuoso-test-data -v /virtuoso busybox
We are going to use our /data/rdf
folder (or equivalent) as the staging directory. We'll need to make a staging.sql
file to specify what to load.
Edit the equivalent of /data/rdf/staging.sql
with your favourite text editor (e.g. vim
) to contain:
-- OPDSR
ld_dir('/staging/opdsr/rdf/data/', '*.ttl', 'http://rdf.ncbi.nlm.nih.gov/pubchem/');
(Don't forget that trailing ;
!)
Note that the path after /staging
will be resolved under /data/rdf
(or your equivalent local folder). You might need to modify the *.ttl
wild-card if your RDF files are not in Turtle format. The last parameter is the graph name to load into.
You can verify the file paths within docker using:
stain@biggie:/data/rdf$ docker run -v /data/rdf:/staging:ro --volumes-from virtuoso-test-data -it stain/virtuoso find /staging/
(..)
/staging/opdsr/rdf/data
/staging/opdsr/rdf/data/npcpd2_substance.ttl
/staging/opdsr/rdf/data/reactome.ttl
/staging/opdsr/rdf/data/pubchem_pd2_substance.ttl
(..)
/staging/opdsr/NPC-OIDD-Assay.csv
/staging/opdsr/NPC-OIDD-Compound.csv
Next we'll try to load the data:
docker run -v /data/rdf:/staging:ro --volumes-from virtuoso-test-data -it stain/virtuoso staging.sh
* Starting Virtuoso OpenSource Edition 7.2 virtuoso-opensource-7 [ OK ]
Configuring SPARQL
Populating from /staging/staging.sql
Starting RDF loader for core 0
Starting RDF loader for core 1
Starting RDF loader for core 2
Starting RDF loader for core 3
Checkpointing
Staging finished, total triples: 613668
* Stopping Virtuoso OpenSource Edition 7.2 virtuoso-opensource-7
If the "total triples" is very low, e.g. 3972, then you probably didn't load anything and you are only counting the default triples of the Virtuoso install.
TODO: Check for parser errors, eg. prefix vocabulary:
was missing
Now let's start up our secondary Virtuoso using the freshly populated data volume. We'll expose this on port 8890
- modify below to something like -p 8892:8890
if 8890 is already in use.
docker run -p 8890:8890 --volumes-from virtuoso-test-data -d stain/virtuoso
We can verify the server started up by looking for the only running Docker container that isn't called ops-
something:
docker ps | grep -v ops-
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
52cefde336b1 stain/virtuoso "/docker-entrypoint. About a minute ago Up About a minute 1111/tcp, 0.0.0.0:8890->8890/tcp modest_thompson
Above we saw the auto-generated container name modest_thompson
, now we check its log:
docker logs modest_thompson
(..)
13:49:29 Checkpoint started
13:49:29 Checkpoint finished, log reused
13:49:29 HTTP/WebDAV server online at 8890
13:49:29 Server online at 1111 (pid 1)
We can verify this is working by accessing http://localhost:8890/sparql (or equivalent host/port)
The default query in Virtuoso is often a good start:
select distinct ?Concept where {[] a ?Concept} LIMIT 100
If you execute that query you will see that it returns a big list of known types in the RDF store. You can ignore the openlinksw.com
and w3.org
types that come from the default statements in Virtuoso.
Some of the types we find:
- http://www.bioassayontology.org/bao#BAO_0002786
- http://www.bioassayontology.org/bao#BAO_0002805
- http://www.bioassayontology.org/bao#BAO_0002989
- http://rdf.ebi.ac.uk/terms/chembl#UniprotRef
- http://rdf.ebi.ac.uk/terms/chembl#SingleProtein
- http://purl.obolibrary.org/obo/CHEBI_100147
- http://purl.obolibrary.org/obo/CHEBI_10023
- http://purl.obolibrary.org/obo/CHEBI_100246
- http://purl.obolibrary.org/obo/CHEBI_10093
- http://purl.obolibrary.org/obo/CHEBI_15940
In bioinformatics you will unfortunately often encounter unhelpful class and property names like BAO_0002989
. See Exploring ontologies below.
In this particular case, there are almost 10.000 classes, as the CHEBI data represent each compound as a class. So we'll do a bit clever query:
SELECT ?Concept, COUNT(?Concept) AS ?instances
WHERE {
[] a ?Concept
}
GROUP BY ?Concept
ORDER BY DESC(?instances)
(Note that running this query on the full Open PHACTS Virtuoso instance could take considerable amount of time)
Now we get much more useful information:
So we explore:
SELECT * WHERE {
?s a <http://rdf.ebi.ac.uk/terms/chembl#SingleProtein> .
}
LIMIT 100
But they all have rdf.ebi.ac.uk
URIs, from the loaded chembl_*
files, which presumably have been taken from the Chembl RDF Download. If we FILTER
away these, there are no other chembl:SingleProtein
s.
SELECT * WHERE {
?s a <http://rdf.ebi.ac.uk/terms/chembl#SingleProtein> .
FILTER (!strStarts(str(?s), "http://rdf.ebi.ac.uk/"))
}
LIMIT 100
From this you might see why loading into different named graphs can be helpful.
But we know from manual inspection of the RDF files that we are instead looking for things in the http://rdf.ncbi.nlm.nih.gov/pubchem namespace - so we'll do a similar FILTER
to only look at their types.
SELECT ?Concept, COUNT(?Concept) AS ?instances
WHERE {
?s a ?Concept
FILTER (strStarts(str(?s), "http://rdf.ncbi.nlm.nih.gov/pubchem"))
}
GROUP BY ?Concept
ORDER BY DESC(?instances)
The results look like:
Perhaps some resources don't have an explicit type, at least we see most of them have one in common. So let's just see which properties are used:
SELECT ?property, COUNT(?property) AS ?statements
WHERE {
{ ?s ?property [] . }
UNION
{ [] ?property ?s . }
FILTER (strStarts(str(?s), "http://rdf.ncbi.nlm.nih.gov/pubchem"))
}
GROUP BY ?property
ORDER BY DESC(?statements)
Now this gives very promising results:
Perhaps some resources are more important than others. Let's look so we can find a good example to explore further:
SELECT ?s, (COUNT(?p) AS ?statements) WHERE {
?s ?p []
FILTER (strStarts(str(?s), "http://rdf.ncbi.nlm.nih.gov/pubchem"))
}
GROUP BY ?s
ORDER BY DESC(?statements)
The results include:
Let's have a look at the first one using
DESCRIBE <http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID144206872>
The results are not particularly helpful:
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ns0: <http://purl.obolibrary.org/obo/> .
@prefix ns1: <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/> .
@prefix ns2: <http://rdf.ncbi.nlm.nih.gov/pubchem/substance/> .
@prefix ns4: <http://purl.org/dc/terms/> .
@prefix ns5: <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/> .
@prefix ns6: <http://semanticscience.org/resource/> .
@prefix ns7: <http://rdf.ncbi.nlm.nih.gov/pubchem/compound/> .
@prefix ns8: <http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/> .
@prefix ns9: <http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/> .
ns1:SID144206872_AID1117310 ns0:IAO_0000136 ns2:SID144206872 .
ns1:SID144206872_AID1117312 ns0:IAO_0000136 ns2:SID144206872 .
# ...
ns2:SID144206872 ns4:modified "2014-12-12-04:00"^^xsd:date .
ns2:SID144206872 ns0:BFO_0000056 ns5:AID1117322 ,
ns5:AID1117323 ,
ns5:AID1117324 ,
ns5:AID1117350 ,
# ..
ns5:AID1117351 ;
ns4:available "2012-10-06-04:00"^^xsd:date .
ns2:SID144206872 ns6:CHEMINF_000477 ns7:CID26391 .
ns2:SID144206872 ns6:has-attribute ns8:MD5_da87d1146f2fb14d8063aa40e97ad022 ,
ns8:MD5_894dec1eae1c0e77e33aa780a16b0513 ,
ns8:MD5_bf989a31741f8faac5e427eaf2faad2e ,
ns8:MD5_0efbc007106f40cdbdf79ce68b76523f ,
ns8:MD5_566fc8f0db12b11f8111299e88637b5f .
ns2:SID144206872 ns6:has-attribute ns9:SID144206872_Substance_Version ,
(what happened to ns3
? :)
TODO
TODO
http://www.essepuntato.it/lode
http://www.essepuntato.it/lode/http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary.owl
TODO
It is possible to perform staging within a running Virtuoso instance (docker exec -it ops-virtuoso bash
, use wget
inside /staging
, start isql
and run ld_dir
and rdf_loader_run()
) - but this tutorial uses the simpler approach documented for the Docker image.