Load kg-covid-19 into blazegraph instance #134

justaddcoffee · 2020-05-04T23:52:00Z

No description provided.

justaddcoffee · 2020-05-12T19:41:42Z

@kltm has set up a blazegraph endpoint:
http://kg-hub-rdf.berkeleybop.io/blazegraph/#splash

now we just need to give him a .jnl file - anyone want to help with this?

kltm · 2020-05-12T19:47:45Z

Noting:
https://github.com/geneontology/go-site/blob/master/pipeline/Makefile#L181-L198

kltm · 2020-05-12T19:55:49Z

From @balhoff

blazegraph-runner load --informat=ntriples --journal=blazegraph.jnl --use-ontology-graph=true myfile.nt
Is it an ontology?
If not, maybe you don’t want --use-ontology-graph=true

balhoff · 2020-05-12T19:57:48Z

@kltm I had accidentally put a space before the format there.

kltm · 2020-05-12T19:58:40Z

fixed

justaddcoffee · 2020-05-12T19:59:15Z

thanks @balhoff @kltm

balhoff · 2020-05-12T20:00:52Z

@justaddcoffee you may need to increase available heap size:

export JAVA_OPTS=-Xmx16G

justaddcoffee · 2020-05-12T22:09:10Z

made a PR for this #151 - likely will have to fiddle with Jenkinsfile a bit more

justaddcoffee · 2020-05-16T17:19:16Z

#151 is merged and will emit a blazegraph journal here if the Jenkins gods are good.

@kltm can we sync up on Monday-ish to see about loading this jnl?

…vailability; work on #134

kltm · 2020-05-20T00:25:24Z

Possible working deployment draft on issue-134-blazegraph-deploy, but cannot test until journal available (and jenkins can find the branch).

justaddcoffee · 2020-05-20T02:10:55Z

Possible working deployment draft on issue-134-blazegraph-deploy, but cannot test until journal available (and jenkins can find the branch).

Thanks very much @kltm - should have a blazegraph journal soon

blazegraph-runner load --informat=ntriples --journal=blazegraph.jnl --use-ontology-graph=true myfile.nt
Is it an ontology?
If not, maybe you don’t want --use-ontology-graph=true

@balhoff about whether this is an ontology - I'm not sure. It contains the contents of 3 ontologies, but I don't know if it's an ontology itself. What's this --use-ontology-graph flag do?

@balhoff thanks again for all your help.

balhoff · 2020-05-20T13:56:08Z

It contains the contents of 3 ontologies, but I don't know if it's an ontology itself.

@justaddcoffee it kind of depends how they were merged. OWL ontologies can be stored in RDF, but there is a schema dictating how the RDF can be structured. For the purposes of the --use-ontology-graph, it will just search the file for a triple like ?ont rdf:type owl:Ontology. It does this in a streaming way, so it won't load the whole file into memory. It stops at the first matching triple, and will use the ?ont IRI as the graph IRI to load the file into.

justaddcoffee · 2020-05-20T20:23:45Z

thanks @balhoff

justaddcoffee · 2020-05-24T18:17:38Z

Okay, completed blazegraph journal is here

Increasing memory cut the blazegraph-runner runtime in half, down to 17h

I'll sync up with Seth to stand up the blazegraph instance

balhoff · 2020-05-26T17:07:36Z

Increasing memory cut the blazegraph-runner runtime in half, down to 17h

This still seems quite long. Did you say 15 million triples?

deepakunni3 · 2020-05-26T17:16:41Z

Since adding memory reduced the load time, it is highly likely that a lot of time is spent in garbage collection when loading via blazegraph-runner

justaddcoffee · 2020-05-26T22:30:03Z

This still seems quite long. Did you say 15 million triples?

The TSV edge file has 15 million lines, it's actually got 561 million triples:

~/Desktop  $ gzcat kg-covid-19.nt.gz | wc -l
 561836116

balhoff · 2020-05-26T22:46:35Z

Oh! That's a different story. It may be fairly appropriate time frame then. You may be able to speed it up somewhat if you are on SSD.

justaddcoffee · 2020-05-27T15:00:19Z

Oh! That's a different story. It may be fairly appropriate time frame then. You may be able to speed it up somewhat if you are on SSD.

not sure why this would matter, but when I use the NT file produced at the merge step (as opposed to doing this after the fact with KGX), the runtime is reduced to 8 h. (@deepakunni3, thanks again for spotting this)

Triples are down to about 262M now, so this reduction in runtime is possibly just because the graph got a lot smaller:

~/Desktop  $ gzcat kg-covid-19.nt.gz.2 | wc -l
 262313832
~/Desktop  $

justaddcoffee · 2020-05-27T15:02:41Z

@kltm the blazegraph deploy stage failed, perhaps because of a permission problem with the repo operations.git

06:55:00  hudson.plugins.git.GitException: Command "git fetch --tags --progress -- https://github.com/geneontology/operations.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
06:55:00  stdout: 
06:55:00  stderr: remote: Repository not found.
06:55:00  fatal: repository 'https://github.com/geneontology/operations.git/' not foun

deepakunni3 · 2020-05-27T16:10:33Z

Oh! That's a different story. It may be fairly appropriate time frame then. You may be able to speed it up somewhat if you are on SSD.

not sure why this would matter, but when I use the NT file produced at the merge step (as opposed to doing this after the fact with KGX), the runtime is reduced to 8 h. (@deepakunni3, thanks again for spotting this)

Triples are down to about 262M now, so this reduction in runtime is possibly just because the graph got a lot smaller:
~/Desktop  $ gzcat kg-covid-19.nt.gz.2 | wc -l
 262313832
~/Desktop  $

Yes, I fixed the NT exporter to not export unnecessary edges like id which is a self referencing triple. Also, the nans are fixed (removed). Which again reduces the triple count.

kltm · 2020-05-27T23:25:24Z

@justaddcoffee You should have a repo invite now to help deal with the issue.

justaddcoffee · 2020-05-27T23:28:02Z

Thanks @kltm !

kltm · 2020-05-28T00:48:40Z

Ummm...I made a few more changes and I think it worked? Do you have a way to test that?

kltm · 2020-05-28T00:50:29Z

Assuming this is "good" for now, next steps might be to get this out of the cheat of using a GO repo into something you have more easy operational control of. That said, this works for now...

justaddcoffee added enhancement New feature or request help wanted Extra attention is needed labels May 6, 2020

justaddcoffee removed the help wanted Extra attention is needed label May 12, 2020

justaddcoffee mentioned this issue May 12, 2020

Added stage to make blazegraph journal #151

Merged

justaddcoffee self-assigned this May 12, 2020

justaddcoffee added this to In progress in Make queryable graph endpoint May 14, 2020

kltm added a commit that referenced this issue May 20, 2020

first draft for deployment; will have to wait on blazegraph journal a…

7def065

…vailability; work on #134

justaddcoffee mentioned this issue May 20, 2020

Give blazegraph-runner more memory #164

Closed

justaddcoffee moved this from In progress to Done in Make queryable graph endpoint Jun 16, 2020

justaddcoffee closed this as completed Jun 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load kg-covid-19 into blazegraph instance #134

Load kg-covid-19 into blazegraph instance #134

justaddcoffee commented May 4, 2020

justaddcoffee commented May 12, 2020

kltm commented May 12, 2020

kltm commented May 12, 2020 •

edited

Loading

balhoff commented May 12, 2020

kltm commented May 12, 2020

justaddcoffee commented May 12, 2020

balhoff commented May 12, 2020

justaddcoffee commented May 12, 2020

justaddcoffee commented May 16, 2020 •

edited

Loading

kltm commented May 20, 2020

justaddcoffee commented May 20, 2020

balhoff commented May 20, 2020

justaddcoffee commented May 20, 2020

justaddcoffee commented May 24, 2020

balhoff commented May 26, 2020

deepakunni3 commented May 26, 2020

justaddcoffee commented May 26, 2020

balhoff commented May 26, 2020

justaddcoffee commented May 27, 2020 •

edited

Loading

justaddcoffee commented May 27, 2020

deepakunni3 commented May 27, 2020

kltm commented May 27, 2020

justaddcoffee commented May 27, 2020

kltm commented May 28, 2020

kltm commented May 28, 2020

Load kg-covid-19 into blazegraph instance #134

Load kg-covid-19 into blazegraph instance #134

Comments

justaddcoffee commented May 4, 2020

justaddcoffee commented May 12, 2020

kltm commented May 12, 2020

kltm commented May 12, 2020 • edited Loading

balhoff commented May 12, 2020

kltm commented May 12, 2020

justaddcoffee commented May 12, 2020

balhoff commented May 12, 2020

justaddcoffee commented May 12, 2020

justaddcoffee commented May 16, 2020 • edited Loading

kltm commented May 20, 2020

justaddcoffee commented May 20, 2020

balhoff commented May 20, 2020

justaddcoffee commented May 20, 2020

justaddcoffee commented May 24, 2020

balhoff commented May 26, 2020

deepakunni3 commented May 26, 2020

justaddcoffee commented May 26, 2020

balhoff commented May 26, 2020

justaddcoffee commented May 27, 2020 • edited Loading

justaddcoffee commented May 27, 2020

deepakunni3 commented May 27, 2020

kltm commented May 27, 2020

justaddcoffee commented May 27, 2020

kltm commented May 28, 2020

kltm commented May 28, 2020

kltm commented May 12, 2020 •

edited

Loading

justaddcoffee commented May 16, 2020 •

edited

Loading

justaddcoffee commented May 27, 2020 •

edited

Loading