Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load kg-covid-19 into blazegraph instance #134

Closed
justaddcoffee opened this issue May 4, 2020 · 25 comments
Closed

Load kg-covid-19 into blazegraph instance #134

justaddcoffee opened this issue May 4, 2020 · 25 comments
Assignees
Labels
enhancement New feature or request

Comments

@justaddcoffee
Copy link
Collaborator

No description provided.

@justaddcoffee justaddcoffee added enhancement New feature or request help wanted Extra attention is needed labels May 6, 2020
@justaddcoffee
Copy link
Collaborator Author

@kltm has set up a blazegraph endpoint:
http://kg-hub-rdf.berkeleybop.io/blazegraph/#splash

now we just need to give him a .jnl file - anyone want to help with this?

@kltm
Copy link
Contributor

kltm commented May 12, 2020

@kltm
Copy link
Contributor

kltm commented May 12, 2020

From @balhoff

blazegraph-runner load --informat=ntriples --journal=blazegraph.jnl --use-ontology-graph=true myfile.nt
Is it an ontology?
If not, maybe you don’t want --use-ontology-graph=true

@balhoff
Copy link

balhoff commented May 12, 2020

@kltm I had accidentally put a space before the format there.

@kltm
Copy link
Contributor

kltm commented May 12, 2020

fixed

@justaddcoffee
Copy link
Collaborator Author

thanks @balhoff @kltm

@balhoff
Copy link

balhoff commented May 12, 2020

@justaddcoffee you may need to increase available heap size:

export JAVA_OPTS=-Xmx16G

@justaddcoffee
Copy link
Collaborator Author

made a PR for this #151 - likely will have to fiddle with Jenkinsfile a bit more

@justaddcoffee justaddcoffee removed the help wanted Extra attention is needed label May 12, 2020
@justaddcoffee justaddcoffee self-assigned this May 12, 2020
@justaddcoffee
Copy link
Collaborator Author

justaddcoffee commented May 16, 2020

#151 is merged and will emit a blazegraph journal here if the Jenkins gods are good.

@kltm can we sync up on Monday-ish to see about loading this jnl?

kltm added a commit that referenced this issue May 20, 2020
@kltm
Copy link
Contributor

kltm commented May 20, 2020

Possible working deployment draft on issue-134-blazegraph-deploy, but cannot test until journal available (and jenkins can find the branch).

@justaddcoffee
Copy link
Collaborator Author

Possible working deployment draft on issue-134-blazegraph-deploy, but cannot test until journal available (and jenkins can find the branch).

Thanks very much @kltm - should have a blazegraph journal soon

blazegraph-runner load --informat=ntriples --journal=blazegraph.jnl --use-ontology-graph=true myfile.nt
Is it an ontology?
If not, maybe you don’t want --use-ontology-graph=true

@balhoff about whether this is an ontology - I'm not sure. It contains the contents of 3 ontologies, but I don't know if it's an ontology itself. What's this --use-ontology-graph flag do?

@balhoff thanks again for all your help.

@balhoff
Copy link

balhoff commented May 20, 2020

It contains the contents of 3 ontologies, but I don't know if it's an ontology itself.

@justaddcoffee it kind of depends how they were merged. OWL ontologies can be stored in RDF, but there is a schema dictating how the RDF can be structured. For the purposes of the --use-ontology-graph, it will just search the file for a triple like ?ont rdf:type owl:Ontology. It does this in a streaming way, so it won't load the whole file into memory. It stops at the first matching triple, and will use the ?ont IRI as the graph IRI to load the file into.

@justaddcoffee
Copy link
Collaborator Author

thanks @balhoff

@justaddcoffee
Copy link
Collaborator Author

Okay, completed blazegraph journal is here

Increasing memory cut the blazegraph-runner runtime in half, down to 17h

I'll sync up with Seth to stand up the blazegraph instance

@balhoff
Copy link

balhoff commented May 26, 2020

Increasing memory cut the blazegraph-runner runtime in half, down to 17h

This still seems quite long. Did you say 15 million triples?

@deepakunni3
Copy link
Member

Since adding memory reduced the load time, it is highly likely that a lot of time is spent in garbage collection when loading via blazegraph-runner

@justaddcoffee
Copy link
Collaborator Author

This still seems quite long. Did you say 15 million triples?

The TSV edge file has 15 million lines, it's actually got 561 million triples:

~/Desktop  $ gzcat kg-covid-19.nt.gz | wc -l
 561836116

@balhoff
Copy link

balhoff commented May 26, 2020

Oh! That's a different story. It may be fairly appropriate time frame then. You may be able to speed it up somewhat if you are on SSD.

@justaddcoffee
Copy link
Collaborator Author

justaddcoffee commented May 27, 2020

Oh! That's a different story. It may be fairly appropriate time frame then. You may be able to speed it up somewhat if you are on SSD.

not sure why this would matter, but when I use the NT file produced at the merge step (as opposed to doing this after the fact with KGX), the runtime is reduced to 8 h. (@deepakunni3, thanks again for spotting this)

Triples are down to about 262M now, so this reduction in runtime is possibly just because the graph got a lot smaller:

~/Desktop  $ gzcat kg-covid-19.nt.gz.2 | wc -l
 262313832
~/Desktop  $

@justaddcoffee
Copy link
Collaborator Author

@kltm the blazegraph deploy stage failed, perhaps because of a permission problem with the repo operations.git

06:55:00  hudson.plugins.git.GitException: Command "git fetch --tags --progress -- https://github.com/geneontology/operations.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
06:55:00  stdout: 
06:55:00  stderr: remote: Repository not found.
06:55:00  fatal: repository 'https://github.com/geneontology/operations.git/' not foun

@deepakunni3
Copy link
Member

Oh! That's a different story. It may be fairly appropriate time frame then. You may be able to speed it up somewhat if you are on SSD.

not sure why this would matter, but when I use the NT file produced at the merge step (as opposed to doing this after the fact with KGX), the runtime is reduced to 8 h. (@deepakunni3, thanks again for spotting this)

Triples are down to about 262M now, so this reduction in runtime is possibly just because the graph got a lot smaller:

~/Desktop  $ gzcat kg-covid-19.nt.gz.2 | wc -l
 262313832
~/Desktop  $

Yes, I fixed the NT exporter to not export unnecessary edges like id which is a self referencing triple. Also, the nans are fixed (removed). Which again reduces the triple count.

@kltm
Copy link
Contributor

kltm commented May 27, 2020

@justaddcoffee You should have a repo invite now to help deal with the issue.

@justaddcoffee
Copy link
Collaborator Author

Thanks @kltm !

@kltm
Copy link
Contributor

kltm commented May 28, 2020

Ummm...I made a few more changes and I think it worked? Do you have a way to test that?

@kltm
Copy link
Contributor

kltm commented May 28, 2020

Assuming this is "good" for now, next steps might be to get this out of the cheat of using a GO repo into something you have more easy operational control of. That said, this works for now...

@justaddcoffee justaddcoffee moved this from In progress to Done in Make queryable graph endpoint Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

4 participants