Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce and publish RDF #80

Closed
justaddcoffee opened this issue Apr 9, 2020 · 8 comments
Closed

Produce and publish RDF #80

justaddcoffee opened this issue Apr 9, 2020 · 8 comments
Assignees

Comments

@justaddcoffee
Copy link
Collaborator

justaddcoffee commented Apr 9, 2020

Produce some RDF from all current sources, especially CORD19:

  • download, transform, load -> nodes.tsv + edges.tsv
  • transform to TTL using KGX

blocked by #79

@justaddcoffee justaddcoffee self-assigned this Apr 9, 2020
@justaddcoffee justaddcoffee changed the title Produce RDF Produce and publish RDF Apr 10, 2020
@justaddcoffee
Copy link
Collaborator Author

justaddcoffee commented Apr 10, 2020

Waiting to confirm that this Jenkins job runs to completion. Then:

  • get current Jenkins pipeline running green to completion
  • wait for KGX refactor to help with memory usage
  • add a Jenkins stage to transform to TTL using KGX
  • update Publish stage to push merged nodes and edges and TTL to https://idg.berkeleybop.io/
  • move Jenkins pipeline to build.berkeleybop.io (high mem machine)
  • add stage to convert TSV to RDF
  • run Biolink RDF validator on RDF output

@justaddcoffee
Copy link
Collaborator Author

get current Jenkins pipeline running green to completion

Still waiting on run.py load to finish - 3 days and counting. I think STRING is slowing things down quite a bit.

@deepakunni3
Copy link
Member

deepakunni3 commented Apr 13, 2020

The whole run finished in 2 hours an 42 mins on a laptop (including STRING).
Something else is happening on the Jenkins instance. What is the specification of the Jenkins instance?

@justaddcoffee
Copy link
Collaborator Author

From #30, @kltm says that these are the mem specs on the Jenkins machine:

free -h
              total        used        free      shared  buff/cache   available
Mem:            23G        2.9G         13G         11M        6.8G         20G
Swap:           46G        1.6G         44G

Not sure of the other specs (CPU, disk space, etc)

@kltm
Copy link
Contributor

kltm commented Apr 13, 2020

Try:

di -h
cat /proc/cpuinfo

@justaddcoffee
Copy link
Collaborator Author

It's likely that what's going on here is that we're dipping into swap, and that's an order of magnitude slower on this Jenkins instance compared with Deepak's laptop (which I think has an SSD drive and therefore fast swap).

Just chatted with Deepak, and he's going to refactor KGX a bit to lower the memory footprint

@justaddcoffee
Copy link
Collaborator Author

justaddcoffee commented Apr 27, 2020

Thanks @deepakunni3 for adding (very fast) ntriples support to KGX!
Added a stage to Jenkins to convert TSV to NT, and an additional command to push ntriples to S3, running now

@justaddcoffee
Copy link
Collaborator Author

Done - the graph in RDF triples (and of course TSV) is link here in Wiki, in the download knowledge graph section:
https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants