Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge feature_update_graph_pattern #43

Merged
merged 78 commits into from
Nov 10, 2021
Merged

Conversation

ThomasThelen
Copy link
Member

@ThomasThelen ThomasThelen commented Aug 19, 2021

This PR addresses a few issues: #42 #34 #35 which are centered around applying a new triple pattern across the data and updating virtuoso.

Changes

Image Management

Previously, the docker/ folder acted as the place to build the scheduler and worker images. Because the workers and scheduler depend on the d1lod library, and because Dockerfiles can only copy same-directory or lower files, the d1lod folder had to be copied into each folder. Additionally, there were separate images for the workers and scheduler. This was error prone and tedious.

The changes in 16c9632 and ae271ea install the d1lod package into the d1lod image (which I have at thomasthelen/slinky). This image is used in the workers and scheduler, getting rid of the need for their separate images.

Worker Changes

The worker deployment has changed by deploying separate workers, one for the default queue and the other for the dataset queue. These changes are in 4b78c5c.

Scheduler Changes

In 3051b99 the scheduler deployment was updated to use the slinky cli.

Virtuoso Changes

Instead of having a second container interact with Virtuoso to enable updating, I combined the openlink virtuoso image with the shell script to enable updates. The virtuoso server an the update-enabler are run in separate processes, shown below. This was done in 913791c

Screen Shot 2021-11-05 at 2 36 18 PM

Networking Changes

When pods in separate deployments need to communicate, they have to go through the associated service. Services can be referenced the same way in docker (ie 'redis' or 'virtuoso'). 50aa722 has a few changes to accommodate this requirement.

Misc

832c7a2 changes the name of the redis deployment to just 'redis'. We don't have a need to distinguish redis deployments, and this makes it more aligned with current names.

8e03b92 Fixes a bug where, without init.py pip won't include the contents of the folder in the install package path.

Testing

Kubernetes

A good kubbernetes test is to tear down the stack and then bring it back up and check that Virtuoso has triples.

  1. kubectl config use-context slinky
  2. kubectl delete deployment scheduler redis virtuoso worker-default worker-dataset
  3. kubectl apply -f virtuoso-deployment.yaml
  4. kubectl apply -f redis-deployment.yaml
  5. kubectl apply -f virtuoso-deployment.yaml
  6. kubectl apply -f worker-default-deployment.yaml
  7. kubectl apply -f virtuoso-dataset-deployment.yaml
  8. kubectl apply -f scheduler-deployment.yaml
  9. kubectl proxy
  10. Visit http://127.0.0.1:8001/api/v1/namespaces/slinky/services/virtuoso:virtuoso/proxy/sparql/ and check for triples

An optional extra step after deployment is viewing the logs for the workers to see if any jobs have been processed.

Unit Tests

The unit tests should be run the usual way. Without the RDF module locally I run into issues running the tests locally; the requirement for running virtuoso or blazegraph alongside them complicated things (docker in docker territory).

I was able to run the non-networking tests which seemed to pass okay, including the local triple store tests. @amoeba is there any way you can run these on your local machine to test out?

The old Interface and Graph classes had a lot of cruft and were really confusing to keep straight. I've done a refactor that uses a different class structure that's a lot easier for me to understand. Hopefully it's easier for you too.

See the README for more info, including a fancy picture, but the Interface class is now wrapped into a top-level SlinkyClient class and the old Graph class is now tied into a SparqlTripleStore class. With the old setup, you had to instantiate an Interface and a Graph. Now you just instantiate a SlinkyClient and you're good to go.

Here's some more detail, copied from the README:

- `SlinkyClient`: Entrypoint class that manages a connection to DataONE, a triple store, and Redis for short-term persistence and delayed jobs
- `FilteredCoordinatingNodeClient`: A view into a Coordinating Node that can limit what content appears to be available based on a Solr query. e.g., a CN client that can only see datasets that are part of a specific EML project or in a particular region
- `SparqlTripleStore`: Handles inserting into and querying a generic SPARQL-compliant RDF triplestore via SPARQL queries. Designed to be used with multiple triple stores.
- `Processor`: Set of classes that convert documents of various formats (e.g., XML, JSON-LD) into a set of RDF statements

The old package code is left in the legacy submodule (to be deleted in the future) and its tests are still alive and working via pytest.
This is pretty basic but you can run this container and hit /get?id=foo to get the Slinky RDF for given DataONE PID.
Closes #30

I couldn't find a way to send very large SPARQL queries to Virtuoso but Virtuoso does have an HTTP API that takes Turtle/NTriples/etc. Since this is specific to Virtuoso, I've made a separate model from SparqlTripleStore.
I don't know why I had tweaked this.
We don't want to run multiple update jobs at once and we also don't want to run update_job when the dataset queue is saturated. This change controls both of those scenarios.
This prevents repeated calls to get_new_datasets_since from inserting the most recent dataset over and over again
Turns out you _really_ need to pass binary data to ElementTree because it'll treat your XML content as ASCII if you have requests/httpx decode it first.
We need a way to override default behavior depending on context. Development, production, etc. I'm not sure if I want to do this via configuration or via environment variables just yet. This'll all probably change once I start building Docker images.
I'm not really sure what to do about this yet. It's a bit painful to write code to handle every blank node pattern SOSO is going to throw at us. I might re-introduce this at some later point.
I had originally designed the filtered client to take a base filter + an extra filter. I didn't account for the fact that the filtered client needs to manage three filters: (1) the base, (2) the actual filter of interest (eg SASAP-only, ARCTICA-only) and (3) the filter we use as the cursor to determine when there are new datasets. This makes the class fully aware of all of that.
@ThomasThelen ThomasThelen changed the base branch from main to develop August 19, 2021 22:06
@ThomasThelen ThomasThelen changed the title [WIP] Merge feature_update_graph_pattern Merge feature_update_graph_pattern Nov 5, 2021
@ThomasThelen ThomasThelen mentioned this pull request Nov 6, 2021
@ThomasThelen
Copy link
Member Author

This PR should be ready for review (see my note about testing). There's a followup PR, #48 that's forked off of this branch. It has some nice features added to the deployment and documentation; after reading through this PR I suggest taking a look at the other changes for a bigger picture view.

@amoeba
Copy link
Contributor

amoeba commented Nov 10, 2021

Thanks @ThomasThelen! I appreciate you coming up to speed on how this service works and getting everything deploying correctly.

This PR got kinda too big to give an effective review but since we ran through things together last week, I'm going to merge and we can go from there.

I did do a scan and I two things did pop up:

  • Use of a custom Virtuoso image (thomasthelen/virtuoso) referenced in various places. It'd be better to either use the stock image or build one under DataONE org. No need to do that right now.
  • I think it'd be good to do a pass over how we're doing service discovery and configuration. In my rewrite, I hard-coded service names, some of which are still in the code and should be changed. You added some flags, like --prod which help get around that but I think the best overall solution is to have all connections to default to localhost with default ports and override any hostnames and ports with environmental variables (ala 12F). We can also work on that post-merge.

@amoeba
Copy link
Contributor

amoeba commented Nov 10, 2021

Oh, and Re: Testing

You should be able to docker compose up -d and run pytest to run the full test suite. I did have to do some work to get the binding installed and I wrote up docs for my future self at https://github.com/DataONEorg/slinky/blob/develop/d1lod/docs/installing-redlands-bindings.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants