Merge feature_update_graph_pattern #43

ThomasThelen · 2021-08-19T22:06:19Z

This PR addresses a few issues: #42 #34 #35 which are centered around applying a new triple pattern across the data and updating virtuoso.

Changes

Image Management

Previously, the docker/ folder acted as the place to build the scheduler and worker images. Because the workers and scheduler depend on the d1lod library, and because Dockerfiles can only copy same-directory or lower files, the d1lod folder had to be copied into each folder. Additionally, there were separate images for the workers and scheduler. This was error prone and tedious.

The changes in 16c9632 and ae271ea install the d1lod package into the d1lod image (which I have at thomasthelen/slinky). This image is used in the workers and scheduler, getting rid of the need for their separate images.

Worker Changes

The worker deployment has changed by deploying separate workers, one for the default queue and the other for the dataset queue. These changes are in 4b78c5c.

Scheduler Changes

In 3051b99 the scheduler deployment was updated to use the slinky cli.

Virtuoso Changes

Instead of having a second container interact with Virtuoso to enable updating, I combined the openlink virtuoso image with the shell script to enable updates. The virtuoso server an the update-enabler are run in separate processes, shown below. This was done in 913791c

Networking Changes

When pods in separate deployments need to communicate, they have to go through the associated service. Services can be referenced the same way in docker (ie 'redis' or 'virtuoso'). 50aa722 has a few changes to accommodate this requirement.

Misc

832c7a2 changes the name of the redis deployment to just 'redis'. We don't have a need to distinguish redis deployments, and this makes it more aligned with current names.

8e03b92 Fixes a bug where, without init.py pip won't include the contents of the folder in the install package path.

Testing

Kubernetes

A good kubbernetes test is to tear down the stack and then bring it back up and check that Virtuoso has triples.

kubectl config use-context slinky
kubectl delete deployment scheduler redis virtuoso worker-default worker-dataset
kubectl apply -f virtuoso-deployment.yaml
kubectl apply -f redis-deployment.yaml
kubectl apply -f virtuoso-deployment.yaml
kubectl apply -f worker-default-deployment.yaml
kubectl apply -f virtuoso-dataset-deployment.yaml
kubectl apply -f scheduler-deployment.yaml
kubectl proxy
Visit http://127.0.0.1:8001/api/v1/namespaces/slinky/services/virtuoso:virtuoso/proxy/sparql/ and check for triples

An optional extra step after deployment is viewing the logs for the workers to see if any jobs have been processed.

Unit Tests

The unit tests should be run the usual way. Without the RDF module locally I run into issues running the tests locally; the requirement for running virtuoso or blazegraph alongside them complicated things (docker in docker territory).

I was able to run the non-networking tests which seemed to pass okay, including the local triple store tests. @amoeba is there any way you can run these on your local machine to test out?

The old Interface and Graph classes had a lot of cruft and were really confusing to keep straight. I've done a refactor that uses a different class structure that's a lot easier for me to understand. Hopefully it's easier for you too. See the README for more info, including a fancy picture, but the Interface class is now wrapped into a top-level SlinkyClient class and the old Graph class is now tied into a SparqlTripleStore class. With the old setup, you had to instantiate an Interface and a Graph. Now you just instantiate a SlinkyClient and you're good to go. Here's some more detail, copied from the README: - `SlinkyClient`: Entrypoint class that manages a connection to DataONE, a triple store, and Redis for short-term persistence and delayed jobs - `FilteredCoordinatingNodeClient`: A view into a Coordinating Node that can limit what content appears to be available based on a Solr query. e.g., a CN client that can only see datasets that are part of a specific EML project or in a particular region - `SparqlTripleStore`: Handles inserting into and querying a generic SPARQL-compliant RDF triplestore via SPARQL queries. Designed to be used with multiple triple stores. - `Processor`: Set of classes that convert documents of various formats (e.g., XML, JSON-LD) into a set of RDF statements The old package code is left in the legacy submodule (to be deleted in the future) and its tests are still alive and working via pytest.

This is pretty basic but you can run this container and hit /get?id=foo to get the Slinky RDF for given DataONE PID.

Closes #30 I couldn't find a way to send very large SPARQL queries to Virtuoso but Virtuoso does have an HTTP API that takes Turtle/NTriples/etc. Since this is specific to Virtuoso, I've made a separate model from SparqlTripleStore.

I don't know why I had tweaked this.

We don't want to run multiple update jobs at once and we also don't want to run update_job when the dataset queue is saturated. This change controls both of those scenarios.

This prevents repeated calls to get_new_datasets_since from inserting the most recent dataset over and over again

Turns out you _really_ need to pass binary data to ElementTree because it'll treat your XML content as ASCII if you have requests/httpx decode it first.

We need a way to override default behavior depending on context. Development, production, etc. I'm not sure if I want to do this via configuration or via environment variables just yet. This'll all probably change once I start building Docker images.

I'm not really sure what to do about this yet. It's a bit painful to write code to handle every blank node pattern SOSO is going to throw at us. I might re-introduce this at some later point.

I had originally designed the filtered client to take a base filter + an extra filter. I didn't account for the fact that the filtered client needs to manage three filters: (1) the base, (2) the actual filter of interest (eg SASAP-only, ARCTICA-only) and (3) the filter we use as the cursor to determine when there are new datasets. This makes the class fully aware of all of that.

Ref #34

…ph_pattern

…ce-based network addresses

…need its contents

…hed images. Change the cli parameters to use the slinky cli. Use the slinky base image, which has the d1lod library installed on it

… the scheduler and worker

ThomasThelen · 2021-11-06T18:22:43Z

This PR should be ready for review (see my note about testing). There's a followup PR, #48 that's forked off of this branch. It has some nice features added to the deployment and documentation; after reading through this PR I suggest taking a look at the other changes for a bigger picture view.

amoeba · 2021-11-10T02:51:46Z

Thanks @ThomasThelen! I appreciate you coming up to speed on how this service works and getting everything deploying correctly.

This PR got kinda too big to give an effective review but since we ran through things together last week, I'm going to merge and we can go from there.

I did do a scan and I two things did pop up:

Use of a custom Virtuoso image (thomasthelen/virtuoso) referenced in various places. It'd be better to either use the stock image or build one under DataONE org. No need to do that right now.
I think it'd be good to do a pass over how we're doing service discovery and configuration. In my rewrite, I hard-coded service names, some of which are still in the code and should be changed. You added some flags, like --prod which help get around that but I think the best overall solution is to have all connections to default to localhost with default ports and override any hostnames and ports with environmental variables (ala 12F). We can also work on that post-merge.

amoeba · 2021-11-10T02:53:05Z

Oh, and Re: Testing

You should be able to docker compose up -d and run pytest to run the full test suite. I did have to do some work to get the binding installed and I wrote up docs for my future self at https://github.com/DataONEorg/slinky/blob/develop/d1lod/docs/installing-redlands-bindings.md.

amoeba added 30 commits April 12, 2021 16:25

Remove 2to3.txt from Python 2 to 3 migration

e018a08

Remove unneeded void.ttl file from d1lod folder

d9e67c1

Update d1lod package readme and setup.py

05741d9

Create initial prototype of web app

4cfebac

WIP Begin overhaul of classes for Slinky

109bc54

Merge branch 'feature_web_app' into feature_update_graph_pattern

8378a34

Remove top-level makefile

46009a7

Convert web front-end to use a SlinkyClient

062c562

This is pretty basic but you can run this container and hit /get?id=foo to get the Slinky RDF for given DataONE PID.

Remove unused variable in SparqlTripleStore.get_ua_string

2fca953

Fix broken URIs in eml_processor

a782ff9

Add note about Exceptions in d1lod readme

e1dd51f

Wrap up last bit of work for full EML processing

85163f6

Create a Blazegraph connector to replace Virtuoso

fd6f7cb

Add a main method in cli.py so the cli can be debugged

ac73868

Create an ISOProcessor class to process ISO docs

104e190

Make SlinkyClient's choice of store an argument

777cf1f

Add more tests to Blazegraph and SparqlStore

04f9407

Hook up new classes for easy testing

be4d94a

Create new Virtuoso-specific store model

9783fb6

Closes #30 I couldn't find a way to send very large SPARQL queries to Virtuoso but Virtuoso does have an HTTP API that takes Turtle/NTriples/etc. Since this is specific to Virtuoso, I've made a separate model from SparqlTripleStore.

Change BlazegraphStore's default port

08cbd7a

I don't know why I had tweaked this.

Remove unused Exception from client.py

025933b

Do a cleanup pass over the entire test suite

cb2ff80

Adjust logic for when update_job runs or doesn't

414fce3

We don't want to run multiple update jobs at once and we also don't want to run update_job when the dataset queue is saturated. This change controls both of those scenarios.

Re-use module-level global in jobs add_dataset_job

9daeef9

Add --debug argument to work command in cli

65879b9

Make get_new_datasets_since query range-exclusive

729d615

This prevents repeated calls to get_new_datasets_since from inserting the most recent dataset over and over again

Use response.content instead of response.text

776f842

Turns out you _really_ need to pass binary data to ElementTree because it'll treat your XML content as ASCII if you have requests/httpx decode it first.

Add blazegraph to d1lod package's docker compose

dd233d9

amoeba added 9 commits June 4, 2021 17:40

Change update schedule from 5min to 1min

8f1217d

Remove test for double-processing

59ca2cb

I'm not really sure what to do about this yet. It's a bit painful to write code to handle every blank node pattern SOSO is going to throw at us. I might re-introduce this at some later point.

Re-organize code between client and jobs module

91aa958

Add start of test suite for client

d89cf23

Fix broken imports from previous refactor

aa24cf2

Fix test regressions in for FilteredD1Client

81f81c3

Switch d1lod test suite's docker-compose to use official VOS image

72b00fa

Ref #34

Fix invalid EML doc in d1lod test suite

7ba9b9c

ThomasThelen changed the base branch from main to develop August 19, 2021 22:06

ThomasThelen added 11 commits August 19, 2021 15:15

Merge remote-tracking branch 'origin/develop' into feature_update_gra…

8d664dc

…ph_pattern

Create two separate worker deployments that can be individually scaled

4b78c5c

Add a step to the Dockerfile to install d1lod to the image

16c9632

Refactor the Scheduler and SlinkyClient interactions to support servi…

50aa722

…ce-based network addresses

Add __init__.py to the iso folder to let the python packager know we …

8e03b92

…need its contents

Refactor the scheduler to always pull an image to avoid using old cac…

3051b99

…hed images. Change the cli parameters to use the slinky cli. Use the slinky base image, which has the d1lod library installed on it

Change the name of 'redis-main' deployment to just 'redis'.

832c7a2

Remove the 'docker' folder since the d1lod image is now being used by…

ae271ea

… the scheduler and worker

Remove helm chart fils and simplify the deployment directory structure

b938f59

Combine the enable-update feature with the virtuoso image

913791c

Remove debug flags from the worker deployments

49e9c72

ThomasThelen changed the title ~~[WIP] Merge feature_update_graph_pattern~~ Merge feature_update_graph_pattern Nov 5, 2021

ThomasThelen mentioned this pull request Nov 6, 2021

Deployment Upgrades #48

Merged

amoeba merged commit 1ad7647 into develop Nov 10, 2021

This was referenced Nov 10, 2021

Merge feature_update_graph_pattern and re-deploy to the dev cluster #42

Closed

Configure networking via env vars #51

Closed

amoeba added this to Done in Slinky Issues Board Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge feature_update_graph_pattern #43

Merge feature_update_graph_pattern #43

ThomasThelen commented Aug 19, 2021 •

edited

Loading

ThomasThelen commented Nov 6, 2021

amoeba commented Nov 10, 2021

amoeba commented Nov 10, 2021

Merge feature_update_graph_pattern #43

Merge feature_update_graph_pattern #43

Conversation

ThomasThelen commented Aug 19, 2021 • edited Loading

Changes

Image Management

Worker Changes

Scheduler Changes

Virtuoso Changes

Networking Changes

Misc

Testing

Kubernetes

Unit Tests

ThomasThelen commented Nov 6, 2021

amoeba commented Nov 10, 2021

amoeba commented Nov 10, 2021

ThomasThelen commented Aug 19, 2021 •

edited

Loading