Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge feature_update_graph_pattern #43

Merged
merged 78 commits into from
Nov 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
e018a08
Remove 2to3.txt from Python 2 to 3 migration
amoeba Apr 13, 2021
d9e67c1
Remove unneeded void.ttl file from d1lod folder
amoeba Apr 13, 2021
05741d9
Update d1lod package readme and setup.py
amoeba Apr 13, 2021
4cfebac
Create initial prototype of web app
amoeba Apr 22, 2021
109bc54
WIP Begin overhaul of classes for Slinky
amoeba Apr 24, 2021
8378a34
Merge branch 'feature_web_app' into feature_update_graph_pattern
amoeba Apr 24, 2021
478b62f
Completely refactor d1lod package
amoeba Apr 30, 2021
46009a7
Remove top-level makefile
amoeba Apr 30, 2021
062c562
Convert web front-end to use a SlinkyClient
amoeba Apr 30, 2021
2fca953
Remove unused variable in SparqlTripleStore.get_ua_string
amoeba Apr 30, 2021
a782ff9
Fix broken URIs in eml_processor
amoeba Apr 30, 2021
e1dd51f
Add note about Exceptions in d1lod readme
amoeba May 1, 2021
85163f6
Wrap up last bit of work for full EML processing
amoeba May 4, 2021
fd6f7cb
Create a Blazegraph connector to replace Virtuoso
amoeba May 4, 2021
ac73868
Add a main method in cli.py so the cli can be debugged
amoeba May 12, 2021
104e190
Create an ISOProcessor class to process ISO docs
amoeba May 12, 2021
777cf1f
Make SlinkyClient's choice of store an argument
amoeba May 12, 2021
04f9407
Add more tests to Blazegraph and SparqlStore
amoeba May 12, 2021
be4d94a
Hook up new classes for easy testing
amoeba May 15, 2021
9783fb6
Create new Virtuoso-specific store model
amoeba May 18, 2021
08cbd7a
Change BlazegraphStore's default port
amoeba May 18, 2021
025933b
Remove unused Exception from client.py
amoeba May 18, 2021
cb2ff80
Do a cleanup pass over the entire test suite
amoeba May 18, 2021
414fce3
Adjust logic for when update_job runs or doesn't
amoeba May 18, 2021
9daeef9
Re-use module-level global in jobs add_dataset_job
amoeba May 18, 2021
65879b9
Add --debug argument to work command in cli
amoeba May 18, 2021
729d615
Make get_new_datasets_since query range-exclusive
amoeba May 18, 2021
776f842
Use response.content instead of response.text
amoeba May 19, 2021
dd233d9
Add blazegraph to d1lod package's docker compose
amoeba May 19, 2021
21e4323
Begin work refactoring setups/environments
amoeba May 19, 2021
d1af30a
Remove unused code from cli.py
amoeba May 26, 2021
aab1c30
Add SPARQL DELETE support to VirtuosoStore
amoeba May 26, 2021
a6ef121
Prevent EMLProcessor from re-inserting identifier blank nodes
amoeba May 26, 2021
ae07392
Make VirtuosoStore's count method support patterns
amoeba May 26, 2021
1c315ed
Capitalize 'select' in VirtuosoStore.all
amoeba May 26, 2021
9b35c2a
Add remaining pieces of VirtuosoStore delete impl
amoeba May 26, 2021
9958cf3
Remove extra trailing slash from VirtuosoStore endpoint
amoeba May 26, 2021
49d78bd
Fix bug in Processor's handling of sysmeta 'obsoletes'
amoeba May 26, 2021
7eee7bd
Add a datatype for isAccessibleForFree triples (boolean)
amoeba May 26, 2021
2a8586b
Fix bug in schema:byteSize routine
amoeba May 26, 2021
c3eadc4
Guard against unset accessPolicy in Processor
amoeba May 26, 2021
7f64dbe
Add .strip() calls to all ElementTree .text calls
amoeba May 26, 2021
c954903
Remove unused code from ISOProcessor
amoeba May 26, 2021
1d43a69
Fix logic bug in handling datePublished
amoeba May 26, 2021
ee1907c
Add schema:distribution triples
amoeba May 28, 2021
9eaec1e
Add insert, insertall, clear, and count commands to CLI
amoeba May 28, 2021
6c60a8f
Add architecture diagram to readme
amoeba May 28, 2021
da28ab4
Finish up support for semantic annotations
amoeba Jun 3, 2021
dffb273
Finish spdx:Checksum support
amoeba Jun 3, 2021
ee4e1a6
Tweak style of slinky-architecture diagram a tad
amoeba Jun 3, 2021
fa74fa5
Move lookup* functions around in eml_processor
amoeba Jun 4, 2021
c154540
Clean up whitespace in readme
amoeba Jun 4, 2021
7482b19
Add in support for SOSO PropertyValue model for attributes
amoeba Jun 4, 2021
6d15143
Rename variable in test_eml220_processor
amoeba Jun 4, 2021
d4ac71c
Add count and format options to cli's get method
amoeba Jun 4, 2021
30e06a6
Remove unused pagination code in filtered_d1_client
amoeba Jun 5, 2021
aeec699
Finish up implementation of EML attributes
amoeba Jun 5, 2021
471dda8
Remove RQ Dashboard from compose file
amoeba Jun 5, 2021
8f1217d
Change update schedule from 5min to 1min
amoeba Jun 5, 2021
59ca2cb
Remove test for double-processing
amoeba Jun 5, 2021
91aa958
Re-organize code between client and jobs module
amoeba Jun 5, 2021
d89cf23
Add start of test suite for client
amoeba Jun 5, 2021
aa24cf2
Fix broken imports from previous refactor
amoeba Jun 5, 2021
3928b6d
Fix bug in FilteredCoordinatingNodeClient logic
amoeba Jun 5, 2021
81f81c3
Fix test regressions in for FilteredD1Client
amoeba Jun 23, 2021
72b00fa
Switch d1lod test suite's docker-compose to use official VOS image
amoeba Jun 23, 2021
7ba9b9c
Fix invalid EML doc in d1lod test suite
amoeba Jul 8, 2021
8d664dc
Merge remote-tracking branch 'origin/develop' into feature_update_gra…
ThomasThelen Aug 19, 2021
4b78c5c
Create two separate worker deployments that can be individually scaled
ThomasThelen Nov 5, 2021
16c9632
Add a step to the Dockerfile to install d1lod to the image
ThomasThelen Nov 5, 2021
50aa722
Refactor the Scheduler and SlinkyClient interactions to support servi…
ThomasThelen Nov 5, 2021
8e03b92
Add __init__.py to the iso folder to let the python packager know we …
ThomasThelen Nov 5, 2021
3051b99
Refactor the scheduler to always pull an image to avoid using old cac…
ThomasThelen Nov 5, 2021
832c7a2
Change the name of 'redis-main' deployment to just 'redis'.
ThomasThelen Nov 5, 2021
ae271ea
Remove the 'docker' folder since the d1lod image is now being used by…
ThomasThelen Nov 5, 2021
b938f59
Remove helm chart fils and simplify the deployment directory structure
ThomasThelen Nov 5, 2021
913791c
Combine the enable-update feature with the virtuoso image
ThomasThelen Nov 5, 2021
49e9c72
Remove debug flags from the worker deployments
ThomasThelen Nov 5, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 0 additions & 11 deletions Makefile

This file was deleted.

61 changes: 14 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
# Slinky, the DataONE Graph Store

## Overview

Service for the DataONE Linked Open Data graph.

This repository contains the deployment and code that makes up the
DataONE graph store.
DataONE graph store.

The main infrastructure of the service is composed of four services:
The main infrastructure of the service is composed of four services and is essentially a backround job system ([RQ](https://python-rq.org/)) hooked into an RDF triplestore ([Virtuoso](http://vos.openlinksw.com/owiki/wiki/VOS)) for persistence:

1. `virtuoso`: Acts as the backend graph store
2. `scheduler`: An [APSchduler](https://apscheduler.readthedocs.org) process that schedules jobs (e.g., update graph with new datasets) on the `worker` at specified intervals
3. `worker`: An [RQ](http://python-rq.org/) worker process to run scheduled jobs
4. `redis`: A [Redis](http://redis.io) instance to act as a persistent store for the `worker` and for saving application state

![slinky architecture diagram showing the components in the list above connected with arrows](./docs/slinky-architecture.png)

As the service runs, the graph store will be continuously updated as datasets are added/updated on [DataOne](https://www.dataone.org/). Another scheduled job exports the statements in the graph store and produces a Turtle dump of all statements at [http://dataone.org/d1lod.ttl](http://dataone.org/d1lod.ttl).

Expand All @@ -25,66 +27,31 @@ As the service runs, the graph store will be continuously updated as datasets ar
├── docs # Detailed documentation beyond this file
```


## What's in the graph?

For an overview of what concepts the graph contains, see the [mappings](/docs/mappings.md) documentation.

## Deployment

## Deployment Management

### Kubernetes Helm

The entire stack can be deployed using the Kubernetes Helm Chart using
the following command from the `deploy/` directory.

`helm install ./ --generate-name`

To tear the helm stack down, run
The deployment consists of a handful of deployment files and a few
associated services. To bring the deployments and services online, use
`kubectl -f <deployment_file>` on each file in the `deploy/deployment/`
and `deploy/service/` directories. It's recommended to start the redis
and virtuoso images _before_ the workers and scheduler.

`helm uninstall <name_of_the_deployment>`


### As Individual Services & Pods

The stack can also be brought up by invoking the pods and services
individually.

#### Redis
```
kubectl apply -f templates/deployment/redis-deployment.yaml
kubectl apply -f templates/service/redis-service.yaml
```

#### Virtuoso
```
kubectl apply -f templates/deployment/virtuoso-deployment.yaml
kubectl apply -f templates/service/virtuoso-service.yaml
```

#### worker
```
kubectl apply -f templates/deployment/worker-deployment.yaml
```
### Scaling Pods

#### Scheduler
```
kubectl apply -f templates/deployment/scheduler-deployment.yaml
```
The pods should be scaled with `kubectl scale`, shown below.

### Scaling Pods
The pods should be scaled the usual way,
```
kubectl scale --replicas=0 deployments/{pod-name}
kubectl scale --replicas=0 deployments/{pod-name}
```

Note that there should always be at least one replica running for each
pod.

### Accessing Virtuoso on Dev
To access Virtuoso on the development cluster, first connect to the proxy via `kubectl proxy`. Then, the service should be accessible at `http://127.0.0.1:8001/api/v1/namespaces/slinky/services/virtuoso:virtuoso/proxy/`.

### Protecting the Virtuoso SPARQLEndpoint

We don't want open access to the `sparql/` endpoint that Virtuoso
exposes. To protect the endpoint, follow
[this](http://vos.openlinksw.com/owiki/wiki/VOS/VirtSPARQLProtectSQLDigestAuthentication)
Expand Down
57 changes: 0 additions & 57 deletions d1lod/2to3.txt

This file was deleted.

4 changes: 4 additions & 0 deletions d1lod/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ RUN apt-get update && \

WORKDIR /tmp

# Install the d1lod python package
ADD . .
RUN pip3 install .

RUN git clone https://github.com/dajobe/redland-bindings
WORKDIR /tmp/redland-bindings

Expand Down
11 changes: 0 additions & 11 deletions d1lod/Makefile

This file was deleted.

113 changes: 99 additions & 14 deletions d1lod/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,109 @@
# D1 LOD Python Package
# d1lod

This directory contains the Python package that supports the LOD service.
This directory contains the Python package that supports Slinky.
The package is currently called 'd1lod' but might be renamed in the future.

## Contents
## Status

- d1lod.jobs: Jobs for the D1 LOD Service
- d1lod.util: Helper methods
- d1lod.metadata.*: Methods for extracting information from Science Metadata
- d1lod.people.*: Methods for extracting information about people and organizations from Science Metadata
- d1lod.graph: A light-weight wrapper around the Virtuoso store and its HTTP API for interacting with graphs
- d1lod.interface: A light-weight wrapper around the Virtuoso store and its HTTP API
The codebase is currently being cleaned up and is not as thoroughly tested as the previous codebase.
The code you see in `./d1lod` is the cleaned up code and the old codebase has
been kept at `./d1lod/legacy` for reference.
The tests are `./tests` are a mix of test against the new code and legacy code.

## Testing
### TODOS

### Pre-requisites:
- [x] Refactor classes from prevous Graph+Interface structure
- [x] Create a CLI to easily interact with Slinky
- [ ] Implement more Processor to match more DataONE format IDs
- [ ] Implement Processors to match Mappings
- [ ] Provide easy way to configure connection information (d1client, triplestore, redis)

- Virtuoso Store must be running on 'http://localhost:8000/virtuoso/conductor'
## Architecture

`d1lod` is made of a few key classes, the most important of which is `SlinkyClient`:

- `SlinkyClient`: Entrypoint class that manages a connection to DataONE, a triple store, and Redis for short-term persistence and delayed jobs
- `FilteredCoordinatingNodeClient`: A view into a Coordinating Node that can limit what content appears to be available based on a Solr query. e.g., a CN client that can only see datasets that are part of a specific EML project or in a particular region
- `SparqlTripleStore`: Handles inserting into and querying a generic SPARQL-compliant RDF triplestore via SPARQL queries. Designed to be used with multiple triple stores.
- `Processor`: Set of classes that convert documents of various formats (e.g., XML, JSON-LD) into a set of RDF statements
- `jobs`: Set of routines for use in a background job scheduling system. Currently `rq`.

The `SlinkyClient` is the main entrypoint for everything the package does and handles connections to require services.
See the following diagram to get a sense of the interaction between the classes:

![slinky package architecture](./docs/slinky-client-architecture.png)

## Usage

### Installation

`d1lod` isn't intended for broad usage so it won't end up on https://pypi.org/.
You can install it locally with

```
pip install .
```

### Usage

The most common routines `d1lod` provides can be used from the command line with the `slinky` executable.
After installation, type:

```
slinky --help
```

You can get a Turtle-formatted description of a DataONE dataset with:

```
slinky get doi:10.5063/F1N58JPP
```

## Development

This package is a bit harder to set up than normal Python projects because it uses the [librdf](https://librdf.org/bindings/) (Redland) Python bindings which aren't on PyPi and require manual installation.
You should install the Redland Python bindings as appropriate on your system.
Under macOS, see [our guide](./docs/install-redlands-bindings.md).

The rest of this package's dependencies are installed when you run `pip install .`.
Installing the package in editable mode is recommended:

```
pip install -e .
```

Note: virtualenvs are really nice but I haven't found a way to easily use them with Redland so I tend to develop without them.
Let me know if you have any ideas.

### Testing

Testing is impelemented with [pytest](https://pytest.org).

#### Pre-requisites

The test suite is mostly made of integration tests that depend on being able to access an RDF triplestore while the test suite runs.
We've developed and tested with [Virtuoso Open Source](http://vos.openlinksw.com/owiki/wiki/VOS) and other RDF triplestore may require modifications to work correctly.

The pre-requisities for running the test suite are:

1. Virtuoso or a similar RDF triplestore
2. pytest

A quick way to get Virtuoso running is via [Docker](https://www.docker.com):

```
docker run -it -e "DBA_PASSWORD=dba" -p 8890:8890 thomasthelen/virtuoso
```

### Running

From this directory, run:
```
pytest
```

### Guidelines

It's helpful to write down some guidelines to help keep codebases internally consistent.
Note: This section is new and I'm hoping to add things here as we go.

`py.test`
- In `Processor` classes, prefer throwing exceptions over logging and continuing when you encounter an unhandled state. The processors run in a delayed job system and so there's no harm in throwing an unhandled exception and it makes it easy to find holes in processing code.
18 changes: 0 additions & 18 deletions d1lod/d1lod/__init__.py
Original file line number Diff line number Diff line change
@@ -1,18 +0,0 @@
from . import people
from . import dataone
from . import util
from . import validator
from . import metadata
from .graph import Graph
from .interface import Interface

# Set default logging handler to avoid "No handler found" warnings.
import logging
try: # Python 2.7+
from logging import NullHandler
except ImportError:
class NullHandler(logging.Handler):
def emit(self, record):
pass

logging.getLogger(__name__).addHandler(NullHandler())
Loading