DataONEorg · amoeba · Nov 10, 2021 · Apr 13, 2021 · Apr 13, 2021 · Apr 13, 2021
diff --git a/Makefile b/Makefile
diff --git a/README.md b/README.md
@@ -1,18 +1,20 @@
 # Slinky, the DataONE Graph Store
 
 ## Overview
+
 Service for the DataONE Linked Open Data graph.
 
 This repository contains the deployment and code that makes up the
-DataONE graph store. 
+DataONE graph store.
 
-The main infrastructure of the service is composed of four services:
+The main infrastructure of the service is composed of four services and is essentially a backround job system ([RQ](https://python-rq.org/)) hooked into an RDF triplestore ([Virtuoso](http://vos.openlinksw.com/owiki/wiki/VOS)) for persistence:
 
 1. `virtuoso`: Acts as the backend graph store
 2. `scheduler`: An [APSchduler](https://apscheduler.readthedocs.org) process that schedules jobs (e.g., update graph with new datasets) on the `worker` at specified intervals
 3. `worker`: An [RQ](http://python-rq.org/) worker process to run scheduled jobs
 4. `redis`: A [Redis](http://redis.io) instance to act as a persistent store for the `worker` and for saving application state
 
+![slinky architecture diagram showing the components in the list above connected with arrows](./docs/slinky-architecture.png)
 
 As the service runs, the graph store will be continuously updated as datasets are added/updated on [DataOne](https://www.dataone.org/). Another scheduled job exports the statements in the graph store and produces a Turtle dump of all statements at [http://dataone.org/d1lod.ttl](http://dataone.org/d1lod.ttl).
 
@@ -25,66 +27,31 @@ As the service runs, the graph store will be continuously updated as datasets ar
 ├── docs         # Detailed documentation beyond this file
 ```
 
-
 ## What's in the graph?
 
 For an overview of what concepts the graph contains, see the [mappings](/docs/mappings.md) documentation.
 
+## Deployment
 
-## Deployment Management
-
-### Kubernetes Helm
-
-The entire stack can be deployed using the Kubernetes Helm Chart using
-the following command from the `deploy/` directory.
-
-`helm install ./ --generate-name`
-
-To tear the helm stack down, run
+The deployment consists of a handful of deployment files and a few
+associated services. To bring the deployments and services online, use
+`kubectl -f <deployment_file>` on each file in the `deploy/deployment/`
+and `deploy/service/` directories. It's recommended to start the redis
+and virtuoso images _before_ the workers and scheduler.
 
-`helm uninstall <name_of_the_deployment>`
-
-
-### As Individual Services & Pods
-
-The stack can also be brought up by invoking the pods and services
-individually.
-
-#### Redis
-```
-kubectl apply -f templates/deployment/redis-deployment.yaml
-kubectl apply -f templates/service/redis-service.yaml
-```
 
-#### Virtuoso
-```
-kubectl apply -f templates/deployment/virtuoso-deployment.yaml
-kubectl apply -f templates/service/virtuoso-service.yaml
-```
 
-#### worker
-```
-kubectl apply -f templates/deployment/worker-deployment.yaml
-```
+### Scaling Pods
 
-#### Scheduler
-```
-kubectl apply -f templates/deployment/scheduler-deployment.yaml
-```
+The pods should be scaled with `kubectl scale`, shown below.
 
-### Scaling Pods
-The pods should be scaled the usual way, 
 ```
-kubectl scale --replicas=0 deployments/{pod-name} 
+kubectl scale --replicas=0 deployments/{pod-name}
 ```
 
-Note that there should always be at least one replica running for each
-pod.
-
-### Accessing Virtuoso on Dev
-To access Virtuoso on the development cluster, first connect to the proxy via `kubectl proxy`. Then, the service should be accessible at `http://127.0.0.1:8001/api/v1/namespaces/slinky/services/virtuoso:virtuoso/proxy/`.
 
 ### Protecting the Virtuoso SPARQLEndpoint
+
 We don't want open access to the `sparql/` endpoint that Virtuoso
 exposes. To protect the endpoint, follow
 [this](http://vos.openlinksw.com/owiki/wiki/VOS/VirtSPARQLProtectSQLDigestAuthentication)

diff --git a/d1lod/2to3.txt b/d1lod/2to3.txt
diff --git a/d1lod/Dockerfile b/d1lod/Dockerfile
@@ -12,6 +12,10 @@ RUN apt-get update && \
 
 WORKDIR /tmp
 
+# Install the d1lod python package
+ADD . .
+RUN pip3 install .
+
 RUN git clone https://github.com/dajobe/redland-bindings
 WORKDIR /tmp/redland-bindings
 

diff --git a/d1lod/Makefile b/d1lod/Makefile
diff --git a/d1lod/README.md b/d1lod/README.md
@@ -1,24 +1,109 @@
-# D1 LOD Python Package
+# d1lod
 
-This directory contains the Python package that supports the LOD service.
+This directory contains the Python package that supports Slinky.
+The package is currently called 'd1lod' but might be renamed in the future.
 
-## Contents
+## Status
 
-- d1lod.jobs: Jobs for the D1 LOD Service
-- d1lod.util: Helper methods
-- d1lod.metadata.*: Methods for extracting information from Science Metadata
-- d1lod.people.*: Methods for extracting information about people and organizations from Science Metadata
-- d1lod.graph: A light-weight wrapper around the Virtuoso store and its HTTP API for interacting with graphs
-- d1lod.interface: A light-weight wrapper around the Virtuoso store and its HTTP API 
+The codebase is currently being cleaned up and is not as thoroughly tested as the previous codebase.
+The code you see in `./d1lod` is the cleaned up code and the old codebase has
+been kept at `./d1lod/legacy` for reference.
+The tests are `./tests` are a mix of test against the new code and legacy code.
 
-## Testing
+### TODOS
 
-### Pre-requisites:
+- [x] Refactor classes from prevous Graph+Interface structure
+- [x] Create a CLI to easily interact with Slinky
+- [ ] Implement more Processor to match more DataONE format IDs
+- [ ] Implement Processors to match Mappings
+- [ ] Provide easy way to configure connection information (d1client, triplestore, redis)
 
-- Virtuoso Store must be running on 'http://localhost:8000/virtuoso/conductor'
+## Architecture
+
+`d1lod` is made of a few key classes, the most important of which is `SlinkyClient`:
+
+- `SlinkyClient`: Entrypoint class that manages a connection to DataONE, a triple store, and Redis for short-term persistence and delayed jobs
+- `FilteredCoordinatingNodeClient`: A view into a Coordinating Node that can limit what content appears to be available based on a Solr query. e.g., a CN client that can only see datasets that are part of a specific EML project or in a particular region
+- `SparqlTripleStore`: Handles inserting into and querying a generic SPARQL-compliant RDF triplestore via SPARQL queries. Designed to be used with multiple triple stores.
+- `Processor`: Set of classes that convert documents of various formats (e.g., XML, JSON-LD) into a set of RDF statements
+- `jobs`: Set of routines for use in a background job scheduling system. Currently `rq`.
+
+The `SlinkyClient` is the main entrypoint for everything the package does and handles connections to require services.
+See the following diagram to get a sense of the interaction between the classes:
+
+![slinky package architecture](./docs/slinky-client-architecture.png)
+
+## Usage
+
+### Installation
+
+`d1lod` isn't intended for broad usage so it won't end up on https://pypi.org/.
+You can install it locally with
+
+```
+pip install .
+```
+
+### Usage
+
+The most common routines `d1lod` provides can be used from the command line with the `slinky` executable.
+After installation, type:
+
+```
+slinky --help
+```
+
+You can get a Turtle-formatted description of a DataONE dataset with:
+
+```
+slinky get doi:10.5063/F1N58JPP
+```
+
+## Development
+
+This package is a bit harder to set up than normal Python projects because it uses the [librdf](https://librdf.org/bindings/) (Redland) Python bindings which aren't on PyPi and require manual installation.
+You should install the Redland Python bindings as appropriate on your system.
+Under macOS, see [our guide](./docs/install-redlands-bindings.md).
+
+The rest of this package's dependencies are installed when you run `pip install .`.
+Installing the package in editable mode is recommended:
+
+```
+pip install -e .
+```
+
+Note: virtualenvs are really nice but I haven't found a way to easily use them with Redland so I tend to develop without them.
+Let me know if you have any ideas.
+
+### Testing
+
+Testing is impelemented with [pytest](https://pytest.org).
+
+#### Pre-requisites
+
+The test suite is mostly made of integration tests that depend on being able to access an RDF triplestore while the test suite runs.
+We've developed and tested with [Virtuoso Open Source](http://vos.openlinksw.com/owiki/wiki/VOS) and other RDF triplestore may require modifications to work correctly.
+
+The pre-requisities for running the test suite are:
+
+1. Virtuoso or a similar RDF triplestore
+2. pytest
+
+A quick way to get Virtuoso running is via [Docker](https://www.docker.com):
+
+```
+docker run -it -e "DBA_PASSWORD=dba" -p 8890:8890 thomasthelen/virtuoso
+```
 
 ### Running
 
-From this directory, run:
+```
+pytest
+```
+
+### Guidelines
+
+It's helpful to write down some guidelines to help keep codebases internally consistent.
+Note: This section is new and I'm hoping to add things here as we go.
 
-`py.test`
+- In `Processor` classes, prefer throwing exceptions over logging and continuing when you encounter an unhandled state. The processors run in a delayed job system and so there's no harm in throwing an unhandled exception and it makes it easy to find holes in processing code.
diff --git a/d1lod/d1lod/__init__.py b/d1lod/d1lod/__init__.py
@@ -1,18 +0,0 @@
-from . import people
-from . import dataone
-from . import util
-from . import validator
-from . import metadata
-from .graph import Graph
-from .interface import Interface
-
-# Set default logging handler to avoid "No handler found" warnings.
-import logging
-try:  # Python 2.7+
-    from logging import NullHandler
-except ImportError:
-    class NullHandler(logging.Handler):
-        def emit(self, record):
-            pass
-
-logging.getLogger(__name__).addHandler(NullHandler())