Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature deployment ordering & Environmental Variables #54

Merged
merged 20 commits into from
Feb 24, 2022

Conversation

ThomasThelen
Copy link
Member

@ThomasThelen ThomasThelen commented Nov 18, 2021

Fixes #53
Fixes #51

Deployment Startup Order

This PR should fix any issues around startup order, meaning that we can start the scheduler, workers, database, etc in any order and the services that need particular services to come online will wait. What this looks like is the scheduler and workers both waiting for redis to come online before executing any job logic. The workers also wait for Virtuoso to come online once a connection to Redis is established to prevent any premature SPARQL inserts.

This allows us to decouple startup order from the kubernetes deployment, resulting in a few changes to the deployment files. I left the redis and virtuoso readiness probes in. I don't think there's much harm in letting other containers or developers know that they're not ready for interaction (note that the pods are in the Ready state while the code is waiting on redis).

The deployment order is invariant with respect to readiness probes. The initContainer on the scheduler however, does dictate startup behavior (it won't run the scheduler ENTRYPOINT until redis comes online). Since this logic is now in the scheduler itself, I removed the initContainer.

Environmental Variables

This PR also has the addition of a few environmental variables for specifying hosts and ports. These are stored in the settings and was done in ea6feb4.

Production/Development/CLI Environments

The ENVIRONMENTS dict was partially deprecated in the changes above and is fully removed in 3ff13e3. The SlinkyClient constructor default parameters now come from the environmental variables. To match this change, the calls to SlinkyClient in the cli were refactored.

Changes to Testing

I think that the ideal testing workflow is achieved through setting up pipenv before running pytest to isolate all of the dependancies. The RDF module requirement forces us to break this model because of the manual building & installing that needs to happen for its dependencies. This PR includes an additional container running the d1lod image alongside the graph store in the docker-compose file. This container has all of the test dependencies and can run the unit tests. The pain point is that you have to docker exec in to run them. I could just call pytest in the entrypoint - but the container will exit before the test logs can be checked.

It's not a perfect system, but it at least provides a reliable way to run the tests on any system and I'm open to other options/routes. There are some changes in the unit test fixtures that change the default blazegraph and virtuoso endpoints to the docker service name rather than localhost.

Testing

Testing for this mostly revolved around deploying things at different orders and making sure that the job system still functioned.

Scheduler, Redis, & Worker Tests

I did some manual testing, bringing the three services up in different orders which are bulleted below. In each case I confirmed with the logs that services waited for redis to come online. Because the retry timeout is set for 30 seconds, give the tests at least a minute before the thumbs up/down.

  • Scheduler, Redis, Workers
  • Scheduler, Workers, Redis
  • Workers, Scheduelr, Redis
  • Workers, Redis, Scheduler
  • Redis, Scheduler, Workers
  • Redis, Workers, Scheduler

Virtuoso Startup Tests

  • Workers, Virtuoso
  • Virtuoso, Workers

Remaining Testing

  • Docker compose
  • Local tests
  • CLI test

@ThomasThelen ThomasThelen changed the title [WIP] Feature deployment ordering Feature deployment ordering & Environmental Variables Nov 18, 2021
@amoeba
Copy link
Contributor

amoeba commented Dec 2, 2021

Thanks @ThomasThelen! I'll have a look this week.

Copy link
Contributor

@amoeba amoeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good @ThomasThelen, thanks. The wait/sleep approach was exactly what I was thinking would work well here.

I made a few comments you can take or leave.

Re: Testing, I don't think it's ideal to have to run the tests in a container. But do they all pass there? I ran pytest locally and got just two failures:

================================================= short test summary info =================================================
FAILED tests/test_client.py::test_client - rq.connections.NoRedisConnectionException: Could not resolve a Redis connection
FAILED tests/test_eml220_processor.py::test_production_eml - httpx.ConnectError: [Errno 8] nodename nor servname provide...
============================================== 2 failed, 65 passed in 9.29s ===============================================

The errors just look like missing or wrong hostnames so maybe some minor tweaking would get things passing?

@@ -19,7 +19,10 @@ spec:
image: slinkythegraph/slinky
imagePullPolicy: Always
command: ["/bin/sh","-c"]
args: ["slinky schedule --prod; rqscheduler --host redis -i 10"]
args: ["slinky schedule; rqscheduler --host redis -i 10"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use the value in REDIS_HOST or just not specify it all and rely on the environmental variables we're already setting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch-done in c921777.

deploy/deployment/worker-dataset-deployment.yaml Outdated Show resolved Hide resolved
@@ -20,7 +20,10 @@ spec:
image: slinkythegraph/slinky
imagePullPolicy: Always
command: ["slinky"]
args: ["work", "default", "--prod"]
args: ["work", "default", "--debug"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above in the worker-dataset deployment

curl

ADD enable_update.sh /enable_update.sh
ENV DBA_PASSWORD "dba"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is outside of the scope of this review I think but having the password hardcoded here seems like a future problem. Do you have any good ideas for handling this secret? Maybe filing an issue is a good place to start unless there's already one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely leaves room for errors during deployment. The problem is that the secret needs to be shared across developers that are building this image-which means some sort of overarching system where the build takes place. For the moment I think Keybase works well enough-this image should be built rarely anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the secret needs to be shared across developers that are building this image-which means some sort of overarching system where the build takes place.

Does it? For example, the Virtuoso image has a default password and lets you override it during container creation. i.e., customization of the password isn't baked into the image. I think the change here is just removing the ENV statement from both Virtuoso Dockerfiles and injecting it at runtime from config. I imagine we're already doing that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I forgot to remote the initContainer folder, done here aef613b
I also removed the virtuoso password from the dockerfile (which is overridden by the one in the k8 deployment file and docker-compose). Done in 65b3f33

@ThomasThelen
Copy link
Member Author

Running the tests in a container is kludgy-but I don't see any other alternative for anyone that can't get the RDF package installed locally. I think that it should be runnable in both the container and outside of the container, given that the REDIS_HOST, REDIS_PORT, VIRTUOSO_HOST, and VIRTUOSO_PORT environmental variables are set to the right locations.

From the docker image,
Screen Shot 2021-12-08 at 1 04 25 PM

@ThomasThelen
Copy link
Member Author

Actually I take that back. d1lod/tests/conftest.py has hardcoded service names that need to get changed to env vars

@amoeba
Copy link
Contributor

amoeba commented Dec 8, 2021

Right, if you can't get the deps installed you can't run the tests. That's fine. To your last statement, I'd really like to be able to run the tests without having to set environmental variables. It makes me think we're missing a default or two somewhere. Is that what you're having to do?

@ThomasThelen
Copy link
Member Author

The defaults are set to production values, which are service names. I left the redis default to None, which is the local graph store though.

One problem I'm seeing is in d1lod/tests/conftest.py with hardcoded service names. The issue is that we run two graphs in tests, but only have one environmental variable for the graph host (and if we want to run these on localhost we want to say the the host is on localhost not at http://blazegraph). I can add a second environmental variable, BLAZEGRAPH_HOST - but then we have unittest requirements trickling up into the deployment logic.

@amoeba
Copy link
Contributor

amoeba commented Dec 8, 2021

Gotcha. Thanks for explaining.

I think it generally works better and follows norms if defaults are for local development. Reasoning being that local development is the same for everyone and its production that varies. Also makes it faster to get started and easier to do various tasks if development setup steps are minimal (e.g., I open up ~6 separate shells while developing and setting global env vars or per-shell env vars is just more work).

Re: Helping the tests pick the right graph store: It doesn't feel like we'd need another env var so I agree with you there. Both Stores could use the same env var, have appropriate sensible defaults, and have their tests could manage getting them the environment variables they need (which I think should be zero).

@ThomasThelen
Copy link
Member Author

Re: Helping the tests pick the right graph store: It doesn't feel like we'd need another env var so I agree with you there. Both Stores could use the same env var, have appropriate sensible defaults, and have their tests could manage getting them the environment variables they need (which I think should be zero).

The issue is that both stores need their own environmental variable for tests. For production we only run one graph, so we have the single environmental variable. This causes problems in testing because we run two graphs-and since the second graph can be either at localhost:1234 or http://blazegraph:1234 we need a way to specify. Since they both run under the same instance, they can't share the same address. I can set the default to localhost so that it works for local tests, but then dockerized tests will fail.

@ThomasThelen
Copy link
Member Author

In ebf2c1f I changed the default network location in settings.py to what I think they should be for running things locally. In the docker compose and kubernetes deployment files, they're set to the networked version (http://virtuoso, http://blazegraph, etc).

@ThomasThelen
Copy link
Member Author

@amoeba Commit e3a8e62 should remove the failing unit test that we encountered earlier with the Graph class.

@amoeba
Copy link
Contributor

amoeba commented Dec 15, 2021

Great, thanks @ThomasThelen. Do you want me review again?

@ThomasThelen
Copy link
Member Author

Great, thanks @ThomasThelen. Do you want me review again?

I'd appreciate it if you would.

@amoeba
Copy link
Contributor

amoeba commented Dec 15, 2021

I ran into two things:

  • pytest fails with on failure: tests/test_client.py::test_client (rq.connections.NoRedisConnectionException: Could not resolve a Redis connection). Do you see all the tests passing on your end? I'm running pytest directly and not containerized.
  • When virtuoso and redis are not running, slinky get fails with connection errors. This is technically a regression but it's not a total deal-breaker, I found it handy to have slinky get not depend on an external services since Docker and such can take so long to start up and I'm constantly stopping it. This could be addressed after we merge unless you see an easy way to enable it that still fits within the spirit of this PR.

@ThomasThelen
Copy link
Member Author

  1. I have all the tests working while containerized-I'll try and take a look to see what's not working locally.

  2. I can adda flag to the SlinkyClient to use the in-memory store rather than a running service. This can be used by the cli get method to use the local store.

@ThomasThelen
Copy link
Member Author

ThomasThelen commented Dec 26, 2021

The redis issue should be resolved in e6a62be

The default store for the cli is now LocalStore, but can be overridden to use the environmental variables with --local False.
I ran across a few issues while testing with the LocalStore; I think(I cant test previous versions without the dockerization) that these were pre-existing conditions; I can file issues around each of these if they're in fact bugs that .

To reproduce,

  1. Clone & checkout
  2. Build the d1lod image
  3. Modify the docker-compose file to use the local image build in step 2
  4. run docker-compose up
  5. docker exec into the image
  6. Run the commands below
  7. Run them again with --local false ie slinky insert --local false xyz and see them work

slinky insert urn:uuid:0910e8cf-9a67-43e1-b257-ebd4e3d66bca

root@ce762e064de4:/tmp# slinky insert urn:uuid:0910e8cf-9a67-43e1-b257-ebd4e3d66bca
Traceback (most recent call last):
  File "/usr/local/bin/slinky", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/d1lod/cli.py", line 221, in insert
    client.process_dataset(id)
  File "/usr/local/lib/python3.9/dist-packages/d1lod/client.py", line 57, in process_dataset
    self.store.insert_model(model)
TypeError: insert_model() missing 1 required positional argument: 'model'

slinky clear

root@ce762e064de4:/tmp# slinky clear
Traceback (most recent call last):
  File "/usr/local/bin/slinky", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/d1lod/cli.py", line 142, in clear
    old_size = client.store.count()
AttributeError: type object 'LocalStore' has no attribute 'count'

slinky count will throw the same

Redis<ConnectionPool<Connection<host=redis,port=6379,db=0>>>
Traceback (most recent call last):
  File "/usr/local/bin/slinky", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/d1lod/cli.py", line 159, in count
    print(client.store.count())
AttributeError: type object 'LocalStore' has no attribute 'count'

slinky insertall (scroll right)

root@ce762e064de4:/tmp# slinky insertall
  0%|                                                                                                                                                                                                                 | 0/130 [00:00<?, ?it/s]Failed to insert doi:10.5063/F16971VP due to query() missing 1 required positional argument: 'query_text'
  1%|█▌                                                                                                                                                                                                       | 1/130 [00:00<00:54,  2.38it/s]Failed to insert doi:10.5063/F1XS5SP4 due to query() missing 1 required positional argument: 'query_text'
  2%|███                                                                                                                                                                                                      | 2/130 [00:00<00:54,  2.36it/s]Failed to insert doi:10.5063/F1T151XN due to query() missing 1 required positional argument: 'query_text'
  2%|████▋                                                                                                                                                                                                    | 3/130 [00:01<00:58,  2.15it/s]Failed to insert urn:uuid:230592c4-cf17-46e7-834b-26fd32731f56 due to query() missing 1 required positional argument: 'query_text'
  3%|██████▏                                                                                                                                                                                                  | 4/130 [00:01<00:47,  2.63it/s]Failed to insert doi:10.5063/F1R20ZMM due to query() missing 1 required positional argument: 'query_text'
  4%|███████▋                                                                                                                                                                                                 | 5/130 [00:02<01:03,  1.98it/s]Failed to insert doi:10.5063/F1BV7DWF due to query() missing 1 required positional argument: 'query_text'
  5%|█████████▎                                                                                                                                                                                               | 6/130 [00:02<00:59,  2.10it/s]Failed to insert doi:10.5063/F1736P5F due to query() missing 1 required positional argument: 'query_text'
  5%|██████████▊                                                                                                                                                                                              | 7/130 [00:03<01:04,  1.92it/s]Failed to insert doi:10.5063/F13B5XD8 due to query() missing 1 required positional argument: 'query_text'
  6%|████████████▎                                                                                                                                                                                            | 8/130 [00:03<01:04,  1.90it/s]Failed to insert doi:10.5063/F1ZK5DXZ due to query() missing 1 required positional argument: 'query_text'

@amoeba
Copy link
Contributor

amoeba commented Jan 4, 2022

Thanks for the continued work on this @ThomasThelen.

For commands like clear, count, insert, and insertall, using a the LocalStore doesn't really make sense to me since LocalStore is ephemeral (i.e., thrown away when the command exits). Their intended target is going to be some persistent triple store. On the other hand, get only makes sense with a LocalStore because it's generating triples on the fly.

If the above commands all produce stack traces without --local false, I think we're making the CLI harder to use. Do you think the flag is necessary or could the code just follow the logic in my first paragraph here? Is the explicitness of the flag better? If you do think a flag is necessary, maybe it could be --remote instead of --local false since I think the former pattern is more common for boolean flags.

PS: I made one PR comment that might be a culprit. Does that change any of the above output?

@ThomasThelen
Copy link
Member Author

ThomasThelen commented Jan 18, 2022

Alright with 57dd95e, using get will always use the local store. All other commands no longer have the option for using LocalStore. If there's still issues, maybe a complete documentation of what each command is and how it's supposed to be used will help.

Tested using docker-compose and deployed to dev.

@amoeba
Copy link
Contributor

amoeba commented Feb 24, 2022

Thanks @ThomasThelen. I tested this out and I had two make two changes, d6b8067 and 7033442 to make the tests pass locally. They look minor and it doesn't look like they'll affect anything but local development.

@amoeba amoeba merged commit 243a823 into develop Feb 24, 2022
This was referenced Feb 24, 2022
@ThomasThelen ThomasThelen mentioned this pull request Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants