Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration behaviour is confusing and poorly documented. #1258

Open
SpacemanPaul opened this issue May 19, 2022 · 20 comments
Open

Configuration behaviour is confusing and poorly documented. #1258

SpacemanPaul opened this issue May 19, 2022 · 20 comments
Assignees
Labels
documentation pinned v1.9 Target for version 1.9

Comments

@SpacemanPaul
Copy link
Contributor

SpacemanPaul commented May 19, 2022

Expected behaviour

The behaviour of the $DATACUBE_DB_URL environment variable is clearly documented.

Actual behaviour

The $DATACUBE_DB_URL environment variable is not mentioned at all in the docs.

@stale
Copy link

stale bot commented Sep 20, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 20, 2022
@omad omad removed the wontfix label Oct 31, 2022
@omad
Copy link
Member

omad commented Nov 2, 2022

Lets expand the scope of this to cover documenting how ODC finds it's configuration and determines which Database to connect to.

The only existing documentation is Installing & Managing -> Configuring -> Database Setup -> Create Configuration File

I would expect to find information in the Data Access ... -> Connecting to ODC, but this page has a non-link to Setting up your environment. The link needs fixing, if that page even exists.

I think this should also be documented in the API docs for datacube.Datacube.

@omad
Copy link
Member

omad commented Nov 2, 2022

The current behaviour is complicated and confusing, involving many possible environment variables and configuration files, on top of flags to CLI applications and arguments passed to datacube.Datacube() when using the Python API.

The precedence of configuration files is in https://github.com/opendatacube/datacube-core/blob/develop/datacube/config.py#L19-L30 , and the logic in the rest of config.py determines how they're used.

Configuration file precedence

  1. /etc/datacube.conf
  2. File referenced by the $DATACUBE_CONFIG_PATH environment variable.
  3. ~/.datacube.conf
  4. datacube.conf in the current working directory ($PWD)

Environment Variables
DATACUBE_CONFIG_PATH
DATACUBE_ENVIRONMENT
DATACUBE_IAM_AUTHENTICATION
DATACUBE_DB_URL
DATACUBE_IAM_TIMEOUT
DB_HOSTNAME, DB_USERNAME, DB_PASSWORD, DB_DATABASE

These are on top of all the PostgreSQL configuration options, including ~/.pgpass, PGUSER, PGDATABASE, PGHOST, PGPORT, PGPASSWORD

@omad
Copy link
Member

omad commented Nov 2, 2022

I've found an "interesting" undocumented feature lurking in the ODC configuration code. If you run python -m datacube (or call the function datacube.config.auto_config(), a new configuration file will be written to ~/.datacube.conf based upon current environment variables. 🤯

There are a bunch of tests on how the configuration settings work in tests/test_config.py, but they don't include enough commentary to be useful as documentation.

@omad
Copy link
Member

omad commented Nov 2, 2022

The documentation for the configuration file needs improving too. The canonical source in the docs seems to be Database Setup -> Create Configuration File.

It is mostly made up of a commented example configuration file, but doesn't mention db_port, or any of the AWS specific authentication options which are supported. The section of code responsible appears to be datacube/drivers/postgres/_connections.py

@omad
Copy link
Member

omad commented Nov 2, 2022

Oh my, there's more complexity in choosing the "Datacube Environment" to use.

The priority for choosing the environment is (to the best of my understanding):

  1. The env= value passed as an argument when creating Datacube().
  2. The environment variable DATACUBE_ENVIRONMENT
  3. An undocumented value stored in the configuration file/s.
[user]
default_environment=value
  1. default
  2. datacube

(See LocalConfig.__init__() in datacube/config.py)

Also: I don't think "Datacube Environment" is defined clearly anywhere.

@omad
Copy link
Member

omad commented Nov 2, 2022

Okay, after thinking I understood this now, and tearing my hair out.... there's more.

The above logic is only called when creating a datacube.Datacube() in code.

If you're using the datacube CLI tools, they use completely different logic, see datacube.config.LocalConfig.find().

It respects the environment variables DATACUBE_DB_URL, DB_[USERNAME|HOSTNAME|PORT|DATABASE|PASSWORD], but doesn't support DATACUBE_CONFIG_PATH or DATACUBE_ENVIRONMENT!! Those must be passed as command line flags to the datacube command.

@SpacemanPaul
Copy link
Contributor Author

Oh wow, I've been through this code fairly recently but didn't spot all of these gotchas.

@SpacemanPaul
Copy link
Contributor Author

See also @Kirill888 's comments on #1329 re: using postgres with conda.

@stale
Copy link

stale bot commented Mar 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@SpacemanPaul
Copy link
Contributor Author

Configuration file precedence

  1. /etc/datacube.conf
  2. File referenced by the $DATACUBE_CONFIG_PATH environment variable.
  3. ~/.datacube.conf
  4. datacube.conf in the current working directory ($PWD)

Note this list is in reverse precedence order - the ones lower in the list override the one's higher in the list. This means that the priority of the environment variable is surprisingly low - should really be the highest precedence.

@Kirill888
Copy link
Member

the ones lower in the list override the one's higher in the list.

@SpacemanPaul can you confirm what granularity of the override is?

  • Whole file
  • Environment section
  • Individual settings

Is it fusing multiple configs "somehow" or just picks last found one?

@Kirill888
Copy link
Member

Also I'm guessing that $DATACUBE_CONFIG_PATH is really meant to be:

"alternative to global config that is usually in /etc/datacube.conf, but we can't have that on something like NCI, so we add environment variable for that."

@SpacemanPaul
Copy link
Contributor Author

the ones lower in the list override the one's higher in the list.

@SpacemanPaul can you confirm what granularity of the override is?

* Whole file

* Environment section

* Individual settings

Is it fusing multiple configs "somehow" or just picks last found one?

I'm just looking starting to look at that now - but on first glance, yes, it looks like it might be fusing them (and there's even yet another base default config not mentioned in this list that is defined statically in the source code). Immediate progress will be slow because I'm about to go on leave but cleaning this up for 1.9 is my next priority - I'll clean up the 1.8 documentation at the same time.

@SpacemanPaul
Copy link
Contributor Author

The above logic is only called when creating a datacube.Datacube() in code.

If you're using the datacube CLI tools, they use completely different logic, see datacube.config.LocalConfig.find().

Looks to me like Datacube() calls LocalConfig.find as well - see the normalise_config function in Datacube.__init__()?

@SpacemanPaul
Copy link
Contributor Author

SpacemanPaul commented Apr 13, 2023

Then there's datacube.ui.task_app which contains a bunch of utilities and config options that do not appear to be actually used anywhere in core, or in odc-tools. It's not clear to me what other repos might be using these functions but gee it would be nice if I could just delete the whole file.

I think it's all tied up with the executer/worker framework which iirc we had already said we were going to remove. (We already partly removed celery support).

@SpacemanPaul
Copy link
Contributor Author

Honestly, the behaviour in 1.8 is so complex, conditional and inconsistent, I'm not sure it's even possible to write complete, accurate and comprehensible documentation.

@woodcockr
Copy link
Member

This threads sums up nicely our "understanding" - we use everything but the IAM stuff at the moment, but that's because it's completely undocumented and I've not been brave enough to dig into the rabbit hole....

@omad
Copy link
Member

omad commented Apr 21, 2023

@SpacemanPaul
Copy link
Contributor Author

SpacemanPaul commented Apr 24, 2023

EP-10 ^^ is now mostly complete design. Thanks to Damien and Rob for their feedback while working through this.

I'm pretty sure the new design provides a way to do everything that current system can do except merging multiple files - and as I discuss, with the way I have the config loading process laid out, I think it would be possible to do multiple file merging with only minor tweaks if we decide we can't do without it - but I think some of the other new features I've introduced make it unnecessary.

@SpacemanPaul SpacemanPaul changed the title Configuration via DATACUBE_DB_URL environment variable is undocumented Configuration behaviour is confusing and poorly documented. May 23, 2023
@SpacemanPaul SpacemanPaul added the v1.9 Target for version 1.9 label May 23, 2023
@SpacemanPaul SpacemanPaul self-assigned this Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation pinned v1.9 Target for version 1.9
Projects
Status: Done
Development

No branches or pull requests

4 participants