Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keyname overhaul #3115

Draft
wants to merge 116 commits into
base: master
from
Draft

Conversation

@kodebach
Copy link
Contributor

kodebach commented Oct 25, 2019

Implements #3102

Closes #3050
Closes #3084
Closes #2698
Closes #3082

Basics

These points need to be fulfilled for every PR:

  • Short descriptions of your changes are in the release notes
    (added as entry in doc/news/_preparation_next_release.md which
    contains _(my name)_)
    Please always add something to the release notes.
  • Details of what you changed are in commit messages
    (first line should have module: short statement syntax)
  • References to issues, e.g. close #X, are in the commit messages.
  • The buildservers are happy.
  • The PR is rebased with current master.

Checklist

  • I added unit tests for my code
  • I fully described what my PR does in the documentation
    (not in the PR description)
  • I fixed all affected documentation
  • I added code comments, logging, and assertions as appropriate (see Coding Guidelines)
  • I updated all meta data (e.g. README.md of plugins and METADATA.ini)
  • I mentioned every code not directly written by me in THIRD-PARTY-LICENSES

Review

@lgtm-com

This comment has been minimized.

Copy link

lgtm-com bot commented Oct 25, 2019

This pull request introduces 1 alert when merging 2f20bd4 into 8d138a2 - view on LGTM.com

new alerts:

  • 1 for FIXME comment
@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Oct 25, 2019

@markus2330

Before we get to the questions, one important thing:

I added a separate ukey field for the unescaped name to struct _Key. This simplifies a lot of things. I hope this is not a problem. (@mpranj I also hope I modified mmapstorage correctly, please check.)

I also removed elektraFinalizeName, so any modification to the key name has to be made via the proper functions. This resulted in some changes to keyset.c, most cases are covered by the newly added keySetNamespace (not working correctly yet) function that changes the namespace of a key without touching the rest of the name. ksCut was an exception. There I introduced a small hack, by only changing the namespace in the unescaped name, since the escaped name is not relevant inside of ksCut (or its callees).


Now the many questions:

This is still very much broken (as you can see), but there are a lot of open questions:

First the more concrete questions:

  1. There are some weird escaping tests in data_escape.c. I have no idea, why the old escaping function worked this way. Can you maybe shed some light on that?

TEST_ESCAPE_PART("\\\\%", "\\\\\\%"); // TODO (kodebach): WHAT??
TEST_ESCAPE_PART("\\\\\\%", "\\\\\\\\%"); // TODO (kodebach): WHAT??

  1. In test_key.c there are some test cases that are wrong IMO. Adding e.g. ../../a to /b should be an error, since we cannot go two levels up. Do you agree?

TEST_ADD_NAME ("/much/more/level/1/2/3", "../../../../../../..//user", "/user");
TEST_ADD_NAME ("/much/more/level/1/2/3", "..///../../../../../../..//user", "/user");
TEST_ADD_NAME ("/much/more/level/1/2/3", "..///../../..////../../../..//user", "/user");
TEST_ADD_NAME ("/much/more/level/1/2/3", "../../....///../../..////../../../..//user", "/user");

  1. Is there any reason that keyCompareByName et al. in keyset.c use const void * arguments instead of const Key *? I know that this makes it compatible with things like qsort, but in that case why doesn't the public keyCmp use const void *?

And now some more conceptual things:

  1. There are many functions that return ssize_t, but could return size_t and use 0 as an error case. keyGetNameSize is such a function. The return value can never be 0, since we include the terminating null byte in the size and explicitly state that 1 is return, if no key name exists. Should we make this change?
  2. Currently there is the possibility of passing a key like user:hello/my/key to keyNew. This results in a key with name user/my/key and metadata owner = hello. There is a comment in the function elektraHandleUserName that indicates this is legacy behaviour:
    // handle owner (compatibility, to be removed)

    I assume this means, I don't need to handle that case and can remove the function?
  3. If I am not mistaken, the options argument of elektraKeySetName has no use with the new key names. Am I missing something, or can we remove the argument and therefore re-integrate elektraKeySetName into keySetName?
  4. keySetName and keyAddName currently don't distinguish between different kinds of errors, is that something we want? If so how much distinction should there be?
  5. I noticed that keyNew creates a key without a name, if elektraKeySetName fails (e.g. because of an invalid name). I changed it so that keyNew (and keyVNew) return NULL in this case. This also required changing keyVInit so that it returns a value.

And last a few things that are still completely open:

  1. Right now both keyNew (0, KEY_END) and keyNew ("", KEY_END) create a key with name "". IMO one of the two should fail. The current (somewhat broken) solution in this PR is that keyNew (0) creates a key with no name (i.e. key->key == NULL), but externally nothing changes, keyName (keyNew (0)) still returns "". However, keyNew ("", KEY_END) fails because "" is not a valid key name. Note: this behaviour is implemented in elektraKeySetName, so keySetName (key, 0) free the allocated name buffers.
  2. What about backwards compatibility? The introduction of the colon : after the namespace already means that many keyNew and keySetName calls are broken. And I don't think we should provide any backwards compatibility for that in general. However, we might want to add an #ifdef ELEKTRA_KEYNAME_COLON_COMPAT (like in #2983 (comment)) that allows use using old key names without the colon. The #ifdef would enable code in elektraKeyNameCanonicalize that translates the old keys into new ones. I would however only add compatibility for the colon. Any other incompatibilities should not be addressed since they should be rare.
  3. Should the functions elektraKeyNameValidate, elektraKeyNameCanonicalize, elektraKeyNameUnescape and elektraKeyNameEscapePart be made public? elektraKeyNameValidate and elektraKeyNameEscapePart don't make any assumptions about the name/part given, but elektraKeyNameCanonicalize assumes a name that is valid according to elektraKeyNameValidate and elektraKeyNameUnescape assumes a name that is canonical according to elektraKeyNameCanonicalize.
Copy link
Member

mpranj left a comment

Maybe it makes sense that you leave the mmap stuff out and I will fix it when the rest is done?

@@ -403,6 +419,7 @@ int keyCopy (Key * dest, const Key * source)

// free old resources of destination
if (!test_bit (dest->flags, KEY_FLAG_MMAP_KEY)) elektraFree (destKey);
if (!test_bit (dest->flags, KEY_FLAG_MMAP_KEY)) elektraFree (destUKey);

This comment has been minimized.

Copy link
@mpranj

mpranj Oct 25, 2019

Member

The KEY_FLAG_MMAP_KEY is a flag for the Key->key member. If I understand this correctly ukey is a member which is separately allocated. Then we need to add a separate flag for this field.

This comment has been minimized.

Copy link
@kodebach

kodebach Oct 26, 2019

Author Contributor

I don't think we do. key->key and key->ukey are always changed at the same time. Either both are NULL or both are correctly allocated.

This comment has been minimized.

Copy link
@mpranj

mpranj Oct 27, 2019

Member

In that case you're right, I was not aware of that.

This comment has been minimized.

Copy link
@markus2330

markus2330 Oct 27, 2019

Contributor

Such invariants should also be documented in the struct.

if (key->ukey)
{
key->ukey = key->ukey + destInt;
}
Comment on lines +969 to +1220

This comment has been minimized.

Copy link
@mpranj

mpranj Oct 25, 2019

Member

Sounds like the right direction, but there are other places we need to change. We will also need to copy this data into the mmap region.

@Chemin1 Chemin1 mentioned this pull request Oct 25, 2019
18 of 35 tasks complete
@markus2330

This comment has been minimized.

Copy link
Contributor

markus2330 commented Oct 26, 2019

I added a separate ukey field for the unescaped name to struct _Key. This simplifies a lot of things. I hope this is not a problem. (@mpranj I also hope I modified mmapstorage correctly, please check.)

It is a performance thing: ideally keyNew would be two mallocs: one for the struct (so that the identity will stay the same throughout the lifetime) and one for all dynamic parts that might need resizing. One more separate field brings us one step further from this situation. But this optimizations can also be done after 1.0.

I have no idea, why the old escaping function worked this way. Can you maybe shed some light on that?

If your formalization brings different results in this cases, it is okay to change this. Actually the best test would be to escape und unescape random byte sequences and see if you receive the same bits.

I assume this means, I don't need to handle that case and can remove the function?

Yes, please remove this owner stuff. It is not working anyway and it is unlikely we can have it for 1.0.

keySetName and keyAddName currently don't distinguish between different kinds of errors, is that something we want? If so how much distinction should there be?

@PhilippGackstatter didn't you already make this suggestion? Do you already work on this?

What about backwards compatibility?

Please do not provide any. Elektra 1.0 should only understand key names with the colon. We make this change either fully or not.

Right now both keyNew (0, KEY_END) and keyNew ("", KEY_END) create a key with name ""

What about making this fail? (letting keyNew return NULL) Is there some code that really needs these empty key names?

From the user view it would be nice to have an invariant that keys have a valid name. (Especially, if you already changed that wrong names let keyNew fail. The empty name would simply be yet another wrong name.)

Should the functions elektraKeyNameValidate, elektraKeyNameCanonicalize, elektraKeyNameUnescape and elektraKeyNameEscapePart be made public?

Maybe somewhere in the future during 1.0. At the moment we make as little as possible public.

@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Oct 26, 2019

It is a performance thing: ideally keyNew would be two mallocs: one for the struct (so that the identity will stay the same throughout the lifetime) and one for all dynamic parts that might need resizing. One more separate field brings us one step further from this situation. But this optimizations can also be done after 1.0.

I don't think blindly aiming for less mallocs is a good idea. Having two smaller buffers instead of one big one might for example affect cache behaviour. (e.g. ukey is accessed more so it is kept in cache, but key is not accessed as much). This is pure speculation and without really good benchmarks I would prefer we decide based on code simplicity and readability.

Apart from that using only 2 mallocs is pretty hard (and definitely was not the case for the old keyNew) when array index prefixing is involved. AFAIK the only way to calculate the name buffer size is to iterate the whole string to check for array indices.

Actually the best test would be to escape und unescape random byte sequences and see if you receive the same bits.

There is a test that checks all 2 byte sequences (with an extra flag it checks all 4 byte sequences). But I can add some more edge cases. Actually random tests are not good for reproducibility and just generating some fixed strings and using them doesn't really make for good tests, as most strings are won't involve any escaping.

@PhilippGackstatter didn't you already make this suggestion?

He did yes, my question was more about the "how much granularity do we want".

Is there some code that really needs these empty key names?

There are definitely some tests that create a key with an empty name and then use keyAddName. There might also be some other places where something like this is used. There may also be some cases where you need a Key, but don't care about its name and want to avoid the malloc. For example if the Key is used for error handling.

From the user view it would be nice to have an invariant that keys have a valid name. (Especially, if you already changed that wrong names let keyNew fail. The empty name would simply be yet another wrong name.)

I agree. The easy solution to this would be to allow keyName to return NULL (and keyGetNameSize to return 0). Then the invariant would be keyName returns a valid name or NULL, if the key doesn't have a name. We could also use "" instead of NULL, which would avoid segfaults.

@PhilippGackstatter

This comment has been minimized.

Copy link
Contributor

PhilippGackstatter commented Oct 27, 2019

@PhilippGackstatter didn't you already make this suggestion? Do you already work on this?

It's broadly covered in #3029, but I am not working on it.

@markus2330

This comment has been minimized.

Copy link
Contributor

markus2330 commented Oct 27, 2019

This is pure speculation and without really good benchmarks

In benchmarks I saw that malloc and elektraUnescapeKeyName have about the same costs.. And it is not pure speculation that calling malloc several times is more expensive than calling it once. I do not see how this could negatively affect cache behavior, as having everything in a block should not be worse than having it scattered around in the memory. But of course I cannot exclude this possibility.

I would prefer we decide based on code simplicity and readability.

I agree. As already said, we can optimize this later when it really becomes a bottleneck. Currently malloc usually is only about 5% of the total costs (in a kdb ls). So to not optimize it now, is the correct decision.

He did yes, my question was more about the "how much granularity do we want".

I think the proposal in #3029 is good. You can say there which other cases you think of.

There are definitely some tests that create a key with an empty name and then use keyAddName. There might also be some other places where something like this is used. There may also be some cases where you need a Key, but don't care about its name and want to avoid the malloc. For example if the Key is used for error handling.

When grepping for 'keyNew(0' or 'keyNew(""' * I find only very few occurrences. It should not be a big deal to make a 'keyNew("/"' in these cases instead. Then we could also remove this ugly cases like in kdb.c line 1501 where invalid keys are passed.

I agree. The easy solution to this would be to allow keyName to return NULL (and keyGetNameSize to return 0). Then the invariant would be keyName returns a valid name or NULL, if the key doesn't have a name. We could also use "" instead of NULL, which would avoid segfaults.

I would prefer if an invalid key is completely impossible.

@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Oct 27, 2019

Okay so to sum up:

  1. Key should have an invariant that the key name is always valid. Both keyNew (0, KEY_END) and keyNew ("", KEY_END) should return NULL. For error handling keyNew ("/", KEY_END) is the recommended one, if the key name isn't important and you don't have any other key to use.
  2. Return codes should give some indication on the type of error that occurred.
  3. No backwards compatibility will be provided.
  4. No additional functions will be made public right now.
  5. All the owner stuff will be removed.
  6. Test cases that don't agree with the specification in #3102 will be adapted.

Questions still open:

  1. Should we replace ssize_t with size_t, if return 0 can be used for errors?
  2. const void * vs const Key * in keyCmp, keyCompareByName, et al. ?
  3. Is there any use left for the options argument in elektraKeySetName?
@kodebach kodebach mentioned this pull request Oct 27, 2019
4 of 16 tasks complete
@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Oct 27, 2019

A few more things came up:

  1. The keyGetX, keySetX functions for X = UID, GID, Owner, Dir, Mode, ATime, MTime and CTime are all marked deprecated. I assume I can ignore/remove these and comment out the test cases?
  2. keyIsInactive checks the key for parts starting with . everytime it is called. Should we instead do the check during keySetName and store the state in a flag?
@markus2330

This comment has been minimized.

Copy link
Contributor

markus2330 commented Oct 28, 2019

The keyGetX, keySetX functions for X = UID, GID, Owner, Dir, Mode, ATime, MTime and CTime are all marked deprecated. I assume I can ignore/remove these and comment out the test cases?

Yes you can remove all of them (including test cases), I wrote about that just before in #3124

keyIsInactive checks the key for parts starting with . everytime it is called. Should we instead do the check during keySetName and store the state in a flag?

keyIsInactive can also be removed. If we reintroduce inactive keys, we will use meta data for it. It is not needed for 1.0.0.

@kodebach kodebach force-pushed the kodebach:keyname_overhaul branch from 2f20bd4 to b077488 Nov 3, 2019
@lgtm-com

This comment has been minimized.

Copy link

lgtm-com bot commented Nov 3, 2019

This pull request introduces 1 alert when merging b077488 into 6aea503 - view on LGTM.com

new alerts:

  • 1 for FIXME comment
@lgtm-com

This comment has been minimized.

Copy link

lgtm-com bot commented Nov 3, 2019

This pull request introduces 1 alert when merging a61af5a into 6aea503 - view on LGTM.com

new alerts:

  • 1 for FIXME comment
@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Nov 5, 2019

@markus2330 I stumbled upon a problem with the trie. (I think) it relies on the existence of keys with empty names. Also, the trie operates on escaped key names AFAICT. We should probably change that as well. The _Trie struct is not well optimized in general... It uses 4 arrays of pointers/size_t each of length 256, that means a single _Trie struct occupies 8 KiB of memory on a 64 bit machine (unless I am mistaken), most of which will be empty space (zeros).

@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Nov 5, 2019

Although on second though, I think Keys are only used for trieLookup... We could easily use strings there too

@markus2330

This comment has been minimized.

Copy link
Contributor

markus2330 commented Nov 5, 2019

Yes, the trie internally does not know about keys anyway. I would not optimize there anything unless you find a problem. The 8 KiB should be no problem as they are only needed per mountpoint. Afaik this part was bug free since it was written.

@lgtm-com

This comment has been minimized.

Copy link

lgtm-com bot commented Nov 6, 2019

This pull request introduces 1 alert when merging 4da8dd4 into 69b1e93 - view on LGTM.com

new alerts:

  • 1 for FIXME comment
@kodebach kodebach mentioned this pull request Nov 7, 2019
2 of 16 tasks complete
@lgtm-com

This comment has been minimized.

Copy link

lgtm-com bot commented Nov 7, 2019

This pull request introduces 1 alert when merging a9e6df6 into 69b1e93 - view on LGTM.com

new alerts:

  • 1 for FIXME comment

if (!key->meta) key->meta = ksNew (0, KS_END);
memset (key, 0, sizeof (struct _Key));
key->meta = ksNew (0, KS_END);

This comment has been minimized.

Copy link
@mpranj

mpranj Nov 7, 2019

Member

Is it possible to not allocate key->meta eagerly? It was discussed in #3142 to not do this.

This comment has been minimized.

Copy link
@kodebach

kodebach Nov 7, 2019

Author Contributor

This code is based on the state before #3142.
If it is possible in #3142, then it should be possible here too. I will wait for #3142 to be merged an then integrate it's changes. (Same goes for any other PRs that make changes to the core)

@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Nov 7, 2019

@markus2330 The current situation is this:
I decided to implement elektraKeySetName (and keyAddName) in 3 steps:

  1. Call elektraKeyNameValidate.
    • If it returns 1, the key name is valid (in keyAddName we use the existing name as a prefix). Both keySize and keyUSize will be updated to the correct values and we can call elektraRealloc and continue.
    • If it returns 0, the key name is NOT valid. Neither keySize nor keyUSize will have been modified and we return immediately to ensure the key still in a valid state when we return.
  2. Call elektraKeyNameCanonicalize. The function has no return type, since it assumes a valid name and therefore cannot fail.
  3. Call elektraKeyNameUnescape. The function has no return type, since it assumes a valid canonical name and therefore cannot fail.

Most of this works already, but I have some problems with .. parts in elektraKeyNameValidate. I can't quite figure out, how to best calculate the correct keySize and keyUSize, if .. parts are involved. The only solution I could think of that should (not tried yet) work without problems is to do a second pass over the key name starting from the back.
However, I think there is a simpler (and probably faster) solution, if we introduce a limit on the number of parts in a key name. Let's say 1024. (Note: the limit is only so that I don't need to malloc) What do you think about that? UPDATE: scratch that, I ran into problems with that solution, I'll just do the second iteration for now.

@markus2330

This comment has been minimized.

Copy link
Contributor

markus2330 commented Nov 7, 2019

Thank you for the update! Sounds very good!

elektraKeySetName

Shouldn't it be called keySetName? Or is elektraKeySetName still needed?

@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Nov 7, 2019

Shouldn't it be called keySetName? Or is elektraKeySetName still needed?

No its not needed anymore, I just didn't update that part yet.

@kodebach kodebach mentioned this pull request Nov 8, 2019
0 of 14 tasks complete
@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Nov 8, 2019

I ended up validating the whole key name (except for the namespace) in reverse, but there are now working versions of elektraKeyNameValidate, elektraKeyNameCanonicalize and elektraKeyNameUnescape. Now it should only be a matter of going through the whole code base and updating all key names ...

@kodebach kodebach force-pushed the kodebach:keyname_overhaul branch from 380ddbd to 27babaa Nov 27, 2019
@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Nov 28, 2019

@markus2330 Do you know which storage plugins support metadata? Is there some way to easily find out, other than looking at the code?

@mpranj

This comment has been minimized.

Copy link
Member

mpranj commented Nov 28, 2019

Screenshot 2019-11-28 at 15 40 37

@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Nov 28, 2019

[Screenshot]

@mpranj What do you want to tell us? Yes, the PR is insanely huge, but it can't be any smaller... Changing a basic fact about key names (colon after namespace) will naturally break everything.

You could use the VS Code Github Extension, if you are interesting the contents of this PR. It seems it is able to handle the enormous size of the PR.

@mpranj

This comment has been minimized.

Copy link
Member

mpranj commented Nov 28, 2019

@kodebach I just thought it's funny that github cannot handle this PR. It's clear that the change implies huge changes in the complete codebase.

@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Nov 28, 2019

@markus2330 Whats the plan for the Python 2 binding? Are we going to remove it soon? Or do I have to fix it?

@mpranj

This comment has been minimized.

Copy link
Member

mpranj commented Nov 28, 2019

Whats the plan for the Python 2 binding?

It will be removed soon via #3291.

@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Nov 28, 2019

I hit a problem with testshell_markdown_kdb_global_umount. It breaks other tests, because system:/elektra/modules keys are serialized to disk.

The specific circumstances in which this caused a problem is the Backup-and-Restore directive in the markdown shell tests. For some reason it doesn't just backup the keys below the given key, but the whole namespace. So in kdb-global-unmount.md, we make a backup of all of system/ because of the line Backup-and-Restore: system:/elektra/globalplugins. The restore then causes the following error:

Sorry, module kdb issued the warning C01320:
        Interface: Postcondition of backend was violated: drop key system:/elektra/modules/cache/infos/description not belonging to "system:/elektra/modules/cache" with name "modules" but instead to "(null)" with name "(null)" because it is hidden by other mountpoint

Also the file $KDB_DB_SYSTEM/elektra.ecf then contains all the system/elektra/modules keys.

I don't know why this didn't happen before the key name update. Does anyone have an idea, where these keys should be removed or what else should happen to avoid this problem?

PS. Everything else should be working now. Apart from documentation, I am now just waiting for #2969, so that I can rebase.

@kodebach kodebach mentioned this pull request Nov 28, 2019
@markus2330

This comment has been minimized.

Copy link
Contributor

markus2330 commented Nov 29, 2019

@markus2330 Do you know which storage plugins support metadata? Is there some way to easily find out, other than looking at the code?

Do you mean being able to store arbitrary metadata? Then it should be:

  • ni
  • dump
  • quickdump
  • mmapstorage
  • ini

Hopefully I did not forget one.

@kodebach

This comment has been minimized.

Copy link
Contributor Author

kodebach commented Nov 29, 2019

Do you mean being able to store arbitrary metadata? Then it should be:

Yes, thanks. I think xerces is also one of them.

Hopefully I did not forget one.

We'll find out if someone notices incompatible files, because of the meta:/ namespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.