Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match replace lazytree branch names #225

Merged
merged 12 commits into from
Mar 29, 2023
Merged

Match replace lazytree branch names #225

merged 12 commits into from
Mar 29, 2023

Conversation

tamasgal
Copy link
Member

@tamasgal tamasgal commented Mar 14, 2023

The idea is to give the user the ability to rename branches. The current implementation of @Moelf is amazing, but sometimes the branch paths are long and annoying and result in awkward columns.

Before this patch:

julia> t = LazyTree(f, "E", [r"Evt/trks/trks.(dir|pos).([xyz])")
 Row │ Evt_trkstrksdir        Evt_trkstrksdir        Evt_trkstrksdir        Evt_trkstrkspos        Evt_trkstrkspos        Evt_trk      │ SubArray{Float6        SubArray{Float6        SubArray{Float6        SubArray{Float6        SubArray{Float6        SubArra ─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1   │ [-0.487, -0.487, -0.4  [0.0369, 0.0369, 0.03  [-0.873, -0.873, -0.8  [125.0, 125.0, 70.7,   [615.0, 615.0, 585.0,  [446.0, 
 2   │ [0.521, 0.521, 0.521,  [-0.175, -0.175, -0.1  [-0.835, -0.835, -0.8  [80.7, 80.7, 39.1, 33  [533.0, 533.0, 559.0,  [465.0, 
 3   │ [-0.122, -0.122, -0.1  [-0.0817, -0.0817, -0  [-0.989, -0.989, -0.9  [194.0, 194.0, 96.5,   [593.0, 593.0, 581.0,  [457.0, 
 4   │ [-0.23, -0.23, -0.23,  [-0.102, -0.102, -0.1  [-0.968, -0.968, -0.9  [204.0, 204.0, 124.0,  [590.0, 590.0, 571.0,  [440.0, 
1 column omitted

After this patch:

julia> t = LazyTree(f, "E", [r"Evt/trks/trks.(dir|pos).([xyz])" => s"\1_\2"])
 Row │ pos_z                  dir_z                  pos_y                  dir_y                  dir_x                  pos_x        │ SubArray{Float6        SubArray{Float6        SubArray{Float6        SubArray{Float6        SubArray{Float6        SubArra ─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1   │ [125.0, 125.0, 70.7,   [-0.873, -0.873, -0.8  [615.0, 615.0, 585.0,  [-0.487, -0.487, -0.4  [0.0369, 0.0369, 0.03  [446.0, 
 2   │ [80.7, 80.7, 39.1, 33  [-0.835, -0.835, -0.8  [533.0, 533.0, 559.0,  [0.521, 0.521, 0.521,  [-0.175, -0.175, -0.1  [465.0, 
 3   │ [194.0, 194.0, 96.5,   [-0.989, -0.989, -0.9  [593.0, 593.0, 581.0,  [-0.122, -0.122, -0.1  [-0.0817, -0.0817, -0  [457.0, 
 4   │ [204.0, 204.0, 124.0,  [-0.968, -0.968, -0.9  [590.0, 590.0, 571.0,  [-0.23, -0.23, -0.23,  [-0.102, -0.102, -0.1  [440.0, 
 5   │ [58.6, 58.6, 30.1, 35  [-0.821, -0.821, -0.8  [546.0, 546.0, 565.0,  [0.54, 0.54, 0.54, 0.  [0.187, 0.187, 0.187,  [440.0,  1 column omitted

@Moelf
Copy link
Member

Moelf commented Mar 14, 2023

I think we should try not to make LazyTree() any more complicated as it does many things already (including the custom type injection), we had:

and I still think we should have a rename(), but as a separate API much like: https://dataframes.juliadata.org/stable/lib/functions/#DataFrames.rename!

@tamasgal
Copy link
Member Author

I thought it's minimal enough ;) the problem I see with the rename is that it requires a two step process and also carrying the information to two separate places while it could easily be done in one go.

If we do not need to do a lot of copying, we can also go for the rename option. I'll check that out. At least in my case, this rename thing is pretty normal and needed almost everywhere 🙈

@tamasgal
Copy link
Member Author

Another problem is that the user needs to know the normalised branch beforehand to create the rename map. This also means that changing the branch name normalisation is a breaking change for these awkward brach names, where you expect the user to do renaming.

@Moelf
Copy link
Member

Moelf commented Mar 15, 2023

I thought it's minimal enough ;)

eh, find another similar package that has the "main function" (defined as the one that most user will use to get the table-like thing back) handle renaming, I checked a few, and they don't usually do this:

@tamasgal
Copy link
Member Author

tamasgal commented Mar 15, 2023

Hm, I think I have to contradict, they do: CSV parsers usually offer "renaming" in the actual parsing function, like header= (https://csv.juliadata.org/stable/reading.html#header, https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) which can take a list. There is also the (sometimes optional) column name normalisation (https://csv.juliadata.org/stable/reading.html#normalizenames) which we also do, to get valid identifiers.

Also just because others don't do it, it does not mean it could make sense in our case 🙂 ROOT is of course much more different compared to CSV and handling complex, nested structures is way more complicated.

Just to regurgitate my example from above

t = LazyTree(f, "E", [r"Evt/trks/trks.(dir|pos).([xyz])")

will currently translate to the fieldnames

Evt_trkstrksdirx, Evt_trkstrksdiry, Evt_trkstrksdirz, Evt_trkstrksposx, Evt_trkstrksposy, Evt_trkstrksposz

while the original class which was serialised to this sub-branch is an Evt, which has fields named vector<Hit>, vector<MCHit>, vector<Trk>, vector<MCTrk>, Header, vector<Weight> and so on; the list goes on with dozens of other fields and the nightmare is that all the vectors of other classes are nested as well.

So I have to deal with the renaming of dozens of fields to make it more "usable" in the code and for the user.

If the renaming need to happen after the generation of these awkward names, one really needs to implement this by hand and rely on the normalisation procedure, which I consider to be something like an implementation detail (more or less). To mitigate this, the function to normalise branch names would need to be exported, in my opinion.

The code would need to be hardcoded if the renaming is external to LazyTree, something like (keep in mind that the list goes on with much more renaming, in my case):

t = LazyTree(f, "E", [r"Evt/trks/trks.(dir|pos).([xyz])")
new_t = rename(t,
    ["Evt_trkstrksdirx" => "dir_x",
    "Evt_trkstrksdiry" => "dir_y",
    "Evt_trkstrksdirz" => "dir_z",
    "Evt_trkstrksposx" => "pos_x",
    "Evt_trkstrksposy" => "pos_y",
    "Evt_trkstrksposz" => "pos_z",]
)

compared to this single line, where the information is minimal and it renames multiple branches in one go.

t = LazyTree(f, "E", [r"Evt/trks/trks.(dir|pos).([xyz])" => s"\1_\2"])

Of course, rename() could offer regex renaming as well, so that the user hooks up on the normalized names and create regex for those, but that does not feel good to me. The branch path is very well known, so that's the point where I naturally would hook in when I wanted to rename something, just like in CSV where the column names in the header line are known to the user (of course, in case of CSV one usually hooks into the column number due to the fixed ordering).

Apropos CSV. Taking your example with CSV-parsers, I think it would be fair to offer something like the "header"-renaming and then I don't see a big problem to give the user the option to also use regex 😉 or what do you think?

Maybe my cases are more like edge cases and other users of UnROOT are happy with the branchnames, but I have to mimic nested structures in order to allow event-by-event readout which materialise in Julia structs since our data is not really fitting the usual tabular style but is often doubly-jagged.

Anyways, let me know what you think.

@tamasgal
Copy link
Member Author

OK, I need some feedback here, I could not fix #226 yet but I need to move on now and leave that for later.

My design process now heavily depends on whether we want this replacement feature or not 😉

@tamasgal
Copy link
Member Author

Sorry, forgot to ping @Moelf

src/iteration.jl Outdated Show resolved Hide resolved
@Moelf
Copy link
Member

Moelf commented Mar 27, 2023

I'm not against but please make sure CI passes and there's example and documentation

@tamasgal
Copy link
Member Author

OK thanks for the go! Of course I'll make everything pass and remove all the debug stuff (it's still draft) so that you can put your final stamp on it 😄

@codecov
Copy link

codecov bot commented Mar 29, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -1.88 ⚠️

Comparison is base (d8a5013) 90.08% compared to head (8667679) 88.21%.

❗ Current head 8667679 differs from pull request most recent head 1219c34. Consider uploading reports for the commit 1219c34 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #225      +/-   ##
==========================================
- Coverage   90.08%   88.21%   -1.88%     
==========================================
  Files          18       18              
  Lines        2129     2121       -8     
==========================================
- Hits         1918     1871      -47     
- Misses        211      250      +39     
Impacted Files Coverage Δ
src/iteration.jl 87.44% <100.00%> (+0.22%) ⬆️

... and 5 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@tamasgal tamasgal marked this pull request as ready for review March 29, 2023 09:28
@tamasgal
Copy link
Member Author

OK, ready for review @Moelf

I don't know what's wrong with chaintrees() on the nightly (see #230), I could not reproduce it locally. Tried with the latest alpha and now compiling the laste master of Julia. I think it's not related.

@tamasgal
Copy link
Member Author

Tests pass on the latest master, so it's unrelated to the changes here.

@tamasgal tamasgal marked this pull request as draft March 29, 2023 14:02
@tamasgal tamasgal marked this pull request as ready for review March 29, 2023 17:17
docs/src/index.md Outdated Show resolved Hide resolved
@Moelf
Copy link
Member

Moelf commented Mar 29, 2023

looks good to me. we don't care nightly error I think.

docs/src/index.md Outdated Show resolved Hide resolved
@tamasgal tamasgal merged commit 4ee582e into master Mar 29, 2023
@tamasgal tamasgal deleted the match-replace-lazytree branch March 29, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants