Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with reading branches of C-style arrays #165

Closed
briederer opened this issue Apr 25, 2022 · 37 comments · Fixed by #166
Closed

Problem with reading branches of C-style arrays #165

briederer opened this issue Apr 25, 2022 · 37 comments · Fixed by #166

Comments

@briederer
Copy link

I just tried to read one of my Root-files which contains a TTree with TBranches of the following types

  • primitive types: i.e. integer and double branches
  • array types: i.e. a C-Style array of doubles like double mat[9]

In principle it seems to be related to my old issue #9 so I decided to dig a bit deeper with the current version.

Loading my file with ROOTFile() works without a problem. However when trying to create the LazyTree of the said TTree I get the error message:

Error displaying LazyTree with 6 branches:
n, trsigma2, sigma, sigma2, sigma3...
: ArgumentError: invalid Array dimensions

The stacktrace lead me to the interped_data(rawdata, rawoffsets, ::Type{T}, ::Type{J}) where {T, J<:JaggType} function.
It turns out that my data seems to be a NooffsetJag JagType which seems kind of reasonable to me. Also reinterpreting works without a problem and indeed the variable real_data contains the real data stored in the Root-File.
So in fact the problem only occurs when passing the real_data and the offset to VectorOfVectors. The problem seems to be the following: In my case the offset-vector is simply an empty vector Int32[] which causes the VectorOfVectors to call the function similar to init a vector of size -1.

So to sum it up, it seems that UnROOT is able to read my data (and I guess in turn also the file from #9) but simply the wrapping of the data fails. I guess it should be possible to fix it by changing the part for NooffsetJag in the code and catch cases when rawoffsets is empty. But I don't know whether this would create other problems.

I really hope we can figure this out, because being able to finally use UnROOT for my analysis would make my analysis much more efficient.

@briederer
Copy link
Author

Follow-up:
Simply replacing the lines I've mentioned in the previous post by:

...
else
    #offset = rawoffsets # not needed since set implicitly in return
    real_data = ntoh.(reinterpret(T, rawdata))
    return VectorOfVectors(real_data, Int32[1])
end

gives kind of a weird behavior for the lazy tree.

file = ROOTFile("path/to/file")
lt = LazyTree(file,"tree",["branch"]) # branch contains 30 entries with each arrays of the Root-type `[9]/D`

Outputs:

julia> lt[1] 
UnROOT.LazyEvent at index 1 with 1 columns:
(branch = [-0.11494079823987098, 0.003512754908587555, -0.006951244327531138, -0.010983292976212498, 0.0012710038728507352, -0.009022733771010032, -0.007430790262943545, -0.001381307906738535, 0.12396353201088427, 6.95206001540604e-310    NaN, NaN, 6.95214518845505e-310, 6.952094284304e-310, 1.0e-323, 7.94302e-318, 1.0e-323, 4.0e-323, 0.0, 5.0e-324],)
# contains the correct data but is then padded at the end by garbage (probably Vector{Float64}(undef,N) is somewhere called?)
julia> lt.branch[1]
0-element view(::Vector{Float64}, 1:0) with eltype Float64
# Not accessible this way
julia> lt
 Row │ branch                                                                                                                              
     │ SubArray{Float6                                                                                                                    
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1   │ [-0.115, 0.00351, -0.00695, -0.011, 0.00127, -0.00902, -0.00743, -0.00138, 0.124, 6.95205899055105e-310, 6.95205899024354e-310, 5. 
 2   │ []                                                                                                                                 
 3   │ []                                                                                                                                 
 4   │ [-0.106, 0.000901, -0.00758, -0.00825, 0.00072, -0.0106, -0.00998, -0.00164, 0.116, 6.95205899057635e-310, 6.95205904360066e-310,  
 5   │ [-0.105, 0.0009, -0.00814, -0.00801, 0.000957, -0.00985, -0.0104, -0.00249, 0.115, 6.9521451199374e-310, 6.9520570656116e-310, 6.9 
 6   │ []                                                                                                                                 
 7   │ []                                                                                                                                 
 8   │ [-0.0976, 0.00202, -0.00855, -0.00792, 0.00172, -0.0115, -0.011, -0.0022, 0.109, 6.9521451199374e-310, 6.952057148036e-310, 6.9520 
                                                               
                                                                                                               1 column and 22 rows omitted

I guess the fact that not all elements are filled here is due to the tree being loaded Lazy?

@tamasgal
Copy link
Member

Thanks for the investigations @briederer, that certainly helps a bit :) A few things has changed since #9 so we might be closer to solving this issue.

Can you upload a small test file and some reference values?

@briederer
Copy link
Author

Hey thanks for the response.
I tried to recreate a minimal ROOT-File with one tree arrays and three branches containing the C-style arrays I am using:

  • n: Simple integers (0,1,2,...)
  • carr: A simple C-array double mat[9]; with random numbers stored to the ROOT-File by: tree -> Branch("carr",&mat,"[9]/D");
  • cstr: A wrapped C-array struct vec{ double el[9]} v; with random numbers stored to the ROOT-File by: tree -> Branch("cstr",&v,"el[9]/D");
    And of course then filled with file -> Fill();. The stored random numbers are attached at the end of the comment.

What is interesting is, that although this file is now filled in an analogous way compared to my real data, it behaves a bit different and I guess this is due to basket-separation. In my original files I flush after every Fill() command by file->Write("", TObject::kWriteDelete); since I need to make sure that no data is lost (I am aware that this is probably not the most efficient thing to do but it is necessary due to some other reasons). In the attached file I did Write(...) only once at the end. So If I do get everything right this means in my original data I always have "1 basket contains 1 array per branch" while here the one basket that is created contains "all arrays".

When I pass this new file to my the modified version of UnROOT (see previous comment) I see again that the arrays are of type UnROOT.Nooffsetjagg with offset empty offset-arrays. The differences are the following now:

  • the real_data is now everything that is in the basket, i.e. the full column from the table below
  • The program segfaults and I am not able to recreate the Outputs from the previous post. However the segfault appears to happen at the point of output creation (i.e. in the _show-function and not during interpeding or other parts).

Let me know if I can help any further.

PS: Sorry for adding the ROOT-file as tar.gz but as it seems github started to forbid adding *.root-files to comments.

Reference values:

$ root sample.root 
   ------------------------------------------------------------------
  | Welcome to ROOT 6.24/06                        https://root.cern |
  | (c) 1995-2021, The ROOT Team; conception: R. Brun, F. Rademakers |
  | Built for linuxx8664gcc on Sep 02 2021, 14:20:23                 |
  | From tags/v6-24-06@v6-24-06                                      |
  | With                                                             |
  | Try '.help', '.demo', '.license', '.credits', '.quit'/'.q'       |
   ------------------------------------------------------------------
root [0] 
Attaching file sample.root as _file0...
(TFile *) 0x5565a90ccc80
root [1] _file0 -> ls()
TFile**		sample.root	Sample file with C-style arrays
 TFile*		sample.root	Sample file with C-style arrays
  KEY: TTree	arrays;1	Here we have the arrays
root [2] arrays -> Scan("*");
***********************************************************
*    Row   * Instance *       n.n * carr.carr * cstr.cstr *
***********************************************************
*        0 *        0 *         0 * 0.9605671 * 0.7005088 *
*        0 *        1 *         0 * 0.5783026 * 0.5883587 *
*        0 *        2 *         0 * 0.8360180 * 0.6346690 *
*        0 *        3 *         0 * 0.6464254 * 0.1063357 *
*        0 *        4 *         0 * 0.4649314 * 0.3290703 *
*        0 *        5 *         0 * 0.9394426 * 0.6610401 *
*        0 *        6 *         0 * 0.2085035 * 0.3833527 *
*        0 *        7 *         0 * 0.1450006 * 0.6490524 *
*        0 *        8 *         0 * 0.3819448 * 0.3342306 *
*        1 *        0 *         1 * 0.0580883 * 0.4054549 *
*        1 *        1 *         1 * 0.5265699 * 0.7071468 *
*        1 *        2 *         1 * 0.5964353 * 0.1376930 *
*        1 *        3 *         1 * 0.4881621 * 0.1404784 *
*        1 *        4 *         1 * 0.7832763 * 0.3854090 *
*        1 *        5 *         1 * 0.5757230 * 0.3180625 *
*        1 *        6 *         1 * 0.1483398 * 0.5362901 *
*        1 *        7 *         1 * 0.0185714 * 0.7266424 *
*        1 *        8 *         1 * 0.1246488 * 0.8545894 *
*        2 *        0 *         2 * 0.3613115 * 0.7710743 *
*        2 *        1 *         2 * 0.9609252 * 0.8262430 *
*        2 *        2 *         2 * 0.1001446 * 0.9003678 *
*        2 *        3 *         2 * 0.4872831 * 0.3086482 *
*        2 *        4 *         2 * 0.2837206 * 0.6322838 *
*        2 *        5 *         2 * 0.9577006 * 0.6656654 *
*        2 *        6 *         2 * 0.9665145 * 0.0157890 *
*        2 *        7 *         2 * 0.0711204 * 0.4930844 *
*        2 *        8 *         2 * 0.7229358 * 0.6675558 *
***********************************************************

@tamasgal
Copy link
Member

OK thanks, the good news is that uproot can read everything ;)

In [1]: import uproot

In [2]: f = uproot.open("sample.root")

In [3]: f.keys()
Out[3]: ['arrays;1']

In [4]: f['arrays']
Out[4]: <TTree 'arrays' (3 branches) at 0x0001073d9bb0>

In [5]: f['arrays'].keys()
Out[5]: ['n', 'carr', 'cstr']

In [6]: f['arrays/carr']
Out[6]: <TBranch 'carr' at 0x00011a493940>

In [7]: f['arrays/carr'].array()
Out[7]: <Array [[0.961, 0.578, ... 0.0711, 0.723]] type='3 * 9 * float64'>

In [8]: f['arrays/cstr'].array()
Out[8]: <Array [[0.701, 0.588, ... 0.493, 0.668]] type='3 * 9 * float64'>

I'll try to figure out what's going on. Regarding the number of baskets: the size of the data is also relevant. With little data, you often end up with a single basket, so we often try to create sample files where the data is spread across multiple baskets, since stitching the data together might be another issue. May I ask you to generate a bit more data to force multiple baskets? ;)

Meanwhile I'll see what I can do...

@briederer
Copy link
Author

I've created another file where I force the same structure as in my file (i.e. writing after each filling).
This creates more baskets.
However I still tried to keep the files small and just added 3 entries.
If you really need for some reason a bigger data-set let me know.

File: sample_baskets.root.tar.gz

@briederer
Copy link
Author

Just tried it out and this file does not segfault with my edit. And I guess you don't the new reference values since uproot is able to read it @tamasgal?

@tamasgal
Copy link
Member

Alright, thanks! Yes, I'll use uproot for cross-checks :)

@tamasgal
Copy link
Member

OK, digging deeper reveals that the problem is that fLen is not taken into account at all.

Here you can see that the leaf-information for the "carr" branch is telling us that fLen = 9 which should be used to construct the 9-length arrays:

julia> f["arrays"].fBranches[2]
UnROOT.TBranch_13
  cursor: UnROOT.Cursor
  fName: String "carr"
  fTitle: String "[9]/D"
  fFillColor: Int16 0
  fFillStyle: Int16 1001
  fCompress: Int32 9
  fBasketSize: Int32 32000
  fEntryOffsetLen: Int32 0
  fWriteBasket: Int32 3
  fEntryNumber: Int64 3
  fIOFeatures: UnROOT.ROOT_3a3a_TIOFeatures
  fOffset: Int32 0
  fMaxBaskets: UInt32 0x0000000a
  fSplitLevel: Int32 0
  fEntries: Int64 3
  fFirstEntry: Int64 0
  fTotBytes: Int64 435
  fZipBytes: Int64 435
  fBranches: UnROOT.TObjArray
  fLeaves: UnROOT.TObjArray
  fBaskets: UnROOT.TObjArray
  fBasketBytes: Array{Int64}((10,)) [145, 145, 145, 0, 0, 0, 0, 0, 0, 0]
  fBasketEntry: Array{Int64}((10,)) [0, 1, 2, 3, 0, 0, 0, 0, 0, 0]
  fBasketSeek: Array{Int64}((10,)) [396, 5607, 1051, 0, 0, 0, 0, 0, 0, 0]
  fFileName: String ""


julia> f["arrays"].fBranches[2].fLeaves
UnROOT.TObjArray("", 0, Any[UnROOT.TLeafD
  fName: String ""
  fTitle: String "[9]"
  fLen: Int32 9
  fLenType: Int32 8
  fOffset: Int32 0
  fIsRange: Bool false
  fIsUnsigned: Bool false
  fLeafCount: UInt32 0x00000000
  fMinimum: Float64 0.0
  fMaximum: Float64 0.0
])

I now need to think about how to get this hammered into the existing auto-jagg-construct. The place where this could be handled is here:

UnROOT.jl/src/root.jl

Lines 394 to 397 in 81b6ca9

else
_type = primitivetype(leaf)
_type = _jaggtype === Nojagg ? _type : Vector{_type}
end

Just a dummy placeholder:

Screenshot 2022-04-26 at 15 31 19

Here I need to figure out how to pass over the information that we now need to read 9 elements in something like a StaticArray. Maybe we need a new Jagg-type, which is even parametric to hold the dimensions? I am not sure. @Moelf do you have any ideas?

@briederer
Copy link
Author

If I get the code right the NooffsetJagg is exactly build for such objects, but not containing the info about the size yet.

In the utils.jl code NooffsetJagg is assigned to branches where the leaf.fTitle contains the bracket notation for 'array'-entries:

(match(r"\[.*\]", leaf.fTitle) !== nothing) && return Nooffsetjagg

Especially this type is used nowhere else in the code so I guess this one may be used for that?

However how to add the fLen here in a correct way (without breaking other things and reducing speed) is over my head.

Uproot4 is extracting the info here
https://github.com/scikit-hep/uproot4/blob/ccae426ae94dfaeed15b625d80d42c8c586a5082/src/uproot/interpretation/identify.py#L115-L117
and then "simply" creates a numpy array with the data here:
https://github.com/scikit-hep/uproot4/blob/ccae426ae94dfaeed15b625d80d42c8c586a5082/src/uproot/interpretation/identify.py#L388-L390

Maybe something of this helps.

@Moelf
Copy link
Member

Moelf commented Apr 26, 2022

I think in this case you just need NooffsetJagg and a SVector{} type?

_type = SVector{_type, 9}
_jaggtype = NooffsetJagg

return _type, _jaggtype

@tamasgal
Copy link
Member

Yes, that's the plan, and then propagate the dimension parameter down the stream. I'll check it out later after work.

@tamasgal
Copy link
Member

I think in this case you just need NooffsetJagg and a SVector{} type?

_type = SVector{_type, 9}
_jaggtype = NooffsetJagg

return _type, _jaggtype

Oh wait, would this work like that? That would make thinks fairly easy. 😆

@Moelf
Copy link
Member

Moelf commented Apr 26, 2022

I hope the downstream usage of _type is such that it just read(_type, source) but it may need some tweak. Shouldn't be too hard though

@tamasgal
Copy link
Member

tamasgal commented Apr 26, 2022

OK not the fully story but it's a start of a hopefully short journey.

julia> f["arrays/carr"]
ERROR: DimensionMismatch("No precise constructor for StaticArrays.SVector{Float64, 9} found. Length of input was 0.")
Stacktrace:
 [1] StaticArrays.SVector{Float64, 9}(x::Tuple{Tuple{Tuple{Tuple{}}}})
   @ StaticArrays ~/.julia/packages/StaticArrays/KyWYI/src/convert.jl:1
 [2] StaticArray (repeats 4 times)
   @ ~/.julia/packages/StaticArrays/KyWYI/src/convert.jl:4 [inlined]
 [3] LazyBranch(f::ROOTFile, b::UnROOT.TBranch_13)
   @ UnROOT ~/Dev/UnROOT.jl/src/iteration.jl:116
 [4] getindex(f::ROOTFile, s::String)
   @ UnROOT ~/Dev/UnROOT.jl/src/root.jl:134
 [5] top-level scope
   @ REPL[45]:1

@Moelf
Copy link
Member

Moelf commented Apr 26, 2022

oh it's because StaticArrays.SVector{Float64, 9}() doesn't work, when we build buffer for the lazy branch

@Moelf
Copy link
Member

Moelf commented Apr 26, 2022

ah ok then after fixing that we need to tweak interped_data(), will take a look later

@tamasgal
Copy link
Member

Yep, that's tricky. I had a long day, so I cannot spend as much as I thought I can on it, but I'll squeeze out a few more minutes until I fall into bed. Tomorrow is another day ;)

@tamasgal
Copy link
Member

OK, so of course, we run into the issue that SVector{T, N} is not a bits type, so it will fail when trying to reinterpret it:

julia> raw, offsets = UnROOT.array(f, "arrays/carr"; raw=true)
(UInt8[0x3f, 0xda, 0xe5, 0x2b, 0x15, 0x35, 0xca, 0x56, 0x3f, 0xa5    0x50, 0xf4, 0x3f, 0xed, 0xc1, 0x31, 0x14, 0x3b, 0x82, 0x62], Int32[0])

julia> reinterpret(StaticArrays.SVector{Float64, 9}, raw[1:8])
ERROR: ArgumentError: cannot reinterpret `UInt8` as `SVector{Float64, 9}`, type `SVector{Float64, 9}` is not a bits type
Stacktrace:
 [1] (::Base.var"#throwbits#277")(S::Type, T::Type, U::Type)
   @ Base ./reinterpretarray.jl:16
 [2] reinterpret(#unused#::Type{SVector{Float64, 9}}, a::Vector{UInt8})
   @ Base ./reinterpretarray.jl:36
 [3] top-level scope
   @ REPL[22]:1

Here of course we also need to reverse the bytes, due to the big-endianness of ROOT ;)

But as seen here, the data can be accessed:

julia> reinterpret(Vector{Float64}, reverse(raw[1:8]))
1-element reinterpret(Float64, ::Vector{UInt8}):
 0.4202373225336137

cf

In [13]: f['arrays/carr'].array()[0][0]
Out[13]: 0.4202373225336137

@Moelf
Copy link
Member

Moelf commented Apr 26, 2022

https://juliaarrays.github.io/StaticArrays.jl/stable/pages/api/#Arrays-of-static-arrays

you have to go to Vector{Float64} first but then it should work

@briederer
Copy link
Author

Thanks for the progress on this already. Seems very promising so far.

May I ask if one of you @tamasgal or @Moelf could push the current state of the modifications (although not working yet) to #166?
I want to play around with it a little bit and see what it does to some other very special branches of mine (;

@tamasgal
Copy link
Member

I just pushed, but it's not much. Sorry it's taking so long, I am a bit overloaded... I'll spend a bit time on it today!

@briederer
Copy link
Author

Thanks for that.
It's not urgent so don't worry. Just thought that this is definitely an issue that may hit other people and had a bit of time and wanted to look what the side effects of this PR may be on very specific problems that I have.
Still thanks for the great work. It is always a pleasure to submit issues to UnROOT since they are dealt with so quickly 👍

@tamasgal
Copy link
Member

Alright, thanks for the great feedback too :) I am glad that @Moelf jumped into the project, he did an enormous amount of work already :)

Btw. a few thoughts which might help you: the current issue is to hook into the interpretation of an SVector{T, N} and the general idea is that reinterpret() should take the Vector{UInt8} and slice it up into packs of bytes (depending on the size of T, in your case Float64 and therefore 8 bytes/elements) and then reverse those (big endian!).
As a first try I'd introduce a new struct which wraps SVector{T, N} and simply create a method for reinterpret using that type (to avoid type piracy).

@tamasgal
Copy link
Member

Essentially instead of _type = SVector{_type, 9} do something like _type = SVectorWrapper{_type, 9} which is just a thin wrapper for SVector{_type, 9} and holds an instance of it as a field.

A long-term solution would probably be the replacement of reinterpret with our own function, which falls back to the original reinterpret for all the things which work out of the box and then uses custom parsers for things like this SVector.

@Moelf
Copy link
Member

Moelf commented Apr 29, 2022

btw all of your type notation is flipped, it's

SVector{N, T}
NTuple{N, T}

@Moelf
Copy link
Member

Moelf commented Apr 29, 2022

julia> ROOTFile("./test/samples/issue165_multiple_baskets.root")["arrays/carr"]
3-element LazyBranch{UnROOT.FixLenVector{9, Float64}, UnROOT.Nooffsetjagg, Vector{UnROOT.FixLenVector{9, Float64}}}: 
 [0.4202373225336137, 0.04243914226183628, 0.5403969290388734, 0.5762304009759008, 0.5129559252005796, 0.2988885591267089, 0.866322138284483, 0.3651808394888327, 0.14771760168844722]
 [0.189585121902444, 0.5142614690187673, 0.5731252634772683, 0.7441941270344863, 0.018536993776698128, 0.5783476343277598, 0.12047688202954683, 0.981035016933938, 0.24272876151033154]
 [0.7775048011809144, 0.8664217530127716, 0.4918492038230641, 0.24464299401484568, 0.38991686533667, 0.15690925771226608, 0.3850047958013624, 0.9268160513261408, 0.9298329730191421]

I basically got it working, but the inner thing being a vector makes my head spin

@tamasgal
Copy link
Member

btw all of your type notation is flipped, it's

SVector{N, T}
NTuple{N, T}

oops yeah, error propagation 😆

that looks good, so you went the thin-wrapper approach...

@Moelf
Copy link
Member

Moelf commented Apr 29, 2022

Re: #165 (comment)

wait, the file you provided is not Nooffsetjagg, it's Nojagg right? Mentally I treat a Tuple() as a single Object, because here the offset bytes seems empty

@Moelf
Copy link
Member

Moelf commented Apr 29, 2022

ehh, ok maybe we should think of it as Nooffsetjagg

@tamasgal
Copy link
Member

Yeah, it's a bit confusing but I think Nooffsetjagg is correct in our context

@Moelf
Copy link
Member

Moelf commented Apr 29, 2022

Wait no I think "no offset" means something else, it means no 10 bytes offset between rawdata and raw offsets.

It does NOT mean there's no offset bytes.

If there's no offset bytes (for Jagged array), it's Nojagg. I'm inclined to treat this as Nojagg since element length is known to compiler

@tamasgal
Copy link
Member

OK you are right! The 10 byte offset usually comes from std::vector. Here we don't have any header, we simply know the shape and eltype

@Moelf
Copy link
Member

Moelf commented Apr 29, 2022

@briederer I've updated this PR:

can you try it with your real data? I've added tests for the sample file you provided

@briederer
Copy link
Author

wait, the file you provided is not Nooffsetjagg, it's Nojagg right? Mentally I treat a Tuple() as a single Object, because here the offset bytes seems empty

What I meant by "it is a NooffsetJagg" was that the current version of UnROOT treats it like this type, not that it is meant to be that Jaggtype.
Sorry for the confusion in this case (:

Re test with real data: will do it in a second.

@briederer
Copy link
Author

briederer commented Apr 29, 2022

All that I've tested now works like a charm 🥳

Only weird behavior I saw is that when I call the following in my script:

file = UnROOT.samplefile("issue165_multiple_baskets.root")
file["arrays/n"]  # independent of the chosen branch

that additionally to usual output a single colon : is printed to the julia-REPL. This does not happen when I use the LazyTree() though. I guess this line:

println(": ")

should specify the IO where to print the colon (;

Apart from that I think this issue is resolved. I only have one more feature request related to that issue, but this may be added separately, since it may be involved. (;

Again thanks for the great and fast work.

@Moelf
Copy link
Member

Moelf commented Apr 29, 2022

ah nice catch, ok I will merge and tag a release soon

@tamasgal
Copy link
Member

Nice work as alwasy :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants