You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opening an issue to centralise discussion around supporting the proposed TRX format (tee-ar-ex/trx-python#1; also linking to tee-ar-ex/trx-python#9) (or something similar, or whatever it evolves into) within MRtrix3. Will bear some relation to #411, but discussion here will be more specific to how the novel functionalities of this format might be handled, with the expectation that any other formats supported as part of #411 would utilise a subset of such.
Firstly consider a pass-through operation, e.g. tckedit with TRX input and TRX output but no other options. Data need to go through the multi-threaded queue whilst retaining an arbitrary number of data-per-vertex entries and an arbitrary number of data-per-streamline entries.
The approach that would provide maximal encapsulation would be to explicitly load all data for each streamline into an instance of the Tractography::Streamline class, additionally retaining information regarding the name of each file from which the DPV / DPS data were drawn (so that the Sink functor knows where to write them). This however includes a lot of data loading, some or all of which may not be used. There's also the complexity of mapping: in the most general fully-encapsulated case, the Sink functor would need to cross-reference the set of output DPS / DPV files it has constructed with a list of strings contained within the Tractography::Streamline instance to figure out how to map the various DPS / DPV data contained within to the output filesystem structure.
This all gets even more complex once you consider that tckedit could accept multiple TRX inputs, or that the pipe functor may want to perform some operation based on DPS / DPV data and needs an efficient way to determine where to read the relevant data from. E.g. If the pipe functor needs to select streamlines based on contents of file "dps/variable.float32", you don't want that pipe functor doing for every single streamline a string comparison between "variable.float32" and the source file names of every DPS file in order to figure out which index of the DPS array to read from.
If we relax the demand for complete separation data between functors, we can maybe keep some of the benefits of the format specification. One could define a shared class per TRX instance, that would set up the memmaps to the various DPS / DPV arrays (along with just one instance of the mapping between source file names and array indices), and provide access to individual streamline data on request only. Each Tractography::Streamline instance would need to store a std::shared_ptr<> to this shared class instance to deal with the prospect of tckedit receiving as input multiple TRX files (i.e. it couldn't be a truly "global" shared structure). If e.g. tckedit wants to select streamlines based on some parameter, or a command wants to utilise SIFT2 weights (which is really just one specific instance of a DPS), the pipe functor would want to determine, for a given TRX shared class instance, which e.g. DPS index corresponds to the parameter in which it is interested.
The difficulty with this approach is that while it's tailored for TRX, it's complicated by the desire to support both .trk and piping of streamlines data. Here construction of such a "shared" class isn't appropriate, as all data for each streamline are read in one go. One could define a sort of base class for "streamline associated data", with derivative classes to branch depending on whether such data are stored contiguously per streamline (i.e. .trk, piping) or contiguously per variable (TRX). Though this would mean polymorphism per streamline where previously there was none.
Specifically the export of TRX presents its own problem also. There may have to be some compromises here between performance, robustness to premature termination, etc.. One of the advantages of the .tck format is that you can append new data beyond the triple-INF EOF delimiter, and only upon completion of such change the triple-INF to a triple-NAN, so that the file can't really be found in an invalid state if the program is terminated at any point, it's merely a question of whether or not the most recent buffered data will or will not be read from the file. Also you can continue growing the file indefinitely, there's no need for knowledge of size or indeed even streamline count.
With TRX there is not only offsets.uint32 and positions.floatXX but also various DPV and DPS data that may be being concurrently written.
My suspicion is that we'll have some form of buffered writer, with buffers allocated for all associated data, and upon filling the buffer, we close memmaps, re-allocate larger memmaps to accommodate the new size, and write buffered data in the safest order: DPS / DPV, then positions, and finally offsets. Immediately writing std::numeric_limits<offset_t>::max() to the first new entry in offsets and updating that entry with the true offset last may provide some corruption detection, though this may need to be written into the specification and would still theoretically be prone to a race condition.
There's also the prospect of the output TRX being a ZIP file rather than a directory. Personally I'd advocate doing an initial write to a directory in the target location, designated as an MRtrix temporary, and then do the ZIP conversion & directory deletion in a separate step. There's no guarantee that the streamlines data being processed will fit in RAM, or indeed even in /tmp/ if it's a RAM filesystem. Or alternatively do what the Python scripts do and place the scratch directory in the current working directory, unless overridden by config file variables.
Of course all of this is only dealing with DPV and DPS, not groups or DPG. Those are higher-level functionalities that will require greater thought. E.g. if tckedit receives a group index file as input, that may instruct the Reader class to only bother loading the streamlines corresponding to that group. For something like DPG, the Shared class instance could deal with the mapping from streamline index to group assignment and therefore yield the data value corresponding to that group (assuming a trivial mapping between streamlines and groups). But retaining such group information at write time in a pass-through operation may be in the too-hard basket.
Personally I see these functionalities as only being of utility once the streamline offsets and positions data are fixed, and one is instead interacting with the files contained within a single TRX instance, rather than treating the whole TRX instance as a streamlines tractography dataset going through generic read / write handlers. But if I'm overly simplifying or overlooking a crucial use case do say so.
The text was updated successfully, but these errors were encountered:
Opening an issue to centralise discussion around supporting the proposed TRX format (tee-ar-ex/trx-python#1; also linking to tee-ar-ex/trx-python#9) (or something similar, or whatever it evolves into) within MRtrix3. Will bear some relation to #411, but discussion here will be more specific to how the novel functionalities of this format might be handled, with the expectation that any other formats supported as part of #411 would utilise a subset of such.
Firstly consider a pass-through operation, e.g.
tckedit
with TRX input and TRX output but no other options. Data need to go through the multi-threaded queue whilst retaining an arbitrary number of data-per-vertex entries and an arbitrary number of data-per-streamline entries.The approach that would provide maximal encapsulation would be to explicitly load all data for each streamline into an instance of the
Tractography::Streamline
class, additionally retaining information regarding the name of each file from which the DPV / DPS data were drawn (so that the Sink functor knows where to write them). This however includes a lot of data loading, some or all of which may not be used. There's also the complexity of mapping: in the most general fully-encapsulated case, the Sink functor would need to cross-reference the set of output DPS / DPV files it has constructed with a list of strings contained within theTractography::Streamline
instance to figure out how to map the various DPS / DPV data contained within to the output filesystem structure.This all gets even more complex once you consider that
tckedit
could accept multiple TRX inputs, or that the pipe functor may want to perform some operation based on DPS / DPV data and needs an efficient way to determine where to read the relevant data from. E.g. If the pipe functor needs to select streamlines based on contents of file "dps/variable.float32
", you don't want that pipe functor doing for every single streamline a string comparison between "variable.float32
" and the source file names of every DPS file in order to figure out which index of the DPS array to read from.If we relax the demand for complete separation data between functors, we can maybe keep some of the benefits of the format specification. One could define a shared class per TRX instance, that would set up the memmaps to the various DPS / DPV arrays (along with just one instance of the mapping between source file names and array indices), and provide access to individual streamline data on request only. Each
Tractography::Streamline
instance would need to store astd::shared_ptr<>
to this shared class instance to deal with the prospect oftckedit
receiving as input multiple TRX files (i.e. it couldn't be a truly "global" shared structure). If e.g.tckedit
wants to select streamlines based on some parameter, or a command wants to utilise SIFT2 weights (which is really just one specific instance of a DPS), the pipe functor would want to determine, for a given TRX shared class instance, which e.g. DPS index corresponds to the parameter in which it is interested.The difficulty with this approach is that while it's tailored for TRX, it's complicated by the desire to support both
.trk
and piping of streamlines data. Here construction of such a "shared" class isn't appropriate, as all data for each streamline are read in one go. One could define a sort of base class for "streamline associated data", with derivative classes to branch depending on whether such data are stored contiguously per streamline (i.e..trk
, piping) or contiguously per variable (TRX). Though this would mean polymorphism per streamline where previously there was none.Specifically the export of TRX presents its own problem also. There may have to be some compromises here between performance, robustness to premature termination, etc.. One of the advantages of the
.tck
format is that you can append new data beyond the triple-INF EOF delimiter, and only upon completion of such change the triple-INF to a triple-NAN, so that the file can't really be found in an invalid state if the program is terminated at any point, it's merely a question of whether or not the most recent buffered data will or will not be read from the file. Also you can continue growing the file indefinitely, there's no need for knowledge of size or indeed even streamline count.With TRX there is not only
offsets.uint32
andpositions.floatXX
but also various DPV and DPS data that may be being concurrently written.My suspicion is that we'll have some form of buffered writer, with buffers allocated for all associated data, and upon filling the buffer, we close memmaps, re-allocate larger memmaps to accommodate the new size, and write buffered data in the safest order: DPS / DPV, then positions, and finally offsets. Immediately writing
std::numeric_limits<offset_t>::max()
to the first new entry inoffsets
and updating that entry with the true offset last may provide some corruption detection, though this may need to be written into the specification and would still theoretically be prone to a race condition.There's also the prospect of the output TRX being a ZIP file rather than a directory. Personally I'd advocate doing an initial write to a directory in the target location, designated as an MRtrix temporary, and then do the ZIP conversion & directory deletion in a separate step. There's no guarantee that the streamlines data being processed will fit in RAM, or indeed even in /tmp/ if it's a RAM filesystem. Or alternatively do what the Python scripts do and place the scratch directory in the current working directory, unless overridden by config file variables.
Of course all of this is only dealing with DPV and DPS, not groups or DPG. Those are higher-level functionalities that will require greater thought. E.g. if
tckedit
receives a group index file as input, that may instruct the Reader class to only bother loading the streamlines corresponding to that group. For something like DPG, the Shared class instance could deal with the mapping from streamline index to group assignment and therefore yield the data value corresponding to that group (assuming a trivial mapping between streamlines and groups). But retaining such group information at write time in a pass-through operation may be in the too-hard basket.Personally I see these functionalities as only being of utility once the streamline offsets and positions data are fixed, and one is instead interacting with the files contained within a single TRX instance, rather than treating the whole TRX instance as a streamlines tractography dataset going through generic read / write handlers. But if I'm overly simplifying or overlooking a crucial use case do say so.
The text was updated successfully, but these errors were encountered: