Fixes globbing issues #2404

Tomaz-Vieira · 2021-02-26T15:04:32Z

Currently ilastik will not allow using files whose names can be interpreted as globs (ar colon-separated lists of files). This problems permeates surprisingly deep into the codebase, affecting all stack reader operators, (de)serialization and file selection everywhere (data selection applet and GUI, command line parsing, batch applet, importing labels, etc).

This PR attempts to put some order into things, and move all globbing to a single place. Also, globbing will only happen when reading user input, and nowhere else.

DataUrl, DataPath and Dataset

This PR creates the DataUrl class and, most importantly, its subclass, DataPath. DataPath is meant to feel a like python's builtin Path, but able to .glob() and check existence (.exists()) even when using internal paths to archive files like .h5, .n5 and .npz.

A group of DataPaths is encapsulated into a Dataset class. A Dataset is what the user selects when he fills a role of an ilastik Lane. It can be a stack - which is a Dataset with multiple DataPaths - or it could be a single file - which is a Dataset with a single DataPath. In the future, Dataset will be further generalized to a group of DataUrls to encompass remote files and precomputed chunks).

All path strings provided by the user are to be converted into Datasets as soon as possible. All DataPaths in a Dataset are guaranteed to exist. Also, NO DataPath inside a Dataset will be a globstring, even if its path has colons, brackets or other funny symbols. FilesystemDatasetInfo no longer deals with globbed, colon-separated strings, but rather takes a fully expanded Dataset. Because of that, a lot of path handling had been removed from FilesystemDatasetInfo.

Data selection

There is a new implementation of the stack selection dialog, which now produces Datasets instead of strings. The machinery for selecting multiple lanes via patterns or full directory is also in place, but inactive. We can activate it if we decide to close #2283

(De)serialization of stacks

Serialization of stacks no longer relies on joining paths with os.pathsep. This is still done for backwards compatibility, but now each path of the stack is saved as an item of a list.

The logic for finding missing files has been integrated with the Dataset logic and the new stack selecting dialog and can handle missing stacks, so that we are one step closer to NOT internalizing stacks all the time.

fixes #2391

codecov · 2021-03-01T10:58:32Z

Codecov Report

Merging #2404 (91fd130) into master (e0edb89) will increase coverage by 0.11%.
The diff coverage is 60.02%.

@@            Coverage Diff             @@
##           master    #2404      +/-   ##
==========================================
+ Coverage   52.00%   52.11%   +0.11%     
==========================================
  Files         519      520       +1     
  Lines       60243    60386     +143     
  Branches     8297     8298       +1     
==========================================
+ Hits        31327    31469     +142     
+ Misses      27239    27238       -1     
- Partials     1677     1679       +2

Impacted Files	Coverage Δ
...orkflows/voxelSegmentation/voxelSegmentationGui.py	`0.00% <0.00%> (ø)`
...k/applets/dataSelection/dataSelectionSerializer.py	`57.89% <4.16%> (-3.22%)`	⬇️
ilastik/widgets/stackFileSelectionWidget.py	`27.63% <27.66%> (+14.66%)`	⬆️
...lets/pixelClassification/pixelClassificationGui.py	`49.64% <33.33%> (+0.09%)`	⬆️
ilastik/applets/labeling/labelingImport.py	`12.88% <50.00%> (+0.26%)`	⬆️
...orkflows/tracking/manual/manualTrackingWorkflow.py	`62.85% <50.00%> (+0.35%)`	⬆️
ilastik/applets/dataSelection/dataSelectionGui.py	`64.06% <52.27%> (-1.06%)`	⬇️
...k/applets/dataSelection/datasetInfoEditorWidget.py	`79.89% <52.38%> (+3.17%)`	⬆️
...tors/ioOperators/opStreamingH5N5SequenceReaderS.py	`65.81% <60.71%> (-11.82%)`	⬇️
...astik/applets/dataSelection/dataSelectionApplet.py	`85.61% <66.66%> (+0.19%)`	⬆️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0edb89...91fd130. Read the comment docs.

k-dominik · 2021-03-01T14:22:58Z

I know it's a draft, so just a quick questions: the --skip-deglobbing parameter would only be applicable if there were a file with the exact name as the glob expression, right? So the situation is this, I have files that match the glob expression, but I also have this file with the same name, and I want to make sure that the single file is used, rather than the globbed ones. I wonder if the other way round wouldn't make more sense, since it is very unlikely that the situation above is the case.
So, in general, check if a file with the name from the cmd exists, if so, use it, if not, go globbing. Then for this one, pathological case, where we'd actually have both present, I would add something like a --force-globbing... But I don't know. I just think that files that have glob like file names should be more common, than having both, file names with glob like file names, and more files that match this glob expression. With the --force-globbing flag, we'd cater for the more likely case.

Tomaz-Vieira · 2021-03-09T14:11:32Z

So here are our options in regards to globbing in the command line, as discussed in the previous meeting:

1. Always deglob unless `--skip-deglobbing` is also passed in

Pros:

Backwards compatible;
Unambiguous behavior with no magic, no special cases: it either is or isn't a glob.

Cons:

Potentially confusing;
Yet another command line flag.

2. Always deglob, no special flags. Preventing deglobbing is done by escaping the special characters.

Pros:

Very consistent with the shell's globbing behavior;
Unambiguous behavior with no magic, no special cases.

Cons:

Escaping globs is hard and hard to read ( glob.escape("file[1].tiff") == 'file[[]1].tiff' )

3. Check if a file with a globlike name exists, otherwise deglob

Pros:

It is what the user wants 95% of the time

Cons:

Not strictly backwards compatible
Magic behavior; more branches to debug, harder to specify, more things to explain in the documentation and in help emails;
Pathological cases like having 4 files named file[123].txt, file1.txt, file2.txtand file3.txt and it being impossible to specify a stack of the files file1.txt, file2.txt and file3.txt

4. Deglob only if `--stack-along` is also present

Pros:

Incentivises a sensible usage of --stack-along i.e., to always use it for stacks

Cons:

Not backwards compatible (--stack-along is optional now)
One could argue that an image of shape {x: 100, y: 100} is a special/degenerate case of a 3D volume with z: 1, and that expanding a glob to a list of single file (like in the shell) is the correct behavior.

What do you guys think?

m-novikov · 2021-03-18T10:59:27Z

My vote is for 3rd option try filename and deglob.

k-dominik · 2021-03-18T10:59:41Z

Option 3 ftw ;)

emilmelnikov · 2021-03-18T11:14:00Z

I vote for 2, but disable glob expansion by setting a special environment variable instead of escaping and/or passing skip_deglobbing parameters around. This is not pretty, but it is very rarely required anyway.

Nevertheless, it's probably a good idea to check if a file with the name identical to a glob pattern exists, and, if this happens, issue a warning with the message "if you actually want to pass this file, please set this environment variable".

k-dominik · 2021-03-25T16:16:15Z

option	votes
1	0
2	1
3	2
4	0

so far it looks like option 3, but @Tomaz-Vieira has sneakily not voiced a strong opinion yet.

…aUrl subclass

k-dominik · 2021-04-26T08:03:29Z

ilastik/utility/data_url.py

+
+from lazyflow.utility.pathHelpers import lsH5N5, globH5N5, globNpz
+
+# pyright: strict


looks like you introduce new tooling here. Is it comparable to mypy? Is it better?

It's... complementary to mypy, I guess. It does some things better, some things worse. Having that comment there shouldn't hurt anyone, but I can mark that file to be strictly checked elsewhere (outside of the source code) if it's bothering you =)

Do you check other parts of the code without the strict option? Or did you only check this one file at all? When would you have non-strict checks (strict sounds like the right thing to do? :D)...

In any case, maybe that is better suited for pyproject.toml?

k-dominik · 2021-04-26T08:40:53Z

ilastik/utility/data_url.py

+        raise ValueError(f"Could not convert {value} to a valid Scheme")
+
+    @classmethod
+    def contains(cls, value: str) -> bool:


This method seems to be used only once, and there it could also be done by constructing in a `try... except``` with raise from there...

I think it's a bit weird that this enables:

>>> x = Scheme("http") >>> x.contains("https") True

k-dominik · 2021-04-26T09:56:13Z

Heyo, got a quick glimpse at the code which looks great and then decided, before going into details there, to look at it from a user perspective...

Some of the following issues might have been there before...

the updated stacking dialog:

it is easy to get into an inconsistent state there. If I select some files, then decide to do some globbing, unsuccessfully, it will still show the old files.
Why is there the string "New Dataset" in the list field? It could be in the label that says "Selection" right now.
now that sequence axis is a drop down, it should probably not have a default value. In my experience users might overlook it
The field for the separator is very long. Are any separators supported? so example myverylongcustomseparator? Also, is this ever useful? What is the use-case? I think this complicates more than it helps (but of course am open to be proven wrong :)) Wasn't the goal to mimic shell behavior here?
the pattern field will correctly not accept something like /this/is/my/folder (with folder being a folder) but will happily accept /this/is/my.h5/group, where group is a group.
adding a single file to the stack, will open an image, but not add the singleton axis. I'd expect to either
- not open single images there
- add the appropriate singleton axis
While we are at it, there should be a note in the stack import that this will save the data to the project file
import is "silent", no progress, no indication of busyness
the layout with the buttons could be improved: the "ok" button stretches weirdly

Tomaz-Vieira added 3 commits February 26, 2021 15:59

Adds skip_globbing to datasetinfo creation

c3a60eb

Adds skip_deglobbing to bash processing applet

a03103c

Removes leftover debug code

5c97220

Adds skip-deglobbing headeless tests

3f2b222

Tomaz-Vieira marked this pull request as ready for review March 15, 2021 09:24

Tomaz-Vieira added 7 commits March 22, 2021 18:11

[WIP]Creates DataPath class, but it still allows non-existant paths

eea120b

Adds tests and fixes data_path inner path globbing

c0b76ce

Adds tests for DatasetPath

e0ef088

Adds ArchiveDataPathlist_internal_paths, tests. Organizes methods

5a3762e

Fixes some type hints

1fd954b

Adds ArchiveDataPath.siblings and more tests

287fe11

Adds DatasetPath.common_internal_paths and tests

2b7ffc1

Tomaz-Vieira added 11 commits March 25, 2021 18:33

Adds some type hints to pathHelpers

da2f076

Fixes typing issues with data_path

d8498c1

ArchivePath.is_archive_path now takes Union[str,Path]

f33c1de

Renames classes. Creates PrecomputedChunksUrl. DataPath made into Dat…

f45bfab

…aUrl subclass

Cleans up path extensions in DataPath

d88d337

Refines DataUrl

a234b63

Removes requirement data a DataPath be absolute

8282fbe

More refinements to DataPath

42b63ca

DataPath.glob does not raise om empty expansion

3f78e3c

StackPath checks for DataPath existance on __init__

481355f

StackPath.from_string expands user

5b1fe1b

Tomaz-Vieira added 17 commits April 16, 2021 12:55

Adds StackPath.suffixes method

9a2d073

Adds (de)serialization to StackPath

22285a9

StackPath: from_h5_data Uses StackPath.split instead of manual split

5d67db7

[WIP] Uses StackPath everywhere

1b13903

Rename StackPath to Dataset

5eb1502

Checks for repeated DataUrls in Dataset

09c3b13

Adds Dataset.internal_paths()

876c03e

Fixes Dataset.split using separator

458d3a2

Rewrites stack selection to produce Datasets

87c378d

Fixes serialization of sequence_axis and fixing of missing stacks

4b2af19

Fixes tests

d0d5546

Removes now unused stackFileSelectionWidget.ui

64e1bb0

Removes unused --skip-deglobbing from cmd line args

3012107

Fixes testPixelClassificationGui.py

47df133

Fixes legacy tests

c369eb7

Handles backslashes in DataPath

33936bc

Fixes black issue

5b290e9

Tomaz-Vieira force-pushed the fixes_globbing_issues branch 3 times, most recently from 2a741ff to f09f72a Compare April 23, 2021 22:58

More path fixes for windows

df14245

Tomaz-Vieira force-pushed the fixes_globbing_issues branch from f09f72a to df14245 Compare April 26, 2021 07:29

k-dominik reviewed Apr 26, 2021

View reviewed changes

Fixes N5DataPath.list_internal_paths backslashes

91fd130

k-dominik reviewed Apr 26, 2021

View reviewed changes

k-dominik mentioned this pull request May 20, 2021

make sure to pass project file path to dataset info #2430

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes globbing issues #2404

Fixes globbing issues #2404

Tomaz-Vieira commented Feb 26, 2021 •

edited

codecov bot commented Mar 1, 2021 •

edited

k-dominik commented Mar 1, 2021

Tomaz-Vieira commented Mar 9, 2021 •

edited

m-novikov commented Mar 18, 2021

k-dominik commented Mar 18, 2021

emilmelnikov commented Mar 18, 2021 •

edited

k-dominik commented Mar 25, 2021 •

edited

k-dominik Apr 26, 2021

Tomaz-Vieira Apr 26, 2021

k-dominik Apr 26, 2021

k-dominik Apr 26, 2021 •

edited

k-dominik commented Apr 26, 2021


		from lazyflow.utility.pathHelpers import lsH5N5, globH5N5, globNpz

		# pyright: strict

Fixes globbing issues #2404

Are you sure you want to change the base?

Fixes globbing issues #2404

Conversation

Tomaz-Vieira commented Feb 26, 2021 • edited

DataUrl, DataPath and Dataset

Data selection

(De)serialization of stacks

codecov bot commented Mar 1, 2021 • edited

Codecov Report

k-dominik commented Mar 1, 2021

Tomaz-Vieira commented Mar 9, 2021 • edited

1. Always deglob unless --skip-deglobbing is also passed in

Pros:

Cons:

2. Always deglob, no special flags. Preventing deglobbing is done by escaping the special characters.

Pros:

Cons:

3. Check if a file with a globlike name exists, otherwise deglob

Pros:

Cons:

4. Deglob only if --stack-along is also present

Pros:

Cons:

m-novikov commented Mar 18, 2021

k-dominik commented Mar 18, 2021

emilmelnikov commented Mar 18, 2021 • edited

k-dominik commented Mar 25, 2021 • edited

k-dominik Apr 26, 2021

Choose a reason for hiding this comment

Tomaz-Vieira Apr 26, 2021

Choose a reason for hiding this comment

k-dominik Apr 26, 2021

Choose a reason for hiding this comment

k-dominik Apr 26, 2021 • edited

Choose a reason for hiding this comment

k-dominik commented Apr 26, 2021

Tomaz-Vieira commented Feb 26, 2021 •

edited

codecov bot commented Mar 1, 2021 •

edited

Tomaz-Vieira commented Mar 9, 2021 •

edited

1. Always deglob unless `--skip-deglobbing` is also passed in

4. Deglob only if `--stack-along` is also present

emilmelnikov commented Mar 18, 2021 •

edited

k-dominik commented Mar 25, 2021 •

edited

k-dominik Apr 26, 2021 •

edited