Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes globbing issues #2404

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

Tomaz-Vieira
Copy link
Contributor

@Tomaz-Vieira Tomaz-Vieira commented Feb 26, 2021

Currently ilastik will not allow using files whose names can be interpreted as globs (ar colon-separated lists of files). This problems permeates surprisingly deep into the codebase, affecting all stack reader operators, (de)serialization and file selection everywhere (data selection applet and GUI, command line parsing, batch applet, importing labels, etc).

This PR attempts to put some order into things, and move all globbing to a single place. Also, globbing will only happen when reading user input, and nowhere else.

DataUrl, DataPath and Dataset

This PR creates the DataUrl class and, most importantly, its subclass, DataPath. DataPath is meant to feel a like python's builtin Path, but able to .glob() and check existence (.exists()) even when using internal paths to archive files like .h5, .n5 and .npz.

A group of DataPaths is encapsulated into a Dataset class. A Dataset is what the user selects when he fills a role of an ilastik Lane. It can be a stack - which is a Dataset with multiple DataPaths - or it could be a single file - which is a Dataset with a single DataPath. In the future, Dataset will be further generalized to a group of DataUrls to encompass remote files and precomputed chunks).

All path strings provided by the user are to be converted into Datasets as soon as possible. All DataPaths in a Dataset are guaranteed to exist. Also, NO DataPath inside a Dataset will be a globstring, even if its path has colons, brackets or other funny symbols. FilesystemDatasetInfo no longer deals with globbed, colon-separated strings, but rather takes a fully expanded Dataset. Because of that, a lot of path handling had been removed from FilesystemDatasetInfo.

Data selection

There is a new implementation of the stack selection dialog, which now produces Datasets instead of strings. The machinery for selecting multiple lanes via patterns or full directory is also in place, but inactive. We can activate it if we decide to close #2283

(De)serialization of stacks

Serialization of stacks no longer relies on joining paths with os.pathsep. This is still done for backwards compatibility, but now each path of the stack is saved as an item of a list.

The logic for finding missing files has been integrated with the Dataset logic and the new stack selecting dialog and can handle missing stacks, so that we are one step closer to NOT internalizing stacks all the time.

fixes #2391

@codecov
Copy link

codecov bot commented Mar 1, 2021

Codecov Report

Merging #2404 (91fd130) into master (e0edb89) will increase coverage by 0.11%.
The diff coverage is 60.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2404      +/-   ##
==========================================
+ Coverage   52.00%   52.11%   +0.11%     
==========================================
  Files         519      520       +1     
  Lines       60243    60386     +143     
  Branches     8297     8298       +1     
==========================================
+ Hits        31327    31469     +142     
+ Misses      27239    27238       -1     
- Partials     1677     1679       +2     
Impacted Files Coverage Δ
...orkflows/voxelSegmentation/voxelSegmentationGui.py 0.00% <0.00%> (ø)
...k/applets/dataSelection/dataSelectionSerializer.py 57.89% <4.16%> (-3.22%) ⬇️
ilastik/widgets/stackFileSelectionWidget.py 27.63% <27.66%> (+14.66%) ⬆️
...lets/pixelClassification/pixelClassificationGui.py 49.64% <33.33%> (+0.09%) ⬆️
ilastik/applets/labeling/labelingImport.py 12.88% <50.00%> (+0.26%) ⬆️
...orkflows/tracking/manual/manualTrackingWorkflow.py 62.85% <50.00%> (+0.35%) ⬆️
ilastik/applets/dataSelection/dataSelectionGui.py 64.06% <52.27%> (-1.06%) ⬇️
...k/applets/dataSelection/datasetInfoEditorWidget.py 79.89% <52.38%> (+3.17%) ⬆️
...tors/ioOperators/opStreamingH5N5SequenceReaderS.py 65.81% <60.71%> (-11.82%) ⬇️
...astik/applets/dataSelection/dataSelectionApplet.py 85.61% <66.66%> (+0.19%) ⬆️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0edb89...91fd130. Read the comment docs.

@k-dominik
Copy link
Contributor

I know it's a draft, so just a quick questions: the --skip-deglobbing parameter would only be applicable if there were a file with the exact name as the glob expression, right? So the situation is this, I have files that match the glob expression, but I also have this file with the same name, and I want to make sure that the single file is used, rather than the globbed ones. I wonder if the other way round wouldn't make more sense, since it is very unlikely that the situation above is the case.
So, in general, check if a file with the name from the cmd exists, if so, use it, if not, go globbing. Then for this one, pathological case, where we'd actually have both present, I would add something like a --force-globbing... But I don't know. I just think that files that have glob like file names should be more common, than having both, file names with glob like file names, and more files that match this glob expression. With the --force-globbing flag, we'd cater for the more likely case.

@Tomaz-Vieira
Copy link
Contributor Author

Tomaz-Vieira commented Mar 9, 2021

So here are our options in regards to globbing in the command line, as discussed in the previous meeting:

1. Always deglob unless --skip-deglobbing is also passed in

Pros:

  • Backwards compatible;
  • Unambiguous behavior with no magic, no special cases: it either is or isn't a glob.

Cons:

  • Potentially confusing;
  • Yet another command line flag.

2. Always deglob, no special flags. Preventing deglobbing is done by escaping the special characters.

Pros:

  • Very consistent with the shell's globbing behavior;
  • Unambiguous behavior with no magic, no special cases.

Cons:

  • Escaping globs is hard and hard to read ( glob.escape("file[1].tiff") == 'file[[]1].tiff' )

3. Check if a file with a globlike name exists, otherwise deglob

Pros:

  • It is what the user wants 95% of the time

Cons:

  • Not strictly backwards compatible
  • Magic behavior; more branches to debug, harder to specify, more things to explain in the documentation and in help emails;
  • Pathological cases like having 4 files named file[123].txt, file1.txt, file2.txtand file3.txt and it being impossible to specify a stack of the files file1.txt, file2.txt and file3.txt

4. Deglob only if --stack-along is also present

Pros:

  • Incentivises a sensible usage of --stack-along i.e., to always use it for stacks

Cons:

  • Not backwards compatible (--stack-along is optional now)
  • One could argue that an image of shape {x: 100, y: 100} is a special/degenerate case of a 3D volume with z: 1, and that expanding a glob to a list of single file (like in the shell) is the correct behavior.

What do you guys think?

@Tomaz-Vieira Tomaz-Vieira marked this pull request as ready for review March 15, 2021 09:24
@m-novikov
Copy link
Contributor

My vote is for 3rd option try filename and deglob.

@k-dominik
Copy link
Contributor

Option 3 ftw ;)

@emilmelnikov
Copy link
Member

emilmelnikov commented Mar 18, 2021

I vote for 2, but disable glob expansion by setting a special environment variable instead of escaping and/or passing skip_deglobbing parameters around. This is not pretty, but it is very rarely required anyway.

Nevertheless, it's probably a good idea to check if a file with the name identical to a glob pattern exists, and, if this happens, issue a warning with the message "if you actually want to pass this file, please set this environment variable".

@k-dominik
Copy link
Contributor

k-dominik commented Mar 25, 2021

option votes
1 0
2 1
3 2
4 0

so far it looks like option 3, but @Tomaz-Vieira has sneakily not voiced a strong opinion yet.

@Tomaz-Vieira Tomaz-Vieira force-pushed the fixes_globbing_issues branch 3 times, most recently from 2a741ff to f09f72a Compare April 23, 2021 22:58

from lazyflow.utility.pathHelpers import lsH5N5, globH5N5, globNpz

# pyright: strict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you introduce new tooling here. Is it comparable to mypy? Is it better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's... complementary to mypy, I guess. It does some things better, some things worse. Having that comment there shouldn't hurt anyone, but I can mark that file to be strictly checked elsewhere (outside of the source code) if it's bothering you =)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you check other parts of the code without the strict option? Or did you only check this one file at all? When would you have non-strict checks (strict sounds like the right thing to do? :D)...

In any case, maybe that is better suited for pyproject.toml?

raise ValueError(f"Could not convert {value} to a valid Scheme")

@classmethod
def contains(cls, value: str) -> bool:
Copy link
Contributor

@k-dominik k-dominik Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems to be used only once, and there it could also be done by constructing in a `try... except``` with raise from there...

I think it's a bit weird that this enables:

>>> x = Scheme("http")
>>> x.contains("https")
True

@k-dominik
Copy link
Contributor

Heyo, got a quick glimpse at the code which looks great and then decided, before going into details there, to look at it from a user perspective...

Some of the following issues might have been there before...

the updated stacking dialog:

  1. it is easy to get into an inconsistent state there. If I select some files, then decide to do some globbing, unsuccessfully, it will still show the old files.
  2. Why is there the string "New Dataset" in the list field? It could be in the label that says "Selection" right now.
  3. now that sequence axis is a drop down, it should probably not have a default value. In my experience users might overlook it
  4. The field for the separator is very long. Are any separators supported? so example myverylongcustomseparator? Also, is this ever useful? What is the use-case? I think this complicates more than it helps (but of course am open to be proven wrong :)) Wasn't the goal to mimic shell behavior here?
  5. the pattern field will correctly not accept something like /this/is/my/folder (with folder being a folder) but will happily accept /this/is/my.h5/group, where group is a group.
  6. adding a single file to the stack, will open an image, but not add the singleton axis. I'd expect to either
    • not open single images there
    • add the appropriate singleton axis
  7. While we are at it, there should be a note in the stack import that this will save the data to the project file
  8. import is "silent", no progress, no indication of busyness
  9. the layout with the buttons could be improved: the "ok" button stretches weirdly
    stack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

File names with square brackets cannot be opened Load h5 datasets as separate volumes in one go
4 participants