Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running preprocessing pipeline on Tiles object #332

Open
PaulScemama opened this issue Oct 14, 2022 · 15 comments
Open

Running preprocessing pipeline on Tiles object #332

PaulScemama opened this issue Oct 14, 2022 · 15 comments
Labels
enhancement New feature or request

Comments

@PaulScemama
Copy link

PaulScemama commented Oct 14, 2022

I begin at the Tile level. That is I am beginning with regions extracted from a WSI where these regions are in the form of .pngs. I want to run preprocessing on these Tiles in a parallel fashion. From what I've seen there is only this parallel preprocessing done for a slide object (slide.run()).

Describe the solution you'd like
I think an alternative to running a preprocessing pipeline on the slide level would be to have a way to run a preprocessing pipeline on the Tiles level. That is, I have a bunch of Tiles in a Tiles instance. I then do something like Tiles.run() and it runs a parallel preprocessing.

Describe alternatives you've considered
Of course, I could do a for loop across all the tiles and sequentially apply transforms/preprocessing/etc.

Additional context
I think this would be nice functionality to add for user flexibility. I am willing to try and implement it myself but wanted to get the thoughts of other contributors. Is this worthwhile? Is this pointless? Is there a better way?

Thanks in advance!

@PaulScemama PaulScemama added the enhancement New feature or request label Oct 14, 2022
@PaulScemama PaulScemama changed the title Running preprocessing pipeline on Tiling object Running preprocessing pipeline on Tiles object Oct 14, 2022
@jacob-rosenthal
Copy link
Collaborator

Hi @PaulScemama , thanks for the suggestion!
I think this would definitely be a worthwhile feature to add if you are able! My first thought is that you could create a new class inheriting from SlideData and just change the logic for some of the methods. For example, .generate_tiles() would just need to yield the tiles you already have, instead of reading them from an image via a backend. Then, the other methods like run() should ideally work with minimal changes. Hope this makes sense.
I'm happy to help by code reviews or answering questions if you want to contribute. Thanks!!

@PaulScemama
Copy link
Author

@jacob-rosenthal definitely! That makes a lot of sense and I will give it a go. This will be my first opensource contribution so bear with me :). Btw just came upon this package a week ago and really enjoy the documentation, effort put into readability, etc. so thank you!

I'll start working on it shortly.

@PaulScemama
Copy link
Author

PaulScemama commented Oct 15, 2022

@jacob-rosenthal just ran python -m pytest -m "not slow" from CONTRIBUTING.rst after following the local set up instruction and received one exception:

Failed to start Java VM

Do you have any idea how to resolve this; or where I may have done something wrong?

Edit: additionally, this is what I get when I run java -version

openjdk version "1.8.0_152-release"
OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12)
OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)

@PaulScemama
Copy link
Author

@jacob-rosenthal new question (and let me know if there is a better place to pose these questions for you).

SlideData has one required argument -- filepath. I want to inherit from SlideData, but instead of filepath as a required argument I want it to be directory_path as instead of a path to a file (slide), I have a path to a directory of images that represent regions from a slide. However, I imagine I want to call super().__init__(**kwargs). This passes through SlideData arguments and gets hung up on wanting to see filepath as an argument to my child class. But I don't have a meaningful filepath to give it!

Any thoughts?

@jacob-rosenthal
Copy link
Collaborator

Maybe consider just writing init from scratch? Since your case is pretty specific, you probably don't need all the logic used when initializing SlideData, so maybe you don't need to call super().__init__(**kwargs)

For the Java VM issue, did the tests pass and you just got a message, or was it causing the testing to fail? If tests pass, probably ok to ignore other messages. Not sure what could be causing it, but first thought would be to try again in a fresh environment. Javabridge can be finicky

@PaulScemama
Copy link
Author

To the first part - that sounds good!

To the second - I will have to double check when I can get to my computer. But pretty sure a singular test failed and that was the message. It hasn't halted development on this specific task, but I'll look into it a bit more and try a fresh env.

@PaulScemama
Copy link
Author

@jacob-rosenthal two-parter (again).

  1. I'm having trouble identifying the utility of the (optional) tiles input to SlideData. The reason being...I see run() as the most important method. Calling run() when you inputted tiles optionally either overwrites these tiles or raises an exception. Why else would we input tiles?
  2. I see (maybe) a way to make the (optional) tiles input useful. It fits nicely into my original enhancement proposition. If I already have the tiles, I can create a SlideData object and instead of giving it a filepath to a slide, I would give it a Tiles object that contains my tiles. Then when I call run(), instead of overwriting or raising an exception, I preprocess these tiles directly. There are a couple other methods that wouldn't make sense in this case - namely extract_region. Perhaps if I instantiated SlideData with tiles instead of a slide filepath, one could raise an exception or meld the tiles into one large slide to then extract a region from.

Thoughts?

@jacob-rosenthal
Copy link
Collaborator

I guess the idea is that you could initialize SlideData with tiles, and then save it to h5path.

Agree that that that could be a good way to support your usecase without even changing the API. Maybe, instead of making a new class, you could just add a new method to SlideData, something like run_tiles() that is similar to run() but just pulls tiles from self.tiles instead of reading from the backend? That could also be useful for the case where someone has already run a pipeline, but then wants to run an additional processing step on the tiles.

One thing to avoid is rewriting all the logic for distributed and non-distributed processing. We want to keep that in one place, and it should be the same regardless of wherever the tiles are coming from. So might need to pull it out into new methods and call them via run() and run_tiles(). Hope that makes sense

I would say don't worry about melding tiles, throwing an informative exception should be good

What do you think? Thanks!

@PaulScemama
Copy link
Author

I agree full heartedly. I'll get on it asap and keep you posted. Thanks!

@tddough98
Copy link
Collaborator

tddough98 commented Oct 20, 2022

I had a similar use case of running further processing on existing h5path files and took a stab at a solution here #337

It's not exactly the approach described in this issue of running a pipeline directly on a Tiles object, but instead refactors SlideData to allow reading and updating tiles from an h5path file. More generally, it separates the tile generation from running pipelines, so that run can be called repeatedly to tack on transforms or can be called on tiles in an existing h5path file. This did make breaking changes by moving tile size, stride, padding, etc. from run to the SlideData constructor, but I feel like this abstraction does make sense.

@PaulScemama
Copy link
Author

Ah I should have read through that beforehand. I looked through the code once through and I like the ideas. I am still learning the codebase so I do have a question:

Would it be easier to allow as input either a filepath to a slide (.svs, etc.) or a directory to images representing Tiles (or other ideas of how to take as input a bunch of tiles). Then there are two scenarios:

  1. You inputted a filepath to a slide. In which case generate_tiles is called to populate self.tiles
  2. You inputted a directory of tiles (images). In which case self.tiles is populated from simply reading the files in.

Then you have one method - run() which runs a pipeline on self.tiles()

Then in the case that you've run a pipeline and want to run it again on the same tiles, you can either call run() again which modifies the tiles in-place, or you can keep the first set of transformed tiles by (writing to h5path, etc.) and then calling run().

Let me know if there are any flaws in my thinking?

@tddough98
Copy link
Collaborator

I think the issue with providing a directory with a bag of tile files is that it doesn't have meaningful structure as a single slide, so SlideData is probably the wrong class for a feature like that. If your tiles are coming from one slide, then you should already be able to convert that slide into a h5path file and use #337 to run different/new transforms on your tiles.

@PaulScemama
Copy link
Author

PaulScemama commented Oct 20, 2022

Edit:

If I have a bunch of .pngs that are from a single slide (which I do), I could just save them as an h5 and then use that as input to SlideData(), then run() and I'm all set? That is what you're proposing and I think that's definitely worthwhile.

So I should probably create TileData() that doesn't care if the tiles are from the same slide or not? Because Tiles() seems to only be instantiated provided the tiles are from the same slide (through passing in an h5 manager). Am I correct in thinking that?

@jacob-rosenthal
Copy link
Collaborator

I guess we need to decide the best way to handle this. It could go a couple ways for your usecase:

Option 1:

  1. Load all the tiles from disk into h5 when initializing
  2. When running preprocessing, use those existing tiles from h5

This mechanism should also support running preprocessing on any existing tiles (e.g. #337) because at that point they are already in h5 so it doesn't matter where the tiles came from (i.e. whether they were loaded from disk or created by running a pipeline previously)

Option 2:

  1. Load a backend with paths to the files
  2. When running preprocessing, read tiles from disk and send them directly to preprocessing pipeline.

This would be more similar to the mechanism for generating tiles from a backend, because it would be loading the images from disk. I think the main difference is that the tiles wouldn't be read from disk and put into h5 at the time of initializing, since it would use the file paths to read the tiles at the time of running a pipeline.

Thoughts on the best/most logical way to implement? Need to think about performance and also try to make the API as consistent as possible. @PaulScemama @tddough98

@PaulScemama
Copy link
Author

PaulScemama commented Oct 29, 2022

Edit: With my current understanding, I think Option 2 may be more logical. The reasons I think this are

  1. Users may want to create a SlideData instance from a directory of paths and not have to wait. Then once they're ready to preprocess, we read the images in and preprocess.
  2. This is more consistent with the API in that SlideData currently only generates tiles once run() is called.

My question: Do you think we should create separate backend classes that has as input a directory instead of a filename? Or should we modify the current backends to accommodate a directory input? My feeling is the latter is a better option, but I could be wrong.

Thoughts? @tddough98 @jacob-rosenthal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants