Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental implementation of caching mechanism #66

Closed
wants to merge 32 commits into from

Conversation

jonasteuwen
Copy link
Contributor

Implemented a preliminary version of a caching algorithm.

Currently you need to do dlup wsi downsample <> with the same settings as what you want to downsample with. This will not be the final implementation, but I would like to hear opinions on the implementation.

Copy link
Contributor

@lromor lromor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I think there's a lot going on (a bit too much). So I would suggest you to split the PR and add add the changes for the type aliases or other unrelated features to the caching mechanism.

The code is an interesting proposition. I like it but I maybe would change the user-experience. Mostly moving the burden of creating another SlideImage object into embedding this system inside the current SlideImage, so that is opaque. I would do that by passing a possibly "default" global ImageCacher object. We could even merge the openslide object caching with this so we centralize the cachine features but we keep the granularity at the slide level.

I have a few questions for you. When the cache is created in the filesystem who owns it? is it going to be automatically deleted? Can you pre-generate a cache and instruct the SlideImage to point at it? Maybe each of these could be configuration options for the ImageCacher.

What do you think?

.gitmodules Outdated
[submodule "openslide-python"]
path = openslide-python
url = https://github.com/NKI-AI/openslide-python.git
# [submodule "openslide"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright.

dlup/__init__.py Outdated
from ._image import SlideImage


from ._image import CachedSlideImage, SlideImage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably include the caching as an optional feature of SlideImage. No need to create another class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, was to see how to do it. Will do in next version.

@@ -0,0 +1,219 @@
# coding=utf-8
# Copyright (c) dlup contributors
import abc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistently group and order the imports

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imports feel a bit broken

dlup/_cache.py Outdated
) -> PIL.Image.Image:
"""..."""

# @abc.abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's commented remove it

dlup/_cache.py Outdated
)


class AbstractImageCache(abc.ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some docs documenting the intent of the abstract interface. I like the name, but maybe it's bit too generic. If I understood correctly the purpose maybe something like ScaleLevelCache?

from dlup.utils.imports import PYVIPS_AVAILABLE
from dlup.utils.types import GenericFloatArray, GenericIntArray, GenericNumber, PathLike

if PYVIPS_AVAILABLE:
Copy link
Contributor

@lromor lromor Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move away from vips and do our own thing, less dependencies and tailored behavior for now it's fine, but I would keep libvips as a full-fledged dependency. For now we really need it for caching purposes.

AbstractSlide = Union[openslide.AbstractSlide]


class SlideReaderBackend(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it being used anywhere, maybe delete it?



class _SlideImageRegionView(RegionView):
"""Represents an image view tied to a slide image."""

def __init__(self, wsi: _TSlideImage, scaling: _GenericNumber, boundary_mode: BoundaryMode = None):
def __init__(self, wsi: _TSlideImage, scaling: GenericNumber, boundary_mode: BoundaryMode = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes should stay on a different PR

"""Returns the objective power at which the WSI was sampled."""
return int(self._openslide_wsi.properties[openslide.PROPERTY_NAME_OBJECTIVE_POWER])
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a test to reflect this behavior so we remain consistent.

dlup/_image.py Outdated
# ImageCacher.__init__(self, original_filename=wsi._filename)
# print(ImageCacher)
self._cache_directory = None
self._cacher = ImageCacher(original_filename=wsi._filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of the ImageCacher, I think we should add that to the SlideImage so a user can configure if and how the slide can be cached. It makes it easier for testing and also how granularly you can setup a filesystem cache.

@lromor lromor marked this pull request as draft January 20, 2022 14:02
Copy link
Contributor

@lromor lromor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't look at everything, the annotations part seems relatively tidy but I would prefer to review it on it on a separate occasion. Regarding the cache implementation I think it works but I would make the user-interface much simpler. Somehow I would completely remove the access of any caching-logic from the user and give it only the minimal amount of necessary knobs. This is to minimize confusion and misuse. I would just keep everything behind SlideImage and add options to that.

Maybe someone else could give feedback on how they would like to approach this issue. Do you have any suggestions of what you would like to see?

return slide_image.read_region(location, 1.0, size)

@property
def cache_lock(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not needed no need to add it.

"""Cache lock, is not needed for Tiffs (they cannot be written on the fly)."""
return None

def get_cache_for_mpp(self, mpp: float) -> SlideImage:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrieveg a cache object based on the mpp seems to be a brittle implementation due to floating point representation. What if the mpp is 0.3213231, what if it's 0.3213232?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add either an absolute error threshold or define a power of 2 mpp values.
For instance, we could support caching for mpps 1.0, 0.5, 0.25, etc.. or even make it more dense.

image.close()


def create_tiff_cache(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could encapsulate create_tiff_cache and get_cache_for_mpp behavior inside some other function. I don't see why a user should know about these two functions. Ideally a user should create a caching object, initialize it from a path of pre-stored caching tiffs, and that's it.

writer.from_iterator(_local_iterator(), filename, total=len(grid))


def image_cache(func):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea to use a decorator. The issue is that i don't see it reused very often. I would definitely would embed all this logic just inside SlideImage.read_region().

comment = self._openslide_wsi.properties.get(openslide.PROPERTY_NAME_COMMENT, None)
mpp_x, mpp_y = _read_dlup_wsi_mpp(comment)
# If it is still none you can raise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove space and comment?

@@ -0,0 +1,194 @@
# coding=utf-8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put in another separate PR?

"""Base writer class"""


class TiffImageWriter(ImageWriter):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a bit of re-implementing the underlying pyvips interface but without adding anything to it. Is this extra layer necessary?

from dlup.data.dataset import TiledROIsSlideImageDataset
from dlup.tiling import Grid, TilingMode

# def test_dataset_equality():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented code

@pytest.mark.parametrize("regions", [(0, 0), (512, 512)])
def test_cache_correctness(self, regions):

INPUT_FILE_PATH = "/processing/j.teuwen/TCGA-5T-A9QA-01Z-00-DX1.B4212117-E0A7-4EF2-B324-8396042ACEC1.svs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded paths?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants