Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yield_checksumfiles function doesn't handle ChecksumFiles with file paths as their name correctly #666

Closed
mvandenburgh opened this issue Jan 17, 2022 · 7 comments · Fixed by #667
Assignees
Labels
bug Something isn't working

Comments

@mvandenburgh
Copy link
Contributor

To reproduce, run this management script and then try to view the Raster that it creates. The tile requests will fail with 500 errors with errors about wrong file paths in the server log.

I came across this bug when working on Danesfield. On that project we encode the file path of a given ChecksumFile inside the name column - for example, foo/bar/foobar.txt represents a file hierarchy of

foo/
  bar/
    foobar.txt

This line in RGD is incompatible with this approach, since it assumes the ChecksumFile name is a flat file name instead of a file path. For example, if we have a ChecksumFile with the name foo.tiff, the current code would work:

path = '/tmp/rgd/file_cache/foo.tiff'
with yield_checksumfiles([self], path.parent):
    yield path

Note that path.parent would be /tmp/rgd/file_cache/foo.tiff, which is correct. But if the file name is a/b/c/foo.tiff:

path = '/tmp/rgd/file_cache/a/b/c/foo.tiff'
with yield_checksumfiles([self], path.parent):
    yield path

the call to path.parent would evaluate to /tmp/rgd/file_cache/a/b/c, which is incorrect and leads to 500 errors since the server can't find the files in that location.

@mvandenburgh mvandenburgh self-assigned this Jan 17, 2022
@mvandenburgh
Copy link
Contributor Author

mvandenburgh commented Jan 17, 2022

I have this fixed on a local branch, once i add a regression test I'll open a PR for it.

@mvandenburgh mvandenburgh added the bug Something isn't working label Jan 17, 2022
@banesullivan
Copy link
Contributor

@mvandenburgh and I talked offline, @mvandenburgh is going to try to come up with another example for me to think about this from another angle

@mvandenburgh
Copy link
Contributor Author

@banesullivan I looked more into this a bit, and there does indeed seem to be a bug, but I think my original theory of what was causing it was wrong; I still have some investigation to do, but so far it seems like the messed up file path happens during file ingestion, not during download. Unfortunately I messed around with the caching too much on my local environment and seem to have completely broken it, so I'll need to rebuild my docker setup from scratch - I can carve out some time tomorrow to do that and hopefully conclude this.

@mvandenburgh
Copy link
Contributor Author

I think I've figured out the problem - it's coming from the .touch() call in yield_checksumfiles at https://github.com/ResonantGeoData/ResonantGeoData/blob/main/django-rgd/rgd/models/utils.py#L137-L139. TLDR; it seems like we're not creating the appropriate directory structure within the cache directory for ChecksumFiles that have a file path encoded in their name.

I think what's happening is this -

  1. We create a ChecksumFile, chksum_file, with a name of foo/bar.tiff.
  2. chksum_file.download_to_local_path() is called, so the caching service creates a directory to download the file to, say, /tmp/rgd/file_cache/f-40.
  3. The yield_checksumfiles method is called with directory set to ChecksumFile.get_cache_path()
    • The get_cache_path method concatenates the cache directory with the name of the checksum file, i.e. it returns
      '/tmp/rgd/file_cache/f-40' / 'foo/bar.tiff'.
  4. Now we get to the code I linked above:
# Touch the directory to update the mtime
directory = Path(directory)
directory.touch()

And directory is set to '/tmp/rgd/file_cache/f-40/foo/bar.tiff' as stated before. The foo/ directory doesn't exist (since it is a part of the chksm_file name and was never explicitly created), so instead of updating the mtime of the directory it will create a new, 0-byte file /tmp/rgd/file_cache/f-40/foo. So, a simple solution to this would be to have a directory.mkdir(parents=True, exist_ok=True) right above the directory.touch() line (although I'm unsure if that's the appropriate place to solve this, maybe someone with a better understanding of the caching system knows a better place).

Does that make sense @banesullivan?

@banesullivan
Copy link
Contributor

Aha, thanks for looking into this further and finding the culprit of this. I will review your PR and help to land this fix

@mvandenburgh
Copy link
Contributor Author

mvandenburgh commented Jan 28, 2022

Well, I put in the fix I suggested and added some tests for this edge case, and they fail. It seems like I was actually right the first time (:sweat_smile:), and the bug is related to the yield_local_path method. But, there's a much simpler explanation then what I wrote before -

@contextmanager
def yield_local_path(self, try_fuse: bool = True, yield_file_set: bool = False):
	...
    # Fallback to loading entire file locally - this uses `get_temp_path`
    logger.debug('`yield_local_path` falling back to downloading entire file to local storage.')
    path = self.get_cache_path()
    if yield_file_set and self.file_set:
		...
    # Not in file_set. Download to cache dir
    from .utils import yield_checksumfiles

    with yield_checksumfiles([self], path.parent):
        yield path

Say we call this on a ChecksumFile with a name of a/b/c.txt. path = self.get_cache_path() will set path to /tmp/rgd/file_cache/{pk}/a/b/c.txt.
Then, when we get to the yield_checksumfiles context manager, the path.parent argument will evaluate to /tmp/rgd/file_cache/{pk}/a/b. Then, it will get to this line and concatenate the file name (a/b/c.txt) with this directory, yielding a/b/a/b/c.txt.

So there's two possible solutions - one is to do what I did originally, which is to strip the ChecksumFile.name from the directory and pass in /tmp/rgd/file_cache/{pk}/ as the directory argument in yield_checksumfiles(). Alternatively, we could pass /tmp/rgd/file_cache/{pk}/a/b as the directory argument, and change this line to extract only the base file name (i.e. change dest_path = Path(directory, self.name) to dest_path = Path(directory, self.name.split('/')[-1]

@banesullivan I believe you wrote most/all of the file caching code, do you have any thoughts on which approach makes sense?

@banesullivan
Copy link
Contributor

@mvandenburgh, I have pushed to your #667 in a169f58 and e6dd7d5 which should fix this issue.

The problem was that yield_local_path/yield_checksumfiles were mounting nested files under the wrong base/root directory by not handling the .parent as you pointed out for:

with yield_checksumfiles([self], path.parent):

a/b/c.txt would have been mounted at /tmp/rgd/file_cache/{pk}/a/a/b/c.txt. The changes in a169f58 resolve this and make sure that yield_checksumfiles is always passed the root directory /tmp/rgd/file_cache/{pk}/ such that:

file.download_to_local_path(directory=directory)

Is mounting under the root cache directory and handles nesting the file with its relative path in the name field correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants