Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Head (or similar) should support a timeout value #265

Closed
epa095 opened this issue Dec 17, 2018 · 7 comments
Closed

Head (or similar) should support a timeout value #265

epa095 opened this issue Dec 17, 2018 · 7 comments

Comments

@epa095
Copy link

epa095 commented Dec 17, 2018

Description

We have an issue where at least one file on our datalake is broken. In the portal it has a size, but preview does not work, and "downloading" it results in an empty file.

The problem is that when trying to download it (using open) with this library the result is that python just hangs with 100% CPU. Using info/ls/stat etc gives valid information, so there is no way of discovering its "broken-ness" from this. Using e.g. head file_name 1 to read a single byte also results in python standing in an infinite loop with 100% CPU. It would be nice if head accepted a timeout-parameter, then we could use it to detect when we were in this situation. Alternatively, do you have any other way of programmatically detect it?

Reproduction Steps

** Enumerate the steps to reproduce the issue here:**

Environment summary

SDK Version: What version of the SDK are you using? (pip show azure-datalake-store)
Answer here:
azure-datalake-store==0.0.39

Python Version: What Python version are you using? Is it 64-bit or 32-bit?
Answer here:
3.6.7 64bit

OS Version: What OS and version are you using?
Answer here:
Linux

Shell Type: What shell are you using? (e.g. bash, cmd.exe, Bash on Windows)
Answer here:
bash

@akharit
Copy link
Member

akharit commented Dec 18, 2018

@epa095
Can you elaborate on what you mean when you say that "one file on our datalake is broken". In particular, it would be helpful if I can get the following info.

adl = ADLS_INSTANCE
print(adl.info(path_to_corrupted_file))

Can you also mention the adls account name, file path, and approximate time(with time zone) when you run the info and open operations? This will allow me to look at backend logs to figure out what exactly is happening.

Usually the backend server requests time out if necessary and propagate to the sdk, so explicit time outs aren't needed for functions. Here it seems like there is a bug in the sdk(which may be a result from bug in backend). From what I can figure out, the loop in read function i.e.

    while length > 0:
        self._read_blocksize()
        data_read = self.cache[self.loc - self.start:
                         min(self.loc - self.start + length, self.end - self.start)]
        out += data_read
        self.loc += len(data_read)
        length -= len(data_read)

is most probably not getting the length parameter reduced, and going into an infinite loop. This seems like the only pathway that will result in 100% Cpu usage. I can add a break on len(data_read) == 0 to fix, but I'll feel more confidant in the fix if I can get more info about the file and backend operations.

@epa095
Copy link
Author

epa095 commented Dec 19, 2018

Good work, I think you are on to the reason here. Especially since downloading the file from the portal results in an empty file.

With regards to the file/account I am for corporate secret reasons not allowed to put any such information here on the open internet. I can say that the files in question become available again after some time, and it is quite unpredictable when it happens.

I will fill an issue through our premium support, and come back to you (I guess they will either contact you, or I will get some support nr I can share).

@akharit
Copy link
Member

akharit commented Dec 19, 2018

I asked my colleagues, and there is an issue related to length i.e. size of the file not being synced up for some time after concurrent appends. I'll add a fix for that as soon as possible. If this is the same issue, a call to open does cause a sync up. You can try the following as a temporary mitigation.

from azure.datalake.store import core
adl = ADLS_INSTANCE
core._fetch_range_with_retry(adl.azure, FILE_PATH, 0, 1)

The _fetch_range should cause a forced sync in the backend and afterwards reads should be correct.

This is a private method though, and there is no guarantee on its interface, so please only use it as temporary mitigation.

Otherwise, I'll get a new release in a couple of days with this solved if the root cause I am thinking is the actual root cause.

@flikka
Copy link

flikka commented Dec 19, 2018

Thanks for looking into this @akharit I am working with @epa095, and I just looked at your suggestion. I did try to do a call to the _fetch_range_with_retry, this returned a 200 response, but with just b'' (empty byte) as content. Subsequent reading of the file still fails, as is does when using the browser on portal.azure.com. Some mismatch between the reported size (24 mb) and the actual file (empty) is indeed going on.

I used the debugger to see where the "infinite loop" was, and you are indeed correct, the while loop you refer to in core.py is where it is working its magic. So there should probably be some test for this case, and an exception or so - makes it a bit easier to fail in a sensible way. We still need to figure out why this thing happens in the data lake in the first place though. Great if you have ideas there too :-)

Thanks for your help, if you want some details about failing files you can drop me a mail at kflik@equinor.com

@akharit
Copy link
Member

akharit commented Dec 19, 2018

So there are two separate issues here. One is from the sdk side, which hangs. I can break the loop on the condition of len(data_read) == 0 to fix that, but that won't solve the underlying issue of why the file is not getting the correct size.

I am not sure I can do anything for that on the sdk layer. The root cause I thought should have been solved by sending an 'OPEN' operation on the file path, which is what _fetch_range was meant to do. I guess the cause is different here.

@epa095 @flikka It would be better if you can create an issue through the support. Please mention my id(i.e akharit) to the support representative so it can come to me faster, along with details about the file path, adls account and time and I can check the backend logs and pass it on to whichever internal component is causing problems.

akharit added a commit that referenced this issue Jan 4, 2019
* Possible fix for issue with infinite loop on read

* Refactoring

* Update pkg to pep 420 standards
@akharit akharit mentioned this issue Jan 8, 2019
3 tasks
@akharit
Copy link
Member

akharit commented Jan 9, 2019

I have added a break when the data_read has lenth 0 in the new version, so infinite loop won't happen. However I am keeping this open till the underlying issue i.e why the file metadata is not correct is fixed.

@akharit
Copy link
Member

akharit commented Jan 30, 2019

Closing from here, since it is confirmed to not be a Python issue.

@akharit akharit closed this as completed Jan 30, 2019
@akharit akharit mentioned this issue Feb 1, 2019
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants