Head (or similar) should support a timeout value #265

epa095 · 2018-12-17T14:26:38Z

Description

We have an issue where at least one file on our datalake is broken. In the portal it has a size, but preview does not work, and "downloading" it results in an empty file.

The problem is that when trying to download it (using open) with this library the result is that python just hangs with 100% CPU. Using info/ls/stat etc gives valid information, so there is no way of discovering its "broken-ness" from this. Using e.g. head file_name 1 to read a single byte also results in python standing in an infinite loop with 100% CPU. It would be nice if head accepted a timeout-parameter, then we could use it to detect when we were in this situation. Alternatively, do you have any other way of programmatically detect it?

Reproduction Steps

** Enumerate the steps to reproduce the issue here:**

Environment summary

SDK Version: What version of the SDK are you using? (pip show azure-datalake-store)
Answer here:
azure-datalake-store==0.0.39

Python Version: What Python version are you using? Is it 64-bit or 32-bit?
Answer here:
3.6.7 64bit

OS Version: What OS and version are you using?
Answer here:
Linux

Shell Type: What shell are you using? (e.g. bash, cmd.exe, Bash on Windows)
Answer here:
bash

The text was updated successfully, but these errors were encountered:

akharit · 2018-12-18T22:39:24Z

@epa095
Can you elaborate on what you mean when you say that "one file on our datalake is broken". In particular, it would be helpful if I can get the following info.

adl = ADLS_INSTANCE
print(adl.info(path_to_corrupted_file))

Can you also mention the adls account name, file path, and approximate time(with time zone) when you run the info and open operations? This will allow me to look at backend logs to figure out what exactly is happening.

Usually the backend server requests time out if necessary and propagate to the sdk, so explicit time outs aren't needed for functions. Here it seems like there is a bug in the sdk(which may be a result from bug in backend). From what I can figure out, the loop in read function i.e.

    while length > 0:
        self._read_blocksize()
        data_read = self.cache[self.loc - self.start:
                         min(self.loc - self.start + length, self.end - self.start)]
        out += data_read
        self.loc += len(data_read)
        length -= len(data_read)

is most probably not getting the length parameter reduced, and going into an infinite loop. This seems like the only pathway that will result in 100% Cpu usage. I can add a break on len(data_read) == 0 to fix, but I'll feel more confidant in the fix if I can get more info about the file and backend operations.

epa095 · 2018-12-19T14:30:13Z

Good work, I think you are on to the reason here. Especially since downloading the file from the portal results in an empty file.

With regards to the file/account I am for corporate secret reasons not allowed to put any such information here on the open internet. I can say that the files in question become available again after some time, and it is quite unpredictable when it happens.

I will fill an issue through our premium support, and come back to you (I guess they will either contact you, or I will get some support nr I can share).

akharit · 2018-12-19T20:34:18Z

I asked my colleagues, and there is an issue related to length i.e. size of the file not being synced up for some time after concurrent appends. I'll add a fix for that as soon as possible. If this is the same issue, a call to open does cause a sync up. You can try the following as a temporary mitigation.

from azure.datalake.store import core
adl = ADLS_INSTANCE
core._fetch_range_with_retry(adl.azure, FILE_PATH, 0, 1)

The _fetch_range should cause a forced sync in the backend and afterwards reads should be correct.

This is a private method though, and there is no guarantee on its interface, so please only use it as temporary mitigation.

Otherwise, I'll get a new release in a couple of days with this solved if the root cause I am thinking is the actual root cause.

flikka · 2018-12-19T21:27:53Z

Thanks for looking into this @akharit I am working with @epa095, and I just looked at your suggestion. I did try to do a call to the _fetch_range_with_retry, this returned a 200 response, but with just b'' (empty byte) as content. Subsequent reading of the file still fails, as is does when using the browser on portal.azure.com. Some mismatch between the reported size (24 mb) and the actual file (empty) is indeed going on.

I used the debugger to see where the "infinite loop" was, and you are indeed correct, the while loop you refer to in core.py is where it is working its magic. So there should probably be some test for this case, and an exception or so - makes it a bit easier to fail in a sensible way. We still need to figure out why this thing happens in the data lake in the first place though. Great if you have ideas there too :-)

Thanks for your help, if you want some details about failing files you can drop me a mail at kflik@equinor.com

akharit · 2018-12-19T22:37:42Z

So there are two separate issues here. One is from the sdk side, which hangs. I can break the loop on the condition of len(data_read) == 0 to fix that, but that won't solve the underlying issue of why the file is not getting the correct size.

I am not sure I can do anything for that on the sdk layer. The root cause I thought should have been solved by sending an 'OPEN' operation on the file path, which is what _fetch_range was meant to do. I guess the cause is different here.

@epa095 @flikka It would be better if you can create an issue through the support. Please mention my id(i.e akharit) to the support representative so it can come to me faster, along with details about the file path, adls account and time and I can check the backend logs and pass it on to whichever internal component is causing problems.

* Possible fix for issue with infinite loop on read * Refactoring * Update pkg to pep 420 standards

akharit · 2019-01-09T21:07:05Z

I have added a break when the data_read has lenth 0 in the new version, so infinite loop won't happen. However I am keeping this open till the underlying issue i.e why the file metadata is not correct is fixed.

akharit · 2019-01-30T19:25:55Z

Closing from here, since it is confirmed to not be a Python issue.

akharit mentioned this issue Jan 3, 2019

Fixe for issue #266, #265 and refactoring #268

Merged

3 tasks

akharit added a commit that referenced this issue Jan 4, 2019

Fixe for issue #266, #265 and refactoring (#268)

3e8177e

* Possible fix for issue with infinite loop on read * Refactoring * Update pkg to pep 420 standards

akharit mentioned this issue Jan 8, 2019

Release 0.0.40 #270

Merged

3 tasks

akharit closed this as completed Jan 30, 2019

akharit mentioned this issue Feb 1, 2019

Release version 0.0.41 #273

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Head (or similar) should support a timeout value #265

Head (or similar) should support a timeout value #265

epa095 commented Dec 17, 2018

akharit commented Dec 18, 2018

epa095 commented Dec 19, 2018

akharit commented Dec 19, 2018

flikka commented Dec 19, 2018

akharit commented Dec 19, 2018

akharit commented Jan 9, 2019

akharit commented Jan 30, 2019

Head (or similar) should support a timeout value #265

Head (or similar) should support a timeout value #265

Comments

epa095 commented Dec 17, 2018

Description

Reproduction Steps

Environment summary

akharit commented Dec 18, 2018

epa095 commented Dec 19, 2018

akharit commented Dec 19, 2018

flikka commented Dec 19, 2018

akharit commented Dec 19, 2018

akharit commented Jan 9, 2019

akharit commented Jan 30, 2019