New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix closing connection before content is read #117
Conversation
Well spotted @ace-e4s . I think even better would be to put everything inside the with statement and pass the stream directly on without any intermediate copying, that should cause less memory allocation. These files can be quite large. I will try to come up with a suggestion. |
That is a good point. I principle, we could just send DRIO may send "malformed" CSV, which requires us to define the seperator as a regex and use a pure Python csv engine. This has a huge(!) performance impact. To mitigate this, we have choosen to inspect the entire content (counting new lines and commas) and determine if the csv can be parsed using regular C engine, and only use the Python engine as a last resort. The price we pay is double memory usage for a few seconds while parsing. Malformed csv: DRIO assumes everything is 2 column csv. Therefore, it may send a csv file where there is more than 1 comma, and no quatation is used to encapsulate the second column. Example of a malformed line: `1234, {"key1": "value1", "key2": "value2"} |
So, if DRIO operated with RFC4180 definition of CSV (https://datatracker.ietf.org/doc/html/rfc4180.html), we could have done it more memory efficient and always used the fastest parser. |
Thanks, this is nice to know. So when the value is JSON we can have commas in there, therefore parsing it requires special care. JSON and CSV sounds like a strange mix, we would have to send some encoded JSON or something to make that work, I would assume. This is indeed a strange mix... I guess we will have to continue doing it for now, but we should discuss this with the DRIO team. It shows how some of these assumptions about values being a string has consequences. |
We should have something in the code that explains the intent behind the code that handles the parsing, a comment is probably good enough. |
fc9eb79
This PR is related to user story ESS-XXXX
Description
Fixes an issue in _blob_to_df, where connection was set to stream=True and closed, before content was read. This meant that the connection was available in the pool while the data was still being read.
If the connection is then reused by someone else, this would look like the connection broke from server side, potentially leading to all the ChunkedEncodingError and IncompleteReadError.
Fixed by reading the content all at once while connection was still open. No need for streaming, since we dont use the partial data.
Checklist
PR title tips: