Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix closing connection before content is read #117

Merged
merged 3 commits into from Dec 8, 2023

Conversation

ace-e4s
Copy link
Member

@ace-e4s ace-e4s commented Dec 5, 2023

This PR is related to user story ESS-XXXX

Description

Fixes an issue in _blob_to_df, where connection was set to stream=True and closed, before content was read. This meant that the connection was available in the pool while the data was still being read.

If the connection is then reused by someone else, this would look like the connection broke from server side, potentially leading to all the ChunkedEncodingError and IncompleteReadError.

Fixed by reading the content all at once while connection was still open. No need for streaming, since we dont use the partial data.

Checklist

  • PR title is descriptive and fit for injection into release notes (see tips below)
  • Correct label(s) are used

PR title tips:

  • Use imperative mood
  • Describe the motivation for change, issue that has been solved or what has been improved - not how
  • Examples:
    • Add functionality for Allan variance to sensor_4s.simulate
    • Upgrade to support Python 9.10
    • Remove MacOS from CI

@bjorn-einar-bjartnes-4ss
Copy link

bjorn-einar-bjartnes-4ss commented Dec 5, 2023

Well spotted @ace-e4s . I think even better would be to put everything inside the with statement and pass the stream directly on without any intermediate copying, that should cause less memory allocation. These files can be quite large. I will try to come up with a suggestion.

@ace-e4s
Copy link
Member Author

ace-e4s commented Dec 5, 2023

Well spotted @ace-e4s . I think even better would be to put everything inside the with statement and pass the stream directly on without any intermediate copying, that should cause less memory allocation. These files can be quite large. I will try to come up with a suggestion.

That is a good point. I principle, we could just send response.raw directly into read_csv. However, there is one show stopper.

DRIO may send "malformed" CSV, which requires us to define the seperator as a regex and use a pure Python csv engine. This has a huge(!) performance impact. To mitigate this, we have choosen to inspect the entire content (counting new lines and commas) and determine if the csv can be parsed using regular C engine, and only use the Python engine as a last resort.

The price we pay is double memory usage for a few seconds while parsing.

Malformed csv: DRIO assumes everything is 2 column csv. Therefore, it may send a csv file where there is more than 1 comma, and no quatation is used to encapsulate the second column.

Example of a malformed line: `1234, {"key1": "value1", "key2": "value2"}

@ace-e4s
Copy link
Member Author

ace-e4s commented Dec 6, 2023

So, if DRIO operated with RFC4180 definition of CSV (https://datatracker.ietf.org/doc/html/rfc4180.html), we could have done it more memory efficient and always used the fastest parser.

@bjorn-einar-bjartnes-4ss

Thanks, this is nice to know. So when the value is JSON we can have commas in there, therefore parsing it requires special care. JSON and CSV sounds like a strange mix, we would have to send some encoded JSON or something to make that work, I would assume. This is indeed a strange mix...

I guess we will have to continue doing it for now, but we should discuss this with the DRIO team. It shows how some of these assumptions about values being a string has consequences.

@bjorn-einar-bjartnes-4ss

We should have something in the code that explains the intent behind the code that handles the parsing, a comment is probably good enough.

@hanne-opseth-rygg-4ss hanne-opseth-rygg-4ss merged commit fdb5a87 into master Dec 8, 2023
9 checks passed
@hanne-opseth-rygg-4ss hanne-opseth-rygg-4ss deleted the blob_connection_fix branch December 8, 2023 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging this pull request may close these issues.

None yet

3 participants