Fix closing connection before content is read #117

ace-e4s · 2023-12-05T13:16:52Z

This PR is related to user story ESS-XXXX

Description

Fixes an issue in _blob_to_df, where connection was set to stream=True and closed, before content was read. This meant that the connection was available in the pool while the data was still being read.

If the connection is then reused by someone else, this would look like the connection broke from server side, potentially leading to all the ChunkedEncodingError and IncompleteReadError.

Fixed by reading the content all at once while connection was still open. No need for streaming, since we dont use the partial data.

Checklist

PR title is descriptive and fit for injection into release notes (see tips below)
Correct label(s) are used

PR title tips:

Use imperative mood
Describe the motivation for change, issue that has been solved or what has been improved - not how
Examples:
- Add functionality for Allan variance to sensor_4s.simulate
- Upgrade to support Python 9.10
- Remove MacOS from CI

datareservoirio/storage/storage.py

bjorn-einar-bjartnes-4ss · 2023-12-05T14:20:28Z

Well spotted @ace-e4s . I think even better would be to put everything inside the with statement and pass the stream directly on without any intermediate copying, that should cause less memory allocation. These files can be quite large. I will try to come up with a suggestion.

ace-e4s · 2023-12-05T17:34:05Z

Well spotted @ace-e4s . I think even better would be to put everything inside the with statement and pass the stream directly on without any intermediate copying, that should cause less memory allocation. These files can be quite large. I will try to come up with a suggestion.

That is a good point. I principle, we could just send response.raw directly into read_csv. However, there is one show stopper.

DRIO may send "malformed" CSV, which requires us to define the seperator as a regex and use a pure Python csv engine. This has a huge(!) performance impact. To mitigate this, we have choosen to inspect the entire content (counting new lines and commas) and determine if the csv can be parsed using regular C engine, and only use the Python engine as a last resort.

The price we pay is double memory usage for a few seconds while parsing.

Malformed csv: DRIO assumes everything is 2 column csv. Therefore, it may send a csv file where there is more than 1 comma, and no quatation is used to encapsulate the second column.

Example of a malformed line: `1234, {"key1": "value1", "key2": "value2"}

ace-e4s · 2023-12-06T07:17:24Z

So, if DRIO operated with RFC4180 definition of CSV (https://datatracker.ietf.org/doc/html/rfc4180.html), we could have done it more memory efficient and always used the fastest parser.

bjorn-einar-bjartnes-4ss · 2023-12-06T07:35:18Z

Thanks, this is nice to know. So when the value is JSON we can have commas in there, therefore parsing it requires special care. JSON and CSV sounds like a strange mix, we would have to send some encoded JSON or something to make that work, I would assume. This is indeed a strange mix...

I guess we will have to continue doing it for now, but we should discuss this with the DRIO team. It shows how some of these assumptions about values being a string has consequences.

bjorn-einar-bjartnes-4ss · 2023-12-06T07:49:50Z

We should have something in the code that explains the intent behind the code that handles the parsing, a comment is probably good enough.

refactor blob_to_df

9776190

ace-e4s requested review from heidi-holm-4ss, bjorn-einar-bjartnes-4ss and hanne-opseth-rygg-4ss December 5, 2023 13:16

ace-e4s added the bug Something isn't working label Dec 5, 2023

bjorn-einar-bjartnes-4ss reviewed Dec 5, 2023

View reviewed changes

datareservoirio/storage/storage.py Show resolved Hide resolved

bjorn-einar-bjartnes-4ss reviewed Dec 5, 2023

View reviewed changes

datareservoirio/storage/storage.py Show resolved Hide resolved

bjorn-einar-bjartnes-4ss reviewed Dec 5, 2023

View reviewed changes

datareservoirio/storage/storage.py Show resolved Hide resolved

bjorn-einar-bjartnes-4ss previously approved these changes Dec 6, 2023

View reviewed changes

bjartwolf mentioned this pull request Dec 6, 2023

Alternative solution to how to make the df_blob #118

Closed

hanne-opseth-rygg-4ss previously approved these changes Dec 8, 2023

View reviewed changes

comment on parsing

fc9eb79

hanne-opseth-rygg-4ss dismissed stale reviews from bjorn-einar-bjartnes-4ss and themself via fc9eb79 December 8, 2023 07:28

hanne-opseth-rygg-4ss self-requested a review December 8, 2023 07:28

hanne-opseth-rygg-4ss previously approved these changes Dec 8, 2023

View reviewed changes

isort and black

f81d06c

hanne-opseth-rygg-4ss dismissed their stale review via f81d06c December 8, 2023 07:32

hanne-opseth-rygg-4ss self-requested a review December 8, 2023 07:34

hanne-opseth-rygg-4ss approved these changes Dec 8, 2023

View reviewed changes

hanne-opseth-rygg-4ss merged commit fdb5a87 into master Dec 8, 2023
9 checks passed

hanne-opseth-rygg-4ss deleted the blob_connection_fix branch December 8, 2023 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix closing connection before content is read #117

Fix closing connection before content is read #117

ace-e4s commented Dec 5, 2023 •

edited

bjorn-einar-bjartnes-4ss commented Dec 5, 2023 •

edited

ace-e4s commented Dec 5, 2023

ace-e4s commented Dec 6, 2023

bjorn-einar-bjartnes-4ss commented Dec 6, 2023

bjorn-einar-bjartnes-4ss commented Dec 6, 2023

Fix closing connection before content is read #117

Fix closing connection before content is read #117

Conversation

ace-e4s commented Dec 5, 2023 • edited

This PR is related to user story ESS-XXXX

Description

Checklist

bjorn-einar-bjartnes-4ss commented Dec 5, 2023 • edited

ace-e4s commented Dec 5, 2023

ace-e4s commented Dec 6, 2023

bjorn-einar-bjartnes-4ss commented Dec 6, 2023

bjorn-einar-bjartnes-4ss commented Dec 6, 2023

ace-e4s commented Dec 5, 2023 •

edited

bjorn-einar-bjartnes-4ss commented Dec 5, 2023 •

edited