-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems parsing large CSV files #229
Comments
Okay. Well park and give time for someone responses to your post to emerge. |
No replies at all so far. :/ |
How active is that project? |
There seems to have been recent activity in the codebase. Issues - new ones being posted, about five comments in total on the most recent three. So perhaps active but not super active. |
From element thread "...is to find or create a test which reproduced that issue without MM, then try and binary search the PapaParse library for the most recent version which doesn't have it |
I didn't end up doing a binary search for this. Think I've worked out what's happening by debugging. Playing around with the PapaParse demo, I could see that the first chunk was getting processed fine, but then the next chunk (bytes 5242880-10485759) was hitting a This helpful guide https://www.joshcanhelp.com/accept-encoding-content-encoding-err-content-decoding-failed/ helped me to work out that it's because we seem to be receiving a We have the same problem with our server it seems, where if we try to get 'https://dev.data.solidarityeconomy.coop/coops-uk/standard.csv' whilst accepting gzip encoding, the server gives us back a PapaParse uses a So I don't think this is something that should be fixed in PapaParse, since it's not really broken (apart from their demo). Instead, we should fix our dev.data.solidarityeconomy.coop server to not send a gzip header. And then I suppose that whoever is hosting papaparse.com will have to do the same to get the demo site working? |
Ok, so I think the problem is not actually about the server gzip encoding negotiation, which I didn't fully understand. Our data server is successfully gzipping the CSV, but it looks like It's explained here: So I think PapaParse needs to use a different method of chunking, or disable gzip compression when requesting a chunked response. |
I don't think there's any easy solution to this, without completely changing how PapaParse does streaming. It seems to be a limitation of HTTP right now, if we want to use I think we just need to disable streaming CSVs and get PapaParse to download the whole file in one chunk. In fact, I think this is more efficient anyway since it will reduce the number of HTTP requests and save time. I don't think streaming is actually leading to many memory efficiencies, since we're basically saving the whole CSV into memory anyway. It would only bring efficiencies if we were simply processing the CSV then throwing away lots of the data. @wu-lee what do you think about disabling streaming? |
Interesting observations. So in my investigations, it was clear that (some) un-compressed data was arriving because I could see it, plus the parser wouldn't be able to decode compressed data. I would guess everything at the level of the HTTP server and the browser is working normally. So I think we agree it's something to do with PapaParse's streaming. It would seem like they ought to have had something working once, at least, since they claim to have large file support, and gzip encoding is not at all unusual (in fact as in our case, it seems to be standard out of the box behaviour with Apache). My hunch is that some recent work broke the streaming in a way that their test suite missed. We could disable streaming and see if it works - this is just an option to PP, and I suspect it could be flipped in the debugger, even. This would confirm your theory. It isn't true that the CSV is just downloaded into the memory verbatim and used - we are indeed processing and throwing away that CSV data. The plain text CSV is parsed and transformed into another data structure a row at a time. So I think streaming will help it cope with dealing with large datasets, and I'd need to feel a a bit desperate to want to lose it to be able to deal with larger datasets. I'm also not totally certain it would speed up the download significantly: as HTTP is likely well optimised these days I suspect the chunks may well be mediated in a single TCP connection, the overhead would be a few extra header packets. But in order to show it's a streaming problem, sure, disable it. |
I think that the first chunk is being successfully transferred and decompressed because, although it's incomplete, it contains the gzip header so can be decompressed. The error that PapaParse code is hitting for the second chunk is I see what you mean about PapaParse was probably working with streaming and gzip in the past since it's had lots of testers. I'll check against the first version of PapaParse. And yes, I suppose we're row transforming the data, so maybe this cause some duplication and inefficiency in memory. I wonder if there's a way to make this efficient in our processing code (e.g. throwing away rows from the original data after processing them), which would be just as efficient as downloading and processing chunks. I think this would be a simpler fix than forking PapaParse, since the issue seems to be quite baked into their current implementation. |
Ok, some new intel..... I just tested the PapaParse demo again (after >1 day of not testing it), and now it's working with streamed data. But... if I now:
This is repeatable for me, and if I reset the browser cache, streaming works again. @wu-lee please can you test that this is repeatable for you? This indicates that the caching functionality isn't working with the This isn't exactly the same as the Mykomaps bug, since it never works, even after resetting the browser cache. But maybe it has something to do with the caching on our server. |
I think all of this analysis above was a red herring, and we were just hitting the bug in the demo due to browser caching, which is different to what's happening in Mykomap. In the Mykomap console logs, I'm seeing this error in PapaParse, which I'd missed before: This part of the PapaParse code is failing
In particular, xhr.getResponseHeader('Content-Range') since the CORS policy means that this fails for cross-origin requests to our data server https://stackoverflow.com/a/7463297 I've just tried on a browser with CORS checks disabled and Mykomaps is successfully downloading all the data! |
@wu-lee We need to add the See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Access-Control-Expose-Headers Please can you give me access to this server (dev-1 I think, judging by the ansible config?) |
Had a search for PapaParse and the ACEH header, expecting to find someone with a similar problem, but didn't really find much. This seems the closest match I can find: |
Yes I think that person is hitting the same problem, and I've mentioned this solution on that ticket. I added Think this can be closed now @wu-lee |
Nice. I can confirm that https://dev.coopsuk.solidarityeconomy.coop/ is loading again and it seems faster than before. Might that be the case? |
I'm not sure why it would load faster. Maybe it's because your browser wasn't caching the failed partial response before but now it is? If you empty cache and hard reload (on Chrome, show the inspect window and then right click refresh), maybe it will take longer to reload again. Or maybe there is some improvement in the caching server side. |
As you say, several things could have influenced previous actual and perceived/remembered load times. Not worth further invsestigation. |
Note, the fix is to update the file
This needs to be done on the other servers which can be used for CSV downloads by maps. So I've updated the same file on prod-1, and the equivalent file on dev-2 and prod-2: (There's not really a home for this file in a repo currently. It's on my mental list for some consideration.) |
@wu-lee Ah thanks, I didn't realise there was more than 1 server. Maybe it could be added to Ansible? |
Possibly should be - problem is that the Ansible stuff's boundary of knowledge stops at the details of all the applications that are installed (i.e. what's in |
Describe the bug
[Cooperative World Map]
The recently updated CUK dataset (January 2024) has 7248 organisations.
Converted into a standard CSV file as usual and consumed by the map at https://dev.coopsuk.solidarityeconomy.coop/, the map seems to fail to load the CSV, complaining in the dev console of an error, which seems to be the result of the CSV being truncated on download.
I've checked the file, this is fine, and parses correctly elsewhere.
Looking in the developer console, I can see that the string being parsed is truncated.
The size of the string being parsed, converted to bytes, matches the size of the first part of the
Content-Range
header:Note that the latter number indicates that not all of the file was sent in this response - which is normal, and managed by the server. PapaParse, the CSV library, relies on this header to work correctly, which I know from testing it with mock HTTP responses - it doesn't work without it.
The truncated string typically chops the end of a CSV row in half - so manifests as a parse error. But what seems to be the problem is that only the first chunk is being parsed. This explains why the smaller CSV files we've used so far don't exhibit this problem - they fit into one chunk.
The same CSV file works fine in a unit test which sends the data in a single chunk.
To Reproduce
Expected behavior
All the data should be loaded and the corresponding organisations visible (7248). There should be no errors.
Desktop (please complete the following information):
Additional context
I think it may be the library at fault. It is meant to be able to stream-parse larger files than this, and I can do that using it in tests.
At the time of writing, however, the demo page for PapaParse seems to have problems with our CUK CSV file. And moreso, it has problems with its own test files.
I've filed a bug on GitHub outlining this here, let's see if what they say:
mholt/PapaParse#1037
The text was updated successfully, but these errors were encountered: