Some fixes for multithreaded Context parsing #1073

Liozou · 2023-02-12T17:36:41Z

I had a CSV file with only Float64 whose columns were sometimes parsed as String, sometimes as Float64 when using multiple threads. Whether it happened and which columns were affected changed from one execution to the other, which made the issue difficult to track.

This PR fixes this behaviour and a couple of issues encountered in the same area of the code. Most notably:

this fixes the race condition underlying the not-reproductible issue: this came from findchunkrowstart, where thread i reads ranges[i+1] and writes to ranges[i].
if multithreaded parsing fails, the column types are now properly reinitialized to allow the fallback single-threaded parsing to re-infer them correctly (I followed the suggestion from Threaded parsing type mismatch depending on ntasks value #1010).
a trailing newline at the end of the file could be mistaken for a row with 0 columns, causing multi-threaded parsing to fail.

I also made multi-threading parsing failure a loud @error to prevent performance pitfalls to go unnoticed (last commit). This can be easily reverted if you think it can lead to too many spurious logs.

Fix #1010, fix #1047 and added tests (which now include one occurrence of the new @error).

Liozou · 2023-04-28T09:44:50Z

Bump. This fixes a real bug and a race condition.
If the changes are too complex to review, would you like me to split this PR in smaller separate ones? I tried to do that with the different commits already, but I understand if it's still difficult to navigate.

quinnj · 2023-05-19T05:54:31Z

src/detection.jl

@@ -336,24 +336,23 @@ ColumnProperties(T) = ColumnProperties(T, 0x00)
    end
 end

+function findnextnewline(pos, stop, buf, opts)
+    while pos < stop


stop here seems to be the last byte of the current chunk; why is it < here instead of <=? I understand the change below from len = ranges[i + 1] to len = ranges[i + 1] - 1, but it seems like we still want to check each byte in the chunk?

(for example, on line 372 we're doing while pos <= len...)

Thanks for the review!

This function findnextnewline only returns the position of the first encountered newline and defaults by returning stop. So, whether there is no newline, or whether the newline happens on the last byte, it will return stop either way and there is no side-effect to the function, so we might as well skip checking that last byte.
The default return value comes from the assumption that each chunk should end with a newline. Line 450 this assumption is verified by the preprocessing of ranges from lines 466-480, while on the call line 470 stop is simply the end of the buffer.

quinnj

This looks pretty good IMO; I left one question though. I'm also not sure why CI is not running at all here? We definitely want to make sure we have a good test run before merging.

codecov · 2023-05-19T12:58:33Z

Codecov Report

Patch coverage: 95.83% and project coverage change: +0.18 🎉

Comparison is base (94deaf4) 90.22% compared to head (6dc07f2) 90.40%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1073      +/-   ##
==========================================
+ Coverage   90.22%   90.40%   +0.18%     
==========================================
  Files           9        9              
  Lines        2270     2293      +23     
==========================================
+ Hits         2048     2073      +25     
+ Misses        222      220       -2

Impacted Files	Coverage Δ
src/detection.jl	`96.33% <92.50%> (+0.38%)`	⬆️
src/context.jl	`88.99% <100.00%> (+0.51%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

quinnj · 2023-05-19T13:06:51Z

Thanks @Liozou!

Liozou · 2023-05-19T13:49:56Z

You're welcome, thanks for the review and merge! I believe this should also fix #1089, I checked locally that the bug occurs when using e.g. ntasks=8 but not after this PR.

Liozou added 8 commits January 6, 2023 17:49

Reinitialize column types after failed chunking

790463d

Fix race condition on ranges in findchunkrowstart

4c29d06

Reinitialize column type after limit row start detection

252a4cd

Preprocess ranges to avoid cutting values in chunk row start detection

bcaa8b6

Fix using too large limit

40d4b36

Fix off-by-one in chunk row detection for last line

114c801

Add tests

6cdcae1

Make multithreaded parsing failure loud

6dc07f2

quinnj reviewed May 19, 2023

View reviewed changes

quinnj closed this May 19, 2023

quinnj reopened this May 19, 2023

quinnj approved these changes May 19, 2023

View reviewed changes

quinnj merged commit ae05b87 into JuliaData:main May 19, 2023

Liozou deleted the chunkrowstartdetection branch May 19, 2023 13:28

Liozou mentioned this pull request Jun 6, 2023

[Bug] CSV.read randomly changes eltype of column #1089

Closed

Liozou mentioned this pull request Jun 17, 2023

Selectively reduce multithreaded parsing @error #1099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some fixes for multithreaded Context parsing #1073

Some fixes for multithreaded Context parsing #1073

Liozou commented Feb 12, 2023

Liozou commented Apr 28, 2023

quinnj May 19, 2023

quinnj May 19, 2023

Liozou May 19, 2023

quinnj left a comment

codecov bot commented May 19, 2023 •

edited

Loading

quinnj commented May 19, 2023

Liozou commented May 19, 2023

Some fixes for multithreaded Context parsing #1073

Some fixes for multithreaded Context parsing #1073

Conversation

Liozou commented Feb 12, 2023

Liozou commented Apr 28, 2023

quinnj May 19, 2023

Choose a reason for hiding this comment

quinnj May 19, 2023

Choose a reason for hiding this comment

Liozou May 19, 2023

Choose a reason for hiding this comment

quinnj left a comment

Choose a reason for hiding this comment

codecov bot commented May 19, 2023 • edited Loading

Codecov Report

quinnj commented May 19, 2023

Liozou commented May 19, 2023

codecov bot commented May 19, 2023 •

edited

Loading