Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some fixes for multithreaded Context parsing #1073

Merged
merged 8 commits into from
May 19, 2023

Conversation

Liozou
Copy link
Contributor

@Liozou Liozou commented Feb 12, 2023

I had a CSV file with only Float64 whose columns were sometimes parsed as String, sometimes as Float64 when using multiple threads. Whether it happened and which columns were affected changed from one execution to the other, which made the issue difficult to track.

This PR fixes this behaviour and a couple of issues encountered in the same area of the code. Most notably:

  • this fixes the race condition underlying the not-reproductible issue: this came from findchunkrowstart, where thread i reads ranges[i+1] and writes to ranges[i].
  • if multithreaded parsing fails, the column types are now properly reinitialized to allow the fallback single-threaded parsing to re-infer them correctly (I followed the suggestion from Threaded parsing type mismatch depending on ntasks value #1010).
  • a trailing newline at the end of the file could be mistaken for a row with 0 columns, causing multi-threaded parsing to fail.

I also made multi-threading parsing failure a loud @error to prevent performance pitfalls to go unnoticed (last commit). This can be easily reverted if you think it can lead to too many spurious logs.

Fix #1010, fix #1047 and added tests (which now include one occurrence of the new @error).

@Liozou
Copy link
Contributor Author

Liozou commented Apr 28, 2023

Bump. This fixes a real bug and a race condition.
If the changes are too complex to review, would you like me to split this PR in smaller separate ones? I tried to do that with the different commits already, but I understand if it's still difficult to navigate.

@@ -336,24 +336,23 @@ ColumnProperties(T) = ColumnProperties(T, 0x00)
end
end

function findnextnewline(pos, stop, buf, opts)
while pos < stop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop here seems to be the last byte of the current chunk; why is it < here instead of <=? I understand the change below from len = ranges[i + 1] to len = ranges[i + 1] - 1, but it seems like we still want to check each byte in the chunk?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for example, on line 372 we're doing while pos <= len...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

This function findnextnewline only returns the position of the first encountered newline and defaults by returning stop. So, whether there is no newline, or whether the newline happens on the last byte, it will return stop either way and there is no side-effect to the function, so we might as well skip checking that last byte.
The default return value comes from the assumption that each chunk should end with a newline. Line 450 this assumption is verified by the preprocessing of ranges from lines 466-480, while on the call line 470 stop is simply the end of the buffer.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good IMO; I left one question though. I'm also not sure why CI is not running at all here? We definitely want to make sure we have a good test run before merging.

@quinnj quinnj closed this May 19, 2023
@quinnj quinnj reopened this May 19, 2023
@codecov
Copy link

codecov bot commented May 19, 2023

Codecov Report

Patch coverage: 95.83% and project coverage change: +0.18 🎉

Comparison is base (94deaf4) 90.22% compared to head (6dc07f2) 90.40%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1073      +/-   ##
==========================================
+ Coverage   90.22%   90.40%   +0.18%     
==========================================
  Files           9        9              
  Lines        2270     2293      +23     
==========================================
+ Hits         2048     2073      +25     
+ Misses        222      220       -2     
Impacted Files Coverage Δ
src/detection.jl 96.33% <92.50%> (+0.38%) ⬆️
src/context.jl 88.99% <100.00%> (+0.51%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@quinnj quinnj merged commit ae05b87 into JuliaData:main May 19, 2023
@quinnj
Copy link
Member

quinnj commented May 19, 2023

Thanks @Liozou!

@Liozou Liozou deleted the chunkrowstartdetection branch May 19, 2023 13:28
@Liozou
Copy link
Contributor Author

Liozou commented May 19, 2023

You're welcome, thanks for the review and merge! I believe this should also fix #1089, I checked locally that the bug occurs when using e.g. ntasks=8 but not after this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants