-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some fixes for multithreaded Context parsing #1073
Conversation
Bump. This fixes a real bug and a race condition. |
@@ -336,24 +336,23 @@ ColumnProperties(T) = ColumnProperties(T, 0x00) | |||
end | |||
end | |||
|
|||
function findnextnewline(pos, stop, buf, opts) | |||
while pos < stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop
here seems to be the last byte of the current chunk; why is it <
here instead of <=
? I understand the change below from len = ranges[i + 1]
to len = ranges[i + 1] - 1
, but it seems like we still want to check each byte in the chunk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for example, on line 372 we're doing while pos <= len
...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
This function findnextnewline
only returns the position of the first encountered newline and defaults by returning stop
. So, whether there is no newline, or whether the newline happens on the last byte, it will return stop
either way and there is no side-effect to the function, so we might as well skip checking that last byte.
The default return value comes from the assumption that each chunk should end with a newline. Line 450 this assumption is verified by the preprocessing of ranges from lines 466-480, while on the call line 470 stop
is simply the end of the buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good IMO; I left one question though. I'm also not sure why CI is not running at all here? We definitely want to make sure we have a good test run before merging.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1073 +/- ##
==========================================
+ Coverage 90.22% 90.40% +0.18%
==========================================
Files 9 9
Lines 2270 2293 +23
==========================================
+ Hits 2048 2073 +25
+ Misses 222 220 -2
☔ View full report in Codecov by Sentry. |
Thanks @Liozou! |
You're welcome, thanks for the review and merge! I believe this should also fix #1089, I checked locally that the bug occurs when using e.g. |
I had a CSV file with only
Float64
whose columns were sometimes parsed asString
, sometimes asFloat64
when using multiple threads. Whether it happened and which columns were affected changed from one execution to the other, which made the issue difficult to track.This PR fixes this behaviour and a couple of issues encountered in the same area of the code. Most notably:
findchunkrowstart
, where threadi
readsranges[i+1]
and writes toranges[i]
.ntasks
value #1010).I also made multi-threading parsing failure a loud
@error
to prevent performance pitfalls to go unnoticed (last commit). This can be easily reverted if you think it can lead to too many spurious logs.Fix #1010, fix #1047 and added tests (which now include one occurrence of the new
@error
).