Threaded parsing type mismatch depending on `ntasks` value #1010

mattBrzezinski · 2022-06-28T16:53:55Z

Problem

We've ran across a very odd issue where depending on the value of ntasks set the type of parsed is different.

CSV.File("foobar.csv"; ntasks=60, debug=true)
...
types after parsing: Type[Float64], pool = (0.2, 500)

CSV.File("foobar.csv"; ntasks=120, debug=true)
types after parsing: Type[String31], pool = (0.2, 500)

File to replicate the issue, foobar.csv

The text was updated successfully, but these errors were encountered:

iamed2 · 2022-06-28T19:54:07Z

Additional fact from the original observed file: this column was the right-most column but there were several columns of various types including Float64, string, and Int64, and only this column had issues.

nickrobinson251 · 2022-06-28T20:54:09Z

I think i know what's happening, but not what to do about it.

when we chunk up the file in the ntasks=120 case we get unlucky and end up with the last 2 bytes in the chunk being a newline and a leading negation sign.

it's a bit like trying to parse a file that looks like

1.0
2.0
-3.0
4.0
5.0

(i.e. 1.0\n2.0\n-3.0\n4.0\n5.0) by parsing it in two chunks: 1.0\n2.0\n- and 3.0\4.0\n5.0.

When we try to parse that first chunk as Float64 values (as we do as part of detect), we're going to parse 1.0, then 2.0 then try to parse - as a Float64, which is going to fail, so detect decides on this basis that the column isn't Float64s afterall (and falls back to a parsing the column as a string)

quinnj · 2022-06-28T21:01:09Z

Hmmmm.....I don't think that should be possible because we do extra work to ensure chunks only get split exactly on the newline character, so \n should always end a chunk and the next character would be the start of the next chunk.

nickrobinson251 · 2022-06-28T21:46:58Z

hmmmm

well, i'm very curious to find out what is going on 😂

the debug output is

julia> CSV.File("foobar.csv"; ntasks=120, debug=true)
header is: 1, skipto computed as: 2
headerpos = 1, datapos = 8
estimated rows: 3598
detected delimiter: ","
column names detected: [:foobar]
byte position of data computed at: 8
computed types are: nothing
initial byte positions before adjusting for start of rows: [8, 520, 1032, 1544, 2056, 2568, 3080, 3592, 4104, 4616, 5128, 5640, 6152, 6664, 7176, 7688, 8200, 8712, 9224, 9736, 10248, 10760, 11272, 11784, 12296, 12808, 13320, 13832, 14344, 14856, 15368, 15880, 16392, 16904, 17416, 17928, 18440, 18952, 19464, 19976, 20488, 21000, 21512, 22024, 22536, 23048, 23560, 24072, 24584, 25096, 25608, 26120, 26632, 27144, 27656, 28168, 28680, 29192, 29704, 30216, 30728, 31240, 31752, 32264, 32776, 33288, 33800, 34312, 34824, 35336, 35848, 36360, 36872, 37384, 37896, 38408, 38920, 39432, 39944, 40456, 40968, 41480, 41992, 42504, 43016, 43528, 44040, 44552, 45064, 45576, 46088, 46600, 47112, 47624, 48136, 48648, 49160, 49672, 50184, 50696, 51208, 51720, 52232, 52744, 53256, 53768, 54280, 54792, 55304, 55816, 56328, 56840, 57352, 57864, 58376, 58888, 59400, 59912, 60424, 60936, 61526]
something went wrong chunking up a file for multithreaded parsing, falling back to single-threaded parsing
time for initial parsing: 5.262593030929565
types after parsing: Type[String31], pool = (0.2, 500)

I think a few funny things are going on here:

multi-threaded parsing fails... and i'm curious why.
- i thought this usually only happened when we got unlucky with quoted columns, and there's no quoted data here... but maybe my understanding is wrong and multi-threaded parsing is known to fail in other cases
multi-threaded parsing detects String31 (and fails)
- i'm not sure if this is the same as the point above or not, i.e. the incorrect type detection and the failure are one and the same thing
- i think what happens is we somehow get detect being passed pos == len and we're trying to parse a single character that happens to be the - character (which is then why detect chooses a string type)... but how we get here i'm not sure
i think we're parsing things as a String31 because this is what is detected by multi-threaded parsing... but multi-threaded parsing fails and yet we still used the detected String31 type for single-threaded parsing
- should we reset the column types (to NeedsTypeDetection for columns where the type wasn't user-given) if multi-threaded parsing fails (so that "falling back to single-threaded parsing" really is the same as ntasks=1)?

mattBrzezinski · 2022-06-29T13:04:34Z

Multi-threading fails here, I'm slowly trying to walk through and figure out what's going on but it's quite difficult and overwhelming to understand.

quinnj · 2022-06-29T13:20:52Z

I'd be curious to know the values here and why that check failed, especially on the full file where it seems like we should have enough columns to get a good % probability of finding the right row endings.

It sounds like @nickrobinson251 is probably right that we're not resetting things correctly when multithreaded parsing fails, so we're "stuck" with potentially bad types.

nickrobinson251 added the bug label Oct 7, 2022

Liozou mentioned this issue Feb 12, 2023

Some fixes for multithreaded Context parsing #1073

Merged

quinnj closed this as completed in #1073 May 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threaded parsing type mismatch depending on `ntasks` value #1010

Threaded parsing type mismatch depending on `ntasks` value #1010

mattBrzezinski commented Jun 28, 2022 •

edited

iamed2 commented Jun 28, 2022

nickrobinson251 commented Jun 28, 2022 •

edited

quinnj commented Jun 28, 2022

nickrobinson251 commented Jun 28, 2022 •

edited

mattBrzezinski commented Jun 29, 2022

quinnj commented Jun 29, 2022

Threaded parsing type mismatch depending on ntasks value #1010

Threaded parsing type mismatch depending on ntasks value #1010

Comments

mattBrzezinski commented Jun 28, 2022 • edited

Problem

iamed2 commented Jun 28, 2022

nickrobinson251 commented Jun 28, 2022 • edited

quinnj commented Jun 28, 2022

nickrobinson251 commented Jun 28, 2022 • edited

mattBrzezinski commented Jun 29, 2022

quinnj commented Jun 29, 2022

Threaded parsing type mismatch depending on `ntasks` value #1010

Threaded parsing type mismatch depending on `ntasks` value #1010

mattBrzezinski commented Jun 28, 2022 •

edited

nickrobinson251 commented Jun 28, 2022 •

edited

nickrobinson251 commented Jun 28, 2022 •

edited