Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded parsing type mismatch depending on ntasks value #1010

Closed
mattBrzezinski opened this issue Jun 28, 2022 · 6 comments · Fixed by #1073
Closed

Threaded parsing type mismatch depending on ntasks value #1010

mattBrzezinski opened this issue Jun 28, 2022 · 6 comments · Fixed by #1073
Labels

Comments

@mattBrzezinski
Copy link

mattBrzezinski commented Jun 28, 2022

Problem

We've ran across a very odd issue where depending on the value of ntasks set the type of parsed is different.

CSV.File("foobar.csv"; ntasks=60, debug=true)
...
types after parsing: Type[Float64], pool = (0.2, 500)
CSV.File("foobar.csv"; ntasks=120, debug=true)
types after parsing: Type[String31], pool = (0.2, 500)

File to replicate the issue, foobar.csv

@iamed2
Copy link
Contributor

iamed2 commented Jun 28, 2022

Additional fact from the original observed file: this column was the right-most column but there were several columns of various types including Float64, string, and Int64, and only this column had issues.

@nickrobinson251
Copy link
Collaborator

nickrobinson251 commented Jun 28, 2022

I think i know what's happening, but not what to do about it.

when we chunk up the file in the ntasks=120 case we get unlucky and end up with the last 2 bytes in the chunk being a newline and a leading negation sign.

it's a bit like trying to parse a file that looks like

1.0
2.0
-3.0
4.0
5.0

(i.e. 1.0\n2.0\n-3.0\n4.0\n5.0) by parsing it in two chunks: 1.0\n2.0\n- and 3.0\4.0\n5.0.

When we try to parse that first chunk as Float64 values (as we do as part of detect), we're going to parse 1.0, then 2.0 then try to parse - as a Float64, which is going to fail, so detect decides on this basis that the column isn't Float64s afterall (and falls back to a parsing the column as a string)

@quinnj
Copy link
Member

quinnj commented Jun 28, 2022

Hmmmm.....I don't think that should be possible because we do extra work to ensure chunks only get split exactly on the newline character, so \n should always end a chunk and the next character would be the start of the next chunk.

@nickrobinson251
Copy link
Collaborator

nickrobinson251 commented Jun 28, 2022

hmmmm

well, i'm very curious to find out what is going on 😂

the debug output is

julia> CSV.File("foobar.csv"; ntasks=120, debug=true)
header is: 1, skipto computed as: 2
headerpos = 1, datapos = 8
estimated rows: 3598
detected delimiter: ","
column names detected: [:foobar]
byte position of data computed at: 8
computed types are: nothing
initial byte positions before adjusting for start of rows: [8, 520, 1032, 1544, 2056, 2568, 3080, 3592, 4104, 4616, 5128, 5640, 6152, 6664, 7176, 7688, 8200, 8712, 9224, 9736, 10248, 10760, 11272, 11784, 12296, 12808, 13320, 13832, 14344, 14856, 15368, 15880, 16392, 16904, 17416, 17928, 18440, 18952, 19464, 19976, 20488, 21000, 21512, 22024, 22536, 23048, 23560, 24072, 24584, 25096, 25608, 26120, 26632, 27144, 27656, 28168, 28680, 29192, 29704, 30216, 30728, 31240, 31752, 32264, 32776, 33288, 33800, 34312, 34824, 35336, 35848, 36360, 36872, 37384, 37896, 38408, 38920, 39432, 39944, 40456, 40968, 41480, 41992, 42504, 43016, 43528, 44040, 44552, 45064, 45576, 46088, 46600, 47112, 47624, 48136, 48648, 49160, 49672, 50184, 50696, 51208, 51720, 52232, 52744, 53256, 53768, 54280, 54792, 55304, 55816, 56328, 56840, 57352, 57864, 58376, 58888, 59400, 59912, 60424, 60936, 61526]
something went wrong chunking up a file for multithreaded parsing, falling back to single-threaded parsing
time for initial parsing: 5.262593030929565
types after parsing: Type[String31], pool = (0.2, 500)

I think a few funny things are going on here:

  1. multi-threaded parsing fails... and i'm curious why.
    • i thought this usually only happened when we got unlucky with quoted columns, and there's no quoted data here... but maybe my understanding is wrong and multi-threaded parsing is known to fail in other cases
  2. multi-threaded parsing detects String31 (and fails)
    • i'm not sure if this is the same as the point above or not, i.e. the incorrect type detection and the failure are one and the same thing
    • i think what happens is we somehow get detect being passed pos == len and we're trying to parse a single character that happens to be the - character (which is then why detect chooses a string type)... but how we get here i'm not sure
  3. i think we're parsing things as a String31 because this is what is detected by multi-threaded parsing... but multi-threaded parsing fails and yet we still used the detected String31 type for single-threaded parsing
    • should we reset the column types (to NeedsTypeDetection for columns where the type wasn't user-given) if multi-threaded parsing fails (so that "falling back to single-threaded parsing" really is the same as ntasks=1)?

@mattBrzezinski
Copy link
Author

Multi-threading fails here, I'm slowly trying to walk through and figure out what's going on but it's quite difficult and overwhelming to understand.

@quinnj
Copy link
Member

quinnj commented Jun 29, 2022

I'd be curious to know the values here and why that check failed, especially on the full file where it seems like we should have enough columns to get a good % probability of finding the right row endings.

It sounds like @nickrobinson251 is probably right that we're not resetting things correctly when multithreaded parsing fails, so we're "stuck" with potentially bad types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants