[BUG] Invalid characters in CSV are handled differently when reading from GPU #9560

jbrennan333 · 2023-10-27T14:18:32Z

Describe the bug
As described in NVIDIA/spark-rapids-benchmarks#170, the TPC-DS raw data files are in ISO-8859 format, but nds_transcode.py was reading them as UTF8. The customer file has some strings with international characters (Ô and É). With the bug in nds_transcode, we were just passing through these ISO-8859 characters unmodified, while the CPU CSV reader translates the invalid UTF8 characters as � (0xefbfbd).

Steps/Code to reproduce bug
I will attach the file iso-8859-example.csv.
iso-8859-example.csv

In a spark shell:

spark.conf.set("spark.rapids.sql.enabled", false)
spark.read.csv(s"iso-8859-example.csv").write.option("compression", "none").parquet("example-parquet-cpu")
spark.conf.set("spark.rapids.sql.enabled", true)
spark.read.csv(s"iso-8859-example.csv").write.option("compression", "none").parquet("example-parquet-gpu")

Then use a tool like xxd to examine the binary data for the output files.
CPU:

00000300: 397c 3131 7c31 3938 367c 43ef bfbd 5445  9|11|1986|C...TE
00000310: 2044 2749 564f 4952 457c 7c4b 6174 6965   D'IVOIRE||Katie

GPU:

00000070: 7c32 397c 3131 7c31 3938 367c 43d4 5445  |29|11|1986|C.TE
00000080: 2044 2749 564f 4952 457c 7c4b 6174 6965   D'IVOIRE||Katie

Note that CPU has ef bfbd where GPU has d4

Expected behavior
Ideally, we should produce the same output as CPU.
This is a difference in the handling of an invalid UTF8 character in the input file (the result of reading an ISO-8599 file as UTF8), so it's not clear we need to fix it. We might be able to document the difference.

The text was updated successfully, but these errors were encountered:

mattahrens · 2023-10-31T20:18:46Z

Initial scope will be to document the issue and then later on can fix the bug.

jbrennan333 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 27, 2023

jbrennan333 mentioned this issue Oct 27, 2023

[FEA] Validate nvcomp-3.0 with spark rapids plugin #9461

Closed

mattahrens added documentation Improvements or additions to documentation and removed ? - Needs Triage Need team to review and classify labels Oct 31, 2023

jbrennan333 mentioned this issue Nov 7, 2023

Document problem with handling of invalid characters in CSV reader #9655

Merged

jbrennan333 self-assigned this Nov 8, 2023

hyperbolic2346 mentioned this issue Dec 13, 2023

[BUG] Invalid characters in URI query are handled differently when reading from GPU #10036

Closed

jbrennan333 removed their assignment Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Invalid characters in CSV are handled differently when reading from GPU #9560

[BUG] Invalid characters in CSV are handled differently when reading from GPU #9560

jbrennan333 commented Oct 27, 2023

mattahrens commented Oct 31, 2023

[BUG] Invalid characters in CSV are handled differently when reading from GPU #9560

[BUG] Invalid characters in CSV are handled differently when reading from GPU #9560

Comments

jbrennan333 commented Oct 27, 2023

mattahrens commented Oct 31, 2023