[SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV #29516

srowen · 2020-08-22T22:07:45Z

What changes were proposed in this pull request?

Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of \0), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'.

Why are the changes needed?

To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example.

Does this PR introduce any user-facing change?

Nothing beyond the effect of the bug fix.

How was this patch tested?

Existing tests plus new test case.

… still did.

srowen

CC maybe @HyukjinKwon and/or @dongjoon-hyun

srowen · 2020-08-22T22:08:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala

+    if (options.isCommentSet) {
+      val commentPrefix = options.comment.toString
+      iter.filter { line =>
+        val trimmed = line.trim


I just added this while here to avoid trimming twice

srowen · 2020-08-22T22:08:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

@@ -220,7 +220,9 @@ class CSVOptions(
    format.setQuote(quote)
    format.setQuoteEscape(escape)
    charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping)
-    format.setComment(comment)
+    if (isCommentSet) {


Arguably we should rework the handling of 'optional' configs to not use this default of \u0000 to mean "none" but I avoided that here. One consequence is that you cannot use \u0000 as a comment char right now.

If we will change that way then it might impact existing users for which \u0000 is a comment character by default. So I would say a separate optional config is a better solution. What I am saying here is that we need to wait for univocity 3.0.0 to be available where the new changes will be available then we can add spark changes in a proper manner.

You are correct, but, this has never been a valid comment character, and the flip side is the bug you describe: it's always a comment character. I think it's reasonable to fix as a bug. I don't think we need yet another config, as I think it would be quite obscure to use this non-printing control code for comments in a CSV file.

I agree, but once the changes will be done then \u0000 won't be treated as comment character. It will resolve this bug. But then default comment character will be # as in univocity this is the default comment character. So if my data row starts with # then will the row be processed now. If not then it will break most of the existing jobs.

I agree, I'll fix that in the next commit - we need to set the comment char to whatever Spark is using no matter what. However it looks like we are going to need your univocity fix to really fix this. Looks like that was just released in 2.9.0: uniVocity/univocity-parsers@f392311

let me try that.

@dongjoon-hyun it is a correctness issue but I wouldn't hold up a release for it. We should address it but doesn't absolutely have to happen in 2.4.7 or 3.0.1. It's not a regression.

Thanks, if you are fine I can also raise a PR for this.

I would think this is rather a bug fix. If comment is not set, it shouldn't assume anything else is a comment.

That's also what we documented, see also DataFrameReader.csv.

Right, yeah, so we have to use the new method in univocity 2.9.0 to turn off its comment handling if its unset in Spark (= \u0000)

Oh right, this stanza is for writer settings. There is no setCommentProcessingEnabled for writers in univocity. Comments aren't generated. In fact the comment setting doesn't matter, really?

dongjoon-hyun · 2020-08-22T22:11:16Z

Oh, this sounds like a kind of correctness issue. Did I understand correctly, @srowen ?

dongjoon-hyun · 2020-08-22T22:24:54Z

cc @ScrapCodes for Apache Spark 2.4.7 and @zhengruifeng for Apache Spark 3.0.1 because this is a correctness issue.

SparkQA · 2020-08-23T00:15:45Z

Test build #127788 has finished for PR 29516 at commit 6358727.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-08-23T02:02:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala

    }
  }

  def skipComments(iter: Iterator[String], options: CSVOptions): Iterator[String] = {
    if (options.isCommentSet) {
      val commentPrefix = options.comment.toString
      iter.dropWhile { line =>
-        line.trim.isEmpty || line.trim.startsWith(commentPrefix)
+        line.trim.isEmpty || line.startsWith(commentPrefix)


I think it's correct to not trim the string that's checked to see if it starts with a comment, which is a slightly separate issue. \u0000 can't be used as a comment char, but other non-printable chars could.

srowen · 2020-08-23T02:03:08Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

@@ -1902,25 +1902,26 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa

  test("SPARK-25387: bad input should not cause NPE") {
    val schema = StructType(StructField("a", IntegerType) :: Nil)
-    val input = spark.createDataset(Seq("\u0000\u0000\u0001234"))
+    val input = spark.createDataset(Seq("\u0001\u0000\u0001234"))


I think this test was wrong in 2 ways. First it relied on, actually, ignoring lines starting with \u0000, which is the very bug we're fixing. You can see below it's asserting there is no result at all, when there should be some result.

srowen · 2020-08-23T02:03:42Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala


    checkAnswer(
      spark.read
        .option("columnNameOfCorruptRecord", "_corrupt_record")
        .schema(schema)
        .csv(input),
-      Row(null, null))
-    assert(spark.read.csv(input).collect().toSet == Set(Row()))
+      Row(null, "\u0001\u0000\u0001234"))


The other problem I think is that this was asserting there is no corrupt record -- no result at all -- when I think clearly the test should result in a single row with a corrupt record.

SparkQA · 2020-08-23T03:50:05Z

Test build #127793 has finished for PR 29516 at commit 87e8b65.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-23T23:22:26Z

Test build #127817 has finished for PR 29516 at commit f3d14c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-08-24T02:17:29Z

retest this please

srowen · 2020-08-24T02:20:18Z

BTW I think we may still have a real test failure here, I'm looking into it.

SparkQA · 2020-08-24T04:44:47Z

Test build #127824 has finished for PR 29516 at commit f3d14c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-24T18:01:32Z

Test build #127843 has finished for PR 29516 at commit b685f1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala

HyukjinKwon

Looks good otherwise.

srowen · 2020-08-25T15:27:59Z

@HyukjinKwon did this go into 3.0 as well? I think it's meant to

…for CSV Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of `\0`), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'. To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example. Nothing beyond the effect of the bug fix. Existing tests plus new test case. Closes #29516 from srowen/SPARK-32614. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a9d4e60) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

HyukjinKwon · 2020-08-25T15:28:51Z

Merged to master and branch-3.0.

Yes, there was a minor conflict which I resolved mnaully.

tooptoop4 · 2020-08-28T21:19:15Z

can this go in 2.4.7 ?

srowen · 2020-08-28T21:55:37Z

@tooptoop4 I think it could go into 2.4.x. Do you want to try a back-port PR to see if it picks cleanly and passes tests? there's a bump in univocity version which I'd want to be sure doesn't change other behavior.

tooptoop4 · 2020-08-29T13:07:27Z

@srowen I would of but my last PR (#27697) got shot down for no reason

srowen · 2020-08-29T14:20:14Z

@tooptoop4 that change looks unrelated? There was also quite a bit of reason given.
I don't think it's relevant to creating a back-port, which I am saying is worth evaluating.

srowen · 2021-03-05T01:31:29Z

Isn't that expected? or can you set the comment char to something else?

…

On Thu, Mar 4, 2021 at 5:41 PM koertkuipers ***@***.***> wrote: this has unintended side effect of now dropping rows that start with # we ran into this because we had comments disabled but we noticed that rows in a csv that start with # were dropped — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGIZ6QY3FJXY33QJFCRVKDTCALDDANCNFSM4QIJ32RA> .

koertkuipers · 2021-03-05T04:00:02Z

Isn't that expected? or can you set the comment char to something else?
…
On Thu, Mar 4, 2021 at 5:41 PM koertkuipers @.***> wrote: this has unintended side effect of now dropping rows that start with # we ran into this because we had comments disabled but we noticed that rows in a csv that start with # were dropped — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29516 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGIZ6QY3FJXY33QJFCRVKDTCALDDANCNFSM4QIJ32RA .

if in spark csv comment is not set (isCommentSet is false) then univocity should process with comment feature disabled. per univocity documentation the way to do this is to set comment to \0. i realize this seems to be not working exactly as desired although i not yet fully grasp how or why.

but what we are doing now instead is: if in spark comment is not set (isCommentSet is false) then we leave the default comment in univocity, which is #. that is not the same as unsetting/disabling comment feature. i feel like this might be confusing and maybe also can have unintended consequences?

i am still unsure how this is impacting us but what i see is that when we disable comment feature in spark csv we see univocity quote values that start with # upon writing. since we are generated bar delimited output for systems that do not support quotes this causes trouble for us.
we actually had disabled quote (which sets it to \0) leading to # becoming \0#\0 upon writing, which then in older versions of spark was considered a comment line and got dropped! so i spend a few days going down the rabbit hole of trying to understand why we were losing records... it came down to this change in behavior here.
you are right that for my particular issue this can be fixed by explicitly setting comment to something else (as long as its not \0).

manuzhang · 2022-04-26T02:56:04Z

Same issue as @koertkuipers. Even if explicitly setting comment to something else, we still can't select rows with a column value # in Spark 2.3.1 (univocity-parser 2.5.9)

select * from table where column = '#'

manuzhang · 2022-05-10T04:59:11Z

I created uniVocity/univocity-parsers#505 to request optionally disabling quoting row-starting comment char.

…ers set it explicitly in CSV dataSource ### What changes were proposed in this pull request? Pass the comment option through to univocity if users set it explicitly in CSV dataSource. ### Why are the changes needed? In #29516 , in order to fix some bugs, univocity-parsers was upgrade from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. Before this change: #abc,1 After this change: "#abc",1 We change the related `isCommentSet` check logic to enable users to keep behavior as before. ### Does this PR introduce _any_ user-facing change? Yes, a little. If users set comment option as '\u0000' explicitly, now they should remove it to keep comment option unset. ### How was this patch tested? Add a full new test. Closes #39878 from wayneguow/comment. Authored-by: wayneguow <guow93@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>

Don't apply comment processing if 'comment' unset for CSV. Some paths…

6358727

… still did.

srowen self-assigned this Aug 22, 2020

probot-autolabeler bot added the SQL label Aug 22, 2020

srowen commented Aug 22, 2020

View reviewed changes

srowen changed the title ~~[SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV~~ [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV Aug 23, 2020

Try to fix some tests

87e8b65

srowen commented Aug 23, 2020

View reviewed changes

Update univocity to 2.9.0 for better comment disabling

f3d14c6

probot-autolabeler bot added the BUILD label Aug 23, 2020

Fix last new test for null

b685f1c

HyukjinKwon reviewed Aug 25, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala Show resolved Hide resolved

HyukjinKwon approved these changes Aug 25, 2020

View reviewed changes

srowen changed the title ~~[WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV~~ [SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV Aug 25, 2020

HyukjinKwon closed this in a9d4e60 Aug 25, 2020

srowen deleted the SPARK-32614 branch September 12, 2020 21:11

wayneguow mentioned this pull request Feb 3, 2023

[SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource #39878

Closed

abellina mentioned this pull request Apr 28, 2023

[AUDIT][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV NVIDIA/spark-rapids#8205

Open

[SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV #29516

[SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV #29516

Conversation

srowen commented Aug 22, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codealways Aug 23, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Aug 22, 2020

dongjoon-hyun commented Aug 22, 2020

SparkQA commented Aug 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 23, 2020

SparkQA commented Aug 23, 2020

zhengruifeng commented Aug 24, 2020

srowen commented Aug 24, 2020

SparkQA commented Aug 24, 2020

SparkQA commented Aug 24, 2020

HyukjinKwon left a comment

Choose a reason for hiding this comment

srowen commented Aug 25, 2020

HyukjinKwon commented Aug 25, 2020

tooptoop4 commented Aug 28, 2020

srowen commented Aug 28, 2020

tooptoop4 commented Aug 29, 2020

srowen commented Aug 29, 2020

srowen commented Mar 5, 2021 via email

koertkuipers commented Mar 5, 2021

manuzhang commented Apr 26, 2022

manuzhang commented May 10, 2022

codealways Aug 23, 2020 •

edited