Incorrect parsing with auto-detected "\r" line endings when normalizeLineEndingsWithinQuotes=false #499

eirikbakke · 2022-03-17T19:36:06Z

The following CSV file, with "\r" style line endings...

colA,colB,colC
a,A,"x"
b,B,k

...should parse as [[colA, colB, colC], [a, A, x], [b, B, k]]. However, when lineSeparatorDetectionEnabled=true and normalizeLineEndingsWithinQuotes=false, I instead get [[colA, colB, colC], [a, A, "x"\rb, B, k]].

Here is a complete test case, which fails with Univocity 2.9.1 on Windows 11 and Java 17:

import com.univocity.parsers.csv.CsvFormat;
import com.univocity.parsers.csv.CsvParser;
import com.univocity.parsers.csv.CsvParserSettings;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.junit.Assert;
import org.junit.Test;

public class UnivocityLineEndingBugTest {
  private static final boolean TRIGGER_BUG = true;

  private static CsvParserSettings createUnivocitySettings() {
    final CsvParserSettings settings = new CsvParserSettings();
    final CsvFormat format = settings.getFormat();
    settings.setDelimiterDetectionEnabled(false);
    format.setDelimiter(',');
    settings.setQuoteDetectionEnabled(false);
    format.setQuote('\"');
    format.setQuoteEscape('\"');
    settings.setKeepEscapeSequences(false);
    settings.setKeepQuotes(false);

    // Setting this to true will also cause the bug to go away.
    settings.setNormalizeLineEndingsWithinQuotes(false);
    //format.setNormalizedNewline('\n');
    if (TRIGGER_BUG) {
      settings.setLineSeparatorDetectionEnabled(true);
    } else {
      settings.setLineSeparatorDetectionEnabled(false);
      format.setLineSeparator("\r");
    }
    return settings;
  }

  @Test
  public void testBug() throws IOException {
    String csvFile =
        "colA,colB,colC\r" +
        "a,A,\"x\"\r" +
        "b,B,k\r";
    CsvParserSettings settings = createUnivocitySettings();
    List<List<String>> result = new ArrayList<>();
    try (Reader reader = new StringReader(csvFile)) {
      CsvParser parser = new CsvParser(settings);
      parser.beginParsing(reader);
      while (true) {
        String row[] = parser.parseNext();
        if (row == null)
          break;
        // System.out.println(Arrays.toString(row));
        result.add(new ArrayList<>(Arrays.asList(row)));
      }
    }
    System.out.println(result.toString());
    Assert.assertEquals("[[colA, colB, colC], [a, A, x], [b, B, k]]", result.toString());
  }
}

Thank you for your work on the excellent Univocity library! I am using it for Ultorg and am in the process of writing unit tests, which is how I found the bug above...

The text was updated successfully, but these errors were encountered:

eirikbakke · 2022-03-17T19:41:42Z

Also note that the Javadoc and parameter name for CharInputReader.enableNormalizeLineEndings(escaping) seems to reverse the actual behavior of the method as assumed by callers and implemented in AbstractCharInputReader. In fact, in the latter overridden method, the parameter has been renamed to normalizeLineEndings, which seems like a more correct name.

nchammas mentioned this issue Feb 19, 2024

[SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode apache/spark#44872

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect parsing with auto-detected "\r" line endings when normalizeLineEndingsWithinQuotes=false #499

Incorrect parsing with auto-detected "\r" line endings when normalizeLineEndingsWithinQuotes=false #499

eirikbakke commented Mar 17, 2022 •

edited

eirikbakke commented Mar 17, 2022

Incorrect parsing with auto-detected "\r" line endings when normalizeLineEndingsWithinQuotes=false #499

Incorrect parsing with auto-detected "\r" line endings when normalizeLineEndingsWithinQuotes=false #499

Comments

eirikbakke commented Mar 17, 2022 • edited

eirikbakke commented Mar 17, 2022

eirikbakke commented Mar 17, 2022 •

edited