Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parsing with auto-detected "\r" line endings when normalizeLineEndingsWithinQuotes=false #499

Open
eirikbakke opened this issue Mar 17, 2022 · 1 comment

Comments

@eirikbakke
Copy link

eirikbakke commented Mar 17, 2022

The following CSV file, with "\r" style line endings...

colA,colB,colC
a,A,"x"
b,B,k

...should parse as [[colA, colB, colC], [a, A, x], [b, B, k]]. However, when lineSeparatorDetectionEnabled=true and normalizeLineEndingsWithinQuotes=false, I instead get [[colA, colB, colC], [a, A, "x"\rb, B, k]].

Here is a complete test case, which fails with Univocity 2.9.1 on Windows 11 and Java 17:

import com.univocity.parsers.csv.CsvFormat;
import com.univocity.parsers.csv.CsvParser;
import com.univocity.parsers.csv.CsvParserSettings;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.junit.Assert;
import org.junit.Test;

public class UnivocityLineEndingBugTest {
  private static final boolean TRIGGER_BUG = true;

  private static CsvParserSettings createUnivocitySettings() {
    final CsvParserSettings settings = new CsvParserSettings();
    final CsvFormat format = settings.getFormat();
    settings.setDelimiterDetectionEnabled(false);
    format.setDelimiter(',');
    settings.setQuoteDetectionEnabled(false);
    format.setQuote('\"');
    format.setQuoteEscape('\"');
    settings.setKeepEscapeSequences(false);
    settings.setKeepQuotes(false);

    // Setting this to true will also cause the bug to go away.
    settings.setNormalizeLineEndingsWithinQuotes(false);
    //format.setNormalizedNewline('\n');
    if (TRIGGER_BUG) {
      settings.setLineSeparatorDetectionEnabled(true);
    } else {
      settings.setLineSeparatorDetectionEnabled(false);
      format.setLineSeparator("\r");
    }
    return settings;
  }

  @Test
  public void testBug() throws IOException {
    String csvFile =
        "colA,colB,colC\r" +
        "a,A,\"x\"\r" +
        "b,B,k\r";
    CsvParserSettings settings = createUnivocitySettings();
    List<List<String>> result = new ArrayList<>();
    try (Reader reader = new StringReader(csvFile)) {
      CsvParser parser = new CsvParser(settings);
      parser.beginParsing(reader);
      while (true) {
        String row[] = parser.parseNext();
        if (row == null)
          break;
        // System.out.println(Arrays.toString(row));
        result.add(new ArrayList<>(Arrays.asList(row)));
      }
    }
    System.out.println(result.toString());
    Assert.assertEquals("[[colA, colB, colC], [a, A, x], [b, B, k]]", result.toString());
  }
}

Thank you for your work on the excellent Univocity library! I am using it for Ultorg and am in the process of writing unit tests, which is how I found the bug above...

@eirikbakke
Copy link
Author

Also note that the Javadoc and parameter name for CharInputReader.enableNormalizeLineEndings(escaping) seems to reverse the actual behavior of the method as assumed by callers and implemented in AbstractCharInputReader. In fact, in the latter overridden method, the parameter has been renamed to normalizeLineEndings, which seems like a more correct name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant