Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fields with whitespace characters represented with escape sequences return the wrong character #255

Closed
ejdownescx opened this issue Jun 15, 2024 · 3 comments

Comments

@ejdownescx
Copy link

Had a CSV file with escaped whitespace characters in it. Processing will always return the unescaped character representing the whitespace character (e.g. "\n" turns into 'n' instead of a new line.)

	[Fact]
	public void InlineEscapedWhitespaceCharacters()
	{
		using var reader = new StringReader(@"_,a1 \a a2,_,b1 \b b2,_,f1 \f f2,_,n1 \n n2,_,r1 \r r2,_,t1 \t t2,_,v1 \v v2");
		using var csvReader = CsvDataReader.Create(reader, new CsvDataReaderOptions
		{
			CsvStyle = CsvStyle.Escaped,
			Escape = '\\',
			Delimiter = ',',
			HasHeaders = false,
		});

		csvReader.Read();
		var value00 = csvReader.GetString(0);
		var value01 = csvReader.GetString(1);
		var value02 = csvReader.GetString(2);
		var value03 = csvReader.GetString(3);
		var value04 = csvReader.GetString(4);
		var value05 = csvReader.GetString(5);
		var value06 = csvReader.GetString(6);
		var value07 = csvReader.GetString(7);
		var value08 = csvReader.GetString(8);
		var value09 = csvReader.GetString(9);
		var value10 = csvReader.GetString(10);
		var value11 = csvReader.GetString(11);
		var value12 = csvReader.GetString(12);
		var value13 = csvReader.GetString(13);
		Assert.Multiple(
			() => Assert.Equal("_", value00),
			() => Assert.Equal("a1 \a a2", value01), // This will fail; will be "a1 a a2"
			() => Assert.Equal("_", value02),
			() => Assert.Equal("b1 \b b2", value03), // This will fail; will be "b1 b b2"
			() => Assert.Equal("_", value04),
			() => Assert.Equal("f1 \f f2", value05), // This will fail; will be "f1 f f2"
			() => Assert.Equal("_", value06),
			() => Assert.Equal("n1 \n n2", value07), // This will fail; will be "n1 n n2"
			() => Assert.Equal("_", value08),
			() => Assert.Equal("r1 \r r2", value09), // This will fail; will be "r1 r r2"
			() => Assert.Equal("_", value10),
			() => Assert.Equal("t1 \t t2", value11), // This will fail; will be "t1 t t2"
			() => Assert.Equal("_", value12),
			() => Assert.Equal("v1 \v v2", value13)  // This will fail; will be "v1 v v2"
		);
	}

CsvDataReader.PrepareField could have the escape block modified to cover this:

				if (c == escape)
				{
					if (i < len)
					{
						c = buffer[offset + i++];
						if (c != quote && c != escape)
						{
							if (quote == escape)
							{
								// the escape we just saw was actually the closing quote
								// the remainder of the field will be added verbatim
								inQuote = false;
							}
							else if ('\\' == escape)
							{
								switch (c)
								{
									case 'a':	// bell
										c = '\a';
										break;
									case 'b':	// backspace
										c = '\b';
										break;
									case 'f':	// form feed
										c = '\f';
										break;
									case 'n':	// new line
										c = '\n';
										break;
									case 'r':	// carriage return
										c = '\r';
										break;
									case 't':	// horizontal tab
										c = '\t';
										break;
									case 'v':	// vertical tab
										c = '\v';
										break;
								}
							}
						}
					}
					else
					{
						// we should never get here. Invalid fields should always be
						// handled in ReadField and end up in PrepareInvalidField
						throw new CsvFormatException(rowNumber, -1);
					}
				}

All existing unit tests pass with this modification.

@MarkPflug
Copy link
Owner

This is working as expected. The Escaped mode parser only expects delimiters, newlines and escape characters to be escaped. Yes, it is similar to C-style string literal escaping, but is not exactly the same. The test-cases that you provide are invalid, as a doesn't need to be escaped, so a \a sequence is invalid. I've made the design decision to simply remove the unnecessary escape character. If you want a bell (\a) character, or a tab (\t) character, you can simply include that character in the output stream without needing to escape it.

@ejdownescx
Copy link
Author

Agreed on characters like tab and bell not needing escaping for the format to output correctly. The designer of the application that creates the files I'm consuming unfortunately decided to escape every character anyways and isn't open to changing the behavior since the choice was made over 25 years ago.

That said, the newline characters \r and \n were also not working in my tests. A CSV line like

Hello,A\r\nB

would result in parsing to

"Hello"
"ArnB"

field values, instead of

"Hello"
"A
B"

The newline character parsing was primarily where I was encountering issues. I don't think any of the data from this application will seriously include the bell character, but I included it to try and be comprehensive.

@ejdownescx
Copy link
Author

After having my coffee I realize you said the design accommodates newlines by using the escape character followed by the actual newline character itself and not the C-style escape. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants