Fields with whitespace characters represented with escape sequences return the wrong character #255

ejdownescx · 2024-06-15T01:22:06Z

Had a CSV file with escaped whitespace characters in it. Processing will always return the unescaped character representing the whitespace character (e.g. "\n" turns into 'n' instead of a new line.)

	[Fact]
	public void InlineEscapedWhitespaceCharacters()
	{
		using var reader = new StringReader(@"_,a1 \a a2,_,b1 \b b2,_,f1 \f f2,_,n1 \n n2,_,r1 \r r2,_,t1 \t t2,_,v1 \v v2");
		using var csvReader = CsvDataReader.Create(reader, new CsvDataReaderOptions
		{
			CsvStyle = CsvStyle.Escaped,
			Escape = '\\',
			Delimiter = ',',
			HasHeaders = false,
		});

		csvReader.Read();
		var value00 = csvReader.GetString(0);
		var value01 = csvReader.GetString(1);
		var value02 = csvReader.GetString(2);
		var value03 = csvReader.GetString(3);
		var value04 = csvReader.GetString(4);
		var value05 = csvReader.GetString(5);
		var value06 = csvReader.GetString(6);
		var value07 = csvReader.GetString(7);
		var value08 = csvReader.GetString(8);
		var value09 = csvReader.GetString(9);
		var value10 = csvReader.GetString(10);
		var value11 = csvReader.GetString(11);
		var value12 = csvReader.GetString(12);
		var value13 = csvReader.GetString(13);
		Assert.Multiple(
			() => Assert.Equal("_", value00),
			() => Assert.Equal("a1 \a a2", value01), // This will fail; will be "a1 a a2"
			() => Assert.Equal("_", value02),
			() => Assert.Equal("b1 \b b2", value03), // This will fail; will be "b1 b b2"
			() => Assert.Equal("_", value04),
			() => Assert.Equal("f1 \f f2", value05), // This will fail; will be "f1 f f2"
			() => Assert.Equal("_", value06),
			() => Assert.Equal("n1 \n n2", value07), // This will fail; will be "n1 n n2"
			() => Assert.Equal("_", value08),
			() => Assert.Equal("r1 \r r2", value09), // This will fail; will be "r1 r r2"
			() => Assert.Equal("_", value10),
			() => Assert.Equal("t1 \t t2", value11), // This will fail; will be "t1 t t2"
			() => Assert.Equal("_", value12),
			() => Assert.Equal("v1 \v v2", value13)  // This will fail; will be "v1 v v2"
		);
	}

CsvDataReader.PrepareField could have the escape block modified to cover this:

				if (c == escape)
				{
					if (i < len)
					{
						c = buffer[offset + i++];
						if (c != quote && c != escape)
						{
							if (quote == escape)
							{
								// the escape we just saw was actually the closing quote
								// the remainder of the field will be added verbatim
								inQuote = false;
							}
							else if ('\\' == escape)
							{
								switch (c)
								{
									case 'a':	// bell
										c = '\a';
										break;
									case 'b':	// backspace
										c = '\b';
										break;
									case 'f':	// form feed
										c = '\f';
										break;
									case 'n':	// new line
										c = '\n';
										break;
									case 'r':	// carriage return
										c = '\r';
										break;
									case 't':	// horizontal tab
										c = '\t';
										break;
									case 'v':	// vertical tab
										c = '\v';
										break;
								}
							}
						}
					}
					else
					{
						// we should never get here. Invalid fields should always be
						// handled in ReadField and end up in PrepareInvalidField
						throw new CsvFormatException(rowNumber, -1);
					}
				}

All existing unit tests pass with this modification.

MarkPflug · 2024-06-20T15:19:58Z

This is working as expected. The Escaped mode parser only expects delimiters, newlines and escape characters to be escaped. Yes, it is similar to C-style string literal escaping, but is not exactly the same. The test-cases that you provide are invalid, as a doesn't need to be escaped, so a \a sequence is invalid. I've made the design decision to simply remove the unnecessary escape character. If you want a bell (\a) character, or a tab (\t) character, you can simply include that character in the output stream without needing to escape it.

ejdownescx · 2024-06-20T15:57:29Z

Agreed on characters like tab and bell not needing escaping for the format to output correctly. The designer of the application that creates the files I'm consuming unfortunately decided to escape every character anyways and isn't open to changing the behavior since the choice was made over 25 years ago.

That said, the newline characters \r and \n were also not working in my tests. A CSV line like

Hello,A\r\nB

would result in parsing to

"Hello"
"ArnB"

field values, instead of

"Hello"
"A
B"

The newline character parsing was primarily where I was encountering issues. I don't think any of the data from this application will seriously include the bell character, but I included it to try and be comprehensive.

ejdownescx · 2024-06-20T16:34:18Z

After having my coffee I realize you said the design accommodates newlines by using the escape character followed by the actual newline character itself and not the C-style escape. Sorry about that.

MarkPflug closed this as completed Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fields with whitespace characters represented with escape sequences return the wrong character #255

Fields with whitespace characters represented with escape sequences return the wrong character #255

ejdownescx commented Jun 15, 2024

MarkPflug commented Jun 20, 2024

ejdownescx commented Jun 20, 2024

ejdownescx commented Jun 20, 2024

Fields with whitespace characters represented with escape sequences return the wrong character #255

Fields with whitespace characters represented with escape sequences return the wrong character #255

Comments

ejdownescx commented Jun 15, 2024

MarkPflug commented Jun 20, 2024

ejdownescx commented Jun 20, 2024

ejdownescx commented Jun 20, 2024