Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a setting to parse crlf with Tab separated CSV (TSV?!) files #56257

Open
mneedham opened this issue Nov 2, 2023 · 3 comments · May be fixed by #59747
Open

Add a setting to parse crlf with Tab separated CSV (TSV?!) files #56257

mneedham opened this issue Nov 2, 2023 · 3 comments · May be fixed by #59747
Assignees

Comments

@mneedham
Copy link
Contributor

mneedham commented Nov 2, 2023

Not sure if this is a feature or usability, so feel free to re-label.

I have a tab separated file with windows line endings, which means it doesn't work correctly with any of the TSV formats. It will process with the CSV ones:

FROM file(
     'data/Stock Histories.zip :: {NASDAQ,NYSE}.txt', 
     CSVWithNames
)
SELECT *, _file
ORDER BY rand()
LIMIT 10;

But then every field is inside one row and I have to do a split on tab to separate them.

@alexey-milovidov said maybe we could add a setting:

TSV does not have quoting, and escaping of \r is not mandatory, so the interpretation of crlf will be ambiguous if you have a binary string.
We can add it under a setting…

@Blargian
Copy link
Contributor

@Avogar I'll work on this one

Blargian added a commit to Blargian/ClickHouse that referenced this issue Jan 28, 2024
@Blargian
Copy link
Contributor

Blargian commented Jan 29, 2024

@Avogar two questions

  • The intention with this is that if the added setting is true then we will parse TSV files with line ending \r\n? Currently there is an error thrown asking user to convert to unix format.
  • Does any functionality already exist to deal with a return carriage from a ReadBuffer & buf?

As I can see from debugging return carriage gets read in as part of the field (eg: a line Akiba_Hebrew_Academy 2017-08-01 241\r\n from some .tsv file will read in the last field as 241\r and not as 241, as we read everything up until only \t, \n or \\ with char * next_pos = find_first_symbols<'\t', '\n', '\\'>(buf.position(), buf.buffer().end()); so I need to deal with that - maybe something already exists that I can use to modify this function, for a start?

template <typename Vector, bool parse_complex_escape_sequence>
void readEscapedStringIntoImpl(Vector & s, ReadBuffer & buf)
{
while (!buf.eof())
{
char * next_pos = find_first_symbols<'\t', '\n', '\\'>(buf.position(), buf.buffer().end());
appendToStringOrVector(s, buf, next_pos);
buf.position() = next_pos;
if (!buf.hasPendingData())
continue;
if (*buf.position() == '\t' || *buf.position() == '\n')
return;
if (*buf.position() == '\\')
{
if constexpr (parse_complex_escape_sequence)
{
parseComplexEscapeSequence(s, buf);
}
else
{
s.push_back(*buf.position());
++buf.position();
if (!buf.eof())
{
s.push_back(*buf.position());
++buf.position();
}
}
}
}
}

Function above gets called originally from here for context:

template <bool is_header_row>
std::vector<String> TabSeparatedFormatReader::readRowImpl()
{
std::vector<String> fields;
do
{
fields.push_back(readFieldIntoString<is_header_row>());
}
while (checkChar('\t', *buf));
skipRowEndDelimiter();
return fields;
}

@Avogar
Copy link
Member

Avogar commented Jan 29, 2024

The intention with this is that if the added setting is true then we will parse TSV files with line ending \r\n?

Yes. When new setting is true, we should be able to parse files with line ending '\r\n' correctly.

Does any functionality already exist to deal with a return carriage from a ReadBuffer & buf?

We can work with \r similar how we work with \n in existing code. Let me explain possible solution:
To read values in TSV format we sometimes need to read all data until \t or \n, and in all these cases we should support \r\n under a setting. For reading TSV values we use ISerialization::deserializeTextEscaped/deserializeTextRaw functions (*Raw function for TSVRaw format with no escaping):

if (is_raw)
{
if (as_nullable)
return SerializationNullable::deserializeTextRawImpl(column, *buf, format_settings, serialization);
serialization->deserializeTextRaw(column, *buf, format_settings);
return true;
}
if (as_nullable)
return SerializationNullable::deserializeTextEscapedImpl(column, *buf, format_settings, serialization);
serialization->deserializeTextEscaped(column, *buf, format_settings);
return true;

So, we should go through all deserializeTextEscaped/deserializeTextRaw implementations and add support '\r' when needed under a setting. We also have functions readEscapedString/readEscapedStringInto/readTSVField that are used for reading fields in TSV format, we should support \r\n there too, maybe we can just make these functions template with parameter bool support_crlf, all these functions are implemented using readEscapedStringIntoImpl, so we can propogate this template parameter there and add support for \r\n under if constexpr (support_crlf).

@Blargian Blargian linked a pull request Feb 8, 2024 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants