Migrate from old private fork of OpenCSV #2268

wetneb · 2019-12-29T15:26:02Z

We are currently using a custom snapshot of OpenCSV which is years old. The OpenCSV project is still active and a lot of releases have been published since then. We should look into upgrading to a newer version, which would also let us get rid of the locally stored .jar.

wetneb · 2019-12-31T05:21:27Z

Version 5.0 does not seem to support multi-character separators, which are currently used at least in the smartSplit function.

wetneb · 2020-01-03T09:36:54Z

Multi-character separators were proposed but the patch did not make it upstream: https://sourceforge.net/p/opencsv/patches/44/
We don't seem to be the only ones to require this though: https://stackoverflow.com/questions/8653797/java-csv-parser-with-string-separator-multi-character

thadguidry · 2020-01-03T14:34:45Z

The OpenCSV project seems to prefer a strict stance of RFC 4180 which seems reasonable given their mission. (I.E. they consider bits outside the "csv standard" to not really be csv) Even if you provided a new class/method then it would probably be rejected, but you never know and might want to ask them.

I still think investment in one of these would be better all around:
This might mean that perhaps we look at using Jackson CSV (I prefer it more, since they know CSV can be "messy" sometimes and supports extensions, where we would just need to implement something like CsvParser.Feature.ALLOW_MULTI_SEPARATORS)?
https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv#configuring-csvschema

or Apache CSV (which wanted to unify development with OpenCSV at one-time, but didn't get far with them either if I recall)

wetneb · 2020-01-03T14:51:36Z

OpenCSV does not stick to RFC4180, they also have a more flexible parser which accommodates with non-standard needs.

But yeah, switching to another parser could also be an option. There seems to be quite a lot of them actually! https://github.com/uniVocity/csv-parsers-comparison

thadguidry · 2020-01-03T17:15:49Z

I did not say they stick?

wetneb · 2020-01-03T17:17:57Z

Well, you wrote that "The OpenCSV project seems to prefer a strict stance of RFC 4180 which seems reasonable given their mission" and I think that is not a very accurate description of OpenCSV, since their default parser is much more flexible and accepts CSVs that do not conform with RFC4180. They also have a RFC4180 parser, but that is not the default one.

thadguidry · 2020-01-03T17:19:57Z

Thanks for info, but I also see flexibility with other parsers. So switching, although painful might be wiser. Up to you.

wetneb · 2020-02-18T16:48:04Z

With the migration to spark, Spark SQL's own CSV parser is a natural choice since it allows efficient partitioning (so, scales well to large datasets).

tfmorris · 2020-07-18T17:46:19Z

OpenCSV rejected multi-character separators (a second time): https://sourceforge.net/p/opencsv/feature-requests/119/
There still don't seem to be any available Java CSV parsers which support multi-character separator strings.

tfmorris · 2023-10-11T20:20:28Z

A couple of years ago Apache commons-csv added support for multi-character delimiters. I'll look into how easy it is to migrate to it.

Switch from ancient private fork of opencsv with our multi-character delimiter patch to the Apache commons-csv module with supports string delimiters. There are two types of test failures, one of which I think is definitely a bug and the other is arguable. Test changes will go in the next commit to keep them separate.

…ne#1372 Switch from ancient private fork of opencsv with our multi-character delimiter patch to the uniVocity CSV parser which supports string delimiters. There is one test test change:: - On import, a quote character before a separator character is no longer stripped in "ignore quotes" mode. Instead it's included in the data field, which seems correct to me.

…ne#1372 Switch from ancient private fork of opencsv with our multi-character delimiter patch to the uniVocity CSV parser which supports string delimiters. There is one unit test change: - On import, a quote character before a separator character is no longer stripped in "ignore quotes" mode. Instead it's included in the data field, which seems correct to me. The following changes were made to e2e tests: - Disable quote processing option before switching from CSV to TSV because the test data doesn't have legally escaped quotes when the separator isn't a comma - Force options to be sent by project creation text fixture (If a tag name isn't present, no options are sent. Super weird side effect!) - Escape test fixture data which contains quotes before sending it to be parsed as a CSV file

…ne#1372 Switch from ancient private fork of opencsv with our multi-character delimiter patch to the uniVocity CSV parser which supports string delimiters. There are two unit test changes: - On import, a quote character before a separator character is no longer stripped in "ignore quotes" mode. Instead it's included in the data field, which seems correct to me. - On export, CSV/TSV tests now check that the export uses the system line delimiter rather than always a Unix newline (\n) The following changes were made to e2e tests: - Disable quote processing option before switching from CSV to TSV because the test data doesn't have legally escaped quotes when the separator isn't a comma - Force options to be sent by project creation text fixture (If a tag name isn't present, no options are sent. Super weird side effect!) - Escape test fixture data which contains quotes before sending it to be parsed as a CSV file Another try at making export tests work on Windows

…ne#1372 Switch from ancient private fork of opencsv with our multi-character delimiter patch to the uniVocity CSV parser which supports string delimiters. There are two unit test changes: - On import, a quote character before a separator character is no longer stripped in "ignore quotes" mode. Instead it's included in the data field, which seems correct to me. - On export, CSV/TSV tests now check that the export uses the system line delimiter rather than always a Unix newline (\n) The following changes were made to e2e tests: - Disable quote processing option before switching from CSV to TSV because the test data doesn't have legally escaped quotes when the separator isn't a comma - Force options to be sent by project creation text fixture (If a tag name isn't present, no options are sent. Super weird side effect!) - Escape test fixture data which contains quotes before sending it to be parsed as a CSV file

* Add TODO for error reporting and remove redundant qualifiers * Switch to uniVocity CSV parser - Fixes #2268 Fixes #1372 Switch from ancient private fork of opencsv with our multi-character delimiter patch to the uniVocity CSV parser which supports string delimiters. There are two unit test changes: - On import, a quote character before a separator character is no longer stripped in "ignore quotes" mode. Instead it's included in the data field, which seems correct to me. - On export, CSV/TSV tests now check that the export uses the system line delimiter rather than always a Unix newline (\n) The following changes were made to e2e tests: - Disable quote processing option before switching from CSV to TSV because the test data doesn't have legally escaped quotes when the separator isn't a comma - Force options to be sent by project creation text fixture (If a tag name isn't present, no options are sent. Super weird side effect!) - Escape test fixture data which contains quotes before sending it to be parsed as a CSV file * Add uniVocity format guessing As a starting point use both our separator guesser and the CSV parsers format guesser and compare the two. We probably need a wider study to compare the performance of the two.

wetneb added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. import About importers in general - add a label for the data format if available CSV/TSV About the CSV/TSV import or export labels Dec 29, 2019

wetneb mentioned this issue Feb 18, 2020

Change to higher performance CSV parser #1372

Closed

tfmorris changed the title ~~Migrate OpenCSV to a more recent version~~ Migrate from old private fork of OpenCSV Oct 11, 2023

tfmorris self-assigned this Oct 11, 2023

tfmorris mentioned this issue Oct 12, 2023

Switch to Apache commons-csv - fixes #2268 #6092

Closed

tfmorris mentioned this issue Oct 14, 2023

Switch to uniVocity CSV parser. Fixes #2268 & #1372 #6098

Merged

wetneb closed this as completed in #6098 Oct 21, 2023

tfmorris added this to the 3.8 milestone Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate from old private fork of OpenCSV #2268

Migrate from old private fork of OpenCSV #2268

wetneb commented Dec 29, 2019

wetneb commented Dec 31, 2019

wetneb commented Jan 3, 2020 •

edited

thadguidry commented Jan 3, 2020 •

edited

wetneb commented Jan 3, 2020

thadguidry commented Jan 3, 2020

wetneb commented Jan 3, 2020

thadguidry commented Jan 3, 2020

wetneb commented Feb 18, 2020

tfmorris commented Jul 18, 2020

tfmorris commented Oct 11, 2023

Migrate from old private fork of OpenCSV #2268

Migrate from old private fork of OpenCSV #2268

Comments

wetneb commented Dec 29, 2019

wetneb commented Dec 31, 2019

wetneb commented Jan 3, 2020 • edited

thadguidry commented Jan 3, 2020 • edited

wetneb commented Jan 3, 2020

thadguidry commented Jan 3, 2020

wetneb commented Jan 3, 2020

thadguidry commented Jan 3, 2020

wetneb commented Feb 18, 2020

tfmorris commented Jul 18, 2020

tfmorris commented Oct 11, 2023

wetneb commented Jan 3, 2020 •

edited

thadguidry commented Jan 3, 2020 •

edited