New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate from old private fork of OpenCSV #2268
Comments
Version 5.0 does not seem to support multi-character separators, which are currently used at least in the |
Multi-character separators were proposed but the patch did not make it upstream: https://sourceforge.net/p/opencsv/patches/44/ |
The OpenCSV project seems to prefer a strict stance of RFC 4180 which seems reasonable given their mission. (I.E. they consider bits outside the "csv standard" to not really be csv) Even if you provided a new class/method then it would probably be rejected, but you never know and might want to ask them. I still think investment in one of these would be better all around: or Apache CSV (which wanted to unify development with OpenCSV at one-time, but didn't get far with them either if I recall) |
OpenCSV does not stick to RFC4180, they also have a more flexible parser which accommodates with non-standard needs. But yeah, switching to another parser could also be an option. There seems to be quite a lot of them actually! https://github.com/uniVocity/csv-parsers-comparison |
I did not say they stick? |
Well, you wrote that "The OpenCSV project seems to prefer a strict stance of RFC 4180 which seems reasonable given their mission" and I think that is not a very accurate description of OpenCSV, since their default parser is much more flexible and accepts CSVs that do not conform with RFC4180. They also have a RFC4180 parser, but that is not the default one. |
Thanks for info, but I also see flexibility with other parsers. So switching, although painful might be wiser. Up to you. |
With the migration to spark, Spark SQL's own CSV parser is a natural choice since it allows efficient partitioning (so, scales well to large datasets). |
OpenCSV rejected multi-character separators (a second time): https://sourceforge.net/p/opencsv/feature-requests/119/ |
A couple of years ago Apache |
Switch from ancient private fork of opencsv with our multi-character delimiter patch to the Apache commons-csv module with supports string delimiters. There are two types of test failures, one of which I think is definitely a bug and the other is arguable. Test changes will go in the next commit to keep them separate.
…ne#1372 Switch from ancient private fork of opencsv with our multi-character delimiter patch to the uniVocity CSV parser which supports string delimiters. There is one test test change:: - On import, a quote character before a separator character is no longer stripped in "ignore quotes" mode. Instead it's included in the data field, which seems correct to me.
…ne#1372 Switch from ancient private fork of opencsv with our multi-character delimiter patch to the uniVocity CSV parser which supports string delimiters. There is one test test change:: - On import, a quote character before a separator character is no longer stripped in "ignore quotes" mode. Instead it's included in the data field, which seems correct to me.
…ne#1372 Switch from ancient private fork of opencsv with our multi-character delimiter patch to the uniVocity CSV parser which supports string delimiters. There is one unit test change: - On import, a quote character before a separator character is no longer stripped in "ignore quotes" mode. Instead it's included in the data field, which seems correct to me. The following changes were made to e2e tests: - Disable quote processing option before switching from CSV to TSV because the test data doesn't have legally escaped quotes when the separator isn't a comma - Force options to be sent by project creation text fixture (If a tag name isn't present, no options are sent. Super weird side effect!) - Escape test fixture data which contains quotes before sending it to be parsed as a CSV file
…ne#1372 Switch from ancient private fork of opencsv with our multi-character delimiter patch to the uniVocity CSV parser which supports string delimiters. There are two unit test changes: - On import, a quote character before a separator character is no longer stripped in "ignore quotes" mode. Instead it's included in the data field, which seems correct to me. - On export, CSV/TSV tests now check that the export uses the system line delimiter rather than always a Unix newline (\n) The following changes were made to e2e tests: - Disable quote processing option before switching from CSV to TSV because the test data doesn't have legally escaped quotes when the separator isn't a comma - Force options to be sent by project creation text fixture (If a tag name isn't present, no options are sent. Super weird side effect!) - Escape test fixture data which contains quotes before sending it to be parsed as a CSV file Another try at making export tests work on Windows
…ne#1372 Switch from ancient private fork of opencsv with our multi-character delimiter patch to the uniVocity CSV parser which supports string delimiters. There are two unit test changes: - On import, a quote character before a separator character is no longer stripped in "ignore quotes" mode. Instead it's included in the data field, which seems correct to me. - On export, CSV/TSV tests now check that the export uses the system line delimiter rather than always a Unix newline (\n) The following changes were made to e2e tests: - Disable quote processing option before switching from CSV to TSV because the test data doesn't have legally escaped quotes when the separator isn't a comma - Force options to be sent by project creation text fixture (If a tag name isn't present, no options are sent. Super weird side effect!) - Escape test fixture data which contains quotes before sending it to be parsed as a CSV file
* Add TODO for error reporting and remove redundant qualifiers * Switch to uniVocity CSV parser - Fixes #2268 Fixes #1372 Switch from ancient private fork of opencsv with our multi-character delimiter patch to the uniVocity CSV parser which supports string delimiters. There are two unit test changes: - On import, a quote character before a separator character is no longer stripped in "ignore quotes" mode. Instead it's included in the data field, which seems correct to me. - On export, CSV/TSV tests now check that the export uses the system line delimiter rather than always a Unix newline (\n) The following changes were made to e2e tests: - Disable quote processing option before switching from CSV to TSV because the test data doesn't have legally escaped quotes when the separator isn't a comma - Force options to be sent by project creation text fixture (If a tag name isn't present, no options are sent. Super weird side effect!) - Escape test fixture data which contains quotes before sending it to be parsed as a CSV file * Add uniVocity format guessing As a starting point use both our separator guesser and the CSV parsers format guesser and compare the two. We probably need a wider study to compare the performance of the two.
We are currently using a custom snapshot of OpenCSV which is years old. The OpenCSV project is still active and a lot of releases have been published since then. We should look into upgrading to a newer version, which would also let us get rid of the locally stored .jar.
The text was updated successfully, but these errors were encountered: