Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from old private fork of OpenCSV #2268

Closed
wetneb opened this issue Dec 29, 2019 · 10 comments · Fixed by #6098
Closed

Migrate from old private fork of OpenCSV #2268

wetneb opened this issue Dec 29, 2019 · 10 comments · Fixed by #6098
Assignees
Labels
CSV/TSV About the CSV/TSV import or export import About importers in general - add a label for the data format if available Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Milestone

Comments

@wetneb
Copy link
Sponsor Member

wetneb commented Dec 29, 2019

We are currently using a custom snapshot of OpenCSV which is years old. The OpenCSV project is still active and a lot of releases have been published since then. We should look into upgrading to a newer version, which would also let us get rid of the locally stored .jar.

@wetneb wetneb added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. import About importers in general - add a label for the data format if available CSV/TSV About the CSV/TSV import or export labels Dec 29, 2019
@wetneb
Copy link
Sponsor Member Author

wetneb commented Dec 31, 2019

Version 5.0 does not seem to support multi-character separators, which are currently used at least in the smartSplit function.

@wetneb
Copy link
Sponsor Member Author

wetneb commented Jan 3, 2020

Multi-character separators were proposed but the patch did not make it upstream: https://sourceforge.net/p/opencsv/patches/44/
We don't seem to be the only ones to require this though: https://stackoverflow.com/questions/8653797/java-csv-parser-with-string-separator-multi-character

@thadguidry
Copy link
Member

thadguidry commented Jan 3, 2020

The OpenCSV project seems to prefer a strict stance of RFC 4180 which seems reasonable given their mission. (I.E. they consider bits outside the "csv standard" to not really be csv) Even if you provided a new class/method then it would probably be rejected, but you never know and might want to ask them.

I still think investment in one of these would be better all around:
This might mean that perhaps we look at using Jackson CSV (I prefer it more, since they know CSV can be "messy" sometimes and supports extensions, where we would just need to implement something like CsvParser.Feature.ALLOW_MULTI_SEPARATORS)?
https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv#configuring-csvschema

or Apache CSV (which wanted to unify development with OpenCSV at one-time, but didn't get far with them either if I recall)

@wetneb
Copy link
Sponsor Member Author

wetneb commented Jan 3, 2020

OpenCSV does not stick to RFC4180, they also have a more flexible parser which accommodates with non-standard needs.

But yeah, switching to another parser could also be an option. There seems to be quite a lot of them actually! https://github.com/uniVocity/csv-parsers-comparison

@thadguidry
Copy link
Member

I did not say they stick?

@wetneb
Copy link
Sponsor Member Author

wetneb commented Jan 3, 2020

Well, you wrote that "The OpenCSV project seems to prefer a strict stance of RFC 4180 which seems reasonable given their mission" and I think that is not a very accurate description of OpenCSV, since their default parser is much more flexible and accepts CSVs that do not conform with RFC4180. They also have a RFC4180 parser, but that is not the default one.

@thadguidry
Copy link
Member

Thanks for info, but I also see flexibility with other parsers. So switching, although painful might be wiser. Up to you.

@wetneb
Copy link
Sponsor Member Author

wetneb commented Feb 18, 2020

With the migration to spark, Spark SQL's own CSV parser is a natural choice since it allows efficient partitioning (so, scales well to large datasets).

@tfmorris
Copy link
Member

OpenCSV rejected multi-character separators (a second time): https://sourceforge.net/p/opencsv/feature-requests/119/
There still don't seem to be any available Java CSV parsers which support multi-character separator strings.

@tfmorris tfmorris changed the title Migrate OpenCSV to a more recent version Migrate from old private fork of OpenCSV Oct 11, 2023
@tfmorris tfmorris self-assigned this Oct 11, 2023
@tfmorris
Copy link
Member

A couple of years ago Apache commons-csv added support for multi-character delimiters. I'll look into how easy it is to migrate to it.

tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Oct 12, 2023
Switch from ancient private fork of opencsv with our
multi-character delimiter patch to the Apache commons-csv
module with supports string delimiters.

There are two types of test failures, one of which I think
is definitely a bug and the other is arguable.

Test changes will go in the next commit to keep them
separate.
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Oct 14, 2023
…ne#1372

Switch from ancient private fork of opencsv with our
multi-character delimiter patch to the uniVocity CSV
parser which supports string delimiters.

There is one test test change::

- On import, a quote character before a separator character
  is no longer stripped in "ignore quotes" mode. Instead it's
  included in the data field, which seems correct to me.
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Oct 15, 2023
…ne#1372

Switch from ancient private fork of opencsv with our
multi-character delimiter patch to the uniVocity CSV
parser which supports string delimiters.

There is one test test change::

- On import, a quote character before a separator character
  is no longer stripped in "ignore quotes" mode. Instead it's
  included in the data field, which seems correct to me.
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Oct 16, 2023
…ne#1372

Switch from ancient private fork of opencsv with our
multi-character delimiter patch to the uniVocity CSV
parser which supports string delimiters.

There is one unit test change:

- On import, a quote character before a separator character
  is no longer stripped in "ignore quotes" mode. Instead it's
  included in the data field, which seems correct to me.

The following changes were made to e2e tests:

- Disable quote processing option before switching from
  CSV to TSV because the test data doesn't have legally
  escaped quotes when the separator isn't a comma

- Force options to be sent by project creation text fixture
  (If a tag name isn't present, no options are sent. Super weird
  side effect!)

- Escape test fixture data which contains quotes before
  sending it to be parsed as a CSV file
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Oct 16, 2023
…ne#1372

Switch from ancient private fork of opencsv with our
multi-character delimiter patch to the uniVocity CSV
parser which supports string delimiters.

There are two unit test changes:

- On import, a quote character before a separator character
  is no longer stripped in "ignore quotes" mode. Instead it's
  included in the data field, which seems correct to me.
- On export, CSV/TSV tests now check that the export uses the
  system line delimiter rather than always a Unix newline (\n)

The following changes were made to e2e tests:

- Disable quote processing option before switching from
  CSV to TSV because the test data doesn't have legally
  escaped quotes when the separator isn't a comma

- Force options to be sent by project creation text fixture
  (If a tag name isn't present, no options are sent. Super weird
  side effect!)

- Escape test fixture data which contains quotes before
  sending it to be parsed as a CSV file

Another try at making export tests work on Windows
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Oct 16, 2023
…ne#1372

Switch from ancient private fork of opencsv with our
multi-character delimiter patch to the uniVocity CSV
parser which supports string delimiters.

There are two unit test changes:

- On import, a quote character before a separator character
  is no longer stripped in "ignore quotes" mode. Instead it's
  included in the data field, which seems correct to me.
- On export, CSV/TSV tests now check that the export uses the
  system line delimiter rather than always a Unix newline (\n)

The following changes were made to e2e tests:

- Disable quote processing option before switching from
  CSV to TSV because the test data doesn't have legally
  escaped quotes when the separator isn't a comma

- Force options to be sent by project creation text fixture
  (If a tag name isn't present, no options are sent. Super weird
  side effect!)

- Escape test fixture data which contains quotes before
  sending it to be parsed as a CSV file
wetneb pushed a commit that referenced this issue Oct 21, 2023
* Add TODO for error reporting and remove redundant qualifiers

* Switch to uniVocity CSV parser - Fixes #2268 Fixes #1372

Switch from ancient private fork of opencsv with our
multi-character delimiter patch to the uniVocity CSV
parser which supports string delimiters.

There are two unit test changes:

- On import, a quote character before a separator character
  is no longer stripped in "ignore quotes" mode. Instead it's
  included in the data field, which seems correct to me.
- On export, CSV/TSV tests now check that the export uses the
  system line delimiter rather than always a Unix newline (\n)

The following changes were made to e2e tests:

- Disable quote processing option before switching from
  CSV to TSV because the test data doesn't have legally
  escaped quotes when the separator isn't a comma

- Force options to be sent by project creation text fixture
  (If a tag name isn't present, no options are sent. Super weird
  side effect!)

- Escape test fixture data which contains quotes before
  sending it to be parsed as a CSV file

* Add uniVocity format guessing

As a starting point use both our separator guesser and the
CSV parsers format guesser and compare the two. We probably
need a wider study to compare the performance of the two.
@tfmorris tfmorris added this to the 3.8 milestone Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CSV/TSV About the CSV/TSV import or export import About importers in general - add a label for the data format if available Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
3 participants