Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow line-based importer to specify a custom line-delimiter sequence for custom data #4103

Closed
thadguidry opened this issue Aug 13, 2021 · 1 comment · Fixed by #5434
Closed
Labels
Difficulty: Intermediate Identifies moderately challenging issues that require some experience and familiarity with project. import About importers in general - add a label for the data format if available new data format Requests for creation of new importers/exporters Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Milestone

Comments

@thadguidry
Copy link
Member

thadguidry commented Aug 13, 2021

This might need better streaming support for our line-based importer as a pre-requisite. I don't know.
I often land myself with large byte arrays streamed out to a single file that have custom sequence of chars (*%%*) used as record delimiters that I would like to treat simply as a line-delimiter while importing the file into OpenRefine. The file sizes are typically under 4GB, usually only 1GB or 2GB in size, where I often have over 20GB system memory available to give to the Java heap.
Other tools allow reading a file as a stream of characters and separating into new lines based on a custom char sequence.
I would like our OpenRefine line-based importer or a new importer to handle this use case.

Proposed solution

Allow line-based importer to have an option to use a custom line-delimiter character sequence (overriding the defaults of \n or \r\n.

Alternatives considered

I have to use other tools.

Additional context

@thadguidry thadguidry added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators import About importers in general - add a label for the data format if available and removed Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Aug 13, 2021
@thadguidry thadguidry added new data format Requests for creation of new importers/exporters Difficulty: Intermediate Identifies moderately challenging issues that require some experience and familiarity with project. labels Oct 6, 2022
tfmorris added a commit that referenced this issue Nov 17, 2022
#4103

* Added support for regex based row separator to line based importer

* Added basic tests for LineBasedImporter

* Fixed io error handling within LineBasedImporter. Code style fixes.

* Minor cleanups

- update copyright year
- use static imports for Assert methods
- remove unused method
- minor cleanups suggested by IDE

Co-authored-by: Tom Morris <tfmorris@gmail.com>
@wetneb wetneb added this to the 3.7 milestone Nov 17, 2022
@thadguidry
Copy link
Member Author

Thanks @egordm and @tfmorris for working on this enhancement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Difficulty: Intermediate Identifies moderately challenging issues that require some experience and familiarity with project. import About importers in general - add a label for the data format if available new data format Requests for creation of new importers/exporters Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants