Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Additional header parsing control #840

Open
icweaver opened this issue May 28, 2021 · 3 comments
Open

Feature request: Additional header parsing control #840

icweaver opened this issue May 28, 2021 · 3 comments
Milestone

Comments

@icweaver
Copy link

Thanks for providing such a wonderful package! I am opening this issue to follow-up on the Zulip discussion here: https://julialang.zulipchat.com/#narrow/stream/274208-helpdesk-.28published.29/topic/reading.20a.20file.20with.20header.20with.20CSV.2Ejl/near/240593595

To summarize, would it be inline with the parsing feature discussion here to include an option to automatically parse column names from the header row if it has the following formatting:

# column_name_1,column_name_2,...
x1,y1,...
x2,y2,...
.
.
.

Currently, one workaround is to manually drop the header comment character (# in the above example) before reading. I believe that @fredrikekre suggested including a keyword to handle the parsing (e.g., header::Regex) that could be used to handle cases like this.

Would something like this be a reasonable feature to include, or are there alternatives ways to accomplish this in CSV.File already? Sorry if this has already been discussed elsewhere!

@quinnj
Copy link
Member

quinnj commented Jun 5, 2021

Yeah, if you don't have commented lines in your data, you could just pass normalizenames=true and the # character would be "normalized" out.

We could maybe allow passing header::Regex; we would need to handle it here and here. But what exactly is the Regex expected to do? Just parse a single column name? Parse the whole line? And we would need someway to know how many characters were "consumed" by the Regex. I haven't played w/ Regex internals enough to know if that would be available somehow.

I've had the thought before that we could allow some kind of applyheader::Function keyword that would just be a function with form f(x::Symbol) -> Symbol, so we'd parse each column name, and then call applyheader to each one that could do any kind of transform it wanted. That might end up being more flexible and general.

@icweaver
Copy link
Author

icweaver commented Jun 8, 2021

Thanks for suggesting the normalizenames=true option, it really helps a lot! I see what you mean about the Regex option and think that going the function route would be a really welcome feature. Would it make sense to also be able to apply the transform to the already "normalized" column names if it is used, since it already does so much of the heavy lifting?

@quinnj
Copy link
Member

quinnj commented Aug 20, 2021

I think the right solution here is to probably go all in with something like Lyndon/I described here. I might try and get that implemented before the next 0.9 release.

@quinnj quinnj added this to the 0.9 milestone Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants