Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible dialect parsing #5

Closed
xrotwang opened this issue Jun 9, 2022 · 6 comments
Closed

More flexible dialect parsing #5

xrotwang opened this issue Jun 9, 2022 · 6 comments

Comments

@xrotwang
Copy link

xrotwang commented Jun 9, 2022

Hi, I'm one of the creators of a data standard (CLDF) which is based on csvw. Unfortunately we decided to allow a dialect property specifying only the properties which deviate from the default dialect. E.g. here.
This seems not to work with current csvwr:

> d <- read_csvw("projects/glottolog/glottolog-cldf/cldf/cldf-metadata.json")
> d$tables[[1]]$dataframe
$error
<simpleError: Expected single integer value>

$filename
[1] "projects/glottolog/glottolog-cldf/cldf/cldf-metadata.json"

$dialect
$dialect$commentPrefix
NULL


$group_schema
list()

When I remove the dialect property from the metadata (falling back to the default) all is good:

> d <- read_csvw("projects/glottolog/glottolog-cldf/cldf/cldf-metadata.json")
                                                                                                                                                                                                                                                                                                                                                            > d$tables[[1]]$dataframe
# A tibble: 131,048 × 8
   ID     Language_ID Parameter_ID  Value  Code_ID Comment  Source codeReference
   <chr>  <chr>       <chr>         <chr>  <chr>   <chr>    <chr>  <chr>        
 1 more1more1255    level         family level-NA       NA     NA           
 2 more1more1255    category      Family categoNA       NA     NA           
 3 more1more1255    subclassific… ((((bNA      **hh:hvhh:hvNA           
 4 mong1mong1349    level         family level-NA       NA     NA           
 5 mong1mong1349    category      Family categoNA       NA     NA           
 6 mong1mong1349    subclassific… (kitaNA      NA       NA     NA           
 7 kolp1kolp1236    level         langulevel-NA       NA     NA           
 8 kolp1kolp1236    category      SpokecategoNA       NA     NA           
 9 kolp1kolp1236    subclassific… (kollNA      NA       NA     NA           
10 kolp1kolp1236    aes           3      aes-sh… Kol (10hh:heNA           
# … with 131,038 more rows

If dialect properties would be merged in

csvwr/R/csvwr.R

Line 217 in 3c65f95

dialect <- dialect %||% default_dialect

rather than just choosing a fully specified custom dialect or the default, csvwr could handle our data out-of-the-box.

Would you consider switching to such a dialect parsing behaviour an option for csvwr?

@Robsteranium
Copy link
Owner

Thanks for reaching out. It's always interesting to hear what people are building with csvw!

Having individual dialect values override the defaults seems like a reasonable behaviour. Indeed it looks like some of the csvw tests expect this anyway.

We'll need a function to merge the lists, and - looking at your example - it ought to treat NULL as unsetting the default value (rather than simply non-setting it).

@xrotwang
Copy link
Author

Thanks for your answer! Yes, the behaviour you describe would be ideal. Looking at

csvwr/R/csvwr.R

Lines 224 to 228 in 3c65f95

dtf <- readr::read_csv(csv_url,
trim_ws=T,
skip=dialect$headerRowCount,
col_names=column_names,
col_types=column_types)

it seems that only parts of the dialect spec are taken into account as of now. Is that correct?

FWIW, I built a Python package for csvw (https://github.com/cldf/csvw/) and found completely supporting csv dialect specs surprisingly non-trivial ...

@Robsteranium
Copy link
Owner

Yes that's right. I've only implemented the bits I've encountered so far (header and headerRowCount) so nothing is done with commentPrefix for example.

We've been gathering some resources on csvw.org. Would you like to have cldf/csvw included on the tools page?

@xrotwang
Copy link
Author

I'm not sure whether my csvw package should be included on csvw.org, considering its limitations and deviations. But then, I guess most of the tools have rough edges :)
Maybe that'll lead to something like conformance levels for tools.

@Robsteranium
Copy link
Owner

Yeah, no specification is comprehensive and it gets harder still when you start to combine/ extend them! It's no surprise all the implementations end-up slightly different. As long as it's explicit then I think that's totally fine.

I'll add it to the tools page when I get a chance!

@Robsteranium
Copy link
Owner

3d05de8 introduces an override_defaults function and uses it to merge the table/ group/ default dialect.

You're example works now:

> cldf <- read_csvw("https://raw.githubusercontent.com/glottolog/glottolog-cldf/c8eefe82b4c87f3c566a8e5181bacf714661e18e/cldf/cldf-metadata.json")
                                                                                                                
> cldf$dialect$commentPrefix
NULL

Thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants