More flexible dialect parsing #5

xrotwang · 2022-06-09T13:41:16Z

Hi, I'm one of the creators of a data standard (CLDF) which is based on csvw. Unfortunately we decided to allow a dialect property specifying only the properties which deviate from the default dialect. E.g. here.
This seems not to work with current csvwr:

> d <- read_csvw("projects/glottolog/glottolog-cldf/cldf/cldf-metadata.json")
> d$tables[[1]]$dataframe
$error
<simpleError: Expected single integer value>

$filename
[1] "projects/glottolog/glottolog-cldf/cldf/cldf-metadata.json"

$dialect
$dialect$commentPrefix
NULL


$group_schema
list()

When I remove the dialect property from the metadata (falling back to the default) all is good:

> d <- read_csvw("projects/glottolog/glottolog-cldf/cldf/cldf-metadata.json")
                                                                                                                                                                                                                                                                                                                                                            > d$tables[[1]]$dataframe
# A tibble: 131,048 × 8
   ID     Language_ID Parameter_ID  Value  Code_ID Comment  Source codeReference
   <chr>  <chr>       <chr>         <chr>  <chr>   <chr>    <chr>  <chr>        
 1 more1… more1255    level         family level-… NA       NA     NA           
 2 more1… more1255    category      Family catego… NA       NA     NA           
 3 more1… more1255    subclassific… ((((b… NA      **hh:hv… hh:hv… NA           
 4 mong1… mong1349    level         family level-… NA       NA     NA           
 5 mong1… mong1349    category      Family catego… NA       NA     NA           
 6 mong1… mong1349    subclassific… (kita… NA      NA       NA     NA           
 7 kolp1… kolp1236    level         langu… level-… NA       NA     NA           
 8 kolp1… kolp1236    category      Spoke… catego… NA       NA     NA           
 9 kolp1… kolp1236    subclassific… (koll… NA      NA       NA     NA           
10 kolp1… kolp1236    aes           3      aes-sh… Kol (10… hh:he… NA           
# … with 131,038 more rows

If dialect properties would be merged in

csvwr/R/csvwr.R

Line 217 in 3c65f95

dialect <- dialect %||% default_dialect

rather than just choosing a fully specified custom dialect or the default, csvwr could handle our data out-of-the-box.

Would you consider switching to such a dialect parsing behaviour an option for csvwr?

The text was updated successfully, but these errors were encountered:

Robsteranium · 2022-06-14T07:10:45Z

Thanks for reaching out. It's always interesting to hear what people are building with csvw!

Having individual dialect values override the defaults seems like a reasonable behaviour. Indeed it looks like some of the csvw tests expect this anyway.

We'll need a function to merge the lists, and - looking at your example - it ought to treat NULL as unsetting the default value (rather than simply non-setting it).

xrotwang · 2022-06-14T07:21:06Z

Thanks for your answer! Yes, the behaviour you describe would be ideal. Looking at

csvwr/R/csvwr.R

Lines 224 to 228 in 3c65f95

    
           dtf <- readr::read_csv(csv_url, 
        
                                  trim_ws=T, 
        
                                  skip=dialect$headerRowCount, 
        
                                  col_names=column_names, 
        
                                  col_types=column_types)

it seems that only parts of the dialect spec are taken into account as of now. Is that correct?

FWIW, I built a Python package for csvw (https://github.com/cldf/csvw/) and found completely supporting csv dialect specs surprisingly non-trivial ...

Robsteranium · 2022-06-14T07:45:30Z

Yes that's right. I've only implemented the bits I've encountered so far (header and headerRowCount) so nothing is done with commentPrefix for example.

We've been gathering some resources on csvw.org. Would you like to have cldf/csvw included on the tools page?

xrotwang · 2022-06-14T07:57:00Z

I'm not sure whether my csvw package should be included on csvw.org, considering its limitations and deviations. But then, I guess most of the tools have rough edges :)
Maybe that'll lead to something like conformance levels for tools.

Robsteranium · 2022-06-16T13:12:02Z

Yeah, no specification is comprehensive and it gets harder still when you start to combine/ extend them! It's no surprise all the implementations end-up slightly different. As long as it's explicit then I think that's totally fine.

I'll add it to the tools page when I get a chance!

Robsteranium · 2022-07-07T08:06:16Z

3d05de8 introduces an override_defaults function and uses it to merge the table/ group/ default dialect.

You're example works now:

> cldf <- read_csvw("https://raw.githubusercontent.com/glottolog/glottolog-cldf/c8eefe82b4c87f3c566a8e5181bacf714661e18e/cldf/cldf-metadata.json")
                                                                                                                
> cldf$dialect$commentPrefix
NULL

Thanks for the report!

Robsteranium mentioned this issue Jun 16, 2022

Add cldf/csvw to tools Swirrl/csvw.org#7

Closed

xrotwang mentioned this issue Jun 23, 2022

CSVW conformance cldf/csvw#60

Merged

Robsteranium closed this as completed Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More flexible dialect parsing #5

More flexible dialect parsing #5

xrotwang commented Jun 9, 2022

Robsteranium commented Jun 14, 2022

xrotwang commented Jun 14, 2022

Robsteranium commented Jun 14, 2022

xrotwang commented Jun 14, 2022

Robsteranium commented Jun 16, 2022

Robsteranium commented Jul 7, 2022

More flexible dialect parsing #5

More flexible dialect parsing #5

Comments

xrotwang commented Jun 9, 2022

Robsteranium commented Jun 14, 2022

xrotwang commented Jun 14, 2022

Robsteranium commented Jun 14, 2022

xrotwang commented Jun 14, 2022

Robsteranium commented Jun 16, 2022

Robsteranium commented Jul 7, 2022