Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick up column titles from the nth line #130

Open
timgdavies opened this issue Aug 5, 2016 · 26 comments
Open

Pick up column titles from the nth line #130

timgdavies opened this issue Aug 5, 2016 · 26 comments
Assignees

Comments

@timgdavies
Copy link
Contributor

In a spreadsheet it is sometimes useful to provide information above the column titles.

For example:

(1) To provide meta-data

License CC-BY
Downloaded from http://www.example.com/
Attributions Example Coop, Ordnance Survey
id grant_title grantee/name
01 Test Grant AnyCharity

(2) To provide additional field documentation

Identifier Grant Title Recipient
Identifiers should be unique Titles should be no more than 10 words long, and easy to understand for non-experts Include the name the recipient is known as, their legal name can be included elsewhere
id grant_title grantee/name
01 Test Grant AnyCharity

If, like HXL, we allow that the column headers can be given in any row (or any of the first N rows), and that only data below the column header row will be used, we could build much more flexible spreadsheets.

For example, the field title row could be hidden, but there to mark-up an existing tabular data sheet.

We have discussed whether this behaviour should be based on:

  • Specifying the row to start from
  • Auto-detection of header row

My preference is for auto-detection.

We discussed auto-detection being an option that needs to be explicitly turned on.

@andylolz
Copy link
Contributor

andylolz commented Mar 31, 2017

Ooh, +1 for doing it the HXL way! This is particularly handy where there are deeply nested elements, so the column headers required by flatten-tool are not very human readable.

@dalepotter
Copy link

+1 for HXL too.

And like a field containing a short documentation string also sounds useful.

@stevieflow
Copy link
Member

+1

@johnadamsDFID
Copy link

+1 Excellent idea.

@dalepotter
Copy link

As part of the iati.core python library (which aims to better support the the way IATI developers work with the IATI Standard and IATI data), I'm currently working on a user story which may be relevant to the related issue of outputting docs relating to each IATI XPath element.

As a IATI Tech Team Developer I want to build documentation sites that automatically extract definitions and such like from the Schemas and related places, so that I don’t need to write documentation from scratch and manually update it each time something changes.

We're imagining that implementation might be done by something like iati.core.Schema.get_documentation_string(xpath) This would return the English language documentation string.

Happy to provide more detail on the possible implementation of this, if this might be helpful...?

@stevieflow
Copy link
Member

Some additional insight on this

@rory09 and I recently developed a template for gathering IATI relevant data from across several country offices of an organisation.

One aspect that helped users (we found) was to keep the xpath details at the top of the document, and slightly shaded out (to imply that they need not be touched!):

sheet

We found this worked well, in terms of results back. To be clear - we didnt provide a template with the path immediately above data cells, but this might be a story worth considering in light of the above

that only data below the column header row will be used

IN our model here, this would represent a challenge

@timgdavies
Copy link
Contributor Author

It might be useful to think about the options that Open Refine gives when uploading a file

image

I.e. in this case we want to:

  • Parse 1st row as header
  • Drop first two rows of the data
  • Parse everything else

But in other cases might want the other combination of options.

Another design pattern to consider might be finding a way to mark rows in the first column as not-to-parse (e.g. hidden column A with # in for 'comment columns')

@stevieflow
Copy link
Member

Yes
In discussion with @Bjwebb we talked about utilising the meta tab to try and (manually) indicate this. This might be useful in terms of us trying different approaches, rather than trying (initially) to build features of UI, etc...

@kindly
Copy link
Contributor

kindly commented Jul 25, 2017

@Bjwebb and I had a further discussion about this. We have excluded auto-detection as an option as we think there are too many things that could go wrong with this. We have 3 options remaining, summarising the above and our discussion.

1. Saying how many (and what) lines to skip in the Metatab.

This means we will will have some kind of special options in the metatab that say how many rows to skip or if we skip row(s) after the heading line.

Pros

  • Easy to implement
  • No need for any information on individual pages

Cons

  • Can't configure each page and so awkward for single CSV pages
  • Choosing variable name is difficult as it should not clash with any fieldname actually in the data.

2. Let the first column in the spreadsheet be a comment to show the line is skipped.

For example:

#
# can put anything here
# Some Human Readable Title
actual heading line here
data starts from here

Pros

  • Fairly easy to implement (some complications around how we recognise if the first column can have a hash in it).
  • Very flexible skipping of lines can vary per sheet.

Cons

  • Odd behaviour (skipping first column) for spreadsheet user or data consumer. More common to skip the first few rows.

3. Let the first line be for configuration starting with a #.

If the first cell A1 is a # the rest of the first line is for configuring the sheet. We can have various configuration options on this line i.e ignore sheet, skip lines, skip after headings, vertical orientation.

For example:

# skip-line skip-afterheader
can put anything here
actual heading line here
Some Human Readable Title
data starts from here

Pros

  • Very flexible and can add lots of commands.
  • Future proof as can add more options in future.

Cons

  • Not so easy to know what to put in first line, as non standard way of expressing options.

@timgdavies
Copy link
Contributor Author

timgdavies commented Jul 25, 2017

Is there an option 4?

4. Skip the row when the first character of the row is # and use the first non-comment row as the header

#
# can put anything here
# Some Human Readable Title
actual heading line here
data starts from here
# I can also put a comment in the last line - - -
# E.g. the downloaded date - - -

Pros

  • Gets out of the problem of a blank first column;

Cons

  • Causes problems if you have data where column 1 includes # as the first character

@kindly
Copy link
Contributor

kindly commented Jul 26, 2017

@timgdavies we excluded options where the "data" not just the "headings" has limits to what can expressed. So 4 was excluded for that reason. We could have it that you are allowed comments continuously at the top but not again after that (i.e in the footer), but this means you would not be able to have the human readable headings after the normal headings. Also having # at the start of only the first human readable heading seems a bit odd.

My preference is for 3, even though it the most painful to implement.

We can have a way for the metatab to define the defaults for all sheets in its top hash line.

Common cases for this # line would be "#, ignore" or "#, vertical". Also, it solves the issue of how express metadata at the top (first point on this ticket) by having "#, metadata-lines-3", so we know the first three lines are lines of metadata explicitly. This way the metadata can parsed (not just skipped), which solves the metatab issue for a single csv file.

I suppose it is fairly ugly but instead of comments its a bit like a #! line at the top of files in order to know how to parse them.

@timgdavies
Copy link
Contributor Author

Ok - you're selling me on option 3.

I've just been looking at:

W3C Tabular data dialect descriptions which provide a set of possible 'commands' (skipRows)

http://w3c.github.io/csvw/syntax/#tabular-data-embedding-annotations is also particularly useful as a reference source for us I think.

@kindly
Copy link
Contributor

kindly commented Jul 26, 2017

@timgdavies They look good references especially for naming conventions. i.e

# skipLines 1 headerRows 2
can put anything here
actual heading line here
Some Human Readable Title
data starts from here

I suppose we could also allow any number of continuous # lines at the start (like option 4) for comments, but keep the first line special for "commands".

@stevieflow
Copy link
Member

My thinking is that on the example I provided above: #130 (comment) - there would need to be another row for configuration. With styling and such, we can make it clear to the user to not touch this - but it could be a risk, but minor

@rory09 - any thoughts?

@kindly
Copy link
Contributor

kindly commented Jul 27, 2017

@stevieflow Just for clarity I hope that you will not need to have that line at the top of every sheet. The aim is that you can define the defaults for all sheets in the Metatab sheet.

@rorads
Copy link

rorads commented Jul 28, 2017

I have a few thoughts on the issues mentioned above. I'm going to post the thought which is well formed now, and then have a think about the multiple heading rows discussion.

Commenting out columns and workbook sheets

I've often (and often seen publishers) put an extra column in my data for notes/quick tests using formulas, or put a whole extra sheet in some IATI/360 data to house values for calculations, pivot tables for quick reference, or even charts for overviewing data.

See the red tabs in this image:

image

For either a column or tab which a user doesn't want to convert, but which they don't want to delete every time they convert their data, I propose the user should be able to 'comment' out that column or sheet with by adding two underscores at the start, so "Org ID Index" would become "__Org ID Index" and a notes column would be "__notes".

N.B. I would have gone with a hashtag, but Tim rightly pointed out this would run counter to the precedent set by HXL, possibly resulting in confusion.

@rorads
Copy link

rorads commented Aug 1, 2017

I can see the appeal of option 3, but I have some worries.

Convention inconsistencies between CSV and Excel files

IATI version must be be declared in the metatab in Excel, but there can't be a metatab for CSV files, so there would need to be very clear documentation, and even then it could be confusing / would mean that people would have to be careful in transferring data from csv to excel.

Users/publishers having to program their 2d data

The runaway success of HXL comes from its ability to take existing, easily consumed datasets and simply allowing a publisher to ad a new row in a predictable place and as long as they do that one step correctly, and as long as their data makes sense, they'll produce useful data.

This isn't to say that I think the existence of a command row at the top isn't a good idea - for superusers like myself it's great, but I think we should be very opinionated and somewhat protective of our users in setting very clear and canonical defaults, which allow the 90% of users who just want to keep their headings but add one row below/above them like in HXL.

One of the huge benefits of flatten-tool in my use is that it can be used to convert lightly marked up data, which makes sense as stand alone. Although it's not hard to apply an auto-filter or pivot table to a custom range, for example, it creates a small amount of friction which could stop people exploring the really powerful features of excel and/or IATI.

@rorads rorads closed this as completed Aug 1, 2017
@rorads rorads reopened this Aug 1, 2017
@rorads
Copy link

rorads commented Aug 1, 2017

Wrong button. ![oh dear](facepalm.png)

@rorads
Copy link

rorads commented Aug 1, 2017

So with my above call for defaults in mind, we could present users with the following kind of template in the case of Excel (with a meta-tab + IATI version included elsewhere in the workbook):

A B C D E
1 Activity ID Publisher ID Publisher Type Code Publisher Name ...
2 iati-identifier reporting-org/@ref reporting-org/@type reporting-org/narrative ...
3 GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...
4 GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...
5 GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...

Then, for a CSV, it could be something like:

# version 2.02 skip_rows 1 header_rows 2
Activity ID Publisher ID Publisher Type Code Publisher Name ...
iati-identifier reporting-org/@ref reporting-org/@type reporting-org/narrative ...
GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...
GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...
GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...

@kindly
Copy link
Contributor

kindly commented Aug 1, 2017

@rory09

Thanks for your feedback. I have had the same concerns over auto-filtering, pivoting and other functions. HXL has this issue too.

I am happy to support HXL but the standard itself "as far as I can see" does not give us a clear way to say if the sheet is HXL formatted to begin with and flattentool has to know that. Any autodetection we do will be flawed and would be have to be replicated by anyone else consuming the spreadsheets.

I want to start working on this and there is not really a consensus on a solution.

For the time being I think I will go for solution 1 initially but with the syntax discussed in solution 3:

This way we would only allow the # line at the top of the Metatab page and anything configured there applies to all (non-metatab) sheets. Hardly anyone apart from the person making the spreadsheet template would be likely to touch that line. So a metatab page would look like:

# skipLines 1 headerRows 2
license CC
version 1.1
extensions

This keeps all the "data" sheets clean (no # at the top) and uniform. This makes for easier data manipulation of those sheets. It does limit us to not having per sheet formatting but that can be added later if it is seen as a need.
Next we could have an exception that for single sheet csv or spreadsheets, we would allow the # line as well, but that can be discussed later too.

@kindly
Copy link
Contributor

kindly commented Aug 1, 2017

Also I think we should consider supporting HXL in a way that says every sheet has it. i.e by having a Metatab sheet like the following:

# hxl
license CC
version 1.1
extensions

This would mean for every sheet (apart from the metatab) after every heading line there must be the HXL # tag lines.
The only thing we would need to discuss is how we represent what flatten-tool needs (the path to the field) within those # tags.

@kindly
Copy link
Contributor

kindly commented Aug 1, 2017

@rory09 I missed the comment about the defaults but they fit into the outline of the plan in my comment.

There will be two types of defaults. Defaults set in flattentool and defaults expressed in the Metatab. To begin with I am going with the defaults set in the Metatab but it should fairly easy to have a way to send a list of defaults to the command line tool. Cove can set a list of defaults for a particular standard that way too.

kindly added a commit that referenced this issue Aug 15, 2017
@kindly kindly mentioned this issue Aug 17, 2017
Bjwebb added a commit to OpenDataServices/cove that referenced this issue Aug 17, 2017
Bjwebb added a commit to OpenDataServices/cove that referenced this issue Aug 17, 2017
@stevieflow
Copy link
Member

Amazing - thanks @Bjwebb

@rory09 - please can you review?

I guess we will also need to plan some documentation around this. @Bjwebb : this could be an IATI CoVE sphnix setup?

@Bjwebb
Copy link
Member

Bjwebb commented Aug 21, 2017

Yes, OpenDataServices/cove#814 is the issue about writing docs.

@jpmckinney
Copy link
Contributor

Is this closed by #167 and #216? Not sure what the remaining issues are, if not.

rhiaro pushed a commit to open-contracting/cove-ocds that referenced this issue Nov 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants