-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pick up column titles from the nth line #130
Comments
Ooh, +1 for doing it the HXL way! This is particularly handy where there are deeply nested elements, so the column headers required by flatten-tool are not very human readable. |
+1 for HXL too. And like a field containing a short documentation string also sounds useful. |
+1 |
+1 Excellent idea. |
As part of the iati.core python library (which aims to better support the the way IATI developers work with the IATI Standard and IATI data), I'm currently working on a user story which may be relevant to the related issue of outputting docs relating to each IATI XPath element.
We're imagining that implementation might be done by something like Happy to provide more detail on the possible implementation of this, if this might be helpful...? |
Some additional insight on this @rory09 and I recently developed a template for gathering IATI relevant data from across several country offices of an organisation. One aspect that helped users (we found) was to keep the xpath details at the top of the document, and slightly shaded out (to imply that they need not be touched!): We found this worked well, in terms of results back. To be clear - we didnt provide a template with the path immediately above data cells, but this might be a story worth considering in light of the above
IN our model here, this would represent a challenge |
It might be useful to think about the options that Open Refine gives when uploading a file I.e. in this case we want to:
But in other cases might want the other combination of options. Another design pattern to consider might be finding a way to mark rows in the first column as not-to-parse (e.g. hidden column A with # in for 'comment columns') |
Yes |
@Bjwebb and I had a further discussion about this. We have excluded auto-detection as an option as we think there are too many things that could go wrong with this. We have 3 options remaining, summarising the above and our discussion. 1. Saying how many (and what) lines to skip in the Metatab.This means we will will have some kind of special options in the metatab that say how many rows to skip or if we skip row(s) after the heading line. Pros
Cons
2. Let the first column in the spreadsheet be a comment to show the line is skipped.For example:
Pros
Cons
3. Let the first line be for configuration starting with a #.If the first cell A1 is a # the rest of the first line is for configuring the sheet. We can have various configuration options on this line i.e ignore sheet, skip lines, skip after headings, vertical orientation. For example:
Pros
Cons
|
Is there an option 4? 4. Skip the row when the first character of the row is # and use the first non-comment row as the header
Pros
Cons
|
@timgdavies we excluded options where the "data" not just the "headings" has limits to what can expressed. So 4 was excluded for that reason. We could have it that you are allowed comments continuously at the top but not again after that (i.e in the footer), but this means you would not be able to have the human readable headings after the normal headings. Also having # at the start of only the first human readable heading seems a bit odd. My preference is for 3, even though it the most painful to implement. We can have a way for the metatab to define the defaults for all sheets in its top hash line. Common cases for this # line would be "#, ignore" or "#, vertical". Also, it solves the issue of how express metadata at the top (first point on this ticket) by having "#, metadata-lines-3", so we know the first three lines are lines of metadata explicitly. This way the metadata can parsed (not just skipped), which solves the metatab issue for a single csv file. I suppose it is fairly ugly but instead of comments its a bit like a #! line at the top of files in order to know how to parse them. |
Ok - you're selling me on option 3. I've just been looking at: W3C Tabular data dialect descriptions which provide a set of possible 'commands' (skipRows) http://w3c.github.io/csvw/syntax/#tabular-data-embedding-annotations is also particularly useful as a reference source for us I think. |
@timgdavies They look good references especially for naming conventions. i.e
I suppose we could also allow any number of continuous # lines at the start (like option 4) for comments, but keep the first line special for "commands". |
My thinking is that on the example I provided above: #130 (comment) - there would need to be another row for configuration. With styling and such, we can make it clear to the user to not touch this - but it could be a risk, but minor @rory09 - any thoughts? |
@stevieflow Just for clarity I hope that you will not need to have that line at the top of every sheet. The aim is that you can define the defaults for all sheets in the Metatab sheet. |
I have a few thoughts on the issues mentioned above. I'm going to post the thought which is well formed now, and then have a think about the multiple heading rows discussion. Commenting out columns and workbook sheetsI've often (and often seen publishers) put an extra column in my data for notes/quick tests using formulas, or put a whole extra sheet in some IATI/360 data to house values for calculations, pivot tables for quick reference, or even charts for overviewing data. See the red tabs in this image: For either a column or tab which a user doesn't want to convert, but which they don't want to delete every time they convert their data, I propose the user should be able to 'comment' out that column or sheet with by adding two underscores at the start, so
|
I can see the appeal of option 3, but I have some worries. Convention inconsistencies between CSV and Excel files IATI version must be be declared in the metatab in Excel, but there can't be a metatab for CSV files, so there would need to be very clear documentation, and even then it could be confusing / would mean that people would have to be careful in transferring data from csv to excel. Users/publishers having to The runaway success of HXL comes from its ability to take existing, easily consumed datasets and simply allowing a publisher to ad a new row in a predictable place and as long as they do that one step correctly, and as long as their data makes sense, they'll produce useful data. This isn't to say that I think the existence of a command row at the top isn't a good idea - for superusers like myself it's great, but I think we should be very opinionated and somewhat protective of our users in setting very clear and canonical defaults, which allow the 90% of users who just want to keep their headings but add one row below/above them like in HXL. One of the huge benefits of flatten-tool in my use is that it can be used to convert lightly marked up data, which makes sense as stand alone. Although it's not hard to apply an auto-filter or pivot table to a custom range, for example, it creates a small amount of friction which could stop people exploring the really powerful features of excel and/or IATI. |
Wrong button. |
So with my above call for defaults in mind, we could present users with the following kind of template in the case of Excel (with a meta-tab + IATI version included elsewhere in the workbook):
Then, for a CSV, it could be something like:
|
Thanks for your feedback. I have had the same concerns over auto-filtering, pivoting and other functions. HXL has this issue too. I am happy to support HXL but the standard itself "as far as I can see" does not give us a clear way to say if the sheet is HXL formatted to begin with and flattentool has to know that. Any autodetection we do will be flawed and would be have to be replicated by anyone else consuming the spreadsheets. I want to start working on this and there is not really a consensus on a solution. For the time being I think I will go for solution 1 initially but with the syntax discussed in solution 3: This way we would only allow the # line at the top of the Metatab page and anything configured there applies to all (non-metatab) sheets. Hardly anyone apart from the person making the spreadsheet template would be likely to touch that line. So a metatab page would look like:
This keeps all the "data" sheets clean (no # at the top) and uniform. This makes for easier data manipulation of those sheets. It does limit us to not having per sheet formatting but that can be added later if it is seen as a need. |
Also I think we should consider supporting HXL in a way that says every sheet has it. i.e by having a Metatab sheet like the following:
This would mean for every sheet (apart from the metatab) after every heading line there must be the HXL # tag lines. |
@rory09 I missed the comment about the defaults but they fit into the outline of the plan in my comment. There will be two types of defaults. Defaults set in flattentool and defaults expressed in the Metatab. To begin with I am going with the defaults set in the Metatab but it should fairly easy to have a way to send a list of defaults to the command line tool. Cove can set a list of defaults for a particular standard that way too. |
Diff: OpenDataServices/flatten-tool@dfdf61d...ea9394c Includes changes to the tests, and options to skip lines etc. OpenDataServices/flatten-tool#130
Diff: OpenDataServices/flatten-tool@dfdf61d...ea9394c Includes changes to the tests, and options to skip lines etc. OpenDataServices/flatten-tool#130
Our work so far on this is deployed at http://iati.cove.opendataservices.coop/ Some example files: Next steps are to review this and decide whether CSV support is a priority. |
Yes, OpenDataServices/cove#814 is the issue about writing docs. |
Diff: OpenDataServices/flatten-tool@dfdf61d...ea9394c Includes changes to the tests, and options to skip lines etc. OpenDataServices/flatten-tool#130
In a spreadsheet it is sometimes useful to provide information above the column titles.
For example:
(1) To provide meta-data
(2) To provide additional field documentation
If, like HXL, we allow that the column headers can be given in any row (or any of the first N rows), and that only data below the column header row will be used, we could build much more flexible spreadsheets.
For example, the field title row could be hidden, but there to mark-up an existing tabular data sheet.
We have discussed whether this behaviour should be based on:
My preference is for auto-detection.
We discussed auto-detection being an option that needs to be explicitly turned on.
The text was updated successfully, but these errors were encountered: