Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v.db.select: Handle all formats through option, add CSV #1121

Merged
merged 17 commits into from
Jun 20, 2021

Conversation

wenzeslaus
Copy link
Member

@wenzeslaus wenzeslaus commented Nov 25, 2020

Instead of separate flags for JSON and vertical output format, use format option
so -j is replaced by format=json, etc. The formats are no longer spread among
the other flags and a written command is more readble. It is also more explicit
since the default format is now represented in the interface and vertical output
clearly stands out as yet another alternative.

The default format is named plain, alternative names are be text, ascii and default.

The use of an option does not change anything major in the code in terms of output
except for use of ints/booleans instead of directly accessing the flags' answers.
The parameter handling changed significantly. The standard exclude function
cannot be used and custom checks and messages are needed.

The significant clean-up related changes to existing formats are:

  • The JSON output of NULL values is fixed, writing JSON literal null.
  • JSON outputs are now always Object (dictionary) so theoretically they can be combined or enhanced in the future.
  • Different default separator for different formats and outputs.
  • Separator for -e can be customized.
  • All required JSON escapes are escaped.
  • Tab is escaped when escaping is enabled (so you could write TSV with escapes using plain).
  • JSON escapes except for double quotes are applied when escaping is enabled for other formats.
  • Vertical output does not add a newline when newline is vertical separator (no more two newlines).

This adds another format, CSV, as format=csv, i.e., without need to add another
flag (e.g., -c and -v were already taken). Most of the code was already there
as basic structure was the original plain default output and quoting was already
done for the JSON output. The original format was not a proper CSV because it
lacked handling quoting of fields which for common CSV parsers takes care of new line
characters and separators in values/cells. It was not completely safe even with escaping
because it did does not escape to the separator (to be delimeter separated values format
with escapes).

Now in CSV, all non-numeric columns are quoted (with two quotes representing a quote in the text).
The quotes used are double quotes, the separator is derived by the horizontal/field
separator option. By default is is a comma (a different separator is set for each format by default).
CSV output is available even for the -r flag (minimal region/bbox).

@wenzeslaus
Copy link
Member Author

wenzeslaus commented Nov 25, 2020

@landam Can you please look at this? v.db.select seems like a great example of a reduction of multiple flags into a single option.

Anybody, let me know what do you think. The idea is that v.db.select supports 3 formats (4 with newly added real CSV), but instead of having a flag for each format, this PR uses only one option/parameter which can have different values making the selection more clear (you can choose only one and you see that they are all in one group). This issue is generally important because there is a lot of cases of multiple related and exclusive flags and many of those are different output formats like in this case.

Before:

  -v   Vertical output (instead of horizontal)
  -j   JSON output
  -c   CSV output [newly added]

After:

              format   Output format
                       options: plain,csv,json,vertical
                       default: plain

There are some additional issues in regard to how different formats interact with the other options and flags. For example, JSON ignores most of the other format-related options while CSV supports different separators, but ideally would have comma as default rather than pipe which is the current default. Another example is -r flag which gives only the bounding box (not the attributes) and always ignores the separator in the current code.

This PR is now in draft mode, but I want to get some early feedback. If you are viewing the diff right now, you may want to check Hide whitespace changes in settings. I will eventually put the code reformatting into a separate PR.

@wenzeslaus
Copy link
Member Author

This is ready for general comments, but I'm especially interested in feedback regarding the replacement of -v -j (-c) by format=plain|csv|json|vertical.

(Reformatting done in #1138. This now contains only the actual changes.)

@lucadelu
Copy link
Contributor

lucadelu commented Dec 2, 2020

This is ready for general comments, but I'm especially interested in feedback regarding the replacement of -v -j (-c) by format=plain|csv|json|vertical.

(Reformatting done in #1138. This now contains only the actual changes.)

I support this change, I think is better an option than several flag

@mlennert
Copy link
Contributor

mlennert commented Dec 2, 2020

I also support this change. Just out of curiosity: what is the difference between plain with sep=comma and csv ?

@wenzeslaus
Copy link
Member Author

what is the difference between plain with sep=comma and csv ?

That's a good question. The current and future default output allows escaping of certain characters and does not quote any fields. The escaping there is not really the common CSV escaping while other things namely separator inside fields is not handled in any way.

The new CSV output quotes all text fields/cells allowing the separator to be in there while, similarly to JSON, doing its own escaping, namely doubling the quotation mark which is a common CSV practice. Now, in combination with a proper CSV parser, you can have separator character, quotation character, and newline in cells and it will work. Any separator is allowed and default is now unfortunately pipe, not comma, so that's something I would like to change, but I don't know how to express that in the interface since we have default separator pipe everywhere and it still makes sense for the default/plain output (I was thinking separator=default: use format's default separator). (One limitation still there is that if you use separator which can occur in the numeric field such as period, the resulting CSV will be invalid. Seem like an unlikely problem though.)

The default/plain output makes sense even in the light of the CSV if we want a human readable output for the command line. Quotes would make too much clutter, and you can make sense of some tricks such as separator or newline in the cell. Pipe makes sense as it makes more readable than a comma. Here comes a question whether we should document format=plain (or whatever is the name) as something stable and thus okay to parse (current state before JSON and CSV), or as something which is really just meant for human eyes and may change without warning (more like r.univar or r.info default outputs).

Instead of separate flags for JSON and vertical output format, use format option
so -j is replaced by format=json, etc. The formats are no longer spread among
the other flags and a written command is more readble. It is also more explicit
since the default format is now represented in the interface and vertical output
clearly stands out as yet another alternative.

The default format is named plain, alternative names are be text, ascii and default.

The user of an option does not change anything major in the code in terms of output
except for use of ints/booleans instead of directly accessing the flags' answers.
The parameter handling changed significatly. The standard exclude function
cannot be used and custom checks and messages are needed.

This adds another format, CSV, as format=csv, i.e., without need to add another
flag (e.g., -c and -v were already taken). Most of the code was already there
as basic structure was the original plain default output and quoting was already
done for the JSON output. The original format was not a proper CSV because it
lacked handling quoting of fields which for common CSV parsers takes care of new line
characters and separators in values/cells. All non-numeric columns are quoted
(with two quotes representing a quote in the text).
The quotes used are double quotes, the separator is drived by the horizontal/field
separator option which is unfortunate because it is not a comma by default.
CSV output is available even for the -r flag (minimal region/bbox).

The JSON output of NULL values is fixed, writing JSON literal null.
@wenzeslaus wenzeslaus force-pushed the v_db_select_formats_and_csv branch from c73047a to 5d578cc Compare May 15, 2021 03:45
@wenzeslaus wenzeslaus requested a review from HuidaeCho May 19, 2021 04:00
@wenzeslaus wenzeslaus added bug Something isn't working enhancement New feature or request vector Related to vector data processing labels May 19, 2021
@wenzeslaus
Copy link
Member Author

The four old and new formats are described in the documentation.

PR description updated to reflect the new state.

@wenzeslaus
Copy link
Member Author

To make the review easier, this has tests which shows that the different forms work even with special characters and non-default separators in case of CSV.

vector/v.db.select/main.c Outdated Show resolved Hide resolved
vector/v.db.select/main.c Outdated Show resolved Hide resolved
fatal_error_option_value_excludes_option(options.format, options.vsep,
_("Only vertical output can use vertical separator"));
}

Copy link
Member

@HuidaeCho HuidaeCho May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (csv)
    fatal_error_option_value_excludes_option(options.format, options.fsep, _("Separator is based on the format"));

CSV (comma-separated values) is only comma-separated. Right now, fsep is configurable with comma as default for CSV, but can we call it CSV when a non-comma separator is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although comma is often expected for CSV, CSV is also used to refer to broader format group with same or roughly same syntax, but different separator as demonstrated by the csv package in Python. In some natural languages, comma is used as a decimal point which leads to semicolon being the most common separator (as comma may lead to confusion even when period is used as decimal point). This is visible even in some GRASS helper code. So, yes, we can call it CSV even if delimiter/separator is customizable.

Practically, allowing different separators gives users the choice to produce some flavors of TSV or DSV although the plain format may be more suitable for that.

vector/v.db.select/main.c Outdated Show resolved Hide resolved
@wenzeslaus wenzeslaus requested a review from HuidaeCho June 6, 2021 20:13
vector/v.db.select/main.c Outdated Show resolved Hide resolved
@wenzeslaus
Copy link
Member Author

This is ready to be merged, but now you can test it at https://mybinder.org/v2/gh/wenzeslaus/grass/v_db_select_formats_and_csv after opening New > Terminal and run:

grass data/grassdata/nc_basic_spm_grass7/PERMANENT --exec v.db.select points_of_interest format=json

@wenzeslaus
Copy link
Member Author

wenzeslaus commented Jun 19, 2021

@HuidaeCho unless you have further comment I will merge. The Windows tests fails due to failed g.download.location download and a parallel temporal test fails with unrelated race condition.

@HuidaeCho
Copy link
Member

@wenzeslaus Everything looks good!

@wenzeslaus wenzeslaus merged commit dfc9b05 into OSGeo:master Jun 20, 2021
@wenzeslaus wenzeslaus deleted the v_db_select_formats_and_csv branch June 20, 2021 01:15
@wenzeslaus
Copy link
Member Author

Thanks @HuidaeCho. Happy parsing everyone!

@neteler neteler added this to the 8.0.0 milestone Dec 9, 2021
ninsbl pushed a commit to ninsbl/grass that referenced this pull request Oct 26, 2022
)

Instead of separate flags for JSON and vertical output format, use format option,
so -j is replaced by format=json, etc. The formats are no longer spread among
the other flags and a written command is more readble. It is also more explicit
since the default format is now represented in the interface and vertical output
clearly stands out as yet another alternative.

The default format is named plain, alternative names considered and rejected were text, ascii and default.
It is meant to be read by humans, but now it can be also newly safely parsed when set as TSV,
however, this feature is questionable and the idea is that plain is not used for parsing.

The use of an option does not change anything major in the code in terms of output
except for use of an enum instead of directly accessing the flags' answers.
The parameter handling changed significantly. The standard exclude function
cannot be used and custom checks and messages are needed to disallow combinations
such as JSON with custom null value.

The significant clean-up related changes to existing formats are:

* The JSON output of NULL values is fixed, writing JSON literal null.
* JSON outputs are now always Object (dictionary) so theoretically they can be combined or enhanced in the future. 
* Different default separator for different formats and outputs (pipe for plain, comma for CSV).
* Separator for -e can now be customized.
* All required JSON escapes are escaped.
* Tab is escaped when escaping is enabled (so you could theoretically write TSV with escapes using plain).
* For all formats, JSON-style escapes except for double quotes are applied when escaping is enabled.
* Vertical output does not add a newline when newline is vertical separator (no more two newlines).

This adds another format, CSV, as format=csv, i.e., without need to add another
flag (e.g., -c and -v were already taken). Most of the code was already there
as basic structure was the original plain default output and quoting was already
done for the JSON output. The original format was not a proper CSV because it
lacked handling quoting of fields which for common CSV parsers takes care of new line
characters and separators in values/cells. It was not completely safe even with escaping
because it did does not escape to the separator (to be delimiter separated values format
with escapes). Delimiter for CSV can be customized, but the quoting cannot.

Now in CSV, all non-numeric columns are quoted (with two quotes representing a quote in the text).
The quotes used are double quotes, the separator is derived by the horizontal/field
separator option. By default is is a comma (a different separator is set for each format by default).
CSV output is available even for the -r flag (minimal region/bbox).
ninsbl pushed a commit to ninsbl/grass that referenced this pull request Feb 17, 2023
)

Instead of separate flags for JSON and vertical output format, use format option,
so -j is replaced by format=json, etc. The formats are no longer spread among
the other flags and a written command is more readble. It is also more explicit
since the default format is now represented in the interface and vertical output
clearly stands out as yet another alternative.

The default format is named plain, alternative names considered and rejected were text, ascii and default.
It is meant to be read by humans, but now it can be also newly safely parsed when set as TSV,
however, this feature is questionable and the idea is that plain is not used for parsing.

The use of an option does not change anything major in the code in terms of output
except for use of an enum instead of directly accessing the flags' answers.
The parameter handling changed significantly. The standard exclude function
cannot be used and custom checks and messages are needed to disallow combinations
such as JSON with custom null value.

The significant clean-up related changes to existing formats are:

* The JSON output of NULL values is fixed, writing JSON literal null.
* JSON outputs are now always Object (dictionary) so theoretically they can be combined or enhanced in the future. 
* Different default separator for different formats and outputs (pipe for plain, comma for CSV).
* Separator for -e can now be customized.
* All required JSON escapes are escaped.
* Tab is escaped when escaping is enabled (so you could theoretically write TSV with escapes using plain).
* For all formats, JSON-style escapes except for double quotes are applied when escaping is enabled.
* Vertical output does not add a newline when newline is vertical separator (no more two newlines).

This adds another format, CSV, as format=csv, i.e., without need to add another
flag (e.g., -c and -v were already taken). Most of the code was already there
as basic structure was the original plain default output and quoting was already
done for the JSON output. The original format was not a proper CSV because it
lacked handling quoting of fields which for common CSV parsers takes care of new line
characters and separators in values/cells. It was not completely safe even with escaping
because it did does not escape to the separator (to be delimiter separated values format
with escapes). Delimiter for CSV can be customized, but the quoting cannot.

Now in CSV, all non-numeric columns are quoted (with two quotes representing a quote in the text).
The quotes used are double quotes, the separator is derived by the horizontal/field
separator option. By default is is a comma (a different separator is set for each format by default).
CSV output is available even for the -r flag (minimal region/bbox).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request vector Related to vector data processing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants