Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add formatters. #105

Merged
merged 15 commits into from
Aug 22, 2023
Merged

Add formatters. #105

merged 15 commits into from
Aug 22, 2023

Conversation

kklingenberg
Copy link
Contributor

@kklingenberg kklingenberg commented Aug 9, 2023

This PR adds the @ filters that usually format things as strings using specific encodings, but also some that decode strings already encoded.

It is the absolute laziest version of this PR. I won't disagree if you decide to close this right away, or park it forever. It is lazy because:

  • I simply added libraries for every encoding, and didn't attempt to implement any myself.
  • The @csv and @tsv implementations are especially bad because they allocate csv::Writer for every element of the input stream.
  • Extended the parser to allow for identifiers starting with @, because I didn't find an elegant way of limiting the @ symbol to core-filters only. So users can define filters starting with @ now, which is not possible in jq or gojq.
  • I added a bunch of non-standard core filters that are referenced by the std implementations of the @ filters.
  • I added a bunch of specific error variants, to capture Err variants from the various libraries that I wasn't quite sure I could just unwrap().

If you still want to salvage this you can take over, or point me to ways of fixing stuff.

jaq-core/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 418 to 424
match cell {
Val::Str(s) => {
record.push_field(s);
Ok(())
}
_ => Err(Error::Str(cell.clone())),
}?;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could shorten this a lot by using cell.as_str().

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remark of mine is probably obsolete if you consider my comment below.

Comment on lines 450 to 453
box_once(match cv.1 {
Val::Str(s) => Ok(Val::str(urlencoding::encode(&s).to_string())),
_ => Err(Error::Str(cv.1)),
})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You frequently have this pattern where you check if the input value is a string, and if it is, then run some function on the string and return another string.
I would create a function in Val that encodes this pattern, similarly to mutate_str(). You could call it map_str, for example. For functions that can fail during this conversion, like fromuri, you could introduce another function like try_map_str().

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remark of mine is probably obsolete if you consider my comment below.

jaq-std/src/std.jq Outdated Show resolved Hide resolved
@01mf02
Copy link
Owner

01mf02 commented Aug 10, 2023

Thanks a lot, this is pretty cool work!
It is quite funny because I was recently also pondering doing something very similar to what you have done now. In particular, like you, I came to the conclusion that it is the easiest way to just allow defining filters whose name starts with an @. So no objections from me on this front.
I also think that it is good style to salvage existing crates that perform these conversions instead of writing that stuff ourselves. That way, there are less things to get wrong.
My main objection would be your declaration of new error variants. I have started to rework the Error type in #106, by removing all error variants that are not produced in jaq-interpret itself. Until I have merged this new error type, could you encode errors by writing them into Error::Val(Val::str(..))?
To make this task easier, you could just for now copy my implementation of Error::from_str into core, from which I'll then remove it when I merge #106.

@kklingenberg
Copy link
Contributor Author

kklingenberg commented Aug 10, 2023

I think I managed to follow your pointers. However clippy was complaining about naming Error::from_str that way so I just used the name Error::from_any. Which is obscure, and worse, but now clippy doesn't complain.

EDIT: I also tried the impl<T: ToString> From<T> for Error route but that's impeded by this. A custom converting function is still required, it seems.

Comment on lines 410 to 411
v.clone()
.to_str()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use as_str() here and clone only conditionally.

fn stringify(v: Val) -> String {
v.clone()
.to_str()
.map_or_else(|_| v.to_string(), |s| s.to_string())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would replace s.to_string() by something like s.clone() here, so it is clear that we are really just reusing the string that is there and not creating an escaped version of it.

@pkoppstein
Copy link

My impression (perhaps mistaken) is that this PR strays quite widely from jq's "@foo" operators.

As the jq manual says, "The @foo syntax is used to format and escape strings".

This means that one can write expressions such as:

 "@uri "https://www.google.com/search?q=\(.search)"

That is, @foo can both be used as a 0-arity filter, or as a kind of "prefix operator".

@@ -402,3 +404,82 @@ const LOG: &[(&str, usize, RunPtr, UpdatePtr)] = &[(
|_, cv| box_once(Ok(debug(cv.1))),
|_, cv, f| f(debug(cv.1)),
)];

#[cfg(feature = "format")]
fn stringify(v: Val) -> String {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually make this a member function of Val, because it is a functionality which will be used implicitly in the interpreter. That is, "world" | "Hello, \(.)" will use this function for interpolation.
However I would rename this function; what do you think about to_string_or_clone()? I know it's a bit longer than stringify, but at least it's a bit more descriptive and clarifies the difference with respect to to_string().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a testament to my inexperience with Rust library design, I really screwed this up w.r.t. using Rc and cloning everywhere. I tried the Cow route and the AsRef route, but failed to make it "nice" when calling the function and Val::str.

Anyway, feel free to rework it all to reduce unnecessary cloning and such.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a lot of unnecessary cloning. Don't worry, I think you're doing a good job.

}),
("@sh", 0, |_, cv| {
box_once(Ok(Val::str(
shell_escape::escape(stringify(cv.1).into()).to_string(),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use into_owned() instead of to_string()?

)
}),
("@csv", 0, |_, cv| {
box_once(cv.1.as_arr().and_then(|vs| to_csv(vs, b',')))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that shows what happens when the input contains commas (inside a string)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know this, but apparently commas don't need to be escaped within quotes (source: https://datatracker.ietf.org/doc/html/rfc4180#section-2). The added test reflects this.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 480 to 481
.map_err(Error::from_any)
.map(|s| Val::str(s.to_string()))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use map_or_else() and to_ owned()?

@@ -106,6 +106,8 @@ pub fn token() -> impl Parser<char, Token, Error = Simple<char>> {

let var = just('$').ignore_then(text::ident());

let format_filter = just('@').chain(text::ident()).collect();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use just('@').or_not() to make this more general and then use it instead of text::ident() to define ident below?

@01mf02
Copy link
Owner

01mf02 commented Aug 16, 2023

That is, @foo can both be used as a 0-arity filter, or as a kind of "prefix operator".

Correct. This PR adds support for the first case only. However, I do already have an idea how this can be extended in the future in order to also support the second case, that is, string interpolation with custom formatting as prefix operator like in your example.
I suppose that you will be happy to hear this, given that you have been missing string interpolation so badly. :) It's not that far away anymore. However, I want to release 1.0 without string interpolation yet, given how many new features made it into this version already.

@pkoppstein
Copy link

pkoppstein commented Aug 16, 2023

@01mf02 - Thanks for your fast response, which was very welcome in several respects!

Perhaps you missed my implied point, which was to highlight the tension that is created by the
decision to allow "users [to] define filters starting with @".

That is, identifiers starting with "@" are (in jq) very special in that they can be used as prefix operators.

Of course, whatever potential confusion this might introduce in jaq can be mitigated by suitable documentation.

@01mf02
Copy link
Owner

01mf02 commented Aug 16, 2023

Perhaps you missed my implied point, which was to highlight the tension that is created by the decision to allow "users [to] define filters starting with @".

That is, identifiers starting with "@" are (in jq) very special in that they can be used as prefix operators.

I plan to keep this particular behaviour; that is, only filters whose name starts with an @ should be usable as string prefix operators.
Still, @-identifiers will be more flexible in jaq. For example, I plan to allow filters starting with an @ that take arguments, also when used in prefix form. I do not expect that this will be used a lot, but it would actually complicate the implementation more to prohibit it, so I don't see a good reason to disallow it.

Of course, whatever potential confusion this might introduce in jaq can be mitigated by suitable documentation.

I agree.

@@ -402,3 +404,76 @@ const LOG: &[(&str, usize, RunPtr, UpdatePtr)] = &[(
|_, cv| box_once(Ok(debug(cv.1))),
|_, cv, f| f(debug(cv.1)),
)];

#[cfg(feature = "format")]
fn to_csv(vs: &[Val], delimiter: u8) -> ValR {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this function return a string result instead of a value result? In general, when a function returns only one kind of value, I return this directly instead of wrapping it in a value.

Comment on lines 62 to 64
Ok(Val::str(
s.chars().skip(skip).take(take).collect::<String>(),
))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be related to formatters?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's because you modified the str function.
I think that accepting parameters implementing Into should be done quite sparingly, as it can require more type annotations, like here. May I ask you for your motivation for having changed the signature of the function?

Comment on lines 58 to 59
pub fn str(s: impl Into<String>) -> Self {
Self::Str(Rc::new(s.into()))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I do not like this change because it makes the string function different from all other similar functions like arr, obj and so on. This might be useful, but only if all functions are changed accordingly, which I would however not do as part of this PR. Let's keep it small.

/// it to string.
pub fn to_string_or_clone(self) -> String {
match self {
Self::Str(s) => s.to_string(),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would use to_owned().
Otherwise, this will make people (like me ^^) reading this wonder whether this will escape the string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to go with (*s).clone() since to_owned() over an Rc yields another Rc. I think.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me!

Comment on lines 113 to 116
let ident = just('@')
.or_not()
.chain::<char, String, _>(text::ident())
.collect()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the size of this change, could you terminate the definition of ident after collect(), then redefine ident using the previously defined variable ident?

@kklingenberg
Copy link
Contributor Author

So, about map_or_else: do you prefer it to map_err().map() even when the closures in map_or_else would produce Err and Ok values respectively? It somehow seems wrong to me, but I'd rather keep it consistent with your preferences.

@01mf02
Copy link
Owner

01mf02 commented Aug 16, 2023

So, about map_or_else: do you prefer it to map_err().map() even when the closures in map_or_else would produce Err and Ok values respectively? It somehow seems wrong to me, but I'd rather keep it consistent with your preferences.

That's nice of you to ask. :) I was actually thinking about exactly this question a few minutes ago. I think I prefer the option that produces fewer characters of code. I think that indeed, here map_err + map might be better.

@kklingenberg
Copy link
Contributor Author

kklingenberg commented Aug 17, 2023

Amazingly, it took me this long to check de official definition of the @sh filter (from the manual):

@sh:
The input is escaped suitable for use in a command-line for a POSIX shell. If the input is an array, the output will be a series of space-separated strings.

Also there's this PR in jq: jqlang/jq#2828

So possibly that one requires a heavy rewrite. Perhaps it's best to wait for jq's PR to land?

@01mf02
Copy link
Owner

01mf02 commented Aug 18, 2023

@kklingenberg, I have a few comments regarding this PR:

  • I made a few formatters available without the “format” feature gate. This is because I use feature gates in jaq exclusively to reduce the number of external dependencies.
  • I rewrote a few tests to use strings of the format r#”...”#. This is really important to respect because it simplifies testing the tests with jq by copy-pasting them into the command line. Tests that are hard to test are less likely to be tested.
  • Speaking of the tests, if the behavior of jaq diverges from that of jq, this has to be documented. In particular, the HTML, CSV, and TSV behavior diverged in several ways, which I have now tried to remedy. Please make sure to compare your remaining implementations with jq; see @sh, but also @uri could be tested with more input.
  • Speaking of @sh, I would implement the behavior in jq 1.6, that is, special casing for arrays and rejection of objects.
  • You added a new filter @urid; this should probably be documented.

@kklingenberg
Copy link
Contributor Author

I thank you for all the effort you put into fixing this; I expected it would be messy from the start. It probably ended up being more work to guide me through this, than to simply implement it yourself.

I tried to emulate the behaviour of jq's @sh, but the library I used diverges from jq in that it doesn't escape what it doesn't need to, and escapes some symbols more aggresively (e.g. !). I also removed the non-standard @urid filter, which I had Initially pulled from gojq's HEAD.

Also about @sh, another thing that worries me is that the library has different behaviour for unix or windows: https://docs.rs/shell-escape/latest/src/shell_escape/lib.rs.html#17-23. I believe this may cause further divergence from jq.

Anyway, I hereby yield control of this PR, as I don't have the required motivation to continue. Thanks again!

@01mf02
Copy link
Owner

01mf02 commented Aug 22, 2023

I thank you for all the effort you put into fixing this; I expected it would be messy from the start. It probably ended up being more work to guide me through this, than to simply implement it yourself.

Thank you for having attacked this functionality. Even if it took some of my time indeed, it is important for me that other people than me become familiar with my philosophy of working on jaq, and I hope that you learned something from working on this.

I tried to emulate the behaviour of jq's @sh, but the library I used diverges from jq in that it doesn't escape what it doesn't need to, and escapes some symbols more aggresively (e.g. !).
Also about @sh, another thing that worries me is that the library has different behaviour for unix or windows: https://docs.rs/shell-escape/latest/src/shell_escape/lib.rs.html#17-23. I believe this may cause further divergence from jq.

Platform-specific output is a no-go. It is also not what jq does, quoth the jq documentation: "The input is escaped suitable for use in a command-line for a POSIX shell."
In the end, I found that jq only escapes ' characters, which was sufficiently easy to do without any supporting library.

I also removed the non-standard @urid filter, which I had Initially pulled from gojq's HEAD.

That's good.

Anyway, I hereby yield control of this PR, as I don't have the required motivation to continue. Thanks again!

Thanks to you too!

@01mf02 01mf02 merged commit 69a2d1d into 01mf02:main Aug 22, 2023
1 check passed
@kklingenberg kklingenberg deleted the feature/formatters branch August 28, 2023 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants