Is it possible to write back code block backtick count to original? #20

yumetodo · 2020-08-04T13:35:59Z

After #17, #18, code block backtick count is configurable. However, there is another problem.

Now let's consider such senario like below:

ex.) https://github.com/yumetodo/markdown_img_url_editor_rust/blob/master/src/lib.rs#L74

Parse markdown file with pulldown-cmark
Edit somewhare.
Write back to markdown using pulldown-cmark-to-cmark

In this case, there is nothing to emulate block backtick count.

As a result, currently, when just bypass pulldown-cmark to pulldown-cmark-to-cmark, input and output is not equal.

Byron · 2020-08-05T05:10:40Z

I think this is closely related to #16 - maybe you can join the people who start talking to pulldown-cmark to make not degenerating information a priority.
Otherwise there is no way to tell how many backticks were parsed.

yumetodo · 2020-08-05T15:18:17Z

@Byron Do you mean that this is pulldown-cmark's matter?
What information is lacked after passing the pulldown-cmark event system to resolve this issue?
Without the information, I cannot create an issue at the pulldown-cmark.

Byron · 2020-08-06T09:02:07Z

This means you can try for yourself how pulldown-cmark maintains information about backticks. If it does indeed not maintain any, then there is nothing we can do here.

yumetodo · 2020-08-08T06:34:18Z

I use a5f644a . I understand that the Fenced event doesn't provide backtick count.

(BTW, why the result of cargo run --example stupicat -- sample.md | pulldown-cmark is broken? Where is the last ````?

$git clone https://github.com/Byron/pulldown-cmark-to-cmark.git
Cloning into 'pulldown-cmark-to-cmark'...

remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 347 (delta 2), reused 7 (delta 2), pack-reused 336
Receiving objects: 100% (347/347), 86.72 KiB | 275.00 KiB/s, done.
Resolving deltas: 100% (200/200), done.

$cd markdown_img_url_editor_rust/

$cargo install pulldown-cmark
    Updating crates.io index
  Downloaded pulldown-cmark v0.7.2
  Downloaded 1 crate (102.7 KB) in 1.21s
  Installing pulldown-cmark v0.7.2
  Downloaded unicase v2.6.0
  Downloaded memchr v2.3.3
  Downloaded unicode-width v0.1.8
  Downloaded bitflags v1.2.1
  Downloaded version_check v0.9.2
  Downloaded getopts v0.2.21
  Downloaded 6 crates (110.0 KB) in 0.97s
   Compiling version_check v0.9.2
   Compiling bitflags v1.2.1
   Compiling memchr v2.3.3
   Compiling pulldown-cmark v0.7.2
   Compiling unicode-width v0.1.8

   Compiling getopts v0.2.21
   Compiling unicase v2.6.0
    Finished release [optimized] target(s) in 33.27s
  Installing /home/yumetodo/.cargo/bin/pulldown-cmark
   Installed package `pulldown-cmark v0.7.2` (executable `pulldown-cmark`)

$cat sample.md
`````markdown
````markdown
```typescript
console.log("arikitari na sekai");
```
````
`````

$cat sample.md | pulldown-cmark -e
0..90: Start(CodeBlock(Fenced(Borrowed("markdown"))))
14..85: Text(Borrowed("````markdown\n```typescript\nconsole.log(\"arikitari na sekai\");\n```\n````\n"))
0..90: End(CodeBlock(Fenced(Borrowed("markdown"))))
EOF

$cargo run --example stupicat -- sample.md | pulldown-cmark
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/examples/stupicat sample.md`
<pre><code class="language-markdown">````markdown
```typescript
console.log(&quot;arikitari na sekai&quot;);
```
</code></pre>
<pre><code></code></pre>

Byron · 2020-08-08T13:41:27Z

(BTW, why the result of cargo run --example stupicat -- sample.md | pulldown-cmark is broken? Where is the last ````?

I don't know either, but I don't entirely know what the examples are supposed to show. That is, I don't understand what's expected, and what you think should be done.

As this issue is about backtick counts and it's clear that pulldown-cmark doesn't keep them, I would conclude there is nothing we can do here.

If the example above indicates another bug in pulldown-cmark-to-cmark, I think this should move to a new issue if there is interest in getting it fixed.

yumetodo · 2020-08-08T14:42:29Z

OK. I created pulldown-cmark/pulldown-cmark#461 so that I will wait that responce.

yumetodo · 2020-08-24T03:35:31Z

pulldown-cmark's collaborator says that count it by yourself using Event::Start's range information.

if let (Event::Start(Tag::CodeBlock(CodeBlockKind::Fenced(..))), range) = event {
    let num_backticks = source_text[range].chars().take_while(|c| c == '`').count();
}

Byron · 2020-09-01T02:38:58Z

Fantastic. With this it is certainly possible to make the requested change.
I think it should be quite straightforward.

yumetodo · 2020-09-02T05:23:51Z

Of course, they are not always equal between begin count and end count of code block backtick like below. And also, tilde(~) is allowed as a code block fence char.

```text
aaa
````

lambda-fairy · 2021-01-17T08:53:44Z

Won't this require changing the API? The cmark function currently has no access to the original source text.

Byron · 2021-01-18T12:40:49Z

Good point! I think it could be done by optionally passing the input buffer as part of the Options. There using the range information should be a breeze.

yumetodo · 2021-01-18T16:23:15Z

In that case, it's better to get the diff between the event stream created from the original input text and the passed event stream...

Byron · 2021-01-18T23:07:11Z

That sounds like a post-process entirely unrelated to this crate, or am I missing something?
Maybe knowing more about this would help to understand why you would discard a proposal which to my mind would allow using ranges as suggested by a pulldown-cmark maintainer.

yumetodo · 2021-01-19T01:46:45Z

I'm concern that the index information in the range is valid when the passed event stream is modified from the original.
pulldown-cmark maintainer's suggestion code requests source text that corresponds one-to-one. That is what @lambda-fairy pointed out, as far as I understand.

lambda-fairy · 2021-01-19T02:06:27Z

Have we considered dynamically setting the number of backticks based on the contents of the code block? That will require buffering those contents, but will avoid the need to pass in source text.

The two main use cases I am thinking of are (a) mdbook preprocessors, and (b) Markdown code formatters. (I'm interested in the latter.) In either case there is no need to preserve the original input exactly – only to ensure that the result parses in the same way.

My concern with the proposed solution is that the type system doesn't enforce that the given text and event stream correspond to each other. It's too easy to either mess up the ranges in the event stream, or pass in an unrelated string, both of which can lead to subtle bugs.

Byron · 2021-01-19T06:01:34Z

@yumetodo

I'm concern that the index information in the range is valid when the passed event stream is modified from the original.
pulldown-cmark maintainer's suggestion code requests source text that corresponds one-to-one.

@lambda-fairy

My concern with the proposed solution is that the type system doesn't enforce that the given text and event stream correspond to each other. It's too easy to either mess up the ranges in the event stream, or pass in an unrelated string, both of which can lead to subtle bugs.

Absolutely, the provided input buffer must match what's actually parsed, and it's something the programmer has to assure. That said, I would entrust then to be able to do that or notice very quickly that something is wrong. In my understanding this crate should do as much if not all of the 'heavy lifting', but I can only acknowledge that it might be a different story when actually implementing it.

Have we considered dynamically setting the number of backticks based on the contents of the code block? That will require buffering those contents, but will avoid the need to pass in source text.

This sounds like a heuristic to me that is based on the observed code block? Maybe this could be elaborated so we are on the same page.

Just to sum up what I got so far as options to solve it this issue:

pass in a source buffer and use the provided ranges to read the portions of the source we are interested in
diff source and output, or more generally, apply a post process (probably to be implemented in a different crate, can be part of this repo though)
Set the number of backticks based on the contents of the code block (needs elaboration)

Something I would like to highlight in parting is a comment of @lambda-fairy as it helps me to see the value behind this issue:

In either case there is no need to preserve the original input exactly – only to ensure that the result parses in the same way.

This is something that ideally is easy because the input events are made so that they don't degenerate information, but it turns out that's not a goal of the input parser. This crate could be the remedy and I would love to find a general purpose solution to this issue. To me, option 1 is able to do it for sure and provides all information needed to apply more fixes in future, too.

I would hope we could get to a point where any of the options really is implemented as proof of concept so we can eliminate those who might not work in practice and make progress towards resolving this issue.

yumetodo · 2021-01-19T07:48:42Z

When I first created this issue, I was thinking of a scenario where pulldown-cmark would parse it, edit some of it, and pulldown-cmark-to-cmark would convert it back to markdown(yumetodo/markdown_img_url_editor#40 (Written in Japanese)).

However, now I got it that is too difficult to implement. For such a use case, we should use Parser::into_offset_iter() and not by this issue.

Now I agree to reduce the scope of this issue to the two main use cases @lambda-fairy says.

Byron · 2021-01-21T04:39:43Z

I have created a sketch, see #22, to show how this could work using the 'Range and buffer' method. This is the only way I would know how to do what's needed for this issue, and in future.

Maybe those who have other ideas can sketch them and submit PRs so we all know what we are talking about.

I am particularly interested in learning what @yumetodo thinks about the PR (please feel free to comment there) as this supports exactly what they have done.

willcrichton · 2022-06-24T17:47:26Z

As another instance of this issue, I am writing an mdBook preprocessor using pulldown-cmark-to-cmark for rust-lang/book. The Rust book uses double-backticks for inline code to escape single-backticks, such as:

here we embed ``an error `message` <- with a backtick``

Which correctly renders as:

here we embed an error `message` <- with a backtick

However, this package will replace the double-backtick with a single backtick, producing:

here we embed `an error `message` <- with a backtick`

Which incorrectly renders as:

here we embed an error message <- with a backtick

willcrichton · 2022-06-24T18:09:46Z

Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.

Byron · 2022-06-28T02:10:16Z

Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.

That's fantastic, thanks a lot!

I pulled the changes and saw a test was added for validation, and besides that, the spec-test is up 2 tests as well to 425/649 passed! I'd think that's a very measurable improvement worth a new release in case it helps.

mgeisler · 2023-12-22T13:33:18Z

Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.

Hi @willcrichton, I believe the case of inline code with backticks has been fixed by #56.

As a result, currently, when just bypass pulldown-cmark to pulldown-cmark-to-cmark, input and output is not equal.

@yumetodo, I'm just another user of the library — so this is just my opinion 😄

I would not expect the output to be equal after parsing and serializing back to Markdown since the Markdown AST might not preserve all information. However, I would expect it to be equal if you parse it again and serialize to Markdown.

I did some experiments on this earlier this year in #55. That lead me to find a few corner cases in this library and also in pulldown-cmark.

dalance · 2024-01-24T03:12:28Z

Set the number of backticks based on the contents of the code block (needs elaboration)

I'm interesting in this approach.
In google/mdbook-i18n-helpers#129, I wrote the following code to count backticks in code block:

fn check_code_block_token_count(group: &[(usize, Event)]) -> usize {
    let events = group.iter().map(|(_, event)| event);
    let mut in_codeblock = false;
    let mut max_token_count = 0;

    // token_count should be taken over Text events
    // because a continuous text may be splitted to some Text events.
    let mut token_count = 0;

    for event in events {
        match event {
            Event::Start(Tag::CodeBlock(_)) => {
                in_codeblock = true;
                token_count = 0;
            }
            Event::End(Tag::CodeBlock(_)) => {
                in_codeblock = false;
                token_count = 0;
            }
            Event::Text(x) if in_codeblock => {
                for c in x.chars() {
                    if c == '`' {
                        token_count += 1;
                    } else {
                        max_token_count = std::cmp::max(max_token_count, token_count);
                        token_count = 0;
                    }
                }
                max_token_count = std::cmp::max(max_token_count, token_count);
            }
            _ => token_count = 0,
        }
    }
    if max_token_count < 3 {
        // default code block token is "```" which is 3
        3
    } else {
        // If there is "```" in codeblock, codeblock token should be extended.
        max_token_count + 1
    }
}

I think a code like the above may be useful if it is placed in this crate.
For example, the following designs can be considered.

Add code_block_token_count as a free standing function which can be used like below

let code_block_token_count = pulldown_cmark_to_cmark::code_block_token_count(events);
let options = Options {
    code_block_token_count,
    ..Options::default()
};
pulldown_cmark_to_cmark::cmark_resume_with_options(events, &mut markdown, state, options);

Add a token counting code which is used when an explicit token count is not provided to cmark* function

let options = Options {
    code_block_token_count: None, // "None" means the appropriate count is calcurated
                                  // in cmark_resume_with_options
    ..Options::default()
};
pulldown_cmark_to_cmark::cmark_resume_with_options(events, &mut markdown, state, options);

How about these ideas?

Byron · 2024-01-24T06:18:48Z

Thanks for the writeup and for sharing!

The way I understand this is that the count of backticks depends on the nesting level of code blocks? If so, the proposed code also wouldn't reproduce the exact count of backticks as there can be different nesting levels in the provided events?

In any case, if you think having more support for this would help here, even if its scope is limited, then adding it to the crate should be useful. However, I think it should be done in a backward compatible way that also doesn't affect performance. Thus is sounds like option 1 (freestanding function) would be the way to go.

dalance · 2024-01-24T08:01:41Z

The way I understand this is that the count of backticks depends on the nesting level of code blocks? If so, the proposed code also wouldn't reproduce the exact count of backticks as there can be different nesting levels in the provided events?

In my understanding, fenced code block can't be nested.
(Refer: https://spec.commonmark.org/0.30/#fenced-code-blocks)

For example, I think the following code block is not a nested code block but a single code block which have "````markdown\n```python\n```\n````" as a plane text.
Therefore, an appropriate token count can be judged by counting continuous backticks in the plane text.

`````markdown
````markdown
```python
```
````
`````

My goal is not "reproducing the original backtick count" but "generating a valid markdown".
My code counts the longest backticks in the events, so some of the generated backticks may be too long, but it is a valid markdown.

Original markdown

`````markdown
````markdown
```python
```
````
`````

````markdown
```rust
```
````

```c
```

Generated markdown by cmark and code_block_token_count function (In this case, code_block_token_count will be 5)

`````markdown
````markdown
```python
```
````
`````

`````markdown
```rust
```
`````

`````c
`````

Byron · 2024-01-24T08:07:27Z

Thanks for the clarification and the examples, I think we are on the same page now.

Then it's a definitive "Yes," please submit a PR with a new free function that helps to set the correct backtick count based on a pre-scan of all known events.

The code_block_token_count (which is confusingly named given that it is about backticks in code-blocks) documentation could also be adjusted to refer to the new function, whose own documentation could make clear what it does (maybe even by reusing some of the explanation used in this thread).

dalance · 2024-01-24T09:44:53Z

Thank you for your suggestion.
I've opened #65.

dalance · 2024-01-25T09:26:27Z

@Byron
Sorry. I missed the problem of API design.
The API of #65 doesn't have the way to specify default backticks count.
In the current impl, it refers the hard-coded DEFAULT_CODE_BLOCK_TOKEN_COUNT.

So even If we want to set backtick count to 3, count_code_block_tokens surely returns 4 or more than 4.
In other word, when the function returns 4, we can't identify the following two cases.

There is no backtick in code block, so codeblock_token_count can be set to 3.
There is 3 consective backticks in code block, so codeblock_token_count can't be set to 3, it must be 4 or more than.

I think two solutions, but both are breaking change.

Pass default count to count_code_block_tokens

let code_block_token_count = count_code_block_tokens(events, 3);

Return Option<usize> from count_code_block_tokens

let code_block_token_count = count_code_block_tokens(events).unwrap_or(3);

dalance · 2024-01-25T11:22:16Z

Alternatively the default value can be changed to 3.
3 is the minimum value allowed in spec.

lf user want to use 4, the follow code is available.

let code_block_token_count = count_code_block_tokens(events);
let code_block_token_count = std::cmp::max(code_block_token_count, 4);

This is not breaking change.

Byron · 2024-01-26T09:49:00Z

Thanks for bringing this up, and sorry for not catching it. I already yanked the previous release so the fix isn't breaking.

My preference is to go with Option<usize> as return value, it seems most idiomatic and flexible.

Thanks again for your help.

dalance · 2024-01-27T02:56:09Z

Thank you.
I've opened #66.

yumetodo changed the title ~~Is it possible to~~ Is it possible to write back code block backtick count to original? Aug 4, 2020

yumetodo mentioned this issue Aug 4, 2020

コードブロックを正確に書き戻せていない問題 yumetodo/markdown_img_url_editor#40

Closed

yumetodo mentioned this issue Aug 8, 2020

How should I get backtick count of fenced code block? pulldown-cmark/pulldown-cmark#461

Closed

Byron added good first issue Good for newcomers help wanted Extra attention is needed labels Sep 1, 2020

Byron mentioned this issue Jan 21, 2021

Range support #22

Closed

Veykril mentioned this issue Dec 17, 2021

Intra-doc links using Markdown references get rendered incorrectly rust-lang/rust-analyzer#11008

Closed

barafael mentioned this issue Sep 24, 2023

Option to preserve list_token from input #59

Closed

mgeisler mentioned this issue Dec 22, 2023

Fix codeblock in codeblock google/mdbook-i18n-helpers#129

Merged

dalance mentioned this issue Jan 24, 2024

Add check_code_block_token_count function #65

Merged

mgeisler mentioned this issue Jan 24, 2024

group_events fuzz test is flaky google/mdbook-i18n-helpers#150

Closed

dalance mentioned this issue Jan 27, 2024

Change the return type of count_code_block_tokens #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to write back code block backtick count to original? #20

Is it possible to write back code block backtick count to original? #20

yumetodo commented Aug 4, 2020 •

edited

Byron commented Aug 5, 2020

yumetodo commented Aug 5, 2020

Byron commented Aug 6, 2020

yumetodo commented Aug 8, 2020

Byron commented Aug 8, 2020

yumetodo commented Aug 8, 2020

yumetodo commented Aug 24, 2020

Byron commented Sep 1, 2020

yumetodo commented Sep 2, 2020 •

edited

lambda-fairy commented Jan 17, 2021

Byron commented Jan 18, 2021

yumetodo commented Jan 18, 2021

Byron commented Jan 18, 2021

yumetodo commented Jan 19, 2021

lambda-fairy commented Jan 19, 2021

Byron commented Jan 19, 2021 •

edited

yumetodo commented Jan 19, 2021 •

edited

Byron commented Jan 21, 2021

willcrichton commented Jun 24, 2022 •

edited

willcrichton commented Jun 24, 2022

Byron commented Jun 28, 2022

mgeisler commented Dec 22, 2023

dalance commented Jan 24, 2024

Byron commented Jan 24, 2024

dalance commented Jan 24, 2024

Byron commented Jan 24, 2024

dalance commented Jan 24, 2024

dalance commented Jan 25, 2024

dalance commented Jan 25, 2024

Byron commented Jan 26, 2024

dalance commented Jan 27, 2024

Is it possible to write back code block backtick count to original? #20

Is it possible to write back code block backtick count to original? #20

Comments

yumetodo commented Aug 4, 2020 • edited

Byron commented Aug 5, 2020

yumetodo commented Aug 5, 2020

Byron commented Aug 6, 2020

yumetodo commented Aug 8, 2020

Byron commented Aug 8, 2020

yumetodo commented Aug 8, 2020

yumetodo commented Aug 24, 2020

Byron commented Sep 1, 2020

yumetodo commented Sep 2, 2020 • edited

lambda-fairy commented Jan 17, 2021

Byron commented Jan 18, 2021

yumetodo commented Jan 18, 2021

Byron commented Jan 18, 2021

yumetodo commented Jan 19, 2021

lambda-fairy commented Jan 19, 2021

Byron commented Jan 19, 2021 • edited

yumetodo commented Jan 19, 2021 • edited

Byron commented Jan 21, 2021

willcrichton commented Jun 24, 2022 • edited

willcrichton commented Jun 24, 2022

Byron commented Jun 28, 2022

mgeisler commented Dec 22, 2023

dalance commented Jan 24, 2024

Byron commented Jan 24, 2024

dalance commented Jan 24, 2024

Byron commented Jan 24, 2024

dalance commented Jan 24, 2024

dalance commented Jan 25, 2024

dalance commented Jan 25, 2024

Byron commented Jan 26, 2024

dalance commented Jan 27, 2024

yumetodo commented Aug 4, 2020 •

edited

yumetodo commented Sep 2, 2020 •

edited

Byron commented Jan 19, 2021 •

edited

yumetodo commented Jan 19, 2021 •

edited

willcrichton commented Jun 24, 2022 •

edited