Skip to content

Conversation

@jeanprbt
Copy link
Contributor

@jeanprbt jeanprbt commented Apr 15, 2024

This PR modifies the formatter HTMLChars used alongside the formatter Markdown to avoid the useless swallow of chars { and }. It adapts tests accordingly and adds RemoveBrackets formatter in relevant *.layout files to maintain previous behavior for non-Markdown fields.
See this comment.

  • Adapt tests
  • Adapt *.layout files

Closes #10928

Now, Markdown rendering keeps braces, see following screenshot.

braces_kept

Mandatory checks

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

@jeanprbt
Copy link
Contributor Author

jeanprbt commented Apr 15, 2024

@koppor, in your comment you say the following.

Thinking more, the LaTeX command handling in HTMLChars should be removed. In the .layout files, LaTeXToUnicode should be used. - I think, HTMLChars will be much smaller then. HTML5 default encoding is unicode, thus LaTeXToUnicode is the right formatter.

Do you want me to also cover this in this current PR, or does this deserve another one ?

Also, the original issue mentions that Markdown rendering also swallows new lines, but removing the following lines in MardownFormatter solves the problem.

// workaround HTMLChars transforming "\n" into <br> by returning a one liner
        return html.replaceAll("\\r\\n|\\r|\\n", " ").trim();

The only drawback is that it seems to add more newlines than necessary, which may come from the external parser used. Do we keep the old version with no newlines ?

newlines

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments regarding the test

Comment on lines 113 to 109
assertEquals("&#39;", layout.format("{\\textquotesingle}"));
assertEquals("&#39;", layout.format("\\textquotesingle"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change is OK. Reason: { is not used as parameter for a LaTeX command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the existing layouter is tested now, please add this back.

@koppor
Copy link
Member

koppor commented Apr 15, 2024

  • Please also add a test case to org.jabref.logic.layout.format.MarkdownFormatterTest so that it is really seen that this works.

@koppor
Copy link
Member

koppor commented Apr 15, 2024

Thinking more, the LaTeX command handling in HTMLChars should be removed. In the .layout files, LaTeXToUnicode should be used. - I think, HTMLChars will be much smaller then. HTML5 default encoding is unicode, thus LaTeXToUnicode is the right formatter.

Do you want me to also cover this in this current PR, or does this deserve another one ?

Since you went deep into fixing HTMLChars, that update can be done in a next PR. It also more about experimentation. --> Test cases in org.jabref.logic.layout.format.MarkdownFormatterTest are needed to ensure that the functionality is unchanged after rewriting HTMLChars.

@koppor
Copy link
Member

koppor commented Apr 15, 2024

The only drawback is that it seems to add more newlines than necessary,

Without going deep into the code, I cannot follow. Can you add test cases to org.jabref.logic.layout.format.MarkdownFormatterTest so that the current behavior is shown? Then, the new behavior can also be shown in test cases.

Remember that modern Java allows to use """ for multiline string constants (including newlines) which can come handy here.

@jeanprbt
Copy link
Contributor Author

Small comments regarding the test

This is fixed now.

  • Please also add a test case to org.jabref.logic.layout.format.MarkdownFormatterTest so that it is really seen that this works.

I should have thought about this before, my bad.

Since you went deep into fixing HTMLChars, that update can be done in a next PR. It also more about experimentation. --> Test cases in org.jabref.logic.layout.format.MarkdownFormatterTest are needed to ensure that the functionality is unchanged after rewriting HTMLChars.

Ok ! I'll do this in a separate PR then. You're right it will be experimentation, since I don't really see the border between HTMLChars and LaTeXToUnicode formatters at the moment. I added many test cases to org.jabref.logic.layout.format.MarkdownFormatterTest to ensure functionnality will be unchanged in next PR.

@jeanprbt
Copy link
Contributor Author

Without going deep into the code, I cannot follow. Can you add test cases to org.jabref.logic.layout.format.MarkdownFormatterTest so that the current behavior is shown? Then, the new behavior can also be shown in test cases.

Current behavior is shown using the following test.

 @Test
    void formatWhenFormattingComplexMarkupThenReturnsOnlyOneLine() {
        assertFalse(markdownFormatter.format("Markup\n\n* list item one\n* list item 2\n\n rest").contains("\n"));
    }

The formatter only removes every new line.

@jeanprbt
Copy link
Contributor Author

Now I have to adapt every *.layout files, adding the formatter RemoveBrackets for non-Markdown fields. Will be done in a couple of days max.

@koppor
Copy link
Member

koppor commented Apr 16, 2024

Ok ! I'll do this in a separate PR then. You're right it will be experimentation, since I don't really see the border between HTMLChars and LaTeXToUnicode formatters at the moment.

Yeah, you are right. - I first thought LaTeXToUnicode is executed first and then some Unicode-to-HTML conversion. italics unicode back to <em>. However, this might be too complicated and the current approach is OK - even though LaTeX parsing is doubled. -- Maybe, it would make sense to use SnuggleTeX to parse LaTeX and do the conversion based on the DOM tree? (See org.jabref.logic.integrity.LatexIntegrityChecker for some code ideas).

However, LaTeX has a stable syntax, maybe, we leave everything as is.

@koppor
Copy link
Member

koppor commented Apr 16, 2024

Current behavior is shown using the following test.

Not quote. Please use assertEquals. Left: The new string, right: the original stirng.

USe

  String input = """
     test
  
     some more
     """

This is more readable for the human reader than reading \n.

@koppor
Copy link
Member

koppor commented Apr 16, 2024

Now I have to adapt every *.layout files, adding the formatter RemoveBrackets for non-Markdown fields. Will be done in a couple of days max.

This is a nightmare for backward compatiblity :)

Thinking aloud:

  • When using latex2unicode, these {} are removed. Therefore maybe the idea to use that in htmlchars. However, not wanted for markdown
  • The htmlchars converter could be parameterized to keep braces. Depending on the mode, it could behave differently. (enum BraceHandling: KEEP, REMOVE)
  • maybe the previewer can parameterize the htmlchars with KEEP - and the exporter using layout parameteirzed using REMOVE

OK, the default preview of JabRef needs to be adapted maybe... The normal latex2unicode should be able to handle most cases, thus I think, there probably is no adaption necessary?

@jeanprbt
Copy link
Contributor Author

jeanprbt commented Apr 17, 2024

Yeah, you are right. - I first thought LaTeXToUnicode is executed first and then some Unicode-to-HTML conversion. italics unicode back to \<em\>. However, this might be too complicated and the current approach is OK - even though LaTeX parsing is doubled. -- Maybe, it would make sense to use SnuggleTeX to parse LaTeX and do the conversion based on the DOM tree? (See org.jabref.logic.integrity.LatexIntegrityChecker for some code ideas).

However, LaTeX has a stable syntax, maybe, we leave everything as is.

I agree with you. Since the current syntax is working properly and changes are not simply LaTeXToUnicode and then Unicode-to-HTML, but far more complex, we can leave everything as is. The issue about Markdown rendering with braces is now solved. In any case, perhaps the thymeleaf template engine will fill all these gaps this summer ? ;)

Not quote. Please use assertEquals. Left: The new string, right: the original string.

Use

  String input = """
     test
  
     some more
     """

This is more readable for the human reader than reading \n.

Understood, I implemented a test with multiline quotes in MarkdownFormatterTest.

@jeanprbt
Copy link
Contributor Author

jeanprbt commented Apr 17, 2024

This is a nightmare for backward compatiblity :)

Thinking aloud:

  • When using latex2unicode, these {} are removed. Therefore maybe the idea to use that in htmlchars. However, not wanted for markdown
  • The htmlchars converter could be parameterized to keep braces. Depending on the mode, it could behave differently. (enum BraceHandling: KEEP, REMOVE)
  • maybe the previewer can parameterize the htmlchars with KEEP - and the exporter using layout parameteirzed using REMOVE

OK, the default preview of JabRef needs to be adapted maybe... The normal latex2unicode should be able to handle most cases, thus I think, there probably is no adaption necessary?

Your idea of an enum BraceHandling for HTMLChars is great, I love it. What I can do is modify HTMLChars to support both modes (i.e. remove braces or not), the default being REMOVE. I will only have to modify the preview to use the KEEP mode, which allows backwards compatibility in avoiding modifying all *.layout files. WDYT ?

@koppor
Copy link
Member

koppor commented Apr 17, 2024

Your idea of an enum BraceHandling for HTMLChars is great, I love it. What I can do is modify HTMLChars to support both modes (i.e. remove braces or not), the default being REMOVE.

Add a constructor accepting the mode. The default constructor calls the other one with REMOVE.

Then add some ifs in the code covering the two cases. - The places should be easy to find as they are the places of your modifications.

Modifying the tests will be more hard, but required, too.

Note: I am aware this is not OO style at all. However, rewriting in OO-style would make the class even more complex and not help IMHO.

I will only have to modify the preview to use the KEEP mode, which allows backwards compatibility in avoiding modifying all *.layout files. WDYT ?

Yes!

@jeanprbt jeanprbt marked this pull request as ready for review April 19, 2024 17:22
@jeanprbt jeanprbt requested a review from koppor April 21, 2024 17:59
Comment on lines 113 to 109
assertEquals("&#39;", layout.format("{\\textquotesingle}"));
assertEquals("&#39;", layout.format("\\textquotesingle"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the existing layouter is tested now, please add this back.

}

@Test
void formatWhenFormattingHeaderThenReturnsHeaderInHtml() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could convert all of them to ParameterizedTests!

All the tests seems to test some markdown-to-html functionality, which JabRef does not implement by itself, but through a library.

I would remove them and keep formatWhenFormattingCodeBlockThenReturnsCodeBlockInHtml() { only.

And then add some tests with empty lines to see how the converter deals with blocks. Are they also removed - or is only the code block the issue?

What about code blocks "recommended" by MD031 of markdown-lint?

"""
   First line

   Second line

   ```java
   String test;
   ```
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having written some tests I get what's going on about new line formatting. The current formatter uses HTML <p> tags for each line. By default, HTML inserts a new line between each <p> tag, which is why in MarkdownFormatter we remove the \n before returning formatted strings, to avoid jumping two lines instead of one. But in <pre> tags which include the code in Mardown, there are no such <p> tags to indicate the new lines. When JabRef removes the \n before returning the final string, we lose track of those new lines in code blocks.

@jeanprbt jeanprbt requested a review from koppor April 22, 2024 12:53
…e equality for braces keep mode in HTMLChars
@jeanprbt jeanprbt requested a review from koppor April 22, 2024 14:27
@jeanprbt jeanprbt requested a review from koppor April 22, 2024 17:38
@koppor koppor added this pull request to the merge queue Apr 23, 2024
Merged via the queue into JabRef:main with commit dbd9257 Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Markdown rendering should keep {}

2 participants