Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor baseUrl into variablePreprocessor #1239

Merged
merged 2 commits into from
Jun 26, 2020

Conversation

ang-zeyu
Copy link
Contributor

@ang-zeyu ang-zeyu commented Jun 10, 2020

What is the purpose of this pull request? (put "X" next to an item, remove the rest)

• [x] Other, please explain: More refactoring work

Resolves #1236 Resolves #907 as well

What is the rationale for this request?
Less code, moderately (10%) better performance, more cohesion

essentially a smaller follow up to the variable refactoring work, although I ended up needing to resolve the xmlMode issues as well (not much changes on the code side though) 😞

What changes did you make? (Give an overview)
Moved baseUrl rendering into variablePreprocessor, removing the need for an entirely separate stage of baseUrl processing which requires edge case baseUrl resolving like in #1088.

Is there anything you'd like reviewers to focus on?
the changes in the test output.

there are a few blank <p></p> addition / removals; These are expected, and will be annotated.

the 2103 site has a lot of similar expected <p></p> additions for the same reason, and I've manually verified each of those are non-breaking.

Testing instructions:

Proposed commit message: (wrap lines at 72 characters)
Refactor baseUrl into variablePreprocessor

BaseUrl processing is done in a separate stage involving repeated and
recursive parsing / rendering of the content.

This decreases cohesiveness of variable processing, and also
performance due to the repeated parsing and rendering.
With a framework for variable processing now, let's move baseUrl
processing into it, solving the above said problems.

To do this, MarkBind's reliance on xmlMode needs to be removed as well,
as it isvery tightly coupled to the htmlparser call in resolveBaseUrl.
xmlMode is also not compliant with many html specs, such as opening


tags being able to be closed by most other opening tags.
Hence, let's instead patch htmlparser2 to enable it's self closing tag
recognition and disable lower case attribute conversion to achieve
this.

Lastly, rendering of other variables containing html is also dependent
on the extra htmlparser call in resolveBaseUrl which has the
decodeEntities option turned on.
Let's formally remove the need for this by using only the unescaped
nunjucks environment to render variables.

@ang-zeyu ang-zeyu force-pushed the refactor-baseurl-processing branch from 1242ab1 to 5da86e1 Compare June 10, 2020 15:11
@ang-zeyu ang-zeyu removed the s.OnHold label Jun 11, 2020
@ang-zeyu ang-zeyu force-pushed the refactor-baseurl-processing branch 2 times, most recently from 768c958 to 63d9f82 Compare June 12, 2020 10:59
@@ -6,4 +6,4 @@ Also, the boilerplate file name (e.g. `inside.md`) and the file that it is suppo

This file should behaves as if it is in the `requirements` folder:

<panel header="Tested with the folllowing include" src="NonFunctionalRequirements.md" >
<panel header="Tested with the folllowing include" src="NonFunctionalRequirements.md" />
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was no corresponding closing tag; this results in one of the changes in test_site/index.html

@@ -251,6 +251,7 @@ <h1 id="panel-without-heading-with-keyword">Panel without heading with keyword<a
<h1 id="keyword-should-be-tagged-to-this-heading-not-the-panel-heading">Keyword should be tagged to this heading, not the panel heading<a class="fa fa-anchor" href="#keyword-should-be-tagged-to-this-heading-not-the-panel-heading" onclick="event.stopPropagation()"></a></h1>
<p><span class="keyword">panel keyword</span></p>
</panel>
<p></p>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This occurs due to html not needing a closing <p> tag.
Previously, this is the intermediate output after markdown-it renders the html.
It renders invalid html because it is subtly invalid markdown - panel is not a whitelisted html block tag and cannot terminate the paragraph rule (also resulting in #958):

<p><span class="keyword">panel keyword</span>
</panel>
</p>

when rerendering the html after having passed it through cheerio, the bottom lone </p> is invalid in xmlMode, and gets erased. when rendering without xmlMode however, it gets expanded into an empty p tag by htmlparser2's forgiving nature (there is no spec for this afaik). (line 226 of htmlparser2's Parser.js)

Ultimately, the extra <p></p> is due to miswritten markdown;

<span class="keyword">
panel keyword</span>

would be the correct way of not inserting any additional <p>'s at all (note the newline). (I've left this in for now as to explain the many similar cases in the 2103 site)

<p></p>
</panel>
<p>
<panel src="/test_site/requirements/NonFunctionalRequirements._include_.html"><template slot="_header">
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to the added / in the source file

@@ -716,6 +715,7 @@ <h6 class="always-index" id="level-6-header-outside-headingsearchindex-with-alwa




Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to the intermediary endraw removal, likewise for the above <p></p>'s

@@ -84,6 +76,7 @@ <h2 id="so-far-as-to-comply-with-the-commonmark-spec">So far as to comply with t
*/
</p>
</abc>
<p></p>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the two <p></p> additions in this file occur due to miswritten markdown as well.
In the first case, the /* ... indicates the start of a paragraph ending after the </abc>
In the second case, the if ... does the same

@ang-zeyu ang-zeyu requested review from acjh and marvinchin June 12, 2020 12:25
* This allows variables and other nunjuck syntax inside {% raw %} {% endraw %} tags
* to be ignored by nunjucks until the final renderString call.
*/
function preEscapeRawTags(pageData) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the real solution to #762 was to make nunjucks a single pass, removing the need for preescaping raw/endraw. A little unfortunate that the work done here is removed almost as soon as it was introduced #1220 #1049, but we could always revert it in the future if there's ever a need for multiple nunjucks passes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crphang could I get your opinion on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is any historical context behind why we had multiple nunjucks passes in the first place. Would @acjh or @yamgent happen to know more about this?

@ang-zeyu ang-zeyu force-pushed the refactor-baseurl-processing branch 2 times, most recently from ccbe682 to fb1994e Compare June 13, 2020 09:20
@marvinchin marvinchin self-assigned this Jun 14, 2020
@ang-zeyu ang-zeyu force-pushed the refactor-baseurl-processing branch from fb1994e to bd4ee73 Compare June 22, 2020 10:56
@ang-zeyu
Copy link
Contributor Author

rebased on latest, no changes

Copy link
Contributor

@marvinchin marvinchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm not super familiar with the intricacies of the parser. This makes sense and looks good to me, but it would be great if someone more familiar with the parser (@acjh @yamgent @crphang maybe?) can take a look at it and make sure that we haven't missed out any edge cases.

* This allows variables and other nunjuck syntax inside {% raw %} {% endraw %} tags
* to be ignored by nunjucks until the final renderString call.
*/
function preEscapeRawTags(pageData) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is any historical context behind why we had multiple nunjucks passes in the first place. Would @acjh or @yamgent happen to know more about this?

node.children = cheerio.parseHTML(md.render(cheerio.html(node.children)), true);
cheerio.prototype.options.xmlMode = true;
break;
case 'panel': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to say that because we can resolve the base url in the initial pass, we no longer need to do the intermediate preprocessing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is moved to componentPreprocessor L149 - L155.

This chunk does the rewriting of src="topic1.html" into src="topic1._include_.html", so it's still necessary

The main difference is the removal of the fileExists check, which already wasn't necessary before (there's a similar check done in componentPreprocessor)
The second difference would be path.dirname(path.join('{{hostBaseUrl}}', path.relative(config.rootPath, src))) using config.baseUrl directly instead of {{ hostBaseUrl }}, since we no longer have / render {{ hostBaseUrl }}

config.baseUrlMap);
filePath = path.resolve(parentSitePath, includePath.replace(baseUrlRegex, ''));
if (path.posix.isAbsolute(includePath)) {
// {{ baseUrl }} (or simply '/...' if baseUrl === '') was used in this src attribute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume /... is /includePath?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, will reword the comment a little

Copy link
Contributor Author

@ang-zeyu ang-zeyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the multiple nunjucks passes, I looked through the history a little; It seems resolveBaseUrl was introduced right at the start before any of @acjh and @yamgent's commits unfortunately =C

64b2ff6

Realised I focused too much on annotating the strange diffs in the test as well, and none for the code =C. Added some

@@ -1,4 +1,4 @@
const cheerio = require('cheerio');
const cheerio = require('cheerio'); require('./lib/markbind/src/patches/htmlparser2');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this in all files to ensure unit tests load the patch, which is more than ever important with the extra 2 "patches" for recognizeSelfClosing and lowerCaseAttributeNames

*/
this.variablePreprocessor.addUserDefinedVariable(base, 'baseUrl', '{{baseUrl}}');
const siteRelativePathFromRoot = utils.ensurePosix(path.relative(this.rootPath, base));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main functional change, the rest are just necessary changes due to the tight coupling of xmlMode/rendering order/using it to unescape escaped html from removing resolveBaseUrl

We simply add the correct baseUrl for each (sub)site as a variable directly into variablePreprocessor (previously userDefinedVariablesMap)

this.variablePreprocessor.addUserDefinedVariable(base, 'MarkBind', MARKBIND_LINK_HTML);

const $ = cheerio.load(content);
const $ = cheerio.load(content, { decodeEntities: false });
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ decodeEntities: false } here as this file dosen't have the cheerio.prototype.options.decodeEntities = false; line like the others, and is the only cheerio.load call in the file

@@ -1,6 +1,5 @@
module.exports = {
// src/lib/markbind/src/parser.js
ATTRIB_INCLUDE_PATH: 'include-path',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously used for _rebaseReference

const resolvedResourcePath = path.posix.relative(utils.ensurePosix(rootPath), fullResourcePath);
return path.posix.join('{{hostBaseUrl}}', resolvedResourcePath);
const resolvedResourcePath = path.posix.relative(utils.ensurePosix(config.rootPath), fullResourcePath);
return path.posix.join(config.baseUrl || '/', resolvedResourcePath);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's already rendered at this point, so we need to change it correspondingly to the actual baseUrl

config.baseUrlMap);
filePath = path.resolve(parentSitePath, includePath.replace(baseUrlRegex, ''));
if (path.posix.isAbsolute(includePath)) {
// {{ baseUrl }} (or simply '/...' if baseUrl === '') was used in this src attribute,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, will reword the comment a little

@@ -123,39 +120,42 @@ function _getSrcFlagsAndFilePaths(element, context, config) {
* and adds the appropriate include.
*/
function _preProcessPanel(node, context, config, parser) {
const element = node;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

irrelevant style change =C, let me know if you want to move it to another PR!

element.attribs.src = filePath;
const { fragment } = node.attribs;
const resultPath = utils.setExtension(path.relative(config.rootPath, filePath), '._include_.html');
const baseUrlPrependedPath = path.posix.join(`${config.baseUrl}/`, utils.ensurePosix(resultPath));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned in the intermediate processing comment, the fileExists check is done above in _getFileExistsNode, and {{ hostBaseUrl }} is replaced with config.baseUrl

},
renderString(pageData, variableMap = {}, options = {}, env = commonEnv) {
return env.renderString(pageData, variableMap, options);
renderRaw(pageData, variableMap = {}) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little strange to call it renderRaw now that we don't preescape it, but I'll be touching on this shortly in the next PR that fixes #931

@@ -121,7 +121,7 @@ test('includeFile replaces <include src="doesNotExist.md" optional> with empty <

const expected = [
'# Index',
'<div/>',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consequence of using xmlMode previously; <div></div> is rendered to <div/> which is invalid

Copy link
Contributor

@marvinchin marvinchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I took another look and it looks fine to me! Let's improve some of the surrounding comments and I think we should be good to merge.

}

element.attribs.src = filePath;
const { fragment } = node.attribs;
const resultPath = utils.setExtension(path.relative(config.rootPath, filePath), '._include_.html');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to rename resultPath -> relativePath and baseUrlPrependedPath -> absolutePath? I feel like that might better reflect what they are.

BaseUrl processing is done in a separate stage involving repeated and
recursive parsing / rendering of the content.

This decreases cohesiveness of variable processing, and also
performance due to the repeated parsing and rendering.
It also necessitates edge-case solutions such as that in MarkBind#1088 when we
need to resolve the baseUrl before the resolveBaseUrl stage has been
reached.

With a framework for variable processing now, let's move baseUrl
processing into it, solving the above said problems.

Furthermore, rendering of other variables containing html is dependent
on the extra htmlparser call in resolveBaseUrl.

Let's formally remove the need for this by using only the unescaped
nunjucks environment to render variables.
Markbind relies on xmlMode parsing of htmlparser2 to parse 'custom self
closing tags' such as '<panel />', and avoid the auto lower case
conversion of attribute names.

This introduces very tight coupling between the order and variations of
this option across html rendering and parsing calls.

In addition, xmlMode is not compliant with various html specs, such as
opening tags being able to act as closing tags in certain cases, or
more common cases like erroneously expanding <br> into <br>...</br>.

Let's use htmlparser2's recognizeSelfClosing tag and
lowerCaseAttributeName options to do the same instead.
@ang-zeyu ang-zeyu force-pushed the refactor-baseurl-processing branch from bd4ee73 to ca8b330 Compare June 26, 2020 10:22
@ang-zeyu ang-zeyu added this to the v2.15.0 milestone Jun 26, 2020
@ang-zeyu
Copy link
Contributor Author

Thanks for the review again @marvinchin!

@ang-zeyu ang-zeyu merged commit b5e2311 into MarkBind:master Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

{{ xx | safe }} filter is always being "applied" Treat nunjucks variables as safe by default
2 participants