Report broken links only once #1819

jovyntls · 2022-03-11T10:17:21Z

What is the purpose of this pull request?

Resolves the first checkbox of #1418 (Report each broken link only once)
Fixes #1723 by only logging the warnings after the progress bar has completed

Overview of changes:

Add a SiteLinkManager to keep track of all broken links, then validate them all at once. This avoids duplicate link reporting and logs that break up the progress bar.
linkProcessor.validateIntralink was modified such that it is responsible only for validating a resourcePath and a cwf; handling the node and checking whether or not to validate was moved to SiteLinkManager.collectIntralink.

Anything you'd like to highlight / discuss:

If this implementation makes sense, I'm wondering if we should completely decouple linkProcessor and nodeProcessor, so that NodeProcessor calls only SiteLinkManager, which calls linkProcessor.
Since all the broken link logging will be consolidated, I thought a possible improvement to the broken link reporting would be to take inspiration from ESLint (i.e. log the filename, then a list of broken links below). Not sure if this would be cleaner for logging or if the existing way is better?

Testing instructions:

Add a broken link to a page, e.g.

mail@email.com (considered a broken link due to #1820)
![](wrong_image.png)

Proposed commit message: (wrap lines at 72 characters)

Report broken links only once

When a link is validated and found to be broken, a warning is logged for
each time such a link is found. 

This makes the logs messy and harder to read while not providing
actionable information on which links to fix. Furthermore, logging each
link while the site is building will disrupt the CLI progress bar.

Let's use a site-wide abstraction to validate all the links at once and
report only unique broken links. Doing so allows us to avoid duplicate
link reporting and create a cleaner logging output.

Checklist: ☑️

Updated the documentation for feature additions and enhancements
Added tests for bug fixes or features
Linked all related issues
No unrelated changes

ryoarmanda

Thanks for your work! If I understand the implementation correctly:

SiteLinkManager stores intralinks to be validated after page generation is finished (rather than real-time) in order to achieve the consolidated logging.
SiteLinkManager takes some functionalities from linkProcessor that are required to determine whether a particular node should be collected for its intralink.
Therefore, linkProcessor validate intralinks "contract" is changed to working with the resource paths instead of the actual nodes.

My thought was to keep the intralink validation in real-time but add the (cwf, link) pairs that errored out to the SiteLinkManager's collection to be output after page generation finished. That way, we can keep linkProcessor's way of doing things, just added more on top of it.

I think your implementation is fine too (maybe we could check first if there is any difference, functional or otherwise, on validating in real-time vs at the end?)

The last point is a regression on a vital function of intralink validation, so we need to be extra careful here that we don't create unintended outcomes. It is generally better to add new functions rather than changing existing ones, but if it is unavoidable, we have to make sure it's right.

packages/core/src/html/NodeProcessor.js

packages/core/src/html/SiteLinkManager.js

packages/core/src/Site/index.js

packages/core/src/html/SiteLinkManager.js

packages/core/src/html/linkProcessor.js

jovyntls · 2022-03-23T12:49:33Z

The last point is a regression on a vital function of intralink validation, so we need to be extra careful here that we don't create unintended outcomes. It is generally better to add new functions rather than changing existing ones, but if it is unavoidable, we have to make sure it's right.

Got it, thank you for the explanation! 🙇

I took a look through and here are some notes on it:

validateIntralink
- has no major side effects (besides the logging) and its output is not used anywhere, except in tests to verify where the function returns
- may be changed by the state of the config object, however the attributes it depends on (e.g. baseUrl) should remain constant throughout
- is only called once, in the _process method of NodeProcessor
in _process, validateIntralink mainly serves the purpose of logging and its output is not used elsewhere (i.e. does not affect the rest of the method)

Behaviour wise, it looks like there shouldn't be a difference (besides collecting and then validating all at once, later)

ang-zeyu · 2022-03-26T08:15:58Z

packages/core/src/Site/index.js

@@ -1103,6 +1111,8 @@ class Site {
      } else {
        isCompleted = await this.generatePagesAsyncThrottled(task.pages, progressBar);
      }
+
+      this.siteLinkManager.validateAllIntralinks();


Lgtm 👍, I'll leave @ryoarmanda to the final look and merge

Just adding a note that while this preserves the previous behaviour (across rebuilds, invalid links are only logged for rebuilt files and their dependencies), perhaps we we could improve on this in a future issue / pr to persist the invalid links across rebuilds. (one more to the list for #1418)

E.g.

markbind s - all invalid links are reported

change some file -> only invalid links belonging to the file (or its dependencies) are reported

author has to scroll up to the "first batch" of messages to see the invalid links

Or

markbind s -o -b - all invalid links are reported for the opened page (and its dependencies), then for the background pages

change opened file -> only invalid links belonging to the file (or its dependencies) are reported

background pages are reported "disjointed" from the earlier batch, same issue with the author having to piece together the information again

I think your implementation is fine too (maybe we could check first if there is any difference, functional or otherwise, on validating in real-time vs at the end?)

No preference on this for now, but doing it at the end should facilitate the 2nd item (hash checking) in #1418, where it'll likely be necessary to do it at the end (only once all hashes have been collected from all pages).

ryoarmanda

Looks good to me! The suggestion @ang-zeyu mentioned to persist the invalid links across rebuilds sounds like a nice improvement too.

No preference on this for now, but doing it at the end should facilitate the 2nd item (hash checking) in #1418, where it'll likely be necessary to do it at the end (only once all hashes have been collected from all pages).

Ah in that case, doing it at the end would prepare to go forth with the hash checking as well.

tlylt · 2022-04-17T13:34:33Z

While experimenting for a fix to #1723, I realize that since now the links are collected and then validated, the output might end up like this if the mentioned issue is fixed:
(warnings/infos will be logged above the progress bar, and the progress bar will stay at the bottom until completion)

Wonder if there are any opinions/thoughts if I revert the behavior to real-time while keeping the rest of code structure from this PR the same?

jovyntls · 2022-04-17T17:53:42Z

Wonder if there are any opinions/thoughts if I revert the behavior to real-time while keeping the rest of code structure from this PR the same?

Quoting Ze Yu's reply from here:

No preference on this for now, but doing it at the end should facilitate the 2nd item (hash checking) in #1418, where it'll likely be necessary to do it at the end (only once all hashes have been collected from all pages).

ang-zeyu · 2022-04-17T18:44:42Z

I think we should be fixing fundamental issue here instead: #1633

#1723 should be naturally fixed as a result too. (but we can still update winston of course)

E.g. maybe use this https://www.npmjs.com/package/progress#interrupt

Or (hopefully) there's even a progress bar library that works out of the box with winston 👀

tlylt · 2022-04-17T23:44:06Z

Quoting Ze Yu's reply from here:

No preference on this for now, but doing it at the end should facilitate the 2nd item (hash checking) in #1418, where it'll likely be necessary to do it at the end (only once all hashes have been collected from all pages).

Thanks, @jovyntls, in this case, I guess real-time will have to mean that we can check on a per-page basis (depending on how hash checking can be done, especially when dealing with includes). I think the user experience is slightly better to have the warnings pop up as we go, which gives users the freedom to stop the build/serve whenever they notice such logs in the console.

I think we should be fixing fundamental issue here instead: #1633

#1723 should be naturally fixed as a result too. (but we can still update winston of course)

Indeed 1723 is somewhat a duplicate and adjusting the order is not the fix that I am going for (In my screenshot the link reporting does have the linebreak). I am asking more for user experience wise (should they see the broken link reported as they go, which was the previous experience, or should they see it at the end of site generation).

E.g. maybe use this npmjs.com/package/progress#interrupt

Indeed my working solution uses the idea of interrupt, will hopefully push up a PR to continue the discussion from there.

Or (hopefully) there's even a progress bar library that works out of the box with winston 👀

At first, I thought the problem is with the progress bar library and even tried another popular one (cli-progress), but it ended up having the same issue as well and no real good solutions for that (basically people have been filing issues about how difficult it is to work with loggers but not much was done to help). My sketch of the solution is to wrap the winston loggers and keep a reference to the progress bar so that the bar can be interrupted whenever there's a call to the logging methods. Again, will push up for discussion when it's done. Thanks @ang-zeyu for looking into this.

ang-zeyu · 2022-04-18T04:08:26Z

I am asking more for user experience wise (should they see the broken link reported as they go, which was the previous experience, or should they see it at the end of site generation).

#1418 (comment)

^ just my personal preference, as an author, I would prefer to deal with link validation issues "in one block" of messages, rather than finding them disjointed all over the place. Having the messages come out real time may not be too helpful during markbind serve as well (since page regeneration proceeds fairly quickly)

tlylt · 2022-04-18T04:13:49Z

^ just my personal preference, as an author, I would prefer to deal with link validation issues "in one block" of messages, rather than finding them disjointed all over the place. Having the messages come out real time may not be too helpful during markbind serve as well (since page regeneration proceeds fairly quickly)

Hmm that's quite true for 'markbind serve'. I'm more thinking of 'markbind build', in a larger site which the build time can be more than a few seconds, I would imagine it to be useful to know the error immediately so that I can go and fix it before I am ready to deploy this build. Certainly some personal preference is involved...

ang-zeyu · 2022-04-18T04:32:57Z

Certainly some personal preference is involved...

Agreed, I think the clincher might be this ⬇️

No preference on this for now, but doing it at the end should facilitate the 2nd item (hash checking) in #1418, where it'll likely be necessary to do it at the end (only once all hashes have been collected from all pages).

jovyntls added 7 commits February 28, 2022 11:29

Add .ignore for ag

631cc27

Merge branch 'master' of github.com:MarkBind/markbind

e7502cd

update .ignore

d2f4648

Merge branch 'master' of github.com:MarkBind/markbind

3f57344

git skip worktree .ignore

319c840

First draft

ea1b164

Rename to Manager

3a0918d

jovyntls changed the title ~~Draft: Intralink validation~~ Draft: Report broken links only once Mar 11, 2022

Fix tests

94b9e78

jovyntls force-pushed the intralink-validation branch from 22b7640 to 94b9e78 Compare March 14, 2022 03:34

Fix duplicates on rebuild

354f269

jovyntls force-pushed the intralink-validation branch from 84cd6b7 to 354f269 Compare March 14, 2022 06:56

jovyntls changed the title ~~Draft: Report broken links only once~~ Report broken links only once Mar 14, 2022

jovyntls marked this pull request as ready for review March 14, 2022 07:05

jonahtanjz requested a review from ryoarmanda March 15, 2022 06:58

ang-zeyu requested a review from wxwxwxwx9 March 15, 2022 08:38

ryoarmanda reviewed Mar 22, 2022

View reviewed changes

Fixes

7f98252

ang-zeyu self-requested a review March 25, 2022 04:13

ang-zeyu reviewed Mar 26, 2022

View reviewed changes

ang-zeyu approved these changes Mar 26, 2022

View reviewed changes

ryoarmanda approved these changes Mar 30, 2022

View reviewed changes

ryoarmanda merged commit 1e83c1e into MarkBind:master Mar 31, 2022

ryoarmanda added this to the 4.0 milestone Apr 3, 2022

tlylt mentioned this pull request Apr 4, 2022

Emails are wrongly validated as links #1820

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report broken links only once #1819

Report broken links only once #1819

jovyntls commented Mar 11, 2022 •

edited

Loading

ryoarmanda left a comment

jovyntls commented Mar 23, 2022

ang-zeyu Mar 26, 2022

ryoarmanda left a comment •

edited

Loading

tlylt commented Apr 17, 2022

jovyntls commented Apr 17, 2022

ang-zeyu commented Apr 17, 2022

tlylt commented Apr 17, 2022

ang-zeyu commented Apr 18, 2022

tlylt commented Apr 18, 2022

ang-zeyu commented Apr 18, 2022

Report broken links only once #1819

Report broken links only once #1819

Conversation

jovyntls commented Mar 11, 2022 • edited Loading

ryoarmanda left a comment

Choose a reason for hiding this comment

jovyntls commented Mar 23, 2022

ang-zeyu Mar 26, 2022

Choose a reason for hiding this comment

ryoarmanda left a comment • edited Loading

Choose a reason for hiding this comment

tlylt commented Apr 17, 2022

jovyntls commented Apr 17, 2022

ang-zeyu commented Apr 17, 2022

tlylt commented Apr 17, 2022

ang-zeyu commented Apr 18, 2022

tlylt commented Apr 18, 2022

ang-zeyu commented Apr 18, 2022

jovyntls commented Mar 11, 2022 •

edited

Loading

ryoarmanda left a comment •

edited

Loading