Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report broken links only once #1819

Merged
merged 10 commits into from
Mar 31, 2022
Merged

Conversation

jovyntls
Copy link
Contributor

@jovyntls jovyntls commented Mar 11, 2022

What is the purpose of this pull request?

  • Documentation update
  • Bug fix
  • Feature addition or enhancement
  • Code maintenance
  • Others, please explain:

Resolves the first checkbox of #1418 (Report each broken link only once)
Fixes #1723 by only logging the warnings after the progress bar has completed

Overview of changes:

  • Add a SiteLinkManager to keep track of all broken links, then validate them all at once. This avoids duplicate link reporting and logs that break up the progress bar.
  • linkProcessor.validateIntralink was modified such that it is responsible only for validating a resourcePath and a cwf; handling the node and checking whether or not to validate was moved to SiteLinkManager.collectIntralink.

Anything you'd like to highlight / discuss:

  • If this implementation makes sense, I'm wondering if we should completely decouple linkProcessor and nodeProcessor, so that NodeProcessor calls only SiteLinkManager, which calls linkProcessor.
  • Since all the broken link logging will be consolidated, I thought a possible improvement to the broken link reporting would be to take inspiration from ESLint (i.e. log the filename, then a list of broken links below). Not sure if this would be cleaner for logging or if the existing way is better?
    image

Testing instructions:

Add a broken link to a page, e.g.

mail@email.com (considered a broken link due to #1820)
![](wrong_image.png)

Proposed commit message: (wrap lines at 72 characters)

Report broken links only once

When a link is validated and found to be broken, a warning is logged for
each time such a link is found. 

This makes the logs messy and harder to read while not providing
actionable information on which links to fix. Furthermore, logging each
link while the site is building will disrupt the CLI progress bar.

Let's use a site-wide abstraction to validate all the links at once and
report only unique broken links. Doing so allows us to avoid duplicate
link reporting and create a cleaner logging output.

Checklist: ☑️

  • Updated the documentation for feature additions and enhancements
  • Added tests for bug fixes or features
  • Linked all related issues
  • No unrelated changes

@jovyntls jovyntls changed the title Draft: Intralink validation Draft: Report broken links only once Mar 11, 2022
@jovyntls jovyntls changed the title Draft: Report broken links only once Report broken links only once Mar 14, 2022
@jovyntls jovyntls marked this pull request as ready for review March 14, 2022 07:05
@ang-zeyu ang-zeyu requested a review from wxwxwxwx9 March 15, 2022 08:38
Copy link
Contributor

@ryoarmanda ryoarmanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work! If I understand the implementation correctly:

  • SiteLinkManager stores intralinks to be validated after page generation is finished (rather than real-time) in order to achieve the consolidated logging.
  • SiteLinkManager takes some functionalities from linkProcessor that are required to determine whether a particular node should be collected for its intralink.
  • Therefore, linkProcessor validate intralinks "contract" is changed to working with the resource paths instead of the actual nodes.

My thought was to keep the intralink validation in real-time but add the (cwf, link) pairs that errored out to the SiteLinkManager's collection to be output after page generation finished. That way, we can keep linkProcessor's way of doing things, just added more on top of it.

I think your implementation is fine too (maybe we could check first if there is any difference, functional or otherwise, on validating in real-time vs at the end?)

The last point is a regression on a vital function of intralink validation, so we need to be extra careful here that we don't create unintended outcomes. It is generally better to add new functions rather than changing existing ones, but if it is unavoidable, we have to make sure it's right.

packages/core/src/html/NodeProcessor.js Outdated Show resolved Hide resolved
packages/core/src/html/SiteLinkManager.js Show resolved Hide resolved
packages/core/src/Site/index.js Outdated Show resolved Hide resolved
packages/core/src/html/SiteLinkManager.js Outdated Show resolved Hide resolved
packages/core/src/html/SiteLinkManager.js Outdated Show resolved Hide resolved
packages/core/src/html/linkProcessor.js Show resolved Hide resolved
@jovyntls
Copy link
Contributor Author

The last point is a regression on a vital function of intralink validation, so we need to be extra careful here that we don't create unintended outcomes. It is generally better to add new functions rather than changing existing ones, but if it is unavoidable, we have to make sure it's right.

Got it, thank you for the explanation! 🙇

I took a look through and here are some notes on it:

  • validateIntralink
    • has no major side effects (besides the logging) and its output is not used anywhere, except in tests to verify where the function returns
    • may be changed by the state of the config object, however the attributes it depends on (e.g. baseUrl) should remain constant throughout
    • is only called once, in the _process method of NodeProcessor
  • in _process, validateIntralink mainly serves the purpose of logging and its output is not used elsewhere (i.e. does not affect the rest of the method)

Behaviour wise, it looks like there shouldn't be a difference (besides collecting and then validating all at once, later)

@ang-zeyu ang-zeyu self-requested a review March 25, 2022 04:13
@@ -1103,6 +1111,8 @@ class Site {
} else {
isCompleted = await this.generatePagesAsyncThrottled(task.pages, progressBar);
}

this.siteLinkManager.validateAllIntralinks();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm 👍, I'll leave @ryoarmanda to the final look and merge

Just adding a note that while this preserves the previous behaviour (across rebuilds, invalid links are only logged for rebuilt files and their dependencies), perhaps we we could improve on this in a future issue / pr to persist the invalid links across rebuilds. (one more to the list for #1418)

E.g.

  • markbind s - all invalid links are reported
  • change some file -> only invalid links belonging to the file (or its dependencies) are reported
    • author has to scroll up to the "first batch" of messages to see the invalid links

Or

  • markbind s -o -b - all invalid links are reported for the opened page (and its dependencies), then for the background pages
  • change opened file -> only invalid links belonging to the file (or its dependencies) are reported
    • background pages are reported "disjointed" from the earlier batch, same issue with the author having to piece together the information again

I think your implementation is fine too (maybe we could check first if there is any difference, functional or otherwise, on validating in real-time vs at the end?)

No preference on this for now, but doing it at the end should facilitate the 2nd item (hash checking) in #1418, where it'll likely be necessary to do it at the end (only once all hashes have been collected from all pages).

Copy link
Contributor

@ryoarmanda ryoarmanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! The suggestion @ang-zeyu mentioned to persist the invalid links across rebuilds sounds like a nice improvement too.

No preference on this for now, but doing it at the end should facilitate the 2nd item (hash checking) in #1418, where it'll likely be necessary to do it at the end (only once all hashes have been collected from all pages).

Ah in that case, doing it at the end would prepare to go forth with the hash checking as well.

@ryoarmanda ryoarmanda merged commit 1e83c1e into MarkBind:master Mar 31, 2022
@ryoarmanda ryoarmanda added this to the 4.0 milestone Apr 3, 2022
@tlylt
Copy link
Contributor

tlylt commented Apr 17, 2022

While experimenting for a fix to #1723, I realize that since now the links are collected and then validated, the output might end up like this if the mentioned issue is fixed:
(warnings/infos will be logged above the progress bar, and the progress bar will stay at the bottom until completion)
image

Wonder if there are any opinions/thoughts if I revert the behavior to real-time while keeping the rest of code structure from this PR the same?

@jovyntls
Copy link
Contributor Author

Wonder if there are any opinions/thoughts if I revert the behavior to real-time while keeping the rest of code structure from this PR the same?

Quoting Ze Yu's reply from here:

No preference on this for now, but doing it at the end should facilitate the 2nd item (hash checking) in #1418, where it'll likely be necessary to do it at the end (only once all hashes have been collected from all pages).

@ang-zeyu
Copy link
Contributor

I think we should be fixing fundamental issue here instead: #1633

#1723 should be naturally fixed as a result too. (but we can still update winston of course)

E.g. maybe use this https://www.npmjs.com/package/progress#interrupt

Or (hopefully) there's even a progress bar library that works out of the box with winston 👀

@tlylt
Copy link
Contributor

tlylt commented Apr 17, 2022

Quoting Ze Yu's reply from here:

No preference on this for now, but doing it at the end should facilitate the 2nd item (hash checking) in #1418, where it'll likely be necessary to do it at the end (only once all hashes have been collected from all pages).

Thanks, @jovyntls, in this case, I guess real-time will have to mean that we can check on a per-page basis (depending on how hash checking can be done, especially when dealing with includes). I think the user experience is slightly better to have the warnings pop up as we go, which gives users the freedom to stop the build/serve whenever they notice such logs in the console.

I think we should be fixing fundamental issue here instead: #1633

#1723 should be naturally fixed as a result too. (but we can still update winston of course)

Indeed 1723 is somewhat a duplicate and adjusting the order is not the fix that I am going for (In my screenshot the link reporting does have the linebreak). I am asking more for user experience wise (should they see the broken link reported as they go, which was the previous experience, or should they see it at the end of site generation).

E.g. maybe use this npmjs.com/package/progress#interrupt

Indeed my working solution uses the idea of interrupt, will hopefully push up a PR to continue the discussion from there.

Or (hopefully) there's even a progress bar library that works out of the box with winston 👀

At first, I thought the problem is with the progress bar library and even tried another popular one (cli-progress), but it ended up having the same issue as well and no real good solutions for that (basically people have been filing issues about how difficult it is to work with loggers but not much was done to help). My sketch of the solution is to wrap the winston loggers and keep a reference to the progress bar so that the bar can be interrupted whenever there's a call to the logging methods. Again, will push up for discussion when it's done. Thanks @ang-zeyu for looking into this.

@ang-zeyu
Copy link
Contributor

I am asking more for user experience wise (should they see the broken link reported as they go, which was the previous experience, or should they see it at the end of site generation).

#1418 (comment)

^ just my personal preference, as an author, I would prefer to deal with link validation issues "in one block" of messages, rather than finding them disjointed all over the place. Having the messages come out real time may not be too helpful during markbind serve as well (since page regeneration proceeds fairly quickly)

@tlylt
Copy link
Contributor

tlylt commented Apr 18, 2022

^ just my personal preference, as an author, I would prefer to deal with link validation issues "in one block" of messages, rather than finding them disjointed all over the place. Having the messages come out real time may not be too helpful during markbind serve as well (since page regeneration proceeds fairly quickly)

Hmm that's quite true for 'markbind serve'. I'm more thinking of 'markbind build', in a larger site which the build time can be more than a few seconds, I would imagine it to be useful to know the error immediately so that I can go and fix it before I am ready to deploy this build. Certainly some personal preference is involved...

@ang-zeyu
Copy link
Contributor

Certainly some personal preference is involved...

Agreed, I think the clincher might be this ⬇️

No preference on this for now, but doing it at the end should facilitate the 2nd item (hash checking) in #1418, where it'll likely be necessary to do it at the end (only once all hashes have been collected from all pages).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing linebreak in logging below progress bar
4 participants