Skip to content

Latest commit

 

History

History
402 lines (300 loc) · 37.7 KB

help.org

File metadata and controls

402 lines (300 loc) · 37.7 KB

What?

pWebArc (Personal Private Passive Web Archive) is a browser extension that passively captures, collects, and archives dumps of HTTP requests and responses to your own private archiving server (like the dumb archiving server, also there) as you browse the web.

Glossary

  • A reqres (REQuest + RESponse) is a pWebArc-internal object containing captured information about an HTTP request and its response, including their headers and data, and some meta-information (whether it originates from an extension, tabId it originates from, its state, etc).

General operation

State Diagram

Reqres change their internal states according to the following state diagram (which is explained below):

(start) -> (request sent) -> (nIO) -> (headers received) -> (nIO) --> (body recived)
   |                           |                              |             |
   |                           v                              v             v
   |                     (no_response)                   (incomplete)   (complete)
   |                           |                              |             |
   |                           \                              |             |
   |\---> (canceled) -----\     \                             |             |
   |                       \     \                            \             |
   |                        \     \                            \            v
   |\-> (incomplete_fc) ----->----->---------------------------->-----> (finished)
   |                        /                                            /  |
   |                       /                                      /-----/   |
   \--> (complete_fc) ----/        /--------------- (picked) <---/          v
                                   |                   |                (dropped)
                                   v                   v                 /  |
       (archived) <- (sIO) <- (collected) <------- (in_limbo) <---------/   |
                       |           ^                   |                    |
                       |           |                   |                    |
                /------/           \-----\             \--> (discarded) <---/
                |                        |
                \-> (failed to archive) -/

Step 1: Tracking

pWebArc attaches to browser’s runtime and tracks progress of HTTP requests and their responses, capturing both their request and response headers and data at appropriate times in the browser’s request and response processing pipeline.

Whether pWebArc will track a given request depends on the “Track new reqres” toggles (or checkboxes, if you are using a browser without support for the needed CSS) in the settings popup (also displayed on the right here), e.g:

  • this toggle allows you to disable tracking of newly spawned HTTP requests globally, thus essentially disabling pWebArc,
  • this one controls whether pWebArc will track new reqres originating from the currently active tab,
  • this one controls whether it will track new reqres originating from new tabs opened from the currently active tab (aka “children tabs”, e.g. via middle mouse click, context menu, etc),
  • while this one controls whether it will track new reqres originating from new tabs opened via browser’s “New Tab” browser action (i.e. the plus sign in the tab bar, Control+T, menu item, etc),
  • and so forth for the others (press “?” symbols to see a tooltip explaining what each of them does).

Disabling any of these toggles does not stop tracking of already initiated requests, it only stops new requests controlled by that toggle from being tracked.

The networking states of the State Diagram

As shown on the above diagram, a new reqres proceeds through the following networking states:

  • start: the starting state;
  • request sent, headers received, (response) body recived: normal HTTP request stages (webRequest API stages);
  • nIO: normal network IO performed by the browser in between HTTP request stages;
  • canceled: request was canceled before it was sent (by you; by the browser itself, e.g. when rewriting an http:// URL to an https:// URL in HTTPS-only mode; by an ad-blocking extension like “uBlock Origin”; etc);
  • no_response: request was sent, but no response was received (connection to the server was rejected; you canceled it manually via the “Stop” button before it got a response; the server decided to ignore the request completely; network timeout was reached; etc);
  • incomplete: request was sent, response headers were received, but then the loading was interrupted before all of the response body was received;
  • incomplete_fc: only on Firefox-based browsers: the browser loaded the response data of this reqres directly from its cache, but did not give it to pWebArc; this is just how Firefox handles things sometimes (usually, for images); this is a separate state, because usually this means this URL was successfully archived before (if it was not, reload the page with Control+F5);
  • complete: request was completed successfully;
  • complete_fc: request was completed successfully from browser’s cache;
  • finished: the terminal state of this step, no new events for this reqres will come from the browser (webRequest API);

In principle, at reaching finished state the reqres can be serialized and saved to disk, but pWebArc provides more states and UI for convenience.

Glossary

  • An /in-flight reqres/ is a reqres that did not reach the finished state yet, in the UI such reqres will be shown to be in in_flight state. If some reqres get stuck in one of the in_flight states, the UI has buttons (this and this in the popup) to force them out of the current state as if an error occurred.
  • A finished reqres is a reqres that reached the finished state, the final networking state is the last state before finished (i.e. complete, incomplete, etc).

Step 2: Classification

On reaching the finished state, pWebArc performs reqres classification controlled by “Pick reqres for archival when they finish” and “Mark reqres as problematic when they finish” settings.

The former set decides whether the reqres in question should be picked or dropped, which influences the actions pWebArc will perform in the next step.

The latter set decides if the reqres in question should be marked as problematic. Note that problematic is a status flag, not a state.

The problematic reqres status does not influence archival or any actions discussed in the latter steps. It exists because, normally, browsers provide no indication when some parts of the page failed to load properly — they expect you to actually look at the page with your eyes to notice something looking broken (and reload it manually) instead — which is not a proper way to do this when you want to be sure that the whole page with all its resources was archived, as some of the incompletely loaded parts of the page might be invisible.

And so, to provide such an indicator, pWebArc keeps the log of problematic reqres and displays the number of elements in the log in its toolbar button’s (browserAction’s) badge.

By default, HTTP requests that failed to get a response, those that have incomplete response bodies, and those for which the browser reported potentially problematic errors but then pWebArc picked them anyway, will be marked as problematic.

Problematic errors are errors like

  • “fetching of this request’s data was aborted because this whole request was aborted, for instance, because the JavaScript making it decided to cancel it as no longer relevant when you moved your mouse cursor away from an interactive video thumbnail it was needed for”,
  • and similar things that probably imply some part of the page was left unfetched,

but NOT errors like

  • “fetching of this request’s data was aborted because it was redirected by the server”,
  • “the browser decided against rendering of this data”,
  • and similar errors where the data was properly fetched.

(In principle, pWebArc could have been designed to never record the errors of the latter category in the first place, thus simplifying the above bit, but pWebArc is designed to follow the philosophy or “collect everything as browser gives it, as raw as possible, do all the post-processing logic separately, allow for no logic at all, if the user asks for it”.)

The raw error strings reported by the browser for each reqres can be seen in the recent reqres history log.

If this option is enabled pWebArc will generate a desktop notification each time a new problematic reqres get produced. If you don’t care about the problematic flag and it annoys you, you should disable that option, not options under “Mark reqres as problematic when they finish” settings.

Glossary

Displayed on the Picked/Dropped reqres line:

On its own line:

Step 3: Collection, Discarding, and Limbo

Normally, picked reqres proceed to the collected state, which queues them for archival.

Similarly, dropped reqres proceed to being discarded from memory.

Limbo

However, for picked reqres, when “Pick into limbo” setting is enabled in the currently active tab (or via the respective settings for other reqres sources), the reqres in question will be put into limbo until you collect it or discard it manually by pressing the appropriate buttons (or global buttons, if you want to do it for all tabs and sources at once).

Similarly, for dropped reqres, when “Drop into limbo” setting is enabled in the currently active tab (or via the respective settings for other reqres sources), the reqres in question will be similarly put into limbo. Mainly, this exists for debugging.

If this option is enabled and there are more than this number reqres in limbo or the total size of all dumps in limbo is more than this size (in MiB), pWebArc will complain to remind you to collect or discard some of them so that your browser does not waste much memory and so that you won’t loose too much data if something crashes.

Glossary

On its own line:

  • an in-limbo reqres/ is a reqres that is being held in =limbo= until you manually /collect or discard it.

Displayed on the Collected/Discarded reqres line:

  • A /collected reqres/ is a reqres that was (either automatically or manually) sent to the collected state.
  • A /discarded reqres/ is a reqres that was (either automatically or manually) sent to the discarded.

Step 3.5: Logging

On entering collected or discarded state, metadata of each reqres is copied into the recent reqres history log (which can be narrowed to the currently active tab with this button) and is kept there until the size of the log reaches this many elements, at which point the older elements of the log start being elided automatically.

You can also ask pWebArc to forget some history manually by pressing this button to forget all history, or this button to forget history of reqres generated by the currently active tab.

Note, however, that problematic reqres will not get automatically elided from the log, nor forgotten by using the above buttons. To forget about them, you will have to unset the problematic flag on the respective reqres via this button, or this one, or use similar buttons in the log.

Step 4: Archival

When “Archive collected reqres” toggle is enabled, pWebArc will pop the queued reqres from its archival queue one by one, serialize them into CBOR-formatted dumps, and then push those dumps to the archiving server at “Archive collected reqres to URL” setting by turning each reqres into a POST HTTP request with the dump of the reqres as request body (which is denoted by sIO state on the diagram). It will also specify profile query parameter to the POST request using the appropriate “Profile” setting, e.g.

  • this one will be used for requests originating from the currently active tab,
  • this one will be used for requests originating from new child tabs opened from the currently active tab (e.g. via middle mouse click, context menu, etc),
  • while this one will be used for new tabs opened via browser’s “New Tab” browser action (i.e. the plus sign in the tab bar, Control+T, menu item, etc),
  • and so forth for the others (press “?” symbols to see a tooltip explaining what each of them does).

Evaluation of the profile parameter is done just before the POST request is sent, so if the queue is not yet empty, and you disable “Archive collected reqres”, edit some of the “Profile” settings, and enable “Archive collected reqres” again, pWebArc will start using the new setting immediately.

If this option is enabled and some reqres failed to be archived, a new desktop notification will be generated. If this option is enabled, a new desktop notification will be generated when the archival queue gets empty the very first time or after any failures.

Glossary

Displayed on the Archived/Failed reqres line:

  • An /archived reqres/ is a reqres that was successfully archived to the archiving server and thus was discarded from memory.
  • A /queued reqres/ is a reqres still queued for archival.
  • A /failed to archive reqres/ is a reqres that failed to be archived to the archiving server. Archiving of these reqres’ will be retried every 60 seconds but you can retry it immediately by pressing this button.

Shortcuts

pWebArc provides a bunch of keyboard and context menu shortcuts to allow using it in more efficient ways.

  • On Firefox-based browsers, you can see and edit all keyboard shortcuts via “Add-ons and themes” (about:addons) -> the gear icon -> Manage Extension Shortcuts.
  • On Chromium-based browsers, you can see and edit all keyboard shortcuts via the menu -> “Extensions” -> “Manage Extensions” (chrome://extensions/) -> “Keyboard shortcuts” (on the left).

Keyboard shortcuts

pWebArc provides shortcuts to:

Context menu actions

pWebArc provides context menu actions to:

  • open a given link in a new tab with currently active tab’s tracking in children tabs setting negated. I.e.,
    • right-mouse clicking while pointing at a link and
    • selecting “Open Link in New Tracked/Untracked Tab” from “pWebArc” sub-menu,

    is equivalent to

    • toggling this,
    • middle-mouse clicking a link,
    • toggling this again.
  • do the same thing, but opening it in a new window.

Quirks and Bugs

Known extension issues

  • At the moment, reqres in limbo and queued reqres in the archival queue are only stored in memory, so if you close the browser or reload the extension before all the queued reqres finish archiving, or if you forget about some reqres in limbo, you will lose some data.

    This is not an issue under normal conditions, as limbo is disabled by default and archiving a reqres takes milliseconds, meaning that the queue will stay empty almost all of the time. But this is technically a bug that might get fixed later.

  • When the extension is (re-)loaded, all tabs inherit the values of this, this, this, and this setting.
  • At the moment, pWebArc does not implement collection of WebSockets data on any of the supported browsers (even though, Chromium does support it, in theory).
  • On Chromium, response data of background requests and requests made by other extensions does not get collected, since there’s no tab to attach a debugger to, and I have not figured out how to attach debugger to other things yet.

Relevant issues of Firefox, Tor Browser, LibreWolf, etc

  • On Firefox-based browsers, without the patch (also there), the browser only supplies formData to browser.webRequest.onBeforeRequest handlers, thus making impossible to recover the actual request body for a POST request.

    pWebArc will mark such requests as having a “partial request body” and try its best to recover the data from formData structure, but if a POST request was uploading files, they won’t be recoverable from formData (in fact, it is not even possible to tell if there were any files attached there), and so your archived request data will be incomplete even after pWebArc did its best.

    Disabling this toggle will disable archiving of such broken requests. This is not recommended, however, as archiving some data is usually better than archiving none.

    With the above patch applied, small POST requests will be archived completely and correctly. POST requests that upload large files and only those will be marked as having a “partial request body”.

  • If-Modified-Since and If-None-Match headers never get archived, because the browser never supplies them to the extensions. Thus, you can get “304 Not Modified” reqres response to a seemingly normal “GET” request.
  • Reqres of already cached media files (images, audio, video, except for svg and favicons) will end in incomplete state because browser.webRequest.filterResponseData API does not provide response bodies for such requests. This toggle controls if such reqres should be picked.

    By default, pWebArc will drop them. Usually this is not a problem since such media will be archived on first (non-cached) access. But if you want to force everything on the page to be archived, you can reload the page without the cache with Control+F5.

  • Firefox fails to run onstop method for browser.webRequest.filterResponseData filter for the very first HTTP/2 request the browser makes after you start it, thus making the reqres of that request incomplete. If this option is enabled, pWebArc transparently works around this bug by redirecting the very first navigation request to about:blank and then reloading the tab with its original URL.
  • Firefox-based browsers provide no API for archiving WebSockets data at the moment, unfortunately.

Relevant issues of Chromium, Chrome, etc

On Chromium-based browsers, there is no way to get HTTP response data without attaching Chromium’s debugger to a tab from which a request originates from. This makes things a bit tricky, for instance:

  • With pWebArc and this option enabled, new tabs will be reset to this value (about:blank by default) because the default of chrome://newtab/ does not allow attaching debugger to the tabs with chrome: URLs.
  • Requests made before the debugger is attached will get canceled by pWebArc. So, for instance, when you middle-click a link, Chromium will open a new tab, but pWebArc will block the requests from there until the debugger gets attached and then automatically reload the tab after. As side-effect of this, Chromium will show “Request blocked” page until the debugger is attached and the page is reloaded, meaning it will get visually stuck on “Request blocked” page if fetching the request ended up spawning a download instead of showing a page. The download will proceed as normal, though.
  • You will get an annoying notification bar constantly displayed in the browser while pWebArc is enabled. Closing that notification will detach the debugger. pWebArc will reattach it immediately because it assumes you don’t want to lose data and closing that notification on accident is, unfortunately, quite easy.

    However, closing the notification will make all in-flight requests lose their response data.

    If you disable pWebArc the debuggers will get detached only after all requests finish. But even if there are no requests in-flight the notification will not disappear immediately. Chromium takes its time updating the UI after the debugger is detached.

Moreover, Chromium has the following long-standing issues/bugs making things difficult:

  • Chromium will automatically detach a debugger from a tab if it tries to save too much data into its debugger state. Which means that a tab that loads too much data too fast will get its debugger detached. Chromium does this to try and save memory, but this, among other issues, means that large images will fail to be properly archived, and any page that loads such files is likely to fail to be archived too.

    This is a design limitation of Chromium debugging interface, there appears to be no work-around for this at the moment.

    Meanwhile, on Firefox, pWebArc uses browser.webRequest.filterResponseData API (not available no Chromium, because it greatly enhances browser’s ad-blocking capabilities) which does not suffer from this problem.

  • Chromium will occasionally detach debuggers from some tabs at random. It just happens. Fortunately, pWebArc will mark the resulting broken reqres as problematic by default as they match the conditions of at least one of this, this, or that options.
  • Chromium handling of media files (audio and video) within its debugging interface is very strange. When Chromium encounters a media file, it immediately loads a first few frames of it, then cancels the rest of the download, generates a networking error debugging event, but forgets to give the already loaded data to it, and then, when the user clicks the play button, continues the download by requesting the rest of the file as normal. Thus, on Chromium, for media files pWebArc will only ever get “206 Partial Content” HTTP responses with the first few kilobytes of file data missing. This bug has no good workaround, all alternatives to pWebArc that work with Chromium work it around by silently re-downloading the file the second time in background.
  • Similarly to unpatched Firefox, Chromium-based browsers do not supply contents of files in POST request data. They do, however, provide a way to see if files were present in the request, so pWebArc will mark such and only such requests as having a “partial request body”. There is no patch for Chromium to fix this, nor does the author plan to make one (feel free to contribute one, though).

    As with Firefox, disabling this toggle will disable archiving of such broken requests. This is not recommended, however, as archiving some data is usually better than archiving none.

  • If the server supplies the same header multiple times (which happens sometimes) then archived response headers will be incomplete, as Chromium’s Network.responseReceived debugging API event provides a dictionary of headers, not a list.
  • Chromium fails to provide openerTabId to tabs created with chrome.tabs.create API so in the unlikely case of opening two or more new tabs/windows in rapid succession via pWebArc context menu actions and not giving them time to initialize pWebArc could end up mixing up settings between the newly created tabs/windows. This bug is impossible to trigger unless your system is very slow or you are clicking things with automation tools like AutoHotKey or xnee.
  • To properly collect all the data about a reqres, pWebArc has to use both the data generated by webRequest API and Chromium’s own debugging API events, using only one of those is usually insufficient. But Chromium generates different request IDs for events generated by these two different APIs and also generates those events in arbitrary order. Therefore, pWebArc tracks reqres generated by both sets of APIs separately and then matches those two lists against each other heuristically, merging matching reqres together. Which is ugly enough. But then Chromium sometimes generates debugging API events and forgets to produce the corresponding webRequest API events, or vice versa, thus leaving some of those reqres unmatched. To work around that, pWebArc waits this many seconds for new events to arrive, and if none do, forcefully finishes all unmatched in-flight reqres.

Error messages and codes

Desktop notifications

  • pWebArc FAILED to archive <N> items in the queue because it can't establish a connection to the archive at <URL>

    Are you running the the archiving server script? pWebArc requires an archiving server to actually archive anything.

  • pWebArc FAILED to archive <N> items in the queue because requests to URL fail with: <STATUS> <REASON>: <RESPONSE>

    Your archiving sever is returning HTTP errors when pWebArc is trying to archive data to it. See its error console for more information.

    Some common reasons it could be failing:

    • No space left on the device you are archiving to.
    • It’s a bug.

Errors recorded in reqres, as seen in the log

Most error codes are produced by attaching one of the following prefixes to the raw error code given by the browser:

  • webRequest:: prefix is prepended to errors produced by the browser.webRequest API;
  • debugger:: prefix is prepended to errors produced by the Chromium Debugger API code;
  • filterResponseData:: prefix is prepended to errors produced by browser.webRequest.filterResponseData API (these can usually be ignored, since Firefox generates normal webRequest:: codes for those reqres too, when it was an actual error, but pWebArc still collects them, adhering to “collect everything as browser gives it, when possible” philosophy).

In particular, webRequest::NS_ prefix on Firefox, and webRequest::net:: and debugger::net:: prefixes on Chromium signify various issues produced by the networking stacks of those browsers. For instance:

  • webRequest::NS_ERROR_ABORT on Firefox and webRequest::net::ERR_ABORTED on Chromium signify that this request was aborted before it finished, e.g. because the originator tab was closed before it was fully loaded; Firefox also uses this code to mean what Chromium signifies with various BLOCKED codes;
  • webRequest::net::ERR_BLOCKED_BY_CLIENT on Chromium signifies that an extension blocked it;
  • debugger::net::ERR_BLOCKED:: is a prefix for other errors when the request was blocked, e.g. by CSP;
  • webRequest::NS_ERROR_NET prefix on Firefox and webRequest::net::ERR_FAILED error on Chromium signify various networking issues.

The exception to the above rule of keeping everything as raw as possible are webRequest::pWebArc:: and debugger::pWebArc:: prefixes which signify various errors produced by pWebArc itself in its webRequest- or debugger-handling code, respectively. In particular:

  • webRequest::pWebArc::EMIT_FORCED_BY_USER and debugger::pWebArc::EMIT_FORCED_BY_USER are produced when you forcefully advance a reqres from in-flight state by pressing this or that button;
  • debugger::pWebArc::EMIT_FORCED_BY_DETACHED_DEBUGGER is produced when Chromium debugger gets detached from its tab while a reqres inside that tab is still in flight;
  • debugger::pWebArc::EMIT_FORCED_BY_CLOSED_TAB is produced when a tab gets closed while a reqres inside of it is still in flight;
  • debugger::pWebArc::NO_RESPONSE_BODY:: is a prefix for errors produced when getting request’s response body from Chromium’s debugger fails for various reasons;
  • webRequest::pWebArc::NO_DEBUGGER::CANCELED is produced when a non-main-frame request is canceled by pWebArc because no debugger is available to capture it; in the case of a main frame request, pWebArc will cancel the request and reload the tab, as discussed above, so this error will not be produced; but it can happen if a page tries to load a sub-frame (like iframe) while the debugger for the tab (and, thus, the main frame) did not attach yet (which only happens for pages where Chromium disallows debugging, or when pWebArc gets enabled after the page in question already started loading, e.g. the very first page after the browser stats); also, this can happen when the debugger gets detached after the main frame was captured but its resources are still loading.

Frequently Asked Questions

Does pWebArc send any of my captured web browsing data to any third-parties?

No, pWebArc only ever sends data to the archiving server URL you specify.

Does pWebArc collect and send any telemetry anywhere?

pWebArc does persist some global stat numbers across restarts (like Collected/Discarded reqres), but they are never sent anywhere, and you can reset them.

Why do pages under https://addons.mozilla.org/ and https://chromewebstore.google.com/ can not be captured?

Browsers prevent extensions from running on extension store pages to prevent them from manipulating ratings, reviews, and etc such things. However, you can archive https://addons.mozilla.org/ pages by running pWebArc under Chromium and https://chromewebstore.google.com/ pages by running pWebArc under Firefox.

Why does a URL http://..., https://... or some part of it fails to be properly captured?

Did you read the notes on the bugs of the browser you are using above?

Most notably:

  • both Chromium- and Firefox-based browsers in their default builds fail to properly supply POST request data to their extensions; for Firefox-based browsers there exists a patch that fixes it, mostly; Chromium users are out of luck at the moment;
  • on a Chromium-based browser, because of limitations of the Chromium’s debugging interface, it is impossible to capture media files (both audio and video, except for those that are absolutely tiny, as in 64KiB or less) and large images; this issue has no good work-around and, AFAIK, all alternatives to pWebArc running on Chromium-based browser suffer from it (and work around it by silently re-downloading the media file the second time in background); try using pWebArc under a Firefox-based browser instead.

Can I capture a web page without archiving it, look at it, decide if I want to save it, and archive it only if I do, all without reloading the page?

Yes. This is why “Pick into limbo” setting exists. See above for more info.

On Chromium, a lot of my captures fail with debugger::pWebArc::EMIT_FORCED_BY_DETACHED_DEBUGGER, debugger::pWebArc::NO_RESPONSE_BODY::DETACHED_DEBUGGER, and webRequest::pWebArc::NO_DEBUGGER::CANCELED errors. What do I do?

You are either

  • pressing the “Cancel” or “Close” (cross) buttons in the Chromium’s popup-toolbar telling you about the debugger being enabled, and so Chromium detaches it, breaking everything (see above);
  • pressing Space or Escape keyboard keys when doing things in Chromium’s UI, but nothing at that particular moment reacts to the key you pressed, except there is that popup-toolbar… and so Chromium decides it must mean you want to press “Cancel” button there … and detaches the debugger, breaking everything (again);

    yes, this is really annoying, and this is a common problem for me, since I usually page-down using Space and press Escape a lot (usually to cancel selection, but sometimes also as a trauma of a long-time Vim user);

    the only solution to this I know of is to just not touch the keyboard at all, at least while things are still loading; i.e. just click on stuff using the mouse/track-point/touch-pad/touchscreen/etc, wait for the “T” (“Tracking”) to vanish from the extension’s badge, and only then let your (grabby and impatient for exercise via keyboard shortcuts) fingers to touch the keyboard;

    even then, Chromium will detach debuggers from time to time seemingly at random, but at least it will be rare enough that you won’t need to reload much;

  • trying to capture very large media files; as discussed above, this has no workaround, run pWebArc under Firefox instead.

This page does not answer my question. What do I do?

If the whole content of this page (not just this section, did you try searching for stuff with Control+F? there’s a lot of info here) does not explain your problem, open an issue on GitHub or get in touch otherwise.