Skip to content

May 2022 - Replace Splash with Playwright

Compare
Choose a tag to compare
@Rafiot Rafiot released this 24 May 13:33
· 863 commits to main since this release
v1.12.0

New Features

Playwright

The captures are now made via Playwright instead of Splash. It is a major improvement as Playwright uses actual up-to-date browsers, in headless mode (instead of qt-webkit from ~2016). You can read more about the research that lead to this change in the discussion.

The main other advantages of using playwright are the following:

  • Easier to install: it doesn't requires Docker in order to use Splash
  • Much better control of what happen in the browser while capturing: Playwright makes it extremely simple to instrument everything in the browsers. The capturing module already tries to solve reCaptcha if it detects it on the page.

The capture is made by a standalone python module that you can use in your own tools if you wish to.

De-duplication

If the exact same capture is triggered multiple times within 5 min, it is skipped and the requestor is redirected to the capture done before.

Fixes

  • Avoid discarding a capture on network error: when a redirect is broken down the line, we keep the chain up to that point
  • Issue when the MISP was submitted as un-published
  • [Docker] Properly handle archiving
  • [Docker] Init SRI hashes

Changes

  • Improve subsequent capture template on long URLs
  • Improve view of the capture page on small-ish screens
  • General maintenance and code cleanup
  • Improvement in the tree generation on edge cases
  • Bump JS/CSS libraries
  • Update bundled-in User-Agent file
  • Use pydeep2, comes with a bundled-in libfuzzy, easier to install.