Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop adding data: and mailto: URIs to the database #483

Open
JustAnotherArchivist opened this issue Jan 28, 2024 · 0 comments
Open

Stop adding data: and mailto: URIs to the database #483

JustAnotherArchivist opened this issue Jan 28, 2024 · 0 comments
Labels

Comments

@JustAnotherArchivist
Copy link
Contributor

As of wpull 2.0.3, data: and mailto: URIs get added to the database, although neither serves any purpose. Not only are these schemes unsupported, there's also nothing to be retrieved for them anyway. tel: URIs (currently entirely unsupported and treated as relative paths instead) should likely also be treated the same.

As an extreme example of the impact in the real world: an ArchiveBot job's database grew to 106 GB over the past couple days due to data: URIs embedded in every page. After purging these URIs with (likely not the most efficient approach)

sqlite3 wpull.db 'SELECT id FROM url_strings WHERE url LIKE "data:%"' | sed 's,^.*$,UPDATE url_strings SET url = "data:<removed-&>" WHERE id = &\;,' >cmds
sqlite3 wpull.db <cmds
sqlite3 wpull.db VACUUM

the database size dropped to 860 MB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant