Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pollution in archives due to search domains in resolv.conf #318

Open
JustAnotherArchivist opened this issue May 12, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@JustAnotherArchivist
Copy link
Contributor

commented May 12, 2018

@hook54321a brought up on IRC that a pipeline retrieved http://www/ successfully (Wayback Machine). The reason for this is that there's a search ovh.net line on the pipeline's resolv.conf, meaning that www resolves to www.ovh.net. It appears that this is not a new issue; here's an example snapshot of Online.net's website from two years ago, captured through the same URL by an ArchiveBot pipeline.

We are not alone with this problem: there are various other snapshots showing all kinds of responses (e.g. one, two), and IA even captured itself under that URL at least once back in 2005.

Still, we need to find a way to prevent this from happening to avoid pollution in the archives. @ivan suggested testing whether www resolves in the preflight check, but this might not always be reliable because www might not resolve on the search domain while other (sub)domains work fine. Another option might be explicitly testing whether /etc/resolv.conf contains a search line, though that would obviously not work on all OS. Yet another option would be implementing a custom DNS resolving stack which completely ignores the DNS configuration in resolv.conf (i.e. communicates with specific DNS servers directly), but that's probably not a good idea.

In any case, the current pipelines also need to be fixed of course. Ping to the current pipeline operators: @Asparagirl, @chronomex, @falconkirtaran, @HarryC145, @MattIggo. My pipelines are not affected.

@JustAnotherArchivist

This comment has been minimized.

Copy link
Contributor Author

commented May 12, 2018

Another option might be anchoring the resolution explicitly to the root domain, i.e. resolving www. instead of www. I'm not entirely sure if this will work in all cases though. The resolv.conf(5) man page indicates that non-default values for the ndots option might still cause local resolution (but doesn't explicitly mention what happens if there is a dot at the end).

The man page also hints at an environment variable LOCALDOMAIN which can be used to override the search directive. So we could try setting that to an empty value (or, if that's not possible, an unresolvable one, such as invalid). But I don't know whether that environment variable is honoured on all systems.

@chronomex

This comment has been minimized.

Copy link
Member

commented Jun 17, 2018

Oh shoot, I came here to post that my pipeline is resolving clean. But then I looked, and resolv.conf had cloud.online.net in it! Fixed.

@JustAnotherArchivist

This comment has been minimized.

Copy link
Contributor Author

commented Jun 18, 2018

Thanks. Just so you're aware: depending on the network configuration, it may reappear after a reconnect/reboot. This happens for example when using DHCP with the default configuration; it can be deactivated by overriding the request directive in /etc/dhcp/dhclient.conf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.