Hi All,
I have questions regarding using the correct format to exclude certain domains/subdomains, e.g. to skip google search result pages.
My regex looks like this:
^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$
I use
archivebox config --set URL_BLACKLIST = r'^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$'
which gives me this in the archivebox.conf:
URL_BLACKLIST = r^http(s)?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$
The config did not work as expected. I also tried manual input such as
[GENERAL_CONFIG]
TIMEOUT = 20
URL_BLACKLIST = r'^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$'
or
[GENERAL_CONFIG]
TIMEOUT = 20
URL_BLACKLIST = '^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$'
or
[GENERAL_CONFIG]
TIMEOUT = 20
URL_BLACKLIST= ^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$
None of the above worked. Can anyone share a workable archivebox.conf to help me set the correct format to input the regex?
I also found the documentation of URL_BLACKLIST seems to have a regex with unclosed single quote. Please help to verify.
I'm using ArchiveBox v0.6.2
Hi All,
I have questions regarding using the correct format to exclude certain domains/subdomains, e.g. to skip google search result pages.
My regex looks like this:
I use
which gives me this in the
archivebox.conf:The config did not work as expected. I also tried manual input such as
or
or
None of the above worked. Can anyone share a workable
archivebox.confto help me set the correct format to input the regex?I also found the documentation of URL_BLACKLIST seems to have a regex with unclosed single quote. Please help to verify.
I'm using ArchiveBox v0.6.2