Skip to content

Question: The correct format for URL_BLACKLIST property? #824

@levitabris

Description

@levitabris

Hi All,

I have questions regarding using the correct format to exclude certain domains/subdomains, e.g. to skip google search result pages.

My regex looks like this:

^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$

I use

archivebox config --set URL_BLACKLIST = r'^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$'

which gives me this in the archivebox.conf:

URL_BLACKLIST = r^http(s)?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$

The config did not work as expected. I also tried manual input such as

[GENERAL_CONFIG]
TIMEOUT = 20
URL_BLACKLIST = r'^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$'

or

[GENERAL_CONFIG]
TIMEOUT = 20
URL_BLACKLIST = '^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$'

or

[GENERAL_CONFIG]
TIMEOUT = 20
URL_BLACKLIST= ^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$

None of the above worked. Can anyone share a workable archivebox.conf to help me set the correct format to input the regex?

I also found the documentation of URL_BLACKLIST seems to have a regex with unclosed single quote. Please help to verify.

I'm using ArchiveBox v0.6.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions