Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HTML filtering rules $$ -- allow CSS-like selectors #94

Closed
Alex-302 opened this issue Oct 16, 2017 · 12 comments
Closed

Improve HTML filtering rules $$ -- allow CSS-like selectors #94

Alex-302 opened this issue Oct 16, 2017 · 12 comments

Comments

@Alex-302
Copy link
Member

Alex-302 commented Oct 16, 2017

Add ability to use syntax, similar to CSS.
For example:
example.com$$td[id="main_content"] > span[id^="post_"][tag-content="advertising"]
and
example.com$$td[id="main_content"] > span[id^="post_"][wildcard="<a href=\"*==""]
and
example.com$$[id="main_content"] > span[id^="post_"][wildcard="<a href=\"*==""]
if tag of [id="main_content"] is random(see 4pda.ru)

@ameshkov ameshkov added this to the v2.0 milestone Jul 8, 2019
@ameshkov ameshkov changed the title Improve HTML filtering rules $$ Improve HTML filtering rules $$ -- allow CCS-like selectors Jul 8, 2019
@Alex-302 Alex-302 changed the title Improve HTML filtering rules $$ -- allow CCS-like selectors Improve HTML filtering rules $$ -- allow CSS-like selectors Oct 16, 2019
@ngorskikh
Copy link
Member

Propose to change the current spec in the following way.

Replace

rule = [domains] "$$" tagName [attributes]

with

elementSelector = (tagName [attributes]) / attributes
childCombinator = ">"
rule = [domains] "$$" elementSelector *(childCombinator elementSelector)

.

Replace:

`tagName` — name of the element in lower case, for example, `div` or `script`.

with:

`tagName` — name of the element in lower case, for example, `div` or `script`. An empty `tagName` matches any element.

.

Add to the description of rule parts:

`childCombinator` — an operator that works similarly to the [CSS child combinator](insert link to MDN): that is, the `elementSelector` on the right of the `childCombinator` will only match an element whose direct parent matches the `elementSelector` on the left of the `childCombinator`.

.

Add to the description of the tag-content special attribute:

Due to the way HTML filtering works, the `tag-content` attribute is **not** supported in an `elementSelector` that stands to the left of a `childCombinator`: such rules will be rejected.

.

@ameshkov @Alex-302 @sfionov Discuss.

@ameshkov
Copy link
Member

ameshkov commented Aug 4, 2023

I'd like us to do some improvements alongside this.

  1. First of all, add :contains() pseudo support.
  2. Second, remove native tag-content and wildcard support. Instead of that make them convert to :contains implicitly and mark them as deprecated in the knowledge base.
  3. Open an issue in the tsurlfilter when this is done.

@Alex-302
Copy link
Member Author

Alex-302 commented Aug 4, 2023

Do they have max-length limit for custom value?

@ameshkov
Copy link
Member

ameshkov commented Aug 6, 2023

@Alex-302 this is a very good point!

Btw, do you think it makes sense to extend :contains or :has-text and introduce the max-length there?

@Alex-302
Copy link
Member Author

Alex-302 commented Aug 6, 2023

I'm not sure. Does the length of the removing script or length limit affect applying performance?
If not, it's probably not necessary.

Will there be support for regular expressions in :contains / :has-text?

@ameshkov
Copy link
Member

ameshkov commented Aug 6, 2023

If not, it's probably not necessary.

@ngorskikh Which div will be removed when there's a structure like the one below and a rule like this: $$div:contains(trying to remove)?

If we can guarantee that the inner one (and not the outer one) will be removed, then I guess it's okay to simply drop max-length support. But we should document the logic.

Also, we should check existing rules and decide what we do with them. Do we simply remove max-length from these rules or let CoreLibs handle that attribute and ignore it?

<div>
    ...
    lots of content
    ...
    <div>
        the div that you're actually trying to remove.
    </div>
    ...
    lots of content
    ...
</div>

Will there be support for regular expressions in :contains / :has-text?

Yes, sure.

@ngorskikh
Copy link
Member

ngorskikh commented Aug 7, 2023

@ameshkov In CoreLibs, HTML rules are matched against the unmodified HTML stream, which means that in your example, both div elements match and are, as a consequence, deleted. That, probably, is the rationale behind the max-length attribute: to try and disambiguate such cases.

By the way, the same sort of ambiguity exists in CSS rules: for example, the :contains pseudo-class uses the textContent attribute of a node for matching, which is a concatenation of text contents of each child. Meaning that in your example, the outer div also matches and can be hidden by a CSS rule.

@ameshkov
Copy link
Member

ameshkov commented Aug 7, 2023

@ngorskikh makes sense, thank you.

@Alex-302 can there be any trouble because of that? Is max-length used to disambiguate in some cases?

@Alex-302
Copy link
Member Author

Alex-302 commented Aug 7, 2023

@ameshkov

can there be any trouble because of that? Is max-length used to disambiguate in some cases?

Usually we try to make the matching specific. But sometimes we apply a max-length limit slightly larger than the size of the script.

@ameshkov
Copy link
Member

ameshkov commented Aug 7, 2023

@Alex-302 generally, it does not matter much. Also, if there will be regex support, you can do something like that: $$script:contains(/.*stringtosearch.{10,1000}/)

@Alex-302
Copy link
Member Author

Alex-302 commented Aug 7, 2023

Ok, that's what I meant by the regexp question.

@ngorskikh
Copy link
Member

tsurlfilter task: AdguardTeam/tsurlfilter#96
Knowledge base PR: AdguardTeam/KnowledgeBase#264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment