Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

score_selector should find the innermost match. #2152

Closed
Thom1729 opened this issue Jan 6, 2018 · 26 comments
Closed

score_selector should find the innermost match. #2152

Thom1729 opened this issue Jan 6, 2018 · 26 comments

Comments

@Thom1729
Copy link

Thom1729 commented Jan 6, 2018

Summary

The sublime.score_selector API method (and the View.score_selector method) should always give higher scores to more specific matches.

Expected behavior

When scoring multiple selectors at a point, I expect the highest scoring selector to be the one that is most specific (given that it matches). What “most specific” means to me is that each part of the selector should match the rightmost possible component of the given scope. (This can be done greedily in linear time.)

Actual behavior

In at least some circumstances, a “less specific” selector scores higher than a “more specific” selector. This poses challenges in some circumstances when creating color schemes. See:

The present behavior has confused some color scheme authors, although to my knowledge the confusion has not been formally articulated before. The API documentation does not specify the exact behavior, so it's hard to call this issue a bug in the strictest sense. However, I think that the new behavior suggested here would be the most intuitive for users and the most useful for developers.

Steps to reproduce

To illustrate the issue, use the following .sublime-color-scheme:

{
    "rules": [
        {
            "scope": "meta.structure.dictionary punctuation.section",
            "foreground": "#ff0000",
        },
        {
            "scope": "meta.structure.array punctuation.section",
            "foreground": "#00ff00",
        },
    ]
}

Then observe the following file with the core JSON syntax:

{
    "foo": [
        {}
    ]
}

[
    {
        "foo": []
    }
]

I expect all of the curly braces to be red and all of the square brackets to be green. However, the innermost braces are the “wrong” color.

(It may be observed that this example is not the most effective way of highlighting JSON punctuation, because we can distinguish curly braces from square brackets directly. However, that solution does not apply to more complex cases in other languages.)

Environment

  • Mac OS Sierra 10.12.6 (16G1114)
  • Sublime Text dev build 3156
@borela
Copy link

borela commented Jan 6, 2018

This behavior also happens on Windows 7/10 x64 and Debian 9 x64.

@Thom1729
Copy link
Author

Thom1729 commented Jan 6, 2018

For reference, here's an example Python implementation of what I believe to be the current behavior:

def score_selector_leftmost(scope, selector):
    scope = enumerate(scope.split(' '), start=1)
    selector = selector.split(' ')

    score = 0
    for i, part in scope:
        if len(selector) == 0: break

        if part.startswith(selector[0]):
            selector = selector[1:]
            score += 8 ** i

    return score

And the preferred behavior:

def score_selector_rightmost(scope, selector):
    scope = enumerate(scope.split(' '), start=1)
    selector = selector.split(' ')

    scope = reversed(list(scope))
    selector = list(reversed(selector))

    score = 0
    for i, part in scope:
        if len(selector) == 0: break

        if part.startswith(selector[0]):
            selector = selector[1:]
            score += 8 ** i

    return score

Tests:

TESTS = [
    ('function class function punctuation', 'function punctuation'),
    ('function class function punctuation', 'class punctuation'),
]

for scope, selector in TESTS:
    print(score_selector_leftmost(scope, selector), score_selector_rightmost(scope, selector))

score_selector_leftmost returns the lowest-scoring way to match the scope to the selector, and score_selector_rightmost returns the highest-scoring. In the cases given by the tests, this affects which selector scores the highest.

I assume that the real implementation is C++, but I hope that the change may be as simple as twiddling some indexes.

@keith-hall
Copy link
Collaborator

To highlight the difference in a simpler/quicker way - no need to create color schemes etc, just execute the following in the ST console:

>>> sublime.score_selector('source.json meta.structure.dictionary.json meta.structure.dictionary.value.json meta.structure.array.json meta.structure.dictionary.json punctuation.section.dictionary.begin.json', 'meta.structure.dictionary punctuation.section')
524480
>>> sublime.score_selector('source.json meta.structure.dictionary.json meta.structure.dictionary.value.json meta.structure.array.json meta.structure.dictionary.json punctuation.section.dictionary.begin.json', 'meta.structure.array punctuation.section')
536576

@borela
Copy link

borela commented Mar 25, 2018

Any news on this one? Fixing this will benefit many languages(C, C++, JavaScript, MQL4 & 5) as you'll be able to give different colors to the punctuation and for example, differentiate an array literal vs access, object literal vs simple block, etc...

@borela
Copy link

borela commented Jul 10, 2018

@wbond Is the scoring function inside the executable or is it a default plugin that I can fix?

@Thom1729
Copy link
Author

It's not anywhere accessible, such as a default package, and it's used in core features like syntax highlighting, so it must be in the executable.

@brupelo
Copy link

brupelo commented Mar 11, 2019

@borela The scoring function as well as other API internal functions if I'm not mistaken is living in plugin_host.exe. I've made some little analysis about that particular function in this thread in order to figure out how it works behind the curtains but I haven't been able yet to figure out the underlying algorithm.

@keith-hall
Copy link
Collaborator

Why would you assume in plugin_host when the methods are used internally and not just part of plugins? Seems more likely they are in the core executable..

@brupelo
Copy link

brupelo commented Mar 11, 2019

Just check the plugin_host strings, ie: strings plugin_host.exe . And then grep for api_ . Also, attach your debugger to plugin_host process and look around... You'll see the name of those functions correspond to the sublime_api bultin module

@FichteFoll
Copy link
Collaborator

FichteFoll commented Mar 11, 2019

Just because the names are there doesn't imply the algorithm being defined there. Or rather, of course there need to be strings for the functions exposed by the sublime_api module because they need to be defined somewhere, but because the plugin host doesn't have knowledge of the view's contents it makes more sense to compute this on the core side instead, at least for View.score_selector. It could be possible that the sublime.score_selector function has a mirror implementation on the plugin_host side because it is contextless, but I still wonder what you want to make with this information. Do you intend to reverse-engineer it?

@brupelo
Copy link

brupelo commented Mar 11, 2019

My main goal right now is to understand how extract_scope works, by doing so I'll be able to duplicate the toggle_comment command 1:1 on a scintilla widget. Reason why I felt knowing how score_selector works was important is mainly because I felt that function could be helpful to understand extract_scope. That said, being able to know exactly where these api_* core functions are defined exactly using the debugger would be amazing, as right now when trying to reverse specific sublime functions Im using some sort of statitical analysis inference based on datasets... Which is pretty much a really slow way to reverse thse simple algorithms 😥.

If you have relevant information, ideas, thoughts or any hypothesis about all of this please let me know and I'll check it out when I'm back home.

@Thom1729
Copy link
Author

Thom1729 commented Mar 26, 2019

After some more experimentation, this seems to be the algorithm that Sublime uses for score_selector:

def split_scope(scope):
    return [
        re.split(r'\.', part)
        for part in re.split(r'\s+', scope)
        if part
    ]

def score_selector(scope, selector):
    return _score_selector_leftmost(
        split_scope(scope),
        split_scope(selector)
    )

def _score_selector_leftmost(scope, selector):
    score = 0

    scope_count = len(scope)
    selector_count = len(selector)

    scope_index = 0
    selector_index = 0

    if selector_count == 0:
        return 1

    while True:
        if selector_index >= selector_count:
            return score
        elif scope_index >= scope_count:
            return 0

        part_score = _score_part(scope[scope_index], selector[selector_index])

        if part_score > 0:
            score += part_score << (3 * min(scope_index+1, 20))
            score &= ((1 << 64) - 1) # Keep 64 bits

            selector_index += 1
        
        scope_index += 1

def _score_part(scope_part, selector_part):
    scope_len = len(scope_part)
    selector_len = len(selector_part)

    score = 0
    i = 0

    while True:
        if i >= selector_len:
            return min(score, 7)
        elif i >= scope_len or scope_part[i] != selector_part[i]:
            return 0
        else:
            score += 1
            i += 1

Written in a C-ish style (sorry). Notes:

  • Operators like & are omitted. Those are probably handled at a slightly higher level.
  • This algorithm runs in guaranteed linear time (in the length of the selector).
  • Sublime interns the strings. I'm not sure of the precise details.
  • Each dot-delimited scope contributes three bits of information, representing the length of the prefix match. If more than seven dotted terms match, the remainder will not contribute to the score (though if they do not match, _score_part will still return 0).
  • The result seems to always be congruent to zero mod 8, except that an empty selector scores 1. Two bits are therefore wasted.
  • When there is a match beyond the 20th term of the scope, the bits from that match are added to the score at the position for the 20th term. The result is truncated to 64 bits.

@brupelo
Copy link

brupelo commented Mar 26, 2019

@Thom1729 You're the man here! Your post/experiment is quite awesome, hats off... few weeks ago I'd tried to reverse engineer that one and failed miserably... you can read about it here. When I find a moment I'll compare it against Sublime's but I believe you've already done so...

Btw, did you know about the existence of pyblime, guys like you who're curious about how things works behind the curtains would fit extremely well in our growing at fast pace team :D (take my marketing sentece with a little grain of salt).

Anyway, thank you very much for all your reserach! high quality content there ;)

@Thom1729
Copy link
Author

FYI, I've updated the code to handle overflow in the same way as Sublime. Not really sure why it works that way, but it seems to be exact.

@brupelo
Copy link

brupelo commented Mar 26, 2019

@Thom1729 Hats off man... pretty impressive you've reversed this algo... May I ask you something out of curious? Which approach did you take to be able to figure out the original behaviour? Definitely the approach I'd taken was not good :)

@Thom1729
Copy link
Author

I already had the impression from somewhere that the score was a sum of bits, where each bit represented a successful match. By observation, score_selector('a', 'a') == 8, and score_selector('x a', 'a') == 64, which meant that each part was using three bits. The obvious guess was that the first sub-part would use the first bit, the second sub-part the second bit, and so on. This worked for a and a.b, but not for a.b.c and above. Instead, the three bits turned out to be a three-bit unsigned integer representing the index of the number of sub-parts matched. This is more space-efficient. I thought that the group might expand if there were more than seven sub-parts, but it didn't; extra sub-parts do not contribute to the score.

It's a bit odd that the first group starts at 8. The obvious choice is 1, the first bit. But then an empty selector would score zero, which is used to mean failure, even though an empty selector logically should match everything. It seems to be a special case that the empty selector scores 1. It gets an entire group to itself, even though it doesn't need it -- two bits are wasted.

Anyway, I knew that the system failed in some way when there were too many nested scopes, which sounded like an overflow issue. To investigate this, I tested score_selector('1.2.3.4.5.6.7 '*20, '1.2.3.4.5.6.7 '*20) (seeing your post in syntect) and then score_selector('1.2.3.4.5.6.7 '*21, '1.2.3.4.5.6.7 '*21). This confirmed that the odd behavior happened at the point of overflow. (The empty selector handling gets three bits, and 20*3 more is 63. 21 parts overflows it.) But the behavior wasn't a simple overflow. I formatted those two outputs with '{:064b}' so that they'd line up, and the 21 case looked like the 20 case with an extra 111 added on top of the last 111 from the 20 case, turning 0111 into 1110. I verified that additional parts added their scores at the same position. Of course, when the number would actually overflow, the leading bits were simply truncated.

If the empty selector handling only used one bit, then we could use the other 63 bits to fit 21 groups without issue. The 22nd would still break. Also, the score wouldn't look so friendly in octal because the three-bit groups wouldn't line up properly.

The algorithm itself is just the result of writing this down in Python while thinking about C. Often, when there's weird behavior in Sublime, you can find the answer by thinking about the most obvious way to write a C program that would reproduce the known behavior. As often as not, that program handles unknown cases the same as Sublime. This way of thinking was especially useful when I wrote SublimeSyntaxParser.

@brupelo
Copy link

brupelo commented Mar 26, 2019

@Thom1729 Wow, love it... that material deserves to live in a nice blog article... it's the type of articles I love reading actually... the approach you've taken driven by your hypothesis has been definitely much more optimal than mine based on gathering data and observation, which leaded me nowhere. Also, I liked specially this part:

The algorithm itself is just the result of writing this down in Python while thinking about C. Often, when there's weird behavior in Sublime, you can find the answer by thinking about the most obvious way to write a C program that would reproduce the known behavior. As often as not, that program handles unknown cases the same as Sublime. This way of thinking was especially useful when I wrote SublimeSyntaxParser.

Which makes total sense... in fact, when I was trying to figure out the behaviour of insert/erase/replace I used a similar approach, as my "starting point" was just assuming the maths of these algorithms should be something extremely simple as they behave extremely fast, therefore the operations involved shouldn't be complex at all, thinking that way really helped that time to find equivalent functions.

Talking of which, you seem to be well versed on Sublime's internal behaviour as well as parsing & scoping... sorry to bother with more questions but... maybe you know something about #2691 ? Being able to figure out how extract_scope works would be pretty amazing as by doing so I'd be able to use toggle_comment.py code in pyblime . Anyway, as I said, thank you very much for the explanation... it's been a really nice reading, glad I'm not the only one who's interested on these tiny but extremely interesting puzzles ;)

@Thom1729
Copy link
Author

Circling back to the original topic, if the C implementation of score_selector is roughly similar to the above C-inspired Python implementation, then fixing this bug is a simple matter of reversing the order of iteration in _score_selector_leftmost:

def _score_selector_rightmost(scope, selector):
    score = 0

    scope_count = len(scope)
    selector_count = len(selector)

    scope_index = scope_count - 1
    selector_index = selector_count - 1

    if selector_count == 0:
        return 1

    while True:
        if selector_index < 0:
            return score
        elif scope_index < 0:
            return 0

        part_score = _score_part(scope[scope_index], selector[selector_index])

        if part_score > 0:
            score += part_score << (3 * min(scope_index+1, 20))
            score &= ((1 << 64) - 1) # Keep 64 bits

            selector_index -= 1
        
        scope_index -= 1

@brupelo
Copy link

brupelo commented Mar 28, 2019

@Thom1729 I've tested your above routine in pyblime and i've found some corner cases that won't behave as Sublime, you can see them in this unit-test commented. Just check these ones:

if __name__ == '__main__':
    print(score_selector("a " * 129, "a " * 129))
    print(score_selector("a.a " * 86, "a.a " * 86))
    print(score_selector("a.a.a " * 65, "a.a.a " * 65))

so you'll see your routine will give:

16305604136582550088
7246935171814466704
12023324262328547032

In any case, it's almost a perfect 1:1 routine ;) . Btw, added proper credits here... hopefully you'll get interested in the project and become a contributor though as your skills seem to be the real deal ;)

Ps. Again, thanks for the good job

@Thom1729
Copy link
Author

It looks like the scope or selector foo.bar baz is encoded as [X, Y, $, Z, $], where X, Y, and Z are references to foo, bar, and baz, and $ is a terminator. Sublime encodes scopes and selectors into an array with a capacity of 256 units. When a scope or selector would overflow this capacity, it is treated as empty.

If I have time, I'll see about modifying the algorithm accordingly.

@borela
Copy link

borela commented Nov 24, 2019

@FichteFoll Please add this to the next dev cycle, having this feature fixed would simplify color schemes e and allow us to target punctuations for languages where [] {} can mean different things in different contexts.

@keith-hall
Copy link
Collaborator

adding a milestone would be misleading unless SublimeHQ are actually going to work on it in this dev cycle...

@borela
Copy link

borela commented Nov 24, 2019

@keith-hall I know that but since he is adding specific issues to milestones I thought he had enough rapport with the core devs to ask for this.

@FichteFoll
Copy link
Collaborator

FichteFoll commented Nov 26, 2019

All issues in the "next dev cycle" milestone are confirmed to have been looked into by a core dev and to be addressed by the "next dev cycle" releases. The issues I add to that have either been mentioned by a dev on the forum, Discord, or otherwise. wbond also adds the milestone to issues directly (when he can).

@FichteFoll
Copy link
Collaborator

FichteFoll commented Feb 27, 2024

Old behavior (via):

>>> sublime.score_selector("meta.mapping meta.sequence meta.mapping punctuation.separator.sequence", "meta.mapping punctuation.separator.sequence")
12304
>>> sublime.score_selector("meta.mapping meta.sequence meta.mapping punctuation.separator.sequence", "meta.sequence punctuation.separator.sequence")
12416

New behavior in 4173:

>>> sublime.score_selector("meta.mapping meta.sequence meta.mapping punctuation.separator.sequence", "meta.mapping punctuation.separator.sequence")
1664
>>> sublime.score_selector("meta.mapping meta.sequence meta.mapping punctuation.separator.sequence", "meta.sequence punctuation.separator.sequence")
1552

🎉

@FichteFoll FichteFoll added this to the Build 4173 milestone Feb 27, 2024
@FichteFoll
Copy link
Collaborator

And with the example given in this issue:

>>> sublime.score_selector('source.json meta.structure.dictionary.json meta.structure.dictionary.value.json meta.structure.array.json meta.structure.dictionary.json punctuation.section.dictionary.begin.json', 'meta.structure.dictionary punctuation.section')
524480
>>> sublime.score_selector('source.json meta.structure.dictionary.json meta.structure.dictionary.value.json meta.structure.array.json meta.structure.dictionary.json punctuation.section.dictionary.begin.json', 'meta.structure.array punctuation.section')
536576

Now:

>>> sublime.score_selector('source.json meta.structure.dictionary.json meta.structure.dictionary.value.json meta.structure.array.json meta.structure.dictionary.json punctuation.section.dictionary.begin.json', 'meta.structure.dictionary punctuation.section')
77824
>>> sublime.score_selector('source.json meta.structure.dictionary.json meta.structure.dictionary.value.json meta.structure.array.json meta.structure.dictionary.json punctuation.section.dictionary.begin.json', 'meta.structure.array punctuation.section')
67072

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants