Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking β€œSign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the generation of no-member suggestions #10277

Merged
merged 4 commits into from
Mar 25, 2025

Conversation

correctmost
Copy link
Contributor

Type of Changes

Type
βœ“ πŸ”¨ Refactoring

Description

Previously, edit distances were calculated for strings that had length differences greater than the maximum suggestion threshold.

Related PR:

Stats

I profiled yt-dlp (commit bd0a6681) with the following .pylintrc file:

[MAIN]
jobs=1

[MESSAGES CONTROL]
disable=all
enable=no-member

[REPORTS]
reports=no
score=no

Before

Command Mean [s] Min [s] Max [s] Relative
pylint --recursive=y . 39.161 Β± 0.175 38.929 39.548 1.00
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   103292    3.706    0.000    5.653    0.000 pylint/checkers/typecheck.py:153(_string_distance)
 19623350    1.938    0.000    1.938    0.000 {built-in method builtins.min}
      245    0.039    0.000    5.747    0.023 pylint/checkers/typecheck.py:171(_similar_names)

After

Command Mean [s] Min [s] Max [s] Relative
pylint --recursive=y . 37.009 Β± 0.249 36.650 37.333 1.00
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    15159    0.460    0.000    0.704    0.000 pylint/checkers/typecheck.py:153(_string_distance)
  2429889    0.244    0.000    0.244    0.000 {built-in method builtins.min}
      245    0.035    0.000    0.807    0.003 pylint/checkers/typecheck.py:175(_similar_names)

Previously, edit distances were calculated for strings that had
length differences greater than the maximum suggestion threshold.
Copy link

codecov bot commented Mar 15, 2025 β€’

Codecov Report

All modified and coverable lines are covered by tests βœ…

Project coverage is 95.89%. Comparing base (3457433) to head (ab59393).
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main   #10277   +/-   ##
=======================================
  Coverage   95.89%   95.89%           
=======================================
  Files         175      175           
  Lines       19073    19082    +9     
=======================================
+ Hits        18290    18299    +9     
  Misses        783      783           
Files with missing lines Coverage Ξ”
pylint/checkers/typecheck.py 96.67% <100.00%> (+0.03%) ⬆️
πŸš€ New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This comment has been minimized.

This comment has been minimized.

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, a 5+% improvement ! That's nice.

It seems like a well known problems that we should not have to optimize ourselves though.

Would you mind comparing this with using difflib.SequenceMatcher ? I suggested something but it might not directly work. And I don't like adding dependencies but maybe using https://pypi.org/project/edlib/ would be worth it (it seems it's faster than difflib)..

@correctmost
Copy link
Contributor Author

Would you mind comparing this with using difflib.SequenceMatcher ?

difflib.get_close_matches is about 200ms faster than the PR code.

One drawback is that it may be hard to map existing missing-member-hint-distance values to cutoff ratios because of the different distance algorithms and because of floating point math. For example, I would expect "buff" not to match "buffer" in the first two cases:

import difflib

difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.75)
-> ['buffer']

difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.80)
-> ['buffer']

difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.80000001)
-> []

Here's a separate case where I would expect a match but don't see one:

import difflib

difflib.get_close_matches("x", ["z"], n=3, cutoff=0.00001)
-> []

Python's own attribute suggestion code does not seem to use difflib:
https://github.com/python/cpython/blob/a931a8b32415f311008dbb3f09079aae1e6d7a3d/Lib/traceback.py#L1538-L1545


Here's the patch I used:
diff --git a/pylint/checkers/typecheck.py b/pylint/checkers/typecheck.py
index edef5188b..ab280e38b 100644
--- a/pylint/checkers/typecheck.py
+++ b/pylint/checkers/typecheck.py
@@ -6,9 +6,8 @@
 
 from __future__ import annotations
 
-import heapq
+import difflib
 import itertools
-import operator
 import re
 import shlex
 import sys
@@ -170,7 +169,7 @@ def _string_distance(seq1: str, seq2: str) -> int:
 
 def _similar_names(
     owner: SuccessfulInferenceResult,
-    attrname: str | None,
+    attrname: str,
     distance_threshold: int,
     max_choices: int,
 ) -> list[str]:
@@ -179,31 +178,24 @@ def _similar_names(
     The similar names are searched given a distance metric and only
     a given number of choices will be returned.
     """
-    possible_names: list[tuple[str, int]] = []
-    names = _node_names(owner)
+    if not attrname:
+        return []
 
-    for name in names:
-        if name == attrname:
-            continue
+    names = _node_names(owner)
+    possible_names = [name for name in names if name != attrname]
 
-        distance = _string_distance(attrname or "", name)
-        if distance <= distance_threshold:
-            possible_names.append((name, distance))
+    ratio = distance_threshold / len(attrname)
+    if distance_threshold == 1 and ratio == 1:
+        ratio = 0.5
 
-    # Now get back the values with a minimum, up to the given
-    # limit or choices.
-    picked = [
-        name
-        for (name, _) in heapq.nsmallest(
-            max_choices, possible_names, key=operator.itemgetter(1)
-        )
-    ]
-    return sorted(picked)
+    threshold = 1 - ratio - 0.0001
+    choices = difflib.get_close_matches(attrname, possible_names, max_choices, threshold)
+    return sorted(choices)
 
 
 def _missing_member_hint(
     owner: SuccessfulInferenceResult,
-    attrname: str | None,
+    attrname: str,
     distance_threshold: int,
     max_choices: int,
 ) -> str:
@@ -1209,7 +1201,7 @@ accessed. Python regular expressions are accepted.",
             if self.linter.config.missing_member_hint:
                 hint = _missing_member_hint(
                     owner,
-                    node.attrname,
+                    node.attrname or "",
                     self.linter.config.missing_member_hint_distance,
                     self.linter.config.missing_member_max_choices,
                 )

@Pierre-Sassoulas
Copy link
Member

Neat.

It seems traceback use levenshtein distance while difflib use an algo that "does not yield minimal edit sequences, but does tend to yield matches that β€œlook right” to people." (can't link the part of the doc specifically; https://docs.python.org/3/library/difflib.html)

Which would make https://pypi.org/project/edlib/ a better candidate as it's an optimized levenshtein distance ? Or we can copy paste the traceback code and not deal with a dependency if there isn't a big difference in performance ?

@Pierre-Sassoulas Pierre-Sassoulas modified the milestones: 3.3.6, 3.3.7 Mar 20, 2025
@correctmost
Copy link
Contributor Author

I re-ran the benchmarks on an updated system (which is a bit slower now):

Baseline (main)

Command Mean [s] Min [s] Max [s] Relative
pylint --recursive=y . 41.209 Β± 0.157 40.921 41.426 1.00

Current PR

Command Mean [s] Min [s] Max [s] Relative
pylint --recursive=y . 39.484 Β± 0.194 39.195 39.816 1.00

traceback._levenshtein_distance

Command Mean [s] Min [s] Max [s] Relative
pylint --recursive=y . 39.363 Β± 0.195 39.168 39.831 1.00

edlib

Command Mean [s] Min [s] Max [s] Relative
pylint --recursive=y . 39.296 Β± 0.149 39.045 39.481 1.00

My preference is to move forward with the PR branch for these reasons:

  • I don't think the additional speed-up is worth adding a dependency on edlib. Also, the performance gains would be greatest for codebases that don't actively fix Pylint warnings :) (i.e., codebases with lots of no-member errors).
  • The traceback._levenshtein_distance implementation assigns different substitution costs for moves (cost of 2) and case differences (cost of 1). This makes it tricky to honor existing missing-member-hint-distance values in a backward-compatible manner.
  • Copying the traceback._levenshtein_distance code into Pylint would require reading and updating license text, which is a bit tedious.

Patches tested

traceback._levenshtein_distance patch
diff --git a/pylint/checkers/typecheck.py b/pylint/checkers/typecheck.py
index edef5188b..dddfa7da0 100644
--- a/pylint/checkers/typecheck.py
+++ b/pylint/checkers/typecheck.py
@@ -12,6 +12,7 @@ import operator
 import re
 import shlex
 import sys
+import traceback
 from collections.abc import Callable, Iterable
 from functools import cached_property, singledispatch
 from re import Pattern
@@ -170,7 +171,7 @@ def _string_distance(seq1: str, seq2: str) -> int:
 
 def _similar_names(
     owner: SuccessfulInferenceResult,
-    attrname: str | None,
+    attrname: str,
     distance_threshold: int,
     max_choices: int,
 ) -> list[str]:
@@ -182,12 +183,15 @@ def _similar_names(
     possible_names: list[tuple[str, int]] = []
     names = _node_names(owner)
 
+    # Multiply by two because _levenshtein_distance considers move changes to have a cost of 2
+    max_cost = distance_threshold * 2
+
     for name in names:
         if name == attrname:
             continue
 
-        distance = _string_distance(attrname or "", name)
-        if distance <= distance_threshold:
+        distance = traceback._levenshtein_distance(attrname, name, max_cost)
+        if distance <= max_cost:
             possible_names.append((name, distance))
 
     # Now get back the values with a minimum, up to the given
@@ -203,7 +207,7 @@ def _similar_names(
 
 def _missing_member_hint(
     owner: SuccessfulInferenceResult,
-    attrname: str | None,
+    attrname: str,
     distance_threshold: int,
     max_choices: int,
 ) -> str:
@@ -1209,7 +1213,7 @@ accessed. Python regular expressions are accepted.",
             if self.linter.config.missing_member_hint:
                 hint = _missing_member_hint(
                     owner,
-                    node.attrname,
+                    node.attrname or "",
                     self.linter.config.missing_member_hint_distance,
                     self.linter.config.missing_member_max_choices,
                 )
edlib patch
diff --git a/pylint/checkers/typecheck.py b/pylint/checkers/typecheck.py
index edef5188b..594d16b56 100644
--- a/pylint/checkers/typecheck.py
+++ b/pylint/checkers/typecheck.py
@@ -20,6 +20,7 @@ from typing import TYPE_CHECKING, Any, Literal, Union
 import astroid
 import astroid.exceptions
 import astroid.helpers
+import edlib
 from astroid import arguments, bases, nodes, util
 from astroid.nodes import _base_nodes
 from astroid.typing import InferenceResult, SuccessfulInferenceResult
@@ -170,7 +171,7 @@ def _string_distance(seq1: str, seq2: str) -> int:
 
 def _similar_names(
     owner: SuccessfulInferenceResult,
-    attrname: str | None,
+    attrname: str,
     distance_threshold: int,
     max_choices: int,
 ) -> list[str]:
@@ -186,7 +187,7 @@ def _similar_names(
         if name == attrname:
             continue
 
-        distance = _string_distance(attrname or "", name)
+        distance = edlib.align(attrname, name)["editDistance"]
         if distance <= distance_threshold:
             possible_names.append((name, distance))
 
@@ -203,7 +204,7 @@ def _similar_names(
 
 def _missing_member_hint(
     owner: SuccessfulInferenceResult,
-    attrname: str | None,
+    attrname: str,
     distance_threshold: int,
     max_choices: int,
 ) -> str:
@@ -1209,7 +1210,7 @@ accessed. Python regular expressions are accepted.",
             if self.linter.config.missing_member_hint:
                 hint = _missing_member_hint(
                     owner,
-                    node.attrname,
+                    node.attrname or "",
                     self.linter.config.missing_member_hint_distance,
                     self.linter.config.missing_member_max_choices,
                 )

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the benchmarks ! I agree with your analysis.

DanielNoord
DanielNoord previously approved these changes Mar 24, 2025

This comment has been minimized.

@Pierre-Sassoulas Pierre-Sassoulas dismissed stale reviews from DanielNoord and themself via ab59393 March 25, 2025 12:27
@Pierre-Sassoulas Pierre-Sassoulas enabled auto-merge (squash) March 25, 2025 12:28
@Pierre-Sassoulas Pierre-Sassoulas merged commit 9a6168d into pylint-dev:main Mar 25, 2025
45 checks passed
github-actions bot pushed a commit that referenced this pull request Mar 25, 2025
Previously, edit distances were calculated for strings that had
length differences greater than the maximum suggestion threshold.
Remove handling of missing-member-hint-distance = 0

Co-authored-by: Pierre Sassoulas <pierre.sassoulas@gmail.com>
(cherry picked from commit 9a6168d)
Copy link
Contributor

πŸ€– According to the primer, this change has no effect on the checked open source code. πŸ€–πŸŽ‰

This comment was generated for commit ab59393

Pierre-Sassoulas pushed a commit that referenced this pull request Mar 25, 2025
Previously, edit distances were calculated for strings that had
length differences greater than the maximum suggestion threshold.
Remove handling of missing-member-hint-distance = 0

Co-authored-by: Pierre Sassoulas <pierre.sassoulas@gmail.com>
(cherry picked from commit 9a6168d)

Co-authored-by: correctmost <134317971+correctmost@users.noreply.github.com>
@correctmost correctmost deleted the cm/speed-up-no-member branch March 25, 2025 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants