Update/pandas 3.0#694
Conversation
* fix error * test * test: action should pass with 3.14 and without 3.9 and 3.10 * other version mods * test normal * test as before * test without 3.9 and 3.10. plus 3.14 * fix error * test after fixing ruff errors with ruff format * Fix remaining ruff issues * replacing isinstance(A, (X, Y)) by isinstance(A, X | Y) * Format code with ruff * python tests passed * fix pyupgrade * reducing tests * hook precommit * added tests names * test all versions at the same time * Update shapash/explainer/multi_decorator.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update shapash/decomposition/contributions.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * changing versions for pyupgrade * ruff * fix: possible bug where if it was a Series instead of a DataFrame it would crash * upgrade: more robust syntax * update: readme * fix: correcting fallback * fix: fixing boolean mistake * fix: fallback for when viewport_data = None --------- Co-authored-by: 61153a <61153a@slhdg002.maif.local> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| data_path = dirname(dirname(abspath(__file__))) | ||
| self.ds_titanic_clean = pd.read_pickle(join(data_path, "data", "clean_titanic.pkl")) | ||
| if int(pd.__version__.split(".")[0]) >= 3: | ||
| print("Using clean_titanic_pandas_3.pkl for pandas version >= 3") |
There was a problem hiding this comment.
please remove this print
| ) | ||
| expected["pred"] = expected["pred"].astype(int) | ||
| assert not pd.testing.assert_frame_equal(expected, output) | ||
| assert not pd.testing.assert_frame_equal(expected, output, check_dtype=False) |
| dtype=object, | ||
| ) | ||
| assert not pd.testing.assert_frame_equal(expected, output) | ||
| assert not pd.testing.assert_frame_equal(expected, output, check_dtype=False) |
There was a problem hiding this comment.
Pull request overview
This PR updates Shapash to remain compatible with pandas 3.0 behavioral changes, primarily by relaxing dtype assertions in tests, avoiding in-place mutation on read-only NumPy views, and adjusting LIME input types to avoid pandas indexing changes.
Changes:
- Relaxed multiple
assert_frame_equalchecks to ignore dtype differences across pandas 2.x/3.x. - Updated LIME backend to pass NumPy arrays (instead of Series) to
explain_instance()in some branches. - Added a pandas-3-specific integration test fixture and widened the pandas dependency range to
<4.0.0.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit_tests/utils/test_columntransformer_backend.py | Uses check_dtype=False in frame equality assertions for pandas 3 string dtype changes. |
| tests/unit_tests/utils/test_category_encoders_backend.py | Uses check_dtype=False in multiple inverse-transform assertions. |
| tests/unit_tests/explainer/test_smart_state.py | Switches to assert_frame_equal(..., check_dtype=False) for pandas 3 dtype inference. |
| tests/unit_tests/explainer/test_smart_plotter.py | Copies NumPy arrays before in-place sorting to avoid pandas 3 read-only arrays. |
| tests/unit_tests/explainer/test_smart_explainer.py | Uses check_dtype=False in dataframe comparisons. |
| tests/integration_tests/test_integration_inverse_tranform.py | Loads a pandas-3-specific pickle fixture based on pandas major version. |
| tests/data/clean_titanic_pandas_3.pkl | Adds a pandas 3-specific pickled dataset fixture. |
| shapash/backend/lime_backend.py | Passes x.loc[i].values into LIME for some branches to avoid pandas 3 Series integer-indexing changes. |
| pyproject.toml | Relaxes pandas upper bound from <3.0.0 to <4.0.0. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| exp = explainer.explain_instance(x.loc[i].values, self.model.predict_proba, num_features=x.shape[1]) | ||
| lime_contrib.append({_transform_name(var_name[0], x): var_name[1] for var_name in exp.as_list()}) |
There was a problem hiding this comment.
In run_explainer(), only the binary-classification and regression paths were updated to pass a NumPy array into explain_instance(). The multi-class classification path (the elif num_classes > 2: block further down) still calls explainer.explain_instance(x.loc[i], ...) with a pandas Series, so the pandas 3.0 integer-indexing breaking change can still trigger there. Please apply the same to_numpy()/array conversion consistently in the multi-class branch as well (and avoid reusing i as the inner-loop variable, since it currently shadows the outer loop index).
|
|
||
| else: | ||
| exp = explainer.explain_instance(x.loc[i], self.model.predict, num_features=x.shape[1]) | ||
| exp = explainer.explain_instance(x.loc[i].values, self.model.predict, num_features=x.shape[1]) |
There was a problem hiding this comment.
The PR description says this fix uses .to_numpy(), but the code uses .values. For pandas objects, .to_numpy() is the recommended, explicit API (and allows copy= control), whereas .values can return different types depending on dtype and is discouraged. Consider switching to x.loc[i].to_numpy() (optionally with copy=True if LIME mutates the array) to align behavior across dtypes/pandas versions.
| if int(pd.__version__.split(".")[0]) >= 3: | ||
| self.ds_titanic_clean = pd.read_pickle(join(data_path, "data", "clean_titanic_pandas_3.pkl")) | ||
| else: | ||
| self.ds_titanic_clean = pd.read_pickle(join(data_path, "data", "clean_titanic.pkl")) |
There was a problem hiding this comment.
Selecting a different pickle for pandas>=3 is fine, but clean_titanic_pandas_3.pkl appears to contain ArrowStringArray data (it references pyarrow in the pickle payload). This makes the integration test implicitly depend on pyarrow being installed when running with pandas 3, while the test extra in pyproject.toml doesn't include pyarrow. Either ensure pyarrow is part of the test dependencies, or regenerate the pandas-3 fixture using python string storage so it can be unpickled without pyarrow.
| if int(pd.__version__.split(".")[0]) >= 3: | |
| self.ds_titanic_clean = pd.read_pickle(join(data_path, "data", "clean_titanic_pandas_3.pkl")) | |
| else: | |
| self.ds_titanic_clean = pd.read_pickle(join(data_path, "data", "clean_titanic.pkl")) | |
| pandas_major_version = int(pd.__version__.split(".")[0]) | |
| default_pickle_path = join(data_path, "data", "clean_titanic.pkl") | |
| if pandas_major_version >= 3: | |
| pandas_3_pickle_path = join(data_path, "data", "clean_titanic_pandas_3.pkl") | |
| try: | |
| self.ds_titanic_clean = pd.read_pickle(pandas_3_pickle_path) | |
| except (ImportError, ModuleNotFoundError) as exc: | |
| if "pyarrow" not in str(exc): | |
| raise | |
| self.ds_titanic_clean = pd.read_pickle(default_pickle_path) | |
| else: | |
| self.ds_titanic_clean = pd.read_pickle(default_pickle_path) |
Description
Fixes compatibility issues introduced by the pandas 3.0 migration.
Main changes
pandas 3.0 breaking changes addressed (#677 )
StringDtypeas default for string columns: pandas 3.0 now infersStringDtypeinstead ofobjectfor string columns and indexes. Added 'check_dtype=False' on tests for consistent results across pandas 2.x and 3.x. As it doesnt change the actual functionality, and we are testing the values and not the types.Read-only arrays from
.to_numpy(): pandas 3.0 returns read-only arrays from.to_numpy(). Fixed by adding.copy()after.to_numpy()calls where in-place operations (e.g..sort()) are performed — in bothcompare_plot()and the corresponding test.Integer indexing on string-indexed Series: pandas 3.0 no longer falls back to positional indexing when accessing a
Serieswith a string index using integer keys. Fixed by passing.to_numpy()instead of a pandasSeriesto LIME'sexplain_instance()inlime_backend.py(both classification and regression branches).