Update FAQ about pandas #28930

lorentzenchr · 2024-05-02T09:14:06Z

Our FAQ is not up to date when it comes to pandas,

Why does scikit-learn not directly work with, for example, pandas.DataFrame?

The homogeneous NumPy and SciPy data objects currently expected are most efficient to process for most operations. Extensive work would also be needed to support Pandas categorical types. Restricting input to homogeneous types therefore reduces maintenance cost and encourages usage of efficient data structures.

Note however that ColumnTransformer makes it convenient to handle heterogeneous pandas dataframes by mapping homogeneous subsets of dataframe columns selected by name or dtype to dedicated scikit-learn transformers. Therefore ColumnTransformer are often used in the first step of scikit-learn pipelines when dealing with heterogeneous dataframes (see Pipeline: chaining estimators for more details).

See also Column Transformer with Mixed Types for an example of working with heterogeneous (e.g. categorical and numeric) data.

As of version 1.2 we have pandas-in-pandas-out, see https://scikit-learn.org/1.4/auto_examples/release_highlights/plot_release_highlights_1_2_0.html#pandas-output-with-set-output-api according to SLEP018.

Also, https://scikit-learn.org/dev/install.html mentions pandas purpose:

benchmark, docs, examples, tests

We should document this better.

ogrisel · 2024-05-02T13:44:39Z

+1 for updating this changelog entry to mention that it's possible to output dataframes with the set_output API with a link to https://scikit-learn.org/dev/auto_examples/miscellaneous/plot_set_output.html for more details.

github-actions bot added the Needs Triage Issue requires triage label May 2, 2024

ogrisel added Documentation and removed Needs Triage Issue requires triage labels May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update FAQ about pandas #28930

Update FAQ about pandas #28930

lorentzenchr commented May 2, 2024 •

edited

ogrisel commented May 2, 2024

Update FAQ about pandas #28930

Update FAQ about pandas #28930

Comments

lorentzenchr commented May 2, 2024 • edited

ogrisel commented May 2, 2024

lorentzenchr commented May 2, 2024 •

edited