Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update FAQ about pandas #28930

Open
lorentzenchr opened this issue May 2, 2024 · 1 comment
Open

Update FAQ about pandas #28930

lorentzenchr opened this issue May 2, 2024 · 1 comment

Comments

@lorentzenchr
Copy link
Member

lorentzenchr commented May 2, 2024

Our FAQ is not up to date when it comes to pandas,

Why does scikit-learn not directly work with, for example, pandas.DataFrame?

The homogeneous NumPy and SciPy data objects currently expected are most efficient to process for most operations. Extensive work would also be needed to support Pandas categorical types. Restricting input to homogeneous types therefore reduces maintenance cost and encourages usage of efficient data structures.

Note however that ColumnTransformer makes it convenient to handle heterogeneous pandas dataframes by mapping homogeneous subsets of dataframe columns selected by name or dtype to dedicated scikit-learn transformers. Therefore ColumnTransformer are often used in the first step of scikit-learn pipelines when dealing with heterogeneous dataframes (see Pipeline: chaining estimators for more details).

See also Column Transformer with Mixed Types for an example of working with heterogeneous (e.g. categorical and numeric) data.

As of version 1.2 we have pandas-in-pandas-out, see https://scikit-learn.org/1.4/auto_examples/release_highlights/plot_release_highlights_1_2_0.html#pandas-output-with-set-output-api according to SLEP018.

Also, https://scikit-learn.org/dev/install.html mentions pandas purpose:

benchmark, docs, examples, tests

We should document this better.

@github-actions github-actions bot added the Needs Triage Issue requires triage label May 2, 2024
@ogrisel ogrisel added Documentation and removed Needs Triage Issue requires triage labels May 2, 2024
@ogrisel
Copy link
Member

ogrisel commented May 2, 2024

+1 for updating this changelog entry to mention that it's possible to output dataframes with the set_output API with a link to https://scikit-learn.org/dev/auto_examples/miscellaneous/plot_set_output.html for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants