Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow conversion to/from pandas without requiring PyArrow #15845

Closed
adrinjalali opened this issue Apr 23, 2024 · 6 comments · Fixed by #15933
Closed

Allow conversion to/from pandas without requiring PyArrow #15845

adrinjalali opened this issue Apr 23, 2024 · 6 comments · Fixed by #15933
Labels
A-interop-pandas Area: interoperability with pandas enhancement New feature or an improvement of an existing feature python Related to Python Polars

Comments

@adrinjalali
Copy link

Encountered this while reviewing this PR on the scikit-learn side, xref: scikit-learn/scikit-learn#28804 (comment)

Basically, if the environment doesn't have pyarrow, conversion from pandas seems to require pyarrow eventhough the pandas.DataFrame isn't using pyarrow.

Minimal reproducible:

python -m venv /tmp/.venv
source /tmp/.venv/bin/activate
pip install pandas polars

python
>>> import pandas as pd
>>> import polars as pl
>>> pl.DataFrame(pd.DataFrame(['a', 'b']))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 406, in __init__
    self._df = pandas_to_pydf(
               ^^^^^^^^^^^^^^^
  File "/tmp/.venv/lib/python3.11/site-packages/polars/_utils/construction/dataframe.py", line 1032, in pandas_to_pydf
    arrow_dict[str(col)] = plc.pandas_series_to_arrow(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/.venv/lib/python3.11/site-packages/polars/_utils/construction/other.py", line 92, in pandas_series_to_arrow
    return pa.array(values, pa.large_utf8(), from_pandas=nan_to_null)
           ^^^^^^^^
  File "/tmp/.venv/lib/python3.11/site-packages/polars/dependencies.py", line 97, in __getattr__
    raise ModuleNotFoundError(msg) from None
ModuleNotFoundError: pa.array requires 'pyarrow' module to be installed
>>> pd.DataFrame(pl.DataFrame(['a', 'b']))
   0
0  a
1  b

Note that in the above example the other way around (conversion from polars to pandas) works fine.

The PR on the scikit-learn side, introduced this line:

co2_data = pl.DataFrame({col: co2.frame[col].to_numpy() for col in co2.frame.columns})

which seems very odd, having to move to numpy and then to polars. Also, if the above line is correct, polars could be doing almost the same internally and not require pyarrow for the conversion.

@stinodego
Copy link
Member

Our pandas conversion logic (both ways) goes through PyArrow. In some special cases (object columns) we go through NumPy.

I suppose it would be more ideal to go through NumPy if that would be sufficient, instead of failing. Or write our own conversion logic.

There is also pl.from_dataframe(pandas_df) which is purely implemented in Polars and goes through the DataFrame interchange protocol.

@stinodego stinodego added the enhancement New feature or an improvement of an existing feature label Apr 24, 2024
@stinodego stinodego changed the title pyarrow requirement converting from pandas.DataFrame Allow conversion to/from pandas without requiring PyArrow Apr 24, 2024
@MarcoGorelli
Copy link
Collaborator

pl.from_dataframe(pandas_df)

Just as a note of caution, there were some pretty bad bugs in the pandas implementation of the interchange protocol before 2.2.2 for nullable dtypes. If you're just converting from pandas classic numpy-backed dtypes (as I think you are in the linked PR) then it should be OK

(unfortunately, this is the downside of having bundled the interchange protocol with pandas itself - by the time minimum versions have been bumped sufficiently, it's going to be years until it's fully usable)

@MarcoGorelli
Copy link
Collaborator

@adrinjalali suppose for the sake of argument that in Polars' next release, conversion from pandas to Polars for pandas primitive dtypes could happen without PyArrow installed

Would you then be OK with bumping the Polars version for the scikit-learn docs to the most recent one, as opposed to using the interchange protocol?

@adrinjalali
Copy link
Author

Since polars is only a requirement for docs and tests:

https://github.com/scikit-learn/scikit-learn/blob/f4cc02963559e4b7a335e97024010a8721c3dc26/sklearn/_min_dependencies.py#L36

I would be okay with that.

Note that we're already probably bumping the min required version to a more recent one for from_dataframe to work: scikit-learn/scikit-learn#28804 (comment)

@MarcoGorelli
Copy link
Collaborator

Awesome, thanks

Fancy waiting 1 week more (that's the usual release cadence for Polars) so you can bump it all the way just once and avoid from_dataframe completely? That may also "unlock" some more improvements in the existing polars examples you have in scikit-learn (e.g. Expr.top_k(n, by) instead of filter + over + unique)

@adrinjalali
Copy link
Author

Works for me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interop-pandas Area: interoperability with pandas enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
4 participants