Description
Is your feature request related to a problem?
No response
What is the motivation behind your request?
No response
Describe the solution you'd like
I have created the following few helper functions that are quite simple, and quite useful IMO:
def is_empty(df):
return df.count().execute() == 0
def to_list(df):
"""
Return the table `df` as a list of dictionaries, with each dictionary representing a row.
"""
return df.to_pyarrow().to_pylist()
def distinct_rows(df, on=None):
if on is None:
on = df.columns
df_with_count = df.group_by(on).aggregate(n_rows = _.count())
df_distinct = df.semi_join(df_with_count.filter(_.n_rows==1), predicates=on)
return df_distinct
def duplicated_rows(df, on=None):
if on is None:
on = df.columns
df_with_count = df.group_by(on).aggregate(n_rows = _.count())
df_duplicated = df.semi_join(df_with_count.filter(_.n_rows>1), predicates=on)
return df_duplicated
I think that each of them could be implemented as methods of an ibis.Table
quite naturally. Is it okay to ask for someone else to implement such functionality? Alternatively, I would appreciate (a) a response on if this is wanted/okay, and (b) a rough outline on how to implement it. Mainly which files to change, and what requirements I should keep in mind.
The implementation for ibis.Table.to_list
closely matches the Column.to_list
function implemented in #10498.
Related but not central to the issue
Also, I used duplicated_rows
to implement the following primary-key check:
def assert_pk(df, on, err=True):
"""
If `err` is true, an error is raised. If not, potential error messages are returned as a list of strings. This allows this function to be used for internal checks in other functions with modified error messages.
"""
df_subset = df.select(on)
n_rows_original = df_subset.count().execute()
n_rows_non_null = df_subset.drop_null().count().execute()
error_messages = []
if n_rows_original != n_rows_non_null:
error_messages.append(f"Found {n_rows_original - n_rows_non_null} null rows for the given colum(s) `{on}`. This violates the properties of a primary key.")
n_duplicated_rows = duplicated_rows(df, on).count().execute()
if n_duplicated_rows != 0:
error_messages.append(f"Found {n_duplicated_rows} duplicated rows for the given colum(s) `{on}`. This violates the properties of a primary key.")
if error_messages:
if err:
raise AssertionError("\n ".join(error_messages))
else:
return error_messages
If such a check is within scope for ibis, I would also love to have it implemented as a method on an ibis.Table
. This would go a long way to alleviate the underlying problem behind #11356.
I added the option to return the error as a string, because I needed it for a foreign key check. The foreign key check is likely beyond scope for ibis, but I include it here for completeness:
def assert_fk(df_left, df_right, on_left, on_right, err=True):
"""
Assert that `on_left` is a foreign key in `df_left`, and that it
links to the primary key `on_right` in `df_right`
"""
if isinstance(on_left, str) or isinstance(on_left, ibis.ir.AnyColumn):
on_left = [on_left]
if isinstance(on_right, str) or isinstance(on_right, ibis.ir.AnyColumn):
on_right= [on_right]
if len(on_left) != len(on_right):
raise AssertionError(f"Expected as many left columns as right columns. Instead, got left columns `{on_left}` and right columns {on_right}")
error_messages = []
error_messages_pk_right = assert_pk(df_right, on_right, err=False)
if error_messages_pk_right:
error_messages.append(f"Attempting to assert primary key for the column(s) `{on_right}` in `df_right` resulted in the following assertion errors:\n" + "\n ".join(error_messages_pk_right))
df_FKs_not_in_PK = df_left.anti_join(df_right, [df_left[l] == df_right[r] for (l, r) in zip(on_left, on_right)])
if df_FKs_not_in_PK.count().execute()>0:
print("Foreign keys in `df_left` not present among primary keys in `df_right`:", end="")
display(df_FKs_not_in_PK.select(on_left))
error_messages.append(f"Found {df_FKs_not_in_PK.count().execute()} rows (printed at the top) in `df_left` with values in columns `{on_left}` not present in the columns `{on_right}` from `df_right`.")
if error_messages:
if err:
raise AssertionError("\n ".join(error_messages))
else:
return error_messages
What version of ibis are you running?
None
What backend(s) are you using, if any?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Type
Projects
Status