Skip to content

feat: A few helper-functions #11378

Open
Open
@KronosTheLate

Description

@KronosTheLate

Is your feature request related to a problem?

No response

What is the motivation behind your request?

No response

Describe the solution you'd like

I have created the following few helper functions that are quite simple, and quite useful IMO:

def is_empty(df):
    return df.count().execute() == 0

def to_list(df):
    """
    Return the table `df` as a list of dictionaries, with each dictionary representing a row.
    """
    return df.to_pyarrow().to_pylist()

def distinct_rows(df, on=None):
    if on is None:
        on = df.columns
    df_with_count = df.group_by(on).aggregate(n_rows = _.count())
    df_distinct = df.semi_join(df_with_count.filter(_.n_rows==1), predicates=on)
    return df_distinct

def duplicated_rows(df, on=None):
    if on is None:
        on = df.columns
    df_with_count = df.group_by(on).aggregate(n_rows = _.count())
    df_duplicated = df.semi_join(df_with_count.filter(_.n_rows>1), predicates=on)
    return df_duplicated

I think that each of them could be implemented as methods of an ibis.Table quite naturally. Is it okay to ask for someone else to implement such functionality? Alternatively, I would appreciate (a) a response on if this is wanted/okay, and (b) a rough outline on how to implement it. Mainly which files to change, and what requirements I should keep in mind.

The implementation for ibis.Table.to_list closely matches the Column.to_list function implemented in #10498.

Related but not central to the issue

Also, I used duplicated_rows to implement the following primary-key check:

def assert_pk(df, on, err=True):
    """
    If `err` is true, an error is raised. If not, potential error messages are returned as a list of strings. This allows this function to be used for internal checks in other functions with modified error messages.
    """
    df_subset = df.select(on)
    n_rows_original = df_subset.count().execute()
    n_rows_non_null = df_subset.drop_null().count().execute()
    error_messages = []
    if n_rows_original != n_rows_non_null:
        error_messages.append(f"Found {n_rows_original - n_rows_non_null} null rows for the given colum(s) `{on}`. This violates the properties of a primary key.")
    n_duplicated_rows = duplicated_rows(df, on).count().execute()
    if n_duplicated_rows != 0:
        error_messages.append(f"Found {n_duplicated_rows} duplicated rows for the given colum(s) `{on}`. This violates the properties of a primary key.")
    if error_messages:
        if err:
            raise AssertionError("\n                ".join(error_messages))
        else:
            return error_messages

If such a check is within scope for ibis, I would also love to have it implemented as a method on an ibis.Table. This would go a long way to alleviate the underlying problem behind #11356.

I added the option to return the error as a string, because I needed it for a foreign key check. The foreign key check is likely beyond scope for ibis, but I include it here for completeness:

def assert_fk(df_left, df_right, on_left, on_right, err=True):
    """
    Assert that `on_left` is a foreign key in `df_left`, and that it 
    links to the primary key `on_right` in `df_right`
    """
    if isinstance(on_left, str) or isinstance(on_left, ibis.ir.AnyColumn):
        on_left = [on_left]
    if isinstance(on_right, str) or isinstance(on_right, ibis.ir.AnyColumn):
        on_right= [on_right]
    if len(on_left) != len(on_right):
        raise AssertionError(f"Expected as many left columns as right columns. Instead, got left columns `{on_left}` and right columns {on_right}")
    error_messages = []
    error_messages_pk_right = assert_pk(df_right, on_right, err=False)
    if error_messages_pk_right:
        error_messages.append(f"Attempting to assert primary key for the column(s) `{on_right}` in `df_right` resulted in the following assertion errors:\n" + "\n                ".join(error_messages_pk_right))
    
    df_FKs_not_in_PK = df_left.anti_join(df_right, [df_left[l] == df_right[r] for (l, r) in zip(on_left, on_right)])
    if df_FKs_not_in_PK.count().execute()>0:
        print("Foreign keys in `df_left` not present among primary keys in `df_right`:", end="")
        display(df_FKs_not_in_PK.select(on_left))
        error_messages.append(f"Found {df_FKs_not_in_PK.count().execute()} rows (printed at the top) in `df_left` with values in columns `{on_left}` not present in the columns `{on_right}` from `df_right`.")
        
    if error_messages:
        if err:
            raise AssertionError("\n                ".join(error_messages))
        else:
            return error_messages

What version of ibis are you running?

None

What backend(s) are you using, if any?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeatures or general enhancements

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions