-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What do you want to see in kedro-pandera
?
#12
Comments
|
(STILL A WIP) I will try to sum up the trials and errors I made and my current opinion about the design. It is not totaly fixed yet but I think we could make a MVP out of it quite quickly. First attempt : validation at runtimeMy first idea was the following: Declare the schema in your catalog iris:
type: pandas.CSVDataSet
filepath: /path/to/data/iris.csv
metadata:
pandera:
schema: <pandera_schema> # not sure neither about the format nor the name, see below and a hook will perform runtime validation: from pandera.
class PanderaHook:
@hook_impl
def after_context created(...):
# pseudo code, I don't know the syntax
for dataset in catalog:
dataset.metadata.pandera.df_schema=DataFrameSchema(dataset.metadata.pandera.schema)
@hook_impl
def before_node_run(
self, node: Node, catalog: DataCatalog, inputs: Dict[str, Any], is_async: bool
) -> None:
for name, data in inputs.items():
df_schema=datasetcatalog.get(name).metadata.pandera.df_schema # pseudo code, I don't know the syntax
df_schema.validate(df) So Open questions about catalog configurationShoulds we use the "metadata" key to store ?
How many nested level should we use in the metadata key?
What should the schema key contain:
Which other key should we have?see: https://pandera.readthedocs.io/en/stable/dataframe_schemas.html
Open questions about plugin configurationTODO How to add advanced configuration capabilites to the plugin ?We can add a configuration file:
What level of lazy validation should we enable ?
When should validation be triggered?
Whatever we decide, this should likely be configurable. Temporarily avoid validation, or only for given pipelinesTODO Open questions about runtime validation?TODO CLITODO How can we generate default schema and test for a dataset?
Should we let users generate the schema of several datasets at the same time?TODO Other desirable features:
|
Super nice work @Galileo-Galilei I'm super keen to help get this off the ground. I'm keen to write up my thoughts in detail later on, but I wanted to point to the built in methods which we should leverage here: |
Yes, that's what I have been playing with. For many reasons i'll discuss later, Not too much time this week but I'll resume next week and suggest a MVP. Hopefully we can release a 0.1.0 version next week. If you want to go on without me, feel free, @noklam has the rights on the repo. |
Thank you again for kicking this off on @Galileo-Galilei , I've got a few year's worth of thoughts on this topic so would love to talk about the things I'd like to see in this space. Also I'm interested in this Frictionless schema standard Pandera has started to support as well, it looks early days - but I do love an open standard. As per your thoughts on YAML vs Python I think we're going to have to manage 3 patterns, users will inevitably want all 3 for different reasons -
1. Online checks
2. Offline checks
3. Interactive workflow (Jupyter)
4. Data docs
5. Datasets in scope
6. Inferered schemas
All in all I'm super excited to help get this off the ground, @noklam if you could make me an editor that would be great. I'm also going to tag my colleague @mkinegm who has used Pandera a lot at QB and has some very well thought out ideas on the topic. Medium term we should also validate our ideas/roadmap once they're a bit more concrete with Neils the maintainer :). |
Very nice thoughts about this. I think it's already worth creating more specific issues for some of them! Some quick comments:
data=catalog.load("dataset_name")
catalog._data_sets["dataset_name"].metadata.pandera.schema.validate(data) With the same logic, maybe a CLI command
If we end up talking with pandera maintainers themselves to validate the roadmap that could be great, but we not even close for now :) |
When using I think it would be great if the datasets that are validated are passed as inputs or outputs using their coerced+validated outputs. However, the |
Just to be sure I get it well, there are 2 separate points here :
|
@Galileo-Galilei
I don't think I need to 2 different outputs or choose between them dynamically. In my mind, the only datasets that go into a node, or are saved to locations - are the validated and possibly modified datasets. I would like to make a PR that changes where exactly the validations happen in |
Description
Opening the floor for feature request discussion, what do you want to see in this plugin? What should it do and what it shouldn't do? Why is it important to you?
The text was updated successfully, but these errors were encountered: