Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Super pipeline for code transforms. #172

Merged
merged 7 commits into from
May 30, 2024
Merged

Conversation

revit13
Copy link
Collaborator

@revit13 revit13 commented May 23, 2024

/Closes #173

Why are these changes needed?

This PR implements a pipeline for the notebook

ingest 2 parquet phase is not part of the super pipeline. its output was manually generated by running the notebook phase.

Please note that transforms images are not synced with the latest code in the repo. To run the pipelines please do:
cd transforms && make image && load-image to upload the latest versions to kind cluster.

Related issue number (if any).

#173

@revit13 revit13 marked this pull request as draft May 23, 2024 05:35
@roytman
Copy link
Member

roytman commented May 23, 2024

@revit13 , we'll need a document that explains it

@revit13 revit13 force-pushed the code-super branch 3 times, most recently from 001b23b to b815335 Compare May 27, 2024 08:22
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
@revit13 revit13 marked this pull request as ready for review May 27, 2024 13:37
@revit13
Copy link
Collaborator Author

revit13 commented May 27, 2024

@revit13 , we'll need a document that explains it

@roytman I added description to kfp/doc/multi_transform_pipeline.md.

Signed-off-by: Revital Sur <eres@il.ibm.com>
**Note** An example super pipeline that combines several transforms, `doc_id`, `ededup`, and `fdedup`, can be found in [superworkflow_dedups_sample_wf.py](../superworkflows/v1/superworkflow_dedups_sample_wf.py).
The sections that follow display two super pipelines as examples:

1) [dedups super pipeline](#De-dups-super-pipeline)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the link will not work.
better to use explicit names.

### Dedups super pipeline <a name = "dedups"></a>

so the link can be dedups super pipeline

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks


### Programming languages Super pipeline

This pipeline combines several programming-languages transforms: `ededup`, `doc_id`, `fdedup`, `proglang_select`, `code_quality`, `malware` and `tokenization`. It can be found in [superworkflow_code_wf.py](../superworkflows/ray/kfp_v1/superworkflow_code_sample_wf.py).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pipeline combines transforms for programming languages data preprocessing:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks


# Pipeline to invoke execution on remote resource
@dsl.pipeline(
name="sample-super-kubeflow-pipeline",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe to change the pipeline name and description to the specific usecase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks

Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Copy link
Member

@roytman roytman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@revit13 revit13 merged commit 5441bd5 into IBM:dev May 30, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Implement a super-pipeline for Code processing
2 participants