Skip to content

Conversation

@vkocaman
Copy link
Contributor

@vkocaman vkocaman commented Sep 25, 2019

adding explodeAnnotations param to finisher so that it can be used not just at the end but whenever it's required across the pipelines.

Description

explodedCol is added as a param and if it is not specified, it will be empty as a default and ignored as it's already not in the flled names.
So, any string that's not inside fieldnames will be ignored and will not trigger anything.
This will also override the previous steps and the only fields returned will be the exploded struct fields.

Motivation and Context

sometimes we need to export the processed output to csv or somewhere.. so it would make Finisher an intermediary solution that can be used across the pipeline.

How Has This Been Tested?

local test applied but I didn't update the test scripts in repo. We may also need to specify if the type of explodedCol is Array. Otherwise it would fail.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@maziyarpanahi
Copy link
Contributor

Could you please add unit tests in both Python and Scala covering few scenarios? What if user sets output as an array and explodeCol? What if user has multiple inputCols and one column inside explodeCol?
In any case, No param should make the others obsolete or redundant. Please make sure this param doesn’t do that otherwise this should be outside annotators and up to the user.

@vkocaman
Copy link
Contributor Author

@maziyarpanahi actually, adding this param will make the others obsolete or redundant as each array type outputCol would have different number of items inside and we cannot explode more than col at a time (we can, but we shouldn't let).. so, Finisher wouldn't be a perfect fit for this feature.. lets either have another annotator for exploding certain cols or instead emphasise this issue in articles so that users could be aware of..

@maziyarpanahi maziyarpanahi self-requested a review September 26, 2019 09:35
@maziyarpanahi
Copy link
Contributor

@maziyarpanahi actually, adding this param will make the others obsolete or redundant as each array type outputCol would have different number of items inside and we cannot explode more than col at a time (we can, but we shouldn't let).. so, Finisher wouldn't be a perfect fit for this feature.. lets either have another annotator for exploding certain cols or instead emphasise this issue in articles so that users could be aware of..

That's a perfect assessment. Let's first add the explode function in Spark SQL into our documentation for those who may not be aware of it. We can make an example in the Finisher section that if anyone wishes to export a CSV out of a column they can do so by simply using select(explode(col...)) after Finisher. Then a quick article with end to end.

As an example, I couldn't make sentence embeddings work in WordEmbeddings and BertEmbeddings without being anti-pattern so I created a new annotator which is clean and has lots of opportunities. However, now I like to output the results not just in array but in Vectors format so it can be accepted by Spark ML, and Finisher is not a good fit for that. Let's talk more about this and maybe we can introduce two more annotators one to finish for the outside of Apache Spark (your use case) and one to finish for Spark ML where it needs Vectors for features (my use case).

@vkocaman
Copy link
Contributor Author

Sounds great.. let me add a use case in documentation using the explode function in Spark SQL.. and looks like we'll need a more capable ColumnTransformer annotator soon.. with params like inputCols, transformType, etc..

@maziyarpanahi
Copy link
Contributor

Very nice, please make a pull request for the documentation and I’ll review the styling super fast. 👍

@vkocaman
Copy link
Contributor Author

ok I just did.. please review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants