docs: add aisample title genre text classification #1617

thinkall · 2022-08-18T09:37:34Z

Related Issues/PRs

None

What changes are proposed in this pull request?

Add a title genre text classification notebook for aisample under notebooks/community/aisample.

How is this patch tested?

I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change any dependencies?

No. You can skip this section.
Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

No. You can skip this section.
Yes. Make sure you have added samples following below steps.

Find the corresponding markdown file for your new feature in website/docs/documentation folder.
Make sure you choose the correct class estimators/transformers and namespace.
Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
Make sure the DocTable points to correct API link.
Navigate to website folder, and run yarn run start to make sure the website renders correctly.
Don't forget to add  before each python code blocks to enable auto-tests for python samples.
Make sure the WebsiteSamplesTests job pass in the pipeline.

AB#1935137

github-actions · 2022-08-18T09:37:49Z

Hey @thinkall 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.
We appreciate your patience and contributions 💯!

mhamilton723

Lovely work heres a few suggestions!

No need for this line: raw_df.createOrReplaceTempView("raw_data")
are these lines necessary for this CSV?

multiLine=True,
   quote='"',
   escape='"',

We have a class balancer tool specifically to deal with label imbalence, this will add a weight thats proportional to the deficit https://microsoft.github.io/SynapseML/docs/next/documentation/estimators/estimators_core/#classbalancer
We have text featurizer that might be able to make alot of these steps a single model call
https://mmlspark.blob.core.windows.net/docs/0.10.0/pyspark/synapse.ml.featurize.text.html#module-synapse.ml.featurize.text.TextFeaturizer
EvaluatorType -> use snake_case here and elsewhere

…isample-tian

…into aisample-tian

thinkall · 2022-08-19T08:55:11Z

Thanks @mhamilton723 for the feedbacks. They helped a lot :-)

Just TextFeaturizer is not applied. The output column only contains indexes of words, not the words. Thus we can't use the output df for plotting word cloud. Moreover, we want to apply word2vec for vectorization. It seems that TextFeaturizer doesn't support it.

thinkall · 2022-08-19T08:55:35Z

/azp run

azure-pipelines · 2022-08-19T08:55:49Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov-commenter · 2022-08-19T09:03:28Z

Codecov Report

Merging #1617 (5c08d20) into master (0f54bc6) will decrease coverage by 0.05%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1617      +/-   ##
==========================================
- Coverage   83.61%   83.56%   -0.06%     
==========================================
  Files         288      288              
  Lines       15334    15334              
  Branches      747      747              
==========================================
- Hits        12822    12814       -8     
- Misses       2512     2520       +8

Impacted Files	Coverage Δ
.../azure/synapse/ml/param/PythonWrappableParam.scala	`66.66% <0.00%> (-8.34%)`	⬇️
...ft/azure/synapse/ml/param/JsonEncodableParam.scala	`57.14% <0.00%> (-7.15%)`	⬇️
...re/src/main/python/synapse/ml/core/schema/Utils.py	`67.10% <0.00%> (-5.27%)`	⬇️
.../execution/streaming/continuous/HTTPSourceV2.scala	`92.08% <0.00%> (-0.72%)`	⬇️
...ft/azure/synapse/ml/cognitive/ComputerVision.scala	`73.10% <0.00%> (+1.26%)`	⬆️
...osoft/azure/synapse/ml/core/utils/AsyncUtils.scala	`80.00% <0.00%> (+5.00%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

mhamilton723 · 2022-08-23T04:17:28Z

/azp run

azure-pipelines · 2022-08-23T04:17:42Z

Azure Pipelines successfully started running 1 pipeline(s).

docs: add aisample title genre text classification

03d3eb4

thinkall requested a review from mhamilton723 as a code owner August 18, 2022 09:37

mhamilton723 requested changes Aug 19, 2022

View reviewed changes

thinkall and others added 4 commits August 19, 2022 13:55

Merge branch 'microsoft:master' into aisample-tian

8104eca

Merge branch 'master' of https://github.com/thinkall/SynapseML into a…

0d1b96a

…isample-tian

chore: polish code, apply SynapseML class balancer

19b1ea6

Merge branch 'aisample-tian' of https://github.com/thinkall/SynapseML …

63ce5a6

…into aisample-tian

thinkall requested a review from mhamilton723 August 19, 2022 08:55

thinkall added 2 commits August 22, 2022 14:56

Merge branch 'microsoft:master' into aisample-tian

784f4b0

Merge branch 'master' into aisample-tian

7358b76

mhamilton723 approved these changes Aug 23, 2022

View reviewed changes

Merge branch 'master' into aisample-tian

5c08d20

mhamilton723 approved these changes Aug 23, 2022

View reviewed changes

mhamilton723 enabled auto-merge (squash) August 23, 2022 04:24

mhamilton723 merged commit d98ac02 into microsoft:master Aug 23, 2022

thinkall deleted the aisample-tian branch August 23, 2022 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add aisample title genre text classification #1617

docs: add aisample title genre text classification #1617

thinkall commented Aug 18, 2022 •

edited by mhamilton723

Loading

github-actions bot commented Aug 18, 2022

mhamilton723 left a comment •

edited

Loading

thinkall commented Aug 19, 2022

thinkall commented Aug 19, 2022

azure-pipelines bot commented Aug 19, 2022

codecov-commenter commented Aug 19, 2022 •

edited

Loading

mhamilton723 commented Aug 23, 2022

azure-pipelines bot commented Aug 23, 2022

docs: add aisample title genre text classification #1617

docs: add aisample title genre text classification #1617

Conversation

thinkall commented Aug 18, 2022 • edited by mhamilton723 Loading

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change any dependencies?

Does this PR add a new feature? If so, have you added samples on website?

github-actions bot commented Aug 18, 2022

mhamilton723 left a comment • edited Loading

Choose a reason for hiding this comment

thinkall commented Aug 19, 2022

thinkall commented Aug 19, 2022

azure-pipelines bot commented Aug 19, 2022

codecov-commenter commented Aug 19, 2022 • edited Loading

Codecov Report

mhamilton723 commented Aug 23, 2022

azure-pipelines bot commented Aug 23, 2022

thinkall commented Aug 18, 2022 •

edited by mhamilton723

Loading

mhamilton723 left a comment •

edited

Loading

codecov-commenter commented Aug 19, 2022 •

edited

Loading