docs: add aisample notebooks into community folder #1606

thinkall · 2022-08-09T06:33:36Z

Related Issues/PRs

None

What changes are proposed in this pull request?

Add two notebooks for aisample into notebooks/community/aisample folder.

AIsample - Book Recommendation.ipynb
AIsample - Fraud Detection.ipynb

How is this patch tested?

I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change any dependencies?

No. You can skip this section.
Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

No. You can skip this section.
Yes. Make sure you have added samples following below steps.

Find the corresponding markdown file for your new feature in website/docs/documentation folder.
Make sure you choose the correct class estimators/transformers and namespace.
Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
Make sure the DocTable points to correct API link.
Navigate to website folder, and run yarn run start to make sure the website renders correctly.
Don't forget to add  before each python code blocks to enable auto-tests for python samples.
Make sure the WebsiteSamplesTests job pass in the pipeline.

AB#1920095

github-actions · 2022-08-09T06:33:47Z

Hey @thinkall 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.
We appreciate your patience and contributions 💯!

mhamilton723 · 2022-08-10T03:44:57Z

Hey @thinkall welcome to the project! I added you to the SynapseML team so you can now queue build with the magical /azp run comment. But before you queue a few questions and requests

Do you intend for this notebook to run on platforms like Synapse and Databricks? I know there's a dependence on the lakehouse, but we can just as easily move this data to our public blob so that the demo works on all the platforms. If so we'll want to make some changes so that we can test this notebook as part of our build system and Serena might be able to help you make these changes
Can you run black on these notebooks so that they get properly formatted into multiple lines to make the code review process easier?
Can you clear the output of the cells prior to checkin? To show key results, include them as an image in the markdown so that folks can compare.

Appreciate these contributions and from the start these notebooks look like they have some great content!

mhamilton723

See comment above

thinkall · 2022-08-10T10:25:13Z

Hi @mhamilton723 , thanks a lot for the suggestions. I've formatted the notebooks with black and cleared the output of cells. As for the cross-platform issue, we're focusing on just one platform at the moment, and we prefer not to introduce concepts of other platforms. I guess we can add some code to make it runnable in Synapse/Databricks after MVP. Does this make sense?

mhamilton723

Awesome work, great content just needs some simplifications. I've added my comments for Book Recommendation, consider applying the learnings from this review to both notebooks. In general we'll want to tighten these notebooks up and make them as simple as possible so users dont have to think too hard. Once you apply these to both notebooks ill go through the other one and give another round of comments. Thanks for your patience :)

Title and Intro

"E2E solution of recommendation system" -> Creating, Evaluating, and Deploying a Recommendation System
In step 1: load the data you have some ASCII art showing the data, consider making this a markdown table

Loading Data

No need for params around ITEM_INFO COl and others
With the downloading data, you could remove this whole section by instead putting the CSV files on our pubvlic blob, then you can just read them directly

Read data from lakehouse

Consider caching the item, user, and user dataframe in the beginning to make this demo smoother

EDA

No need to coalesce before adding monotonic IDs
Dont need to sort before showing.
Dont need to drop to rdds or collect in df_ratings.select(RATING_COL).distinct().rdd.flatMap(lambda x: x).collect(). Just select distinct and show
Why do these user ids start with an "_" consider simplifying your computations to avoid these hidden vars

Merge Data

Dont need to re-import functions as F

For the following code

df_tmp = df_ratings.join(df_users, USER_ID_COL, "inner")
df_all = df_tmp.join(df_items, ITEM_ID_COL, "inner")
df_all_columns = df_all.columns
df_all_columns.remove("_user_id")
df_all_columns.remove("_item_id")
df_all_columns.remove(RATING_COL)
df_all = df_all.select(["_user_id", "_item_id", RATING_COL] + df_all_columns)
df_all = df_all.withColumn("id", F.monotonically_increasing_id())

This is a bit unwildy to look at. 1, consider using the "dot-chaining syntax within a parentheses like you did in the readers. 2. no need to get the columns explicitly, just select "*" if you want all the columns. 3, do you need to add another id?

Cache the df_all dataframe if you dont want to keep redoing this computation on every display

Plotting

Dont filter warnings, what was the warning you are trying to avoid?

Model development and deploy

"We have explored the dataset, got some insights from the exploratory data analysis and even achieved most popular recommendation." ->
" So far we have explored the dataset, added unique ids to our users and items, and plotted top items. Next, we'll train an Alternating Least Squares (ALS) recommender to give users personalized recommendations"
its not good fur users to have to iunteract with things that start with an underscore as this is generally used for private vars in classes. Consider removing that. Also if the IS_SAMPLE flag is false there will be no _df_all and it will fail
This code looks sketchy, consider simplifying with some for comprehensions:

fractions_train = {0: 0}
fractions_test = {0: 0}
for i in ratings:
    if i == 0:
        continue
    fractions_train[i] = 0.8
    fractions_test[i] = 1

Consider describing these cells that joiun and sample to make training and testing sets. Its a bit opaque right now
Cast and add columns before splitting so you dont need to do this twice
Dont use .rdd its deprecated

Hyperparameter tuning

make the question of which hyperparameter tuner you have could result in a hyper_optimzer variable you could use in lower code without needing additional if statements
This code looks like it could be cleaned up:

tmp_list = []
for params in models.getEstimatorParamMaps():
    tmp = ""
    for k in params:
        tmp += k.name + "=" + str(params[k]) + " "
    tmp_list.append(tmp)

Model Eval

Consider dot chaining here:

predictions = model.transform(train)
predictions = predictions.withColumn(
    "prediction", predictions.prediction.cast("double")
)
predictions.select("_user_id", "_item_id", RATING_COL, "prediction").limit(10).show()

I think you can eval all the metrics at the same time which will be much faster

Model saving anf loading

Loading is commented out, not sure if this is intentional

thinkall · 2022-08-12T10:13:49Z

Hey @mhamilton723 , first of all, a huge thank you to you. Your detailed comments are so helpful. I modified our notebooks following most of your comments. For the rest, please let me explain why we did it in that way.

The purpose of these notebooks is to showcase the functionalities of our new platform with some e2e examples in different scenarios. For users who are not familiar with the techniques (ML, Spark, etc.), they can just copy&paste the notebooks, run it directly without changing anything. They can also apply a notebook to their own datasets by simply uploading the dataset to the lakehouse and modifying the params in the first code cell accordingly, then click run! For experienced users, they can also use these notebooks as references.

Loading Data

To achieve these goals, we choose to load arbitrary dataset and reorder/rename some of its columns to make the code in other parts simple. So we hope to keep the load data part as it is. The downloading data part is also needed, using lakehouse is actually one of the feature we want to showcase.

EDA

"coalesce before adding monotonic IDs" is needed to keep the ids continously.
"_" is used to avoid naming conflicts as much as possible.

Merge Data

"get the columns explicitly" is for reordering the columns.

Model Eval

It seems that Evaluator from pyspark.ml can't eval all the metrics at the same time, Metrics from pyspark.mlib can do that but it only accept rdd inputs. Moreover, the current eval time is only a few seconds, so I guess it's acceptable.

Model saving and loading

Yes, it is intentional. We'll add code using mlflow plugin for model saving and loading.

Plus, I adjusted the first cell, with comments aligned. It's different from what black wants, but I think it looks better for this cell. Is it OK for you?

mhamilton723 · 2022-08-15T17:51:27Z

Will give this another full review tonight, one thing i noticed is that the files are still very large due to random synapse data held in the notebooks, can you remove those entries in the json so these notebooks are smaller files?

… aisample

thinkall · 2022-08-16T02:28:23Z

Will give this another full review tonight, one thing i noticed is that the files are still very large due to random synapse data held in the notebooks, can you remove those entries in the json so these notebooks are smaller files?

Thanks @mhamilton723, all the unnecessary random metedata has been removed.

mhamilton723 · 2022-08-17T03:20:36Z

Li, thanks so much for your fixes and changes! Apologies in advance for any annoying comments i leave you andi appreciate your patience in working with me for these fixes. I think its become alot simpler since the first iteration and im hoping we can continue to distill this into its simplest and easiest form for users.

Book Recs

EDA

"coalesce before adding monotonic IDs" is needed to keep the ids continously." - I dont think they need to be continuous, its just a user GUID and coalesce forces you to go to a single machine which doesent scale
"_" is used to avoid naming conflicts as much as possible. - I think we should still remove these underscore names and instead save _names for private members as is the standard. Consider simplifying the logic of your queries to avoid the need to use these underscores. Also consider this for the variable _df_all as its used alot in your code and hence shouldnt be marked as a private variable. In general, _ should be reserved for private variables/methods in python classes as opposed to used objects and names in a public demo.

Model Training

I still think this can be simplified with a smaller if statement. Consider the following

if model_tuning_method == "CrossValidator":
    tuner = CrossValidator(
        estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5
    )
elif model_tuning_method == "TrainValidationSplit":

    tuner = TrainValidationSplit(
        estimator=als,
        estimatorParamMaps=param_grid,
        evaluator=evaluator,
        # 80% of the data will be used for training, 20% for validation.
        trainRatio=0.8,
    )
else:
    raise ValueError(f"Unknown model_tuning_method: {model_tuning_method}")

Wioth this downstream code becomes:

models = tuner.fit(train)
model = models.bestModel

Likewise the next if can be simplified

if model_tuning_method == "CrossValidator":
    metric = models.avgMetrics
elif model_tuning_method == "TrainValidationSplit":
    metric = models.validationMetrics
else:
    raise ValueError(f"Unknown model_tuning_method: {model_tuning_method}")

And the other stuff can be factored out.

the code here:

# collect metrics
tmp_list = []
for params in models.getEstimatorParamMaps():
    tmp = ""
    for k in params:
        tmp += k.name + "=" + str(params[k]) + " "
    tmp_list.append(tmp)

Can be simplified to something like

# collect metrics
param_strings = []
for param_map in models.getEstimatorParamMaps():
    param_strings.append(" ".join(f"{k}={v}" for (k,v) in param_map.items())

Model Evaluation

predictions=... can be cleaned up with dot chaining
The evaluation section hasalot of repeated code both in terms of the experiment duplicated for train and test, and the same snippets "evaluator.evaluate..." consider making these functions or lambdas to tighten this up a bit

Style

"It's different from what black wants" - unfortunately our build error this if its not black compliant, consider shortening the comments so black doesent need to bump to new line

Fraud Detection

Cast columns

Consider dot chaining to neaten this section

Model evaluation

mmlspark is deprecated namespace change to synapse.ml
Same comment with duplication on eval code with smote vs non-smote consider making a function or a loop

Saving model

We should comment that saveNativeModel is useful if you want to use it in other lightGBM context, otherwise for usage later with spark use the sparkML save load on LGBM Classifier. For this demo it might be simplest to use SparkML model serialization instead of LGBM native

Thanks again for all your patience. I know its no fun getting comments and I really appreciate you!

mhamilton723

See other comment

thinkall · 2022-08-17T10:58:52Z

@mhamilton723 , thanks a lot for all the suggestions. All but one are accepted.

"coalesce before adding monotonic IDs" is needed to keep the ids continously." - I dont think they need to be continuous, its just a user GUID and coalesce forces you to go to a single machine which doesent scale

There are two reasons for us to apply coalesce.

The new generated _user_id is used for recommendation models, it's better to be continous integers to get smaller model size, faster inference speed for some models. So it's more than just GUID.
We used ALS model in this demo, the spark implementation only supports values in Integer range, if we don't apply coalesce, for big dataset, the generated id may exceed the range. In fact, I added coalesce after I met this issue.

mhamilton723 · 2022-08-17T16:24:44Z

@mhamilton723 , thanks a lot for all the suggestions. All but one are accepted.

"coalesce before adding monotonic IDs" is needed to keep the ids continously." - I dont think they need to be continuous, its just a user GUID and coalesce forces you to go to a single machine which doesent scale

There are two reasons for us to apply coalesce.

The new generated _user_id is used for recommendation models, it's better to be continous integers to get smaller model size, faster inference speed for some models. So it's more than just GUID.

We used ALS model in this demo, the spark implementation only supports values in Integer range, if we don't apply coalesce, for big dataset, the generated id may exceed the range. In fact, I added coalesce after I met this issue.

Thanks @thinkall, i see what you are saying. I think in this case we should instead use the SparkML StringIndexer as this is designed to learn a mapping from strings representing IDs to integers/longs. This will automatically make sure the ids are contiguous, but it wont have the aforementioned scalability issues. Also i still see alot of _ column names. Sorry to be a pain but can we try to name those without _. Your use of _evaluate is OK though!

thinkall · 2022-08-18T03:17:06Z

All done.

thinkall · 2022-08-18T04:20:08Z

/azp run

azure-pipelines · 2022-08-18T04:20:21Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov-commenter · 2022-08-18T04:28:35Z

Codecov Report

Merging #1606 (1c108d1) into master (ac40e5a) will decrease coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1606      +/-   ##
==========================================
- Coverage   83.79%   83.75%   -0.05%     
==========================================
  Files         292      292              
  Lines       15473    15473              
  Branches      752      752              
==========================================
- Hits        12966    12959       -7     
- Misses       2507     2514       +7

Impacted Files	Coverage Δ
...crosoft/azure/synapse/ml/io/http/HTTPClients.scala	`67.64% <0.00%> (-7.36%)`	⬇️
.../execution/streaming/continuous/HTTPSourceV2.scala	`92.08% <0.00%> (-0.72%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

thinkall · 2022-08-18T06:33:18Z

/azp run

azure-pipelines · 2022-08-18T06:33:33Z

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723 · 2022-08-19T00:27:44Z

/azp run

azure-pipelines · 2022-08-19T00:28:00Z

Azure Pipelines successfully started running 1 pipeline(s).

thinkall added 2 commits August 9, 2022 06:21

docs: add aisample fraud detection

3416991

docs: add aisample book recommendation

b72deb2

thinkall requested a review from mhamilton723 as a code owner August 9, 2022 06:33

thinkall changed the title ~~Add aisample notebooks into community folder~~ doc: add aisample notebooks into community folder Aug 9, 2022

thinkall changed the title ~~doc: add aisample notebooks into community folder~~ docs: add aisample notebooks into community folder Aug 9, 2022

mhamilton723 requested changes Aug 10, 2022

View reviewed changes

thinkall added 2 commits August 10, 2022 10:05

Merge branch 'microsoft:master' into aisample

fba65f5

chore: format notebooks with black and clear outputs

72cb682

thinkall requested a review from mhamilton723 August 11, 2022 02:07

mhamilton723 requested changes Aug 11, 2022

View reviewed changes

thinkall and others added 2 commits August 12, 2022 09:24

chore: some improvements to code and annotations

54b151e

Merge branch 'master' into aisample

c8d485a

thinkall requested a review from mhamilton723 August 12, 2022 10:16

thinkall added 2 commits August 16, 2022 02:12

chore: remove random data in synapse metadata to reduce file size

d5469a2

Merge branch 'aisample' of https://github.com/thinkall/SynapseML into…

f661638

… aisample

mhamilton723 requested changes Aug 17, 2022

View reviewed changes

thinkall added 2 commits August 17, 2022 10:35

chore: update lib, neaten code

92700b0

chore: neaten code

d57eb95

thinkall requested a review from mhamilton723 August 17, 2022 10:59

thinkall added 2 commits August 18, 2022 02:59

chore: remove default None params

1ab8bfb

chore: improve scale perf with StringIndexer, remove params default None

2ba6b09

update notebook converter to only pick up notebooks in features folder

869a223

thinkall requested a review from serena-ruan as a code owner August 18, 2022 06:15

thinkall and others added 2 commits August 18, 2022 14:37

Merge branch 'microsoft:master' into aisample

aeec47a

Merge branch 'master' into aisample

1c108d1

mhamilton723 merged commit 7e90d19 into microsoft:master Aug 19, 2022

thinkall deleted the aisample branch August 19, 2022 05:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add aisample notebooks into community folder #1606

docs: add aisample notebooks into community folder #1606

thinkall commented Aug 9, 2022 •

edited by mhamilton723

Loading

github-actions bot commented Aug 9, 2022

mhamilton723 commented Aug 10, 2022

mhamilton723 left a comment

thinkall commented Aug 10, 2022

mhamilton723 left a comment •

edited

Loading

thinkall commented Aug 12, 2022

mhamilton723 commented Aug 15, 2022

thinkall commented Aug 16, 2022

mhamilton723 commented Aug 17, 2022

mhamilton723 left a comment

thinkall commented Aug 17, 2022

mhamilton723 commented Aug 17, 2022 •

edited

Loading

thinkall commented Aug 18, 2022

thinkall commented Aug 18, 2022

azure-pipelines bot commented Aug 18, 2022

codecov-commenter commented Aug 18, 2022 •

edited

Loading

thinkall commented Aug 18, 2022

azure-pipelines bot commented Aug 18, 2022

mhamilton723 commented Aug 19, 2022

azure-pipelines bot commented Aug 19, 2022

docs: add aisample notebooks into community folder #1606

docs: add aisample notebooks into community folder #1606

Conversation

thinkall commented Aug 9, 2022 • edited by mhamilton723 Loading

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change any dependencies?

Does this PR add a new feature? If so, have you added samples on website?

github-actions bot commented Aug 9, 2022

mhamilton723 commented Aug 10, 2022

mhamilton723 left a comment

Choose a reason for hiding this comment

thinkall commented Aug 10, 2022

mhamilton723 left a comment • edited Loading

Choose a reason for hiding this comment

Title and Intro

Loading Data

Read data from lakehouse

EDA

Merge Data

Plotting

Model development and deploy

Hyperparameter tuning

Model Eval

Model saving anf loading

thinkall commented Aug 12, 2022

Loading Data

EDA

Merge Data

Model Eval

Model saving and loading

mhamilton723 commented Aug 15, 2022

thinkall commented Aug 16, 2022

mhamilton723 commented Aug 17, 2022

Book Recs

EDA

Model Training

Model Evaluation

Style

Fraud Detection

Cast columns

Model evaluation

Saving model

mhamilton723 left a comment

Choose a reason for hiding this comment

thinkall commented Aug 17, 2022

mhamilton723 commented Aug 17, 2022 • edited Loading

thinkall commented Aug 18, 2022

thinkall commented Aug 18, 2022

azure-pipelines bot commented Aug 18, 2022

codecov-commenter commented Aug 18, 2022 • edited Loading

Codecov Report

thinkall commented Aug 18, 2022

azure-pipelines bot commented Aug 18, 2022

mhamilton723 commented Aug 19, 2022

azure-pipelines bot commented Aug 19, 2022

thinkall commented Aug 9, 2022 •

edited by mhamilton723

Loading

mhamilton723 left a comment •

edited

Loading

mhamilton723 commented Aug 17, 2022 •

edited

Loading

codecov-commenter commented Aug 18, 2022 •

edited

Loading