-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add aisample notebooks into community folder #1606
Conversation
Hey @thinkall 👋! |
Hey @thinkall welcome to the project! I added you to the SynapseML team so you can now queue build with the magical
Appreciate these contributions and from the start these notebooks look like they have some great content! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above
Hi @mhamilton723 , thanks a lot for the suggestions. I've formatted the notebooks with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, great content just needs some simplifications. I've added my comments for Book Recommendation, consider applying the learnings from this review to both notebooks. In general we'll want to tighten these notebooks up and make them as simple as possible so users dont have to think too hard. Once you apply these to both notebooks ill go through the other one and give another round of comments. Thanks for your patience :)
Title and Intro
- "E2E solution of recommendation system" -> Creating, Evaluating, and Deploying a Recommendation System
- In step 1: load the data you have some ASCII art showing the data, consider making this a markdown table
Loading Data
- No need for params around ITEM_INFO COl and others
- With the downloading data, you could remove this whole section by instead putting the CSV files on our pubvlic blob, then you can just read them directly
Read data from lakehouse
- Consider caching the item, user, and user dataframe in the beginning to make this demo smoother
EDA
- No need to coalesce before adding monotonic IDs
- Dont need to sort before showing.
- Dont need to drop to rdds or collect in
df_ratings.select(RATING_COL).distinct().rdd.flatMap(lambda x: x).collect()
. Just select distinct and show - Why do these user ids start with an "_" consider simplifying your computations to avoid these hidden vars
Merge Data
- Dont need to re-import
functions as F
For the following code
df_tmp = df_ratings.join(df_users, USER_ID_COL, "inner")
df_all = df_tmp.join(df_items, ITEM_ID_COL, "inner")
df_all_columns = df_all.columns
df_all_columns.remove("_user_id")
df_all_columns.remove("_item_id")
df_all_columns.remove(RATING_COL)
df_all = df_all.select(["_user_id", "_item_id", RATING_COL] + df_all_columns)
df_all = df_all.withColumn("id", F.monotonically_increasing_id())
This is a bit unwildy to look at. 1, consider using the "dot-chaining syntax within a parentheses like you did in the readers. 2. no need to get the columns explicitly, just select "*" if you want all the columns. 3, do you need to add another id?
- Cache the df_all dataframe if you dont want to keep redoing this computation on every display
Plotting
- Dont filter warnings, what was the warning you are trying to avoid?
Model development and deploy
- "We have explored the dataset, got some insights from the exploratory data analysis and even achieved most popular recommendation." ->
- " So far we have explored the dataset, added unique ids to our users and items, and plotted top items. Next, we'll train an Alternating Least Squares (ALS) recommender to give users personalized recommendations"
- its not good fur users to have to iunteract with things that start with an underscore as this is generally used for private vars in classes. Consider removing that. Also if the IS_SAMPLE flag is false there will be no _df_all and it will fail
- This code looks sketchy, consider simplifying with some for comprehensions:
fractions_train = {0: 0}
fractions_test = {0: 0}
for i in ratings:
if i == 0:
continue
fractions_train[i] = 0.8
fractions_test[i] = 1
- Consider describing these cells that joiun and sample to make training and testing sets. Its a bit opaque right now
- Cast and add columns before splitting so you dont need to do this twice
- Dont use .rdd its deprecated
Hyperparameter tuning
- make the question of which hyperparameter tuner you have could result in a
hyper_optimzer
variable you could use in lower code without needing additional if statements - This code looks like it could be cleaned up:
tmp_list = []
for params in models.getEstimatorParamMaps():
tmp = ""
for k in params:
tmp += k.name + "=" + str(params[k]) + " "
tmp_list.append(tmp)
Model Eval
- Consider dot chaining here:
predictions = model.transform(train)
predictions = predictions.withColumn(
"prediction", predictions.prediction.cast("double")
)
predictions.select("_user_id", "_item_id", RATING_COL, "prediction").limit(10).show()
- I think you can eval all the metrics at the same time which will be much faster
Model saving anf loading
- Loading is commented out, not sure if this is intentional
Hey @mhamilton723 , first of all, a huge thank you to you. Your detailed comments are so helpful. I modified our notebooks following most of your comments. For the rest, please let me explain why we did it in that way. The purpose of these notebooks is to showcase the functionalities of our new platform with some e2e examples in different scenarios. For users who are not familiar with the techniques (ML, Spark, etc.), they can just copy&paste the notebooks, run it directly without changing anything. They can also apply a notebook to their own datasets by simply uploading the dataset to the lakehouse and modifying the params in the first code cell accordingly, then click run! For experienced users, they can also use these notebooks as references. Loading DataTo achieve these goals, we choose to load arbitrary dataset and reorder/rename some of its columns to make the code in other parts simple. So we hope to keep the load data part as it is. The downloading data part is also needed, using lakehouse is actually one of the feature we want to showcase. EDA
Merge Data
Model Eval
Model saving and loading
Plus, I adjusted the first cell, with comments aligned. It's different from what |
Will give this another full review tonight, one thing i noticed is that the files are still very large due to random synapse data held in the notebooks, can you remove those entries in the json so these notebooks are smaller files? |
Thanks @mhamilton723, all the unnecessary random metedata has been removed. |
Li, thanks so much for your fixes and changes! Apologies in advance for any annoying comments i leave you andi appreciate your patience in working with me for these fixes. I think its become alot simpler since the first iteration and im hoping we can continue to distill this into its simplest and easiest form for users. Book RecsEDA
Model Training
Wioth this downstream code becomes:
Likewise the next if can be simplified
And the other stuff can be factored out.
Can be simplified to something like
Model Evaluation
Style
Fraud DetectionCast columns
Model evaluation
Saving model
Thanks again for all your patience. I know its no fun getting comments and I really appreciate you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment
@mhamilton723 , thanks a lot for all the suggestions. All but one are accepted. "coalesce before adding monotonic IDs" is needed to keep the ids continously." - I dont think they need to be continuous, its just a user GUID and coalesce forces you to go to a single machine which doesent scale There are two reasons for us to apply
|
Thanks @thinkall, i see what you are saying. I think in this case we should instead use the SparkML StringIndexer as this is designed to learn a mapping from strings representing IDs to integers/longs. This will automatically make sure the ids are contiguous, but it wont have the aforementioned scalability issues. Also i still see alot of _ column names. Sorry to be a pain but can we try to name those without _. Your use of _evaluate is OK though! |
All done. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Codecov Report
@@ Coverage Diff @@
## master #1606 +/- ##
==========================================
- Coverage 83.79% 83.75% -0.05%
==========================================
Files 292 292
Lines 15473 15473
Branches 752 752
==========================================
- Hits 12966 12959 -7
- Misses 2507 2514 +7
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Related Issues/PRs
None
What changes are proposed in this pull request?
Add two notebooks for aisample into
notebooks/community/aisample
folder.How is this patch tested?
Does this PR change any dependencies?
Does this PR add a new feature? If so, have you added samples on website?
website/docs/documentation
folder.Make sure you choose the correct class
estimators/transformers
and namespace.DocTable
points to correct API link.yarn run start
to make sure the website renders correctly.<!--pytest-codeblocks:cont-->
before each python code blocks to enable auto-tests for python samples.WebsiteSamplesTests
job pass in the pipeline.AB#1920095