Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Example notebook of VW vs LightGBM #641

Merged
merged 10 commits into from
Oct 11, 2019

Conversation

loomlike
Copy link
Contributor

@loomlike loomlike commented Aug 5, 2019

Regression example of Vowpal Wabbit, comparing with MMLSpark LightGBM and Spark MLlib Linear Regressor.

Regression problem example comparing Vowpal Wabbit vs LightGBM vs Linear
Regressor (Spark MLlib)
 Please enter the commit message for your changes. Lines starting
@welcome
Copy link

welcome bot commented Aug 5, 2019

💖 Thanks for opening your first pull request! 💖 We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. This helps us to create release messages and credit you for your hard work!
Examples of commit messages with semantic prefixes: - fix: Fix LightGBM crashes with empty partitions - feat: Make HTTP on Spark back-offs configurable - docs: Update Spark Serving usage - build: Add codecov support - perf: improve LightGBM memory usage - refactor: make python code generation rely on classes - style: Remove nulls from CNTKModel - test: Add test coverage for CNTKModel
Make sure to check out the developer guide for guidance on testing your change.

@eisber
Copy link
Contributor

eisber commented Aug 5, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Aug 5, 2019

Codecov Report

Merging #641 into master will increase coverage by 9.67%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #641      +/-   ##
==========================================
+ Coverage    70.1%   79.77%   +9.67%     
==========================================
  Files         229      229              
  Lines        9154     9154              
  Branches      478      478              
==========================================
+ Hits         6417     7303     +886     
+ Misses       2737     1851     -886
Impacted Files Coverage Δ
.../execution/streaming/continuous/HTTPSourceV2.scala 93.04% <0%> (-0.37%) ⬇️
...scala/com/microsoft/ml/spark/io/http/Parsers.scala 75% <0%> (+1.04%) ⬆️
.../microsoft/ml/spark/core/schema/Categoricals.scala 86.45% <0%> (+3.12%) ⬆️
...com/microsoft/ml/spark/core/contracts/Params.scala 91.48% <0%> (+4.25%) ⬆️
...la/com/microsoft/ml/spark/io/http/HTTPSchema.scala 88.48% <0%> (+4.31%) ⬆️
...a/com/microsoft/ml/spark/io/http/HTTPClients.scala 54.71% <0%> (+5.66%) ⬆️
...om/microsoft/ml/spark/featurize/ValueIndexer.scala 77.61% <0%> (+5.97%) ⬆️
...a/com/microsoft/ml/spark/featurize/Featurize.scala 96.96% <0%> (+6.06%) ⬆️
...osoft/ml/spark/io/http/SimpleHTTPTransformer.scala 91.93% <0%> (+8.06%) ⬆️
...n/scala/org/apache/spark/ml/param/ArrayParam.scala 70% <0%> (+20%) ⬆️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 163dead...fd3c057. Read the comment docs.

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loomlike @eisber its hard to comment directly on the NB so ill list them here:

Can you move to the Dataframe API explicitly without the flatmaps? generally you shouldnt need to use that soon to be deprecated API.

No need for matplotlib inline

Can you add the image in the results in markdown for those to compare?

- Remove rdd operations
- Remove inline plot command
- Add plot screenshot to markdown cell
@loomlike
Copy link
Contributor Author

loomlike commented Sep 1, 2019

@mhamilton723 Thank you for the review. I updated the notebook accordingly.

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

" for icol in range(ncols):\n",
" try:\n",
" feat = features[irow*ncols + icol]\n",
" xx = [r[feat] for r in train_data.select(feat).collect()]\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its much more efficient to select the columns in spark, then call .toPandas on the resulting dataframe. This will then make it in a nice form for plotting

Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, theres one more small perf thing I have found. No worries if you dont want to fix it though. Tag me when ready to merge (and poke me on teams for faster merging :))

@loomlike
Copy link
Contributor Author

loomlike commented Sep 6, 2019

@mhamilton723 Good catch! yeah that code did collect() for every feature which was not necessary. Thanks for the review. I just made changes and push them.

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eisber
Copy link
Contributor

eisber commented Sep 9, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

@loomlike would you be able to update the PR to latest? the lightgbm bug should be fixed now.

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft imatiach-msft changed the title Example notebook of VW vs LightGBM docs: Example notebook of VW vs LightGBM Oct 11, 2019
Copy link
Contributor

@imatiach-msft imatiach-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mhamilton723 mhamilton723 merged commit 6b07829 into microsoft:master Oct 11, 2019
@welcome
Copy link

welcome bot commented Oct 11, 2019

Congrats on merging your first pull request, we appreciate your support! 🎉🎉🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants