Strata 2018 Tutorial on Using R and Python for Scalable Data Science, Machine Learning, and AI

Instructions

Provision a CentOS Linux Data Science Virtual Machine; the size "Standard_DS12_v2" works well: https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.linux-data-science-vm?tab=Overview

Log in to JupyterHub by pointing your web browser to https://hostname:8000 (be sure to use https, not http, and replace "hostname" with the hostname or IP address of your virtual machine). Please disgregard warnings about certificate errors.

Open a bash terminal window in JupyterHub by clicking the New button and then clicking Terminal.

In the terminal, run these four commands:

cd ~/notebooks

git clone https://github.com/Azure/Strata2018

cd Strata2018

source startup.sh

You can now log in to RStudio Server at http://hostname:8787 (unlike JupyterHub, be sure to use http, not https).

Abstract

Accessed via R and Python APIs, pre-trained Deep Learning models and Transfer Learning are making custom Image Classification with large or small amounts of labeled data easily accessible to data scientists and application developers. This tutorial walks you through creating end-to-end data science solutions in R and Python on virtual machines, Spark environments, and cloud-based infrastructure and consuming them in production. This tutorial covers strategies and best practices for porting and interoperating between R and Python, with a novel Deep Learning use case for Image Classification as an example use case.

The tutorial materials and the scripts that are used to create the virtual machines configured as single-node Spark clusters are published in this GitHub repository, so you’ll be able to create environments identical to the ones you use in the tutorial by running the scripts after the tutorial session completes.

Outline:

What limits the scalability of R and Python scripts?
What functions and techniques can be used to overcome those limits?
Hands-on, end-to-end Deep Learning-based Image Classification example in R and Python using functions that scale from single nodes to distributed computing clusters
1. Data exploration and wrangling
2. Featurization and Modeling
3. Deployment and Consumption
4. Scaling with distributed computing

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
automation_scripts		automation_scripts
featurize_images		featurize_images
text_classification		text_classification
wood_knots		wood_knots
word_embeddings		word_embeddings
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
active_learning_workshop.pptx		active_learning_workshop.pptx
startup.sh		startup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automation_scripts

automation_scripts

featurize_images

featurize_images

text_classification

text_classification

wood_knots

wood_knots

word_embeddings

word_embeddings

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

active_learning_workshop.pptx

active_learning_workshop.pptx

startup.sh

startup.sh

Repository files navigation

Strata 2018 Tutorial on Using R and Python for Scalable Data Science, Machine Learning, and AI

Instructions

Abstract

Contributing

About

Releases

Packages

Contributors 7

Languages

License

Azure/Strata2018

Folders and files

Latest commit

History

Repository files navigation

Strata 2018 Tutorial on Using R and Python for Scalable Data Science, Machine Learning, and AI

Instructions

Abstract

Contributing

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages