Skip to content

Latest commit

 

History

History
49 lines (35 loc) · 4.01 KB

spark-machine-learning.md

File metadata and controls

49 lines (35 loc) · 4.01 KB
title titleSuffix description author ms.author ms.reviewer ms.date ms.service ms.subservice ms.topic
Use Spark Machine Learning
SQL Server Big Data Clusters
Introducing Spark Machine Learning on SQL Server Big Data Clusters.
HugoMSFT
hudequei
wiassaf
10/05/2021
sql
machine-learning-bdc
conceptual

Introducing Spark Machine Learning on SQL Server Big Data Clusters

[!INCLUDESQL Server 2019]

[!INCLUDEbig-data-clusters-banner-retirement]

This article explains how to effectively use Spark for Machine Learning on [!INCLUDEbig-data-clusters-nover].

Spark Machine Learning in SQL Server Big Data Clusters

SQL Server Big Data Clusters enables machine learning scenarios and solutions using different technology stacks: SQL Server Machine Learning Services and Apache Spark ML.

To better understand when to use each technology stack, refer to Machine Learning guide for SQL Server Big Data Clusters. This guide covers Apache Spark ML.

For big data-based machine learning scenarios, the usage of HDFS for big data hosting and Apache Spark ML capabilities is a more cost-effective, scalable, and powerful approach. Yet this is far from an exhaustive list of the possibilities of what can be achieved with Spark Machine Learning, for a complete list of features see: Spark MLlib.

The next section provides a curated list of scenarios and references for Spark in SQL Server Big Data Clusters.

Building blocks for Spark Machine Learning on SQL Server Big Data Clusters

Learn Contents Link
SQL Server Big Data Clusters runtime for Apache Spark This will show what's included with each release SQL Server Big Data Clusters runtime for Apache Spark Guide
The Storage Pool How to store and use HDFS + Spark together to unlock data for machine learning [Introducing the storage pool in [!INCLUDEbig-data-clusters-2019]](concept-storage-pool.md)
Use notebook-based experiences and your tools of choice Connect Spark-Livy endpoint using your tools of choice [Submit Spark jobs on [!INCLUDEbig-data-clusters-2019] in Azure Data Studio](spark-submit-job.md)
Submit Spark jobs on SQL Server big data cluster in Visual Studio Code
Use sparklyr in SQL Server big data cluster
How to install extra packages In the case a package is not provided out-of-the-box, install it Spark library management
How to troubleshoot In case it breaks Troubleshoot a pyspark notebook
[Debug and Diagnose Spark Applications on [!INCLUDEbig-data-clusters-2019] in Spark History Server](spark-history-server.md)
How to submit machine learning batch jobs Make ML training and batch scoring run using the command line Submit Spark jobs by using command-line tools
How to quickly move data between SQL Server and Spark Make SQL Server source and/or destination for your Spark ML scenarios. Usage of HDFS is not mandatory Use the Apache Spark Connector for SQL Server and Azure SQL
Spark model operationalization After training, operationalize using MLeap [Create, export, and score Spark machine learning models on [!INCLUDEbig-data-clusters-2019]](spark-create-machine-learning-model.md)
Data wrangling Along with Spark's powerful data wrangling capabilities, we ship PROSE Data Wrangling using PROSE Code Accelerator

Next steps

For more information, see [Introducing [!INCLUDEbig-data-clusters-nover]](big-data-cluster-overview.md).