title | titleSuffix | description | author | ms.author | ms.reviewer | ms.date | ms.service | ms.subservice | ms.topic |
---|---|---|---|---|---|---|---|---|---|
Use Spark Machine Learning |
SQL Server Big Data Clusters |
Introducing Spark Machine Learning on SQL Server Big Data Clusters. |
HugoMSFT |
hudequei |
wiassaf |
10/05/2021 |
sql |
machine-learning-bdc |
conceptual |
[!INCLUDESQL Server 2019]
[!INCLUDEbig-data-clusters-banner-retirement]
This article explains how to effectively use Spark for Machine Learning on [!INCLUDEbig-data-clusters-nover].
SQL Server Big Data Clusters enables machine learning scenarios and solutions using different technology stacks: SQL Server Machine Learning Services and Apache Spark ML.
To better understand when to use each technology stack, refer to Machine Learning guide for SQL Server Big Data Clusters. This guide covers Apache Spark ML.
For big data-based machine learning scenarios, the usage of HDFS for big data hosting and Apache Spark ML capabilities is a more cost-effective, scalable, and powerful approach. Yet this is far from an exhaustive list of the possibilities of what can be achieved with Spark Machine Learning, for a complete list of features see: Spark MLlib.
The next section provides a curated list of scenarios and references for Spark in SQL Server Big Data Clusters.
Learn | Contents | Link |
---|---|---|
SQL Server Big Data Clusters runtime for Apache Spark | This will show what's included with each release | SQL Server Big Data Clusters runtime for Apache Spark Guide |
The Storage Pool | How to store and use HDFS + Spark together to unlock data for machine learning | [Introducing the storage pool in [!INCLUDEbig-data-clusters-2019]](concept-storage-pool.md) |
Use notebook-based experiences and your tools of choice | Connect Spark-Livy endpoint using your tools of choice | [Submit Spark jobs on [!INCLUDEbig-data-clusters-2019] in Azure Data Studio](spark-submit-job.md) Submit Spark jobs on SQL Server big data cluster in Visual Studio Code Use sparklyr in SQL Server big data cluster |
How to install extra packages | In the case a package is not provided out-of-the-box, install it | Spark library management |
How to troubleshoot | In case it breaks | Troubleshoot a pyspark notebook[Debug and Diagnose Spark Applications on [!INCLUDEbig-data-clusters-2019] in Spark History Server](spark-history-server.md) |
How to submit machine learning batch jobs | Make ML training and batch scoring run using the command line | Submit Spark jobs by using command-line tools |
How to quickly move data between SQL Server and Spark | Make SQL Server source and/or destination for your Spark ML scenarios. Usage of HDFS is not mandatory | Use the Apache Spark Connector for SQL Server and Azure SQL |
Spark model operationalization | After training, operationalize using MLeap | [Create, export, and score Spark machine learning models on [!INCLUDEbig-data-clusters-2019]](spark-create-machine-learning-model.md) |
Data wrangling | Along with Spark's powerful data wrangling capabilities, we ship PROSE | Data Wrangling using PROSE Code Accelerator |
For more information, see [Introducing [!INCLUDEbig-data-clusters-nover]](big-data-cluster-overview.md).