Skip to content

Latest commit

 

History

History
48 lines (29 loc) · 4.18 KB

model-training-overview.md

File metadata and controls

48 lines (29 loc) · 4.18 KB
title description ms.reviewer ms.author author ms.topic ms.custom ms.date
Train machine learning models with Apache Spark
Use Apache Spark in Fabric to train machine learning models
scottpolly
midesa
midesa
conceptual
build-2023, build-2023-dataai, build-2023-fabric
05/23/2023

Train machine learning models

Apache Spark in [!INCLUDE product-name] enables machine learning with big data, providing the ability to obtain valuable insight from large amounts of structured, unstructured, and fast-moving data. There are several options when training machine learning models using Apache Spark in [!INCLUDE product-name]: Apache Spark MLlib, SynapseML, and various other open-source libraries.

Apache SparkML and MLlib

Apache Spark in [!INCLUDE product-name] provides a unified, open-source, parallel data processing framework supporting in-memory processing to boost big data analytics. The Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark's in-memory distributed computation capabilities make it a good choice for the iterative algorithms used in machine learning and graph computations.

There are two scalable machine learning libraries that bring algorithmic modeling capabilities to this distributed environment: MLlib and SparkML. MLlib contains the original API built on top of RDDs. SparkML is a newer package that provides a higher-level API built on top of DataFrames for constructing ML pipelines. SparkML doesn't yet support all of the features of MLlib, but is replacing MLlib as Spark's standard machine learning library.

Note

You can learn more about creating a SparkML model in the article Train models with Apache Spark MLlib.

Popular libraries

The [!INCLUDE product-name] runtime for Apache Spark includes several popular, open-source packages for training machine learning models. These libraries provide reusable code that you may want to include in your programs or projects. Some of the relevant machine learning libraries that are included by default include:

  • Scikit-learn is one of the most popular single-node machine learning libraries for classical ML algorithms. Scikit-learn supports most of the supervised and unsupervised learning algorithms and can also be used for data-mining and data-analysis.

  • XGBoost is a popular machine learning library that contains optimized algorithms for training decision trees and random forests.

  • PyTorch & Tensorflow are powerful Python deep learning libraries. You can use these libraries to build single-machine models by setting the number of executors on your pool to zero. Even though Apache Spark is not functional under this configuration, it is a simple and cost-effective way to create single-machine models.

SynapseML

SynapseML (previously known as MMLSpark), is an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. This library is designed to make data scientists more productive on Spark, increase the rate of experimentation, and leverage cutting-edge machine learning techniques, including deep learning, on large datasets.

SynapseML provides a layer on top of SparkML's low-level APIs when building scalable ML models, such as indexing strings, coercing data into a layout expected by machine learning algorithms, and assembling feature vectors. The SynapseML library simplifies these and other common tasks for building models in PySpark.

Related content

This article provides an overview of the various options to train machine learning models within Apache Spark in [!INCLUDE product-name]. You can learn more about model training by following the tutorial below: