Skip to content

Latest commit

 

History

History
46 lines (29 loc) · 4.19 KB

model-training-overview.md

File metadata and controls

46 lines (29 loc) · 4.19 KB
title description ms.reviewer ms.author author ms.topic ms.custom ms.date
Train machine learning models with Apache Spark
Use Apache Spark in Fabric to train machine learning models
franksolomon
midesa
midesa
conceptual
build-2023, build-2023-dataai, build-2023-fabric
06/13/2024

Train machine learning models

Apache Spark - a part of [!INCLUDE product-name] - enables machine learning with big data. With Apache Spark, you can build valuable insights into large masses of structured, unstructured, and fast-moving data. You have several available open-source library options when you train machine learning models with Apache Spark in [!INCLUDE product-name]: Apache Spark MLlib, SynapseML, and others.

Apache SparkML and MLlib

Apache Spark - a part of [!INCLUDE product-name] - provides a unified, open-source, parallel data processing framework. This framework supports in-memory processing that boosts big data analytics. The Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark's in-memory distributed computation capabilities make it a good choice for the iterative algorithms that machine learning and graph computations use.

The MLlib and SparkML scalable machine learning libraries bring algorithmic modeling capabilities to this distributed environment. MLlib contains the original API, built on top of RDDs. SparkML is a newer package. It provides a higher-level API built on top of DataFrames for construction of ML pipelines. SparkML doesn't yet support all of the features of MLlib, but is replacing MLlib as the standard Spark machine learning library.

Note

For more information about SparkML model creation, visit the Train models with Apache Spark MLlib resource.

Popular libraries

The [!INCLUDE product-name] runtime for Apache Spark includes several popular, open-source packages for training machine learning models. These libraries provide reusable code that you can include in your programs or projects. The runtime includes these relevant machine learning libraries, and others:

  • Scikit-learn - one of the most popular single-node machine learning libraries for classical ML algorithms. Scikit-learn supports most supervised and unsupervised learning algorithms, and can handle data-mining and data-analysis.

  • XGBoost - a popular machine learning library that contains optimized algorithms for training decision trees and random forests.

  • PyTorch and Tensorflow are powerful Python deep learning libraries. With these libraries, you can set the number of executors on your pool to zero, to build single-machine models. Although that configuration doesn't support Apache Spark, it's a simple, cost-effective way to create single-machine models.

SynapseML

The SynapseML open-source library (previously known as MMLSpark) simplifies the creation of massively scalable machine learning (ML) pipelines. With it, data scientist use of Spark becomes more productive because that library increases the rate of experimentation and applies cutting-edge machine learning techniques - including deep learning - on large datasets.

SynapseML provides a layer above the SparkML low-level APIs when building scalable ML models. These APIs cover string indexing, feature vector assembly, coercion of data into layouts appropriate for machine learning algorithms, and more. The SynapseML library simplifies these and other common tasks for building models in PySpark.

Related content

This article provides an overview of the various options available to train machine learning models within Apache Spark in [!INCLUDE product-name]. For more information about model training, visit these resources: