**<center><h1>Introduction</h1></center>**


To train a machine learning model with Azure Databricks, data scientists can use the Spark ML library. In this module, you learn how to train and evaluate a machine learning model using the Spark ML library as well as other machine learning frameworks.

**<h2>Learning Objectives</h2>**

After completing this module, you’ll be able to:

- Describe Spark ML.
- Train and validate a machine learning model.
- Use other machine learning frameworks.


<hr>

**<center><h1>Understand Spark ML</h1></center>**

Azure Databricks supports several libraries for machine learning. There's one key library, which has two approaches that are native to Apache Spark: **MLLib** and **Spark ML**.


**<h2>MLLib</h2>**

MLLib is a legacy approach for machine learning on Apache Spark. It builds off of Spark's [Resilient Distributed Dataset](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds) (RDD) data structure. This data structure forms the foundation of Apache Spark, but additional data structures on top of the RDD, such as DataFrames, have reduced the need to work directly with RDDs.

As of Apache Spark 2.0, the library entered a maintenance mode. This means that MLLib is still available and has not been deprecated, but there will be no new functionality added to the library. Instead, customers are advised to move to the ```org.apache.spark.ml``` library, commonly referred to as Spark ML.

**<h2>Spark ML</h2>**


Spark ML is the primary library for machine learning development in Apache Spark. It supports DataFrames in its API, versus the classic RDD approach. This makes Spark ML an easier library to work with for data scientists, as Spark DataFrames share many common ideas with the DataFrames used in Pandas and R.

The most confusing part about MLLib versus Spark ML is that **they are both the same library**. The difference is that the "classic" MLLib namespace is ```org.apache.spark.mllib``` whereas the Spark ML namespace is ```org.apache.spark.ml```. Whenever possible, use the Spark ML namespace when performing new data science activities.



<hr>

**<center><h1>Train and validate a model</h1></center>**

The process of training and validating a machine learning model using Spark ML is fairly straightforward. The steps are as follows:

1. Splitting data.
2. Training a model.
3. Validating a model.

**<h2>Splitting data</h2>**

The first step involves splitting data between **training** and **validation** datasets. Doing so allows a data scientist to train a model with a representative portion of the data, while still retaining some percentage as a hold-out dataset. This hold-out dataset can be useful for determining whether the training model is **overfitting** - that is, latching onto the peculiarities of the training dataset rather than finding generally applicable relationships between variables.

DataFrames support a ```randomSplit()``` method, which makes this process of splitting data simple.

**<h2>Training a model</h2>**

Training a model relies on three key abstractions: a **transformer**, an **estimator**, and a **pipeline**.

A transformer takes a DataFrame as an input and returns a new DataFrame as an output. Transformers are helpful for performing feature engineering and feature selection, as the result of a transformer is another DataFrame. An example of this might be to read in a text column, map that text column into a set of feature vectors, and output a DataFrame with the newly mapped column. Transformers will implement a ```.transform()``` method.

An estimator takes a DataFrame as an input and returns a model. It takes a DataFrame as an input and returns a model, which is itself a transformer. An example of an estimator is the ```LinearRegression``` machine learning algorithm. It accepts a DataFrame and produces a ```Model```. Estimators implement a ```.fit()``` method.

Pipelines combine together estimators and transformers and implement a ```.fit()``` method. By breaking out the training process into a series of stages, it's easier to combine multiple algorithms.


**<h2>Validating a model</h2>**

Once a model has been trained, it becomes possible to validate its results. Spark ML includes built-in summary statistics for models based on the algorithm of choice. Using linear regression for example, the model contains a summary object, which includes scores such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination ($R^2$, pronounced R-squared). These will be the summary measures based on the **training** data.

From there, with a **validation** dataset, it is possible to calculate summary statistics on a never-before-seen set of data, running the model's transform() function against the validation dataset. From there, use evaluators such as the RegressionEvaluator to calculate measures such as RMSE, MAE, and $R^2$.




<hr>

**<center><h1>Use other machine learning frameworks</h1></center>**

Azure Databricks supports machine learning frameworks other than Spark ML and MLLib. For example, Azure Databricks offers support for popular libraries like TensorFlow and PyTorch.

It is possible to install these libraries directly, but the best recommendation is to use the [Databricks Runtime for Machine Learning](https://docs.microsoft.com/en-us/azure/databricks/runtime/mlruntime). This runtime comes with various machine learning libraries pre-installed, including TensorFlow, PyTorch, Keras, and XGBoost. It also includes libraries essential for distributed training, allowing data scientists to take advantage of the distributed nature of Apache Spark.

For libraries, which do not support distributed training, it is also possible to use a [single node ](https://docs.microsoft.com/en-us/azure/databricks/clusters/single-node). For example, [PyTorch](https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/train-model/pytorch#use-pytorch-on-a-single-node) and [TensorFlow](https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/train-model/tensorflow#use-tensorflow-on-a-single-node) both support single node use.



<hr>

**<center><h1>Exercise - Train a machine learning model</h1></center>**

Now, it's your chance to use Azure Databricks to train a multivariate regression model and interpret its results.

In this exercise, you will:

- Training a Model.
- Validating a Model.

**<h2>Instructions</h2>**

Follow these instructions to complete the exercise:

1. Open the exercise instructions at https://aka.ms/mslearn-dp090.
2. Complete the **Training and Validating a Machine Learning Model** exercises.



<hr>

**<center><h1>Summary</h1></center>**

In this module, you learned how to train and evaluate a machine learning model.

Now that you've completed this module, you can:

- Describe Spark ML.
- Train and validate a machine learning model.
- Use other machine learning frameworks.



<hr>