## Getting Started with MMLSpark
In this exercise, you will use the Microsoft Machine Learning for Spark (MMLSpark) library to create a classifier.

### Load the Data
First, you'll load the flight delay data from your Azure storage account and create a dataframe with a **Late** column that will be the label your classifier predicts.

In [2]:
import numpy as np
import pandas as pd
import mmlspark
from pyspark.sql.types import *
from pyspark.sql.functions import *

csv = spark.read.csv('wasb://spark@<YOUR_ACCOUNT>.blob.core.windows.net/data/flights.csv', inferSchema=True, header=True)
data = csv.select("DayofMonth", "DayOfWeek", "OriginAirportID", "DestAirportID", "DepDelay", ((col("ArrDelay") > 15).cast("Int").alias("Late")))
data.show()

### Split the Data for Training and Testing
Now you'll split the data into two sets; one for training a classification model, the other for testing the trained model.

In [4]:
train, test = data.randomSplit([0.7, 0.3])
train_rows = train.count()
test_rows = test.count()
print ("Training Rows:", train_rows, " Testing Rows:", test_rows)

### Train a Classification Model
The steps so far have been identical to those used to prepare data for training using SparkML. Now we'll use the MMLSpark **TrainClassifier** function to initialize and fit a Logistic Regression model. This function abstracts the various SparkML classes used to do this, implicitly converting the data into the correct format for the algorithm.

In [6]:
from mmlspark import TrainClassifier
from pyspark.ml.classification import LogisticRegression
model = TrainClassifier(model=LogisticRegression(), labelCol="Late", numFeatures=256).fit(train)

### Evaluate the Model
The MMLSpark library also includes classes to calculate the performance metrics of a trained model. The following code calculates metrics for a classifier, and stores them in a table.

In [8]:
from mmlspark import ComputeModelStatistics, TrainedClassifierModel
prediction = model.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.createOrReplaceTempView("classMetrics")
metrics.show()

If the output above is too wide to view clearly, run the following cell to display the results as a scrollable table. The metrics include:
- predicted_class_as_0.0_actual_is_0.0 (true negatives)
- predicted_class_as_0.0_actual_is_1.0 (false negatives)
- predicted_class_as_1.0_actual_is_0.0 (false positives)
- predicted_class_as_1.0_actual_is_1.0 (true positives)
- accuracy (proportion of correct predictions)
- precision (proportion of predicted positives that are actually positive)
- recall (proportion of actual positves correctly predicted by the model)
- AUC (area under the ROC curve indicating true positive rate vs false positive rate for all thresholds)

In [10]:
%sql
SELECT * FROM classMetrics

### Learn More
This exercise has shown a simple example of using the MMLSpark library. The library really provides its greatest value when building deep learning models with the Microsoft cognitive toolkit (CNTK). To learn more about the MMLSpark library, see https://github.com/Azure/mmlspark.