## Chapter 24, Advanced Analytics and Machine Learning
You can add comments in a Spark notebook. For this load this notebook file into databricks interface, double click on this cell as to edit and see what is on the very first line.  Use the same line for your comments.

Usual markdown formatting can be used as well. Google "markdown format" to learn more about it.

Here we are presented with basic steps for Machine Learning.
More detailed examples are supposed to be in next chapters.

Everything which differs with a corresponding STDG github repository file is marked with **Note**.

In [2]:
from pyspark.ml.linalg import Vectors
denseVec = Vectors.dense(1.0, 2.0, 3.0)
size = 3
idx = [1, 2] # locations of non-zero elements in vector
values = [2.0, 3.0]
sparseVec = Vectors.sparse(size, idx, values)
print(sparseVec)

So a vector `sparseVec` in the previous cell has 3 entries, and at indices 1 and 2 it has float values 2 and 3, correspondingly: (2.0, 3.0, 0). Any other is 0. It is not very sparse. In practice sparse vectors have notably less than 10% of non-zero values but I guess it works as example.


#### 1st ML Example

**Note:** I added here a line to calculate the data frame number of rows.

In [4]:
df = spark.read.json("/databricks-datasets/definitive-guide/data/simple-ml")
print(df.count()) # this is my line
df.orderBy("value2").show()


**Note:** The cell below which defines `supervised` object was missing in the corresponding github script and I added it from the book. 

Here we see an imitation of R formula object.
Read this article if you want a reminder for it: https://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf

I would like to remind that in R when we fit a linear regression with R formula then every categorical variable is binarized during a formula object construction. It is done in the following way: for each such variable a set of distinct values is extracted. Afterwards for every distinct value so called "dummy" variable is created. It has 1s for records where the value appears and 0s otherwise. Therehefore for a categorical variable with 3 distinct values we get 3 dummy variables. Although one of them is usually dropped. 

By their formula code `lab ~ . +color:value1 + color:value2` authors mean the following formula:
$$
\text{lab}= \text{color}\cdot w_1 + \text{value1}\cdot w_2 + \text{value2}\cdot w_3 + \text{color}\cdot \text{value1}\cdot w_4 + 
\text{color}\cdot \text{value2}\cdot w_5
$$
The first 3 summands come from a period in the formula. The period means "all variables but the one to the left of ~ are included". Next terms with products of variables are called *variable interactions* in Statistics.

The R Formula is not mandatory. It is possible in pyspark to provide an outcome ("lab" in this case) and features by usual Python means, without R formula: with a vector(may be an array) for the outcome and an array for other variables. In this case we are on our own with binarizing categorical variables and adding variable interactions.

In [6]:
from pyspark.ml.feature import RFormula
supervised = RFormula(formula="lab ~ . +color:value1 + color:value2")

In [7]:
fittedRF = supervised.fit(df)
preparedDF = fittedRF.transform(df)
preparedDF.show()

As we see our `lab` variable is binarized as `label`. The rest of variables is transformed and in one column as an array named `features`. Looks like it consists of sparse vectors. We get more actual variables because `color` variable has 3 distinct values, meaning that it and its products yield more variables.

In [9]:
train, test = preparedDF.randomSplit([0.7, 0.3])

In [10]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label",featuresCol="features")


In [11]:
print(lr.explainParams())

In [12]:
fittedLR = lr.fit(train)

**Note:** The line below was not in a corresponding github script but in the book and I added it here.

In [14]:
fittedLR.transform(train).select("label", "prediction").show()

But the moment of truth comes when we check our model on a test set. 

**Note:** The next line was added by me and it is not in a book or the github script.

In [16]:
fittedLR.transform(test).select("label", "prediction").show()

Usually people compute some evaluation metric, like accuracy or confusion table. 

Regretfully I do not know Spark evaluation metrics yet. In particular, we are to convert a Spark data frame with predictions into RDD. 

**Note:** The cell below is added by me.  I did it for my own peace of mind as a standard step for a model evaluation. It is calculated on one worker because although Python methods may be parallelized on CPUs/GPUs of one node but they are not distributed among workers. Of course native Spark methods are better for big data: they are distributed.

In [18]:
labels_predictions = fittedLR.transform(test).select("label", "prediction").toPandas()
from sklearn.metrics import confusion_matrix
confusion_matrix(labels_predictions.iloc[ :,0], labels_predictions.iloc[:, 1])

The predictions on the test set are perfect.

#### 2nd Example

In [21]:
train, test = df.randomSplit([0.7, 0.3])

In [22]:
rForm = RFormula()
lr = LogisticRegression().setLabelCol("label").setFeaturesCol("features")

In [23]:
from pyspark.ml import Pipeline
stages = [rForm, lr]
pipeline = Pipeline().setStages(stages)

In [24]:
from pyspark.ml.tuning import ParamGridBuilder
params = ParamGridBuilder()\
  .addGrid(rForm.formula, [
    "lab ~ . + color:value1",
    "lab ~ . + color:value1 + color:value2"])\
  .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
  .addGrid(lr.regParam, [0.1, 2.0])\
  .build()

In [25]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()\
  .setMetricName("areaUnderROC")\
  .setRawPredictionCol("prediction")\
  .setLabelCol("label")

In [26]:
from pyspark.ml.tuning import TrainValidationSplit
tvs = TrainValidationSplit()\
  .setTrainRatio(0.75)\
  .setEstimatorParamMaps(params)\
  .setEstimator(pipeline)\
  .setEvaluator(evaluator)

Running the cell below resulted in request to install `MLflow` library for a attached cluster. To install `MLflow` library for a particular cluster go to `clusters`, click on your cluster `Libraries`, then click button `Install New` and in the appeared box choose `PyPI`. Put `MLflow` (no quotes) in `Package` box and hit `Install`. Although it worked anyway for me.

In [28]:
tvsFitted = tvs.fit(train)

**Note:** Second line in the cell below was missing from the github script and I added it from the book. I commented it because I do not want to use too much memory.

In [30]:
evaluator.evaluate(tvsFitted.transform(test))
#tvsFitted.write.overwrite().save("temp/ModelLocation")

Note that every run will produce different results. It happens because splittings (for train/test and validations) were done randomly. The greatest in my experience was about 0.95 and it could be as low as 0.88.

The interesting moment here is that data are the same as before, which we were able to classify correctly. But we used simple method, without regularization. If you add 0 as one of regularization options you will get 1 as a test evaluation metric.