## Chapter 24, Advanced Analytics and Machine Learning
**Mya's Remark:** You can add comments in a Spark notebook. For this add the same line at the top you see in this cell when you double click on it as your cell first line.
Usual markdown formatting can be used as well.

Here we are presented with basic steps for Machine Learning.
More detailed examples follow in next chapters.

In [2]:
from pyspark.ml.linalg import Vectors
denseVec = Vectors.dense(1.0, 2.0, 3.0)
size = 3
idx = [1, 2] # locations of non-zero elements in vector
values = [2.0, 3.0]
sparseVec = Vectors.sparse(size, idx, values)
print(sparseVec)

So a vector `sparseVec` in the previous cell has 3 entries, and at indices 1 and 2 it has float values 2 and 3, correspondingly: (2.0, 3.0, 0). Any other is 0. It is not very sparse, but I guess it works as example.


#### 1st ML Example

I added here a line to calculate data frame number of rows.

In [4]:
df = spark.read.json("/FileStore/tables/part_r_00000_f5c243b9_a015_4a3b_a4a8_eca00f80f04c-a8b89.json")
print(df.count()) # this is my line
df.orderBy("value2").show()


**Mya's Remarks:** The cell below which defines `supervised` object was missing in the corresponding github script and I added it from the book. Here authors use imitation of R formula object.
https://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf

I would like to remind that in R when we fit a linear regression every categorical variable is binarized. It is done in the following way: for each such variable a set of distinct values is extracted. Afterwards for every distinct value so called "dummy" variable is constructed. It has 1s for records where the value appears and 0s otherwise. 

By their code authors mean the following formula:
$$
\text{lab}= \text{color}\cdot x_1 + \text{value1}\cdot x_2 + \text{value2}\cdot x_3 + \text{color}\cdot \text{value1}\cdot x_4 + 
\text{color}\cdot \text{value2}\cdot x_5
$$
The first 3 summands come from a period in the formula. The period means "all columns but the one to the left of ~ are included".

It is possible in pyspark to provide an outcome ("lab" in this case) and features by usual Python means, without R formula: with a vector for the outcome and an array for other variables. In this case we are on our own with binarizing categorical variables and adding variable interactions.

In [6]:
from pyspark.ml.feature import RFormula
supervised = RFormula(formula="lab ~ . +color:value1 + color:value2")

In [7]:
fittedRF = supervised.fit(df)
preparedDF = fittedRF.transform(df)
preparedDF.show()


**Mya's Remark:** As we see our `lab` variable is binarized. The rest of variables are transformed and all put in one colunm as an array.

In [9]:
train, test = preparedDF.randomSplit([0.7, 0.3])

In [10]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label",featuresCol="features")


In [11]:
print(lr.explainParams())

In [12]:
fittedLR = lr.fit(train)

**Mya's Remark:** The line below was not in a corresponding github script but in the book and I added it here.

In [14]:
fittedLR.transform(train).select("label", "prediction").show()

**Mya's Remark:** But the moment of truth comes when we check our model on a test set. This line was added by me and it is not in a book or the github script.

In [16]:
fittedLR.transform(test).select("label", "prediction").show()

**Mya's Remark:** Usually people compute some evaluation metric, like accuracy or confusion table. 

We will see more on Spark evaluation metircs in Chapters 26 and 27. In particular, we are to convert the table with predictions into RDD. Here is a list of metrics:

https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.evaluation

The cell below is added by me.  I did it for my own peace of mind. It is calculated on one node because although Python methods may be parallelized on CPUs/GPUs of one node but they are not distributed among workers. Of course native Spark methods are better because they are distributed.

In [18]:
labels_predictions = fittedLR.transform(test).select("label", "prediction").toPandas()
from sklearn.metrics import confusion_matrix
confusion_matrix(labels_predictions.iloc[ :,0], labels_predictions.iloc[:, 1])

The predictions on a test set are perfect.

#### 2nd Example

In [21]:
train, test = df.randomSplit([0.7, 0.3])

In [22]:
rForm = RFormula()
lr = LogisticRegression().setLabelCol("label").setFeaturesCol("features")

In [23]:
from pyspark.ml import Pipeline
stages = [rForm, lr]
pipeline = Pipeline().setStages(stages)

In [24]:
from pyspark.ml.tuning import ParamGridBuilder
params = ParamGridBuilder()\
  .addGrid(rForm.formula, [
    "lab ~ . + color:value1",
    "lab ~ . + color:value1 + color:value2"])\
  .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
  .addGrid(lr.regParam, [0.1, 2.0])\
  .build()

In [25]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()\
  .setMetricName("areaUnderROC")\
  .setRawPredictionCol("prediction")\
  .setLabelCol("label")

In [26]:
from pyspark.ml.tuning import TrainValidationSplit
tvs = TrainValidationSplit()\
  .setTrainRatio(0.75)\
  .setEstimatorParamMaps(params)\
  .setEstimator(pipeline)\
  .setEvaluator(evaluator)

**Mya's Remark:** Running the cell below resulted in request to install `MLflow` library for a attached cluster. To install `MLflow` library for a particular cluster go to `clusters`, click on your cluster `Libraries`, then click button `Install New` and in the appeared box choose `PyPI`. Put `MLflow` (no quotes) in `Package` box and hit `Install`. Although it worked anyway for me.

In [28]:
tvsFitted = tvs.fit(train)

**Mya's Remark** Second line in the cell below was missing from the github script and I added it from the book. I commented it because I do not want to use too much memory.

In [30]:
evaluator.evaluate(tvsFitted.transform(test))
#tvsFitted.write.overwrite().save("temp/ModelLocation")

**Mya's Remark.** Note that every run will produce different results. It happens because splittings (for train/test and validations) were done randomly.

The interesting moment here is that data are the same as before, which we were able to classify correctly. Note that due to validation each fitting was done only on 0.7\*0.75 = 0.525 of data, or approximately 58 rows. I guess it's too few.