# Train a model

With the knowledge of some Spark and PySpark basics, one can move on to machine learning utilities: Spark ML (*) 

Here, we take a simple dataset and fit a regressor. This does not claim to be a complete data science process. More important is to get to know the analogies and differences to known frameworks such as scikit-learn.

(*) Often, one hears the term *MLLib*. This usually refers to the *ML* library by Spark. Strictly speaking, these are two different things. Spark MLlib is the older of the two and is applied directly to RDDs. Spark ML, on the other hand, is built on top of DataFrames and provides a higher-level API that abstracts away some of the low-level details of building and tuning machine learning models.

In [1]:
from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel
from pyspark.ml.regression import LinearRegression, LinearRegressionModel
from pyspark.ml.evaluation import BinaryClassificationEvaluator, RegressionEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import mean, stddev

In [2]:
spark_session: SparkSession = SparkSession.builder.master("local").appName("Local").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/17 13:33:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
sc = spark_session.sparkContext

## Load data

Iris, penguins, AirBnBs, Wines etc. You already know them by heart?   

Well, then let's take a look at a more business specific problem.

In [4]:
df = spark_session.read.csv("data/mocked_customer_data.csv", inferSchema=True, header=True)

## EDA

Performing EDA using PySpark has its pros and cons. On the one hand, PySpark supports various statistical functions that can help you calculate summary statistics, identify outliers, and explore relationships between variables. This can be particularly useful when dealing with very large datasets that cannot be easily processed using other tools.

On the other hand, PySpark does not support plotting and visualization, which can be an important part of the EDA process. 

In [5]:
df.printSchema()

root
 |-- customer_id: long (nullable = true)
 |-- postal_code: integer (nullable = true)
 |-- n1: double (nullable = true)
 |-- n2: double (nullable = true)
 |-- n3: double (nullable = true)
 |-- n4: string (nullable = true)
 |-- n4_1: integer (nullable = true)
 |-- n4_2: integer (nullable = true)
 |-- n4_3: integer (nullable = true)
 |-- target: integer (nullable = true)



In [6]:
df.show(2)

+-------------------+-----------+-------------------+-------------------+--------------------+-----------------+----+----+----+------+
|        customer_id|postal_code|                 n1|                 n2|                  n3|               n4|n4_1|n4_2|n4_3|target|
+-------------------+-----------+-------------------+-------------------+--------------------+-----------------+----+----+----+------+
|4974467801682041986|       1970| 1.0389522417417338|0.16304700716821743|-0.07161893126965056|Intermediate Area|   0|   0|   1|     0|
|-858641559787057159|       1645|-0.2659940172448812|  0.750317496703657| -0.8372525523443803|          Country|   0|   1|   0|     0|
+-------------------+-----------+-------------------+-------------------+--------------------+-----------------+----+----+----+------+
only showing top 2 rows



In [7]:
df.show(2, vertical=True)

-RECORD 0---------------------------
 customer_id | 4974467801682041986  
 postal_code | 1970                 
 n1          | 1.0389522417417338   
 n2          | 0.16304700716821743  
 n3          | -0.07161893126965056 
 n4          | Intermediate Area    
 n4_1        | 0                    
 n4_2        | 0                    
 n4_3        | 1                    
 target      | 0                    
-RECORD 1---------------------------
 customer_id | -858641559787057159  
 postal_code | 1645                 
 n1          | -0.2659940172448812  
 n2          | 0.750317496703657    
 n3          | -0.8372525523443803  
 n4          | Country              
 n4_1        | 0                    
 n4_2        | 1                    
 n4_3        | 0                    
 target      | 0                    
only showing top 2 rows



In [8]:
spark_session.conf.set("spark.sql.repl.eagerEval.enabled", True)

In [9]:
df

customer_id,postal_code,n1,n2,n3,n4,n4_1,n4_2,n4_3,target
4974467801682041986,1970,1.0389522417417338,0.1630470071682174,-0.0716189312696505,Intermediate Area,0,0,1,0
-858641559787057159,1645,-0.2659940172448812,0.750317496703657,-0.8372525523443803,Country,0,1,0,0
7875860926956384571,7684,1.3367555579432162,0.262499759615919,1.542938718088682,City,1,0,0,0
1718663060827327339,3355,0.5151084439689695,-0.7996961849347815,2.175468067045149,Country,0,1,0,1
-8637874552457225727,4344,1.6956467653837362,-0.945521641217198,-0.8639522191312855,City,1,0,0,0
-3177612884997717707,3784,0.9220421897516056,1.550548930005398,0.6383819989459883,Country,0,1,0,0
8497217710787736490,5538,0.0581497703921297,1.1856903392169684,0.8213706052182598,City,1,0,0,0
4292528309378731798,1749,0.6513284963527268,0.3168165631959955,0.6160836933665816,City,1,0,0,0
3264293750177743940,7445,0.1847085424540692,-0.2802940363082347,-0.7622539473498987,Intermediate Area,0,0,1,0
3343463163369573891,6479,-1.8851460177226464,-0.2248136997426025,-0.6219523743600517,City,1,0,0,0


In [10]:
df_mean = df.select([mean(c) for c in df.columns])

In [11]:
df_mean.show()

+--------------------+----------------+--------------------+--------------------+-------------------+-------+---------+---------+---------+-----------+
|    avg(customer_id)|avg(postal_code)|             avg(n1)|             avg(n2)|            avg(n3)|avg(n4)|avg(n4_1)|avg(n4_2)|avg(n4_3)|avg(target)|
+--------------------+----------------+--------------------+--------------------+-------------------+-------+---------+---------+---------+-----------+
|3.076534643793414...|        4903.448|0.043817252029327056|0.025989175796927534|0.06259296277136904|   null|    0.355|     0.34|    0.305|      0.241|
+--------------------+----------------+--------------------+--------------------+-------------------+-------+---------+---------+---------+-----------+



In [12]:
df_mean

avg(customer_id),avg(postal_code),avg(n1),avg(n2),avg(n3),avg(n4),avg(n4_1),avg(n4_2),avg(n4_3),avg(target)
3.076534643793414...,4903.448,0.043817252029327,0.0259891757969275,0.062592962771369,,0.355,0.34,0.305,0.241


In [13]:
df_std = df.select([stddev(c) for c in df.columns])

In [14]:
df_std.show()

23/05/17 13:34:57 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+------------------------+------------------------+------------------+-----------------+------------------+---------------+-------------------+------------------+-------------------+-------------------+
|stddev_samp(customer_id)|stddev_samp(postal_code)|   stddev_samp(n1)|  stddev_samp(n2)|   stddev_samp(n3)|stddev_samp(n4)|  stddev_samp(n4_1)| stddev_samp(n4_2)|  stddev_samp(n4_3)|stddev_samp(target)|
+------------------------+------------------------+------------------+-----------------+------------------+---------------+-------------------+------------------+-------------------+-------------------+
|    5.243814185146357...|      2917.0811102023713|1.0204942026362929|0.957810415588983|1.0266739337270703|           null|0.47875275895205466|0.4739458034676797|0.46063780477419236|0.42790431418963537|
+------------------------+------------------------+------------------+-----------------+------------------+---------------+-------------------+------------------+-------------------+------

In [15]:
df_std

stddev_samp(customer_id),stddev_samp(postal_code),stddev_samp(n1),stddev_samp(n2),stddev_samp(n3),stddev_samp(n4),stddev_samp(n4_1),stddev_samp(n4_2),stddev_samp(n4_3),stddev_samp(target)
5.243814185146357...,2917.0811102023717,1.0204942026362929,0.957810415588983,1.0266739337270705,,0.4787527589520546,0.4739458034676797,0.4606378047741923,0.4279043141896353


## Build a model...

Let's build a simple classification model which aims to predict the *target*.

But first, let's clarify [terminology](https://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param) for a moment:

* We already know **DataFrames**.
* **Transformers** can transform one DataFrame to another DataFrame. An ML Transformer transforms a DataFrame with features to DataFrame with predictions.
* **Estimators** create Transformers via Fitting on a DataFrame.
* A **Pipeline** chains Transformers and Estimators to create a Pipeline workflow. But they are not covered here.

In [16]:
FEATURES = ["n1", "n2", "n3", "n4_1", "n4_2", "n4_3"]
TARGET = "target"

Most machine learning algorithms in Spark require input data to be in vector format. PySpark's VectorAssembler is a utility that allows you to combine multiple columns of a PySpark DataFrame into a single vector column. The resulting vector column is then used as an input to machine learning algorithms in PySpark.

In [17]:
assembler = VectorAssembler(
    inputCols=FEATURES,
    outputCol="features_vec"
)
df_prep = assembler.transform(df).select("features_vec", TARGET)

In [18]:
df_prep.show(2)

+--------------------+------+
|        features_vec|target|
+--------------------+------+
|[1.03895224174173...|     0|
|[-0.2659940172448...|     0|
+--------------------+------+
only showing top 2 rows



In [19]:
df_prep.collect()[0]

Row(features_vec=DenseVector([1.039, 0.163, -0.0716, 0.0, 0.0, 1.0]), target=0)

In [20]:
df_train, df_test = df_prep.randomSplit([0.8, 0.2], seed=42)

Allright. Let's fit the baseline model.

The syntax is very similar to sklearn. But, as mentioned above, there is a distingtion between Estimators and Transformers. So, the fitting process returns a Transformer object (in contrast to sklearn, where the object fitted object is able to make predictions directly).

In [21]:
lr = LogisticRegression(labelCol=TARGET, featuresCol="features_vec")

In [22]:
model = lr.fit(df_train)

23/05/17 13:35:47 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


The question whether and how the training process can now be parallelized depends on the model architecture. A knn clustering, for example, can be easily parallelized. An xgboost method, on the other hand, is somewhat more complex. Since the training process is sequential, only the individual steps can be parallelized here. The way in which the calculation is optimized differs in each case. This also applies to the inference.

In [23]:
df_pred = model.transform(df_test)

## ... test it ...

In [24]:
df_pred.show(3)

+--------------------+------+--------------------+--------------------+----------+
|        features_vec|target|       rawPrediction|         probability|prediction|
+--------------------+------+--------------------+--------------------+----------+
|[-2.5592004494753...|     1|[-5.0496932470165...|[0.00637045692062...|       1.0|
|[-2.3885567658875...|     0|[2.57487809350654...|[0.92922717150041...|       0.0|
|[-2.2798296814936...|     1|[-0.7487301988090...|[0.32109804707567...|       1.0|
+--------------------+------+--------------------+--------------------+----------+
only showing top 3 rows



Let's test the goodness of fit via 
* Accuarcy
* F1 Score 
* Area under the curve

Note that neither BinaryClassificationEvaluator nor MulticlassClassificationEvaluator can calculate all metrics on their own. We need to use both.

In [25]:
evaluator = MulticlassClassificationEvaluator(labelCol=TARGET, 
                                              predictionCol="prediction")

accuracy = evaluator.evaluate(df_pred, {evaluator.metricName: 'accuracy'})
f1 = evaluator.evaluate(df_pred, {evaluator.metricName: 'f1'})

print(f"accuracy: {accuracy:.2f}")
print(f"f1: {f1:.2f}")

accuracy: 0.84
f1: 0.83


In [26]:
evaluator = BinaryClassificationEvaluator(labelCol=TARGET, 
                                          rawPredictionCol='rawPrediction')

auc = evaluator.evaluate(df_pred, {evaluator.metricName: 'areaUnderROC'})

print(f"AUC: {auc:.2f}")

AUC: 0.86


## ... and persist it.

So here we are. We've got a model in our hands. Hooray.

For sure, we don't want to train a model again and again but persist it. The Transformer object comes with an built in function to do so. One can persist the model on localhost or any blob storage. 

In [27]:
import os

In [28]:
path = "model/my_model"

In [29]:
if os.path.exists(path):
    os.system(f'rm -r {path}')

In [30]:
model.save(path)

                                                                                

Then, in any production environment, one can load the model to work with it. Note that this could also happen in Scala or Java directly, because the serialization was not done via pickle but is language-agnostic.
Again, one faces some unintuitive syntax, because the ```load()``` method is implemented in a Model class.

In [None]:
# model_loaded = LogisticRegressionModel.load("model/my_model")

## Chain in a pipeline

There is also the possibility to capture these two steps directly in one object. Pipelines form a construct in which several Transformers and Estimators can be chained together. 

In [31]:
assembler = VectorAssembler(
    inputCols=FEATURES,
    outputCol="features_vec"
)

lr = LogisticRegression(labelCol=TARGET, featuresCol="features_vec")

pipeline = Pipeline(stages=[assembler, lr])

In this case, the train-test-split has to be applied directly on the original DataFrame. 

In [32]:
df_train_direct, df_test_direct = df.randomSplit([0.8, 0.2], seed=42)

In [33]:
model = pipeline.fit(df_train_direct)

In [34]:
df_pred = model.transform(df_test_direct)

In [35]:
df_pred.show(3)

+--------------------+-----------+-------------------+--------------------+-------------------+-------+----+----+----+------+--------------------+--------------------+--------------------+----------+
|         customer_id|postal_code|                 n1|                  n2|                 n3|     n4|n4_1|n4_2|n4_3|target|        features_vec|       rawPrediction|         probability|prediction|
+--------------------+-----------+-------------------+--------------------+-------------------+-------+----+----+----+------+--------------------+--------------------+--------------------+----------+
|-9122871847693039327|       1258|   0.53297997354576|-0.16787123785158928| 0.8646612240550045|   City|   1|   0|   0|     1|[0.53297997354576...|[0.37584911122627...|[0.59287157021844...|       0.0|
|-9037252130589906431|       4550|-0.6363395261942798| -0.7252099279344788| 0.5305191506848624|   City|   1|   0|   0|     1|[-0.6363395261942...|[-0.5131584962944...|[0.37445339264470...|       1.0|


## Optimization

Note that the example above is for education purposes. In real world problems, one might try out different hyperparameter setups as well as different models to reach the best one. Also, cross validation should be applied if possible to acchieve more stable results.

In [36]:
assembler = VectorAssembler(
    inputCols=FEATURES,
    outputCol="features_vec"
)

lr = LogisticRegression(labelCol=TARGET, featuresCol="features_vec")

evaluator = BinaryClassificationEvaluator(labelCol=TARGET, 
                                          rawPredictionCol='rawPrediction',
                                          metricName="areaUnderROC")

pipeline = Pipeline(stages=[assembler, lr])

parameter_grid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=parameter_grid,
    evaluator=evaluator,
    numFolds=2
)

cross_val_model = crossval.fit(df_train_direct)

In [37]:
DEEP_DIVE = False

In [38]:
if DEEP_DIVE: 
    print(cross_val_model.bestModel.stages[1].extractParamMap())

In [39]:
if DEEP_DIVE:
    print(cross_val_model.avgMetrics)

## Challenges in a fitting process.

Real data sets usually require some pre-processing before fitting is possible.

* For instance, not every ML model is able to handle missing values. One needs to drop oder impute missing values first. 
* Also, some models are sensitive to different scales in the expressions of numerical values, so one needs to standardize.
* Speaking about numerical values, for many models categorical data have to be encoded first. 

For those cases, there exist Transformers as well. One should integrate them in the modelling process, best directly into a pipeline.

## Wrap-Up 

* Spark ML is a powerful tool to train ML models in a distributed way.
* The syntax is similar to sklearn, but there are some differences.
* Transformer and Estimator are the two main components in a fitting process. Transformers transform data, Estimators fit models.
* The feature values must first be transferred to a so-called feature vector.
* The fitting process is parallelized, but the degree of parallelization depends on the model architecture.
* The resulting model can be persisted and loaded again in any production environment.
* Pipelines are a convenient way to chain several Transformers and Estimators together.
* Once again, the principle of lazy behavior applies. The fitting process is only executed when the ```fit()``` method is called.

## Congrats!

You dived into ML via Pyspark.  
🚀🚀🚀

Again, stop the local spark session.

Then, let's move to the cloud! 

In [40]:
spark_session.stop()