# Train a model

With the knowledge of some Spark and PySpark basics, one can move on to machine learning utilities: Spark ML (*) 

Here, we take a simple dataset and fit a regressor. This does not claim to be a complete data science process. More important is to get to know the analogies and differences to known frameworks such as scikit-learn.

(*) Often, one hears the term *MLLib*. This usually refers to the *ML* library by Spark. Strictly speaking, these are two different things. Spark MLlib is the older of the two and is applied directly to RDDs. Spark ML, on the other hand, is built on top of DataFrames and provides a higher-level API that abstracts away some of the low-level details of building and tuning machine learning models.

In [1]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.regression import LinearRegression, LinearRegressionModel
from pyspark.ml.evaluation import BinaryClassificationEvaluator, RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import mean, stddev

In [2]:
spark_session: SparkSession = SparkSession.builder.master("local").appName("Local").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/12 11:00:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
sc = spark_session.sparkContext

## Load data

Iris, penguins, AirBnBs, Wines etc. You already know them by heart?   
I know, i know, but they do their work. So let's go for housing prices. 🏘️

In [4]:
df = spark_session.read.csv("data/housing_wo_null.csv", inferSchema=True, header=True)

## EDA

Performing EDA using PySpark has its pros and cons. On the one hand, PySpark supports various statistical functions that can help you calculate summary statistics, identify outliers, and explore relationships between variables. This can be particularly useful when dealing with very large datasets that cannot be easily processed using other tools.

On the other hand, PySpark does not support plotting and visualization, which can be an important part of the EDA process. 

In [5]:
df.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)



In [6]:
df.show(2)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
only showing top 2 rows



In [7]:
df.show(2, vertical=True)

-RECORD 0----------------------
 longitude          | -122.23  
 latitude           | 37.88    
 housing_median_age | 41.0     
 total_rooms        | 880.0    
 total_bedrooms     | 129.0    
 population         | 322.0    
 households         | 126.0    
 median_income      | 8.3252   
 median_house_value | 452600.0 
 ocean_proximity    | NEAR BAY 
-RECORD 1----------------------
 longitude          | -122.22  
 latitude           | 37.86    
 housing_median_age | 21.0     
 total_rooms        | 7099.0   
 total_bedrooms     | 1106.0   
 population         | 2401.0   
 households         | 1138.0   
 median_income      | 8.3014   
 median_house_value | 358500.0 
 ocean_proximity    | NEAR BAY 
only showing top 2 rows



In [8]:
spark_session.conf.set("spark.sql.repl.eagerEval.enabled", True)

In [9]:
df

longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


In [None]:
df_mean = df.select([mean(c) for c in df.columns])

In [None]:
df_mean.show()

In [None]:
df_std = df.select([stddev(c) for c in df.columns])

In [None]:
df_std.show()

## Build a model...

Let's build a simple regression model which aims to predict the *median house value*.

But first, let's clarify [terminology](https://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param) for a moment:

* We already know **DataFrames**.
* **Transformers** can transform one DataFrame to another DataFrame. An ML Transformer transforms a DataFrame with features to DataFrame with predictions.
* **Estimators** create Transformers via Fitting on a DataFrame.
* A **Pipeline** chains Transformers and Estimators to create a Pipeline workflow. But they are not covered here.

In [10]:
FEATURES = ["longitude", 
            "latitude", 
            "housing_median_age", 
            "total_rooms", 
            "total_bedrooms", 
            "population",
            "households",
            "median_income"
           ]
TARGET = "median_house_value"

Most machine learning algorithms in Spark require input data to be in vector format. PySpark's VectorAssembler is a utility that allows you to combine multiple columns of a PySpark DataFrame into a single vector column. The resulting vector column is then used as an input to machine learning algorithms in PySpark.

In [11]:
assembler = VectorAssembler(
    inputCols=FEATURES,
    outputCol="features_vec"
)
df_prep = assembler.transform(df).select("features_vec", TARGET)

In [12]:
df_prep.show(2)

+--------------------+------------------+
|        features_vec|median_house_value|
+--------------------+------------------+
|[-122.23,37.88,41...|          452600.0|
|[-122.22,37.86,21...|          358500.0|
+--------------------+------------------+
only showing top 2 rows



In [13]:
df_train, df_test = df_prep.randomSplit([0.8, 0.2], seed=42)

Allright. Let's fit the baseline model, which is a simple linear regression. Note that a linear regression model assumes it's features to be normal distributed. This is not the case here. But we will ignore this at this point, because it is not the focus of this notebook.

The syntax is very similar to sklearn. But, as mentioned above, there is a distingtion between Estimators and Transformers. So, the fitting process returns a Transformer object (in contrast to sklearn, where the object fitted object is able to make predictions directly).

In [14]:
lr = LinearRegression(labelCol=TARGET, featuresCol="features_vec")

In [15]:
model = lr.fit(df_train)

23/05/12 11:06:07 WARN Instrumentation: [1746f8b0] regParam is zero, which might cause numerical instability and overfitting.
23/05/12 11:06:08 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/05/12 11:06:08 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


The question whether and how the training process can now be parallelized depends on the model architecture. A knn clustering, for example, can be easily parallelized. An xgboost method, on the other hand, is somewhat more complex. Since the training process is sequential, only the individual steps can be parallelized here. The way in which the calculation is optimized differs in each case. This also applies to the inference.

In [16]:
df_pred = model.transform(df_test)

## ... test it ...

In [17]:
df_pred.show(3)

+--------------------+------------------+------------------+
|        features_vec|median_house_value|        prediction|
+--------------------+------------------+------------------+
|[-124.3,41.84,17....|          103600.0|102138.32319061132|
|[-124.23,40.54,52...|          106700.0|190038.23036733596|
|[-124.23,41.75,11...|           73200.0| 77197.94015233079|
+--------------------+------------------+------------------+
only showing top 3 rows



In [18]:
evaluator = RegressionEvaluator(labelCol=TARGET, predictionCol='prediction')
mse = evaluator.evaluate(df_pred, {evaluator.metricName: 'mse'})
mae = evaluator.evaluate(df_pred, {evaluator.metricName: 'mae'})
rmse = evaluator.evaluate(df_pred, {evaluator.metricName: 'rmse'})
r2 = evaluator.evaluate(df_pred, {evaluator.metricName: 'r2'})

print(f"MSE: {mse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R2: {r2:.2f}")

MSE: 4856128708.42
MAE: 51153.47
RMSE: 69685.93
R2: 0.63


## ... and persist it.

So here we are. We've got a regressor in our hands. Hooray.

For sure, we don't want to train a model again and again but persist it. The Transformer object comes with an built in function to do so. One can persist the model on localhost or any blob storage. 

In [19]:
import os

In [20]:
path = "model/my_model"

In [21]:
if os.path.exists(path):
    os.system(f'rm -r {path}')

In [22]:
model.save(path)

                                                                                

Then, in any production environment, one can load the model to work with it. Note that this could also happen in Scala or Java directly, because the serialization was not done via pickle but is language-agnostic.
Again, one faces some unintuitive syntax, because the ```load()``` method is implemented in a Model class.

In [None]:
# model_loaded = LinearRegressionModel.load("model/my_model")

## Chain in a pipeline

There is also the possibility to capture these two steps directly in one object. Pipelines form a construct in which several Transformers and Estimators can be chained together. 

In [23]:
assembler = VectorAssembler(
    inputCols=FEATURES,
    outputCol="features_vec"
)

lr = LinearRegression(labelCol=TARGET, featuresCol="features_vec")

pipeline = Pipeline(stages=[assembler, lr])

In this case, the train-test-split has to be applied directly on the original DataFrame. 

In [24]:
df_train_direct, df_test_direct = df.randomSplit([0.8, 0.2], seed=42)

In [25]:
model = pipeline.fit(df_train_direct)

23/05/12 11:10:24 WARN Instrumentation: [7746911c] regParam is zero, which might cause numerical instability and overfitting.


In [26]:
df_pred = model.transform(df_test_direct)

In [27]:
df_pred.show(3)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+--------------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|        features_vec|        prediction|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+--------------------+------------------+
|   -124.3|   41.84|              17.0|     2677.0|         531.0|    1244.0|     456.0|       3.0313|          103600.0|     NEAR OCEAN|[-124.3,41.84,17....|102138.32319061132|
|  -124.23|   40.54|              52.0|     2694.0|         453.0|    1152.0|     435.0|       3.0806|          106700.0|     NEAR OCEAN|[-124.23,40.54,52...|190038.23036733596|
|  -124.23|   41.75|              11.0|     3159.0|         616.0|    1343.0|     479.0|       2.4805|        

## Optimization

Note that the example above is for education purposes. In real world problems, one might try out different hyperparameter setups as well as different models to reach the best one. Also, cross validation should be applied if possible to acchieve more stable results.

In [28]:
assembler = VectorAssembler(
    inputCols=FEATURES,
    outputCol="features_vec"
)

lr = LinearRegression(labelCol=TARGET, featuresCol="features_vec")

evaluator = RegressionEvaluator(labelCol=TARGET, 
                                predictionCol='prediction',
                                metricName="mae")

pipeline = Pipeline(stages=[assembler, lr])

parameter_grid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=parameter_grid,
    evaluator=evaluator,
    numFolds=2
)

cross_val_model = crossval.fit(df_train_direct)

In [None]:
DEEP_DIVE = False

In [None]:
if DEEP_DIVE: 
    print(cross_val_model.bestModel.stages[1].extractParamMap())

In [None]:
if DEEP_DIVE:
    print(cross_val_model.avgMetrics)

## Outlook

Real data sets usually require some pre-processing before fitting is possible.

* For instance, not every ML model is able to handle missing values. One needs to drop oder impute missing values first. 
* Also, some models are sensitive to different scales in the expressions of numerical values, so one needs to standardize.
* Speaking about numerical values, for many models categorical data have to be encoded first. 

For those cases, there exist Transformers as well. One should integrate them in the modelling process, best directly into a pipeline.

## Congrats!

You dived into ML via Pyspark.  
🚀🚀🚀

Again, stop the local spark session.

In [29]:
spark_session.stop()