# Using the Machine Learning library

Spark provides a library with ML functionality. The set of tools is ever expanding -- see the latest at https://spark.apache.org/docs/latest/ml-guide.html

The library is implemented in Scala, and has python binding (i.e. calling from python to the API).


Using MLFlow ( https://mlflow.org/docs/latest/python_api/mlflow.spark.html?highlight=spark#module-mlflow.spark )is also possible, but not covered here.

**Check the notebook at "sdg/Advanced_Analytics_and_Machine_Learning-Chapter_25_Preprocessing_and_Feature_Engineering"**

If a specific tool is not part of MLlib, maybe someone already implemented it.

Always be suspicious of the source: who wrote it? when was the last update? how many stars?

See for example https://spark-packages.org/?q=tags%3A%22Machine%20Learning%22 which is a repo without any quality assurance. You can find a great code, a buggy code, or malware.

In [None]:
from pathlib import Path
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.types import *
spark = SparkSession.builder.appName('MLlib').getOrCreate()

In [None]:
def load_data(file_name_glob):
    """ load the contents of the input files.
        If we already saved them in Parquet file, use it.
        >>> load_data('../data/sdg/retail-data/by-day/2010-12*.csv')
        :param file_name_glob wildcard value of the files to read. e.g. "/mnt/dir/data*"
        :return: DataFrame containing all the data
    """
    
    def cache_file_name(file_name):
        t = file_name.replace('*',"_").replace('?',"_")
        return t[: t.rfind('.')] + ".parquet"
    
    import os
    dirname = os.path.dirname(file_name_glob)
    p = Path(dirname)
    fname = Path(file_name_glob)
    basename = fname.name
    cache_name = cache_file_name(file_name_glob)
    if Path(cache_name).exists():
        print(f"reading {cache_name} from cache Parquet file")
        return spark.read.parquet(cache_name)
    
    #suffix = fname.suffix
    if not p.exists():
        raise ValueError('Path not found')
    file_list = list(p.glob(basename))
    x = [ str(f.resolve()) for f in file_list]
    df = spark.read \
    .option("header","true")\
    .option("inferSchema", "true")\
    .csv(x)
    
    df.write.parquet(cache_name)
    return df
    

df = load_data('../data/sdg/retail-data/by-day/2011-*.csv')
print(f"df.count = {df.count()}")

In [None]:
df.printSchema()

# Using Mlib 

## Prepare the data

Add a new column: "day of week" and split to train/test

In [None]:
from pyspark.sql.functions import date_format, col
preppedDataFrame = df\
  .na.fill(0)\
  .withColumn("day_of_week", date_format(col("InvoiceDate"), "EEEE"))
  #.coalesce(5)

# split to train and test:
trainDataFrame,testDataFrame  = preppedDataFrame.randomSplit([0.7, 0.3])

# we could also split using other criteria:
# trainDataFrame = preppedDataFrame\
#   .where("InvoiceDate < '2011-07-01'")
# testDataFrame = preppedDataFrame\
#   .where("InvoiceDate >= '2011-07-01'")

print(f"train:test ratio: {trainDataFrame.count()/testDataFrame.count()}")

convert day of week "Mon" -> 2 -> one hot encoding

**What is the downside of using one hot encoding?**
- wasted space -> solved by using sparse vectors
- increased dimension

In [None]:
from pyspark.ml.feature import StringIndexer
day_indexer = StringIndexer()\
  .setInputCol("day_of_week")\
  .setOutputCol("day_of_week_index")

country_indexer = StringIndexer()\
  .setInputCol("Country")\
  .setOutputCol("country_index")

from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder()\
  .setInputCol("day_of_week_index")\
  .setOutputCol("day_of_week_encoded")
from pyspark.ml.feature import VectorAssembler

#  add "features" column that contains the input columns as elements in a vector.
# Not very exciting, right?
vectorAssembler = VectorAssembler()\
  .setInputCols(["UnitPrice", "Quantity", "day_of_week_encoded"])\
  .setOutputCol("features")

# Read about pipelines here: https://spark.apache.org/docs/latest/ml-pipeline.html
from pyspark.ml import Pipeline

transformationPipeline = Pipeline()\
  .setStages([day_indexer, country_indexer, encoder, vectorAssembler])

fittedPipeline = transformationPipeline.fit(trainDataFrame)
transformedTraining = fittedPipeline.transform(trainDataFrame)
tranformedTest = fittedPipeline.transform(testDataFrame)

# Let's drop unused columns. 
# This reduces the amount of needed memory so improving performance.
transformedTraining = transformedTraining.drop('day_of_week').drop('day_of_week_encoded').drop('day_of_week_index'). drop('CustomerID')

In [None]:
# Caching the transfored DF will save a lot of time when reusing it (e.g. for hyper param tuning)
transformedTraining.cache()

In Spark, training machine learning models is a two phase process. First we initialize an untrained model, then we
train it. There are always two types for every algorithm in MLlib’s DataFrame API. The algorithm Kmeans and then the
trained version which is a KMeansModel.

## K means

In [None]:
from pyspark.ml.clustering import KMeans
kmeans = KMeans()\
  .setK(6)\
  .setSeed(1)

kmModel = kmeans.fit(transformedTraining)

# Supervised learning


## Logistic Regression


In [None]:
transformedTraining.columns

In [None]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="country_index",featuresCol="features")

In [None]:
# You can see a list of all hyperparams of LogisticRegression
# print(lr.explainParams())

In [None]:
fittedLR = lr.fit(transformedTraining)

In [None]:
fittedLR.transform(tranformedTest).select("country_index", "prediction").\
groupBy("country_index").avg("prediction").show(50)


In [None]:
fittedLR.transform(tranformedTest).select("country_index", "prediction")\
.filter((col('country_index') == col('prediction')) & (col('prediction') != 0)).toPandas()

## Tuning the hyper-params

TODO: reminder why this tuning is needed, and when is it enough
