ML Library by Pyspark

Understanding the complete landscape of the ML library in pyspark. Go more detailed in speeding up model building and deoployment using ML ops after this

The repo has been seeded with the tutorials and data from the following repos

https://github.com/srivatsan88/End-to-End-Time-Series

https://github.com/srivatsan88/Mastering-Apache-Spark

https://github.com/susanli2016/PySpark-and-MLlib

https://github.com/apache/spark/tree/master/data/mllib

https://github.com/srivatsan88/model-deployment.git

Important repo that made this all possible is the ghclone repo. Without HR support, the data could not have pulled at such short notice.

https://github.com/HR/github-clone

Ideas is to complete the MLLib today, and learn about MLops next using the playlist

https://www.youtube.com/playlist?list=PL3N9eeOlCrP6Y73-dOA5Meso7Dv7qYiUU

Have to then work on the MLops serverless usecases from featurestore along with Github actions.

https://youtube.com/playlist?list=PL_RrEj88onS-um2xFy01sY46ik_2yt_EQ

https://github.com/featurestoreorg/serverless-ml-course

https://www.youtube.com/playlist?list=PL3N9eeOlCrP4uLCtas5vxq09sWz6jJXrw



In [35]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation

from pyspark.ml.stat import ChiSquareTest
from pyspark.ml.stat import Summarizer
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

from pyspark.ml.classification import LogisticRegression

from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, Tokenizer

In [2]:
import pyspark
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName("basics").getOrCreate()

22/11/27 04:40:54 WARN Utils: Your hostname, codeStation resolves to a loopback address: 127.0.1.1; using 192.168.84.83 instead (on interface wlo1)
22/11/27 04:40:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/27 04:40:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
#Starting with Basic Statistics in MLLib, starting withe correlation

data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
        (Vectors.dense([4.0, 5.0, 0.0, 3.0]),),
        (Vectors.dense([6.0, 7.0, 0.0, 8.0]),),
        (Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)]
        
df = spark.createDataFrame(data,['features'])
df.show()        

                                                                                

+--------------------+
|            features|
+--------------------+
|(4,[0,3],[1.0,-2.0])|
|   [4.0,5.0,0.0,3.0]|
|   [6.0,7.0,0.0,8.0]|
| (4,[0,3],[9.0,1.0])|
+--------------------+



In [6]:
Correlation.corr(df,'features').head()

[Stage 6:>                                                          (0 + 4) / 4]

22/11/27 04:42:33 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
22/11/27 04:42:33 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS


                                                                                

22/11/27 04:42:34 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.


Row(pearson(features)=DenseMatrix(4, 4, [1.0, 0.0556, nan, 0.4005, 0.0556, 1.0, nan, 0.9136, nan, nan, 1.0, nan, 0.4005, 0.9136, nan, 1.0], False))

In [7]:
r2 = Correlation.corr(df, "features", "spearman").head()

                                                                                

22/11/27 04:42:58 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.


In [8]:
#Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically significant, whether this result occurred by chance or not. 
from pyspark.ml.stat import ChiSquareTest

data = [(0.0, Vectors.dense(0.5, 10.0)),
        (0.0, Vectors.dense(1.5, 20.0)),
        (1.0, Vectors.dense(1.5, 30.0)),
        (0.0, Vectors.dense(3.5, 30.0)),
        (0.0, Vectors.dense(3.5, 40.0)),
        (1.0, Vectors.dense(3.5, 40.0))]
df = spark.createDataFrame(data, ["label", "features"])

In [9]:
df.show()

+-----+----------+
|label|  features|
+-----+----------+
|  0.0|[0.5,10.0]|
|  0.0|[1.5,20.0]|
|  1.0|[1.5,30.0]|
|  0.0|[3.5,30.0]|
|  0.0|[3.5,40.0]|
|  1.0|[3.5,40.0]|
+-----+----------+



In [12]:
r = ChiSquareTest.test(df, featuresCol='features',labelCol='label')
r.show()

[Stage 28:>                                                         (0 + 4) / 4]

+--------------------+----------------+----------+
|             pValues|degreesOfFreedom|statistics|
+--------------------+----------------+----------+
|[0.68728927879097...|          [2, 3]|[0.75,1.5]|
+--------------------+----------------+----------+



                                                                                

In [15]:
#Summarizing the data, in pipeline fashion
from pyspark.ml.stat import Summarizer
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

df = spark.sparkContext.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)),
                     Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()


                                                                                

In [17]:
mySummar = Summarizer.metrics("mean","count")
df.select(mySummar.summary(df.features, df.weight)).show(truncate=False)



+-----------------------------------+
|aggregate_metrics(features, weight)|
+-----------------------------------+
|{[1.0,1.0,1.0], 1}                 |
+-----------------------------------+



                                                                                

In [19]:
df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False)

+--------------+
|mean(features)|
+--------------+
|[1.0,1.0,1.0] |
+--------------+



In [22]:
#loading image data
df_image = spark.read.format("image").option('dropInvalid',True) \
            .load("mllib/images/origin/kittens/")
df_image.select('image.origin','image.width','image.height').show(truncate=True)

+--------------------+-----+------+
|              origin|width|height|
+--------------------+-----+------+
|file:///run/media...|  300|   311|
|file:///run/media...|  199|   313|
|file:///run/media...|  300|   200|
|file:///run/media...|  300|   296|
+--------------------+-----+------+



In [23]:
#loading libsvm data
df_svm = spark.read.format("libsvm").option("numFeatures","780") \
    .load("mllib/sample_libsvm_data.txt")
df_svm.show(5)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(780,[127,128,129...|
|  1.0|(780,[158,159,160...|
|  1.0|(780,[124,125,126...|
|  1.0|(780,[152,153,154...|
|  1.0|(780,[151,152,153...|
+-----+--------------------+
only showing top 5 rows



DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.

Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.

Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.

Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

Parameter: All Transformers and Estimators now share a common API for specifying parameters.

In [24]:
#starting pipelines
"""A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it produces a PipelineModel, which is a Transformer. This PipelineModel is used at test time; the figure below illustrates this usage."""


'A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it produces a PipelineModel, which is a Transformer. This PipelineModel is used at test time; the figure below illustrates this usage.'

In [25]:
#The following steps are without Pipes
from pyspark.ml.classification import LogisticRegression

training =spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

In [27]:
lr = LogisticRegression(maxIter=10, regParam=0.01)
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must beequal wi

In [28]:
model1 = lr.fit(training)

In [29]:
print(model1.extractParamMap())

{Param(parent='LogisticRegression_b9a8b9417984', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, Param(parent='LogisticRegression_b9a8b9417984', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_b9a8b9417984', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto', Param(parent='LogisticRegression_b9a8b9417984', name='featuresCol', doc='features column name.'): 'features', Param(parent='LogisticRegression_b9a8b9417984', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LogisticRegression_b9a8b9417984', name='labelCol', doc='label column name.'): 'label', Param(parent='LogisticRegression_b9a8b9417984', name='maxBlockSizeInMB', doc='maximum memory in MB for stackin

In [30]:
paramMap = {lr.maxIter:20}
paramMap[lr.maxIter] = 30
paramMap.update({lr.regParam:0.1, lr.threshold:0.55})
paramComb = paramMap.copy()
paramComb

{Param(parent='LogisticRegression_b9a8b9417984', name='maxIter', doc='max number of iterations (>= 0).'): 30,
 Param(parent='LogisticRegression_b9a8b9417984', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
 Param(parent='LogisticRegression_b9a8b9417984', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'): 0.55}

In [31]:
model2 = lr.fit(training,paramComb)

In [32]:
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])
test.show()

+-----+--------------+
|label|      features|
+-----+--------------+
|  1.0|[-1.0,1.5,1.3]|
|  0.0|[3.0,2.0,-0.1]|
|  1.0|[0.0,2.2,-1.5]|
+-----+--------------+



In [33]:
predi = model2.transform(test)
predi.show(2)

+-----+--------------+--------------------+--------------------+----------+
|label|      features|       rawPrediction|         probability|prediction|
+-----+--------------+--------------------+--------------------+----------+
|  1.0|[-1.0,1.5,1.3]|[-2.8046567890310...|[0.05707304993572...|       1.0|
|  0.0|[3.0,2.0,-0.1]|[2.49587585164645...|[0.92385219564432...|       0.0|
+-----+--------------+--------------------+--------------------+----------+
only showing top 2 rows



In [36]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, Tokenizer

pipeTrain = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

In [37]:
pipeTrain.show()

+---+----------------+-----+
| id|            text|label|
+---+----------------+-----+
|  0| a b c d e spark|  1.0|
|  1|             b d|  0.0|
|  2|     spark f g h|  1.0|
|  3|hadoop mapreduce|  0.0|
+---+----------------+-----+



In [38]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
#the hashing stage takes the input from tokeniser, and new output col is given
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)

pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

In [39]:
model = pipeline.fit(pipeTrain)

                                                                                

22/11/27 06:21:04 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/11/27 06:21:04 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


In [40]:
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])
test.show()

+---+------------------+
| id|              text|
+---+------------------+
|  4|       spark i j k|
|  5|             l m n|
|  6|spark hadoop spark|
|  7|     apache hadoop|
+---+------------------+



In [41]:
prediction = model.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
selected.show()

+---+------------------+--------------------+----------+
| id|              text|         probability|prediction|
+---+------------------+--------------------+----------+
|  4|       spark i j k|[0.62920984896684...|       0.0|
|  5|             l m n|[0.98477000676230...|       0.0|
|  6|spark hadoop spark|[0.13412348342566...|       1.0|
|  7|     apache hadoop|[0.99557321143985...|       0.0|
+---+------------------+--------------------+----------+



In [42]:
from pyspark.ml.feature import IDF

sentenceD = spark.createDataFrame([
    (0.0, "Hi I heard about Spark"),
    (0.0, "I wish Java could use case classes"),
    (1.0, "Logistic regression model are neat")
],["label","sentence"])
sentenceD.show()

+-----+--------------------+
|label|            sentence|
+-----+--------------------+
|  0.0|Hi I heard about ...|
|  0.0|I wish Java could...|
|  1.0|Logistic regressi...|
+-----+--------------------+



In [47]:
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features",numFeatures=20)
tokenIdf = IDF(inputCol=hashingTF.getOutputCol(), outputCol='final')
idfPipeline = Pipeline(stages=[tokenizer,hashingTF,tokenIdf])

In [51]:
idfModel = idfPipeline.fit(sentenceD)
transformedSent = idfModel.transform(sentenceD)
transformedSent.select("label","final").show(truncate=False)

+-----+--------------------------------------------------------------------------------------------------------------------------------------------+
|label|final                                                                                                                                       |
+-----+--------------------------------------------------------------------------------------------------------------------------------------------+
|0.0  |(20,[6,8,13,16],[0.28768207245178085,0.6931471805599453,0.28768207245178085,0.5753641449035617])                                            |
|0.0  |(20,[0,2,7,13,15,16],[0.6931471805599453,0.28768207245178085,1.3862943611198906,0.28768207245178085,0.6931471805599453,0.28768207245178085])|
|1.0  |(20,[2,3,4,6,19],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.28768207245178085,0.6931471805599453])                        |
+-----+---------------------------------------------------------------------------------------------------

In [None]:
from pyspark.ml.feature import Word2Vec
