### MLlib exercises

```
from pyspark.sql import SparkSession

from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation


spark = SparkSession.builder.appName("MLlib").getOrCreate()

data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
        (Vectors.dense([4.0, 5.0, 0.0, 3.0]),),
        (Vectors.dense([6.0, 7.0, 0.0, 8.0]),),
        (Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)]
df = spark.createDataFrame(data, ["features"])
print(df.show())


r1 = Correlation.corr(df, "features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))

r2 = Correlation.corr(df, "features", "spearman").head()
print("Spearman correlation matrix:\n" + str(r2[0]))
```

* Basic correlation display
* I guess we have to focus on vectors here

```
+--------------------+
|            features|
+--------------------+
|(4,[0,3],[1.0,-2.0])|
|   [4.0,5.0,0.0,3.0]|
|   [6.0,7.0,0.0,8.0]|
| (4,[0,3],[9.0,1.0])|
+--------------------+

None
[Stage 4:>                                                          (0 + 4) / 4]19/05/14 22:37:50 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
19/05/14 22:37:50 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
19/05/14 22:37:51 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.
Pearson correlation matrix:
DenseMatrix([[1.        , 0.05564149,        nan, 0.40047142],
             [0.05564149, 1.        ,        nan, 0.91359586],
             [       nan,        nan, 1.        ,        nan],
             [0.40047142, 0.91359586,        nan, 1.        ]])
19/05/14 22:37:58 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.
Spearman correlation matrix:
DenseMatrix([[1.        , 0.10540926,        nan, 0.4       ],
             [0.10540926, 1.        ,        nan, 0.9486833 ],
             [       nan,        nan, 1.        ,        nan],
             [0.4       , 0.9486833 ,        nan, 1.        ]])
SUCCESS: The process with PID 3768 (child process of PID 11404) has been terminated.
SUCCESS: The process with PID 11404 (child process of PID 212) has been terminated.
SUCCESS: The process with PID 212 (child process of PID 6828) has been terminated.

```

##### What is vectors?
* They are equivalent to numpy arrays or lists in python : array of objects

In [None]:
import numpy as np

a = np.array(['a', 0, "abcd", [0,1,2,3]])
a

In [None]:
a[0] = 'd'
a

##### Chi-squared Test

* Chi-squared : pronounced as kai squared, represented as (X square): does a hypothesis test between observed and expected with a formula : SUM ( (observed - expected)^2 / (expected) )
    * https://www.youtube.com/watch?v=1Ldl5Zfcm1Y
    * scipy package return a p value and that value if less that the permitted tolerance (alpha) then we can reject null hypothesis.
    * this is only for categorical data
        https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
        
        
    * code as in https://spark.apache.org/docs/latest/ml-statistics.html
        * What is Null Hypothesis: The null hypothesis is that the occurrence of the outcomes is statistically independent.

##### Output:
pValues: [0.6872892787909721,0.6822703303362126]
 
degreesOfFreedom: [2, 3]

statistics: [0.75,1.5]
```
>>> r
Row(pValues=DenseVector([0.6873, 0.6823]), degreesOfFreedom=[2, 3], statistics=DenseVector([0.75, 1.5]))
```

```
>>> from pyspark.ml.stat import Summarizer
>>> from pyspark.sql import Row
>>> from pyspark.ml.linalg import Vectors
>>>
>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)),
...                      Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
>>>
>>> # create summarizer for multiple metrics "mean" and "count"
... summarizer = Summarizer.metrics("mean", "count")
>>>
>>> # compute statistics for multiple metrics with weight
... df.select(summarizer.summary(df.features, df.weight)).show(truncate=False)
+-----------------------------------+
|aggregate_metrics(features, weight)|
+-----------------------------------+
|[[1.0,1.0,1.0], 1]                 |
+-----------------------------------+

>>>
>>> # compute statistics for multiple metrics without weight
... df.select(summarizer.summary(df.features)).show(truncate=False)
+--------------------------------+
|aggregate_metrics(features, 1.0)|
+--------------------------------+
|[[1.0,1.5,2.0], 2]              |
+--------------------------------+

>>>
>>> # compute statistics for single metric "mean" with weight
... df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False)
+--------------+
|mean(features)|
+--------------+
|[1.0,1.0,1.0] |
+--------------+

>>>
>>> # compute statistics for single metric "mean" without weight
... df.select(Summarizer.mean(df.features)).show(truncate=False)
+--------------+
|mean(features)|
+--------------+
|[1.0,1.5,2.0] |
+--------------+
```

##### Comparing with describe function

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("dummy").getOrCreate()

In [None]:
df = spark.createDataFrame([(1,2,3),(4,5,6),(7,8,9)],["a","b","c"])
df.show()

In [None]:
df.describe().show()

##### Reading a folder of csv

In [None]:
df_new = spark.read.csv(r"file:///C:\Users\padmaraj.bhat\OneDrive - Accenture\Git\GitHub\Real-Time-Analytics-on-Hadoop-master\New folder",sep="\t")
df_new.show(200000, False)

In [None]:
df_new.count()

In [None]:
df_new.describe().show()

In [None]:
#df_image = spark.read.format("image").path(r"file:///C:\Users\padmaraj.bhat\OneDrive - Accenture\TheBot")
df_image = spark.read.format("image").load(r"file:///C:\Users\padmaraj.bhat\Desktop\dummy pics")
df_image.show()

##### It was not able to read local folder perhaps it would read the hdfs or s3 files.


## Creating a RDD and Converting to DF

In [None]:
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType

schema = StructType([StructField(str(i), StringType(), True) for i in range(3)])



rdd = spark.sparkContext.parallelize(["1","2","34"])

rdd.collect()

In [None]:
rdd

In [None]:
spark.createDataFrame(rdd)

##### As indicated in the below link : createDataFrame requires us to pass list of list,tuple, Row 
https://stackoverflow.com/a/32742294

In [None]:
from pyspark.sql import *
row = Row("val")

In [None]:
rdd.map(lambda x: int(x)).map(row).collect()

In [None]:
df1 = rdd.map(lambda x: int(x)).map(row).toDF()
df2 = rdd.map(row).toDF()

df1.show()
df2.show()

In [None]:
df1[df1.val == "1"].show()

In [None]:
df2[df2.val == 1].show()

In [None]:
df1[df1.val == 1].show()

In [None]:
df2[df2.val == "1"].show()

In [None]:
df1.printSchema()

In [None]:
df2.printSchema()

In [None]:
df2[df2.val == 34].printSchema()

##### This is kind of risky: string and integer searches are same. it must be converting the host variable to that of dataframe variable and then the comparison takes place.

### Create multi feature RDD and then DF

In [None]:
rdd = spark.sparkContext.parallelize([(1,"1"),(2,"2"),(3,"3"),(4,"4")])
rdd.collect()

In [None]:
df3 = rdd.map(lambda x: Row(feat1=x[0], feat2=x[1])).toDF()
df3.show()

In [None]:
rdd = spark.sparkContext.parallelize([(1,"One"),(2,"Two"),(3,"Three"),(4,"Four")])
rdd.collect()

In [None]:
df4 = rdd.map(lambda x: Row(feat1=x[0], feat2=x[1])).toDF()
df4.show()

#### Pandas Merge vs Spark Join

In [None]:
df3.join(df4, on="feat1").show()

#### Pandas concat vs Spark Union on DF: 

In [None]:
df3.union(df4).show()

#### Pandas concat vs Spark RDD Union First and then to DF: 

In [None]:
rdd.union(rdd).map(lambda x: Row(feat1=x[0], feat2=x[1])).toDF().show()

### Pipeline

* Transformer: transforms a df to another df: text to Bow
* Estimators: abstracts the learning: df to model generation
* pipeline: sequence of transformers and estimators placed with respect to application need.
* pipleline.fit() : call all the transformers transform function and call the fit function for the estimators
* DAG : Directed Acyclic Graph: non linear pipe line. DAG is executed in the order of topology.
* runtime type checking through the dataframe schema definition for the columns
* unique pipeline stages: each of the stages should have unique id and should not repeat. If application needs to same transformation on 2 different location of DAG or linear pipeline then it has to have new name to it.
* Parameter: 
    * can be specified during the stage creation like setting the hyper parameters values initialization in model building 
    * or through a ParamMap function which maps a parameter with value. Here advantage is that if there are 2 takers for the parameter then it can be shared.
    
* A pipeline or a ML can be save for the future use. Persistent models or pipelines are *usually* backword compatible and can be used across languages (except R).

In [36]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

# Prepare training data from a list of (label, features) tuples.
training = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for this
# LogisticRegression instance.
print("Model 1 was fit using parameters: ")
print(model1.extractParamMap())

# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30  # Specify 1 Param, overwriting the original maxIter.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # Specify multiple Params.

# You can combine paramMaps, which are python dictionaries.
paramMap2 = {lr.probabilityCol: "myProbability"}  # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)

# Now learn a new model using the paramMapCombined parameters.
# paramMapCombined overrides all parameters set earlier via lr.set* methods.
model2 = lr.fit(training, paramMapCombined)
print("Model 2 was fit using parameters: ")
print(model2.extractParamMap())
print("\n\n\n")

# Prepare test data
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

# Make predictions on test data using the Transformer.transform() method.
# LogisticRegression.transform will only use the 'features' column.
# Note that model2.transform() outputs a "myProbability" column instead of the usual
# 'probability' column since we renamed the lr.probabilityCol parameter previously.
prediction = model2.transform(test)
result = prediction.select("features", "label", "myProbability", "prediction") \
    .collect()

for row in result:
    print("features=%s, label=%s -> prob=%s, prediction=%s"
          % (row.features, row.label, row.myProbability, row.prediction))

LogisticRegression parameters:
aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bou

In [37]:
type(paramMapCombined)

dict

In [38]:
print(lr)

LogisticRegression_ba434e53c1ef


##### Example for 2 mutually exclusive set of parameters targetting different model.

In [39]:
lr2 = LogisticRegression(maxIter=10, regParam=0.01)
paramMapCombined[lr2.maxIter] = 30
model3 = lr2.fit(training, paramMapCombined)

In [40]:
print(lr2)

LogisticRegression_ccc579993ad1


In [41]:
print(model3.extractParamMap())

{Param(parent='LogisticRegression_ccc579993ad1', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2, Param(parent='LogisticRegression_ccc579993ad1', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty'): 0.0, Param(parent='LogisticRegression_ccc579993ad1', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial.'): 'auto', Param(parent='LogisticRegression_ccc579993ad1', name='featuresCol', doc='features column name'): 'features', Param(parent='LogisticRegression_ccc579993ad1', name='fitIntercept', doc='whether to fit an intercept term'): True, Param(parent='LogisticRegression_ccc579993ad1', name='labelCol', doc='label column name'): 'label', Param(parent='LogisticRegression_ccc579993ad1', name='maxIter', doc='maximum number of iterations (>= 0)'): 30, Pa

##### It is important to note that parameter map file has to be created post creation of transformer or estimator, otherwise it would have any impact.


if it was 
```
paramMapCombined[lr2.maxIter] = 30
lr2 = LogisticRegression(maxIter=10, regParam=0.01)
model3 = lr2.fit(training, paramMapCombined)
```

if lr2 is not in context as in if the older execution had not lr2 in python memory, the statement would have given error.

However, if there was older instance of lr2 then newly created lr2 would not have maxiter as 30. You can double check the same by checking the param variable value.

In [None]:
prediction.show(truncate=False)

In [26]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

(4, spark i j k) --> prob=[0.1596407738787475,0.8403592261212525], prediction=1.000000
(5, l m n) --> prob=[0.8378325685476744,0.16216743145232562], prediction=0.000000
(6, spark hadoop spark) --> prob=[0.06926633132976037,0.9307336686702395], prediction=1.000000
(7, apache hadoop) --> prob=[0.9821575333444218,0.01784246665557808], prediction=0.000000


In [27]:
type(prediction)

float

In [28]:
output = model.transform(test)
output["id","text","words","features"].show(truncate=False)
output["rawPrediction","probability","prediction"].show(truncate=False)


+---+------------------+----------------------+------------------------------------------------------+
|id |text              |words                 |features                                              |
+---+------------------+----------------------+------------------------------------------------------+
|4  |spark i j k       |[spark, i, j, k]      |(262144,[20197,24417,227520,234657],[1.0,1.0,1.0,1.0])|
|5  |l m n             |[l, m, n]             |(262144,[18910,100743,213302],[1.0,1.0,1.0])          |
|6  |spark hadoop spark|[spark, hadoop, spark]|(262144,[155117,234657],[1.0,2.0])                    |
|7  |apache hadoop     |[apache, hadoop]      |(262144,[66695,155117],[1.0,1.0])                     |
+---+------------------+----------------------+------------------------------------------------------+

+----------------------------------------+----------------------------------------+----------+
|rawPrediction                           |probability                           

In [29]:
training.printSchema()

root
 |-- id: long (nullable = true)
 |-- text: string (nullable = true)
 |-- label: double (nullable = true)



In [30]:
training.show()

+---+----------------+-----+
| id|            text|label|
+---+----------------+-----+
|  0| a b c d e spark|  1.0|
|  1|             b d|  0.0|
|  2|     spark f g h|  1.0|
|  3|hadoop mapreduce|  0.0|
+---+----------------+-----+



In [31]:
test.show()

+---+------------------+
| id|              text|
+---+------------------+
|  4|       spark i j k|
|  5|             l m n|
|  6|spark hadoop spark|
|  7|     apache hadoop|
+---+------------------+



In [32]:
pipeline.getStages()

[Tokenizer_49734f2e7405,
 HashingTF_d085de47ddaf,
 LogisticRegression_71387618c0e9]

In [33]:
pipeline.extractParamMap()

{Param(parent='Pipeline_ed8c53fdb945', name='stages', doc='a list of pipeline stages'): [Tokenizer_49734f2e7405,
  HashingTF_d085de47ddaf,
  LogisticRegression_71387618c0e9]}

* rawPrediction : indicates the direct probability
* probability : indicates conditional probability, generated from raw predictions
* prediction : statistical mode of the rawPrediction via argmax

##### So what goes into the estimator ?

In [42]:
lr2.getFeaturesCol()

'features'

In [43]:
lr2.getLabelCol()

'label'

##### Does pipeline take custom transformers?

In [44]:
def dummy(df):
    return df
pipeline = Pipeline(stages=[tokenizer, hashingTF, dummy, lr])

In [45]:
model = pipeline.fit(training)

TypeError: Cannot recognize a pipeline stage of type <class 'function'>.

###### We need to extend the Transformer class as indicated below:

https://stackoverflow.com/a/32337101

In [46]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
#metrics = BinaryClassificationMetrics(output)


#####  How do we convert from dataframe to RDD ?

In [47]:
output_rdd = output.rdd

In [48]:
output_rdd.collect()

[Row(id=4, text='spark i j k', words=['spark', 'i', 'j', 'k'], features=SparseVector(262144, {20197: 1.0, 24417: 1.0, 227520: 1.0, 234657: 1.0}), rawPrediction=DenseVector([-1.6609, 1.6609]), probability=DenseVector([0.1596, 0.8404]), prediction=1.0),
 Row(id=5, text='l m n', words=['l', 'm', 'n'], features=SparseVector(262144, {18910: 1.0, 100743: 1.0, 213302: 1.0}), rawPrediction=DenseVector([1.6422, -1.6422]), probability=DenseVector([0.8378, 0.1622]), prediction=0.0),
 Row(id=6, text='spark hadoop spark', words=['spark', 'hadoop', 'spark'], features=SparseVector(262144, {155117: 1.0, 234657: 2.0}), rawPrediction=DenseVector([-2.598, 2.598]), probability=DenseVector([0.0693, 0.9307]), prediction=1.0),
 Row(id=7, text='apache hadoop', words=['apache', 'hadoop'], features=SparseVector(262144, {66695: 1.0, 155117: 1.0}), rawPrediction=DenseVector([4.0082, -4.0082]), probability=DenseVector([0.9822, 0.0178]), prediction=0.0)]

In [49]:
oo = output.foreach(lambda x : x[6])

In [50]:
output.rdd.map(tuple).collect()

[(4,
  'spark i j k',
  ['spark', 'i', 'j', 'k'],
  SparseVector(262144, {20197: 1.0, 24417: 1.0, 227520: 1.0, 234657: 1.0}),
  DenseVector([-1.6609, 1.6609]),
  DenseVector([0.1596, 0.8404]),
  1.0),
 (5,
  'l m n',
  ['l', 'm', 'n'],
  SparseVector(262144, {18910: 1.0, 100743: 1.0, 213302: 1.0}),
  DenseVector([1.6422, -1.6422]),
  DenseVector([0.8378, 0.1622]),
  0.0),
 (6,
  'spark hadoop spark',
  ['spark', 'hadoop', 'spark'],
  SparseVector(262144, {155117: 1.0, 234657: 2.0}),
  DenseVector([-2.598, 2.598]),
  DenseVector([0.0693, 0.9307]),
  1.0),
 (7,
  'apache hadoop',
  ['apache', 'hadoop'],
  SparseVector(262144, {66695: 1.0, 155117: 1.0}),
  DenseVector([4.0082, -4.0082]),
  DenseVector([0.9822, 0.0178]),
  0.0)]

In [51]:
output.rdd.map(tuple).map(lambda x: x[6]).collect()

[1.0, 0.0, 1.0, 0.0]

#### How do we evaluate the model?

In [2]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics

In [3]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

# Prepare training data from a list of (label, features) tuples.
training= spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for this
# LogisticRegression instance.
print("Model 1 was fit using parameters: ")
print(model1.extractParamMap())

# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30  # Specify 1 Param, overwriting the original maxIter.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # Specify multiple Params.

# You can combine paramMaps, which are python dictionaries.
paramMap2 = {lr.probabilityCol: "myProbability"}  # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)

# Now learn a new model using the paramMapCombined parameters.
# paramMapCombined overrides all parameters set earlier via lr.set* methods.
model2 = lr.fit(training, paramMapCombined)
print("Model 2 was fit using parameters: ")
print(model2.extractParamMap())
print("\n\n\n")

# Prepare test data
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

# Make predictions on test data using the Transformer.transform() method.
# LogisticRegression.transform will only use the 'features' column.
# Note that model2.transform() outputs a "myProbability" column instead of the usual
# 'probability' column since we renamed the lr.probabilityCol parameter previously.
prediction = model2.transform(test)

LogisticRegression parameters:
aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bou

In [4]:
prediction.rdd.map(tuple).map(lambda x: x[-1]).collect()

[1.0, 0.0, 1.0]

In [5]:
test.collect()

[Row(label=1.0, features=DenseVector([-1.0, 1.5, 1.3])),
 Row(label=0.0, features=DenseVector([3.0, 2.0, -0.1])),
 Row(label=1.0, features=DenseVector([0.0, 2.2, -1.5]))]

In [7]:
test.rdd.map(tuple).map(lambda x: x[0]).collect()

[1.0, 0.0, 1.0]

In [8]:
prediction.show()

+-----+--------------+--------------------+--------------------+----------+
|label|      features|       rawPrediction|       myProbability|prediction|
+-----+--------------+--------------------+--------------------+----------+
|  1.0|[-1.0,1.5,1.3]|[-2.8046569418746...|[0.05707304171034...|       1.0|
|  0.0|[3.0,2.0,-0.1]|[2.49587635664207...|[0.92385223117041...|       0.0|
|  1.0|[0.0,2.2,-1.5]|[-2.0935249027913...|[0.10972776114779...|       1.0|
+-----+--------------+--------------------+--------------------+----------+



In [19]:
zi = list(zip(prediction.rdd.map(tuple).map(lambda x: x[-1]).collect(), test.rdd.map(tuple).map(lambda x: x[0]).collect()))

In [22]:
predictionAndLabels = spark.sparkContext.parallelize(zi)

In [25]:
metrics = BinaryClassificationMetrics(predictionAndLabels)

# Area under precision-recall curve
print("Area under PR = %s" % metrics.areaUnderPR)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

Area under PR = 1.0
Area under ROC = 1.0


##### Provided a answer for a spark question

https://stackoverflow.com/a/56240742/8693106

##### Elephas :  http://maxpumperla.com/elephas/
Code: https://github.com/maxpumperla/elephas/blob/master/examples/ml_mlp.py


In [1]:
from __future__ import absolute_import
from __future__ import print_function

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.utils import np_utils
from keras import optimizers

from elephas.ml_model import ElephasEstimator
from elephas.ml.adapter import to_data_frame

from pyspark import SparkContext, SparkConf
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.ml import Pipeline


# Define basic parameters
batch_size = 16
nb_classes = 10
epochs = 1




Using TensorFlow backend.




In [2]:
# Load data
(x_train, y_train), (x_test, y_test) = mnist.load_data()

print(x_test.shape, y_test.shape, x_train.shape, y_train.shape)

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)

'''
train_size = 500
test_size = 500
x_train = x_train[:train_size,:]
y_train = y_train[:train_size]
x_test = x_test[:test_size,:]
y_test = y_test[:test_size]

'''

x_train = x_train.astype("float32")
x_test = x_test.astype("float32")
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
y_train = np_utils.to_categorical(y_train, nb_classes)
y_test = np_utils.to_categorical(y_test, nb_classes)

model = Sequential()
model.add(Dense(128, input_dim=784))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax'))

# Create Spark context
try:
    sc.stop()
except:
    pass

conf = SparkConf().setAppName('Mnist_Spark_MLP').setMaster('local[3]').set("spark.executor.heartbeatInterval","3600s").\
set("spark.network.timeout","3601s").set("spark.executor.memory","1GB").set("spark.executor.pyspark.memory","2GB")
sc = SparkContext(conf=conf)

'''conf = ps.SparkConf().setMaster("yarn-client").setAppName("sparK-mer")
conf.set("spark.executor.heartbeatInterval","3600s")
sc = ps.SparkContext('local[4]', '', conf=conf) # uses 4 cores on your local machine'''

# Build RDD from numpy features and labels
df = to_data_frame(sc, x_train, y_train, categorical=True)
test_df = to_data_frame(sc, x_test, y_test, categorical=True)

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
sgd_conf = optimizers.serialize(sgd)

# Initialize Spark ML Estimator
estimator = ElephasEstimator()
estimator.set_keras_model_config(model.to_yaml())
estimator.set_optimizer_config(sgd_conf)
estimator.set_mode("synchronous")
estimator.set_loss("categorical_crossentropy")
estimator.set_metrics(['acc'])
estimator.set_epochs(epochs)
estimator.set_batch_size(batch_size)
estimator.set_validation_split(0.1)
estimator.set_categorical_labels(True)
estimator.set_nb_classes(nb_classes)

# Fitting a model returns a Transformer
pipeline = Pipeline(stages=[estimator])
fitted_pipeline = pipeline.fit(df)


#model.compile(loss="categorical_crossentropy", optimizer=sgd_conf, metrics=["accuracy"])
#model.fit(x_train,y_train)

# Evaluate Spark model by evaluating the underlying model
prediction = fitted_pipeline.transform(test_df)
#prediction = model.predict(x_test)
pnl = prediction.select("label", "prediction")
pnl.show(100)

prediction_and_label = pnl.rdd.map(lambda row: (row.label, row.prediction))
metrics = MulticlassMetrics(prediction_and_label)
print(metrics.precision())
print(metrics.recall())

(10000, 28, 28) (10000,) (60000, 28, 28) (60000,)
60000 train samples
10000 test samples
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


OSError: [WinError 145] The directory is not empty: 'C:\\Users\\PADMAR~1.BHA\\AppData\\Local\\Temp\\tmplbohrxr6'

##### Note below are the tweaking to the above program done to run in my laptop:
* reduce the size of the samples for training to 100
* to handle socket timeout: .set("spark.executor.heartbeatInterval","3600s").set("spark.network.timeout","3601s")
* to use multiple cores: .setMaster('local[3]')

In [13]:
test_df.describe()

DataFrame[summary: string, label: string]

In [22]:
test_df.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[0.0,0.0,0.0,0.0,...|  7.0|
|[0.0,0.0,0.0,0.0,...|  2.0|
|[0.0,0.0,0.0,0.0,...|  1.0|
|[0.0,0.0,0.0,0.0,...|  0.0|
|[0.0,0.0,0.0,0.0,...|  4.0|
|[0.0,0.0,0.0,0.0,...|  1.0|
|[0.0,0.0,0.0,0.0,...|  4.0|
|[0.0,0.0,0.0,0.0,...|  9.0|
|[0.0,0.0,0.0,0.0,...|  5.0|
|[0.0,0.0,0.0,0.0,...|  9.0|
|[0.0,0.0,0.0,0.0,...|  0.0|
|[0.0,0.0,0.0,0.0,...|  6.0|
|[0.0,0.0,0.0,0.0,...|  9.0|
|[0.0,0.0,0.0,0.0,...|  0.0|
|[0.0,0.0,0.0,0.0,...|  1.0|
|[0.0,0.0,0.0,0.0,...|  5.0|
|[0.0,0.0,0.0,0.0,...|  9.0|
|[0.0,0.0,0.0,0.0,...|  7.0|
|[0.0,0.0,0.0,0.0,...|  3.0|
|[0.0,0.0,0.0,0.0,...|  4.0|
+--------------------+-----+
only showing top 20 rows



In [16]:
df.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[0.0,0.0,0.0,0.0,...|  5.0|
|[0.0,0.0,0.0,0.0,...|  0.0|
|[0.0,0.0,0.0,0.0,...|  4.0|
|[0.0,0.0,0.0,0.0,...|  1.0|
|[0.0,0.0,0.0,0.0,...|  9.0|
|[0.0,0.0,0.0,0.0,...|  2.0|
|[0.0,0.0,0.0,0.0,...|  1.0|
|[0.0,0.0,0.0,0.0,...|  3.0|
|[0.0,0.0,0.0,0.0,...|  1.0|
|[0.0,0.0,0.0,0.0,...|  4.0|
|[0.0,0.0,0.0,0.0,...|  3.0|
|[0.0,0.0,0.0,0.0,...|  5.0|
|[0.0,0.0,0.0,0.0,...|  3.0|
|[0.0,0.0,0.0,0.0,...|  6.0|
|[0.0,0.0,0.0,0.0,...|  1.0|
|[0.0,0.0,0.0,0.0,...|  7.0|
|[0.0,0.0,0.0,0.0,...|  2.0|
|[0.0,0.0,0.0,0.0,...|  8.0|
|[0.0,0.0,0.0,0.0,...|  6.0|
|[0.0,0.0,0.0,0.0,...|  9.0|
+--------------------+-----+
only showing top 20 rows



In [17]:
test_df.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)



In [18]:
df.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)



In [21]:
prediction[0]

array([0.07501657, 0.08139812, 0.09706777, 0.1315995 , 0.08939954,
       0.05362463, 0.08077153, 0.20469424, 0.08468577, 0.10174234],
      dtype=float32)

##### Update to my stackoverflow answer

https://stackoverflow.com/a/56240742/8693106

perhaps this also indicates the same:
https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html