### MLlib exercises

```
from pyspark.sql import SparkSession

from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation


spark = SparkSession.builder.appName("MLlib").getOrCreate()

data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
        (Vectors.dense([4.0, 5.0, 0.0, 3.0]),),
        (Vectors.dense([6.0, 7.0, 0.0, 8.0]),),
        (Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)]
df = spark.createDataFrame(data, ["features"])
print(df.show())


r1 = Correlation.corr(df, "features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))

r2 = Correlation.corr(df, "features", "spearman").head()
print("Spearman correlation matrix:\n" + str(r2[0]))
```

* Basic correlation display
* I guess we have to focus on vectors here

```
+--------------------+
|            features|
+--------------------+
|(4,[0,3],[1.0,-2.0])|
|   [4.0,5.0,0.0,3.0]|
|   [6.0,7.0,0.0,8.0]|
| (4,[0,3],[9.0,1.0])|
+--------------------+

None
[Stage 4:>                                                          (0 + 4) / 4]19/05/14 22:37:50 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
19/05/14 22:37:50 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
19/05/14 22:37:51 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.
Pearson correlation matrix:
DenseMatrix([[1.        , 0.05564149,        nan, 0.40047142],
             [0.05564149, 1.        ,        nan, 0.91359586],
             [       nan,        nan, 1.        ,        nan],
             [0.40047142, 0.91359586,        nan, 1.        ]])
19/05/14 22:37:58 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.
Spearman correlation matrix:
DenseMatrix([[1.        , 0.10540926,        nan, 0.4       ],
             [0.10540926, 1.        ,        nan, 0.9486833 ],
             [       nan,        nan, 1.        ,        nan],
             [0.4       , 0.9486833 ,        nan, 1.        ]])
SUCCESS: The process with PID 3768 (child process of PID 11404) has been terminated.
SUCCESS: The process with PID 11404 (child process of PID 212) has been terminated.
SUCCESS: The process with PID 212 (child process of PID 6828) has been terminated.

```

##### What is vectors?
* They are equivalent to numpy arrays or lists in python : array of objects

In [None]:
import numpy as np

a = np.array(['a', 0, "abcd", [0,1,2,3]])
a

In [None]:
a[0] = 'd'
a

##### Chi-squared Test

* Chi-squared : pronounced as kai squared, represented as (X square): does a hypothesis test between observed and expected with a formula : SUM ( (observed - expected)^2 / (expected) )
    * https://www.youtube.com/watch?v=1Ldl5Zfcm1Y
    * scipy package return a p value and that value if less that the permitted tolerance (alpha) then we can reject null hypothesis.
    * this is only for categorical data
        https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
        
        
    * code as in https://spark.apache.org/docs/latest/ml-statistics.html
        * What is Null Hypothesis: The null hypothesis is that the occurrence of the outcomes is statistically independent.

##### Output:
pValues: [0.6872892787909721,0.6822703303362126]
 
degreesOfFreedom: [2, 3]

statistics: [0.75,1.5]
```
>>> r
Row(pValues=DenseVector([0.6873, 0.6823]), degreesOfFreedom=[2, 3], statistics=DenseVector([0.75, 1.5]))
```

```
>>> from pyspark.ml.stat import Summarizer
>>> from pyspark.sql import Row
>>> from pyspark.ml.linalg import Vectors
>>>
>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)),
...                      Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
>>>
>>> # create summarizer for multiple metrics "mean" and "count"
... summarizer = Summarizer.metrics("mean", "count")
>>>
>>> # compute statistics for multiple metrics with weight
... df.select(summarizer.summary(df.features, df.weight)).show(truncate=False)
+-----------------------------------+
|aggregate_metrics(features, weight)|
+-----------------------------------+
|[[1.0,1.0,1.0], 1]                 |
+-----------------------------------+

>>>
>>> # compute statistics for multiple metrics without weight
... df.select(summarizer.summary(df.features)).show(truncate=False)
+--------------------------------+
|aggregate_metrics(features, 1.0)|
+--------------------------------+
|[[1.0,1.5,2.0], 2]              |
+--------------------------------+

>>>
>>> # compute statistics for single metric "mean" with weight
... df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False)
+--------------+
|mean(features)|
+--------------+
|[1.0,1.0,1.0] |
+--------------+

>>>
>>> # compute statistics for single metric "mean" without weight
... df.select(Summarizer.mean(df.features)).show(truncate=False)
+--------------+
|mean(features)|
+--------------+
|[1.0,1.5,2.0] |
+--------------+
```

##### Comparing with describe function

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("dummy").getOrCreate()

In [4]:
df = spark.createDataFrame([(1,2,3),(4,5,6),(7,8,9)],["a","b","c"])
df.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
|  7|  8|  9|
+---+---+---+



In [5]:
df.describe().show()

+-------+---+---+---+
|summary|  a|  b|  c|
+-------+---+---+---+
|  count|  3|  3|  3|
|   mean|4.0|5.0|6.0|
| stddev|3.0|3.0|3.0|
|    min|  1|  2|  3|
|    max|  7|  8|  9|
+-------+---+---+---+



##### Reading a folder of csv

In [6]:
df_new = spark.read.csv(r"file:///C:\Users\padmaraj.bhat\OneDrive - Accenture\Git\GitHub\Real-Time-Analytics-on-Hadoop-master\New folder",sep="\t")
df_new.show(200000, False)

+------------------+----------------+
|_c0               |_c1             |
+------------------+----------------+
|brush             |12-05-2019 07:06|
|pray              |13-05-2019 07:06|
|drink water       |14-05-2019 07:06|
|yoga              |15-05-2019 07:06|
|news paper reading|16-05-2019 07:06|
|bath              |17-05-2019 07:06|
|pray              |18-05-2019 07:06|
|breakfast         |19-05-2019 07:06|
|market            |20-05-2019 07:06|
|lunch             |21-05-2019 07:06|
|parlor            |22-05-2019 07:06|
|attend function   |23-05-2019 07:06|
|evening snacks    |24-05-2019 07:06|
|dinner            |25-05-2019 07:06|
|fruits            |26-05-2019 07:06|
|brush             |12-05-2019 07:06|
|pray              |13-05-2019 07:06|
|drink water       |14-05-2019 07:06|
|yoga              |15-05-2019 07:06|
|news paper reading|16-05-2019 07:06|
|bath              |17-05-2019 07:06|
|pray              |18-05-2019 07:06|
|breakfast         |19-05-2019 07:06|
|market     

In [7]:
df_new.count()

45

In [8]:
df_new.describe().show()

+-------+---------------+----------------+
|summary|            _c0|             _c1|
+-------+---------------+----------------+
|  count|             45|              45|
|   mean|           null|            null|
| stddev|           null|            null|
|    min|attend function|12-05-2019 07:06|
|    max|           yoga|26-05-2019 07:06|
+-------+---------------+----------------+



In [9]:
#df_image = spark.read.format("image").path(r"file:///C:\Users\padmaraj.bhat\OneDrive - Accenture\TheBot")
df_image = spark.read.format("image").load(r"file:///C:\Users\padmaraj.bhat\Desktop\dummy pics")
df_image.show()

Py4JJavaError: An error occurred while calling o54.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 21, localhost, executor driver): java.io.FileNotFoundException: File file:/C:/Users/padmaraj.bhat/Desktop/dummy%20pics/desktop%20-%20Copy%20-%20Copy%20-%20Copy.png does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Unknown Source)
Caused by: java.io.FileNotFoundException: File file:/C:/Users/padmaraj.bhat/Desktop/dummy%20pics/desktop%20-%20Copy%20-%20Copy%20-%20Copy.png does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	... 1 more


##### It was not able to read local folder perhaps it would read the hdfs or s3 files.


## Creating a RDD and Converting to DF

In [10]:
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType

schema = StructType([StructField(str(i), StringType(), True) for i in range(3)])



rdd = spark.sparkContext.parallelize(["1","2","34"])

rdd.collect()

['1', '2', '34']

In [11]:
rdd

ParallelCollectionRDD[43] at parallelize at PythonRDD.scala:195

In [12]:
spark.createDataFrame(rdd)

TypeError: Can not infer schema for type: <class 'str'>

##### As indicated in the below link : createDataFrame requires us to pass list of list,tuple, Row 
https://stackoverflow.com/a/32742294

In [14]:
from pyspark.sql import *
row = Row("val")

In [15]:
rdd.map(lambda x: int(x)).map(row).collect()

[Row(val=1), Row(val=2), Row(val=34)]

In [16]:
df1 = rdd.map(lambda x: int(x)).map(row).toDF()
df2 = rdd.map(row).toDF()

df1.show()
df2.show()

+---+
|val|
+---+
|  1|
|  2|
| 34|
+---+

+---+
|val|
+---+
|  1|
|  2|
| 34|
+---+



In [17]:
df1[df1.val == "1"].show()

+---+
|val|
+---+
|  1|
+---+



In [18]:
df2[df2.val == 1].show()

+---+
|val|
+---+
|  1|
+---+



In [19]:
df1[df1.val == 1].show()

+---+
|val|
+---+
|  1|
+---+



In [20]:
df2[df2.val == "1"].show()

+---+
|val|
+---+
|  1|
+---+



In [21]:
df1.printSchema()

root
 |-- val: long (nullable = true)



In [22]:
df2.printSchema()

root
 |-- val: string (nullable = true)



In [23]:
df2[df2.val == 34].printSchema()

root
 |-- val: string (nullable = true)



##### This is kind of risky: string and integer searches are same. it must be converting the host variable to that of dataframe variable and then the comparison takes place.

### Create multi feature RDD and then DF

In [24]:
rdd = spark.sparkContext.parallelize([(1,"1"),(2,"2"),(3,"3"),(4,"4")])
rdd.collect()

[(1, '1'), (2, '2'), (3, '3'), (4, '4')]

In [25]:
df3 = rdd.map(lambda x: Row(feat1=x[0], feat2=x[1])).toDF()
df3.show()

+-----+-----+
|feat1|feat2|
+-----+-----+
|    1|    1|
|    2|    2|
|    3|    3|
|    4|    4|
+-----+-----+



In [26]:
rdd = spark.sparkContext.parallelize([(1,"One"),(2,"Two"),(3,"Three"),(4,"Four")])
rdd.collect()

[(1, 'One'), (2, 'Two'), (3, 'Three'), (4, 'Four')]

In [27]:
df4 = rdd.map(lambda x: Row(feat1=x[0], feat2=x[1])).toDF()
df4.show()

+-----+-----+
|feat1|feat2|
+-----+-----+
|    1|  One|
|    2|  Two|
|    3|Three|
|    4| Four|
+-----+-----+



#### Pandas Merge vs Spark Join

In [28]:
df3.join(df4, on="feat1").show()

+-----+-----+-----+
|feat1|feat2|feat2|
+-----+-----+-----+
|    1|    1|  One|
|    3|    3|Three|
|    2|    2|  Two|
|    4|    4| Four|
+-----+-----+-----+



#### Pandas concat vs Spark Union on DF: 

In [29]:
df3.union(df4).show()

+-----+-----+
|feat1|feat2|
+-----+-----+
|    1|    1|
|    2|    2|
|    3|    3|
|    4|    4|
|    1|  One|
|    2|  Two|
|    3|Three|
|    4| Four|
+-----+-----+



#### Pandas concat vs Spark RDD Union First and then to DF: 

In [30]:
rdd.union(rdd).map(lambda x: Row(feat1=x[0], feat2=x[1])).toDF().show()

+-----+-----+
|feat1|feat2|
+-----+-----+
|    1|  One|
|    2|  Two|
|    3|Three|
|    4| Four|
|    1|  One|
|    2|  Two|
|    3|Three|
|    4| Four|
+-----+-----+



### Pipeline

* Transformer: transforms a df to another df: text to Bow
* Estimators: abstracts the learning: df to model generation
* pipeline: sequence of transformers and estimators placed with respect to application need.
* pipleline.fit() : call all the transformers transform function and call the fit function for the estimators
* DAG : Directed Acyclic Graph: non linear pipe line. DAG is executed in the order of topology.
* runtime type checking through the dataframe schema definition for the columns
* unique pipeline stages: each of the stages should have unique id and should not repeat. If application needs to same transformation on 2 different location of DAG or linear pipeline then it has to have new name to it.
* Parameter: 
    * can be specified during the stage creation like setting the hyper parameters values initialization in model building 
    * or through a ParamMap function which maps a parameter with value. Here advantage is that if there are 2 takers for the parameter then it can be shared.
    
* A pipeline or a ML can be save for the future use. Persistent models or pipelines are *usually* backword compatible and can be used across languages (except R).

In [31]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

# Prepare training data from a list of (label, features) tuples.
training = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for this
# LogisticRegression instance.
print("Model 1 was fit using parameters: ")
print(model1.extractParamMap())

# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30  # Specify 1 Param, overwriting the original maxIter.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # Specify multiple Params.

# You can combine paramMaps, which are python dictionaries.
paramMap2 = {lr.probabilityCol: "myProbability"}  # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)

# Now learn a new model using the paramMapCombined parameters.
# paramMapCombined overrides all parameters set earlier via lr.set* methods.
model2 = lr.fit(training, paramMapCombined)
print("Model 2 was fit using parameters: ")
print(model2.extractParamMap())
print("\n\n\n")

# Prepare test data
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

# Make predictions on test data using the Transformer.transform() method.
# LogisticRegression.transform will only use the 'features' column.
# Note that model2.transform() outputs a "myProbability" column instead of the usual
# 'probability' column since we renamed the lr.probabilityCol parameter previously.
prediction = model2.transform(test)
result = prediction.select("features", "label", "myProbability", "prediction") \
    .collect()

for row in result:
    print("features=%s, label=%s -> prob=%s, prediction=%s"
          % (row.features, row.label, row.myProbability, row.prediction))

LogisticRegression parameters:
aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bou

In [32]:
type(paramMapCombined)

dict

In [33]:
print(lr)

LogisticRegression_438e07df8a57


##### Example for 2 mutually exclusive set of parameters targetting different model.

In [46]:
lr2 = LogisticRegression(maxIter=10, regParam=0.01)
paramMapCombined[lr2.maxIter] = 30
model3 = lr2.fit(training, paramMapCombined)

In [47]:
print(lr2)

LogisticRegression_1456abf02fee


In [48]:
print(model3.extractParamMap())

{Param(parent='LogisticRegression_1456abf02fee', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2, Param(parent='LogisticRegression_1456abf02fee', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty'): 0.0, Param(parent='LogisticRegression_1456abf02fee', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial.'): 'auto', Param(parent='LogisticRegression_1456abf02fee', name='featuresCol', doc='features column name'): 'features', Param(parent='LogisticRegression_1456abf02fee', name='fitIntercept', doc='whether to fit an intercept term'): True, Param(parent='LogisticRegression_1456abf02fee', name='labelCol', doc='label column name'): 'label', Param(parent='LogisticRegression_1456abf02fee', name='maxIter', doc='maximum number of iterations (>= 0)'): 30, Pa

##### It is important to note that parameter map file has to be created post creation of transformer or estimator, otherwise it would have any impact.


if it was 
```
paramMapCombined[lr2.maxIter] = 30
lr2 = LogisticRegression(maxIter=10, regParam=0.01)
model3 = lr2.fit(training, paramMapCombined)
```

if lr2 is not in context as in if the older execution had not lr2 in python memory, the statement would have given error.

However, if there was older instance of lr2 then newly created lr2 would not have maxiter as 30. You can double check the same by checking the param variable value.

In [57]:
prediction.show(truncate=False)

+-----+--------------+----------------------------------------+----------------------------------------+----------+
|label|features      |rawPrediction                           |myProbability                           |prediction|
+-----+--------------+----------------------------------------+----------------------------------------+----------+
|1.0  |[-1.0,1.5,1.3]|[-2.8046569418746343,2.8046569418746343]|[0.05707304171034065,0.9429269582896593]|1.0       |
|0.0  |[3.0,2.0,-0.1]|[2.4958763566420807,-2.4958763566420807]|[0.923852231170412,0.07614776882958803] |0.0       |
|1.0  |[0.0,2.2,-1.5]|[-2.0935249027913496,2.0935249027913496]|[0.10972776114779761,0.8902722388522024]|1.0       |
+-----+--------------+----------------------------------------+----------------------------------------+----------+



In [58]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

(4, spark i j k) --> prob=[0.15964077387874742,0.8403592261212527], prediction=1.000000
(5, l m n) --> prob=[0.8378325685476743,0.16216743145232568], prediction=0.000000
(6, spark hadoop spark) --> prob=[0.06926633132976034,0.9307336686702395], prediction=1.000000
(7, apache hadoop) --> prob=[0.9821575333444218,0.01784246665557808], prediction=0.000000


In [61]:
type(prediction)

float

In [71]:
output = model.transform(test)
output["id","text","words","features"].show(truncate=False)
output["rawPrediction","probability","prediction"].show(truncate=False)


+---+------------------+----------------------+------------------------------------------------------+
|id |text              |words                 |features                                              |
+---+------------------+----------------------+------------------------------------------------------+
|4  |spark i j k       |[spark, i, j, k]      |(262144,[20197,24417,227520,234657],[1.0,1.0,1.0,1.0])|
|5  |l m n             |[l, m, n]             |(262144,[18910,100743,213302],[1.0,1.0,1.0])          |
|6  |spark hadoop spark|[spark, hadoop, spark]|(262144,[155117,234657],[1.0,2.0])                    |
|7  |apache hadoop     |[apache, hadoop]      |(262144,[66695,155117],[1.0,1.0])                     |
+---+------------------+----------------------+------------------------------------------------------+

+----------------------------------------+----------------------------------------+----------+
|rawPrediction                           |probability                           

In [72]:
training.printSchema()

root
 |-- id: long (nullable = true)
 |-- text: string (nullable = true)
 |-- label: double (nullable = true)



In [73]:
training.show()

+---+----------------+-----+
| id|            text|label|
+---+----------------+-----+
|  0| a b c d e spark|  1.0|
|  1|             b d|  0.0|
|  2|     spark f g h|  1.0|
|  3|hadoop mapreduce|  0.0|
+---+----------------+-----+



In [74]:
test.show()

+---+------------------+
| id|              text|
+---+------------------+
|  4|       spark i j k|
|  5|             l m n|
|  6|spark hadoop spark|
|  7|     apache hadoop|
+---+------------------+



In [80]:
pipeline.getStages()

[Tokenizer_3e3bdbcfa2f2,
 HashingTF_33a56e64c45e,
 LogisticRegression_5803e20e44f7]

In [83]:
pipeline.extractParamMap()

{Param(parent='Pipeline_d13a6b9c18cb', name='stages', doc='a list of pipeline stages'): [Tokenizer_3e3bdbcfa2f2,
  HashingTF_33a56e64c45e,
  LogisticRegression_5803e20e44f7]}

* rawPrediction : indicates the direct probability
* probability : indicates conditional probability, generated from raw predictions
* prediction : statistical mode of the rawPrediction via argmax

##### So what goes into the estimator ?

In [87]:
lr2.getFeaturesCol()

'features'

In [88]:
lr2.getLabelCol()

'label'

##### Does pipeline take custom transformers?

In [89]:
def dummy(df):
    return df
pipeline = Pipeline(stages=[tokenizer, hashingTF, dummy, lr])

In [91]:
model = pipeline.fit(training)

TypeError: Cannot recognize a pipeline stage of type <class 'function'>.

###### We need to extend the Transformer class as indicated below:

https://stackoverflow.com/a/32337101