### MLlib exercises

```
from pyspark.sql import SparkSession

from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation


spark = SparkSession.builder.appName("MLlib").getOrCreate()

data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
        (Vectors.dense([4.0, 5.0, 0.0, 3.0]),),
        (Vectors.dense([6.0, 7.0, 0.0, 8.0]),),
        (Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)]
df = spark.createDataFrame(data, ["features"])
print(df.show())


r1 = Correlation.corr(df, "features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))

r2 = Correlation.corr(df, "features", "spearman").head()
print("Spearman correlation matrix:\n" + str(r2[0]))
```

* Basic correlation display
* I guess we have to focus on vectors here

```
+--------------------+
|            features|
+--------------------+
|(4,[0,3],[1.0,-2.0])|
|   [4.0,5.0,0.0,3.0]|
|   [6.0,7.0,0.0,8.0]|
| (4,[0,3],[9.0,1.0])|
+--------------------+

None
[Stage 4:>                                                          (0 + 4) / 4]19/05/14 22:37:50 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
19/05/14 22:37:50 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
19/05/14 22:37:51 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.
Pearson correlation matrix:
DenseMatrix([[1.        , 0.05564149,        nan, 0.40047142],
             [0.05564149, 1.        ,        nan, 0.91359586],
             [       nan,        nan, 1.        ,        nan],
             [0.40047142, 0.91359586,        nan, 1.        ]])
19/05/14 22:37:58 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.
Spearman correlation matrix:
DenseMatrix([[1.        , 0.10540926,        nan, 0.4       ],
             [0.10540926, 1.        ,        nan, 0.9486833 ],
             [       nan,        nan, 1.        ,        nan],
             [0.4       , 0.9486833 ,        nan, 1.        ]])
SUCCESS: The process with PID 3768 (child process of PID 11404) has been terminated.
SUCCESS: The process with PID 11404 (child process of PID 212) has been terminated.
SUCCESS: The process with PID 212 (child process of PID 6828) has been terminated.

```

##### What is vectors?
* They are equivalent to numpy arrays or lists in python : array of objects

In [2]:
import numpy as np

a = np.array(['a', 0, "abcd", [0,1,2,3]])
a

array(['a', 0, 'abcd', list([0, 1, 2, 3])], dtype=object)

In [4]:
a[0] = 'd'
a

array(['d', 0, 'abcd', list([0, 1, 2, 3])], dtype=object)

##### Chi-squared Test

* Chi-squared : pronounced as kai squared, represented as (X square): does a hypothesis test between observed and expected with a formula : SUM ( (observed - expected)^2 / (expected) )
    * https://www.youtube.com/watch?v=1Ldl5Zfcm1Y
    * scipy package return a p value and that value if less that the permitted tolerance (alpha) then we can reject null hypothesis.
    * this is only for categorical data
        https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
        
        
    * code as in https://spark.apache.org/docs/latest/ml-statistics.html
        * What is Null Hypothesis: The null hypothesis is that the occurrence of the outcomes is statistically independent.

##### Output:
pValues: [0.6872892787909721,0.6822703303362126]
 
degreesOfFreedom: [2, 3]

statistics: [0.75,1.5]
```
>>> r
Row(pValues=DenseVector([0.6873, 0.6823]), degreesOfFreedom=[2, 3], statistics=DenseVector([0.75, 1.5]))
```

```
>>> from pyspark.ml.stat import Summarizer
>>> from pyspark.sql import Row
>>> from pyspark.ml.linalg import Vectors
>>>
>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)),
...                      Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
>>>
>>> # create summarizer for multiple metrics "mean" and "count"
... summarizer = Summarizer.metrics("mean", "count")
>>>
>>> # compute statistics for multiple metrics with weight
... df.select(summarizer.summary(df.features, df.weight)).show(truncate=False)
+-----------------------------------+
|aggregate_metrics(features, weight)|
+-----------------------------------+
|[[1.0,1.0,1.0], 1]                 |
+-----------------------------------+

>>>
>>> # compute statistics for multiple metrics without weight
... df.select(summarizer.summary(df.features)).show(truncate=False)
+--------------------------------+
|aggregate_metrics(features, 1.0)|
+--------------------------------+
|[[1.0,1.5,2.0], 2]              |
+--------------------------------+

>>>
>>> # compute statistics for single metric "mean" with weight
... df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False)
+--------------+
|mean(features)|
+--------------+
|[1.0,1.0,1.0] |
+--------------+

>>>
>>> # compute statistics for single metric "mean" without weight
... df.select(Summarizer.mean(df.features)).show(truncate=False)
+--------------+
|mean(features)|
+--------------+
|[1.0,1.5,2.0] |
+--------------+
```

In [6]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("dummy").getOrCreate()

In [7]:
df = spark.createDataFrame([(1,2,3),(4,5,6),(7,8,9)],["a","b","c"])
df.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
|  7|  8|  9|
+---+---+---+



In [8]:
df.describe().show()

+-------+---+---+---+
|summary|  a|  b|  c|
+-------+---+---+---+
|  count|  3|  3|  3|
|   mean|4.0|5.0|6.0|
| stddev|3.0|3.0|3.0|
|    min|  1|  2|  3|
|    max|  7|  8|  9|
+-------+---+---+---+

