In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

API Guide: https://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html#module-pyspark.ml.linalg

MLlib utilities for linear algebra.  

Examples below were taken from the RDD-based section of the documentation and slightly modified  
https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector  
https://spark.apache.org/docs/latest/mllib-data-types.html#local-matrix



# Vectors
For dense vectors, MLlib uses the __NumPy array__ type, so you can simply pass NumPy arrays around.  
For sparse vectors, you can construct a SparseVector object from MLlib or pass SciPy __scipy.sparse__ column vectors if __SciPy__ is available in the environment.

Docs (RDD-section): https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector  

> A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where 3 is the size of the vector.

The following classes exist in `pyspark.ml.linalg` for Vectors:

`Vectors` (_main class to use_)  
> Factory methods for working with vectors.  
> .. note:: Dense vectors are simply represented as NumPy array objects, so there is no need to covert them for use in MLlib. For sparse vectors, the factory methods in this class create an MLlib-compatible type, or users can pass in SciPy's C{scipy.sparse} column vectors.

- `DenseVector`  
	> A dense vector represented by a value array. We use numpy array for storage and arithmetics will be delegated to the underlying numpy array.

- `SparseVector`  
	> A simple sparse vector class for passing data to MLlib. Users may alternatively pass SciPy's {scipy.sparse} data types.

`VectorUDT`  
> SQL user-defined type (UDT) for Vector.

`Vector`  
> Abstract class for DenseVector and SparseVector

## Sparse Vector Example

In [17]:
import numpy as np
import scipy.sparse as sps
from pyspark.ml.linalg import Vectors

# Use a NumPy array as a dense vector.
dv1 = np.array([1.0, 0.0, 3.0])

# Create a SparseVector.
sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])

print(dv1)
print(sv1)

[1. 0. 3.]
(3,[0,2],[1.0,3.0])


## Dense Vector Example

In [18]:
# Use a Python list as a dense vector.
dv2 = [1.0, 0.0, 3.0]

# Use a single-column SciPy csc_matrix as a sparse vector.
sv2 = sps.csc_matrix(
    (np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1)
)

print(dv2)
print(sv2)

[1.0, 0.0, 3.0]
  (0, 0)	1.0
  (2, 0)	3.0


# Matrices

Docs (RDD-section): https://spark.apache.org/docs/latest/mllib-data-types.html#local-matrix

> The base class of local matrices is _Matrix_, and we provide two implementations: _DenseMatrix_, and _SparseMatrix_. We recommend using the factory methods implemented in _Matrices_ to create local matrices. Remember, local matrices in MLlib are stored in column-major order.

The following classes exist in `pyspark.ml.linalg` for Matrices:

Matrices (_main class to use_)  
> Factory methods for working with matrices.

- DenseMatrix  
> Column-major dense matrix.

- SparseMatrix  
> Sparse Matrix stored in CSC format.

MatrixUDT  
> SQL user-defined type (UDT) for Matrix.
	
Matrix  
> Represents a local matrix.

## Dense Matrix

In [15]:
from pyspark.ml.linalg import Matrix, Matrices

# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
dm2 = Matrices.dense(3, 2, [1, 3, 5, 2, 4, 6])
print(dm2)

DenseMatrix([[1., 2.],
             [3., 4.],
             [5., 6.]])


## Sparse Matrix

In [16]:
# Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
print(sm)

3 X 2 CSCMatrix
(0,0) 9.0
(2,1) 6.0
(1,1) 8.0
