#Machine Learning Library (MLlib)
Spark machine learning library of common learning algorithms and utilities: classification, regression, collaborative filtering, dimensinoality reduction, and some optimization algorithms.

API is divided into 2 parts:
1. `spark.mlib` API as the main API
2. `spark.ml` as a higher-level API for workflows

##Data Types
MLib uses vectors, labeled points (for classification) and matrices.  

__Local Vector__: A vector of numberical values. A vector can be either dense (very few zeros) or sparse (alot of zero values).
- For python API, numpy arrays and lists can be used for vectors
- For sparce Vectors, use MLlib's `SparseVector` or scipy's `csc_matrix` with a single column.

In [13]:
from pyspark.mllib.linalg import Vectors
vec = Vectors.dense([1, 2, 3])
vec

DenseVector([1.0, 2.0, 3.0])

In [47]:
sparceVec = Vectors.sparse(3, {0:1.0, 2:3.0})
sparceVec

SparseVector(3, {0: 1.0, 2: 3.0})

These vectors can be indexed, take their dot product, find it's norm, find the squared distance between it and another vector, etc.

In [49]:
vec[1]

2.0

In [51]:
sparceVec[1]

0.0

In [48]:
vec.dot(sparceVec)

10.0

In [80]:
vec.squared_distance(sparceVec)

4.0

__Labeled Point__: A labeled point is a vector associated with a label/response. The label must be able to be parsed into a float.

- For classification, the labels should be integers going from 0 to k-1, where k is the number of classifications. Binary classification should be either 0 or 1.

In [29]:
from pyspark.mllib.regression import LabeledPoint
pos = LabeledPoint(1, vec)
pos

LabeledPoint(1.0, [1.0,2.0,3.0])

In [28]:
pos2 = LabeledPoint(2, sparceVec)
pos2

LabeledPoint(2.0, (3,[1,3],[1.0,3.0]))

These LabeledPoints have two attributes: label, and features.

In [41]:
print "Label {0} with feature {1}" \
    .format(pos.label, pos.features)

Label 1.0 with feature [1.0,2.0,3.0]


__Matrices__: Matrices can be located on a single machine or distributed over the cluster. There are several ways of storing a distributed matrix, each with their own advantages/disadvantages.
1. __Local Matrix__: Integer-typed row and column indices that is stored on single machine.
2. __RowMatrix__: A row-orientated distributed matrix without meaningful row indices. The number of columns must not be too large.
3. __IndexedRowMatrix__: A row-orientated distributed matrix with row indices.
4. __CoordinateMatrix__: A distributed matrix where each entry is stored as a tuple (i,j,v) where i is the row, j is the column, and v is the entry value. This should be only used when $n$ and $m$ are large and the matrix is sparse.
5. __BlockMatrix__: A distributed matrix where a MatrixBlock is a tuple ((i,j), matrix), where i and j are the index for the block and matrix is the submatrix.

In [67]:
from pyspark.mllib.linalg import DenseMatrix
DenseMatrix?
from pyspark.mllib.linalg.Matrices.dense?

In [85]:
from pyspark.mllib.linalg import Matrix
m = Matrix?

In [None]:
m = Matrix

In [None]:
m = Matrix.mro

In [69]:
from pyspark.mllib.linalg import SparseMatrix
SparseMatrix?

In [73]:
from pyspark.mllib.linalg import Matrices
Matrices.dense?

In [None]:
from pyspark.mllib.linalg.Matrices

In [71]:
k = DenseMatrix(2, 2, [1,2,3,4], isTransposed=False)
k

False

In [1]:
from pyspark.mllib.linalg import Vectors

In [14]:
Vectors.sparse?

In [78]:
from pyspark.mllib.linalg import array
l = array.array("B", [1,2,3,4])
l.tolist?