In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

# Spark MLlib
*Official documentation [here](https://spark.apache.org/docs/latest/mllib-guide.html).*

In [2]:
from pyspark import SparkContext
sc = SparkContext("local[*]", "temp")
print sc.version  # should be >= 1.5.1 for distributed matrices

1.5.1


In [3]:
# needed to convert RDDs into DataFrames
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## MLlib Data Types
### Vector
A mathematical vector. MLlib supports both dense vectors, where every entry is stored, and sparse vectors, where only the nonzero entries are stored to save space. Vectors can be constructed with the mllib.linalg.Vectors package.

In [4]:
import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Vectors, SparseVector

# Create a dense vector (1.0, 0.0, 3.0) from a NumPy array.
dv1 = np.array([1.0, 0.0, 3.0])

# Create a dense vector (1.0, 0.0, 3.0) from a Python list.
dv2 = [1.0, 0.0, 3.0]

# Create a SparseVector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])

# Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
sv2 = Vectors.sparse(3, [(0, 1.0), (2, 3.0)])

print sv1
print sv2

(3,[0,2],[1.0,3.0])
(3,[0,2],[1.0,3.0])


### LabeledPoint
A labeled data point for supervised learning algorithms such as classification and regression. Includes a feature vector and a label (which is a floating-point value). Located in the mllib.regression package.

In [5]:
from pyspark.mllib.regression import LabeledPoint

point1 = LabeledPoint(1.0, np.array([1.0, 0.0, 3.0]))

point2 = LabeledPoint(-1.0, SparseVector(3, [0, 2], [1.0, 3.0]))

print point1.label, point2.label
print "========"
print point1.features, point2.features

1.0 -1.0
[1.0,0.0,3.0] (3,[0,2],[1.0,3.0])


### Local matrix
*Integer*-typed row and column indices and double-typed values, stored on a single machine. 

In [6]:
from pyspark.mllib.linalg import Matrices

# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
dm = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])

# Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])

### Distributed matrix 
*Long*-typed row and column indices and double-typed values, stored distributively in one or more RDDs.  They come in three formats:

- **RowMatrix:** Row-oriented distributed matrix without meaningful row indices, e.g. a collection of feature vectors. It is backed by an RDD of its rows, where each row is a local vector. We assume that the number of columns is not huge for a RowMatrix so that a single local vector can be reasonably communicated to the driver and can also be stored / operated on using a single node. 

In [7]:
from pyspark.mllib.linalg.distributed import RowMatrix

# Create an RDD of vectors.
rows = sc.parallelize([[1.0, 2.0], [2.0, 3.0], [3.0, 4.0]])

# Create a RowMatrix from an RDD of vectors.
rowMatrix = RowMatrix(rows)

# return size
m = rowMatrix.numRows()
n = rowMatrix.numCols() 

print "rows: %d\ncols: %d" % (m, n)

rows: 3
cols: 2


- **IndexedRowMatrix:** Similar to a RowMatrix but with meaningful row indices. It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector.

In [8]:
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix

indexedRows = sc.parallelize([IndexedRow(0, [1.0, 2.0]),
                              IndexedRow(1, [2.0, 3.0]), 
                              IndexedRow(3, [4.0, 5.0])])

indexedRowMatrix = IndexedRowMatrix(indexedRows)

# return size
m = indexedRowMatrix.numRows()
n = indexedRowMatrix.numCols() 

print "rows: %d\ncols: %d" % (m, n)

rows: 4
cols: 2


- **CoordinateMatrix:** (i.e. a distributed sparse matrix) is (essentially) a list of `(Long, Long, Double)`.

In [9]:
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry

matrixEntries = sc.parallelize([MatrixEntry(0, 0, 1.),
                                MatrixEntry(1, 1, 1.),
                                MatrixEntry(2, 2, 1.)])

coordinateMatrix = CoordinateMatrix(matrixEntries)

# return size
m = coordinateMatrix.numRows()
n = coordinateMatrix.numCols() 

print "rows: %d\ncols: %d" % (m, n)

rows: 3
cols: 3


- **BlockMatrix:** A distributed matrix backed by an RDD of MatrixBlocks, where a MatrixBlock is a tuple of ((Int, Int), Matrix), where the (Int, Int) is the index of the block, and Matrix is the sub-matrix at the given index with size rowsPerBlock x colsPerBlock

In [10]:
from pyspark.mllib.linalg.distributed import BlockMatrix

BlockMatrix = coordinateMatrix.toBlockMatrix()

Note, because the matrix is stored in a distributed way, converting between matrix formats is expensive!

**Rating**  
A rating of a product by a user, used in the `mllib.recommendation` package for product recommendation.

## Machine-learning in MLlib

Spark supports a number of machine-learning algorithms.

- Classification and Regression
    - SVM, linear regression
    - SVR, logistic regression
    - Naive Bayes
    - Decision Trees
    - Random Forests and Gradient-Boosted Trees
- Clustering
    - K-means (and streaming K-means)
    - Gaussian Mixture Models
    - Latent Dirichlet Allocation
- Dimensionality Reduction
    - SVD and PCA
- It also has support for lower-level optimization primitives:
    - Stochastic Gradient Descent
    - Low-memory BFGS and L-BFGS

In [11]:
from pyspark.mllib.regression import LinearRegressionWithSGD, LinearRegressionModel
from pyspark.mllib.evaluation import RegressionMetrics
import random

# parameters
TRAINING_ITERATIONS = 10
TRAINING_FRACTION = 0.6

# generate the data
data = sc.parallelize(xrange(1,10001)) \
    .map(lambda x: LabeledPoint(random.random(), [random.random(), random.random(), random.random()]))

# split the training and test sets
splits = data.randomSplit([TRAINING_FRACTION, 1 - TRAINING_FRACTION], seed=42)
training, test = (splits[0].cache(), splits[1])

# train the model
model = LinearRegressionWithSGD.train(training, TRAINING_ITERATIONS)

# get r2 score
predictions = test.map(lambda x: (float(model.predict(x.features)), x.label))
print RegressionMetrics(predictions).r2

-0.302090044981


## Spark ML
Spark ML implements the ideas of transformers, estimators, and pipelines by standardizing APIS across machine learning algorithms. This can streamline more complex workflows.

The core functionality includes:
* DataFrames - built off Spark SQL, can be created either directly or from RDDs as seen above
* Transformers - algorithms that accept a DataFrame as input and return a DataFrame as output
* Estimators - algorithms that accept a DataFrame as input and return a Transformer as output
* Pipelines - chaining together Transformers and Estimators
* Parameters - common API for specifying hyperparameters

For example, a learning algorithm can be implemented as an Estimator which trains on a DataFrame of features and returns a Transformer which can output predictions based on a test DataFrame.

Full documentation can be found [here](http://spark.apache.org/docs/latest/ml-guide.html)

In [12]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

reviews = [("Prose is well-written, but style is an impediment to learning. Should be called 'Reviewing Spark,' not 'Learning Spark'", 0.0),
            ("Nice Headstart to Spark", 1.0),
            ("Start here: Excellent reference for Spark", 1.0),
            ("Insightful and so Spark-tastic!", 1.0),
            ("Good intro but wordy and lacking details in areas", 0.0),
            ("Best of the Books Currently Available", 1.0),
            ("A good resource for people interested in learning Spark", 1.0),
            ("Great Overview", 1.0)]

test_reviews = [("A decent guided tour of Spark and its major components.", 0.0),
                ("10/10 would buy again", 1.0),
                ("it is simple to follow. well organized. straight ...", 1.0),
                ("Just what you need to get started in Apache Spark.", 1.0),
                ("Very good book for learning Spark", 1.0)]

training = sqlContext.createDataFrame(reviews, ["title", "label"])
test = sqlContext.createDataFrame(test_reviews, ["title", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and logistic regression.
tokenizer = Tokenizer(inputCol="title", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
logreg = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, logreg])

model = pipeline.fit(training)

# Make predictions on test documents
prediction = model.transform(test)
selected = prediction.select("title", "label", "prediction")
for row in selected.collect():
    print(row)

Row(title=u'A decent guided tour of Spark and its major components.', label=0.0, prediction=1.0)
Row(title=u'10/10 would buy again', label=1.0, prediction=1.0)
Row(title=u'it is simple to follow. well organized. straight ...', label=1.0, prediction=1.0)
Row(title=u'Just what you need to get started in Apache Spark.', label=1.0, prediction=1.0)
Row(title=u'Very good book for learning Spark', label=1.0, prediction=1.0)


### Exercise: Use SVM to predict colon cancer from gene expressions
You can start getting a feel for the MLlib operations by following the [Spark docs example](https://spark.apache.org/docs/1.3.0/mllib-linear-methods.html#linear-support-vector-machines-svms) on this dataset.

#### About the data format: LibSVM
MLlib conveniently provides a data loading method, `MLUtils.loadLibSVMFile()`, for the LibSVM format for which many other languages (R, Matlab, etc.) also have loading methods.  
A dataset of *n* features will have one row per datum, with the label and values of each feature organized as follows:
>{label} 1:{value} 2:{value} ... n:{value}

Take these two datapoints with six features and labels of -1 and 1 respectively as an example:
>-1.000000  1:2.080750 2:1.099070 3:0.927763 4:1.029080 5:-0.130763 6:1.265460  
1.000000  1:1.109460 2:0.786453 3:0.445560 4:-0.146323 5:-0.996316 6:0.555759 

#### About the colon-cancer dataset
This dataset was introduced in the 1999 paper [Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.](http://www.pnas.org/content/96/12/6745.short)  

Here's the abstract of the paper:  
> *Oligonucleotide arrays can provide a broad picture of the state of the cell, by monitoring the expression level of thousands of genes at the same time. It is of interest to develop techniques for extracting useful information from the resulting data sets. Here we report the application of a two-way clustering method for analyzing a data set consisting of the expression patterns of different cell types. Gene expression in 40 tumor and 22 normal colon tissue samples was analyzed with an Affymetrix oligonucleotide array complementary to more than 6,500 human genes. An efficient two-way clustering algorithm was applied to both the genes and the tissues, revealing broad coherent patterns that suggest a high degree of organization underlying gene expression in these tissues. Coregulated families of genes clustered together, as demonstrated for the ribosomal proteins. Clustering also separated cancerous from noncancerous tissue and cell lines from in vivo tissues on the basis of subtle distributed patterns of genes even when expression of individual genes varied only slightly between the tissues. Two-way clustering thus may be of use both in classifying genes into functional groups and in classifying tissues based on gene expression.*

There are 2000 features, 62 data points (40 tumor (label=0), 22 normal (label=1)), and 2 classes (labels) for the colon cancer dataset. 

In [13]:
sc.stop()

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*