# Machine learning

Let us consider a very simple machine learning example of logistic regression.
Logistic regression is an iterative machine learning algorithm that seeks to find the best hyperplane that separates two sets of points in a multi-dimensional feature space. It can be used to classify messages into spam vs non-spam, for example. Because the algorithm applies the same MapReduce operation repeatedly to the same dataset, it benefits greatly from caching the input in RAM across iterations.

Spark MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives.


## Non-MLlib implementation

First, let us consider the non-MLlib implementation and try to evaluate the effect of caching and partitioning on the perfromance. We're going to try to learn the rule that y(x) = 1 if x < fraction_positive, 0 otherwise

Our training sample will be generated as follows:

In [1]:
import time
#It's a non-MLlib implementation, but I still use LabelPoint format, ok...
from pyspark.mllib.regression import LabeledPoint

import numpy as np
N = 10**4
fraction_positive = 0.5

def y(x):
    return 1 if x < fraction_positive else 0

def generate_sample():
    sample_X = np.arange(0, 1, 1.0/N)
    np.random.shuffle( sample_X) # In-place shuffle!
    sample_Y = map(y, sample_X)
    return (sample_X, sample_Y)

(sample_X, sample_Y) = generate_sample()


## By hand.  This is the example code taken from the Spark Examples on the website.
#  This is much slower than the above code, so I'm not going to even run it (or extract predictions, or test it..)
start = time.time()
def logistic_by_hand(ITERATIONS,nparts):
    points = ( sc.parallelize( zip(sample_X, sample_Y), nparts)
                 .map(lambda (x,y): LabeledPoint(y, [1, x]))
                 .cache() )
    w = np.random.ranf(size = 2) # current separating plane
    print "Original random plane: %s" % w
    for i in xrange(ITERATIONS):
        gradient = points.map(
            lambda pt: (1 / (1 + np.exp(-pt.label*(w.dot(pt.features)))) - 1) * pt.label * pt.features
        ).reduce(lambda a, b: a + b)
        w -= gradient
    print "Final separating plane: %s" % w

logistic_by_hand(20,3)
end = time.time()
print "Elapsed time: ", (end-start)

Original random plane: [ 0.57319195  0.68650258]
Final separating plane: [ 1612.27200241   387.93397398]
Elapsed time:  5.05624508858


## MLlib based implementation

In [2]:
import time
start = time.time()
## Using MLLib and it's data structures.  This is fairly quick.
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint

points = ( sc.parallelize( zip(sample_X, sample_Y), 3)
             .map(lambda (x,y): LabeledPoint(y, [1, x]))
             .cache() )
model = LogisticRegressionWithSGD.train(points)

# Evaluating the model on training data
labelsAndPreds = points.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(points.count())
print "Accuracy on training set: %s" % (1 - trainErr)
end = time.time()
print "Elapsed time: ", (end-start)

  "Deprecated in 2.0.0. Use ml.classification.LogisticRegression or "


Accuracy on training set: 0.8964
Elapsed time:  3.60282087326


### Hands-on mini-exercise

1. Play with the `fraction_positive` parameter: What happens to the accuracy measure as `fraction_positive` gets below 0.30 or above 0.70? (You should be somewhat disappointed with the results!)  What do you think is happening, and can you improve on it?
1. Play with the "by hand" version (.. after lowering N to say 10**4 or so): Figure out what it's actually doing and how to use it to get results.  How much slower than the MLLib version does it seem to be?


## Different ML classifiers (skip for now, read as a homework)

Finally, we will play with various ML classifiers available on the market.


### Decision Trees

A decision tree is a binary tree.  At each of the internal nodes, it chooses a feature $i$ and a threshold $t$.  Each leaf has a value.  Evaluation of the model is just traversal of the tree from the root.  At each node, for example $j$, we go down the left branch if $X_{ji} \le t$ and the right branch otherwise.  The value of the model $f(X_{ji})$ is the value at the value at the terminating leaf of this traveral.  Below, we show a picture of this on small decision tree trained on the iris data set.  Notice that each internal node has a decision criterion and each leaf has the breakdown of label classes left at this leaf of the tree.  


### Random Forests

A random forest is just an ensemble of decision trees.  The predicted value is just the average of the trees (for both regression and classification problems - for classification problems, it is the probabilities that are averaged).  You can adjust `n_estimators` to change the number of trees in the forest.  If each tree is trained on the same subset of data, why aren't they identical?  Two reasons:
1. **Subsampling**: each tree is actually trained on a random selected (with replacement) subset (i.e. bootstrap)
1. **Maximum Features**: the optimal split comes from a randomly selected subset of the features.  In scikit-learn, this feature is controlled by `max_features`.

### Random Forest Training Algorithm and Tuning Parameters

A Random Forest is pretty straightforward to train once you know how a Decision Tree works.  In fact, their construction can even be parallelized.  

Below, various parameters that affect decision tree and random forest training are discussed. 

The first two parameters we mention are the most important, and tuning them can often improve performance:

**numTrees**: Number of trees in the forest.

Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy.
Training time increases roughly linearly in the number of trees.


**maxDepth**: Maximum depth of each tree in the forest.
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).

The next two parameters generally do not require tuning. However, they can be tuned to speed up training.

**subsamplingRate**: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.

**featureSubsetStrategy**: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.

### Linear SVM

The canonical Support Vector Machine is the linear one.  Assume we have two groups labeled by $y = \pm 1$.  Then we are trying to find the line $\beta$ such that $X \beta + \beta_0$ maximially separates the points in our two classes:

If the two classes can be separated by a linear hyperplane (picture on the left), we want to maximize the **margin** $M$ of the **boundary region**.  A little bit of math can show us that finding the largest separation is actually solved by the minimization problem

$$
\min_{\beta, \beta_0} \|\beta\| \\
\mbox{subject to } y_j (X_{j\cdot} \cdot \beta + \beta_0) \ge 1 \quad \mbox{for } j = 1,\ldots,N
$$

The picture and the equation are equivalent: in the picture we are setting the margin to be $M$ and finding the largest margin possible.  In the equation, we are setting the margin to be $1$ and finding the smallest $\beta$ that will make that true.  So $\beta$ and $M$ are related through $\| \beta \| = \frac{1}{M}$.  If the two classes cannot be separated (picture on the right), we will have to add a forgiveness terms $\xi$,

$$
\min_{\beta, \beta_0} \|\beta\| \\
\mbox{subject to } \left\{ \begin{array} {cl} 
 y_j (X_{j\cdot} \cdot \beta + \beta_0) \ge (1-\xi_j) & \mbox{for } j = 1,\ldots,N \\
 \xi_j \ge 0 & \mbox{for } j = 1,\ldots,N \\
 \sum_j \xi_j \le C
\end{array}\right.
$$

for some constant $C$.  The constant $C$ is an important tradeoff.  It corresponds to the total "forgiveness budget" (see the last constraint).  The larger $C$, the forgiveness we have and the wider the margin $M$ can be.  We can rewrite the constrained optimization problem as the primal Lagrangian function with Lagrange multipliers $\alpha_j \ge 0$, $\mu_j \ge 0$, and $\gamma \ge 0$,  for each of our three constraints:

$$ L_P(\gamma) = \min_{\beta, \beta_0, \xi} \max_{\alpha, \mu} \frac{1}{2} \| \beta \|^2 - \sum_j \alpha_j \left[y_j (X_{j \cdot} \cdot \beta + \beta_0 - (1-\xi_j)\right] - \sum_j \mu_j \xi_j  + \gamma \sum_j \xi_j$$

There is a one-to-one correspondence between $\gamma$ and $C$.  By taking first order conditions, first-order conditions, the dual Lagrangian problem can be formulated as

$$
L_D(\gamma) = \max_{\alpha} \sum_j \alpha_j - \frac{1}{2} \sum_{j, j'} \alpha_j \alpha_{j'} y_j y_{j'} X_{j \cdot} \cdot X_{j' \cdot} \,. \\
\mbox{subject to } \left\{ \begin{array} {cl} 
0 = \sum_j \alpha_j y_j \\
0 \le \alpha_j \le \gamma & \mbox{for } j = 1,\ldots,N
\end{array}\right.
$$

This is now a reasonably straightforward quadratic programming problem.  It is solved via [Sequential Minimization Optimization](https://en.wikipedia.org/wiki/Sequential_minimal_optimization).  Once we have solved this problem for $\alpha$, we can easily work out the coefficients from

$$ \beta = \sum_j \alpha_j y_j X_{j \cdot} $$

**Key takeaways**:
1. Critically, only points inside the margin or on the wrong side of the margin ($j$ for which $\xi_j > 0$) affect the SVM (see the picture).  This is intuitively clear from the picture.  In the dual form, this is because $\alpha_j$ is the Lagrangian constraint corresponding to $y_j (X_{j\cdot} \cdot \beta + \beta_0) \ge (1-\xi_j)$ and Complementary Slackness shows tells us that $\alpha_j > 0$ is non-zero only when the constraint is binding ($y_j (X_{j\cdot} \cdot \beta + \beta_0) = (1-\xi_j)$), i.e. we're in the boundary region.  This is meaning the **Support Vector** in "SVM": only the vectors in the boundary-the **Support Vectors**-contribute to the solution.
1. $C$ or $\gamma$ give a trade-off between the amount of forgiveness and the size of the margin or boundary region.  Hence, it controls how many points affect the SVM (based on the distance from the boundary).

Below, we plot out a simple two-class linear SVM on some synthetic data


### Non-linear SVM

What if we don't believe that our data can be cleanly split by a linear hyperplane?  The common way to incorporate non-linear features is to have a non-linear function $h(X_{j\cdot})$ (possibly to a higher-dimensional feature space with dimension $p'$ where $p' \ge p$) and to train on that space.  One intuition is that there's a higher-dimensional space in which the data is has a linear separation and $h$ gives a non-linear mapping into that space.

#### Kernel Trick

The **Kernel Trick** in SVM tells us that rather than directly computing the (potentially very large) vectors $h(X_{j \cdot})$, we can just modify the Kernel.  If we use the transformed data $h(X_{j \cdot})$, the dual Lagrangian would be

$$ \max_{\alpha} \sum_j \alpha_j - \frac{1}{2} \sum_j \sum_{j'} \alpha_j \alpha_{j'} y_j y_{j'} h(X_{j \cdot}) \cdot h(X_{j' \cdot}) $$

We can rewrite

$$h(X_{j \cdot}) \cdot h(X_{j' \cdot})  = K(X_{j \cdot}, X_{j' \cdot})$$ 

for some non-linear Kernel $K$.  Our problem then becomes,

$$ \max_{\alpha} \sum_j \alpha_j - \frac{1}{2} \sum_j \sum_{j'} \alpha_j \alpha_{j'} y_j y_{j'} K(X_{j \cdot}, X_{j' \cdot}) $$

There's a one-to-one correspondence between Kernel functions and functions $h$ (although $h$'s range may be infinite dimensional).  Some common Kernels include

<table>
<tr>
<th>Kernel</th>
<th>$K(x,x')$</th>
<th>Scikit `kernel` parameter</th>
</tr>

<tr>
<td>Linear Kernel</td>
<td>$x \cdot x'$</td>
<td>`kernel='linear'`</td>
</tr>

<tr>
<td>$d$-th Degree Polynomial</td>
<td>$(r + c x \cdot x')^d$</td>
<td>`kernel='poly'`</td>
</tr>

<tr>
<td>Radial Kernel</td>
<td>$ \exp(- c \|x - x' \|^2) $</td>
<td>`kernel='rbf'`</td>
</tr>

<tr>
<td>Neural Network Kernel</td>
<td>$\tanh(c x \cdot x' + r)$</td>
<td>`kernel='sigmoid'`</td>
</tr>
</table>

The benefit of using a Kernel is that we don't have to compute a very high-dimensional (possibly infinite-dimensional) $h$.  All that complexity is just wrapped into the kernel $K$.

## Logistic Regression Algorithm Guide

This notebook provides an example of how you can perform Logistic Regression with the MLlib library.

#### Algorithm Summary

Task: Classification with binary or multiclass labels
Input: Labels (binary or multiclass, 0-based indexed), Feature vectors (continuous, not categorical)
For categorical features, use One-Hot Encoding to convert to binary features usable by Logistic Regression.
Regularization: Logistic Regression, like other Generalized Linear Models (GLMs) in MLlib, support different types of regularization: None, L1, and L2.
Elastic Net regularization, which mixes L1 and L2, is supported with the DataFrame-based ML Pipelines API

#### Hints for Logistic Regression in Spark

There are several APIs for Logistic Regression in Spark. Although this notebook demonstrates the RDD-based spark.mllib API, it is now recommended using the newer DataFrame-based spark.ml API. The DataFrame-based API includes faster, more robust algorithms and provides more information about the model learned.
If you use the other APIs, the main items to be careful about are (a) which optimization algorithm is being run and (b) whether feature scaling is being used.
Feature scaling refers to normalizing features (columns) to have unit variance. After training, the model weights are rescaled so that test data does not have to be normalized. This improves optimization (and often statistical) behavior, but it changes the effect of regularization since the same regularization parameter is used for all weights.

#### APIs
In *pyspark.mllib.classification* (original MLlib package with lower-level API)
*LogisticRegressionWithSGD* does not use feature scaling.
Since this does not use feature scaling, it is often important to tune (decrease) the step size to ensure convergence.
If you decrease the step size, you may also need to increase the number of iterations.
*LogisticRegressionWithLBFGS* uses feature scaling.
In *pyspark.ml.classification* (newer MLlib package with higher-level API for Pipelines) (recommended)

*LogisticRegression* uses feature scaling by default, but is adjustable.

#### Load data

You can load data from many sources in Spark. Here, we will load a hosted R dataset. However, you can also check out this guide for more info: Accessing Data.

In [3]:
# Read titanic data as DataFrame using spark-csv package, and cache it
titanic = spark.read.options(header='true', inferSchema='true').csv('./data/titanic.csv').cache()
titanic.count()

32

The data is an observation-based version of the 1912 Titanic passenger survival log.

#### Data Format

A data frame with 1316 observations on the following 4 variables.

**class**
a factor with levels 1st class 2nd class 3rd class crew

**age**
a factor with levels child adults

**sex**
a factor with levels women man

**survived**
a factor with levels no yes

#### Explore data
This gives a quick idea of how to start exploring the data, and you can find more info here: Visualizations.

In [4]:
titanic.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- Class: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Survived: string (nullable = true)
 |-- Freq: integer (nullable = true)



In [5]:
titanic.show()

+---+-----+------+-----+--------+----+
|_c0|Class|   Sex|  Age|Survived|Freq|
+---+-----+------+-----+--------+----+
|  1|  1st|  Male|Child|      No|   0|
|  2|  2nd|  Male|Child|      No|   0|
|  3|  3rd|  Male|Child|      No|  35|
|  4| Crew|  Male|Child|      No|   0|
|  5|  1st|Female|Child|      No|   0|
|  6|  2nd|Female|Child|      No|   0|
|  7|  3rd|Female|Child|      No|  17|
|  8| Crew|Female|Child|      No|   0|
|  9|  1st|  Male|Adult|      No| 118|
| 10|  2nd|  Male|Adult|      No| 154|
| 11|  3rd|  Male|Adult|      No| 387|
| 12| Crew|  Male|Adult|      No| 670|
| 13|  1st|Female|Adult|      No|   4|
| 14|  2nd|Female|Adult|      No|  13|
| 15|  3rd|Female|Adult|      No|  89|
| 16| Crew|Female|Adult|      No|   3|
| 17|  1st|  Male|Child|     Yes|   5|
| 18|  2nd|  Male|Child|     Yes|  11|
| 19|  3rd|  Male|Child|     Yes|  13|
| 20| Crew|  Male|Child|     Yes|   0|
+---+-----+------+-----+--------+----+
only showing top 20 rows



In [6]:
titanic.where("Class like '1st' and Age like 'Child'").show()

+---+-----+------+-----+--------+----+
|_c0|Class|   Sex|  Age|Survived|Freq|
+---+-----+------+-----+--------+----+
|  1|  1st|  Male|Child|      No|   0|
|  5|  1st|Female|Child|      No|   0|
| 17|  1st|  Male|Child|     Yes|   5|
| 21|  1st|Female|Child|     Yes|   1|
+---+-----+------+-----+--------+----+



Alternatively, you can use the pure Spark SQL syntax as we have seen in the 3rd section.

In [7]:
titanic.registerTempTable("df_titanic")

In [8]:
spark.sql("SELECT * FROM df_titanic WHERE Class like '1st' and Age like 'Child'").show()

+---+-----+------+-----+--------+----+
|_c0|Class|   Sex|  Age|Survived|Freq|
+---+-----+------+-----+--------+----+
|  1|  1st|  Male|Child|      No|   0|
|  5|  1st|Female|Child|      No|   0|
| 17|  1st|  Male|Child|     Yes|   5|
| 21|  1st|Female|Child|     Yes|   1|
+---+-----+------+-----+--------+----+



#### Preprocess data

In this section, we convert string categorical columns into ordered indices usable by a linear model. Note that these categorical features are special: There is a natural ordering, so it makes sense to treat them as continuous. For general categorical features, you should probably use one-hot encoding instead.

In [9]:
# Compute lists of string categories
def getCategories(col):
    vals = sorted(titanic.select(col).distinct().rdd.map(lambda x: x[0]).collect())
    valDict = dict([(vals[i], i) for i in range(len(vals))])
    print col + ': ' + ', '.join(vals)
    return (vals, valDict)

(classes, classDict) = getCategories("Class")
(ages, ageDict) = getCategories("Age")
(sexes, sexDict) = getCategories("Sex")
(survived, survivedDict) = getCategories("Survived")

Class: 1st, 2nd, 3rd, Crew
Age: Adult, Child
Sex: Female, Male
Survived: No, Yes


In [10]:
# Convert the string categories into indices
from pyspark.sql.types import *
from pyspark.sql.functions import udf

classUDF = udf(lambda x: classDict[x], IntegerType())
ageUDF = udf(lambda x: ageDict[x], IntegerType())
sexUDF = udf(lambda x: sexDict[x], IntegerType())
survivedUDF = udf(lambda x: survivedDict[x], IntegerType())

titanicIndexed = titanic.select(classUDF(titanic["Class"]).alias("class"), ageUDF(titanic["Age"]).alias("age"), sexUDF(titanic["Sex"]).alias("sex"), survivedUDF(titanic["Survived"]).alias("survived")).cache()

In [11]:
titanicIndexed.show()

+-----+---+---+--------+
|class|age|sex|survived|
+-----+---+---+--------+
|    0|  1|  1|       0|
|    1|  1|  1|       0|
|    2|  1|  1|       0|
|    3|  1|  1|       0|
|    0|  1|  0|       0|
|    1|  1|  0|       0|
|    2|  1|  0|       0|
|    3|  1|  0|       0|
|    0|  0|  1|       0|
|    1|  0|  1|       0|
|    2|  0|  1|       0|
|    3|  0|  1|       0|
|    0|  0|  0|       0|
|    1|  0|  0|       0|
|    2|  0|  0|       0|
|    3|  0|  0|       0|
|    0|  1|  1|       1|
|    1|  1|  1|       1|
|    2|  1|  1|       1|
|    3|  1|  1|       1|
+-----+---+---+--------+
only showing top 20 rows



#### Train a model
We now train a Logistic Regression model using LogisticRegressionWithSGD. Since we will use the traditional MLlib API (not the Pipelines API), we first have to extract the label and features columns and create an RDD of LabeledPoints. (The Pipelines API takes DataFrames instead of RDDs.)

In [12]:
# Convert data to RDD of LabeledPoint
from pyspark.mllib.regression import LabeledPoint

featureCols = ["age", "sex", "class"]
titanicLabels = titanicIndexed.select("survived").rdd.map(lambda row: row[0])
titanicFeatures = titanicIndexed.select(*featureCols).rdd.map(lambda x: list(x)) #[x[0], x[1], x[2]])
titanicData = titanicLabels.zip(titanicFeatures).map(lambda l_p: LabeledPoint(l_p[0], l_p[1])).cache()

In [13]:
# Train the model, and print the intercept and weight vector
# We use L1 (sparsifying) regularization, but you can also use None or "l2".
from pyspark.mllib.classification import LogisticRegressionWithSGD

lr = LogisticRegressionWithSGD.train(titanicData, regParam=0.1, regType="l1", intercept=True, iterations=100)
print 'Learned LogisticRegressionModel:'
print '\t Intercept: %g' % lr.intercept
print '\t Feature\tWeight'
for i in range(len(featureCols)):
    print '\t %s\t\t%g' % (featureCols[i], lr.weights[i])

Learned LogisticRegressionModel:
	 Intercept: 0
	 Feature	Weight
	 age		-0
	 sex		-0
	 class		-0


In [14]:
# We can make a single prediction:
oneInstance = [0, 1, 0]
prediction = lr.predict(oneInstance)
print 'Example prediction:'
print '  features: ' + str(oneInstance)
print '  prediction: %d' % prediction

Example prediction:
  features: [0, 1, 0]
  prediction: 0


In [15]:
# We can also make predictions on the whole dataset and compute accuracy
import numpy

def accuracy(model, labelsRDD, featuresRDD):
    predictionsRDD = featuresRDD.map(lambda x: model.predict(x))
    return labelsRDD.zip(predictionsRDD).map(lambda labelAndPred: labelAndPred[0] == labelAndPred[1]).mean()

print 'Training accuracy: %g' % accuracy(lr, titanicLabels, titanicFeatures)

Training accuracy: 0.5


In [16]:
# Previously, we were making 0/1 predictions.  We can clear the model threshold to make soft predictions.
# Note: Soft prediction are currently only supported for binary classification.
lr.clearThreshold()
print 'Predicted probability of label 1: %g' % lr.predict(oneInstance)

Predicted probability of label 1: 0.5


### Example: Estimator, Transformer, and Param
    
This example covers the concepts of Estimator, Transformer, and Param.

In [17]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.param import Param, Params

In [19]:
# Prepare training data from a list of (label, features) tuples.
training = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

In [20]:
# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"

LogisticRegression parameters:
aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
maxIter: max number of iterations (>= 0). (default: 100, current: 10)
predictionCol: prediction column name. (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. (default: probability)
rawPredictionCol: raw predi

In [21]:
# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

In [22]:
# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30 # Specify 1 Param, overwriting the original maxIter.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55}) # Specify multiple Params.

In [23]:
# You can combine paramMaps, which are python dictionaries.
paramMap2 = {lr.probabilityCol: "myProbability"} # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)

# Now learn a new model using the paramMapCombined parameters.
# paramMapCombined overrides all parameters set earlier via lr.set* methods.
model2 = lr.fit(training, paramMapCombined)

# Prepare test data
test = sqlContext.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

In [34]:
# Make predictions on test data using the Transformer.transform() method.
# LogisticRegression.transform will only use the 'features' column.
# Note that model2.transform() outputs a "myProbability" column instead of the usual
# 'probability' column since we renamed the lr.probabilityCol parameter previously.
prediction = model2.transform(test)
selected = prediction.select("features", "label", "myProbability", "prediction")
for row in selected.collect():
    print row

Row(features=DenseVector([-1.0, 1.5, 1.3]), label=1.0, myProbability=DenseVector([0.0571, 0.9429]), prediction=1.0)
Row(features=DenseVector([3.0, 2.0, -0.1]), label=0.0, myProbability=DenseVector([0.9239, 0.0761]), prediction=0.0)
Row(features=DenseVector([0.0, 2.2, -1.5]), label=1.0, myProbability=DenseVector([0.1097, 0.8903]), prediction=1.0)


### Example: Pipeline. Hands-on exercise

This example follows the simple text document Pipeline illustrated in the figures above.

Now switch to the Adroit working area and proceed to the Pipelines exercise 