<h1>Logistic Regression</h1>

[Logistic regression](http://en.wikipedia.org/wiki/Logistic_regression) is widely used to predict a binary response. We will show how to implement Logistic Regression in Spark using Scala. First, the standard initialization with the relevant imports:

In [1]:
val sparkVersion = "2.0.1"
val scalaVersion = scala.util.Properties.versionNumberString

[36msparkVersion[0m: [32mString[0m = [32m"2.0.1"[0m
[36mscalaVersion[0m: [32mString[0m = [32m"2.11.8"[0m

In [2]:
classpath.add(
    "org.apache.spark" %% "spark-yarn" % sparkVersion,
    "org.apache.spark" %% "spark-mllib" % sparkVersion
)

146 new artifact(s)


146 new artifacts in macro
146 new artifacts in runtime
146 new artifacts in compile




In [3]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.util.MLUtils

// Logistic Regression
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithLBFGS}

// Naive Bayes
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}

import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Row

[32mimport [36morg.apache.spark.sql.SparkSession[0m
[32mimport [36morg.apache.spark.mllib.util.MLUtils[0m
[32mimport [36morg.apache.spark.ml.classification.LogisticRegression[0m
[32mimport [36morg.apache.spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithLBFGS}[0m
[32mimport [36morg.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}[0m
[32mimport [36morg.apache.spark.mllib.evaluation.MulticlassMetrics[0m
[32mimport [36morg.apache.spark.mllib.regression.LabeledPoint[0m
[32mimport [36morg.apache.spark.ml.linalg.{Vector, Vectors}[0m
[32mimport [36morg.apache.spark.ml.param.ParamMap[0m
[32mimport [36morg.apache.spark.sql.Row[0m

In [4]:
val sparkSession = SparkSession
  .builder()
  .master("local[1]")
  .appName("Logistic Regression")
  .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/03/16 11:10:49 INFO SparkContext: Running Spark version 2.0.1
17/03/16 11:10:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/16 11:10:51 INFO SecurityManager: Changing view acls to: b97eec96efcb40779e247b002e047f82
17/03/16 11:10:51 INFO SecurityManager: Changing modify acls to: b97eec96efcb40779e247b002e047f82
17/03/16 11:10:51 INFO SecurityManager: Changing view acls groups to: 
17/03/16 11:10:51 INFO SecurityManager: Changing modify acls groups to: 
17/03/16 11:10:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(b97eec96efcb40779e247b002e047f82); groups with view permissions: Set(); users  with modify permissions: Set(b97eec96efcb40779e247b002e047f82); groups with modify permissions: Set()
17/03/16 11:10:52 INFO Utils: Successfully started service 

[36msparkSession[0m: [32mSparkSession[0m = org.apache.spark.sql.SparkSession@62036112

As mentioned in Notebook 2, Logistic Regression (LR) instance is an estimator. A small scale example is shown below.

In [5]:
// Prepare training data from a list of (label, features) tuples.
val training = sparkSession.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")

// Create a LogisticRegression instance. This instance is an Estimator.
val lr = new LogisticRegression()
// Print out the parameters, documentation, and any default values.
println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

// We can set parameters using setter methods. Possible parameters are listed below.
lr.setMaxIter(10)
  .setRegParam(0.01)

// Learn a LogisticRegression model. This uses the parameters stored in lr.
val model1 = lr.fit(training)
// Since model1 is a Model (i.e., a Transformer produced by an Estimator),
// we can view the parameters it used during fit().
// This prints the parameter (name: value) pairs, where names are unique IDs for this
// LogisticRegression instance.
println("Model 1 was fit using parameters: " + model1.parent.extractParamMap)

// We may alternatively specify parameters using a ParamMap,
// which supports several methods for specifying parameters.
val paramMap = ParamMap(lr.maxIter -> 20)
  .put(lr.maxIter, 30)  // Specify 1 Param. This overwrites the original maxIter.
  .put(lr.regParam -> 0.1, lr.threshold -> 0.55)  // Specify multiple Params.

// One can also combine ParamMaps.
val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")  // Change output column name.
val paramMapCombined = paramMap ++ paramMap2

// Now learn a new model using the paramMapCombined parameters.
// paramMapCombined overrides all parameters set earlier via lr.set* methods.
val model2 = lr.fit(training, paramMapCombined)
println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)

// Prepare test data.
val test = sparkSession.createDataFrame(Seq(
  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  (0.0, Vectors.dense(3.0, 2.0, -0.1)),
  (1.0, Vectors.dense(0.0, 2.2, -1.5))
)).toDF("label", "features")

// Make predictions on test data using the Transformer.transform() method.
// LogisticRegression.transform will only use the 'features' column.
// Note that model2.transform() outputs a 'myProbability' column instead of the usual
// 'probability' column since we renamed the lr.probabilityCol parameter previously.
model2.transform(test)
  .select("features", "label", "myProbability", "prediction")
  .collect()
  .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>
    println(s"($features, $label) -> prob=$prob, prediction=$prediction")

  }

LogisticRegression parameters:
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0)
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
maxIter: maximum number of iterations (>= 0) (default: 100)
predictionCol: prediction column name (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities (default: probability)
rawPredictionCol: raw prediction (a.k.a. confidence) column name (default: rawPrediction)
regParam: regularization parameter (>= 0) (default: 0.0)
standardization: whether to standardize the training features before fitting the model (default: true)
threshold: threshold in binary

[36mtraining[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mpackage[0m.[32mDataFrame[0m = [label: double, features: vector]
[36mlr[0m: [32mLogisticRegression[0m = logreg_dbdaf58dcdb8
[36mres4_3[0m: [32mLogisticRegression[0m = logreg_dbdaf58dcdb8
[36mmodel1[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mml[0m.[32mclassification[0m.[32mLogisticRegressionModel[0m = logreg_dbdaf58dcdb8
[36mparamMap[0m: [32mParamMap[0m = {
	logreg_dbdaf58dcdb8-maxIter: 30,
	logreg_dbdaf58dcdb8-regParam: 0.1,
	logreg_dbdaf58dcdb8-threshold: 0.55
}
[36mparamMap2[0m: [32mParamMap[0m = {
	logreg_dbdaf58dcdb8-probabilityCol: myProbability
}
[36mparamMapCombined[0m: [32mParamMap[0m = {
	logreg_dbdaf58dcdb8-maxIter: 30,
	logreg_dbdaf58dcdb8-probabilityCol: myProbability,
	logreg_dbdaf58dcdb8-regParam: 0.1,
	logreg_dbdaf58dcdb8-threshold: 0.55
}
[36mmodel2[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mml[0m.[32mclassification[0m.[32mLogisticRegre

[MLlib](https://spark.apache.org/docs/2.0.2/mllib-linear-methods.html#implementation-developer) implements a simple distributed version of [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent). All provided algorithms take as input a regularization parameter (<tt>regParam</tt>) along with various parameters associated with stochastic gradient descent (<tt>stepSize, numIterations, miniBatchFraction</tt>). Three possible regularizations (<tt>L1</tt>, <tt>L2</tt> or their mixture) are supported. Further details are available [here](https://spark.apache.org/docs/2.0.2/mllib-optimization.html#gradient-descent-and-stochastic-gradient-descent) as well as in the output of the cell above. The [settable parameters](https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent) include:

- <tt>stepSize</tt> is a scalar value denoting the initial step size for gradient descent
- <tt>numIterations</tt> is the number of iterations to run.
- <tt>regParam</tt> is the regularization parameter when using L1 or L2 regularization.
- <tt>miniBatchFraction</tt> is the fraction of the total data that is sampled in each iteration, to compute the gradient direction. 

A [second algorithm](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS) to solve logistic regression, [L-BFGS](https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS), an extension of the mini-batch gradient descent above is provided within MLlib. We apply this algorithm below:

In [None]:
val sc = sparkSession.sparkContext

object LRwLBFGS {

    def LBFGS() : Unit = {

        // Load training data in LIBSVM format.
        val data = MLUtils.loadLibSVMFile(sc, "files/sample_libsvm_data.txt")        

        // Split data into training (60%) and test (40%).
        val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
        val training = splits(0).cache()
        val test = splits(1)

        // Run training algorithm to build the model
        val model = new LogisticRegressionWithLBFGS()
          .setNumClasses(10)
          .run(training)

        // Compute raw scores on the test set.
        val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
          val prediction = model.predict(features)
          (prediction, label)
        }

        // Get evaluation metrics.
        val metrics = new MulticlassMetrics(predictionAndLabels)
        val accuracy = metrics.accuracy
        println(s"Accuracy = $accuracy")
    }
}

LRwLBFGS.LBFGS

<h1>Exercises</h1>

<h2>Exercise 1</h2>

Create standalone programs for both methods for solving linear regression to run on the HPC. Run both on the [default of credit card clients](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) data and [occupancy detection](http://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+#) data. Carry out timing experiments for both methods.

<h2>Exercise 2</h2>

Also using these datasets, explore the three available different regularization methods <tt>L1</tt> ([lasso](https://en.wikipedia.org/wiki/Lasso_(statistics))), <tt>L2</tt> ([ridge regression](https://en.wikipedia.org/wiki/Ridge_regression)) and <tt>L1/L2</tt> ([elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization)) by varying the <tt>regParam</tt>. You will need to carry out [cross validation](https://spark.apache.org/docs/2.0.2/ml-tuning.html#example-model-selection-via-cross-validation) using pipeline to find good regularization parameters.

<h2>Exercise 3</h2>

Use logistic regression to predict the probability of an ad click. There is a nice guide to a Python implementation of this at [https://turi.com/learn/gallery/notebooks/click_through_rate_prediction_intro.html](https://turi.com/learn/gallery/notebooks/click_through_rate_prediction_intro.html) which you can adapt.