# COM6012 - 2017: Coursework 3

Deadline: 11:59PM on Thursday 23 March 2017

Submission: via SageMatchCloud (We will collect from you automatically).


In this coursework, you will apply Decision Trees for Classification, Decision Trees for Regression and Logistic Regression over the [HIGGS dataset](http://archive.ics.uci.edu/ml/datasets/HIGGS). For each algorithm:

1. Use pipelines and cross-validation to find the best configuration of parameters and their performance. Use the same splits of training and test data when comparing performances between the algorithms (8 marks).
2. Find which features are more relevant for classification or regression (4 marks). 
3. Provide training times in the cluster when using different cores (1 mark).

Do not try to upload the dataset to SageMathCloud when returnig your work. It is 2.6Gb. 



# Solutions:

> A text file 'README.txt" is provided in the zip for the exercises, that explains which output and error files represent the number of cores used. Eg: 'acp16al_Coursework3.sh.o229510' === Output file for 10 cores, 'acp16al_Coursework3.sh.e229508' === Error file for 5 cores.


# DECISION TREES CLASSIFICATION:

# 1) & 2)

The best configurations of parameters and their performance for different types of features (all, only low-level, only high-level) found from the cross validator are shown in the table:

| Features |  MaxBins  | MaxDepth| Impurity | ACCURACY | TIME |
|----------|-----------|---------|----------|----------|------|
|   ALL    |    20     |   10    |  entropy |   0.688  |740.68|
|   LOW    |    20     |   10    |  entropy |   0.610  |619.13|
|   HIGH   |    20     |   10    |   gini   |   0.679  |517.00|

As shown from the table the best accuracy is obtained when using all the features. Additionally, high-level features provide better accuracy than low-level features. Therefore, for classification high-level features are more relevant (see logistic regression results below that confirm this).

# 3)

Using ALL features, different number of cores were used:

| Cores | TIME |
|-------|------|
|    1  |994.13|
|    5  |734.25|
|   10  |682.49|



# DECISION TREES REGRESSION:

# 1) & 2)

The best configurations of parameters and their performance for different types of features (all, only low-level, only high-level) found from the cross validator are shown in the table:

| Features |  MaxBins  | MaxDepth| Impurity |   RMSE   | TIME |
|----------|-----------|---------|----------|----------|------|
|   ALL    |    20     |   10    | variance |  0.449   |393.80|
|   LOW    |    30     |   10    | variance |  0.484   |337.43|
|   HIGH   |    10     |   10    | variance |  0.449   |313.54|

As shown from the table the best RMSE is obtained when using only the low-level features (even better RMSE than when using all the features). Therefore, for regression low-level features are more relevant.

# 3)

Using ALL features, different number of cores were used:

| Cores | TIME |
|-------|------|
|    1  |521.22|
|    5  |373.99|
|   10  |379.97|



# LOGISTIC REGRESSION:

# 1) & 2)

The best configurations of parameters and their performance for different types of features (all, only low-level, only high-level) found from the cross validator are shown in the table:

| Features |elasticnet |   Reg   |  MaxIter | ACCURACY | TIME |
|----------|-----------|---------|----------|----------|------|
|   ALL    |    0      |    0    |     10   |  0.639   |773.56|
|   LOW    |    0      |    0    |     5    |  0.565   |701.38|
|   HIGH   |    0      |    0    |     10   |  0.617   |517.53|

As shown from the table the best accuracy is obtained when using all the features. Additionally, high-level features provide better accuracy than low-level features, confirming the result of decision tree classification above: that high-level features are more relevant for classification. 
# 3)

Using ALL features, different number of cores were used:

| Cores | TIME  |
|-------|-------|
|    1  |1007.93|
|    5  | 735.44|
|   10  | 694.67|


In [13]:
val sparkVersion = "2.0.1"
val scalaVersion = scala.util.Properties.versionNumberString

[36msparkVersion[0m: [32mString[0m = [32m"2.0.1"[0m
[36mscalaVersion[0m: [32mString[0m = [32m"2.11.8"[0m

In [14]:
classpath.add(
    "org.apache.spark" %% "spark-yarn" % sparkVersion,
    "org.apache.spark" %% "spark-mllib" % sparkVersion
)

0 new artifact(s)




In [7]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.sql.types.DoubleType
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}

object DT_Class {
    
    def main(args: Array[String]) {
        
        val sparkSession = SparkSession.builder
            .master("local")
            .appName("DT_Class")
            .getOrCreate()
        
        val sc = sparkSession.sparkContext
        
        import sparkSession.implicits._

val timing = System.nanoTime
        
val df = sparkSession.read.format("com.databricks.spark.csv")
        .option("header", "false") //reading the headers
        .option("mode", "DROPMALFORMED")
        .load("files/HIGGS_r.csv.gz").toDF


val toDouble = udf[Double, Double]( _.toDouble)
   
val dfnew = df
.withColumn("label", toDouble(df("_c0")))
.withColumn("lepton_pT", toDouble(df("_c1")))
.withColumn("lepton_eta", toDouble(df("_c2")))
.withColumn("lepton_phi", toDouble(df("_c3")))
.withColumn("missing_energy_magnitude", toDouble(df("_c4")))
.withColumn("missing_energy_phi", toDouble(df("_c5")))
.withColumn("jet_1_pt", toDouble(df("_c6")))
.withColumn("jet_1_eta", toDouble(df("_c7")))
.withColumn("jet_1_phi", toDouble(df("_c8")))
.withColumn("jet_1_b-tag", toDouble(df("_c9")))
.withColumn("jet_2_pt", toDouble(df("_c10")))
.withColumn("jet_2_eta", toDouble(df("_c11")))
.withColumn("jet_2_phi", toDouble(df("_c12")))
.withColumn("jet_2_b-tag", toDouble(df("_c13")))
.withColumn("jet_3_pt", toDouble(df("_c14")))
.withColumn("jet_3_eta", toDouble(df("_c15")))
.withColumn("jet_3_phi", toDouble(df("_c16")))
.withColumn("jet_3_b-tag", toDouble(df("_c17")))
.withColumn("jet_4_pt", toDouble(df("_c18")))
.withColumn("jet_4_eta", toDouble(df("_c19")))
.withColumn("jet_4_phi", toDouble(df("_c20")))
.withColumn("jet_4_b-tag", toDouble(df("_c21")))
.withColumn("m_jj", toDouble(df("_c22")))
.withColumn("m_jjj", toDouble(df("_c23")))
.withColumn("m_lv", toDouble(df("_c24")))
.withColumn("m_jlv", toDouble(df("_c25")))
.withColumn("m_bb", toDouble(df("_c26")))
.withColumn("m_wbb", toDouble(df("_c27")))
.withColumn("m_wwbb", toDouble(df("_c28")))
.select("label","lepton_pT", "lepton_eta", "lepton_phi", "missing_energy_magnitude", "missing_energy_phi", "jet_1_pt", "jet_1_eta", "jet_1_phi", "jet_1_b-tag",
"jet_2_pt", "jet_2_eta", "jet_2_phi", "jet_2_b-tag", "jet_3_pt", "jet_3_eta", "jet_3_phi", "jet_3_b-tag", "jet_4_pt", "jet_4_eta", "jet_4_phi",
"jet_4_b-tag", "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb")
    
        //dfnew.show()
     
val assembler = new VectorAssembler()
  .setInputCols(Array("lepton_pT", "lepton_eta", "lepton_phi", "missing_energy_magnitude", "missing_energy_phi", "jet_1_pt", "jet_1_eta", "jet_1_phi", "jet_1_b-tag",
"jet_2_pt", "jet_2_eta", "jet_2_phi", "jet_2_b-tag", "jet_3_pt", "jet_3_eta", "jet_3_phi", "jet_3_b-tag", "jet_4_pt", "jet_4_eta", "jet_4_phi",
"jet_4_b-tag", "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb"))
  .setOutputCol("features")

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = dfnew.randomSplit(Array(0.7, 0.3))

// Train a DecisionTree model.
val dt = new DecisionTreeClassifier()
  .setLabelCol("label")
  .setFeaturesCol("features")
        
// Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(assembler, dt))
        
        
val paramGrid = new ParamGridBuilder()
.addGrid(dt.maxBins, Array(10, 20, 30))
.addGrid(dt.maxDepth, Array(10, 15, 20))
.addGrid(dt.impurity, Array("entropy", "gini"))
.build()

val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new MulticlassClassificationEvaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(5)         

// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(trainingData)
// Make predictions.
val predictions = cvModel.transform(testData)

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Accuracy: "  + (accuracy))

val bestparameters = cvModel.bestModel.asInstanceOf[PipelineModel]
val stages = bestparameters.stages

val dtstage = stages(1).asInstanceOf[DecisionTreeClassificationModel]
println("maxBins = " + dtstage.getMaxBins)
println("maxDepth = " + dtstage.getMaxDepth)
println("impurity = " + dtstage.getImpurity)
        
println("Time (seconds)" + ((System.nanoTime-timing) / 1e9d))

       
}
}
DT_Class.main(Array())

Accuracy: 0.6935890480901171
maxBins = 30
maxDepth = 10
impurity = entropy
Time (seconds)748.040070298


[32mimport [36morg.apache.spark.sql.SparkSession[0m
[32mimport [36morg.apache.spark.ml.{Pipeline, PipelineModel}[0m
[32mimport [36morg.apache.spark.ml.classification.DecisionTreeClassificationModel[0m
[32mimport [36morg.apache.spark.ml.classification.DecisionTreeClassifier[0m
[32mimport [36morg.apache.spark.ml.evaluation.MulticlassClassificationEvaluator[0m
[32mimport [36morg.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}[0m
[32mimport [36morg.apache.spark.mllib.tree.DecisionTree[0m
[32mimport [36morg.apache.spark.mllib.tree.model.DecisionTreeModel[0m
[32mimport [36morg.apache.spark.sql.types.DoubleType[0m
[32mimport [36morg.apache.spark.mllib.linalg.Vectors[0m
[32mimport [36morg.apache.spark.sql.Row[0m
[32mimport [36morg.apache.spark.sql.functions._[0m
[32mimport [36morg.apache.spark.ml.feature.VectorAssembler[0m
[32mimport [36morg.apache.spark.ml.classification.LogisticRegression[0m
[32mimport [36morg.apache.spark.ml.

In [8]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.sql.types.DoubleType
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.DecisionTreeRegressor
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}

object DT_Reg {
    
    def main(args: Array[String]) {
        
        val sparkSession = SparkSession.builder
            .master("local")
            .appName("DT_Reg")
            .getOrCreate()
        
        val sc = sparkSession.sparkContext
        
        import sparkSession.implicits._

val timing = System.nanoTime

val df = sparkSession.read.format("com.databricks.spark.csv")
        .option("header", "false") //reading the headers
        .option("mode", "DROPMALFORMED")
        .load("files/HIGGS_r.csv.gz").toDF


val toDouble = udf[Double, Double]( _.toDouble)
   
val dfnew = df
.withColumn("label", toDouble(df("_c0")))
.withColumn("lepton_pT", toDouble(df("_c1")))
.withColumn("lepton_eta", toDouble(df("_c2")))
.withColumn("lepton_phi", toDouble(df("_c3")))
.withColumn("missing_energy_magnitude", toDouble(df("_c4")))
.withColumn("missing_energy_phi", toDouble(df("_c5")))
.withColumn("jet_1_pt", toDouble(df("_c6")))
.withColumn("jet_1_eta", toDouble(df("_c7")))
.withColumn("jet_1_phi", toDouble(df("_c8")))
.withColumn("jet_1_b-tag", toDouble(df("_c9")))
.withColumn("jet_2_pt", toDouble(df("_c10")))
.withColumn("jet_2_eta", toDouble(df("_c11")))
.withColumn("jet_2_phi", toDouble(df("_c12")))
.withColumn("jet_2_b-tag", toDouble(df("_c13")))
.withColumn("jet_3_pt", toDouble(df("_c14")))
.withColumn("jet_3_eta", toDouble(df("_c15")))
.withColumn("jet_3_phi", toDouble(df("_c16")))
.withColumn("jet_3_b-tag", toDouble(df("_c17")))
.withColumn("jet_4_pt", toDouble(df("_c18")))
.withColumn("jet_4_eta", toDouble(df("_c19")))
.withColumn("jet_4_phi", toDouble(df("_c20")))
.withColumn("jet_4_b-tag", toDouble(df("_c21")))
.withColumn("m_jj", toDouble(df("_c22")))
.withColumn("m_jjj", toDouble(df("_c23")))
.withColumn("m_lv", toDouble(df("_c24")))
.withColumn("m_jlv", toDouble(df("_c25")))
.withColumn("m_bb", toDouble(df("_c26")))
.withColumn("m_wbb", toDouble(df("_c27")))
.withColumn("m_wwbb", toDouble(df("_c28")))
.select("label","lepton_pT", "lepton_eta", "lepton_phi", "missing_energy_magnitude", "missing_energy_phi", "jet_1_pt", "jet_1_eta", "jet_1_phi", "jet_1_b-tag",
"jet_2_pt", "jet_2_eta", "jet_2_phi", "jet_2_b-tag", "jet_3_pt", "jet_3_eta", "jet_3_phi", "jet_3_b-tag", "jet_4_pt", "jet_4_eta", "jet_4_phi",
"jet_4_b-tag", "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb")
    
        //dfnew.show()

     
val assembler = new VectorAssembler()
  .setInputCols(Array("m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb"))
  .setOutputCol("features")


        
val dt = new DecisionTreeRegressor()
  .setLabelCol("label")
  .setFeaturesCol("features")
  
 
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = dfnew.randomSplit(Array(0.7, 0.3))

// Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(assembler, dt))
        
        
val paramGrid = new ParamGridBuilder()
.addGrid(dt.maxBins, Array(10, 20, 30))
.addGrid(dt.maxDepth, Array(10, 15, 20))
.addGrid(dt.impurity, Array("variance"))
.build()

val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new RegressionEvaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(5)         

// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(trainingData)
// Make predictions.
val predictions = cvModel.transform(testData)

// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  .setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println("Root Mean Squared Error (RMSE) on test data = " + rmse)
        
val bestparameters = cvModel.bestModel.asInstanceOf[PipelineModel]
val stages = bestparameters.stages

val dtstage = stages(1).asInstanceOf[DecisionTreeRegressionModel]
println("maxBins = " + dtstage.getMaxBins)
println("maxDepth = " + dtstage.getMaxDepth)
println("impurity = " + dtstage.getImpurity)

println("Time (seconds)" + ((System.nanoTime-timing) / 1e9d))

}
}
DT_Reg.main(Array())

Root Mean Squared Error (RMSE) on test data = 0.45087236357115557
maxBins = 20
maxDepth = 10
impurity = variance
Time (seconds)312.495914529


[32mimport [36morg.apache.spark.sql.SparkSession[0m
[32mimport [36morg.apache.spark.ml.{Pipeline, PipelineModel}[0m
[32mimport [36morg.apache.spark.ml.evaluation.MulticlassClassificationEvaluator[0m
[32mimport [36morg.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}[0m
[32mimport [36morg.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}[0m
[32mimport [36morg.apache.spark.ml.feature.StopWordsRemover[0m
[32mimport [36morg.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}[0m
[32mimport [36morg.apache.spark.mllib.tree.DecisionTree[0m
[32mimport [36morg.apache.spark.mllib.tree.model.DecisionTreeModel[0m
[32mimport [36morg.apache.spark.sql.types.DoubleType[0m
[32mimport [36morg.apache.spark.mllib.regression.LabeledPoint[0m
[32mimport [36morg.apache.spark.mllib.linalg.Vectors[0m
[32mimport [36morg.apache.spark.sql.Row[0m
[32mimport [36morg.apache.spark.sql.functions._[0m
[32mimport [36morg.apache.spark.ml.feature

In [16]:
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary, LogisticRegression, LogisticRegressionModel}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.sql.types.DoubleType
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}


object Logistic_Reg {
    
    def main(args: Array[String]) {
        
        val sparkSession = SparkSession.builder
            .master("local")
            .appName("Logistic_Reg")
            .getOrCreate()
        
        val sc = sparkSession.sparkContext
        
        import sparkSession.implicits._

val timing = System.nanoTime

val df = sparkSession.read.format("com.databricks.spark.csv")
        .option("header", "false") //reading the headers
        .option("mode", "DROPMALFORMED")
        .load("files/HIGGS_r.csv.gz").toDF


val toDouble = udf[Double, Double]( _.toDouble)
   
val dfnew = df
.withColumn("label", toDouble(df("_c0")))
.withColumn("lepton_pT", toDouble(df("_c1")))
.withColumn("lepton_eta", toDouble(df("_c2")))
.withColumn("lepton_phi", toDouble(df("_c3")))
.withColumn("missing_energy_magnitude", toDouble(df("_c4")))
.withColumn("missing_energy_phi", toDouble(df("_c5")))
.withColumn("jet_1_pt", toDouble(df("_c6")))
.withColumn("jet_1_eta", toDouble(df("_c7")))
.withColumn("jet_1_phi", toDouble(df("_c8")))
.withColumn("jet_1_b-tag", toDouble(df("_c9")))
.withColumn("jet_2_pt", toDouble(df("_c10")))
.withColumn("jet_2_eta", toDouble(df("_c11")))
.withColumn("jet_2_phi", toDouble(df("_c12")))
.withColumn("jet_2_b-tag", toDouble(df("_c13")))
.withColumn("jet_3_pt", toDouble(df("_c14")))
.withColumn("jet_3_eta", toDouble(df("_c15")))
.withColumn("jet_3_phi", toDouble(df("_c16")))
.withColumn("jet_3_b-tag", toDouble(df("_c17")))
.withColumn("jet_4_pt", toDouble(df("_c18")))
.withColumn("jet_4_eta", toDouble(df("_c19")))
.withColumn("jet_4_phi", toDouble(df("_c20")))
.withColumn("jet_4_b-tag", toDouble(df("_c21")))
.withColumn("m_jj", toDouble(df("_c22")))
.withColumn("m_jjj", toDouble(df("_c23")))
.withColumn("m_lv", toDouble(df("_c24")))
.withColumn("m_jlv", toDouble(df("_c25")))
.withColumn("m_bb", toDouble(df("_c26")))
.withColumn("m_wbb", toDouble(df("_c27")))
.withColumn("m_wwbb", toDouble(df("_c28")))
.select("label","lepton_pT", "lepton_eta", "lepton_phi", "missing_energy_magnitude", "missing_energy_phi", "jet_1_pt", "jet_1_eta", "jet_1_phi", "jet_1_b-tag",
"jet_2_pt", "jet_2_eta", "jet_2_phi", "jet_2_b-tag", "jet_3_pt", "jet_3_eta", "jet_3_phi", "jet_3_b-tag", "jet_4_pt", "jet_4_eta", "jet_4_phi",
"jet_4_b-tag", "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb")
    
        //dfnew.show()


val assembler = new VectorAssembler()
  .setInputCols(Array("lepton_pT", "lepton_eta", "lepton_phi", "missing_energy_magnitude", "missing_energy_phi", "jet_1_pt", "jet_1_eta", "jet_1_phi", "jet_1_b-tag",
"jet_2_pt", "jet_2_eta", "jet_2_phi", "jet_2_b-tag", "jet_3_pt", "jet_3_eta", "jet_3_phi", "jet_3_b-tag", "jet_4_pt", "jet_4_eta", "jet_4_phi",
"jet_4_b-tag", "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb"))
  .setOutputCol("features")
                
val lr = new LogisticRegression()
  .setLabelCol("label")
  .setFeaturesCol("features")
  
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = dfnew.randomSplit(Array(0.7, 0.3))

// Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(assembler, lr))
        
val paramGrid = new ParamGridBuilder()
.addGrid(lr.elasticNetParam, Array(0, 0.2, 0.4, 0.6, 0.8, 1))
.addGrid(lr.regParam, Array(0.0, 0.5, 1.0))
.addGrid(lr.maxIter, Array(3,4,5,8,10))
.build()

val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new MulticlassClassificationEvaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(5)         

// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(trainingData)
// Make predictions.
val predictions = cvModel.transform(testData)

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Accuracy: "  + (accuracy))

val bestparameters = cvModel.bestModel.asInstanceOf[PipelineModel]
val stages = bestparameters.stages

val lrstage = stages(1).asInstanceOf[LogisticRegressionModel]
println("elasticnetparam = " + lrstage.getElasticNetParam)
println("regparam = " + lrstage.getRegParam)
println("maxiter = " + lrstage.getMaxIter)

println("Time (seconds)" + ((System.nanoTime-timing) / 1e9d))

}
}
Logistic_Reg.main(Array())

Accuracy: 0.6416240562937792
elasticnetparam = 0.0
regparam = 0.0
maxiter = 10
Time (seconds)773.832084513


[32mimport [36morg.apache.spark.ml.classification.{BinaryLogisticRegressionSummary, LogisticRegression, LogisticRegressionModel}[0m
[32mimport [36morg.apache.spark.sql.SparkSession[0m
[32mimport [36morg.apache.spark.ml.{Pipeline, PipelineModel}[0m
[32mimport [36morg.apache.spark.ml.classification.DecisionTreeClassificationModel[0m
[32mimport [36morg.apache.spark.ml.classification.DecisionTreeClassifier[0m
[32mimport [36morg.apache.spark.ml.evaluation.MulticlassClassificationEvaluator[0m
[32mimport [36morg.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}[0m
[32mimport [36morg.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}[0m
[32mimport [36morg.apache.spark.ml.feature.StopWordsRemover[0m
[32mimport [36morg.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}[0m
[32mimport [36morg.apache.spark.mllib.tree.DecisionTree[0m
[32mimport [36morg.apache.spark.mllib.tree.model.DecisionTreeModel[0m
[32mimport [36morg.apache.