# Lab-1: Linear Regression with SparkML

## 1. Goal 
    * Reminding the improtant concepts of Scala
    * Getting a better understaning of Spark APIs 
    * Facilitating Writing ML programs in Spark
## 2. Agenda
    * 5 parts (introduction, group design, discussion, programming)  
    * Part-1: Scala
    * Part-2: Inspect data (Without Programming)
    * Part-3: Inspect data (with Spark datasets: RDD, Dataset, Dataframe)
    * Part-4: Transformer, Estimator, Pipeline
    * Part-5: Linear Regression 
    * Part-6: Hyper Parameter Tuning

---
## Part-1: Scala
### Concepts
    * OO Functional Programming
    * Syntax : => var val def () _ apply unary operators
    * Immutable vs mutable
    * Object/Class/Case Class
    * List, Array, Tuple
    * Lambda & Higher ordered functions & 
    * map, reduce, filter, etc.
### Task: Design a scala program for answering below questions using the 'studentText' String: 
    1. Total number of students
    2. min, max, and avg ages amonge the students
    2. Distict list of nationalities among the students 
    
#### Note: Use the important concepts of Scala (e.g., case class, collections, lambda function, and higher ordered functions) in your desing. 

In [None]:
val studentText = """John,Wikes,36,USA
Sonia,Ericsson,27,Sweden
Kalle,Johonsson,24,Sweden
Peter,Alvaro,25,USA
Diego,Nickolson,38,Argentina
Sujith,Daga,31,India"""


---
## Part-2: Inspect Data (Without Spark)
### Aim
    * The importance of getting a feeling of data without programming.
    
### Task: Download the provided datafile and inspect the content without wiring any program. Answer the questions:
    1. How big is it?
    2. What is the separator used
    3. How many fields in the data?
    4. What are the types of the fields?
    5. What are the data ranges
    6. Are there anomalous values in the data?

---
## Part-3: Inspecting Data with Spark
### Building Blocks
    * RDD
    * Dataset
    * DataFrame (printSchema, count, show, where, select, groupBy, orderBy) 
    * Column
    * GroupedData
    * Aggreagation Higher Order Functions 
    * Queries
### Task: design a program with Spark Data API that can answer the below questions: 
    1. Toatl number of songs in the given dataset?
    2. How many songs were released between the years 1998 and 2000?
    3. What is the min, max and mean value of the year column?
    4. Show the number of songs per year between the years 2000 and 2010?

In [None]:
case class Song(year: Double, f1: Double, f2: Double, f3: Double)
val rdd = sc.textFile("data.txt")

---
## Part-4: Spark Pipeline APIs
### Building Blocks
    * Adding more data columns
    * Transformer (RegexTokenizer, Imputer, VectorAssembler, VectorSlicer, StandardScaler}
    * Estimator
    * setInputCol, setInputCols, setOutputCol, setOutputCols, fit, transform
    * Pipeline
    * Custome Transformer and Estimator
### Desing Task: design a pipeline that recives the 'millionsongs' datafile and it prepars the training data (features' vector and label) . Questions to think about in your design:
    1. What transformers and estimators are needed?
    2. How would you connect the chosen transformers and estimators in your pipeline?
    3. Can you find any usage for the provided custom transformes (Vector2DoubleUDF & DoubleUDF)
### Programming
    1. Implement the designed pipeline

In [None]:
import org.apache.spark.sql.functions._

import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.UnaryTransformer
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.DoubleType
import org.apache.spark.ml.param.ParamMap

class Vector2DoubleUDF(override val uid: String, val udf: Vector => Double)
    extends UnaryTransformer[Vector, Double, Vector2DoubleUDF] {

  def this(udf: Vector => Double) = this(Identifiable.randomUID("vector2DoubleUDF"), udf)

  override protected def createTransformFunc: Vector => Double = udf

  override protected def outputDataType: DoubleType = {
    DoubleType
  }
  
  override def copy(extra: ParamMap): Vector2DoubleUDF = {
    new Vector2DoubleUDF(udf).setInputCol(getInputCol).setOutputCol(getOutputCol)
  }
}


class DoubleUDF(override val uid: String, val udf: Double => Double)
    extends UnaryTransformer[Double, Double, DoubleUDF] {

  def this(udf: Double => Double) = this(Identifiable.randomUID("doubleUDF"), udf)

  override protected def createTransformFunc: Double => Double = udf

  override protected def outputDataType: DoubleType = {
    DoubleType
  }
  
  override def copy(extra: ParamMap): DoubleUDF = {
    new DoubleUDF(udf).setInputCol(getInputCol).setOutputCol(getOutputCol)
  }
}

---
## Part-5: Linear Regression with Spark
### Building Blocks
    * LinearRegression 
    * LinearRegressionModel
    * LinearRegressionSummary
    
### Task: Train a LinearRegression model on the provided data. Use the following parameters in your design:
    1. iterations: 10
    2. regularization 0.1
    3. elastic net 0.1

---
## Part-6: Hyper Parameter Tuning
### Building Blocks
    * Evaluator, RegressionEvaluator
    * ParamGridBuilder 
    * CrossValidator, CrossValidatorModel 
### Task: Use grid-search cross-validation in order to find the best hyper parameters in your training? Which set of the followin parameters gives you the best model?  

    1. iterations {10, 20, 50, 100}
    3. regularization {0.1, 0.01}
    2. elastic net {0.1, 0.5, 0.9}