<img src="ucsb_logo_seal.png"> 

## MLlib Regression

### PSTAT 135 / 235: Big Data Analytics
### University of California, Santa Barbara
### Last Updated: Sep 4, 2019

---  


**Sources:**  
Learning Spark, Chapter 11: Machine Learning with MLlib

*Details on regularization equation*  
https://spark.apache.org/docs/1.5.2/ml-linear-methods.html

https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression

https://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests

http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-support-vector-machine


### OBJECTIVES
- Introduction to major regression models in MLlib

### CONCEPTS

- Linear regression

---

**Introduction to Regression**

Regression is another common form of supervised learning
The response variable in a regression problem is quantitative or continuous  

Earlier, we discussed the classification problem where the response variable is discrete

Several of the models we discussed for classification also have regression counterparts, including:

- Support vector machines  
- Tree-based methods like random forests and gradient-boosted trees  

To implement the regression counterpart, the same package is loaded but a different method is called.

**Linear Regression**

Linear regression is the most fundamental model used in regression.

Model assumes a linear relationship between a set of explanatory variables X (aka features, factors, predictors, independent variables) and a scalar response variable y.

Most often fit using *ordinary least squares* (*OLS*) approach.  

More recently and especially in machine learning, an additional regularization term is added to the loss function. Examples include:

- ridge regression ($L^2$-norm penalty)
- lasso ($L^1$-norm penalty)

**Linear Regression Implementation**  
Method `LinearRegressionWithSGD` used to train the model  

**Linear Regression Example: load data/train model/predict**


In [None]:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from pyspark.ml.regression import LinearRegression

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.replace(',', ' ').split(' ')]
    return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)

# Instantiate the model
lr= LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
model = lr.fit(parsedData)

# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds \
    .map(lambda vp: (vp[0] - vp[1])**2) \
    .reduce(lambda x, y: x + y) / valuesAndPreds.count()

print("Mean Squared Error = " + str(MSE))

Next we show function calls to train various models, comparing  classification and regression.  Code is minimal, and the functions and parameters are different.  

**Random Forest Implementation**

*Classification Model Training*


In [None]:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=1000, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=5, maxBins=32)

*Regression Model Training*

In [None]:
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
                                    numTrees=1000, featureSubsetStrategy="auto",
                                    impurity='variance', maxDepth=4, maxBins=32)

**Gradient-Boosted Trees Implementation**

<span style="color:red"> Note that Spark GBTs support binary classification, but not multiclass classification. </span> 

*Classification Model Training*

In [None]:
model = GradientBoostedTrees.trainClassifier(trainingData,
                                             categoricalFeaturesInfo={}, numIterations=1000)

*Regression Model Training*

In [None]:
model = GradientBoostedTrees.trainRegressor(trainingData,
                                            categoricalFeaturesInfo={}, numIterations=3)

**Linear SVM Implementation**

By default, linear SVMs are trained with an $L2$ regularization.  $L1$ regularization is also supported.

*Classification Model Training*

In [None]:
lsvc = LinearSVC(maxIter=10, regParam=0.1)

*Regression Model Training*

In [None]:
lsvm = SVMWithSGD.train(parsedData, iterations=100)

**Performance Measurement**

For the classification problem, we discussed measuring performance in terms of misclassification error.
For the regression problem, a common metric used for performance measurement is *Mean Squared Error.*

**Measuring Mean Squared Error (MSE) for the Regression Problem**

In [None]:
# compute squared errors of each point and average

testMSE = labelsAndPredictions.map(lambda lp: (lp[0] - lp[1]) * (lp[0] - lp[1])).sum() /\
    float(testData.count())

Compare this with the computation of misclassification error for the Classification Problem

In [None]:
# compute fraction of points where predicted and actual labels disagree

testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())