# Linear Methods
https://spark.apache.org/docs/2.2.0/mllib-linear-methods.html#logistic-regression

## Linear Support Vector Machines (SVMs)

- The linear SVM is a standard method for large-scale classification tasks. 
-

In [3]:
from pyspark.mllib.classification import SVMWithSGD, SVMModel
from pyspark.mllib.regression import LabeledPoint
import urllib2

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])
###
fileURL = "https://raw.githubusercontent.com/cloudera/spark/master/mllib/data/sample_svm_data.txt"
fileLocation = "/tmp/sample_svm_data.txt"
result = dbutils.fs.rm(fileLocation)
dbutils.fs.put(fileLocation, urllib2.urlopen(fileURL).read())
###

data = sc.textFile(fileLocation)
#print data.collect()
parsedData = data.map(parsePoint)
#print parsedData.collect()

# Build the model
model = SVMWithSGD.train(parsedData, iterations=100)

# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))



In [4]:
# empth the dir if exist
%fs rm target/tmp/pythonSVMWithSGDModel/

In [5]:
# Save and load model
result = dbutils.fs.rm("target/tmp/pythonSVMWithSGDModel/*")
model.save(sc, "target/tmp/pythonSVMWithSGDModel")
sameModel = SVMModel.load(sc, "target/tmp/pythonSVMWithSGDModel")

In [6]:
%fs ls target/tmp/pythonSVMWithSGDModel

## Logistic regression
- Logistic regression is widely used to predict a binary response
- The following example shows how to load a sample dataset, build Logistic Regression model, and make predictions with the resulting model to compute the training error.

In [8]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])
  
###
fileURL = "https://raw.githubusercontent.com/cloudera/spark/master/mllib/data/sample_svm_data.txt"
fileLocation = "/tmp/sample_svm_data.txt"
result = dbutils.fs.rm(fileLocation)
dbutils.fs.put(fileLocation, urllib2.urlopen(fileURL).read())
###


data = sc.textFile(fileLocation)
parsedData = data.map(parsePoint)

# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)

# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))


In [9]:

# Save and load model
model.save(sc, "target/tmp/pythonLogisticRegressionWithLBFGSModel")
sameModel = LogisticRegressionModel.load(sc,
                                         "target/tmp/pythonLogisticRegressionWithLBFGSModel")

## Regression
- Linear least squares, Lasso, and ridge regression
- Linear least squares is the most common formulation for regression problems. 
- The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint. The example then uses LinearRegressionWithSGD to build a simple linear model to predict label values. 
- We compute the mean squared error at the end to evaluate goodness of fit.

In [11]:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.replace(',', ' ').split(' ')]
    return LabeledPoint(values[0], values[1:])

###
fileURL = "https://raw.githubusercontent.com/cloudera/spark/master/mllib/data/sample_svm_data.txt"
fileLocation = "/tmp/sample_svm_data.txt"
result = dbutils.fs.rm(fileLocation)
dbutils.fs.put(fileLocation, urllib2.urlopen(fileURL).read())
###

data = sc.textFile(fileLocation)
parsedData = data.map(parsePoint)

# Build the model
model = LinearRegressionWithSGD.train(parsedData, iterations=100, step=0.00000001)

# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds \
    .map(lambda vp: (vp[0] - vp[1])**2) \
    .reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))

In [12]:
# Save and load model
model.save(sc, "target/tmp/pythonLinearRegressionWithSGDModel")
sameModel = LinearRegressionModel.load(sc, "target/tmp/pythonLinearRegressionWithSGDModel")

## Implementation (developer)
- Behind the scene, spark.mllib implements a simple distributed version of stochastic gradient descent (SGD), building on the underlying gradient descent primitive (as described in the optimization section). 
- All provided algorithms take as input a regularization parameter (regParam) along with various parameters associated with stochastic gradient descent (stepSize, numIterations, miniBatchFraction). 
- For each of them, we support all three possible regularizations (none, L1 or L2).


- For Logistic Regression, L-BFGS version is implemented under LogisticRegressionWithLBFGS, and this version supports both binary and multinomial Logistic Regression while SGD version only supports binary Logistic Regression. 
- However, L-BFGS version doesn’t support L1 regularization but SGD one supports L1 regularization. 
- When L1 regularization is not required, L-BFGS version is strongly recommended since it converges faster and more accurately compared to SGD by approximating the inverse Hessian matrix using quasi-Newton method.