### Ridge regression

Loss function: 
$$
L(\mathbf{w} ; \mathbf{x}, y) :=\frac{1}{2}\left(\mathbf{w}^{T} \mathbf{x}-y\right)^{2}
$$

Ridge regression uses L2 regularization.

Reference: [linear methods](https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)

### Principle component analysis (PCA)

Reference: [PCA](https://spark.apache.org/docs/2.3.0/mllib-dimensionality-reduction.html#principal-component-analysis-pca)

### Subsampled randomised Hadamard Transform

$r = 500$

We can first select $1024$ entries from the rdd, transformed it and then divide into training set and test set.

Sampling [with/without replacement](https://web.ma.utexas.edu/users/parker/sampling/repl.htm)

- takeSample() randomly selects data, with replacement
- does the shuffle job. => P
- Should we take with or without replacement?
- When sampling from a given data, your are treating those data as the “true pool” of possible outcomes, thus we usually sample with replacement, ie. the pool should not alter or become smaller when you sample from it. 

In [1]:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from pyspark.mllib.feature import PCA
from random import randrange
from pyspark.mllib.linalg import DenseVector, DenseMatrix
import numpy as np

In [2]:
# Load and parse the data from csv
def parsePoint(line):
    values = [float(x) for x in line.split(',')]
    return LabeledPoint(values[-1], values[0:len(values)-1])

In [3]:
# Ridge regression
def rr_fit(parsed_Data):
    rdd = parsed_Data.randomSplit([0.8, 0.2])
    model = LinearRegressionWithSGD.train(rdd[0], iterations=100,
                                          step=0.00000001, regType="l2")

    # Evaluate the model on training data
    valuesAndPreds = rdd[1].map(lambda p: (p.label, model.predict(p.features)))
    MSE = valuesAndPreds.map(lambda vp: (vp[0] - vp[1])**2)\
              .reduce(lambda x, y: x + y) / valuesAndPreds.count()
    print("Mean Squared Error = " + str(MSE))

In [4]:
# principle component analysis
def pca_fit(parsed_Data):
    x = parsed_Data.map(lambda p: p.features)
    pc = PCA(5).fit(x)
    transformed = pc.transform(x)
    y = parsed_Data.map(lambda p: p.label)
    a = transformed.zip(y)
    paired = a.map(lambda line: LabeledPoint(line[1], line[0]))

    rdd2 = paired.randomSplit([0.8, 0.2])
    model2 = LinearRegressionWithSGD.train(rdd2[0], iterations=100,
                                           step=0.00000001, regType=None)

    # Evaluate the model on training data
    valuesAndPreds = rdd2[1].map(lambda p: (p.label, model2.predict(p.features)))
    MSE = valuesAndPreds.map(lambda vp: (vp[0] - vp[1])**2)\
              .reduce(lambda x, y: x + y) / valuesAndPreds.count()
    print("Mean Squared Error = " + str(MSE))

In [5]:
# matrix D
def mat_D(x):
    d = randrange(-1, 2, 2)
    return d*x

# matrix B serves as filter
def mat_B_filter(x):
    d = randrange(0, 1024)
    return d < 500

# Hadamard transform
def hadamard_fit(data):
    # sample 1024 terms from data
    parsedData = data.map(lambda line: np.array([float(x) for x in line.split(',')]))
    rdd3 = sc.parallelize(parsedData.takeSample(True, 1024),2)

    # create Hadamard matrix
    N = 10
    H = np.zeros([1024, 1024])
    H[0, 0] = 1
    h = 1
    for i in range(N):
        H[0:h, h:2 * h] = H[0:h, 0:h]
        H[h:2 * h, 0:h] = H[0:h, 0:h]
        H[h:2 * h, h:2 * h] = -1 * H[0:h, 0:h]
        h = h * 2

    # multiply with Hadamard matrix
    lens = rdd3.collect()[0].shape[0]
    X_array = np.array(rdd3.collect()).reshape(1024, lens)
    X_hadamard = H.dot(X_array)

    x_rdd = sc.parallelize(X_hadamard)  # each entry is an numpy array
    subset = x_rdd.map(lambda x: LabeledPoint(x[-1], x[0:lens - 1])) \
        .randomSplit([0.8, 0.2])  # split training and testing
    x_rp = subset[0].filter(mat_B_filter)  # mat B actually serve as a filter
    model3 = LinearRegressionWithSGD.train(x_rp, iterations=100,
                                           step=0.00000001, regType=None)
    # Evaluate the model on training data
    valuesAndPreds = subset[1].map(lambda p: (p.label, model3.predict(p.features)))
    MSE = valuesAndPreds \
              .map(lambda vp: (vp[0] - vp[1]) ** 2) \
              .reduce(lambda x, y: x + y) / valuesAndPreds.count()
    print("Mean Squared Error = " + str(MSE))


In [7]:
data = sc.textFile("pynum.csv")
parsedData = data.map(parsePoint)

In [8]:
rr_fit(parsedData)

Mean Squared Error = 3522768811.3


In [9]:
pca_fit(parsedData)

Mean Squared Error = 3561025567.59


In [10]:
hadamard_fit(data)

Mean Squared Error = inf
