# Lab 5: Linear Regression and Ridge Regression from Scratch
## by Tiffany Nguyen
The purpose of this lab is to train a linear regression model and a ridge regression model using the weight equations derived in class and evaluate the predictions using Root Mean Squared Error (RMSE).
### Part 1: Preprocessing

In [99]:
import numpy as np

In [100]:
# Read all training data
trainDataX = np.empty((0, 95))
trainDataY = np.empty((0))

#read train data
with open("crime-train.txt", "r") as filestream:
  next(filestream) #skip header line
  for line in filestream:
    currentline = line.strip().split("\t")
    trainDataX = np.append(trainDataX, [np.array(currentline[1:], dtype=float)], axis=0)
    trainDataY = np.append(trainDataY, float(currentline[0]))

# add dummy feature to train data (column of ones for bias) 
dummyCol = np.ones((trainDataX.shape[0], 1))
trainDataX = np.concatenate((trainDataX, dummyCol), axis=1)

# print output
print("trainDataX shape:", trainDataX.shape)
print("trainDataY shape:", trainDataY.shape)
print("trainDataX: \n", trainDataX)
print("trainDataY: \n", trainDataY)

trainDataX shape: (1595, 96)
trainDataY shape: (1595,)
trainDataX: 
 [[-0.45 -1.85 -1.06 ...  1.26 -0.39  1.  ]
 [-0.45 -0.27 -0.22 ... -0.62 -0.39  1.  ]
 [-0.14  1.87  0.55 ...  0.52 -0.39  1.  ]
 ...
 [ 0.81 -0.57 -0.48 ...  0.08  3.4   1.  ]
 [ 0.18  0.28  1.   ...  0.73  0.52  1.  ]
 [ 1.12  1.93  0.49 ... -0.49  3.77  1.  ]]
trainDataY: 
 [0.67 0.43 0.12 ... 0.23 0.19 0.48]


In [101]:
# Read all testing data
testDataX = np.empty((0, 95))
testDataY = np.empty((0))

#read train data
with open("crime-test.txt", "r") as filestream:
  next(filestream) #skip header line
  for line in filestream:
    currentline = line.strip().split("\t")
    testDataX = np.append(testDataX, [np.array(currentline[1:], dtype=float)], axis=0)
    testDataY = np.append(testDataY, float(currentline[0]))

# add dummy feature to train data (column of ones for bias) 
dummyCol = np.ones((testDataX.shape[0], 1))
testDataX = np.concatenate((testDataX, dummyCol), axis=1)

# print output
print("testDataX shape:", testDataX.shape)
print("testDataY shape:", testDataY.shape)
print("testDataX: \n", testDataX)
print("testDataY: \n", testDataY)

testDataX shape: (399, 96)
testDataY shape: (399,)
testDataX: 
 [[-0.14  0.35 -0.41 ...  0.65 -0.39  1.  ]
 [ 0.02 -0.45 -0.22 ... -0.66 -0.39  1.  ]
 [-0.45  0.28 -0.16 ... -0.66 -0.39  1.  ]
 ...
 [-0.38  1.99  1.07 ... -0.57 -0.39  1.  ]
 [-0.38  0.04 -0.22 ... -0.27 -0.39  1.  ]
 [ 0.02 -0.57 -0.48 ... -0.14 -0.39  1.  ]]
testDataY: 
 [0.08 0.22 0.06 0.16 0.15 0.28 0.63 0.16 0.25 1.   0.04 0.05 0.86 0.08
 0.09 0.17 0.12 0.12 1.   0.16 0.4  0.12 0.07 0.4  0.06 0.21 0.04 0.45
 0.02 0.28 0.38 0.22 0.07 0.06 0.16 0.21 1.   0.19 0.26 0.1  0.23 0.09
 0.55 1.   0.09 0.41 1.   0.05 0.02 0.09 0.11 0.36 0.03 0.63 0.04 0.02
 0.08 0.27 0.52 0.27 0.31 0.06 0.1  0.22 0.6  0.49 0.4  0.02 0.21 0.2
 1.   0.5  0.15 0.31 0.22 0.16 0.16 0.32 0.21 0.09 0.09 0.02 0.05 0.15
 0.11 0.1  0.06 0.11 0.01 0.1  0.18 0.85 0.05 0.12 0.54 0.06 0.04 0.18
 0.44 0.03 0.14 0.03 0.29 0.14 0.71 0.06 0.35 0.07 0.2  0.3  0.05 0.05
 0.26 0.62 0.11 0.03 0.17 0.25 0.34 0.39 0.13 0.19 0.11 0.06 0.08 0.01
 0.04 0.25 0.08 0.31 

### Part 2: Training
Weights for Linear Regression:
$$(X^TX)^{-1}X^Ty$$
Weights for Ridge Regression:
$$(X^TX+\lambda I)^{-1}X^Ty$$

In [102]:
# Get weights for linear regression
xtx = np.dot(np.transpose(trainDataX), trainDataX)
xtxInv = np.linalg.inv(xtx)
xtxInvXt = np.dot(xtxInv, np.transpose(trainDataX))
linearRegressionWeights = np.dot(xtxInvXt, trainDataY)
print("Linear Regression Weights:\n", linearRegressionWeights)

Linear Regression Weights:
 [-1.54906138e-02  3.35459838e-03  1.27547102e-02 -2.99835603e-02
 -2.49633935e-02  2.42430772e-02 -1.26473288e-02  2.03432879e-02
 -3.20098178e-02 -3.13299628e-02  6.01490622e-03 -3.97817625e-02
  4.88968626e-03 -2.59111673e-03 -1.25557009e-02  5.74520977e-02
 -5.57639378e-02  2.37527507e-03 -3.68181254e-04 -4.24706174e-03
  4.31644481e-03  1.27018447e-02  2.54547492e-02 -3.32700396e-02
 -1.60002475e-02  3.54599959e-03  1.54036062e-02  4.24360129e-03
  4.25444378e-02 -1.03008495e-02  3.77526976e-03  1.37145404e-02
  1.56911065e-02  6.22095883e-02  3.59684662e-02  1.04463505e-02
 -6.95109507e-02  1.25702313e-02  4.52392413e-02 -1.27783374e-01
 -3.11803197e-03 -1.58893093e-04  8.76703186e-03 -2.98193646e-02
 -2.15350434e-02  5.75307329e-02 -1.39352290e-02  2.90734700e-03
 -1.57817870e-05 -6.64885528e-03  7.87972918e-03 -2.11190533e-02
 -1.09784568e-02  4.52348038e-02 -9.55488187e-03 -1.74048531e-02
 -4.57405270e-02 -2.40896336e-02 -2.61689470e-04  1.03266126e-

In [103]:
# Get weights for ridge regression
lambdaVal = 100

xtx = np.dot(np.transpose(trainDataX), trainDataX)
lambdaIdentity = lambdaVal * np.identity(trainDataX.shape[1])
xtxLambdaInv = np.linalg.inv(xtx + lambdaIdentity)
xtxLambdaInvXt = np.dot(xtxLambdaInv, np.transpose(trainDataX))
ridgeRegressionWeights = np.dot(xtxLambdaInvXt, trainDataY)
print("Ridge Regression Weights:\n", ridgeRegressionWeights)

Ridge Regression Weights:
 [-6.24990628e-03  9.99933877e-03 -1.62413794e-03 -1.67026882e-02
 -5.16998456e-03  1.17345689e-02 -5.04845701e-03  1.75230995e-02
  4.69535245e-03 -6.57402533e-03  1.12734829e-03 -2.84372641e-02
  6.33694821e-03  1.28021923e-03 -9.22355798e-03  1.19106670e-03
 -9.61654135e-03  6.00453170e-03 -6.15061465e-04 -2.88976415e-03
  4.18653476e-03  1.04107643e-02  2.19726334e-03 -1.08846467e-02
 -7.93857388e-03  6.00278416e-03  2.87703214e-03 -4.35620444e-03
  1.20984227e-02 -6.55706607e-03  2.67530699e-03  4.51306384e-03
  6.16341651e-03  1.49024048e-02  1.45444377e-02 -7.51653065e-03
  3.28798973e-04  1.25623859e-02 -1.75278988e-02 -3.82541806e-02
 -1.30394543e-02 -7.71160529e-03  2.18236021e-03 -1.38805520e-02
 -2.63780546e-03  5.54440285e-02 -1.08161447e-02 -1.42460139e-03
 -2.94012491e-03 -5.30350135e-04  4.68946051e-03 -3.27906358e-03
  1.81094018e-03  5.73759314e-03  8.07469117e-03 -2.25678300e-03
 -1.61746445e-02 -8.82788721e-04 -2.74801279e-03  1.37027444e-0

### Part 3: Prediction
Get prediction by computing the dot product between the weights and samples

In [104]:
linearPredTrain = np.dot(linearRegressionWeights, np.transpose(trainDataX))
linearPredTest = np.dot(linearRegressionWeights, np.transpose(testDataX))
# print(linearPredTrain)
# print(linearPredTest)

In [105]:
ridgePredTrain = np.dot(ridgeRegressionWeights, np.transpose(trainDataX))
ridgePredTest = np.dot(ridgeRegressionWeights, np.transpose(testDataX))
# print(ridgePredTrain)
# print(ridgePredTest)

### Part 4: Evaluation
Evaluate the model using Root Mean Squared Error (RMSE):
$$RMSE = \sqrt{\frac{(y_{pred}-y_{gnd})^2}{m}}$$

In [106]:
m = len(linearPredTrain)
linearTrainRMSE = np.sqrt(np.sum(np.square(linearPredTrain - trainDataY))/m)
print("Linear Training RMSE:", round(linearTrainRMSE, 3))

Linear Training RMSE: 0.128


In [107]:
m = len(linearPredTest)
linearTestRMSE = np.sqrt(np.sum(np.square(linearPredTest - testDataY))/m)
print("Linear Testing RMSE:", round(linearTestRMSE, 3))

Linear Testing RMSE: 0.146


In [108]:
m = len(ridgePredTrain)
ridgeTrainRMSE = np.sqrt(np.sum(np.square(ridgePredTrain - trainDataY))/m)
print("Ridge Training RMSE:", round(ridgeTrainRMSE, 3))

Ridge Training RMSE: 0.131


In [109]:
m = len(ridgePredTest)
ridgeTestRMSE = np.sqrt(np.sum(np.square(ridgePredTest - testDataY))/m)
print("Ridge Testing RMSE:", round(ridgeTestRMSE, 3))

Ridge Testing RMSE: 0.148
