-
Notifications
You must be signed in to change notification settings - Fork 0
/
SVM-SGD-BB-Code-Explanation.Rmd
703 lines (564 loc) · 26.5 KB
/
SVM-SGD-BB-Code-Explanation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
---
title: '[Write-up]_SVM_SGD_BB'
output:
html_document: default
pdf_document: default
---
<h1> 2. Gradient descent implementation </h1>
<h2> Overview </h2>
In this part we will be creating our own support vector machine stochastic
gradient descent with Barzilai-Borwein step size (SVM SGD-BB). As there is
little chance our data is separable, we will be using a soft-margin classifier.
Furthermore, hard-margin classifier (also called maximal margin classifier) is
much more sensitive to observations close to the choice boundary, meaning it can
more easily overfit our data[1].
We will first explain the optimization problem, then go through the method of
finding an optimum, and walk you through the code step by step. Before we run the
our code on the heart dataset we will test our implementation on a few simulated
datasets. Then we will finally run our code on the data set and see how well it
compares to other models.
<h2> 2.1 Support Vector Optimization Problem </h2>
In this implementation we will focus on linear svm, therefore we can represent
our function as:
$$
\hat{f}(X) = \hat{\beta_0} + \hat{\beta_1x}_1 + \hat{\beta_2}x_2 +... +\hat{\beta_n}x_p
$$
Where we classify an object as:
$$
\hat{y} = -1 \space for \space \hat{f}(X)<0 \\ \hat{y} = 1 \space \space for \space \space \hat{f}(X)>=0
$$
To greatly simply our calculations when running an algorithm we will present
our calculations in a matrix notation:
$$
\hat{f}(X)=X\hat{\beta}
$$
Many sources represent bias (also called constant term) as a separate variable b
but throughout our calculations we will include the term as a first entry in
the vector of coefficients B, and instead add a column of 1's to any data we
pass to the algorithm. This should allow our algorithm to be more scalable when
we use our algorithm for higher dimensional problems later on.
Support vector classifier is a solution to a minimization problem[1]:
$$
max: M \\ subject\space to: \space \|\beta \|_2\space = 1 \\
\space y_i({x_i}^T\beta) \ge M(1- \epsilon_i), for\space all\space i \\
\sum_{i=1}^n \epsilon_i\le K, \epsilon_i \ge 0, for\space all\space i
$$
M represents one-sided distance of the decision boundry to the margin[2].
As this distance can be represented as second norm B^-1, instead of maximizing
the margin M (like in the previous formulation) we can rop our condition on 2nd
beta norm squared equal to one, and replacing M with it.
$$
M = 1/\|\beta \|
$$
Epsilon represents the relative distance to our decision boundary, where
1>=e_i>=0 implies observation i has been correctly classified but lies inside
the margin M, whereas e_i>= implies our observation has been miss classified.
By putting an upper bound on the sum of e_i's let's say K, we limit the amount
of possible missclasifications.
$$
\sum_{i=1}^n \epsilon_i<=K
$$
For K = 0 we do not allow any missclasifications, which is equivalent to a
the maximum margin classifier. Instead of putting an upper bound of K we again
reformulate it as a sum of epsilons scaled by a factor C. Our, factor C will be
inversely proportional to K, meaning for a separable case C -> inf.
We can now re-arrange our optimization problem as [2]:
$$
min\space \frac{1}{2}\|\beta \|_2^2\space+\space C\sum_{i=1}^n \epsilon_i \\
subject\space to: \epsilon_i \ge 0, \space y_i({x_i}^T\beta) \ge1- \epsilon_i, for\space all\space i
$$
This is a more computationally convenient representation which we will rely on,
but it needs a few more modifications before we can apply it:
$$
\epsilon_i \ge1- \space y_i({x_i}^T\beta) \\
\space \epsilon_i \ge 0 \space\space, for\space all\space i
$$
We can now put those 2 conditions to get:
$$
\epsilon_i \ge max(0,1-y_i({x_i}^T\beta))
$$
This will represent our hinge SVM hinge loss [2][3] which we can put into our
final minimization problem.
We can finally write up the function we will be minimizing, let's denote it as
h(B):
$$
h(\beta) = \frac{1}{2}\|\beta \|_2^2\space+\space C\sum_{i=1}^nmax(0,1-y_i({x_i}^T\beta))
$$
This is the typical form in which our optimization problem is presented
[3][4]. The left side represents the regularization parameter, that maximizes
the margin M. The right side is the aforementioned loss function, in our case
the hinge loss, that sums up the severity of classifications.
One interesting thing to note is that some resources[1], re-write this
equation in the following form:
$$
h(\beta) = \frac{1}{2}\ \lambda|\beta \|_2^2\space+\space \sum_{i=1}^nmax(0,1-y_i({x_i}^T\beta))
$$
This formulation resembles the ridge regression with parameter lambda
penalizing the coefficients. This formulation is essentially the same function
where lambda is inversely proportional to C parameter.
In this case we will be using the first version of the formulation.
<h2> 2.2 Gradient Descent method and derivations </h2>
The aforementioned optimization has solution using Lagrange multipliers with
KKT (Karush–Kuhn–Tucker) conditions [2][6]. This can be solved using quadratic
programming, but such algorithms are very time consuming[5].
Instead we will be applying a gradient descent algorithm. This algorithm uses
a simple trick that the gradient of our optimization problem will, at a given
point x_i, point in the direction of steepest ascent. Conversely, taking a step
in the opposite direction will lead us towards the minimum. In the simplest
form, the gradient works in the following way[7]:
1. Initialize a set of values values Beta (can be all 0 or a set of random values)
2. Calculate the gradient of our minimization function h(Beta) with respect to Beta
3. Update Beta, subtracting Gradient times learning rate
4. Repeat, until some criterion is reached.
Before we explain the learning rate and stochastic gradient descent let us first
calculate the gradient we will be using. Sadly, the function is not differential
due to the max() condition, but we can find gradients for both outcomes of the
condition[4][5]. We will use variable G to signify a gradient:
$$
\begin{align*}
&G(\beta,X,Y)=\nabla_{\beta}h(\beta,X,Y)=\beta-C*y_ix_i\space&for \space 1-y_i(x_i^{T}\beta)>0 \\
&G(\beta,X,Y)=\nabla_{\beta}h(\beta,X,Y)=\beta&else
\end{align*}
$$
Gradient descent requires calculating the gradient for all observations X.
As this can be time consuming we should use stochastic gradient descent (SGD).
SGD at each new iteration of GD algorithm takes a sample of size S and
calculates the gradient for that particular sample. SGD also has one more
advantage over SG. We know that both at the local extremum and a saddle point,
the gradient of our function will be equal to 0. This means our algorithm can
terminate after getting stuck at a local saddle point, without finding an actual
minimum we're looking for.
Therefore we shall transform our initial GD algorithm to take a sample S of size
s at each iteration and calculate the gradient according to:
$$
G(\beta,X,Y)=\beta-C*\frac{1}{s}[\sum_{i=1}^Sy_ix_i]\space\space\space\space\space for \space 1-y_i(x_i^{T}\beta)>0,\space i\in S
$$
Finally, we need to consider the learning rate, which will dictates how large
the steps we will take towards the direction of our minimum[8]. Although, it
seems as it could be set to an absolute value, this would require optimizing
the parameter. Too large steps could mean our algorithm will jump around the
minimum not really ever reaching it. Too small steps and it might take large
amount of iterations to converge to a minimum[9]. To solve this we shall be
using a Barzilai–Borwein (BB) method[8]. Although, it might seem problematic
to use BB method for stochastic setting as our algorithm never calculates
the entire gradient, paper proposed by K.Sopyla and P.Drozda[10], finds
SGD-BB performs equally well as other similar learning rates, and has a much
lower sensitivity to the choice of our intitial parameters.
The Barzilai–Borwein method for learning rate (variable nn in code) is:
$$
\eta_n = \frac{\left | (\beta_t - \beta_{t-1})^{T}(G_t - G_{t-1}) \right |}{\left \| G_t - G_{t-1} \right \|_2^2}
$$
Where t is our current iteration of SGD algorithm.
<h2> 2.3 Code explained </h2>
Firstly, let us explain which variables we will be using throughout the code.
The function will take 3 basic variables and have 3 more set by default:
x - matrix of independent variables
y - matrix/vector of dependent variables [vector of logical or -1/1's]
C - cost parameter C [numeric]
S - sample size [set by default to 10]
t - the amount of iterations ("steps our gradient descent will take") the main
SDG loop will be executed [set by default to 1000]
bias - variable controlling whether to add a constant beta_0 (also called bias)
[logical, set by default to TRUE]
Furthermore throughout the code we will be repeatedly using some other variables:
X - matrix of independent variables after initialization
XS - matrix of independent variables sampled from X according to S
Y - matrix of dependent variables after initialization
YS - matrix of dependent variables sampled from Y according to S
B - matrix of coefficients
Bp - matrix of previous coefficients (used for BB)
G - gradient
Gp - previous gradient
nn - learning rate
Sample - indices of variables to be sampled each iteration
V - vector, 0 if given variable is not a support vector, corresponding y_i
if it is a support vector (this is done to simplify calculations later on)
M - margin size
We will be using a couple of different variables along the way but their names
will be explicitly specified or specified as "Dummy" (indicating a dummy single use variable)
```{r, eval = FALSE, results='hide',warning=FALSE}
svm_gradient_descent <- function (x,y,C,S = 10,t = 1000,bias = TRUE) {
#B <- coefficient matrix
#t <- number of iterations
#nn <- learning rate
#G <- gradient
#bias <- whether to include bias term (also called constant)
#V <- vector of saved values y or 0 for max(0,1-Yi(BXi))
#S <- Sample for SGD (updated each iteration)
#XS, YS <- Sample of XS,YS of size S
}
```
Firstly, let us check whether the user wants to consider a model with/without
a bias term (beta_0 constant coefficient). As we've explained in section 2.1,
for our model to be scalable to p dimentions and make our calculations easier,
instead of calculating it separately we will ad a row of 1's to our dependent
matrix x. Thus, our model will automatically create 1 more coefficient and procede
as normal later on.
```{r, eval = FALSE, results='hide',warning=FALSE}
if (bias == TRUE) {
Dummy1 <- matrix(1,nrow(x),1)
x <- cbind(Dummy1,x)
}
```
Let's initialize variables we will need throughout the main loop:
```{r, eval = FALSE, results='hide',warning=FALSE}
#Changes TRUE / FALSE to -1 and 1 and turns y into a data type matrix
Y <- data.matrix(ifelse(y == TRUE,1,-1))
#Makes sure x is of data type matrix
X <- data.matrix(x)
#Initializes coefficient & gradient matrices
B <- matrix(0,ncol(x),1)
G <- matrix(0,ncol(x),1)
```
Now, let us proceed to the main loop of the SGD algorithm which will be executed
t times. Firstly, we update the previous gradient (Gp) to be the current gradient (G).
Next, we sample without replacement the indices of rows to be selected as a
batch of variables to update the gradient (Sample). Then we choose our sampled
variables (XS,YS). Finally, we initialize our matrix (V) we will be using for storing,
y_i's and 0's, depending on whether a variable is a support vector or not.
In part (1) we check whether given variables are support vectors.
In part (2) we update the gradient
In part (3) we chose our step size
In part (4) we prevent the code from running on NaN values
We update matrix of previous coefficients (Bp) before updating
the current one (B)
```{r, eval = FALSE, results='hide',warning=FALSE}
for (i in 1:t) {
#Sets previous gradient to be current gradient (used for BB)
Gp <- G
#Sampling without replacement
Sample <- sample(nrow(X),S)
XS <- as.matrix(X[Sample,])
YS <- as.matrix(Y[Sample,])
V <- matrix(0,nrow(YS),1)
#Part (1):
#max(0,1-yi(xiB)) part of the algorithm, saving output in V
#explained in the next part
#explained in the next part
#explained in the next part
#Part (2):
#updating the gradient using V,B and C
#explained in the next part
#explained in the next part
#explained in the next part
#Part (3):
#Barzilai-Borwein Step Size for i> iterations (fixed for i==1)
#explained in the next part
#explained in the next part
#explained in the next part
#Part (4):
#Checks whether the value of B is NaN and prevents further iterations if TRUE
#explained in the next part
#explained in the next part
#explained in the next part
#Sets previous coefficients to the current coefficients, before updating
Bp <- B
#Takes the step in the direction of minimum
B <- B-nn*G
}
```
Part(1)
This part will be checking whether a given variable is a support vector.
Instead of saving the variables as 1's and 0's, in order to simplify
calculations of Part(2), we will instead save y_i's and 0's. We save our support
vectors as y_i's, as in part(2) we will be multiplying matrices by the entire set
XS, giving us output of either 0 (if not a support vector) or x_i*y_i if it is.
This part is respective of:
$$
max(0,1-y_i({x_i}^T\beta)) \space\space\space\space\space for\space i\in S
$$
```{r, eval = FALSE, results='hide',warning=FALSE}
#Part(1)
#max(0,1-yi(xiB)) part of the algorithm, saving output in V
for (n in c(1:nrow(XS))) {
V[n] <- ifelse((1 - (XS[n,] %*% B)*YS[n])>0,YS[n],0)
}
```
Part (2)
We simply update our gradient according to the derivations from 2.1.
This part is respective of:
$$
G(\beta,X,Y)=\beta-C*\frac{1}{s}[\sum_{i=1}^Sy_ix_i]\space\space\space\space\space for \space 1-y_i(x_i^{T}\beta), i\in S
$$
```{r, eval = FALSE, results='hide',warning=FALSE}
#Part(2)
#updating the gradient using V,B and C
for (j in c(1:ncol(XS))) {
G[j] <- B[j]- (C*((t(V)%*%XS[,j])/nrow(XS)))
}
```
Part (3)
In this part we set our learning rate. In order to apply BB, we of course need
previous set of coefficients as well as gradients, hence for 1st iteration
of algorithm we will set a fixed value of nn=0.001
This part is respective of:
$$
\eta_n = \frac{\left | (\beta_t - \beta_{t-1})^{T}(G_t - G_{t-1}) \right |}{\left \| G_t - G_{t-1} \right \|_2^2}
$$
```{r, eval = FALSE, results='hide',warning=FALSE}
#Part(3)
#Barzilai-Borwein Step Size for i> iterations (fixed for i==1)
if (i== 1) {
nn = 1/1000
} else {
numerator = abs(t(B-Bp)%*%(G-Gp))
denominator = (t(G-Gp)%*%(G-Gp))
nn = as.vector(numerator/denominator)
}
```
Part (4) and output
There might be a situation were our algorithm runs into a saddle point or
happens to reach minimum faster then we initially expected (or for some other
unexpected reason Gp == G). The former can especially happen for very low
values of S. In either case if G == Gp our BB algorithm will return a value NaN,
as it will be trying to solve 0/0 and conversely update our B vector for the rest
of the iterations. To prevent this we will terminate the loop faster, display
a warning to the user that we have not reached all of our desired iterations, and
give an output we would otherwise do, but for the previous loop.
```{r, eval = FALSE, results='hide',warning=FALSE}
#Part(4)
#Checks whether the value of B is NaN and prevents further iterations if TRUE
if (is.na(nn)) {
warning(c("Terminated after ",i," iterations"))
predicted <- ifelse(X%*%Bp<0,-1,1)
error_rate <- sum(ifelse(predicted ==Y,0,1))/nrow(Y)
M <- 1/(sqrt(t(Bp)%*%Bp))
svm_output <- list(coef = Bp, error_rate = error_rate,margin = M)
class(svm_output) <- "svm_sgd_bb"
return(svm_output)
}
```
The output will be stored as a class for the user to easily read-off different
variables of interest, as well as access them efficiently through $.
To do this we'll calculate error-rate in the standard way (counting how many
of our predictions do not overlap with), and the margin M according to:
$$
M = 1/\|\beta \|
$$
```{r, eval = FALSE, results='hide',warning=FALSE}
predicted <- ifelse(X%*%B<0,-1,1)
error_rate <- sum(ifelse(predicted ==Y,0,1))/nrow(Y)
M <- 1/(sqrt(t(B)%*%B))
svm_output <- list(coef = B, error_rate = error_rate,margin = M)
class(svm_output) <- "svm_sgd_bb"
return(svm_output)
```
We can now initialize our entire function with all of its components.
This our final function we have created:
```{r}
svm_gradient_descent <- function (x,y,C,S = 10,t = 1000,bias = TRUE) {
#B <- coeffiient matrix
#t <- number of interations
#nn <- learning rate
#H <- vector of h hinge loss value for each iteration
#G <- gradient
#b <- whether to include bias term (also called constant)
#V <- vector of saved values y or 0 for max(0,1-Yi(BXi)
#S <- Sample size for SGD
#XS, YS <- Sample of XS,YS of size S
#Adds a row of 1's if constant is true
if (bias == TRUE) {
Dummy1 <- matrix(1,nrow(x),1)
x <- cbind(Dummy1,x)
}
#Changes TRUE / FALSE to -1 and 1 and turns y into a data type matrix
Y <- data.matrix(ifelse(y == TRUE,1,-1))
#Makes sure x is of data type matrix
X <- data.matrix(x)
#Initializes coefficient & gradient matrices
B <- matrix(0,ncol(x),1)
G <- matrix(0,ncol(x),1)
for (i in 1:t) {
#Sets previous gradient to be current gradient
Gp <- G
#Sampling without replacement
Sample <- sample(nrow(X),S)
XS <- as.matrix(X[Sample,])
YS <- as.matrix(Y[Sample,])
V <- matrix(0,nrow(YS),1)
#Part(1)
#max(0,1-yi(xiB)) part of the algorithm, saving output in V
for (n in c(1:nrow(XS))) {
#Main part of the max(0,1-yi(Bxi) equation
V[n] <- ifelse((1 - (XS[n,] %*% B)*YS[n])>0,YS[n],0)
}
#Part(2)
#updating the gradient using V,B and C
for (j in c(1:ncol(XS))) {
G[j] <- B[j]- (C*((t(V)%*%XS[,j])/nrow(XS)))
}
#Part(3)
#Barzilai-Borwein Step Size for i> iterations (fixed for i==1)
if (i== 1) {
nn = 1/1000
} else {
numerator = abs(t(B-Bp)%*%(G-Gp))
denominator = (t(G-Gp)%*%(G-Gp))
nn = as.vector(numerator/denominator)
}
#Part(4)
#Checks whether the value of B is NaN and prevents further iterations if TRUE
if (is.na(nn)) {
warning(c("Terminated after ",i," iterations"))
predicted <- ifelse(X%*%Bp<0,-1,1)
error_rate <- sum(ifelse(predicted ==Y,0,1))/nrow(Y)
M <- 1/(sqrt(t(Bp)%*%Bp))
svm_output <- list(coef = Bp, error_rate = error_rate,margin = M)
class(svm_output) <- "svm_sgd_bb"
return(svm_output)
}
#Sets previous coefficients to the current coefficients, before updating
Bp <- B
#Takes the step in the direction of minimum
B <- B-nn*G
}
predicted <- ifelse(X%*%B<0,-1,1)
error_rate <- sum(ifelse(predicted ==Y,0,1))/nrow(Y)
M <- 1/(sqrt(t(B)%*%B))
svm_output <- list(coef = B, error_rate = error_rate,margin = M,gradient=G)
class(svm_output) <- "svm_sgd_bb"
return(svm_output)
}
```
<h2> 2.4 Testing our function </h2>
Let us create two data sets,1st linearly separable (X1), the other not (X2) [11]:
```{r}
set.seed(80)
n = 500
a1 = rnorm(n)
a2 = 1 + a1 + 2* runif(n)
b1 = rnorm(n)
b2 = -1 + b1 - 2*runif(n)
X1 = rbind(matrix(cbind(a1,a2),,2),matrix(cbind(b1,b2),,2))
a1 = rnorm(n)
a2 = - a1 + 2* runif(n)
b1 = rnorm(n)
b2 = -0.5 - b1 - 2*runif(n)
X2 = rbind(matrix(cbind(a1,a2),,2),matrix(cbind(b1,b2),,2))
Y <- matrix(c(rep(1,n),rep(-1,n)))
remove(a1,a2,b1,b2,n)
```
Now let us check our models for 2 levels of C (high and low). We will also
increase the amount of iterations to make sure our model reaches a minimum,
and set the sample to be about 1/3 of our dataset.Before running our
function we will unset the seed for our random sampling to work correctly:
```{r}
set.seed(NULL)
Test1 <- svm_gradient_descent(X1,Y,C=100,t=2000,S=300)
Test2 <- svm_gradient_descent(X1,Y,C=0.1,t=2000,S=300)
Test3 <- svm_gradient_descent(X2,Y,C=100,t=2000,S=300)
Test4 <- svm_gradient_descent(X2,Y,C=0.1,t=2000,S=300)
set.seed(80)
```
Plotting the predictions:
```{r}
par(mfrow=c(2,2))
plot(X1,col=ifelse(Y>0,4,2),pch=".",cex=7,xlab = "x1",ylab = "x2")
abline(-Test1$coef[1]/Test1$coef[3],-Test1$coef[2]/Test1$coef[3])
abline(Test1$margin-(Test1$coef[1]/Test1$coef[3]),-Test1$coef[2]/Test1$coef[3],lty=2)
abline(-Test1$margin-(Test1$coef[1]/Test1$coef[3]),-Test1$coef[2]/Test1$coef[3],lty=2)
plot(X1,col=ifelse(Y>0,4,2),pch=".",cex=7,xlab = "x1",ylab = "x2")
abline(-Test2$coef[1]/Test2$coef[3],-Test2$coef[2]/Test2$coef[3])
abline(Test2$margin-(Test2$coef[1]/Test2$coef[3]),-Test2$coef[2]/Test2$coef[3],lty=2)
abline(-Test2$margin-(Test2$coef[1]/Test2$coef[3]),-Test2$coef[2]/Test2$coef[3],lty=2)
plot(X2,col=ifelse(Y>0,4,2),pch=".",cex=7,xlab = "x1",ylab = "x2")
abline(-Test3$coef[1]/Test3$coef[3],-Test3$coef[2]/Test3$coef[3])
abline(Test3$margin-(Test3$coef[1]/Test3$coef[3]),-Test3$coef[2]/Test3$coef[3],lty=2)
abline(-Test3$margin-(Test3$coef[1]/Test3$coef[3]),-Test3$coef[2]/Test3$coef[3],lty=2)
plot(X2,col=ifelse(Y>0,4,2),pch=".",cex=7,xlab = "x1",ylab = "x2")
abline(-Test4$coef[1]/Test4$coef[3],-Test4$coef[2]/Test4$coef[3])
abline(Test4$margin-(Test4$coef[1]/Test4$coef[3]),-Test4$coef[2]/Test4$coef[3],lty=2)
abline(-Test4$margin-(Test4$coef[1]/Test4$coef[3]),-Test4$coef[2]/Test4$coef[3],lty=2)
```
On the left side we can see the models run with C=100. This is proportional to
setting our upper boundary on miss classifications as very high. As expected our
model classifies the data almost perfectly with a very low margin.
On the right side we see both models with C=0.1. As explained in 2.1, C is
inversely proportional to lambda (a different formulation of our model),
hence low C would be equivalent to penalizing our coefficients very hard, just
like in ridge regression. As we can see, we have squashed our decision boundry
towards 0 all while significantly increasing the margin (in the right bottom graph
margin is outside the picture).
As we can see from this short test, our SVM-SGD-BB works as intended. We can
finally deploy it on our data.
<h2> 2.5 Deploying our model on Heart dataset </h2>
Let us deploy our model on our train dataset, but before we do that we need to
preprocess our data so that our binary values are expressed as -1,1. As we will
be adding a bias term we will add a row of 1's at the begging of our test data
set:
```{r}
DummySexTrain <- data.matrix(ifelse(train$Sex == "M",1,-1))
DummySexTest <- data.matrix(ifelse(test$Sex == "M",1,-1))
DummyFastingTrain <- data.matrix(ifelse(train$FastingBS == 0,-1,1))
DummyFastingTest <- data.matrix(ifelse(test$FastingBS == 0,-1,1))
DummyAnginaTrain <- data.matrix(ifelse(train$ExerciseAngina == "Y",1,-1))
DummyAnginaTest <- data.matrix(ifelse(test$ExerciseAngina == "Y",1,-1))
Xtrain <-cbind(train$Age,train$RestingBP,train$Cholesterol,train$MaxHR,
train$Oldpeak)
Dummy1 <- matrix(1,nrow(test),1)
Xtest <-cbind(Dummy1,test$Age,test$RestingBP,test$Cholesterol,test$MaxHR,
test$Oldpeak)
Ytrain <- train$HeartDisease
Ytest <- data.matrix(ifelse(test$HeartDisease == TRUE,1,-1))
```
Now let us deploy our model with multiple values of C to see which one performs
best. We shall set S to be 100, which is around 1/3 of our train data set, and
as our previous tests have shown, this should be enough to gain significant results.
```{r}
set.seed(80)
SVM_Result1 <- svm_gradient_descent(Xtrain,Ytrain,C=0.1,t=3000,S=100)
SVM_Result2 <- svm_gradient_descent(Xtrain,Ytrain,C=1,t=3000,S=100)
SVM_Result3 <- svm_gradient_descent(Xtrain,Ytrain,C=10,t=3000,S=100)
SVM_Result4 <- svm_gradient_descent(Xtrain,Ytrain,C=100,t=3000,S=100)
head(SVM_Result1$error_rate)
head(SVM_Result2$error_rate)
head(SVM_Result3$error_rate)
head(SVM_Result4$error_rate)
```
We can see that our model with the highest C, has the lowest error rate. This is
rather surprising given we wouldn't expect our data to be linearly separable.
Thus we can expect there to be a significant level of model overfit.
Now let us check how our models perform on test data:
```{r}
SVM_predicted1 <- ifelse(Xtest%*%SVM_Result1$coef<0,-1,1)
SVM_test_error_rate1 <- sum(ifelse(SVM_predicted1 ==Ytest,0,1))/nrow(Ytest)
SVM_predicted2 <- ifelse(Xtest%*%SVM_Result2$coef<0,-1,1)
SVM_test_error_rate2 <- sum(ifelse(SVM_predicted2 ==Ytest,0,1))/nrow(Ytest)
SVM_predicted3 <- ifelse(Xtest%*%SVM_Result3$coef<0,-1,1)
SVM_test_error_rate3 <- sum(ifelse(SVM_predicted3 ==Ytest,0,1))/nrow(Ytest)
SVM_predicted4 <- ifelse(Xtest%*%SVM_Result4$coef<0,-1,1)
SVM_test_error_rate4 <- sum(ifelse(SVM_predicted4 ==Ytest,0,1))/nrow(Ytest)
head(SVM_test_error_rate1)
head(SVM_test_error_rate2)
head(SVM_test_error_rate3)
head(SVM_test_error_rate4)
```
As we've expected, the model with highest C actually performs worse then a model
with C=10. We can also observe that our test error rate is lower then test, which
would imply
<b>Please note</b>:
We have not commented on this part of our svm algorithm as it gives different
coefficients each time we re-run it, and we don't know what the final outcome
will be. This result even occurs with very high amount of iterations t>100k.
This could most likely be caused by weird shape of the loss function at a
minimum, combined with random sampling of observations. At a minimum our gradient
should be 0, although gradient descent algorithms rarely give an exact minimum
(rather an approximation) we should still expect the gradient to be small.
For this purpose we have examined gradient of our algorithm (although we
didn't include it here). Depending on the run, we observed that it had a
tendency to jump from extremely small values (~10^-5) to very high (~350). This
might mean that the descent near the optimum is very steep and depending on
the sample, the step size will increase significantly whenever we come close to
a minimum, effectively overshooting the minimum.
[1] Introduction to Statistical Learning https://statlearning.com/
[2] Elements of Statistical Learning https://web.stanford.edu/~hastie/ElemStatLearn/
[3] https://towardsdatascience.com/a-definitive-explanation-to-hinge-loss-for-support-vector-machines-ab6d8d3178f1
[4]https://towardsdatascience.com/solving-svm-stochastic-gradient-descent-and-hinge-loss-8e8b4dd91f5b
[5]https://www.cs.utah.edu/~zhe/pdf/lec-19-2-svm-sgd-upload.pdf
[6]MA208 Optimization Theory Lecture Notes written by Bernhard von Stengel,
edited by Julia Bottcher
[7]https://en.wikipedia.org/wiki/Gradient_descent
[8]https://en.wikipedia.org/wiki/Learning_rate
[9]https://en.wikipedia.org/wiki/Stochastic_gradient_descent
[10]https://www.sciencedirect.com/science/article/abs/pii/S0020025515002467
[11]Randomizing sets to check svm part: https://rpubs.com/empireisme/linearsvm