-
Notifications
You must be signed in to change notification settings - Fork 0
/
knn.Rmd
85 lines (79 loc) · 2.46 KB
/
knn.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
title: "K-nearest-neighbours"
output:
pdf_document: default
html_notebook: default
---
Let's have a look at a non-parametric tool such as Knn for classification.
```{r}
#Downloading the data
website <- "https://web.stanford.edu/~hastie/ElemStatLearn/datasets/ESL.mixture.rda"
load(file(website))
summary(ESL.mixture) # Gaussian mixture data
attach(ESL.mixture)
```
```{r}
plot(x,pch=20, col=ifelse(y==0, "orange", "blue"), asp = 1)
# px1 and px2 define the limits of the input space
grid <- expand.grid(ESL.mixture$px1, ESL.mixture$px2)
points(grid, col="gray", pch=".")
```
Even though we know that the data comes from a Gaussian mixture, we want to use knn to classify the points.
```{r}
#install.packages("FNN")
library(FNN)
```
```{r}
model <- knn(train=x,test=grid,cl=y,k=5, prob = T)
```
Let's look at the predictions of the model for the test set.
```{r}
p <- ifelse(model=="1", attr(model, "prob"), 1- attr(model, "prob"))
```
```{r}
#let's also plot the decision boundary
prob.matrix <- matrix(p, length(px1), length(px2))
contour(px1,px2,prob.matrix, levels=0.5, labels="", xlab="", ylab="", asp=1)
# the training points
points(x,pch=20, col=ifelse(y==0, "orange", "blue"))
# and the resto of the space
grid <- expand.grid(px1, px2)
points(grid, col=ifelse(model==0, "orange","blue"), pch=".")
```
How robust is this classifier?
Let's use random sampling to answer that.
```{r}
sampling.knn <-function(k){
m <- 100 #sample size
n <- length(y)
nrep <- 4
for(i in 1:nrep){
indices <- sample(1:n,size = m, replace = F)
y.sample <- y[indices]
x.sample <- x[indices,]
fit.sample <- knn(train=x.sample,test=grid,cl=y.sample,k=k, prob = T)
# plotting
p <- ifelse(fit.sample=="1", attr(fit.sample, "prob"), 1- attr(fit.sample,
"prob"))
prob.matrix <- matrix(p, length(px1), length(px2))
contour(px1,px2,prob.matrix, levels=0.5, labels="", xlab="", ylab="", asp=1)
points(x.sample,pch=20, col=ifelse(y.sample==0, "orange", "blue"))
grid <- expand.grid(px1, px2)
points(grid, col=ifelse(p<0.5, "orange","blue"), pch=".")
}
}
```
```{r}
sampling.knn(5)
```
From the above plots the decision boundary seems to be highly unstable.
Let's try changing k to see its effect on the robustness of the classifier.
```{r}
sampling.knn(k=10)
```
```{r}
sampling.knn(20)
```
```{r}
sampling.knn(k=2)
```