### Developer: Mayana Mohsin Khan

# Section B. Prediction Uncertainty with Bootstrapping
## Bootstrapping

### Loading Packages

In [15]:
# Loading packages
library(reshape2)
library(ggplot2)
library(corrplot)

### Loading Datasets

In [16]:
# Loading training and testing dataset
train <- read.csv('Task1B_train.csv')
test <- read.csv('Task1B_test.csv')

# splitting the data into predictiors and labels
train.data <- data.matrix(train[,1:4])
train.label <- data.matrix(train[,5])
test.data <- data.matrix(test[,1:4])
test.label <- data.matrix(test[,5])

### I. Handle bootstrapping for KNN regression.
#### Boot function to generate sample index

In [17]:
# define a function that generates sample indixes based on bootstrap technique
boot <- function (original.size=100, sample.size=original.size, times=100){
    
    # Initialize the bootstrapping matrix
    indx <- matrix(nrow=times, ncol=sample.size)
    
    # populate the matrix with sample indexes
    for (t in 1:times){
        indx[t, ] <- sample(x=original.size, size=sample.size, replace = TRUE)
    }
    return(indx)
}
# calling the boot function to test our boot matrix
boot(100, 10, 5)

0,1,2,3,4,5,6,7,8,9
12,12,62,8,21,4,31,32,19,15
41,62,86,66,10,70,18,10,8,71
50,74,86,55,44,61,12,43,48,40
16,24,27,96,97,10,2,84,9,33
72,74,2,34,20,57,42,58,27,94


### KNN function

In [18]:
# Funcion to implement a knn regressor
knn <- function(train.data, train.label, test.data, K=3, distance = 'manhattan'){
    # Convert the train and test data as a matrix
    train.data <- as.matrix(train.data)
    train.label<- as.matrix(train.label)
    test.data <- as.matrix(test.data)
    
    # length of the training samples
    train.len <- nrow(train.data)
    
    ## count number of test samples
    test.len <- nrow(test.data)

    ## calculate distances between samples
    dist <- as.matrix(dist(rbind(test.data, train.data), 
                           method= distance))[1:test.len, (test.len+1):(test.len+train.len)]
    
    # if lenght of training samples is 1, transform the the distance matrix
    if(test.len == 1){
        dist <- t(dist)
    }

    # for each test sample
    for (i in 1:test.len){
        # get the index of the sample 
        nn <- as.data.frame(sort(dist[i,], index.return = TRUE))[1:K,2]
        # find nearest training samples
        y <- train.label[nn,]
        # calculate the test labels
        test.label[i] <- mean(y)
    }
    
    ## return the class test labels as output
    return (test.label)
}

### II. Apply your bootstrapping for KNN regression with times = 50 (the number of subsets), size = 20 (the size of each subset), and change K=1,.., 15 (the neighbourhood size). 

Steps to preform bootstrapping:
* Create the boot indexes.
* loop for every N in 1 to 50.
* loop for every k in 1 to 15.
* Save the sample inded from bootstrap.
* save the value of n and k into the missclassification dataframe.
* calculate the MSE using $$mean(actual - predicted)^2$$.

whereactual value is `test.label`, predicted value is obtained from `knn()` function.

In [None]:
K <- 15 
L <- 20 # size
N <- 50 # Times

# generate bootstrap indexes using boot function
boot.indx <- boot(nrow(train.data), L, N)

# dataframe to store the missclassification errors
miss <- data.frame('K'=1:K, 'N'=1:N, 'test'=rep(0,N*K))
 
# initialize a counter variable
i = 0 

## for every value in the dataset zie:
for (n in 1:N){
    
    ### for every k value
    for (k in 1:K){
        
        # increment the iteration index by 1
        i <- i + 1
        
        #### save sample indices that were selected by bootstrap
        indx <- boot.indx[n,]
        
        #### save the value of k and n into the missclassification datarame
        miss[i,'K'] <- k
        miss[i,'N'] <- n
        
        #### calculate and record the train and test missclassification loss rates
        miss[i,'test'] <- mean((test.label - knn(train.data[indx,], train.label[indx], test.data, K=k))^2)
    } 
}

### Create a boxplot where the x-axis is K, and the y-axis is the average error (and the uncertainty around it) corresponding to each K.

In [None]:
# Pivot the dataframe
miss.m <- melt(miss, id=c('K', 'N')) 
# rename the columns
names(miss.m) <- c('K', 'N', 'type', 'miss')

# create the box plot
ggplot(data=miss.m, aes(factor(K), miss,fill=type)) + geom_boxplot(outlier.shape = NA)  + 
    scale_color_discrete(guide = guide_legend(title = NULL)) + 
    ggtitle('MSE Error vs. K (Box Plot)') + theme_minimal()

# ignore the warnings (because of ignoring outliers)
options(warn=-1)

### III. Based on the plot in the previous part (Part II), how does the test error and its uncertainty behave as K increases? 

##### ANSWER
It is observed, when the value of k increase the the loss rate increases along with the uncertainity in testing sample.

### IV. Apply your bootstrapping for KNN regression with K= 5(the neighbourhood size), times = 50 (the number of subsets), and change sizes = 5, 10, 15,..., 75 (the size of each subset). 



In [None]:
K <- 5 
N <- 50 # time - number of subsets
L <- seq(from = 5, to = 75, by = 5) # size of subsets

# dataframe to store the miscalssification error
miss <- data.frame('N'=1:N, 'L'= L, 'test'=rep(0,N*length(L)))

# counter to increment the iterations
i = 0 

## for every value in size of subset:
for (l in L){
    
    # generate bootstrap indices:
    boot.indx <- boot(nrow(train.data), l, N)
    
    ### for each subset:
    for (n in 1:N){
        
        # Increment the iteration counter
        i <- i + 1
        
        # save the sample boot indexes
        indx <- boot.indx[n,]
        
        # save the value of N and L into dataframe
        miss[i,'N'] <- n
        miss[i,'L'] <- l
        
        # calculate and record the train and test missclassification rates
        miss[i,'test'] <- mean((test.label - knn(train.data[indx,], train.label[indx], test.data, K=K))^2)
    } 
}

### Create a boxplot where the x-axis is ‘sizes’ and the y-axis is the average error (and the uncertainty around it) corresponding to each value of ‘times’.

In [None]:
# Pivot the dataframe
miss.m <- melt(miss, id=c('N', 'L')) 
# rename columns
names(miss.m) <- c('N', 'L', 'type', 'miss')


In [None]:
# Plot MS vs L
ggplot(data=miss.m, aes(factor(L), miss,fill=type)) + geom_boxplot(outlier.shape = NA)  + 
    scale_color_discrete(guide = guide_legend(title = NULL)) + 
    ggtitle('MSE Error vs. L(size of subsets) (Box Plot)') + theme_minimal()
# ignore the warnings (because of ignoring outliers)

options(warn=-1)

### V. Based on the plot in the previous part (Part IV), how does the test error and its uncertainty behave as the size of each subset in bootstrapping increases? Explain in your Jupyter Notebook file.

##### ANSWER

It is observer from the graph that, the error is inversly proportional to the size of the dataset.