# Question 6 [Multiclass Perceptron, 20 Marks]

In [1]:
library(ggplot2)
library(reshape2)

"package 'ggplot2' was built under R version 4.0.2"


### 6.1 
##### Load Task1D_train.csv and Task1D_test.csv sets.

In [2]:
# Reading train and test data sets
train = read.csv("Data_set/Task1D_train.csv")
train_data = train[1:4]
train_label = train[5]

test = read.csv("Data_set/Task1D_test.csv")
test_data = test[1:4]
test_label = test[5]

### 6.2
##### Implement the multiclass perceptron as explained above. Please provide enough comments for your code in your submission.

In [3]:
# Encode labels for convinience
encode = function (label, en=c(1, 2, 3)){
    encoding = c()
    for (lbl in label){
        if (lbl == "C1"){ encoding = append(encoding, en[1]) }
        else if (lbl == "C2"){encoding = append(encoding, en[2])}
        else { encoding = append(encoding, en[3])}
    }
    return (encoding)
}

In [4]:
predict = function(data, W){
    Phi = as.matrix(cbind(1, data))
    pred = data.frame("prediction"=1:nrow(data))
    for (i in 1:nrow(data)){
        L1 = W[[1]]%*%Phi[i,]
        L2 = W[[2]]%*%Phi[i,]
        L3 = W[[3]]%*%Phi[i,]
        pred[i,] = sort(c(L1, L2, L3), index.return = TRUE, decreasing=TRUE)$ix[1]   
    }
    return (pred)  
}

get_missclassfications = function(prediction, label){
    sum(prediction[,1] != encode(label[, 1])) / nrow(label)
}

In [5]:
sgd = function(train_data, train_label, test_data, test_label, eta, epsilon, tau_max){
    
    Phi = as.matrix(cbind(1, train_data))

    # Initialize weights
    # Weights for C1
    W1 = matrix(,nrow=tau_max, ncol=ncol(Phi))
    W1[1,] = runif(ncol(Phi))
    # Weights for C2
    W2 = matrix(,nrow=tau_max, ncol=ncol(Phi))
    W2[1,] = runif(ncol(Phi))
    # Weights for C3
    W3 = matrix(,nrow=tau_max, ncol=ncol(Phi))
    W3[1,] = runif(ncol(Phi))

    # Combine under single list for better accessibility
    W = list(W1, W2, W3)
    
    # Keep track for weight vectors updations
    tau_w = c(1, 1, 1)
    #Iteration counter
    tau = 1
    errors = data.frame()
    while(!FALSE){

        # resuffling train data and associated labels:
        train_index = sample(1:nrow(train_data), replace = FALSE)
        # Get training exaples
        Phi = Phi[train_index,]
        # Get train labbels and encode them for convinience
        target = encode(train_label[train_index,])

        #Training starts for each training datapoint
        for (i in 1:nrow(train_data)){
            # Termination condition 
            if (tau == tau_max) {break}

            if (i %% 5 == 0){
                Wt = list(W[[1]][tau_w[1],], W[[2]][tau_w[2],],W[[3]][tau_w[3],])
                e = data.frame("test"= get_missclassfications(predict(test_data, Wt), test_label))
                errors = rbind(errors, e)

            }

            # Calculate logit values from weight vectors for each label
            Logit_C1 = W[[1]][tau_w[1],]%*%Phi[i,]
            Logit_C2 = W[[2]][tau_w[2],]%*%Phi[i,]
            Logit_C3 = W[[3]][tau_w[3],]%*%Phi[i,]

            # Get predictions. We will assign datapoint to the class with largest logit value
            pred_idx = sort(c(Logit_C1, Logit_C2, Logit_C3), index.return = TRUE, decreasing=TRUE)$ix[1]

            # Target label to match
            target_idx = target[i]

            # If missclassified
            if (pred_idx != target_idx){
                # Update interation (Tau)
                tau = tau + 1

                # Update the indexes in weight matrix (We'll adjust the weights)
                tau_w[pred_idx]  = tau_w[pred_idx] + 1
                tau_w[target_idx]  = tau_w[target_idx] + 1

                # Update weight vector for missclassified class label
                W[[pred_idx]][tau_w[pred_idx],] = W[[pred_idx]][tau_w[pred_idx]-1,] - eta * Phi[i,]
                # Update weight vector for true class label
                W[[target_idx]][tau_w[target_idx],] = W[[target_idx]][tau_w[target_idx]-1,] + eta * Phi[i,] 

            } 
        }
        # Reduce learning rate
        eta = eta * 0.99
        
        # Terminate training
        if (tau >= tau_max){break}

        W_old = list(W[[1]][tau_w[1]-1,], W[[2]][tau_w[2]-1,],W[[3]][tau_w[3]-1,])
        W_new = list(W[[1]][tau_w[1],], W[[2]][tau_w[2],],W[[3]][tau_w[3],])
        p_old = predict(train_data, W_old)
        p_new = predict(train_data, W_new)
        miss_old = get_missclassfications(p_old, train_label)
        miss_new = get_missclassfications(p_new, train_label)

        if (abs(miss_new - miss_old) <= epsilon) {break}

    }

    # Extract final weights
    W1 = W[[1]][tau_w[1],]
    W2 = W[[2]][tau_w[2],]
    W3 = W[[3]][tau_w[3],]

    # Final Weights
    Weights = list(W1, W2, W3)
    
    return (list("W"=Weights, "error"=errors)) 
}

### 6.3
##### Train two multiclass perceptron models on the provided training data by setting the learning rates η to .1 and .01 respectively. Note that all parameter settings stay the same, except the learning rate, when building each model.For each model, evaluate the error of the model on the test data, after processing every 5 training data points (also known as a mini-batch). Then, plot the testing errors of two models built based on the learning rates .1 and .01(with different colors) versus the number of mini-batches. Include it in your Jupyter Notebook file for Question 6.Now, explain how the testing errors of two models behave differently, as the training data increases, by observing your plot. (Include all your answers in your Jupyter Notebook file.)

In [7]:
eta1_out = sgd(train_data, train_label, test_data, test_label, eta=0.01, epsilon=0.001, tau_max=100)
eta2_out = sgd(train_data, train_label, test_data, test_label, eta=0.10, epsilon=0.001, tau_max=100)

lim = min(nrow(eta1_out$error), nrow(eta2_out$error))
error = data.frame("batch.id"=1:lim, "eta.01"=eta1_out$error[1:lim,], "eta.10"=eta2_out$error[1:lim,])
error_m = melt(error, id='batch.id')
error

batch.id,eta.01,eta.10
<int>,<dbl>,<dbl>
1,0.6133333,0.36
2,0.5733333,0.66666667
3,0.6666667,0.66666667
4,0.4933333,0.37333333
5,0.4666667,0.33333333
6,0.6666667,0.33333333
7,0.6666667,0.33333333
8,0.3466667,0.33333333
9,0.3466667,0.33333333
10,0.3466667,0.09333333


Answer: As we increase the training data, the model with the lower learning rate (0.01) can be seen slowly converging as opposed to the model with higher learning rate as it greedily tries to find the solution.