In [10]:
library(data.table)
library(TSrepr)
library(TSdist)
library(dtw)
library(TunePareto)
library(dplyr)

# Summary

In this homework, 60 different parameter combination will be tried to get hyperparameter tuning. To get this combination, 5 different representation types(raw + 2 different difference taken + 2 different PAA), 4 different distance calculations, and 3 different k values will be used. Difference will be taken by shift function and 1, 2 will be used to get difference values. For PAA, 2 different segment length will be used by considering proper values for timeseries. In terms of distance, Euclidean, Dynamic Time Wrapping, Longest Common Subsequence, and Edit Distance with Real Penalties will be used. Lastly, k values are selected as 1, 5, 10 for k-nearest neighbor model. 

Performance of the models will be inspected by using 10-fold cross and 5 repeated validation technique. Moreover, performance of the models will be controlled by having the same test indices to have identical conditions. Lastly, obtained 60 performance will be controlled by a dataframe including average accuracy and standard deviation in accuracy. By considering this dataframe, the best parameter combination will be controlled and this parameters will be used to get accuracy performance in the test dataset. Last comments will be held at the end of the notebooks.

# Context
(In order to get specified techniques rapidly, a context part is added to Notebook.)

1. [Data Preparation](#1)
1. [Classify Function](#2)
1. [Representations](#3)
    1. Raw Data
    1. [Difference Function](#4)
    1. [PAA Function](#5)
1. [Distances](#6)
    1. [Raw Data](#7)
    1. [Difference Data](#8)
    1. [PAA Data](#9)
1. [Main Model](#10)
1. [Result of Models](#11)
1. [Test Performance](#12)
1. [Comments](#13)

<a id="1"></a>
# Data Preparation

In [11]:
dataset_path="D:/Datasets/Univariate2018_arff/Univariate_arff/"

In [12]:
distance_path="C:/Users/bahad/GitHub/IE48B/Homework3/Distances/"

In [13]:
first_dataset="Beef"
second_dataset="BirdChicken"
third_dataset="BMW"
fourth_dataset="Coffee"
fifth_dataset="Wine"

## Loading Dataset

In [14]:
traindata=as.matrix(fread(sprintf('%s%s/%s_TRAIN.txt',dataset_path, fifth_dataset,fifth_dataset)))

In [15]:
head(traindata)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V226,V227,V228,V229,V230,V231,V232,V233,V234,V235
1,1.781622,1.641665,1.513937,1.394362,1.282035,1.174689,1.070514,0.9704153,0.8739401,...,-1.670196,-1.709601,-1.74493,-1.782977,-1.810606,-1.837329,-1.860882,-1.880358,-1.901193,-1.918857
1,1.77975,1.638856,1.512006,1.391952,1.277787,1.170418,1.066673,0.967005,0.8714146,...,-1.670564,-1.710431,-1.745768,-1.78337,-1.811005,-1.837734,-1.861292,-1.880773,-1.901612,-1.91928
1,1.776492,1.63626,1.508282,1.388472,1.275469,1.166551,1.063533,0.965053,0.8683883,...,-1.673939,-1.713876,-1.74882,-1.786488,-1.814171,-1.840947,-1.864546,-1.88406,-1.904482,-1.922635
1,1.77408,1.63514,1.50789,1.388734,1.277222,1.171555,1.066788,0.9665168,0.8720912,...,-1.656265,-1.695384,-1.730906,-1.769126,-1.796555,-1.823084,-1.846465,-1.8658,-1.886933,-1.904919
1,1.776502,1.637654,1.510938,1.39231,1.279075,1.17258,1.06923,0.9703732,0.8746622,...,-1.655163,-1.694706,-1.730204,-1.768399,-1.795809,-1.821871,-1.844788,-1.864559,-1.885678,-1.903652
1,1.776937,1.63661,1.51063,1.392272,1.28019,1.173488,1.070821,0.9717404,0.87535,...,-1.651872,-1.691325,-1.726295,-1.763954,-1.791302,-1.818202,-1.841067,-1.860345,-1.881416,-1.899349


In [16]:
str(traindata)

 num [1:57, 1:235] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:235] "V1" "V2" "V3" "V4" ...


## Class Information

In [17]:
trainclass=traindata[,1] 

## Time Series

In [18]:
traindata=traindata[,2:ncol(traindata)]

## Dataset Information

In [19]:
tlength=ncol(traindata)
n_series_train=nrow(traindata)

## Indices for Datasets

Mentioned test indices are obtained by TunePareto library to have identical conditions. nfold and ntimes parameters are selected as 10, 5 respectively.

In [20]:
set.seed(35)
nof_rep=5
n_fold=10

In [21]:
cv_indices=generateCVRuns(trainclass, ntimes =nof_rep, nfold = n_fold, 
                          leaveOneOut = FALSE, stratified = TRUE)

str(cv_indices)

List of 5
 $ Run  1:List of 10
  ..$ Fold  1 : int [1:6] 10 2 16 39 47 55
  ..$ Fold  2 : int [1:6] 6 24 13 44 45 46
  ..$ Fold  3 : int [1:6] 8 30 15 42 54 56
  ..$ Fold  4 : int [1:6] 1 17 4 41 32 57
  ..$ Fold  5 : int [1:6] 7 3 20 35 48 49
  ..$ Fold  6 : int [1:6] 11 23 27 40 50 38
  ..$ Fold  7 : int [1:6] 9 22 28 53 37 34
  ..$ Fold  8 : int [1:5] 14 29 19 31 33
  ..$ Fold  9 : int [1:5] 5 26 21 36 43
  ..$ Fold  10: int [1:5] 18 25 12 52 51
 $ Run  2:List of 10
  ..$ Fold  1 : int [1:6] 9 29 10 43 37 39
  ..$ Fold  2 : int [1:6] 4 17 3 51 47 55
  ..$ Fold  3 : int [1:6] 6 13 22 32 33 50
  ..$ Fold  4 : int [1:6] 23 2 30 35 52 38
  ..$ Fold  5 : int [1:6] 1 5 24 48 34 36
  ..$ Fold  6 : int [1:6] 8 26 27 56 57 44
  ..$ Fold  7 : int [1:6] 16 18 14 54 45 31
  ..$ Fold  8 : int [1:5] 11 21 7 49 40
  ..$ Fold  9 : int [1:5] 25 12 20 53 46
  ..$ Fold  10: int [1:5] 15 19 28 41 42
 $ Run  3:List of 10
  ..$ Fold  1 : int [1:6] 12 21 1 41 31 48
  ..$ Fold  2 : int [1:6] 8 28 2 38 54 5

<a id="2"></a>
# Classify Function

Classify function is obtained from lecture notebooks. Function takes 4 different parameters, distance matrix, class information, test indeces and k parameter.

In [22]:
nn_classify_cv=function(dist_matrix,train_class,test_indices,k){
    
    test_distances_to_train=dist_matrix[test_indices,]
    test_distances_to_train=test_distances_to_train[,-test_indices]
    train_class=train_class[-test_indices]

    ordered_indices=apply(test_distances_to_train,1,order)
    if(k==1){
        nearest_class=as.numeric(train_class[as.numeric(ordered_indices[1,])])
        nearest_class=data.table(id=test_indices,nearest_class)
    } else {
        nearest_class=apply(ordered_indices[1:k,],2,function(x) {train_class[x]})
        nearest_class=data.table(id=test_indices,t(nearest_class))
    }
    
    long_nn_class=melt(nearest_class,'id')

    class_counts=long_nn_class[,.N,list(id,value)]
    class_counts[,predicted_prob:=N/k]
    wide_class_prob_predictions=dcast(class_counts,id~value,value.var='predicted_prob')
    wide_class_prob_predictions[is.na(wide_class_prob_predictions)]=0
    class_predictions=class_counts[,list(predicted=value[which.max(N)]),by=list(id)]
    
    
    return(list(prediction=class_predictions,prob_estimates=wide_class_prob_predictions))
    
}

<a id="3"></a>
# Representations

There are 3 major representations, raw dataset, difference taken dataset, and paa representation. For difference taken dataset 1 and 2 will be used in shift operator. At the beginning, an example code will be given to show obtained dataframes in difference datasets. For paa dataset 9 and 18 will be used in segment length parameter. At the beginning, an example code will be given to show obtained dataframes in paa datasets.

## Example Code for difference datasets

In [23]:
dt_ts_train=data.table(traindata)
dt_ts_train[,id:=1:.N]
long_train=melt(dt_ts_train,id.vars=c('id'))
long_train[,time:=as.numeric(gsub("\\D", "", variable))-1]
long_train=long_train[order(id,time)]
diff_long=copy(long_train)
diff_long[,diff_series:=value-shift(value,1),by=list(id)]
head(diff_long)

id,variable,value,time,diff_series
1,V2,1.781622,1,
1,V3,1.641665,2,-0.139957
1,V4,1.513937,3,-0.1277277
1,V5,1.394362,4,-0.1195748
1,V6,1.282035,5,-0.1123279
1,V7,1.174689,6,-0.1073456


In [24]:
diff_train=dcast(diff_long[!is.na(diff_series)],id~time,value.var='diff_series')
diff_train=diff_train[,-c("id")]
head(diff_train)
diff_train=as.matrix(diff_train)

2,3,4,5,6,7,8,9,10,11,...,225,226,227,228,229,230,231,232,233,234
-0.139957,-0.1277277,-0.1195748,-0.1123279,-0.1073456,-0.1041751,-0.10009862,-0.09647516,-0.09330462,-0.08696353,...,-0.0461994,-0.0394053,-0.0353289,-0.0380466,-0.027629,-0.0267232,-0.0235526,-0.0194762,-0.020835,-0.0176645
-0.1408939,-0.1268498,-0.1200542,-0.1141648,-0.1073693,-0.103745,-0.09966771,-0.09559037,-0.09241912,-0.08834181,...,-0.0462096,-0.039867,-0.0353368,-0.0376019,-0.0276351,-0.0267291,-0.0235578,-0.0194805,-0.0208396,-0.0176683
-0.140232,-0.1279786,-0.1198098,-0.1130024,-0.108918,-0.1030183,-0.09848005,-0.09666472,-0.09303413,-0.08622675,...,-0.0462901,-0.0399366,-0.0349446,-0.0376674,-0.0276834,-0.0267756,-0.0235989,-0.0195145,-0.0204221,-0.018153
-0.1389405,-0.1272497,-0.119156,-0.1115121,-0.1056667,-0.1047674,-0.10027091,-0.09442555,-0.09127803,-0.08678158,...,-0.0458639,-0.0391191,-0.035522,-0.0382199,-0.0274283,-0.0265291,-0.0233816,-0.0193347,-0.0211334,-0.0179858
-0.1388483,-0.1267159,-0.1186277,-0.1132355,-0.1064953,-0.1033499,-0.09885643,-0.09571097,-0.0921162,-0.08627468,...,-0.0458334,-0.0395426,-0.0354984,-0.0381945,-0.0274102,-0.0260621,-0.0229167,-0.0197713,-0.0211194,-0.0179738
-0.1403266,-0.12598,-0.1183585,-0.1120819,-0.1067019,-0.102667,-0.09908042,-0.09639043,-0.0928038,-0.0847339,...,-0.0457294,-0.0394528,-0.0349696,-0.0376595,-0.027348,-0.0268996,-0.0228647,-0.0192781,-0.0210714,-0.0179331


<a id="4"></a>
## Difference Function

In [25]:
difference_obtainer=function(traindata, diff_value){
    dt_ts_train=data.table(traindata)
    dt_ts_train[,id:=1:.N]
    long_train=melt(dt_ts_train,id.vars=c('id'))
    long_train[,time:=as.numeric(gsub("\\D", "", variable))-1]
    long_train=long_train[order(id,time)]
    diff_long=copy(long_train)
    diff_long[,diff_series:=value-shift(value,diff_value),by=list(id)]#Lag value is assigned by diff_value
    head(diff_long)
    
    diff_train=dcast(diff_long[!is.na(diff_series)],id~time,value.var='diff_series')
    diff_train=diff_train[,-c("id")]
    head(diff_train)
    diff_train=as.matrix(diff_train)
    
    return(diff_train)
}

This function will be used to get a difference dataset. "_2" string will be used to mention 2 differences taken dataset.

###  1 Difference

In [26]:
diff_train=difference_obtainer(traindata,1)

### 2 Difference

In [27]:
diff_train_2=difference_obtainer(traindata,2)

## Example Code for PAA

In [28]:
segment_length=5

In [29]:
paa_results=vector("list", max(long_train$id))

In [30]:
for(i in 1:max(long_train$id)){
    current_ts=long_train[id==i,]$value
    
    paa_rep=repr_paa(current_ts, segment_length, meanC)
    current_dt=data.table(time=1:length(long_train[id==i,]$value))
    result_dt=data.table(time=c(1:(length(paa_rep)))*segment_length, values=paa_rep)
    all_dt=merge(current_dt, result_dt, by='time',all.x=T)
    all_dt[,values:=nafill(values,'nocb')]
    paa_results[[i]]=transpose(data.table(values=all_dt$values))
    
}

In [31]:
paa_train=rbindlist(paa_results)

In [32]:
paa_train

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V225,V226,V227,V228,V229,V230,V231,V232,V233,V234
1.522724,1.522724,1.522724,1.522724,1.522724,0.9740388,0.9740388,0.9740388,0.9740388,0.9740388,...,-1.56892,-1.777089,-1.777089,-1.777089,-1.777089,-1.777089,,,,
1.52007,1.52007,1.52007,1.52007,1.52007,0.9709011,0.9709011,0.9709011,0.9709011,0.9709011,...,-1.568813,-1.777662,-1.777662,-1.777662,-1.777662,-1.777662,,,,
1.516995,1.516995,1.516995,1.516995,1.516995,0.967776,0.967776,0.967776,0.967776,0.967776,...,-1.572192,-1.78086,-1.78086,-1.78086,-1.78086,-1.78086,,,,
1.516613,1.516613,1.516613,1.516613,1.516613,0.9715528,0.9715528,0.9715528,0.9715528,0.9715528,...,-1.555994,-1.763011,-1.763011,-1.763011,-1.763011,-1.763011,,,,
1.519296,1.519296,1.519296,1.519296,1.519296,0.9738781,0.9738781,0.9738781,0.9738781,0.9738781,...,-1.554959,-1.762198,-1.762198,-1.762198,-1.762198,-1.762198,,,,
1.519328,1.519328,1.519328,1.519328,1.519328,0.974789,0.974789,0.974789,0.974789,0.974789,...,-1.551895,-1.758216,-1.758216,-1.758216,-1.758216,-1.758216,,,,
1.413624,1.413624,1.413624,1.413624,1.413624,0.9047137,0.9047137,0.9047137,0.9047137,0.9047137,...,-1.559792,-1.759599,-1.759599,-1.759599,-1.759599,-1.759599,,,,
1.413895,1.413895,1.413895,1.413895,1.413895,0.9044763,0.9044763,0.9044763,0.9044763,0.9044763,...,-1.561058,-1.761513,-1.761513,-1.761513,-1.761513,-1.761513,,,,
1.415631,1.415631,1.415631,1.415631,1.415631,0.9059895,0.9059895,0.9059895,0.9059895,0.9059895,...,-1.559395,-1.759925,-1.759925,-1.759925,-1.759925,-1.759925,,,,
1.464627,1.464627,1.464627,1.464627,1.464627,0.9375599,0.9375599,0.9375599,0.9375599,0.9375599,...,-1.559965,-1.765061,-1.765061,-1.765061,-1.765061,-1.765061,,,,


<a id="5"></a>
## PAA Function

In [33]:
paa_obtainer=function(traindata,segment_length){
    dt_ts_train=data.table(traindata)
    dt_ts_train[,id:=1:.N]
    long_train=melt(dt_ts_train,id.vars=c('id'))
    long_train[,time:=as.numeric(gsub("\\D", "", variable))-1]
    long_train=long_train[order(id,time)]
    
    paa_results=vector("list", max(long_train$id))
    for(i in 1:max(long_train$id)){
        current_ts=long_train[id==i,]$value

        paa_rep=repr_paa(current_ts, segment_length, meanC)
        current_dt=data.table(time=1:length(long_train[id==i,]$value))
        result_dt=data.table(time=c(1:(length(paa_rep)))*segment_length, values=paa_rep)
        all_dt=merge(current_dt, result_dt, by='time',all.x=T)
        all_dt[,values:=nafill(values,'nocb')]
        paa_results[[i]]=transpose(data.table(values=all_dt$values))

    }
    return(rbindlist(paa_results))
}

This function will be used to get a difference dataset. "_2" string will be used to mention segment lentgh determined as 10.

### Segment Length 9

In [34]:
paa_train=paa_obtainer(traindata,9)

### Segment Length 18

In [35]:
paa_train_2=paa_obtainer(traindata,18)

<a id="6"></a>
# Distances

In this part, distance datasets will be calculated to have 20 different combination (5 representation * 4 different distance calculation types). In addition, obtained distance datasets will be stored in a file to skip distance calculation.

In [36]:
large_number=10000

<a id="7"></a>
## Raw Dataset 

In [37]:
dist_euc=as.matrix(dist(traindata))
diag(dist_euc)=large_number
fwrite(dist_euc,sprintf('%s%s/%s_euc_raw_dist.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)

dist_dtw=as.matrix(dtwDist(traindata))
diag(dist_dtw)=large_number
fwrite(dist_dtw,sprintf('%s%s/%s_dtw_raw_dist.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)

dist_lcss=TSDatabaseDistances(traindata,distance='lcss',epsilon=0.05)
dist_lcss=as.matrix(dist_lcss)
diag(dist_lcss)=large_number
fwrite(dist_lcss,sprintf('%s%s/%s_lcss_raw_epsilon_005.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)  

dist_erp=TSDatabaseDistances(traindata,distance='erp',g=0.5)
dist_erp=as.matrix(dist_erp)
diag(dist_erp)=large_number
fwrite(dist_erp,sprintf('%s%s/%s_erp_raw_gap_005.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)  

x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table


<a id="8"></a>
## Difference taken Datasets

### First difference dataset when shift value is 1

In [38]:
dist_euc_diff=as.matrix(dist(diff_train))
diag(dist_euc_diff)=large_number
fwrite(dist_euc_diff,sprintf('%s%s/%s_euc_diff_dist.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)

dist_dtw_diff=as.matrix(dtwDist(diff_train))
diag(dist_dtw_diff)=large_number
fwrite(dist_dtw_diff,sprintf('%s%s/%s_dtw_diff_dist.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)

dist_lcss_diff=TSDatabaseDistances(diff_train,distance='lcss',epsilon=0.05)
dist_lcss_diff=as.matrix(dist_lcss_diff)
diag(dist_lcss_diff)=large_number
fwrite(dist_lcss_diff,sprintf('%s%s/%s_lcss_diff_epsilon_005.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)  

dist_erp_diff=TSDatabaseDistances(diff_train,distance='erp',g=0.5)
dist_erp_diff=as.matrix(dist_erp_diff)
diag(dist_erp_diff)=large_number
fwrite(dist_erp_diff,sprintf('%s%s/%s_erp_diff_gap_005.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)  

x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table


### Second difference dataset when shift value is 2

In [39]:
dist_euc_diff_2=as.matrix(dist(diff_train_2))
diag(dist_euc_diff_2)=large_number
fwrite(dist_euc_diff_2,sprintf('%s%s/%s_euc_diff2_dist.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)

dist_dtw_diff_2=as.matrix(dtwDist(diff_train_2))
diag(dist_dtw_diff_2)=large_number
fwrite(dist_dtw_diff_2,sprintf('%s%s/%s_dtw_diff2_dist.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)

dist_lcss_diff_2=TSDatabaseDistances(diff_train_2,distance='lcss',epsilon=0.05)
dist_lcss_diff_2=as.matrix(dist_lcss_diff_2)
diag(dist_lcss_diff_2)=large_number
fwrite(dist_lcss_diff_2,sprintf('%s%s/%s_lcss_diff2_epsilon_005.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)  

dist_erp_diff_2=TSDatabaseDistances(diff_train_2,distance='erp',g=0.5)
dist_erp_diff_2=as.matrix(dist_erp_diff_2)
diag(dist_erp_diff_2)=large_number
fwrite(dist_erp_diff_2,sprintf('%s%s/%s_erp_diff2_gap_005.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)  

x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table


<a id="9"></a>
## PAA Datasets

### First PAA dataset when segment length value is 9

In [40]:
dist_euc_paa=as.matrix(dist(paa_train))
diag(dist_euc_paa)=large_number
fwrite(dist_euc_paa,sprintf('%s%s/%s_euc_paa_dist.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)

dist_dtw_paa=as.matrix(dtwDist(paa_train))
diag(dist_dtw_paa)=large_number
fwrite(dist_dtw_paa,sprintf('%s%s/%s_dtw_paa_dist.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)

dist_lcss_paa=TSDatabaseDistances(paa_train,distance='lcss',epsilon=0.05)
dist_lcss_paa=as.matrix(dist_lcss_paa)
diag(dist_lcss_paa)=large_number
fwrite(dist_lcss_paa,sprintf('%s%s/%s_lcss_paa_epsilon_005.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)  

dist_erp_paa=TSDatabaseDistances(paa_train,distance='erp',g=0.5)
dist_erp_paa=as.matrix(dist_erp_paa)
diag(dist_erp_paa)=large_number
fwrite(dist_erp_paa,sprintf('%s%s/%s_erp_paa_gap_005.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)  

x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table


### Second PAA dataset when segment length value is 18

In [41]:
dist_euc_paa_2=as.matrix(dist(paa_train_2))
diag(dist_euc_paa_2)=large_number
fwrite(dist_euc_paa_2,sprintf('%s%s/%s_euc_paa2_dist.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)

dist_dtw_paa_2=as.matrix(dtwDist(paa_train_2))
diag(dist_dtw_paa_2)=large_number
fwrite(dist_dtw_paa_2,sprintf('%s%s/%s_dtw_paa2_dist.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)

dist_lcss_paa_2=TSDatabaseDistances(paa_train_2,distance='lcss',epsilon=0.05)
dist_lcss_paa_2=as.matrix(dist_lcss_paa_2)
diag(dist_lcss_paa_2)=large_number
fwrite(dist_lcss_paa_2,sprintf('%s%s/%s_lcss_paa2_epsilon_005.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)  

dist_erp_paa_2=TSDatabaseDistances(paa_train_2,distance='erp',g=0.5)
dist_erp_paa_2=as.matrix(dist_erp_paa_2)
diag(dist_erp_paa_2)=large_number
fwrite(dist_erp_paa_2,sprintf('%s%s/%s_erp_paa2_gap_005.csv', distance_path, fifth_dataset, fifth_dataset),col.names=F)  

x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table
x being coerced from class: matrix to data.table


### To store obtained distance datasetes in a list

In [42]:
dist_folder=sprintf('%s%s', distance_path, fifth_dataset)
dist_files=list.files(dist_folder, full.names=T)

In [43]:
dist_files

<a id="10"></a>
# Main Model

In [44]:
k_levels=c(1,5,10)
approach_file=list.files(dist_folder)

In [45]:
result=vector('list',length(dist_files)*nof_rep*n_fold*length(k_levels))

In [46]:
iter=1
for(m in 1:length(dist_files)){ #
    print(dist_files[m])
    dist_mat=as.matrix(fread(dist_files[m],header=FALSE))
    for(i in 1:nof_rep){
        this_fold=cv_indices[[i]]
        for(j in 1:n_fold){
            test_indices=this_fold[[j]]
            for(k in 1:length(k_levels)){
                current_k=k_levels[k]
                current_fold=nn_classify_cv(dist_mat,trainclass,test_indices,k=current_k)
                accuracy=sum(trainclass[test_indices]==current_fold$prediction$predicted)/length(test_indices)
                tmp=data.table(approach=approach_file[m],repid=i,foldid=j,
                               k=current_k,acc=accuracy)
                result[[iter]]=tmp
                iter=iter+1
            }
            
        }
    
    }   
    
}

[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_dtw_diff_dist.csv"
[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_dtw_diff2_dist.csv"
[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_dtw_paa_dist.csv"
[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_dtw_paa2_dist.csv"
[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_dtw_raw_dist.csv"
[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_erp_diff_gap_005.csv"
[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_erp_diff2_gap_005.csv"
[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_erp_paa_gap_005.csv"
[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_erp_paa2_gap_005.csv"
[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_erp_raw_gap_005.csv"
[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_euc_diff_dist.csv"
[1] "C:/Users/bahad/GitHub/IE48B/Homework3/Distances/Wine/Wine_euc_diff2_dist.csv"
[1

<a id="11"></a>
# Result of Models

In [47]:
dataframe_result=rbindlist(result)
head(dataframe_result)

approach,repid,foldid,k,acc
Wine_dtw_diff_dist.csv,1,1,1,1.0
Wine_dtw_diff_dist.csv,1,1,5,0.3333333
Wine_dtw_diff_dist.csv,1,1,10,0.8333333
Wine_dtw_diff_dist.csv,1,2,1,1.0
Wine_dtw_diff_dist.csv,1,2,5,0.6666667
Wine_dtw_diff_dist.csv,1,2,10,0.6666667


In this dataset, result of each fold exists in this dataframe. repid and foldid represent which repetition and fold respectively. 

### Accumulated Datasets

In [48]:
acc_res=dataframe_result[,list(avg_acc=mean(acc),sdev_acc=sd(acc), repid=max(repid), foldid=max(foldid), 
                                   result_count=.N),by=list(approach,k)]
acc_res_ordered=acc_res[order(avg_acc,decreasing = TRUE)]

Accumulated dataset are ordered by avg_acc value. 

In [49]:
acc_res_ordered

approach,k,avg_acc,sdev_acc,repid,foldid,result_count
Wine_euc_paa_dist.csv,1,1.0,0.0,5,10,50
Wine_euc_paa2_dist.csv,1,1.0,0.0,5,10,50
Wine_euc_raw_dist.csv,1,1.0,0.0,5,10,50
Wine_dtw_paa2_dist.csv,1,0.996,0.02828427,5,10,50
Wine_erp_paa2_gap_005.csv,1,0.996,0.02828427,5,10,50
Wine_erp_raw_gap_005.csv,1,0.996,0.02828427,5,10,50
Wine_dtw_raw_dist.csv,1,0.992,0.03958973,5,10,50
Wine_dtw_paa_dist.csv,1,0.988,0.06272714,5,10,50
Wine_erp_paa_gap_005.csv,1,0.988,0.06272714,5,10,50
Wine_dtw_diff_dist.csv,1,0.9853333,0.0504672,5,10,50


In [50]:
# require(ggplot2)
# ggplot(dataframe_result,aes(x=paste0(approach,'with K=',k), y=acc)) +
#         geom_boxplot()+
#         labs(title="Boxplot of Models")+
#         xlab("Model Types")+
#         coord_flip()

<a id="12"></a>
# Test Performance

Best performance is obtained when representation, distance, and k values are difference=1, Euclidean, and K=1 respectively.

In [51]:
traindata=as.matrix(fread(sprintf('%s%s/%s_TRAIN.txt',dataset_path, fifth_dataset,fifth_dataset)))
testdata=as.matrix(fread(sprintf('%s%s/%s_TEST.txt',dataset_path, fifth_dataset,fifth_dataset)))

To get test performance, train and test datasets will be used. These 2 datasets will be bind and test indices are selected as test dataset indices.

In [52]:
all_dt=rbind(traindata, testdata)

In [53]:
allclass=all_dt[,1] 
all_dt=all_dt[,2:ncol(all_dt)]

In [54]:
test_indices_last=(nrow(all_dt)+1-nrow(testdata)):nrow(all_dt)

### Parameters

In [55]:
test_indices_last

In [56]:
last_k=1

## Representation and Distance Calculation

Representation and distance types are selected by looking the best parameter combination obtained in train dataset.

In [63]:
paa_test=paa_obtainer(all_dt,9)

In [64]:
dist_euc_paa_test=as.matrix(dist(paa_test))
diag(dist_euc_paa_test)=large_number

## Result of Test Dataset

In [65]:
last_result=nn_classify_cv(dist_euc_paa_test,allclass,test_indices_last,k=last_k)
accuracy=sum(allclass[test_indices_last]==last_result$prediction$predicted)/length(test_indices_last)
final_res=data.table(approach="Wine_euc_paa_dist_Test.csv", k=last_k, acc=accuracy)

### Test Result

In [66]:
final_res

approach,k,acc
Wine_euc_paa_dist_Test.csv,1,0.6111111


### Train Result

In [61]:
acc_res_ordered[1][,c("approach", "k", "avg_acc")]

approach,k,avg_acc
Wine_euc_paa_dist.csv,1,1


<a id="13"></a>
# Comments

Obtained results will be analyzed in the result comparison notebook.