# k-Nearest Neighbour on Time-Series data
## Contents

1. Read IR_data.
2. Visualize raster scans.
3. Smooth/Filter the data.
4. Hold out test using Euclidean distance and DTW.
5. Hyper-parameter tuning. 

In [None]:
import numpy as np
import random
import pandas as pd
from collections import Counter

## Reading Data

We would be using Pandas package to read the IR_data.csv and represent it in a DataFrame (Table) format.  
For more info on DataFrame and its functions visit: https://bit.ly/2RKLtd0


In [None]:
X = pd.read_csv("IR_data.csv", index_col=0, header=0)
X.head(5) #display top 5 rows

As we can see in the table above, the row is a time-series, from column 0-299, and the class/label of that series in the column "class". Where the class is either 0 (non-pore) or 1 (pore).

## Extract the class labels into a different variable

In [None]:
y = X["class"].values
X.drop(["class"] , axis=1, inplace=True)
X.head(5)

Data samples are padded with zeros so that all samples are the same length. 

In [None]:
X[1:3].T.plot()

# Visualize Raster Scan

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

Let's visualize the data from first row, after removing the trailing zeroes (for a better representation).

In [None]:
data = X.loc[0]
index = np.where(data==0)[0][0]
data = data[:index]

In [None]:
plt.figure(figsize=(15,5))
plt.title("Raster Scan")
plt.xlabel("#Data-points")
plt.ylabel("Emissivity (mV)")
plt.plot(data, c="red")
plt.xticks(np.arange(0,300,50))
plt.show()

## Filtering The Data

We would be using a low-pass Butterworth filter to remove any possible noise from the IR-data, and smoothen it.  
To learn how a Butterworth filter works, visit: https://bit.ly/3kBv0Uy

In [None]:
from scipy.signal import butter, freqz, lfilter

def butter_lowpass(cutoff, fs, order=5):
    nyq = 0.5 * fs
    normal_cutoff = cutoff / nyq
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    return b, a

def butter_lowpass_filter(data, cutoff, fs, order=5):
    b, a = butter_lowpass(cutoff, fs, order=order)
    y = lfilter(b, a, data)
    return y

In [None]:
data_fltr = butter_lowpass_filter(data,cutoff=10,fs=100.0,order=4)

In [None]:
trans_done = 20 #skip the data from the transient phase at the beginning of the series.
plt.figure(figsize=(15,5))
plt.title("Raster Scan")
plt.xlabel("#Data-points")
plt.ylabel("Emissivity (mV)")
plt.plot(data[trans_done:], c="red")
plt.plot(data_fltr[trans_done:], c="blue")
plt.xticks(np.arange(0,300,50))
#plt.ylim([900, 1020])
plt.show()

## Filtering the entire dataset 

In [None]:
order = 4
fs = 100.0
cutoff = 10

In [None]:
X_fltr = pd.DataFrame(columns=range(300))
for idx,row in X.iterrows():
    row_fltr = butter_lowpass_filter(row,cutoff=cutoff,fs=fs,order=order)
    X_fltr.loc[idx] = row_fltr

In [None]:
X_fltr.head(5)

# k-NN Classification

We would be using k-NN Time-Series Classifier (https://bit.ly/3kyYQcx) from the tslearn package. 

We shall use two version of this classifier: 
<ol>
<li>Using Euclidean as our distance metric</li>
<li>Using Dynamic Time Warping (DTW) as the distance measure</li>
</ol> 

In [None]:
from tslearn.neighbors import KNeighborsTimeSeriesClassifier
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Hold-Out Testing

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_fltr.values, y, test_size=0.33, random_state=42)

### Euclidean

In [None]:
EkNN = KNeighborsTimeSeriesClassifier(n_neighbors=5,metric='euclidean', n_jobs=-1)
EkNN = EkNN.fit(X_train,y_train)
Ey_hat = EkNN.predict(X_test)
euc_accuracy = accuracy_score(y_test,Ey_hat)*100
print("Model Accuracy = {:.2f}%".format(euc_accuracy))

### DTW

In [None]:
metric_params = {'global_constraint': 'sakoe_chiba', 'sakoe_chiba_radius': 5}
DkNN = KNeighborsTimeSeriesClassifier(n_neighbors=5,metric='dtw', metric_params=metric_params, n_jobs=-1)
DkNN = DkNN.fit(X_train,y_train)
Dy_hat = DkNN.predict(X_test)
dtw_accuracy = accuracy_score(y_test,Dy_hat)*100
print("Model Accuracy = {:.2f}%".format(dtw_accuracy))

### Compare the two models

In [None]:
fig = plt.figure(figsize=(6,5))
model = ['Euclidean', 'DTW']
accuracies = [euc_accuracy,dtw_accuracy]
plt.bar(model,accuracies)
plt.title("Model Comparison")
plt.ylabel("Accuracy (%)")
plt.show()

## ROC

In [None]:
Ey_score = EkNN.fit(X_train, y_train).predict_proba(X_test)
fprE, tprE, t = roc_curve(y_test, Ey_score[:,1])
roc_aucE = auc(fprE, tprE)

In [None]:
Dy_score = DkNN.fit(X_train, y_train).predict_proba(X_test)
fprD, tprD, t = roc_curve(y_test, Dy_score[:,1])
roc_aucD = auc(fprD, tprD)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure()
lw = 2
plt.plot(fprE, tprE, color='red',
         lw=lw, label='ROC Euc (area = %0.2f)' % roc_aucE)
plt.plot(fprD, tprD, color='green',
         lw=lw, label='ROC DTW (area = %0.2f)' % roc_aucD)

plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Analysis for the IR data')
plt.legend(loc="lower right")
plt.show()

## Exercise  
Check to see if smoothing the data (Butterworth Filter) has actually helped classification accuracy.  
Update the ROC curves with results for Euclidean distance and DTW on the unfiltered data.   
**Hint:** Simply replace `X_fltr` with the original `X` data.