<a href="https://colab.research.google.com/github/Tyred/TimeSeries_OCC-PUL/blob/main/Notebooks/OC_JKNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> One-Class J-K Nearest Neighbor Classifier</h1>
The main idea is to fit a KNN classifier with data from the positive class only and then perform the OCC as follows [1]:

- For each data sample in the test dataset, do:
    - Compute the distance to its J nearest neighbours and find their average D_j
    - Compute the distance of each J nearest neighbor to its K nearest neighbours and find their average D_k
    - if D_j/D_k <= T then the test sample is classified as a member of the positive class.
    - else the test sample is classified as not a member of the positive class.

- Evaluate the Model's Accuracy, Precision and Recall.

We have 3 hyperparameters, K, J and T. The classifier performance may vary a lot with differents values of these hyperparameters. Initially we will use the value 1 for each hyperparameter just for a proof-of-concept. Later we will develop a Parameter Optimization technique.

[1] [Relationship between Variants of One-Class Nearest
Neighbours and Creating their Accurate Ensembles](https://arxiv.org/abs/1604.01686)

# Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import NearestNeighbors

from sklearn.decomposition import PCA
from fastdtw import fastdtw
from scipy.spatial.distance import euclidean

# Reading Data from Google Drive

In [2]:
path = 'drive/My Drive/UFSCar/FAPESP/IC/Data/UCRArchive_2018'

dataset = input('Dataset: ')
tr_data = np.genfromtxt(path + "/" + dataset + "/" + dataset + "_TRAIN.tsv", delimiter="\t",)
te_data = np.genfromtxt(path + "/" + dataset + "/" + dataset + "_TEST.tsv", delimiter="\t",)

labels      = te_data[:, 0]                             # labels
print("Labels:", np.unique(labels))

Dataset: Wafer
Labels: [-1.  1.]


# Choosing the Positive Class label
This is necessary in order to emulate the One-Class Classification scenario.

In [3]:
class_label = int(input('Positive class label: '))

train_data  = tr_data[tr_data[:, 0] == class_label, 1:] # train
test_data   = te_data[:, 1:]                            # test

print("Train data shape:", train_data.shape)
print("Test data shape:", test_data.shape)

Positive class label: -1
Train data shape: (97, 152)
Test data shape: (6164, 152)


# OC_JKNN definition 
To do: 
<ul>
<li> fix parameter name (J = K) </li>
<li> Create a OC_JKNN Class </li>
<li> Improve this function </li>
</ul>

In [4]:
def OC_KNN(test_sample, k, threshold, distances_avg): # OC-JKNN J = K trocar depois
  # New samples classification
  distances, indices = nbrs.kneighbors(test_sample.reshape(1,-1), k)
  avg = np.mean(distances)
  
  knbrs_sum = 0
  for idx in indices[0]:
    knbrs_sum += distances_avg[idx]
  
  knbrs_avg = knbrs_sum/k

  if knbrs_avg != 0:
    if avg/knbrs_avg <= threshold:
      return True
    else:
      return False
  elif avg == 0: ## Estranho ter esse caso...
    return True
  else:
    return False

# PCA and DTW distance testing

In [5]:
# Principal Component Analysis
"""pca = PCA(n_components=(train_data.shape[1]//3), svd_solver='full') #(train_data.shape[1]//8)

train_data = pca.fit_transform(train_data)
print(train_data.shape)

def dtw(t1, t2):
    distance, _ = fastdtw(t1, t2, dist=euclidean)
    return distance

test_data = pca.transform(test_data)
print(test_data.shape)
"""

"pca = PCA(n_components=(train_data.shape[1]//3), svd_solver='full') #(train_data.shape[1]//8)\n\ntrain_data = pca.fit_transform(train_data)\nprint(train_data.shape)\n\ndef dtw(t1, t2):\n    distance, _ = fastdtw(t1, t2, dist=euclidean)\n    return distance\n\ntest_data = pca.transform(test_data)\nprint(test_data.shape)\n"

# Fitting the NN Classifier and calculating the NN average distance for each point 
To do
<ul>
<li> Maybe remove the average distance array and use some better solution </li>
</ul>

In [6]:
# Fitting OC-KNN 
j = 1
k = 1
threshold = 1

nbrs = NearestNeighbors(k+1).fit(train_data) #metric=dtw
distances, indices = nbrs.kneighbors(train_data)

distances_avg = []
for sample in distances:
  avg = np.mean(sample[1:])
  distances_avg.append(avg)

# Splitting Positive Class data from other class(es)

In [7]:
target_class = np.where(labels==class_label)
negative_class = np.where(labels!=class_label)
print("Positive samples:", len(target_class[0]))
print("Negative samples:", len(negative_class[0]))

Positive samples: 665
Negative samples: 5499


# Testing and Evaluating

## Positive Class

In [11]:
tp = 0
for sample in test_data[target_class[0]]:
  if OC_KNN(sample, j, threshold, distances_avg):
    tp += 1

fn = len(target_class[0]) - tp

## Negative Class

In [12]:
tn = 0
for sample in test_data[negative_class[0]]:
  if not OC_KNN(sample, j, threshold, distances_avg):
    tn += 1

fp = len(negative_class[0]) - tn

## Recall, Precision and Accuracy

In [13]:
print("Recall:", tp/len(target_class[0]))
print("Precision:", tp/(tp+fp))
print("Accuracy:", (tp+tn)/(tp+tn+fp+fn))

Recall: 0.6330827067669172
Precision: 0.9293598233995585
Accuracy: 0.9552238805970149
