# Classification of Pulsar Candidates Using k-Nearest Neighbour Algorithm 

The HTRU2 data set contains information about pulsar candidates collected in the High Time Resolution Universe Survey. Each candidate is described using 8 features:

1. Mean of the integrated pulse profile. 
2. Standard deviation of the integrated pulse profile. 
3. Excess kurtosis of the integrated pulse profile. 
4. Skewness of the integrated pulse profile. 
5. Mean of the DM-SNR curve. 
6. Standard deviation of the DM-SNR curve. 
7. Excess kurtosis of the DM-SNR curve. 
8. Skewness of the DM-SNR curve. 
9. Class 

The class of the candidate indicates whether it is a pulsar or just caused by RFI/noise. It takes on the value 1 (pulsar) or 0 (not a pulsar) accordingly. The data set contains 17,898 candidates of which 1,639 are pulsars. 16,000 data points from this set are used to train the model. 20 candidates from the remaining data are then used to test the model.

In [39]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
from multiprocessing.dummy import Pool as ThreadPool 
from statistics import mode


data = pd.read_csv('pulsar_stars.csv') # Source: R. J. Lyon, HTRU2, DOI: 10.6084/m9.figshare.3080389.v1. 
testdata = pd.read_csv('testdata.csv')
features = data.columns[:8]
labels = data.columns[8]

# 16,000 points from the original file are used as training data.
# 20 points selected from the remaining data are used as test data. Half are pulsars.

train_data = data.iloc[:16000,:8]
train_labels = data.iloc[:16000,8]

test_data = testdata.iloc[:20, :8]
test_labels = testdata.iloc[:20, 8]

In [48]:
def dist(x,y):
    return np.sum(np.square(x-y)) # Gets squared Euclidean distance between two vectors

def NN(x):
    distances = []
    neighbours = []
    k = 3 # Function uses three nearest neighbours to classify a test point
    
    for i in range(16000):
        distances.append(dist(x,train_data.iloc[i,])) # Gets distance between test point and every training point
    
    distances = np.array(distances)
    min_indices = distances.argsort()[:k] # Gets the smallest k distances 
    
    for item in min_indices:
        neighbours.append(train_labels[item]) # Gets the labels corresponding to those distances
    
    return mode(neighbours) # Returns the mode of those labels

In [50]:
t_before = time.time()
pool = ThreadPool(2) # Used to speed up O(n) run time of k-NN
results = pool.map(NN, test_data.values)
t_after = time.time()

err_positions = np.not_equal(results, test_labels.values)
error = float(np.sum(err_positions))/len(test_labels)
perror = error * 100

print("Classification error ", perror, "%")
print("Run time: ", t_after - t_before, "s")



Classification error  20.0 %
Run time:  176.18247961997986 s
