# **The use of pulsar signal data in pulsar candidate labeling (and classification?)**

### **Introduction**

### **Methods and Results**

In [21]:
# Loading the necessary libraries and setting seed for data reproducibility
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)
set.seed(2021)

Reading Pulsar dataset from Google drive and formatting column names

In [2]:
URL_data <- "https://docs.google.com/uc?export=download&id=1oMc6YyUz0hIX6iOEmmauzMjxNB5iaNLv"

pulsar_data <- read_csv(URL_data) %>%
    mutate(target_class = as.factor(target_class))

colnames(pulsar_data) <- make.names(colnames(pulsar_data))

pulsar_data

Parsed with column specification:
cols(
  `Mean of the integrated profile` = [32mcol_double()[39m,
  `Standard deviation of the integrated profile` = [32mcol_double()[39m,
  `Excess kurtosis of the integrated profile` = [32mcol_double()[39m,
  `Skewness of the integrated profile` = [32mcol_double()[39m,
  `Mean of the DM-SNR curve` = [32mcol_double()[39m,
  `Standard deviation of the DM-SNR curve` = [32mcol_double()[39m,
  `Excess kurtosis of the DM-SNR curve` = [32mcol_double()[39m,
  `Skewness of the DM-SNR curve` = [32mcol_double()[39m,
  target_class = [32mcol_double()[39m
)



Mean.of.the.integrated.profile,Standard.deviation.of.the.integrated.profile,Excess.kurtosis.of.the.integrated.profile,Skewness.of.the.integrated.profile,Mean.of.the.DM.SNR.curve,Standard.deviation.of.the.DM.SNR.curve,Excess.kurtosis.of.the.DM.SNR.curve,Skewness.of.the.DM.SNR.curve,target_class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
121.15625,48.37297,0.3754847,-0.01316549,3.168896,18.39937,7.449874,65.15930,0
76.96875,36.17556,0.7128979,3.38871856,2.399666,17.57100,9.414652,102.72297,0
130.58594,53.22953,0.1334083,-0.29724164,2.743311,22.36255,8.508364,74.03132,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
116.0312,43.21385,0.66345569,0.4330880,0.7851171,11.62815,17.055215,312.20433,0
135.6641,49.93375,-0.08994031,-0.2267262,3.8595318,21.50150,7.398395,62.33402,0
120.7266,50.47226,0.34617808,0.1847972,0.7692308,11.79260,17.662222,329.54802,0


Selecting only the columns required, and filtering to remove "NA"

In [3]:
pulsar <- pulsar_data %>%
    select(Mean.of.the.integrated.profile,
           Excess.kurtosis.of.the.integrated.profile,
           target_class) %>%
    filter(Excess.kurtosis.of.the.integrated.profile != "NA")

num_obs <- nrow(pulsar)
pulsar %>%
  group_by(target_class) %>%
  summarize(n = n(),
            percentage = n() / num_obs * 100)

`summarise()` ungrouping output (override with `.groups` argument)



target_class,n,percentage
<fct>,<int>,<dbl>
0,9798,90.781062
1,995,9.218938


Splitting data into training and testing dataset

In [4]:
pulsar_split <- initial_split(pulsar, prop = 0.75, strata = target_class)

pulsar_training <- training(pulsar_split)
pulsar_testing <- testing(pulsar_split)

Standardizing training dataset using recipe

In [5]:
pulsar_recipe <- recipe(target_class ~ ., data = pulsar_training) %>%
    step_scale(all_predictors()) %>%
    step_center(all_predictors())

Building a model to find the best k value

In [6]:
pulsar_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
    set_engine("kknn") %>%
    set_mode("classification")

Creating v fold

In [7]:
pulsar_vfold <- vfold_cv(pulsar_training, v = 5, strata = target_class)

Fitting the model

In [8]:
pulsar_workflow <- workflow() %>%
    add_recipe(pulsar_recipe) %>%
    add_model(pulsar_spec) %>%
    tune_grid(resamples = pulsar_vfold, grid = 10) %>%
    collect_metrics()

pulsar_workflow

neighbors,.metric,.estimator,mean,n,std_err,.config
<int>,<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>
2,accuracy,binary,0.9678814,5,0.001893726,Model1
2,roc_auc,binary,0.9324038,5,0.004948512,Model1
3,accuracy,binary,0.9773935,5,0.001361668,Model2
⋮,⋮,⋮,⋮,⋮,⋮,⋮
12,roc_auc,binary,0.9540467,5,0.006748329,Model8
14,accuracy,binary,0.9798641,5,0.001709494,Model9
14,roc_auc,binary,0.9562052,5,0.006537893,Model9


Assigning k min

In [9]:
kmin <- pulsar_workflow %>%
    filter(.metric == "accuracy") %>%
    filter(mean == max(mean)) %>%
    pull(neighbors)

kmin

Rebuilding model with best k value

In [13]:
pulsar_model <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin) %>%
    set_engine("kknn") %>%
    set_mode("classification")

Fitting the model to training dataset

In [14]:
pulsar_fit <- workflow() %>%
    add_recipe(pulsar_recipe) %>%
    add_model(pulsar_model) %>%
    fit(data = pulsar_training)

Predicting target_class for testing dataset using the model

In [16]:
pulsar_prediction <- pulsar_fit %>%
    predict(pulsar_testing) %>%
    bind_cols(pulsar_testing)

Accuracy Table

In [19]:
pulsar_acc <- pulsar_prediction %>%
    metrics(truth = target_class, estimate = .pred_class)

pulsar_acc

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.9733136
kap,binary,0.8366281


Confusion Matrix

In [20]:
pulsar_conf_mat <- pulsar_prediction %>%
    conf_mat(truth = target_class, estimate = .pred_class)

pulsar_conf_mat

          Truth
Prediction    0    1
         0 2420   49
         1   23  206

### **Discussion**

### **References**