# On the Identification of Pulsar Stars

A DSCI 100 Project

Lucas Kuhn, Sophia Zhang, Michael Cheung

## Introduction

Out in the universe, astronomers have found many neutron stars. A subset of these stars fall under a rare classification known as a "pulsar", meaning that they produce a periodic radio signal that is detectable from Earth. These stars are of great interest, and so being able to identify them is a useful skill that furthers exploration efforts in understanding neutron stars.

In this project, we set out to develop a model for classifying them distinctly from those that are "non-pulsar". Our question to answer: "Are there characteristics of a pulsar candidate's curve profiles that serve as good predictors for pulsar and non-pulsar classification?".

To investigate this, we use the HTRU2 dataset provided by the UCI Machine Learning Repository. This dataset contains summary statistics for a large number of observed pulsar candidates as they exhibit on both the Integrated Pulse (folded) Profile as well as on the Dispersion Measure Signal-to-Noise Ratio (DM-SNR) curve, which is then associated alongside their known classifications as either "pulsar" or "non-pulsar".

## Methods

To carry out the investigation, we will train multiple K-Nearest Neighbours classifier models using various subsets of the columns in our dataset as predictor variables. Given how the candidates that are truly pulsar generally broadcast signal pulses that are "remarkably constant over long periods of time" (Kastergiou et. al, 2011), we can reasonably surmise that the summary statistics from candidates' curve profiles could serve as predictors for our classification model.

As a way of gaining insights as to which dataset columns to use in our set of predictor columns, we will first create scatter plots of pairs of the columns in a scatter plot matrix, colouring the points by their classification and inspecting for visible divisions. For example, columns that we may wish to exclude from the predictor variables are those that result in "heterogeneous" mixtures of the classified points with less defined independent groupings of points. This is a valid way to to evaluate for predictors as plots provide an interface for human inspection, and are used to train other models such as artificial neural networks (Eatough et. al, 2010).

Using a GridSearchCV with 5-fold Cross-Validation, we can optimize the number of neighbours $k$. This allows us to efficiently test the validation accuracy at many values of $k$ and combat both overfitting and underfitting.

With our best model, we can assess its accuracy on the testing set using a confusion matrix, allowing us to more critically assess the accuracy based on whether false positives are more costly than false negatives.

We perform preliminary setup and retrieve the packages we will need.

In [2]:
# Retrieve relevant packages for dataframes and visualization tools
import pandas as pd
import altair as alt
import sklearn

# Retrieve relevant packages for classification and modelling
from sklearn.compose import make_column_transformer
from sklearn.metrics import confusion_matrix
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler

As the dataset we plan to import will contain more than 5000 observations, we will need to allow for greater data plotting.

In [3]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

THe HTRU2 dataset can be downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/00372/). We unzip the file to the `data/` directory and read it into Jupyter Notebook.

In [4]:
# Using the extracted contents, read the .CSV file, identifying columns using the associated .ARFF alongside it
pulsar_data = pd.read_csv(
    "../data/HTRU2/HTRU_2.csv",
    # .CSV file contains no column headers, we provide these here
    names=[
        "Profile_mean",
        "Profile_stdev",
        "Profile_skewness",
        "Profile_kurtosis",
        "DM_mean",
        "DM_stdev",
        "DM_skewness",
        "DM_kurtosis",
        "class"
    ]
)

# As a POC, show the first 5 rows of the dataset
pulsar_data.head()

Unnamed: 0,Profile_mean,Profile_stdev,Profile_skewness,Profile_kurtosis,DM_mean,DM_stdev,DM_skewness,DM_kurtosis,class
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0


## Results

// TODO: results

We found that ...

The columns that best help to train the classifier are ...

## Discussion

// TODO: discussion

We found that (columns that are important) were good classifiers...

These findings could imply... / The impact that this might have is ...

Future questions that this investigation could lead to include ...

## References

1. Eatough, R. P., Molkenthin, N., Kramer, M., Noutsos, A., Keith, M. J., Stappers, B. W., & Lyne, A. G. (2010). Selection of radio pulsar candidates using artificial neural networks. Monthly Notices of the Royal Astronomical Society, 407(4), 2443-2450.

2. Karastergiou, A., Roberts, S. J., Johnston, S., Lee, H. J., Weltevrede, P., & Kramer, M. (2011). A transient component in the pulse profile of PSR J0738− 4042. Monthly Notices of the Royal Astronomical Society, 415(1), 251-256.