# On the Identification of Pulsar Stars

A DSCI 100 Project

Lucas Kuhn, Sophia Zhang, Michael Cheung

## Introduction
"Pulsar" neutron stars produce radio emissions that are of great interest to astronomers and researchers. In this project, we set out to develop a model for classifying them distinctly from those that are "non-pulsar".

Our question to answer: "Are there characteristics of a pulsar candidate's curve profiles that serve as good predictors for pulsar and non-pulsar classification?".

We use the HTRU2 dataset from the UCI Machine Learning Repository, which contains summary statistics for observed pulsar candidates as they exhibit on both the Integrated Pulse (folded) Profile, as well as on the Dispersion Measure Signal-to-Noise Ratio (DM-SNR) curve.

// TODO add stuff? flesh out stuff we had to limit because of proposal word limit?

## Methods

// TODO: (copied from proposal) actually state what's happening and narrate through operations

// TODO: use references in this section + in-text citations?

We plan to train multiple models using various subsets of the summary-statistic columns as predictor variables, given how the pulses for a candidate that is truly pulsar are "remarkably constant over long periods of time" (Karastergiou et. al, 2011) and as such tend to reveal prominent insights on curve profiles. 

We can use scatter plots similar to above to qualitatively assess pairs of variables for any distinct separation or clustering of classified points. Plots like this provide an interface for human inspection, as used to train other models such as artificial neural networks (Eatough et al., 2010).

We will fit a K-Nearest Neighbours classifier model to our training data, using a GridSearchCV with 5-fold Cross-Validation to choose $k$. This allows us to efficiently test the validation accuracy at many values of $k$ and combat both overfitting and underfitting.

With our best model, we can assess its accuracy on the testing set using a confusion matrix, allowing us to more critically assess the accuracy based on whether false positives are more costly than false negatives.

In [1]:
# This serves as the set of packages that may be used for our final project

# Retrieve relevant packages for dataframes and visualization tools
import pandas as pd
import altair as alt
import sklearn

# Retrieve relevant packages for classification and modelling
from sklearn.compose import make_column_transformer
from sklearn.metrics import confusion_matrix
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler

# This dataset has more than 5000 obervations, we will need to allow for greater data plotting
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

First we load in the data from the dataset, providing column headers for readability. Then we change the identification values of 0 and 1 to readable titles of non-pulsar and pulsar respectively. 

In [4]:
# Using the extracted contents, read the .CSV file, identifying columns using the associated .ARFF alongside it
pulsar_data = pd.read_csv(
    "../data/HTRU2/HTRU_2.csv",
    # .CSV file contains no column headers, we provide these here
    names=[
        "Profile_mean",
        "Profile_stdev",
        "Profile_skewness",
        "Profile_kurtosis",
        "DM_mean",
        "DM_stdev",
        "DM_skewness",
        "DM_kurtosis",
        "class"
    ]
)
pulsar_data["class"] = pulsar_data["class"].replace({0: "non-pulsar", 1: "pulsar"}).convert_dtypes()
pulsar_data

# As a POC, show the first 5 rows of the dataset
pulsar_data.head()

Unnamed: 0,Profile_mean,Profile_stdev,Profile_skewness,Profile_kurtosis,DM_mean,DM_stdev,DM_skewness,DM_kurtosis,class
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,non-pulsar
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,non-pulsar
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,non-pulsar
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,non-pulsar
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,non-pulsar


## Results

// TODO: results

We found that ...

The columns that best help to train the classifier are ...

## Discussion

// TODO: discussion

We found that (columns that are important) were good classifiers...

These findings could imply... / The impact that this might have is ...

Future questions that this investigation could lead to include ...

## References

1. Eatough, R. P., Molkenthin, N., Kramer, M., Noutsos, A., Keith, M. J., Stappers, B. W., & Lyne, A. G. (2010). Selection of radio pulsar candidates using artificial neural networks. Monthly Notices of the Royal Astronomical Society, 407(4), 2443-2450.

2. Karastergiou, A., Roberts, S. J., Johnston, S., Lee, H. J., Weltevrede, P., & Kramer, M. (2011). A transient component in the pulse profile of PSR J0738− 4042. Monthly Notices of the Royal Astronomical Society, 415(1), 251-256.