# On the Identification of Pulsar Stars

A DSCI 100 Project

Lucas Kuhn, Sophia Zhang, Michael Cheung

## Introduction
"Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth". (UCI webpage TODO citation). These stars are of great interest, and so we set out to find ways to identify them from those that are "non-pulsar"

Our question to answer involve: "Are there characteristics of a pulsar candidate that might serve as good predictors for pulsar and non-pulsar classification? Are there ranges of values for those characteristics that pulsars generally fall under?"

We make use of the HTRU2 dataset, from the UCI Machine Learning Repository. This dataset contains summary statistics for observed pulsar candidates as they exhibit on both the Integrated Pulse (folded) Profile, as well as on the Dispersion Measure Signal-to-Noise Ratio (DM-SNR) curve.

## Preliminary Exploratory Data Analysis
The dataset is available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/HTRU2). The provided [link](https://archive.ics.uci.edu/ml/machine-learning-databases/00372/) given on that page gives access to the data as a .ZIP file.

This .ZIP file then contains a .CSV file which can be directly read into and visualized within Jupyter Notebook, as per the following code.

In [25]:
# Retrieve relevant packages for dataframes and visualization tools
import pandas as pd
import altair as alt
import sklearn

# Retrieve relevant packages for classification and modelling
from sklearn.compose import make_column_transformer
from sklearn.metrics import confusion_matrix
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler

# This dataset has more than 5000 obervations, we will need to allow for greater data plotting
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [2]:
# Using the extracted contents, read the .CSV file, identifying columns using the associated .ARFF alongside it
pulsar_data = pd.read_csv(
    "../data/HTRU2/HTRU_2.csv",
    # .CSV file contains no column headers, we provide these here
    names=[
        "Profile_mean",
        "Profile_stdev",
        "Profile_skewness",
        "Profile_kurtosis",
        "DM_mean",
        "DM_stdev",
        "DM_skewness",
        "DM_kurtosis",
        "class"
    ]
)

# As a POC, show the first 5 rows of the dataset
pulsar_data.head()

Unnamed: 0,Profile_mean,Profile_stdev,Profile_skewness,Profile_kurtosis,DM_mean,DM_stdev,DM_skewness,DM_kurtosis,class
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0


We briefly check some high-level properties of the overall dataset.

In [3]:
pulsar_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Profile_mean      17898 non-null  float64
 1   Profile_stdev     17898 non-null  float64
 2   Profile_skewness  17898 non-null  float64
 3   Profile_kurtosis  17898 non-null  float64
 4   DM_mean           17898 non-null  float64
 5   DM_stdev          17898 non-null  float64
 6   DM_skewness       17898 non-null  float64
 7   DM_kurtosis       17898 non-null  float64
 8   class             17898 non-null  int64  
dtypes: float64(8), int64(1)
memory usage: 1.2 MB


As given, the data is already within a tidy format: each row of the table corresponds to a single star, characterizing its attributes of aggregated values as columns. It also does not contain any missing or NaN values, as evidenced by the columns being non-null. We can confirm this by the following drop of missing value rows.

In [4]:
pulsar_data.dropna()
pulsar_data

Unnamed: 0,Profile_mean,Profile_stdev,Profile_skewness,Profile_kurtosis,DM_mean,DM_stdev,DM_skewness,DM_kurtosis,class
0,140.562500,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,0
...,...,...,...,...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,0
17894,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,0
17895,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,0
17896,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,0


As we can see, the number of rows does not decrease when we ask to drop any rows containing missing values.

We may also find it useful to understand the distribution of the classified neutron stars, into the "pulsar" and "non-pulsar" classes.

If we like, we may choose to rename the classes from 0 and 1 to "non-pulsar" and "pulsar" later for ease of identification.

Let's make it so our classes are clearer; change 0s and 1s to their word equivalents.

In [5]:
pulsar_data["class"] = pulsar_data["class"].replace({0: "non-pulsar", 1: "pulsar"}).convert_dtypes()
pulsar_data

Unnamed: 0,Profile_mean,Profile_stdev,Profile_skewness,Profile_kurtosis,DM_mean,DM_stdev,DM_skewness,DM_kurtosis,class
0,140.562500,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,non-pulsar
1,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,non-pulsar
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,non-pulsar
3,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,non-pulsar
4,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,non-pulsar
...,...,...,...,...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,non-pulsar
17894,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,non-pulsar
17895,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,non-pulsar
17896,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,non-pulsar


In [14]:
# Keep one other column outside the target classification column
pulsar_data_count = pulsar_data[["class", "Profile_mean"]].groupby("class").count().reset_index()

# Rename the column to more accurately reflect its purpose
pulsar_data_count = pulsar_data_count.rename(columns={"Profile_mean": "count"})
pulsar_data_count = pulsar_data_count.assign(pct=pulsar_data_count["count"] / len(pulsar_data))
pulsar_data_count

Unnamed: 0,class,count,pct
0,non-pulsar,16259,0.908426
1,pulsar,1639,0.091574


We see that a majority of this dataset is classified as "non-pulsar". The proportion is also shown, where we have about 91% of the observations as "non-pulsar", and about 9% are "pulsar".

This is consistent with the idea that neutron stars that are "pulsar" tend to be a rare occurrence.

After seeing the whole dataset, we split into training and testing subsets for insights on building our model. For this preliminary analysis, we will arbitrarily assume to use 75% of the data for training, and the complement for testing.

In [19]:
pulsar_data_train, pulsar_data_test = train_test_split(
    pulsar_data, train_size=0.75, stratify=pulsar_data["class"]
)

We stratify the data such that the training data and the testing data both have a similar distribution of "non-pulsar" and "pulsar" observations as the original full dataset.

In [20]:
pulsar_data_train_count = pulsar_data_train[["class", "Profile_mean"]].groupby("class").count().reset_index()
pulsar_data_train_count = pulsar_data_train_count.rename(columns={"Profile_mean": "count"})
pulsar_data_train_count = pulsar_data_train_count.assign(pct=pulsar_data_train_count["count"] / len(pulsar_data_train))
pulsar_data_train_count

Unnamed: 0,class,count,pct
0,non-pulsar,12194,0.908441
1,pulsar,1229,0.091559


In [22]:
# We only evaluate the distribution of the testing set; we seek no further detail into the nature of the testing set.
pulsar_data_test_count = pulsar_data_test[["class", "Profile_mean"]].groupby("class").count().reset_index()
pulsar_data_test_count = pulsar_data_test_count.rename(columns={"Profile_mean": "count"})
pulsar_data_test_count = pulsar_data_test_count.assign(pct=pulsar_data_test_count["count"] / len(pulsar_data_test))
pulsar_data_test_count

Unnamed: 0,class,count,pct
0,non-pulsar,4065,0.90838
1,pulsar,410,0.09162


The most natural columns to seek as predictors for our classification would be the "mean values" of the star on the Integrated Pulse Profile and the DM-SNR curve. Here, we determine the means of those "mean values".

In [26]:
# Obtain the mean of means for the training set
pulsar_data_train_means = pulsar_data_train[["class", "Profile_mean", "DM_mean"]].groupby("class").mean(numeric_only=True).reset_index()
pulsar_data_train_means

Unnamed: 0,class,Profile_mean,DM_mean
0,non-pulsar,116.459827,8.795973
1,pulsar,56.731133,49.900328


We can create a scatter plot the data in the training set to qualitatively assess for any visible grouping that may exist. We will colour the data points by the classification they are known to be.

In [34]:
pulsar_data_train_means_plot = (
    alt.Chart(pulsar_data_train, title="Mean Comparison")
    .mark_point(opacity=0.4)
    .encode(
        x=alt.X("Profile_mean", title="Mean using the Integrated Pulse Profile", scale=alt.Scale(zero=False)),
        y=alt.Y("DM_mean", title="Mean using the DM-SNR Curve"),
        color=alt.Color("class", title="Classification"),
    )
    .configure_axis(labelFontSize=16, titleFontSize=16)
    .properties(width=400, height=400)
)
pulsar_data_train_means_plot

In the plot, we tend to see the stars classified as "pulsar" on the left side towards lower mean values of the Integrated Pulse Profile, and "non-pulsar" observations on the right. This may imply that these variables would be effective predictors to include in our model.

## Methods

We plan to conduct our data analysis by training multiple models that use various subsets of the columns to assess different columns' effects on the accuracy of classification. While we expect columns detailing "mean values" or "skewness" may be influential to our model, we might also expect other columns like "standard deviation" to be less important and thus likely excluded from our predictors.

A possible strategy we think may be to assess scatter plot visualizations created for pair-wise combinations of variables in the dataset. From this, we may qualitatively inspect whether the resulting plot contains some form of distinguishable separation. By doing this, it may inform our chosen set of predictors when we go to create the classifier model. For example, the previous visualization of the "mean values" suggests that these two variables could be good predictors.

As the most natural decision, we expect to fit a K-Nearest Neighbours classifier model to the dataset, using the predictors we choose from the visualization analysis. For a given set of predictors chosen, we can use a GridSearch Cross-Validation with 5 folds to intuitively select the best value of the hyperparameter $k$ in the classifier model, as well as assess the effectiveness of our training set. Different subsets of those predictors can be combined in different ways to assess their effect on the expected accuracy. Once we have confidence in some model, this can be deployed against the test dataset for evaluation of accuracy in generalization.

We expect to use metrics involved in cross-validation, and possibly a confusion matrix to critically analyze and visualize the accuracy found from a model.

## Expected Outcomes and Significance

We expect that at the end of this investigation, we may be able to provide a subset of the variables in the HTRU2 dataset that most effectively illustrate a distinct separation between neutron stars found to be "pulsar", and those that are not. We also hope to find whether a range of values can be established for these variables that are chosen.

Should we be able to answer these questions, it may be useful in helping to find "pulsar" neutron stars. This may help to narrow down future observations or disregard stars that are not likely to be pulsar.

Future extensions or questions that this could lead to may include:
- // TODO: future question proposing