# On the Classification of Neutron Stars

A DSCI 100 Project

Lucas Kuhn, Sophia Zhang, Michael Cheung

## Introduction
Out in the universe, astronomers have found many neutron stars. A subset of these stars fall under a rare classification known as a "pulsar", meaning that they produce a periodic radio signal that is detectable from Earth. These stars are of great interest, and so being able to identify them is a useful skill that furthers exploration efforts in understanding neutron stars.

For this project, we want to try to answer a question: what characteristics or attributes provide the best insight when classifying a neutron star as "pulsar" or "non-pulsar". As well, can we determine in what ranges of values these variables fall for neutron stars that are classified as "pulsar"?

In order to investigate this, we will be using the HTRU2 dataset, provided by the UCI Machine Learning Repository from the Center for Machine Learning and Intelligent Systems. This dataset contains summary statistics for observed neutron stars as they exhibit on both the integrated pulse (folded) profile, as well as on the DM-SNR curve. It contains a large number of neutron stars with known classifications as either "pulsar" or "non-pulsar".

## Preliminary Exploratory Data Analysis
The dataset is available through a public website provided by the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/HTRU2). It is then downloadable by the provided [link](https://archive.ics.uci.edu/ml/machine-learning-databases/00372/) given on that page, as a .ZIP file.

The contents can be extracted and decompressed to reveal an applicable .CSV file which can be directly read into and visualized within Jupyter Notebook, as per the following steps.

In [2]:
# Retrieve relevant packages for dataframes and visualization tools
import pandas as pd
import altair as alt

In [3]:
# Using the extracted contents, read the .CSV file, identifying columns using the associated .ARFF alongside it
pulsar_data = pd.read_csv(
    "../data/HTRU2/HTRU_2.csv",
    # .CSV file contains no column headers, we provide these here
    names=[
        "Profile_mean",
        "Profile_stdev",
        "Profile_skewness",
        "Profile_kurtosis",
        "DM_mean",
        "DM_stdev",
        "DM_skewness",
        "DM_kurtosis",
        "class"
    ]
)

# As a POC, show the first 5 rows of the dataset
pulsar_data.head()

Unnamed: 0,Profile_mean,Profile_stdev,Profile_skewness,Profile_kurtosis,DM_mean,DM_stdev,DM_skewness,DM_kurtosis,class
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0


We briefly check some high-level properties of the overall dataset.

In [9]:
pulsar_data.describe()

Unnamed: 0,Profile_mean,Profile_stdev,Profile_skewness,Profile_kurtosis,DM_mean,DM_stdev,DM_skewness,DM_kurtosis,class
count,17898.0,17898.0,17898.0,17898.0,17898.0,17898.0,17898.0,17898.0,17898.0
mean,111.079968,46.549532,0.477857,1.770279,12.6144,26.326515,8.303556,104.857709,0.091574
std,25.652935,6.843189,1.06404,6.167913,29.472897,19.470572,4.506092,106.51454,0.288432
min,5.8125,24.772042,-1.876011,-1.791886,0.213211,7.370432,-3.13927,-1.976976,0.0
25%,100.929688,42.376018,0.027098,-0.188572,1.923077,14.437332,5.781506,34.960504,0.0
50%,115.078125,46.947479,0.22324,0.19871,2.801839,18.461316,8.433515,83.064556,0.0
75%,127.085938,51.023202,0.473325,0.927783,5.464256,28.428104,10.702959,139.30933,0.0
max,192.617188,98.778911,8.069522,68.101622,223.392141,110.642211,34.539844,1191.000837,1.0


In [11]:
pulsar_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Profile_mean      17898 non-null  float64
 1   Profile_stdev     17898 non-null  float64
 2   Profile_skewness  17898 non-null  float64
 3   Profile_kurtosis  17898 non-null  float64
 4   DM_mean           17898 non-null  float64
 5   DM_stdev          17898 non-null  float64
 6   DM_skewness       17898 non-null  float64
 7   DM_kurtosis       17898 non-null  float64
 8   class             17898 non-null  int64  
dtypes: float64(8), int64(1)
memory usage: 1.2 MB


If we like, we may choose to rename the classes from 0 and 1 to "non-pulsar" and "pulsar" later for ease of identification.

As given, the data is already within a tidy format: each row of the table corresponds to a single star, characterizing its attributes of aggregated values as columns. It also does not contain any missing or NaN values, as evidenced by the columns being non-null. We can confirm this by the following drop of missing value rows.

In [12]:
pulsar_data.dropna()
pulsar_data

Unnamed: 0,Profile_mean,Profile_stdev,Profile_skewness,Profile_kurtosis,DM_mean,DM_stdev,DM_skewness,DM_kurtosis,class
0,140.562500,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,0
...,...,...,...,...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,0
17894,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,0
17895,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,0
17896,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,0


As we can see, the number of rows does not decrease when we ask to drop any rows containing missing values.

We may also find it useful to understand the distribution of the classified neutron stars, into the "pulsar" and "non-pulsar" classes.

In [13]:
# Keep one other column outside the target classification column
pulsar_data_count = pulsar_data[["class", "Profile_mean"]].groupby("class").count().reset_index()

# Rename the column to more accurately reflect its purpose
pulsar_data_count = pulsar_data_count.rename(columns={"Profile_mean": "count"}).rename({0: "non-pulsar", 1:"pulsar"})
pulsar_data_count

Unnamed: 0,class,count
non-pulsar,0,16259
pulsar,1,1639


From this, we find that there are a significantly larger number of observations in the dataset that are non-pulsar stars than are pulsar. Namely that there are less than 2,000 that are "pulsar", and over 16,000 that are not.

This is consistent with the idea that neutron stars that are "pulsar" tend to be a rare occurrence.

During our investigation, we expect to compare multiple pairs of variables to assess whether there is a distinct grouping of "pulsar" and "non-pulsar" points.

As an example, we may choose to check if a clustering pattern exists between the "Profile_mean" and "DM_mean" variables through a simple scatter plot.

In [10]:
pulsar_data_means_plot = (
    alt.Chart(pulsar_data.head(5000), title="Mean Comparison")
    .mark_point(opacity=0.4)
    .encode(
        x=alt.X("Profile_mean", title="Mean using the Integrated Pulse Profile", scale=alt.Scale(zero=False)),
        y=alt.Y("DM_mean", title="Mean using the DM-SNR Curve"),
        color=alt.Color("class", title="Classification"),
    )
    .configure_axis(labelFontSize=16, titleFontSize=16)
    .properties(width=400, height=400)
)
pulsar_data_means_plot

# TODO: fix the classification colouring so it only needs 2 colours for each of the classifications instead of a saturation scale, and check
# the portion of the dataset being used

## Methods

To effectively determine which variables of the dataset contribute to a distinctive classification of "pulsar" and "non-pulsar" neutron stars, a possible strategy would be to assess scatter plot visualizations created for each pair-wise set of variables to qualitatively evaluate by inspection for any form of clustering that separates the two classifications. As well, we can then train a K-nearest neighbours classifier using a majority subset of the dataset, and then use the remainder as a test set to more quantitatively evaluate whether using that pair of variables was an effective choice to accurately classify observations. 

This can then be extended to merge multiple pairs of attributes deemed effective in the classification, which we can then use to train a classifier using more than two variables, to assess if that further improves the accuracy of classification.

// TODO: might need rewording or compacting

## Expected Outcomes and Significance

We expect that at the end of this investigation, we may be able to provide a subset of variables that most effectively illustrate a distinct separation between neutron stars found to be "pulsar", and those that are not. We also hope to find whether a range of values can be established for these variables that are chosen.

Should we be able to answer these questions, it may be useful in helping to find "pulsar" neutron stars. This may help to narrow down future observations or disregard stars that are not likely to be pulsar.

Future extensions or questions that this could lead to may include:
- // TODO: future question proposing