## Unveiling Pulsars: Machine Learning for Pulsar Identification?

#### Introduction

The HTRU2 dataset comes from a project that looks for rare stars called pulsars, which send radio signals to Earth. These stars are essential for understanding space and the relation between stars.

Pulsars spin quickly, and as they do, they send out a repeating radio signal that can be detected using radio telescopes. This signal is like a specific code for each pulsar, but it's hard to find because there's a lot of interference and noise.

The dataset we'll use includes measurements like the average pulse, its variability, shape, and other features. It has a total of 17,898 examples. Among these, 1,639 are positive (real pulsars), and 16,259 are negative (noise). The data comes in two formats, CSV and ARFF, and each entry has a set of values for these measurements, with 0 indicating noise and 1 indicating a real pulsar.

To discover hidden pulsars, we use Python that learns from examples. We tell the computer what real pulsars look like and what noise looks like. Then, it tries to sort new examples into "pulsar" or "not pulsar." Our goal is to create a program that can tell the difference between pulsars and noise, which will help us discover more about these incredible stars.


#### Preliminary exploratory data analysis

We begin by importing Pandas and reading the Comma Separated Value (CSV) file. data file into our Data Frame named pulsar_data. 

In [41]:
import pandas as pd
pulsar_data = pd.read_csv("data/HTRU_2.csv")

pulsar_data

Unnamed: 0,140.5625,55.68378214,-0.234571412,-0.699648398,3.199832776,19.11042633,7.975531794,74.24222492,0
0,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,0
1,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
2,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,0
3,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,0
4,93.570312,46.698114,0.531905,0.416721,1.636288,14.545074,10.621748,131.394004,0
...,...,...,...,...,...,...,...,...,...
17892,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,0
17893,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,0
17894,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,0
17895,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,0


Before using the data, we must first make sure it is tidy. To ensure the data is tidy, we must first ensure that it satisfies the three criteria:
+ each row is a single observation
+ each column is a single variable, and
+ each value is a single cell.

With the data above, we can see that it does not have column labels. We will begin by adding column labels.


In [42]:
# Add column names.

pulsar_data.columns = ["mean_integrated_profile", 
                       "standard_deviation_integrated_profile", 
                       "excess_kurtosis_integrated_profile", 
                       "skewness_integrated_profile", 
                       "mean_dm-snr_curve", 
                       "standard_deviation_dm-snr_curve",
                       "excess_kurtosis_dm-snr_curve",
                       "skewness_dm-snr_curve",
                       "label"]


In [43]:
# make the pulsar? column variable more readable for analysis.

pulsar_data["label"] = pulsar_data["label"].replace({0: "Noise", 1:"Pulsar"})


pulsar_data

Unnamed: 0,mean_integrated_profile,standard_deviation_integrated_profile,excess_kurtosis_integrated_profile,skewness_integrated_profile,mean_dm-snr_curve,standard_deviation_dm-snr_curve,excess_kurtosis_dm-snr_curve,skewness_dm-snr_curve,label
0,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,Noise
1,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,Noise
2,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,Noise
3,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,Noise
4,93.570312,46.698114,0.531905,0.416721,1.636288,14.545074,10.621748,131.394004,Noise
...,...,...,...,...,...,...,...,...,...
17892,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,Noise
17893,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,Noise
17894,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,Noise
17895,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,Noise


Now that we have clean, tidy data, let us summarize our data for analysis. We will begin by reviewing whether our data contains any empty cells. 

In [44]:
# Confirms that no cells are empty. 

pulsar_data.isnull().sum().sum()

0

Now, let us review the number of pulsar examples classified in our data, and the number of noise examples in our data.

In [45]:
#  Count the amount of pulsar examples and the amount of noise examples in our data. 

pulsar_data_count = pulsar_data["label"].value_counts().to_frame()
pulsar_data_count.columns = ["count"]
pulsar_data_count

Unnamed: 0,count
Noise,16258
Pulsar,1639


In [46]:
# Provide the percentage between the amount of pulsar examples and noise examples in our data. 

pulsar_data_percentage = pulsar_data["label"].value_counts(normalize=True).to_frame()
pulsar_data_percentage.columns = ["percentage"]
pulsar_data_percentage["percentage"] = pulsar_data_percentage["percentage"] * 100
pulsar_data_percentage

Unnamed: 0,percentage
Noise,90.842041
Pulsar,9.157959


Now, let us complete an analysis on the variables that we may use in classifying whether we have found a pulsar.

In [58]:
pulsar_data.loc[:, :"skewness_dm-snr_curve"].describe()

Unnamed: 0,mean_integrated_profile,standard_deviation_integrated_profile,excess_kurtosis_integrated_profile,skewness_integrated_profile,mean_dm-snr_curve,standard_deviation_dm-snr_curve,excess_kurtosis_dm-snr_curve,skewness_dm-snr_curve
count,17897.0,17897.0,17897.0,17897.0,17897.0,17897.0,17897.0,17897.0
mean,111.078321,46.549021,0.477897,1.770417,12.614926,26.326918,8.303574,104.859419
std,25.652705,6.84304,1.064056,6.168058,29.473637,19.471042,4.506217,106.51727
min,5.8125,24.772042,-1.876011,-1.791886,0.213211,7.370432,-3.13927,-1.976976
25%,100.929688,42.375426,0.027108,-0.188528,1.923077,14.43733,5.781485,34.957119
50%,115.078125,46.946435,0.223241,0.198736,2.801839,18.459977,8.433872,83.068996
75%,127.085938,51.022887,0.473349,0.928206,5.464883,28.428152,10.702973,139.310905
max,192.617188,98.778911,8.069522,68.101622,223.392141,110.642211,34.539844,1191.000837


As an exploratory visualization of the data, we have created a graph comparing the Skewness of the integrated profile (y-axis) to the Excess kurtosis of the integrated profile (x-axis). We have also coloured the pulsar and noise examples to be able to determine if we can obtain any information on when we have a pulsar, or when it is noise. Based on the graph, it appears that lower values of the Skewness of the integrated profile and of the Excess kurtosis of the integrated profile are typically noise, while pulsars typically have larger values.

In [59]:
# Import the altair package to use in our visualization.
import altair as alt

# Create the graph to visualize our data.
pulsar_plot = alt.Chart(pulsar_data.sample(n=2000)).mark_circle(opacity=0.4).encode(
    x=alt.X("excess_kurtosis_integrated_profile").title("Excess kurtosis of the integrated profile"),
    y=alt.Y("skewness_integrated_profile").title("Skewness of the integrated profile"),
    color=alt.Color("label").title("label")
)
pulsar_plot


#### Methods

To further expand on our analysis of the data, we will need to decide on which variables to use for our analysis. We have two groups of variables that have been tabulated with a different statistical calculation done for each. One has been calculated using the integrated profile, while the others have similarly been obtained from the DM-SNR curve. Further analysis will need to be done one both groups of variables to determine which will be better for classifying whether we have a pulsar or simply noise. 

We will determine which group of variables is better by training our data using each set independently. Thus, we will be able to determine the optimal value of our K-nearest neighbor for each and visualize the accuracy, precision and recall of the classification between each group of variables. 

#### Expected outcomes and Significance

- What we will find? Look at visualizations, data, etc.
- Impact of findings?
- Future questions? 