# **Group 7 Proposal: <br> Pulsar Star Dataset**

### **Introduction:**

Pulsar candidates collected during the High Time Resolution Universe Survey stored in the HTRU2 dataset will be used to classify candidates from a sample into pulsar or non-pulsar. Pulsar stars (a rare type of Neutron Star) emit similar emission patterns, which will be used to distinguish between classifications.  This machine learning tool is useful for efficiently interpreting large amounts of data to focus on scientific study relating to pulsar stars. To classify pulsar star data, specific variables which associate strongly can be used as predictors, which prompts the following question:

*“Can we use the skewness of the integrated profile and excess kurtosis of the integrated profile variables available to us to predict whether future star observations (with unknown diagnosis) show a non-pulsar or pulsar star?”*

### **Preliminary Exploratory Data Analysis**

In [1]:
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.model_selection import train_test_split
alt.data_transformers.disable_max_rows()
set_config(transform_output="pandas")

In [2]:
unscaled_pulsar_star=pd.read_csv("HTRU_2.csv", names=["mean_ip","sd_ip",
                                                  "excess_ip",
                                                  "skewness_ip",
                                                  "mean_curve",
                                                  "sd_curve",
                                                  "excess_curve",
                                                  "skewness_curve",
                                                  "class_label"])

unscaled_pulsar_star["class_label"] = unscaled_pulsar_star["class_label"].replace({
    0 : 'Negative',
    1 : 'Positive'
})

# unscaled_pulsar_star["class_label"].unique()

preprocessor=make_column_transformer(
    (StandardScaler(),["mean_ip",
                       "sd_ip",
                       "excess_ip",
                       "skewness_ip",
                       "mean_curve",
                       "sd_curve",
                       "excess_curve",
                       "skewness_curve"]),
     remainder="passthrough",
    verbose_feature_names_out=False)

preprocessor.fit(unscaled_pulsar_star)
scaled_pulsar_star= preprocessor.transform(unscaled_pulsar_star)
scaled_pulsar_star

Unnamed: 0,mean_ip,sd_ip,excess_ip,skewness_ip,mean_curve,sd_curve,excess_curve,skewness_curve,class_label
0,1.149317,1.334832,-0.669570,-0.400459,-0.319440,-0.370625,-0.072798,-0.287438,Negative
1,-0.334168,1.802265,-0.011785,-0.370535,-0.371102,-0.588924,0.504427,0.211581,Negative
2,-0.314372,-1.053322,-0.145233,-0.116593,-0.322107,-0.235328,-0.125996,-0.391373,Negative
3,1.000694,1.553254,-0.513409,-0.390178,-0.304404,-0.275666,-0.312265,-0.481300,Negative
4,-0.871402,-0.858879,0.115609,-0.104866,-0.388010,-0.763111,1.324026,1.386794,Negative
...,...,...,...,...,...,...,...,...,...
17893,0.988208,1.943284,-0.625655,-0.406697,-0.384010,-0.727295,1.586054,1.700034,Negative
17894,0.447319,0.429062,-0.328831,-0.234643,0.128776,0.939926,-1.189159,-0.906574,Negative
17895,0.321842,1.956220,-0.299334,-0.407492,0.299137,1.671568,-1.288079,-0.941330,Negative
17896,0.133628,1.074510,-0.260050,-0.291041,-0.361967,-0.664857,0.378257,0.275850,Negative


In [3]:
pulsar_train, pulsar_test = train_test_split(
    scaled_pulsar_star, train_size=0.75, stratify=scaled_pulsar_star["class_label"]
)
pulsar_train

Unnamed: 0,mean_ip,sd_ip,excess_ip,skewness_ip,mean_curve,sd_curve,excess_curve,skewness_curve,class_label
16511,0.958057,0.859309,-0.265949,-0.338247,-0.044905,0.799052,-1.030766,-0.870321,Negative
10310,0.589241,-0.165603,-0.506667,-0.301087,-0.266587,-0.008833,-0.451822,-0.588486,Negative
8729,0.621524,0.627443,0.039183,-0.281884,-0.274219,-0.074872,-0.199411,-0.429759,Negative
610,0.550563,0.584495,-0.567353,-0.319506,-0.346987,-0.369710,0.153950,-0.167917,Negative
10815,-1.327625,-0.298521,0.743174,0.135471,-0.163065,0.539481,-0.692869,-0.735577,Positive
...,...,...,...,...,...,...,...,...,...
17457,-1.034948,-1.146713,0.119498,0.013728,-0.358903,-0.392081,0.456608,0.073513,Negative
10993,-0.395079,-0.325361,-0.074861,-0.226443,-0.303383,-0.154444,-0.240576,-0.460495,Negative
12949,0.481124,0.633394,-0.262348,-0.360028,-0.315525,-0.092108,-0.139876,-0.424969,Negative
14776,0.164388,0.209308,-0.078052,-0.271845,-0.256743,-0.083228,-0.502883,-0.614155,Negative


In [4]:
pulsar_train["class_label"].value_counts()

Negative    12194
Positive     1229
Name: class_label, dtype: int64

In [5]:
pulsar_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13423 entries, 16511 to 2431
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   mean_ip         13423 non-null  float64
 1   sd_ip           13423 non-null  float64
 2   excess_ip       13423 non-null  float64
 3   skewness_ip     13423 non-null  float64
 4   mean_curve      13423 non-null  float64
 5   sd_curve        13423 non-null  float64
 6   excess_curve    13423 non-null  float64
 7   skewness_curve  13423 non-null  float64
 8   class_label     13423 non-null  object 
dtypes: float64(8), object(1)
memory usage: 1.0+ MB


In [6]:
summary_table = pd.DataFrame({"num_positive" : 1229,
                              "num_negative" : 12194,
                              "skewness_ip_mean" : [pulsar_train["skewness_ip"].mean()],
                              "excess_ip_mean" : [pulsar_train["excess_ip"].mean()],
                              "num_missing" : 0
                             })
summary_table

Unnamed: 0,num_positive,num_negative,skewness_ip_mean,excess_ip_mean,num_missing
0,1229,12194,0.008514,0.005533,0


In [7]:
graph=alt.Chart(pulsar_train).mark_point(filled=True, size=10).encode(
    x=alt.X("skewness_ip"),
    y=alt.Y("excess_ip"),
    color=alt.Color("class_label:N"),
    # shape="class_label:N"
)
graph

### **Methods**

Scatter plots are effective at displaying the relationship between two quantitative variables. In this case, visualizing the quantitative ‘skewness_ip’ and ‘excess_ip’ variables via a scatter plot displays whether they have a strong relationship suitable for classification of new observations. To select the most effective variables, various variable combinations were compared, and the two displaying the strongest trend were chosen, wherein the class_label 0 and 1 were clarified to negative (non-pulsar) and positive (pulsar).

### **Expected Outcomes and Significance**

We expect to find that the majority of observations are non-pulsar stars, and a small portion are, as this is a rare star type; additionally, particular traits are likely to more accurately classify pulsar stars than others. The impact of such findings is that pulsar stars can be classified more effectively. As mentioned in the Machine Learning Repository webpage, effective classification will accelerate the efficiency of scientific work, aiding discovery.

**Future questions this could lead to include:**

Are there other traits of pulsar stars that could help increase accuracy of new observation classification?

How accurate are the pulstar star classification predictions? Should the criteria be changed?