## Unveiling Pulsars: Machine Learning for Pulsar Identification

#### Introduction

#### Methods and Results

- *describe in written English the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.*
- your report should include code which:
    - *loads data from the original source on the web*
    - *wrangles and cleans the data from it's original (downloaded) format to the format necessary for the planned analysis*
    - *performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis*
    - *creates a visualization of the data set that is relevant for exploratory data analysis related to the planned analysis*
    - *performs the data analysis*
    - *creates a visualization of the analysis*

*note: all tables and figure should have a figure/table number and a legend*

Import packages needed for analysis

In [1]:
# might not need all these - feel free to edit
import random

import altair as alt
import pandas as pd
import numpy as np
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler


# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

First, we need to load the data set from the web using `read_csv` and name it `pulsar_data`

In [2]:
# Import dataset
pulsar_data = pd.read_csv("https://github.com/Tikio88/Group22_Project/blob/main/data/HTRU_2.csv?raw=true")

Then, we will add column labels so our data set is easier to understand. We will also edit the `pulsar?` column using the `replace` method to replace the values from `'0'` to `'Noise'` and `'1'` to `'Pulsar'`.

In [3]:
# Add column names
pulsar_data.columns = ["mean_integrated_profile", 
                       "standard_deviation_integrated_profile", 
                       "excess_kurtosis_integrated_profile", 
                       "skewness_integrated_profile", 
                       "mean_dm-snr_curve", 
                       "standard_deviation_dm-snr_curve",
                       "excess_kurtosis_dm-snr_curve",
                       "skewness_dm-snr_curve",
                       "label"]

# make "pulsar?" column variable more readable for analysis
pulsar_data["label"] = pulsar_data["label"].replace({0: "Noise", 1:"Pulsar"})

pulsar_data

Unnamed: 0,mean_integrated_profile,standard_deviation_integrated_profile,excess_kurtosis_integrated_profile,skewness_integrated_profile,mean_dm-snr_curve,standard_deviation_dm-snr_curve,excess_kurtosis_dm-snr_curve,skewness_dm-snr_curve,label
0,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,Noise
1,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,Noise
2,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,Noise
3,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,Noise
4,93.570312,46.698114,0.531905,0.416721,1.636288,14.545074,10.621748,131.394004,Noise
...,...,...,...,...,...,...,...,...,...
17892,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,Noise
17893,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,Noise
17894,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,Noise
17895,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,Noise


#### Discussion

- *summarize what you found*
- *discuss whether this is what you expected to find?*
- *discuss what impact could such findings have?*
- *discuss what future questions could this lead to?*

#### References

- *two references used in discussion*
- *reference of data set from web*

In [25]:
from sklearn.utils import resample

## columns to be used for integrated training
integrated_columns=["mean_integrated_profile", 
                        "standard_deviation_integrated_profile", 
                        "excess_kurtosis_integrated_profile",
                        "skewness_integrated_profile"] 

## columns to be used for snr training
snr_columns=["mean_dm-snr_curve", 
                        "standard_deviation_dm-snr_curve", 
                        "excess_kurtosis_dm-snr_curve",
                        "skewness_dm-snr_curve"]



pulsar_data.columns = ["mean_integrated_profile", 
                       "standard_deviation_integrated_profile", 
                       "excess_kurtosis_integrated_profile", 
                       "skewness_integrated_profile", 
                       "mean_dm-snr_curve", 
                       "standard_deviation_dm-snr_curve",
                       "excess_kurtosis_dm-snr_curve",
                       "skewness_dm-snr_curve",
                       "label"]

# make "pulsar?" column variable more readable for analysis
pulsar_data["label"] = pulsar_data["label"].replace({0: "Noise", 1:"Pulsar"})



noise = pulsar_data[pulsar_data["label"] == "Noise"]
pulsar = pulsar_data[pulsar_data["label"] == "Pulsar"]
pulsar_upsample = resample(
    pulsar, n_samples=noise.shape[0]
)
pulsar_data = pd.concat((pulsar_upsample, noise))
pulsar_data['label'].value_counts()

label
Pulsar    16258
Noise     16258
Name: count, dtype: int64

In [29]:
from sklearn.model_selection import GridSearchCV




## integrated preprocessor
polestar_preprocessor_integrated = make_column_transformer(
    (StandardScaler(), integrated_columns),)

## snr preprocessor
polestar_preprocessor_snr = make_column_transformer(
    (StandardScaler(), snr_columns),)


## kNeighborsClassifier we will use in the future in gridSearchCV
knn = KNeighborsClassifier()

## making the integrated pipeline with integrated preprocessor and the standard knn
polestar_integrated_pipe = make_pipeline(polestar_preprocessor_integrated, knn)

## making the snr pipeline with snr preprocessor and the standard knn
polestar_snr_pipe = make_pipeline(polestar_preprocessor_snr, knn)


## these are the values of n we will use in the future, starting at 1, going up to 100 in steps of 10.
parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 12, 1),
}


## setting up the grid search for integrated, with 10 cuts.
integrated_tune_grid = GridSearchCV(
    estimator=polestar_integrated_pipe,
    param_grid=parameter_grid,
    cv=10
)


## setting up the grid search for snr, with 10 cuts.
snr_tune_grid = GridSearchCV(
    estimator=polestar_snr_pipe,
    param_grid=parameter_grid,
    cv=10
)

## performing the grid search and fit for integrated
accuracies_grid_integrated = pd.DataFrame(
    integrated_tune_grid.fit(
        pulsar_data[integrated_columns],
        pulsar_data["label"]
    ).cv_results_
)

## performing the grid search and fit for snr
accuracies_grid_snr = pd.DataFrame(
    snr_tune_grid.fit(
        pulsar_data[snr_columns],
        pulsar_data["label"]
    ).cv_results_
)



accuracy_vs_k_integrated = alt.Chart(accuracies_grid_integrated).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(domain=(0.9, 1))
        .title("Accuracy estimate")
)+ alt.Chart(accuracies_grid_snr).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(domain=(0.9, 1))
        .title("Accuracy estimate"),
    color=alt.value('#ab6c00')
)
accuracy_vs_k_integrated