# The Classification of Pulsar Stars - Project Report
**By Oliver Gullery, Chan Le, Simon Lin, and Adam Parolin**

### Introduction

Pulsar stars are a rare type of neutron star that produce detectable radio emissions. Pulsars rotate and emit beams of electromagnetic radiation, which can be detected if they align directly with Earth. 

These beams take the form of radio waves - electromagnetic waves oscillating at specific frequencies that can be detected.

Using scientific equipment, we can scan for radio waves and discover new pulsar stars. However, some positive detections are caused by radio frequency interference, which makes real detections difficult to find. The main objective of our data analysis is determining if scientific equipment analyzed a real pulsar star or radio frequency interference.<br/> 

<img src="https://media.giphy.com/media/l3dj5M4YLaFww31V6/giphy.gif" width = "600"/>

Source: https://media.giphy.com/media/l3dj5M4YLaFww31V6/giphy.gif

This leads into our question: 
__Using pulsar star candidate data recorded by scientific equipment, is a given candidate a true pulsar star or just radio frequency interference?__

Each observation in the data set (the <a href="https://archive.ics.uci.edu/ml/datasets/HTRU2">HTRU2 Data Set</a> by Rob Lyon) is a candidate, with 8 continuous variables:<br />
1. `mean_of_int_profiles` <br/>
2. `sd_of_int_profiles`<br />
3. `excess_kurtosis_of_int_profiles`<br />
4. `skewness_of_int_profiles`<br />
5. `mean_of_curve`<br />
6. `sd_of_curve`<br />
7. `excess_kurtosis_of_curve`<br />
8. `skewness_of_curve`<br />

... and one class variable:<br />
1. `true_pulsar`



### Method and Results

In [6]:
# Importing required libraries

library(tidyverse)
library(repr)
library(tidymodels)

We can download the dataset (https://archive.ics.uci.edu/ml/datasets/HTRU2) and import into JupyterHub.

In [7]:
# Downloading from url 
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00372/HTRU2.zip"
dir.create("data")
file <- download.file(url, "data/HTRU2.zip")

# Because the data we need is in a .zip file, we use the unzip() function in order to access "HTRU_2.csv"
star_data <- read_csv(unzip("data/HTRU2.zip", files = "HTRU_2.csv", exdir = "data/"), 
            col_names = c("mean_of_int_profiles", "sd_of_int_profiles", "excess_kurtosis_of_int_profiles",
            "skewness_of_int_profiles", "mean_of_curve", "sd_of_curve", 
            "excess_kurtosis_of_curve", "skewness_of_curve", "true_pulsar")) |>
    mutate(true_pulsar = as_factor(true_pulsar))
slice(star_data, 1:10)
# Below is a snapshot of the star data we will be working with

“'data' already exists”
[1mRows: [22m[34m17898[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): mean_of_int_profiles, sd_of_int_profiles, excess_kurtosis_of_int_pr...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


mean_of_int_profiles,sd_of_int_profiles,excess_kurtosis_of_int_profiles,skewness_of_int_profiles,mean_of_curve,sd_of_curve,excess_kurtosis_of_curve,skewness_of_curve,true_pulsar
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
140.5625,55.68378,-0.23457141,-0.6996484,3.1998328,19.110426,7.975532,74.24222,0
102.50781,58.88243,0.46531815,-0.5150879,1.6772575,14.860146,10.576487,127.39358,0
103.01562,39.34165,0.32332837,1.0511644,3.1212375,21.744669,7.735822,63.17191,0
136.75,57.17845,-0.06841464,-0.6362384,3.6429766,20.95928,6.896499,53.59366,0
88.72656,40.67223,0.60086608,1.1234917,1.1789298,11.46872,14.269573,252.56731,0
93.57031,46.69811,0.53190485,0.4167211,1.6362876,14.545074,10.621748,131.394,0
119.48438,48.76506,0.03146022,-0.1121676,0.9991639,9.279612,19.20623,479.75657,0
130.38281,39.84406,-0.15832276,0.3895404,1.2207358,14.378941,13.539456,198.23646,0
107.25,52.62708,0.45268802,0.1703474,2.3319398,14.486853,9.001004,107.97251,0
107.25781,39.49649,0.46588196,1.1628771,4.0794314,24.980418,7.39708,57.78474,0


The first step is to split our data into a training and a testing set:

In [10]:
# DO NOT REMOVE
set.seed(9999) 

# Splitting data into training and testing, with true_pulsar as the strata
pulsar_split <- initial_split(star_data, prop = 0.75, strata = true_pulsar)  
pulsar_train <- training(pulsar_split)   
pulsar_test <- testing(pulsar_split)

#### Preliminary Exploratory Data Analysis

Before we begin classification, we will first take a closer look at our training data.

In [11]:
# Summarizing data into table counting the number of true and false pulsars
pulsar_frequency <- pulsar_train |>
    group_by(true_pulsar) |>
    summarize(number = n())
pulsar_frequency

true_pulsar,number
<fct>,<int>
0,12207
1,1216


In [12]:
# Creating another table that shows the average of our intended predictor variables
pulsar_predictors <- pulsar_train |>
    group_by(true_pulsar) |>
    summarize(avg_mean_of_int_profiles = mean(mean_of_int_profiles),
              avg_mean_of_curve = mean(mean_of_curve))

pulsar_predictors

true_pulsar,avg_mean_of_int_profiles,avg_mean_of_curve
<fct>,<dbl>,<dbl>
0,116.68182,8.886237
1,56.47867,50.112724


### Discussion

### References