In [1]:
# Loading the necessary libraries:
library(tidyverse)
library(dplyr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

# Introduction

Pulsars emit two beams of light in opposite directions. Although the beam's light is steady, pulsars appear to flicker because they rotate.  As the pulsar rotates, the light beam may sweep over the Earth, swinging in and out of view, creating the impression that the pulsar is blinking to an astronomer.

Because pulsars are important space objects that allow scientists to study extreme states of matter and cosmic events, it would be great if we had a system to predict whether a particular space object is a pulsar. Therefore, can we use the information from the data set to create a model that can predict a pulsar star?

The data set contains nine columns. Eight continuous variables and one class variable. The first four variables are statistics derived from a pulsar’s integrated pulse profile, which are unique to a pulsar, whereas the latter four are derived from the DM-SNR (Dispersion Measure of Signal Noise Ratio).



In [2]:
# Reading and tidying the dataset.
pulsar_data <- read_csv("data/HTRU_2.csv", col_names = FALSE) |> # Read the CSV file
    # Add column names:
    rename(mean_intp = X1, 
           std_dev_intp = X2, 
           xs_kurtosis_intp = X3, 
           skewness_intp = X4, 
           mean_dmsnr = X5, 
           std_dev_dmsnr = X6, 
           xs_kurtosis_dmsnr = X7, 
           skewness_dmsnr = X8, 
           class = X9) |>
    tibble::rowid_to_column('id') |> # Adds an id to each individual pulsar star candidate.
    mutate(class = as_factor(class), id = as_factor(id)) |> # Change class from dbl to factor (category) as these are categorial variables.
    select(id, class, everything()) # Reorder class as the first column in the table for organization purposes.
   
# Tidy the data by making intp and dmsnr an categorial observation of the variable "type":
pulsar_data_mean <- pulsar_data |> pivot_longer(starts_with("mean"), names_to = "type", values_to = "mean") |>
    mutate(type = as_factor(case_when(endsWith(type, "dmsnr") == TRUE ~ "dmsnr_curve",
                                      endsWith(type, "intp") == TRUE ~ "integrated_profile"))) |> #rename mean_intp and mean_dmsnr to integrated_profile or dmsnr_curve as they're a category of the variable "type".
    select(id, type, class, mean)

pulsar_data_std_dev <- pulsar_data |> pivot_longer(starts_with("std_dev"), names_to = "type", values_to = "std_dev") |>
    mutate(type = as_factor(case_when(endsWith(type, "dmsnr") == TRUE ~ "dmsnr_curve",
                                      endsWith(type, "intp") == TRUE ~ "integrated_profile"))) |> #rename std_dev_intp and std_dev_dmsnr to integrated_profile and dmsnr_curve as they're a category of the variable "type".
    select(std_dev) # does not include id, type, or class to avoid duplicate columns when cbind.

pulsar_data_xs_kurtosis <- pulsar_data |> pivot_longer(starts_with("xs_kurtosis"), names_to = "type", values_to = "xs_kurtosis") |>
    mutate(type = as_factor(case_when(endsWith(type, "dmsnr") == TRUE ~ "dmsnr_curve",
                                      endsWith(type, "intp") == TRUE ~ "integrated_profile"))) |> # rename xs_kurtosis_intp and xs_kurtosis_dmsnr to integrated_profile and dmsnr_curve as they're a category of the variable "type".
    select(xs_kurtosis) # does not include id, type, or class to avoid duplicate columns when cbind.

pulsar_data_skewness <- pulsar_data |> pivot_longer(starts_with("skewness"), names_to = "type", values_to = "skewness") |>
    mutate(type = as_factor(case_when(endsWith(type, "dmsnr") == TRUE ~ "dmsnr_curve",
                                      endsWith(type, "intp") == TRUE ~ "integrated_profile"))) |> # rename skewness_intp and skewness_dmsnr to integrated_profile and dmsnr_curve as they're a category of the variable "type".
    select(skewness) # does not include id, type, or class to avoid duplicate columns when cbind.
    

pulsar_data <- cbind(pulsar_data_mean, pulsar_data_std_dev, pulsar_data_xs_kurtosis, pulsar_data_skewness) |> # combine all tidied pivoted data into one dataframe.
        rename(excess_kurtosis = xs_kurtosis, standard_deviation = std_dev) # expand abbreviations for clarity.

options(repr.matrix.max.rows = 10) # Shows a maximum of 10 rows to reduce clutter when calling the dataset.
pulsar_data # display the dataframe as a table

[1mRows: [22m[34m17898[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): X1, X2, X3, X4, X5, X6, X7, X8, X9

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


id,type,class,mean,standard_deviation,excess_kurtosis,skewness
<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
1,integrated_profile,0,140.562500,55.68378,-0.2345714,-0.6996484
1,dmsnr_curve,0,3.199833,19.11043,7.9755318,74.2422249
2,integrated_profile,0,102.507812,58.88243,0.4653182,-0.5150879
2,dmsnr_curve,0,1.677258,14.86015,10.5764867,127.3935796
3,integrated_profile,0,103.015625,39.34165,0.3233284,1.0511644
⋮,⋮,⋮,⋮,⋮,⋮,⋮
17896,dmsnr_curve,0,21.430602,58.87200,2.4995171,4.59517265
17897,integrated_profile,0,114.507812,53.90240,0.2011614,-0.02478884
17897,dmsnr_curve,0,1.946488,13.38173,10.0079673,134.23890950
17898,integrated_profile,0,57.062500,85.79734,1.4063910,0.08951971


We expect to find that we should be able to get a prediction of whether or not a candidate is a pulsar or not. This will be done by taking all the other variables into the recipe.

Using this prediction, we would be able to predict if newly discovered stars are pulsars, given that the data we collect is part of our prediction model. To test this in future cases, we could take a newly discovered star and run it against our prediction model.

This could lead to further questions such as:

- What’s the minimum/maximum DM-SNR curve for a certain star for it to no longer be considered a pulsar?
- What is the average skewness of a pulsar? And for non-pulsars?