# <font size="6"> Predict Whether A Candidate is A Real Pulsar

# <font size="5"> 1.Introduction

**Pulsars** are rare neutron stars, as probes of space-time, the interstellar medium and the state of matter, which are very important for the study and development of natural sciences. The search for pulsars relies mainly on detecting signals emitted by periodic broadband radio emission patterns (averaged over multiple rotations) as they rotate at high speeds. However, in practice, all detections are caused by radio frequency interference (RFI) and noise, so it is difficult to find legitimate signals. Hence, the search for **real** pulsars is challenging. 

**Our question is to determine whether a candidate is a real pulsar?** 
HTRU2 is a dataset describing a sample of pulsar candidates collected during the High Temporal Resolution Universe Survey (South). It contains 16,259 spurious examples caused by RFI/noise and 1,639 real pulsar examples. Eight continuous variables describe each candidate as below and first four of them are statistics obtained from integrated pulse profile and the remaining are from DM-SNR curve.This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency.Please see the information as below,
* Mean of the integrated profile.
* Standard deviation of the integrated profile.
* Excess kurtosis of the integrated profile.
* Skewness of the integrated profile.
* Mean of the DM-SNR curve.
* Standard deviation of the DM-SNR curve.
* Excess kurtosis of the DM-SNR curve.
* Skewness of the DM-SNR curve.
* Class:1 means real pulsar and 0 otherwise.

The following explanation is for you to have a better understanding for each of the variable.

# <font size="5"> 2.Methods and Results

First of all, we need to import all of the Jupyter libraries we are going to apply for this project.

In [None]:
library(tidyverse)
library(tidymodels)      
install.packages("GGally")                    
library("GGally")
library(rvest)

Set the seed only once before loading data to gurantee our analysis to be reproducible.

In [None]:
set.seed(1)

Load the data from online by using `read_table2`.
Add column names manually to clean and wrangle data into tidy format; Also factor our predictor for classification. 

In [None]:
pulsar_raw_data <- read_table2("https://raw.githubusercontent.com/JeanetteOfficial/ActiveDSCI100_Group_006-1_proj/main/data/HTRU_2.txt",
                          col_names = FALSE) 
pulsar_raw_data <- rename(pulsar_raw_data, "Mean_IP" = X1,
                         "SD_IP" = X2,
                         "ExcessKurtosis_IP" = X3,
                         "Skewness_IP" = X4,
                         "Mean_DS" = X5,
                         "SD_DS" = X6,
                         "ExcessKurtosis_DS" = X7,
                         "Skewness_DS" = X8,
                         "Class" = X9) 

pulsar_raw_data <- mutate(pulsar_raw_data, Class = as_factor(Class)) 
head(pulsar_raw_data)

In [None]:
glimpse(pulsar_raw_data)

Split the raw data into testing and training data using `initial_split` for training an accurate model and achieving accurate evaluation for the model. The reason why we are using a `75%` and a `25%` split here is because that “We considered the trade off between a larger training set making the model more accurate versus a larger testing set making our evaluation more accurate. As we experimented with different splits, we came to a conclusion that the 75-25 split would be the most appropriate solution. Also the initial_split function allows us to shuffle and stratify the data to prevent the order from influencing the outcome and to ensure that the proportion of each class is preserved between the training and testing split.”()

In [None]:
pulsar_split <- initial_split(pulsar_raw_data,prop=0.75,strata = Class)
pulsar_train <- training(pulsar_split)
pulsar_test <- testing(pulsar_split)
glimpse(pulsar_train)

Check if there are missing values in training data. If so, we have to deal with them first, otherwise our data analysis of prediction would not be accurate. As we can observe from above, training data's row count remains the same as the raw data meaning so there aren't any missing values.

Gonna copy paste from project proposal file.