# Group Proposal Group-3

## Introduction

**Pulsar** *(from **pulsa**ting **r**adio source)* or Pulsar Stars are highly magnetic, rotating, compact heavenly bodies often viewed as flickering “stars” from the earth's night sky. Pulsar stars belong to the family of neutron stars which emit beams of electromagnetic radiation from their poles. Due to its high degree of rotation, the radiation from its poles appears to be pulsating or flickering from the earth, hence its name.

Pulsars are fantastic cosmic tools for scientists to study a wide range of phenomena. Studying them helps us understand unknown information about the universe and helps us advance our understanding of how it works. We mainly detect pulsar stars by studying the radio frequencies received by telescopes. Often radio interferences and random noises interfere and make it hard to detect. Through this project, we aim to produce a predictive classifier that helps identify if received measurements are from pulsar stars or not. As we see this project relates with binary classification problems.

Predictive Question : Can we use measurements of integrated profile and DM-SNR reading from the telescope to determine if a given reading is from a pulsar star or not?

#### Dataset and its attributes

We will be using the [Predicting Pulsar Stars](https://www.kaggle.com/colearninglounge/predicting-pulsar-starintermediate) dataset which collected potential pulsar candidates during the High Time Resolution Universe Survey.
Each signal is described by eight continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile and the remaining four variables are similarly obtained from the DM-SNR curve. These variables are:

Mean of the integrated profile.

Standard deviation of the integrated profile.

Excess kurtosis of the integrated profile.

Skewness of the integrated profile.

Mean of the DM-SNR curve.

Standard deviation of the DM-SNR curve.

Excess kurtosis of the DM-SNR curve.

Skewness of the DM-SNR curve.

target_class *(0 if it is not a pulsar star and 1 if it is a pulsar star)*


### Preliminary exploratory data analysis:

#### Reading data

In [2]:
library(tidyverse)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

In [4]:
# data hosted online on github
url <- "https://raw.githubusercontent.com/Acha220/DSCI_Project_Proposal/main/pulsar_data_train.csv"
pulsar <- read_csv(url)

Parsed with column specification:
cols(
  `Mean of the integrated profile` = [32mcol_double()[39m,
  `Standard deviation of the integrated profile` = [32mcol_double()[39m,
  `Excess kurtosis of the integrated profile` = [32mcol_double()[39m,
  `Skewness of the integrated profile` = [32mcol_double()[39m,
  `Mean of the DM-SNR curve` = [32mcol_double()[39m,
  `Standard deviation of the DM-SNR curve` = [32mcol_double()[39m,
  `Excess kurtosis of the DM-SNR curve` = [32mcol_double()[39m,
  `Skewness of the DM-SNR curve` = [32mcol_double()[39m,
  target_class = [32mcol_double()[39m
)



#### Wrangeling and cleaning dataset

In [5]:
#initially reaidng the dataset to view it
head(pulsar)

Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve,target_class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
121.15625,48.37297,0.3754847,-0.01316549,3.168896,18.39937,7.449874,65.159298,0
76.96875,36.17556,0.7128979,3.38871856,2.399666,17.571,9.414652,102.722975,0
130.58594,53.22953,0.1334083,-0.29724164,2.743311,22.36255,8.508364,74.031324,0
156.39844,48.86594,-0.2159886,-0.17129365,17.471572,,2.958066,7.197842,0
84.80469,36.11766,0.8250128,3.27412537,2.790134,20.61801,8.405008,76.291128,0
121.00781,47.17694,0.2297081,0.09133623,2.036789,,9.546051,112.131721,0


In [6]:
# Changing column names
colnames(pulsar) <- c("mean_profile", "sd_profile", "kurtosis_profile", "skew_profile", "mean_dmsnr", "sd_dmsnr", "kurtosis_dmsnr", "skew_dmsnr", "target_class")

In [7]:
# making target_class as factor rather than a double variable. 
pulsar <- pulsar %>%
mutate(target_class = as_factor(target_class))

In [8]:
#splitting into test and train dataset
pulsar_split <- initial_split(pulsar, prop = 0.75, strata = target_class)
pulsar_training <- training(pulsar_split)
pulsar_testing <- testing(pulsar_split)

In [9]:
glimpse(pulsar_training)

Rows: 9,396
Columns: 9
$ mean_profile     [3m[90m<dbl>[39m[23m 76.96875, 130.58594, 156.39844, 121.00781, 79.34375,…
$ sd_profile       [3m[90m<dbl>[39m[23m 36.17556, 53.22953, 48.86594, 47.17694, 42.40217, 55…
$ kurtosis_profile [3m[90m<dbl>[39m[23m 0.71289786, 0.13340829, -0.21598860, 0.22970813, 1.0…
$ skew_profile     [3m[90m<dbl>[39m[23m 3.38871856, -0.29724164, -0.17129365, 0.09133623, 2.…
$ mean_dmsnr       [3m[90m<dbl>[39m[23m 2.3996656, 2.7433110, 17.4715719, 2.0367893, 141.641…
$ sd_dmsnr         [3m[90m<dbl>[39m[23m 17.570997, 22.362553, NA, NA, NA, 19.496527, 18.2177…
$ kurtosis_dmsnr   [3m[90m<dbl>[39m[23m 9.4146523, 8.5083638, 2.9580659, 9.5460511, -0.70080…
$ skew_dmsnr       [3m[90m<dbl>[39m[23m 102.722975, 74.031324, 7.197842, 112.131721, -1.2006…
$ target_class     [3m[90m<fct>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…


#### Summarizing dataset 

In [10]:
1) do the summary per class 
2) mean per predictor 
3) number of rows with missing data

ERROR: Error in parse(text = x, srcfile = src): <text>:1:2: unexpected ')'
1: 1)
     ^


In [11]:
# computing percentage of each class {percentage of observations which are pulsar and which are not pulsar}
num_obs <- nrow(pulsar_training)
pulsar_training %>%
  group_by(target_class) %>%
  summarize(
    count = n(),
    percentage = n() / num_obs * 100
  )

`summarise()` ungrouping output (override with `.groups` argument)



target_class,count,percentage
<fct>,<int>,<dbl>
0,8540,90.88974
1,856,9.11026


In [13]:
# computing means of each predictor 
mean_table <- select(pulsar_training, mean_profile:skew_dmsnr)  %>%
map_df(mean, na.rm = TRUE)
mean_table

mean_profile,sd_profile,kurtosis_profile,skew_profile,mean_dmsnr,sd_dmsnr,kurtosis_dmsnr,skew_dmsnr
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
111.0803,46.4746,0.4768257,1.804785,12.71377,26.32067,8.334917,105.5083


### Methods:

### Expected outcomes and significance: