> # Predicting Pulsar Stars

### Kaggle


#### HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey .

#### Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter .

#### As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.

#### Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation . Thus a potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find.

## Introduction

#### Firstly, we will take a look in the behaviour of our data! We are going to use a couple of manifold learning algorithms in order to visualise the high dimensional data in a 2D plot!

#### Lastly, we will use some machine learning approaches to correctly identify Pulsar Stars. We want to find the best algorithm, accuracy-wise, however, we need to find which one provides the most important features to identify Pulsar Stars.

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('../input/predicting-a-pulsar-star/pulsar_stars.csv')

data.head()

#### How many samples do we have for which class?

In [None]:
target = data[['target_class']]

data.drop('target_class', axis=1, inplace=True)

target['target_class'].value_counts()

#### Some statistical information about our data:

In [None]:
data.describe()

#### Taking a look of how our features are correlated:

In [None]:
data.corr()

#### We can see that are "groups" of features that are highly correlated. That is good news for linear approaches!

## High Dimensional Data Visualization using Manifold Learning Algorithms

#### I highly recommend to visualise data in the original space, however, our data has 8 features, therefore we cannot plot it. But, in order to do so, we can use manifold learning algorithms to analyse the structure of our data in the original space and, then, embedding it in a low dimensial space, so we can plot it!

#### We are going to use two algorithms: (1) t-SNE and (2) ISOMAP.

#### t-SNE uses a statistical/probability approach to identify and reconstruct the original spac. Meanwhile, ISOMAP uses classical manifold theory to do so. 

In [None]:
from sklearn.manifold import TSNE
from sklearn.manifold import Isomap

import numpy as np

import matplotlib.pyplot as plt

### (1) t-SNE

In [None]:
tsne = TSNE(n_components=2, init='pca', perplexity = 40)
tsne_data = tsne.fit_transform(data)

In [None]:
not_pulsar = []
pulsar = []

for i in range(len(target)):
    if target['target_class'][i] == 0:
        not_pulsar.append(tsne_data[i])
    if target['target_class'][i] == 1:
        pulsar.append(tsne_data[i])
        
not_pulsar = np.array(not_pulsar)
pulsar = np.array(pulsar)

In [None]:
plt.figure(figsize=(7,7))
plt.scatter(not_pulsar[:,0], not_pulsar[:,1], c='blue', label='Not Pulsar Stars')
plt.scatter(pulsar[:,0], pulsar[:,1], c='red', label='Pulsar Stars')
plt.legend()
plt.title('Low dimensional visualization (t-SNE) - Pulsar Stars');

### (2) ISOMAP

In [None]:
isomap = Isomap(n_components=2, n_neighbors=5, path_method='D', n_jobs=-1)

isomap_data = isomap.fit_transform(data)

In [None]:
not_pulsar = []
pulsar = []

for i in range(len(target)):
    if target['target_class'][i] == 0:
        not_pulsar.append(isomap_data[i])
    if target['target_class'][i] == 1:
        pulsar.append(isomap_data[i])
        
not_pulsar = np.array(not_pulsar)
pulsar = np.array(pulsar)

In [None]:
plt.figure(figsize=(7,7))
plt.scatter(not_pulsar[:,0], not_pulsar[:,1], c='blue', label='Not Pulsar Stars')
plt.scatter(pulsar[:,0], pulsar[:,1], c='red', label='Pulsar Stars')
plt.legend()
plt.title('Low dimensional visualization (ISOMAP) - Pulsar Stars');

#### One of the benefits of these approachs is that it often cluster our data. Both methods illustrated samples of Pulsar Stars clustered, meaning that features of our data are distinct among sample groups!

## Classification Approaches

#### We are going to use three approaches: (1) PCA + kNN, (2) LDA + kNN and (3) kNN.

#### PCA is a classical unsupervised algorithm of dimensionality reduction based on variance-covariance between sample features! Meanwhile, LDA is a supervised method to discriminate sample groups finding a hyperplane of n-1 dimension, where n = (number of distinct sample groups). Finally, kNN is a method to classify samples based of proximity between samples. 

#### Our simplest method is the third approach. However is the method that gives less information about our data.

In [None]:
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from scipy.stats import norm

In [None]:
data_train, data_test, target_train, target_test = train_test_split(data, np.array(target['target_class']), test_size=0.2, random_state=0)

print('train size = ', len(data_train))
print('test size = ', len(data_test))

In [None]:
pd.Series(target_train).value_counts()

In [None]:
pd.Series(target_test).value_counts()

#### In order to avoid mistakes caused by variations of each feature, we need to scale it!

In [None]:
scaler = StandardScaler()
scaler.fit(data_train)

data_train_scaled = scaler.transform(data_train)
data_test_scaled = scaler.transform(data_test)

### (1) PCA + kNN

In [None]:
pca = PCA().fit(data_train_scaled)
pca_data_train = pca.transform(data_train_scaled)
print("Variance explained by each component (%): ")
for i in range(len(pca.explained_variance_ratio_)):
      print("\n",i+1,"º:", pca.explained_variance_ratio_[i]*100)
        
print("\nTotal sum (%): ",sum(pca.explained_variance_ratio_)*100)

print("\nSum of the first two components (%): ",(pca.explained_variance_ratio_[0] + pca.explained_variance_ratio_[1])*100)

#### Since we have a lot of samples for a 8 dimensional data, we can calculate all principal components of data. However, for simplicity, we are going to use just the first and second principal component, representing 78% of information of our original data!

In [None]:
not_pulsar = []
pulsar = []

for i in range(len(target_train)):
    if target_train[i] == 0:
        not_pulsar.append(pca_data_train[i])
    if target_train[i] == 1:
        pulsar.append(pca_data_train[i])
        
not_pulsar = np.array(not_pulsar)
pulsar = np.array(pulsar)

In [None]:
plt.figure(figsize=(7,7))
plt.scatter(not_pulsar[:,0], not_pulsar[:,1], c='blue', label='Not Pulsar Stars')
plt.scatter(pulsar[:,0], pulsar[:,1], c='red', label='Pulsar Stars')
plt.legend()
plt.title('Low dimensional visualization (PCA) - Pulsar Stars');

#### We can see that PCA, in a unsupervised way, separated our sample groups! That result means that the first principal component can be used to describe the differences between sample features of each group!

#### Let's evaluate the accuracy of our model when classifing data using kNN!

In [None]:
pca = PCA(n_components = 2).fit(data_train_scaled)

pca_data_test = pca.transform(data_test_scaled)
pca_data_train = pca.transform(data_train_scaled)

In [None]:
accuracy = []
for k in range(1,20):
    knn = KNeighborsClassifier(n_neighbors=k, p=2)
    knn.fit(pca_data_train, target_train)
    accuracy.append(knn.score(pca_data_test, target_test))

In [None]:
plt.plot(range(1,20),accuracy, 'bx-');
plt.xlabel('k number of neighbors')
plt.ylabel('Accuracy')
plt.title('Optimal number of neighbors');

print( "The best accuracy was", np.round(np.array(accuracy).max()*100, 2), "% with k =",  np.array(accuracy).argmax()+1) 


#### Our model has a very high accuracy with all possibilities of neighbors in the range of study! 

### (2) Linear Discriminant Analysis + kNN

#### Using the same approach as before.. let's computate LDA on our data, then plot it, and evaluate our model using kNN!

In [None]:
lda = LDA(n_components=1).fit(data_train_scaled, target_train)

lda_data_train = lda.transform(data_train_scaled)
lda_data_test = lda.transform(data_test_scaled)

In [None]:
not_pulsar = []
pulsar = []

for i in range(len(target_train)):
    if target_train[i] == 0:
        not_pulsar.append(lda_data_train[i])
    if target_train[i] == 1:
        pulsar.append(lda_data_train[i])
        
not_pulsar = np.array(not_pulsar)
pulsar = np.array(pulsar)

#### It is very helpful to computate the probability density function of the LDA result in order to visualise statistical information of it, such as distance between sample groups and variance of each group!

In [None]:
pulsar_mean, pulsar_std = norm.fit(pulsar)
not_pulsar_mean, not_pulsar_std = norm.fit(not_pulsar)
all_mean, all_std = norm.fit(lda_data_train)

x = np.linspace(-7, 12, 10000)
pulsar_pdf = norm.pdf(x, pulsar_mean, pulsar_std)
not_pulsar_pdf = norm.pdf(x, not_pulsar_mean, not_pulsar_std)
all_pdf = norm.pdf(x, all_mean, all_std)

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(pulsar_mean,0, marker='X', c='red',s=400)
plt.scatter(not_pulsar_mean,0, marker='X', c='blue',s=400)
plt.scatter(all_mean,0, marker='X', c='k',s=400)
plt.scatter(lda_data_train[:,0], np.zeros((len(lda_data_train),1)), c= ['red' if l==1  else 'blue' for l in target_train])
plt.ylim([-0.5,0.7])
plt.xlim([-7,12])
plt.plot(x, pulsar_pdf, 'r', linewidth=1.2, label='Pulsar Stars')
plt.plot(x, not_pulsar_pdf, 'b', linewidth=1.2, label='Not Pulsar Stars')
plt.plot(x, all_pdf, 'k', linewidth=1.2, label='All data')
plt.xlabel('Discriminant Hyperplane')
plt.ylabel('Probability Density Function')
plt.legend()
plt.title('LDA model');

#### We can see that PDFs are very separated! That means that we found good news! Not-pulsar-stars samples are very near each other (lower variance), in the other hand, pulsar stars samples seem to has a higher variance, therefore, the behaviour of each pulsar star may vary among distinct stars, meaning the uniqueness of this phenomena!

In [None]:
accuracy = []
for k in range(1,20):
    knn = KNeighborsClassifier(n_neighbors=k, p=2)
    knn.fit(lda_data_train, target_train)
    accuracy.append(knn.score(lda_data_test, target_test))

In [None]:
plt.plot(range(1,20),accuracy, 'bx-');
plt.xlabel('k number of neighbors')
plt.ylabel('Accuracy')
plt.title('Optimal number of neighbors');

print( "The best accuracy was", np.round(np.array(accuracy).max()*100, 2), "% with k =",  np.array(accuracy).argmax()+1)

#### We can see that we achieved similar results to our previous approach! Very good!

### (3) kNN


#### In this method we are not transforming our data, neither extracting important features. Simple as that.

In [None]:
accuracy = []
for k in range(1,20):
    knn = KNeighborsClassifier(n_neighbors=k, p=2)
    knn.fit(data_train_scaled, target_train)
    accuracy.append(knn.score(data_test_scaled, target_test))

In [None]:
plt.plot(range(1,20),accuracy, 'bx-');
plt.xlabel('k number of neighbors')
plt.ylabel('Accuracy')
plt.title('Optimal number of neighbors');

print( "The best accuracy was", np.round(np.array(accuracy).max()*100, 2), "% with k =",  np.array(accuracy).argmax()+1)

#### We achieved a very good accuracy! Since we did not use other algorithms, this method performed better, computational-wise. However, this method is just useful to classify our data, and can not be used to extract further information about it!