 PREDICTING A PULSAR STAR

Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter .  Neutron stars are very dense, and have short, regular rotational periods. This produces a very precise interval between pulses that ranges from milliseconds to seconds for an individual pulsar. Pulsars are believed to be one of the candidates for the source of ultra-high-energy cosmic rays.

![](https://usercontent2.hubstatic.com/14277725_f520.jpg)

The first pulsar was observed on November 28, 1967, by Jocelyn Bell Burnell and Antony Hewish. They observed pulses separated by 1.33 seconds that originated from the same location in the sky, and kept to sidereal time. In looking for explanations for the pulses, the short period of the pulses eliminated most astrophysical sources of radiation, such as stars, and since the pulses followed sidereal time, it could not be man-made radio frequency interference.(source=Wikipedia)

In this kernel, I will explain whether a star is a pulsar star with supervised learning machine learning algorithms.

### **CONTENT :**

1. [DATA ANALYSIS](#1)
2. [LOGISTIC REGRESSION](#2)
3. [K-NEAREST NEIGHBOUR(KNN) CLASSIFICATION](#3) 
4. [SUPPORT VECTOR MACHINE(SVM) CLASSIFICATION](#4)
5. [NAIVE BAYES CLASSIFICATION](#5)
6. [DECISION TREE CLASSIFICATION](#6)
7. [RANDOM FOREST CLASSIFICATION](#7)
8. [EVALUATING A CLASSIFICATION MODEL](#8)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import warnings
warnings.filterwarnings("ignore")
import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

<a id="1"></a> <br>
DATA ANALYSIS 

In [None]:
data=pd.read_csv("../input/pulsar_stars.csv")

In [None]:
data.info()

We have 9 features and looks like there are no nan values. However features names are a little bit untidy. I will change them.

In [None]:
data = data.rename(columns={' Mean of the integrated profile':"mean_integrated_profile",
       ' Standard deviation of the integrated profile':"std_deviation_integrated_profile",
       ' Excess kurtosis of the integrated profile':"kurtosis_integrated_profile",
       ' Skewness of the integrated profile':"skewness_integrated_profile", 
        ' Mean of the DM-SNR curve':"mean_dm_snr_curve",
       ' Standard deviation of the DM-SNR curve':"std_deviation_dm_snr_curve",
       ' Excess kurtosis of the DM-SNR curve':"kurtosis_dm_snr_curve",
       ' Skewness of the DM-SNR curve':"skewness_dm_snr_curve",
       })

Now Let's look at the 5 entries at the top of the data set

In [None]:
data.head()

Following heatmap shows correlation between features. 

There is a high positive correlation between following features:
- Excess kurtosis of the integrated profile - Skewness of the integrated profile (0.95)
- Mean of the DM-SNR curve - Standard deviation of the DM-SNR curve(0.80)
- Excess kurtosis of the DM-SNR curve - Skewness of the DM-SNR curve (0.92)


There is a high negative correlation between following features:
- Mean of the integrated profile - Excess kurtosis of the integrated profile (-0.87)
- Mean of the integrated profile - Skewness of the integrated profile (-0.74)
- Standard deviation of the DM-SNR curve - Excess kurtosis of the DM-SNR curve (-0.81)

In [None]:
f,ax=plt.subplots(figsize=(15,15))
sns.heatmap(data.corr(),annot=True,linecolor="blue",fmt=".2f",ax=ax)
plt.show()

And following pairplots show correlations between features with classes

In [None]:
g = sns.pairplot(data, hue="target_class",palette="husl",diag_kind = "kde",kind = "scatter")

In [None]:
y = data["target_class"].values
x_data = data.drop(["target_class"],axis=1)
x = (x_data - np.min(x_data))/(np.max(x_data)-np.min(x_data))

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3,random_state=1)

<a id="2"></a> <br>
LOGISTIC REGRESSION

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
lr_prediction = lr.predict(x_test)

In [None]:
from sklearn.metrics import mean_squared_error
mse_lr=mean_squared_error(y_test,lr_prediction)

from sklearn.metrics import confusion_matrix,classification_report
cm_lr=confusion_matrix(y_test,lr_prediction)
cm_lr=pd.DataFrame(cm_lr)
cm_lr["total"]=cm_lr[0]+cm_lr[1]
cr_lr=classification_report(y_test,lr_prediction)


In [None]:
from sklearn.metrics import cohen_kappa_score
cks_lr= cohen_kappa_score(y_test, lr_prediction)



In [None]:
score_and_mse={"model":["logistic regression"],"Score":[lr.score(x_test,y_test)],"Cohen Kappa Score":[cks_lr],"MSE":[mse_lr]}
score_and_mse=pd.DataFrame(score_and_mse)


<a id="3"></a> <br>
K-NEAREST NEIGHBOUR(KNN) CLASSIFICATION 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors =13) # n_neighbors = k
knn.fit(x_train,y_train)
knn_prediction = knn.predict(x_test)

In [None]:
score_list = []
for each in range(1,15):
    knn2 = KNeighborsClassifier(n_neighbors = each)
    knn2.fit(x_train,y_train)
    score_list.append(knn2.score(x_test,y_test))
    
plt.plot(range(1,15),score_list)
plt.xlabel("k values")
plt.ylabel("accuracy")
plt.show()

In [None]:
mse_knn=mean_squared_error(y_test,knn_prediction)
cm_knn=confusion_matrix(y_test,knn_prediction)
cm_knn=pd.DataFrame(cm_knn)
cr_knn=classification_report(y_test,knn_prediction)
cm_knn["total"]=cm_knn[0]+cm_knn[1]

In [None]:
from sklearn.metrics import cohen_kappa_score
cks_knn= cohen_kappa_score(y_test, knn_prediction)


In [None]:
score_and_mse = score_and_mse.append({'model': "knn classification","Score":knn.score(x_test,y_test),"Cohen Kappa Score":cks_knn,"MSE":mse_knn}, ignore_index=True)

<a id="4"></a> <br>
SUPPORT VECTOR MACHINE(SVM) CLASSIFICATION

In [None]:
from sklearn.svm import SVC
svm=SVC(random_state=1)
svm.fit(x_train,y_train)
svm_prediction=svm.predict(x_test)

In [None]:
mse_svm=mean_squared_error(y_test,svm_prediction)
svm_cm=confusion_matrix(y_test,svm_prediction)
cm_svm=pd.DataFrame(svm_cm)
cm_svm["total"]=cm_svm[0]+cm_svm[1]

cr_svm=classification_report(y_test,svm_prediction)
cks_svm= cohen_kappa_score(y_test, svm_prediction)


In [None]:
score_and_mse = score_and_mse.append({'model': "svm classification","Score":svm.score(x_test,y_test),"Cohen Kappa Score":cks_svm,"MSE":mse_svm}, ignore_index=True)

 <a id="5"></a> <br>
 NAIVE BAYES CLASSIFICATION

In [None]:
from sklearn.naive_bayes import GaussianNB
nb=GaussianNB()
nb.fit(x_train,y_train)
prediction_nb=nb.predict(x_test)

In [None]:
nb_mse=mean_squared_error(y_test,prediction_nb)
nb_cm=confusion_matrix(y_test,prediction_nb)
nb_cm=pd.DataFrame(nb_cm)
nb_cm["total"]=nb_cm[0]+nb_cm[1]

cr_nb=classification_report(y_test,prediction_nb)
cks_nb= cohen_kappa_score(y_test, prediction_nb)


In [None]:
score_and_mse = score_and_mse.append({'model': "naive bayes classification","Score":nb.score(x_test,y_test),"Cohen Kappa Score":cks_nb,"MSE":nb_mse}, ignore_index=True)

<a id="6"></a> <br>
DECISION TREE CLASSIFICATION

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(x_train,y_train)
prediction_dt=dt.predict(x_test)

In [None]:
dt_mse=mean_squared_error(y_test,prediction_dt)
dt_cm=confusion_matrix(y_test,prediction_dt)
dt_cm=pd.DataFrame(dt_cm)
dt_cm["total"]=dt_cm[0]+dt_cm[1]

cr_dt=classification_report(y_test,prediction_dt)
cks_dt= cohen_kappa_score(y_test, prediction_dt)


In [None]:
score_and_mse = score_and_mse.append({'model': "decision tree classification","Score":dt.score(x_test,y_test),"Cohen Kappa Score":cks_dt, "MSE":dt_mse}, ignore_index=True)

<a id="7"></a> <br>
RANDOM FOREST CLASSIFICATION

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=100,random_state=1)
rf.fit(x_train,y_train)

prediction_rf=rf.predict(x_test)

In [None]:
rf_mse=mean_squared_error(y_test,prediction_rf)
rf_cm=confusion_matrix(y_test,prediction_rf)
rf_cm=pd.DataFrame(rf_cm)
rf_cm["total"]=rf_cm[0]+rf_cm[1]

cr_rf=classification_report(y_test,prediction_rf)
cks_rf= cohen_kappa_score(y_test, prediction_rf)

In [None]:
score_and_mse = score_and_mse.append({'model': "random forest classification","Score":rf.score(x_test,y_test),"Cohen Kappa Score":cks_rf,"MSE":rf_mse}, ignore_index=True)

<a id="8"></a> <br>
EVALUATING A CLASSIFICATION MODEL

**Classification Report**

The classification report averages include : 

**Precision:** The ratio of the total number of correctly classified positive examples by the total number of predicted positive examples. TP/TP+FP

**Recall:** The ratio of the total number of correctly classified positive examples divide to the total number of positive examples. TP/TP+FN

**F-measure:** 2 x Recall x Precision / Recall+ Precision

**micro average:** averaging the total true positives, false negatives and false positives, 

**macro average:** averaging the unweighted mean per label, 

**weighted average:** averaging the support-weighted mean per label 

In [None]:
print('Classification report for Logistic Regression: \n',cr_lr)
print('Classification report for KNN Classification: \n',cr_knn)
print('Classification report for SVM Classification: \n',cr_svm)
print('Classification report for Naive Bayes Classification: \n',cr_nb)
print('Classification report for Decision Tree Classification: \n',cr_dt)
print('Classification report for Random Forest Classification: \n',cr_rf)

**Confusion Matrix**

A confusion matrix is a summary of prediction results on a classification problem.

Positive (P) : Observation is positive (for example: is a Pulse Star).

Negative (N) : Observation is not positive (for example: is not a Pulse Star).

True Positive (TP) : Observation is positive, and is predicted to be positive.

False Negative (FN) : Observation is positive, but is predicted negative.

True Negative (TN) : Observation is negative, and is predicted to be negative.

False Positive (FP) : Observation is negative, but is predicted positive.



![](http://rasbt.github.io/mlxtend/user_guide/evaluate/confusion_matrix_files/confusion_matrix_1.png)

In [None]:
f, axes = plt.subplots(2, 3,figsize=(18,12))
g1 = sns.heatmap(cm_lr,annot=True,fmt=".1f",cmap="flag",cbar=False,ax=axes[0,0])
g1.set_ylabel('y_true')
g1.set_xlabel('y_head')
g1.set_title("Logistic Regression")
g2 = sns.heatmap(cm_knn,annot=True,fmt=".1f",cmap="flag",cbar=False,ax=axes[0,1])
g2.set_ylabel('y_true')
g2.set_xlabel('y_head')
g2.set_title("KNN Classification")
g3 = sns.heatmap(cm_svm,annot=True,fmt=".1f",cmap="flag",ax=axes[0,2])
g3.set_ylabel('y_true')
g3.set_xlabel('y_head')
g3.set_title("SVM Classification")
g4 = sns.heatmap(nb_cm,annot=True,fmt=".1f",cmap="flag",cbar=False,ax=axes[1,0])
g4.set_ylabel('y_true')
g4.set_xlabel('y_head')
g4.set_title("Naive Bayes Classification")
g5 = sns.heatmap(dt_cm,annot=True,fmt=".1f",cmap="flag",cbar=False,ax=axes[1,1])
g5.set_ylabel('y_true')
g5.set_xlabel('y_head')
g5.set_title("Decision Tree Classification")
g6 = sns.heatmap(rf_cm,annot=True,fmt=".1f",cmap="flag",ax=axes[1,2])
g6.set_ylabel('y_true')
g6.set_xlabel('y_head')
g6.set_title("Random Forest Classification")



**Receiver Operating Characteristic(ROC) Curve**

In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points. 

Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.

A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity). 

Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test (Zweig & Campbell, 1993).

In [None]:
from sklearn.metrics import roc_curve
fpr_lr, tpr_lr, thresholds = roc_curve(y_test, lr_prediction)
plt.plot([0, 1], [0, 1], 'k--',color="grey")
plt.plot(fpr_lr, tpr_lr,color="red")
plt.title('Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

**Accuracy (Score) **

It is the ratio of number of correct predictions to the total number of input samples.

Accuracy= Number of Correct Predictions/ Total Number of Predictions Made

**Cohen Kappa Score**

Kappa is similar to Accuracy score, but it takes into account the accuracy that would have happened anyway through random predictions.

It is a measure of how well the classifier actually performs. In other words, if there is a big difference between accuracy and null error rate, a model will have a high Kappa score.


Cohen Kappa only serves to make comparisons between two classifiers, if there are more than two classifiers, Fleiss's Kappa is used.

Kappa = (Observed Accuracy - Expected Accuracy) / (1 - Expected Accuracy)

**Mean Squared Error(MSE)**

Mean Squared Error(MSE) is quite similar to Mean Absolute Error, the only difference being that MSE takes the average of the square of the difference between the original values and the predicted values.

In [None]:
score_and_mse