# Build simple models to predict pulsar candidates

In this notebook we will look at building machine learning models to predict Pulsar Candidate. The data comes from Rob Lyon at Manchester. This data is publically available. For more information check out https://figshare.com/articles/HTRU2/3080389/1

#### Lets start with the basic imports


In [1]:
import numpy as np

# Some preprocessing utilities
from sklearn.cross_validation import train_test_split # Data splitting
from sklearn.utils import shuffle

# The different classifiers
from sklearn.tree import DecisionTreeClassifier # Decision Tree
from sklearn.svm import SVC # Support Vector Machines
from sklearn.neighbors import KNeighborsClassifier # Nearest Neighbor 
from sklearn.naive_bayes import GaussianNB # Bayesian Classifier

# Model result function
from sklearn.metrics import accuracy_score



#### Load dataset

* Data is a csv file with each column as features and rows as samples of positive and negative candidates

* Class label is the last column where "1" correspondes to true pulsar candidate and "0" a false candidate

In [2]:
data = np.loadtxt('Data/Pulsar/HTRU_2.csv',delimiter=',')

# Show some information
print 'Dataset has %d rows and %d columns including features and labels'%(data.shape[0],data.shape[1])

Dataset has 17898 rows and 9 columns including features and labels


#### Get the features and labels

In [3]:
# Lets shuffle the rows of the data 10 times
for i in range(10):
    data = shuffle(data)

# Now split the dataset into seperate variabels for features and labels
features = data[:,0:-1] # Features are all columns till the second last one
labels = data[:,-1] # Labels are usually the last column

#### Split data to training and validation sets

In [4]:
# Do a 70 - 30 split of the whole data for training and testing
# The last argument specifies the fraction of samples for testing
train_data,test_data,train_labels,test_labels = train_test_split(features,labels,test_size=.3)
#Print some info
print 'Number of training data points : %d'%(train_data.shape[0])
print 'Number of testing data points : %d'%(test_data.shape[0])


Number of training data points : 12528
Number of testing data points : 5370


##### Lets do the training on different algorithms

We will be using the following algorithms

* Decision Trees  [ https://en.wikipedia.org/wiki/Decision_tree_learning ]


* Support Vector Machines (SVM)  [ https://en.wikipedia.org/wiki/Support_vector_machine ]


* k-Nearest Neighbours (KNN) [ https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm ]


* Naive Bayes Classifier [ https://en.wikipedia.org/wiki/Naive_Bayes_classifier ]


### Lets start with default model parameters for each classifier.
Check the link above each block for function definition


* Scikit Decision Tree

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [5]:
dt = DecisionTreeClassifier() # Make the classifier object
dt.fit(train_data,train_labels) # Train the model

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

* Scikit SVM

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [6]:
# Support Vector Machine
svm = SVC()
svm.fit(train_data,train_labels)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

* Scikit KNN

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [7]:
# K nearest neighbor
knn = KNeighborsClassifier()
knn.fit(train_data,train_labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

* Scikit Naive Bayes

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [8]:
# Naive Bayes
nb = GaussianNB()
nb.fit(train_data,train_labels)

GaussianNB(priors=None)

#### Fancy function to print results for model evaluation

In [None]:
# Pretty function to test  a model and print accuracy score
def evaluate(model,modelname,test_data,test_labels):
    predictions = model.predict(test_data) # Do the actual prediction
    print '%s is %f%% accurate \n'%(modelname,accuracy_score(test_labels,predictions)*100)
   

In [None]:
# Making some stuff easy
models =[dt,svm,knn,nb]
model_names =['Decision Tree','Support Vector Machines','KNN','Naive Bayes']

#### Now Lets test each classifier and disply their accuracy

In [None]:
for i in range(0,4):
    evaluate(models[i],model_names[i],test_data,test_labels)