# Exercise: Classification

Here you'll get some experience training a classification model yourself. What you'll do here is create a model that can determine if radio signals come from a pulsar. Pulsars are a rare type of neutron stars that produce radio signals we can detect on Earth. As the pulsars rotate, the beam of radio waves points directly at us, then moves away. This leads to a periodic signal that we can use to determine if the radio signal is actually from a pulsar or just noise.

From the [dataset page on Kaggle](https://www.kaggle.com/pavanraj159/predicting-a-pulsar-star):

>As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.

>Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation . Thus a potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find.

The data itself contains eight measures of this radio signal and a column `target_class` that indicates if the signal is noise (0) or a pulsar (1). Using this data, you'll train a classifier that can identify pulsars from the radio signal data.

In [3]:
#%%RM_IF(PROD)%%
# For loading learntools during dev
import sys
sys.path.append('../../..')

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex8 import *
print("Setup complete")

Setup complete


Load in the data and check out the first few rows to get acquainted with the features.

In [5]:
pulsar_data = pd.read_csv('../input/predicting-a-pulsar-star/pulsar_stars.csv')
pulsar_data.head()

Unnamed: 0,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve,target_class
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0


As normal, split the data into training and test sets.

In [None]:
y = pulsar_data['target_class']
X = pulsar_data.drop('target_class', axis=1)
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1, test_size=.2)

## Train the classifier

Now, it's time to create the model and fit it to our training data. Use `RandomForestClassifier` here and fit the model on the training data.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Define the model. Set random_state to 1
model = _____

# Fit your model
_____

step_1.check()

In [None]:
# The lines below will show you a hint or the solution.
#step_1.hint() 
#step_1.solution()

In [None]:
#%%RM_IF(PROD)%%

from sklearn.ensemble import RandomForestClassifier

# Define the model. Set random_state to 1
model = RandomForestClassifier(random_state=1)

# fit your model on the training data
model.fit(train_X, train_y)

step_1.assert_check_passed()

## Make Predictions

Make predictions using the trained model and the validation features. Calculate the accuracy of the predictions with metrics.accuracy_score, using the validation targets.

In [None]:
# Get predictions from the trained model using the validation features
pred_y = _____

# Calculate the accuracy of the trained model with the validation targets and predicted targets
accuracy = _____

print(f"Accuracy: {accuracy:.3f}")

step_2.check()

In [None]:
# The lines below will show you a hint or the solution.
#step_2.hint()
#step_2.solution()

In [None]:
#%%RM_IF(PROD)%%
# Get predictions from the trained model using the validation features
pred_y = model.predict(val_X)

# Calculate the accuracy of the trained model with the validation targets and predicted targets
accuracy = metrics.accuracy_score(val_y, pred_y)

print("Accuracy, Random Forest Classifier: {:.3f}".format(accuracy))
step_2.assert_check_passed()

## Interpret the results

Finally, calculate the confusion matrix for the classifier. We'll also normalize the confusion matrix to get the true positive rates.

In [None]:
confusion = metrics.confusion_matrix(val_y, pred_y)
print(f"Confusion matrix:\n{confusion}")

# Normalizing by the true label counts
norm_confusion = confusion.astype(float) / confusion.sum(axis=1)[:, None]
print(f"\nNormalized confusion matrix:\n{norm_confusion}")

Looking at the true positive rate, do you think the model is doing well at classifying pulsars from radio wave signals? Is the model misclassifying noise as pulsars or missing pulsars in the data?

In [None]:
#step_3.solution()