# Exercise: Classification

Here you'll get some experience training a classification model yourself. What you'll do here is create a model that can determine if radio signals come from a pulsar. Pulsars are a rare type of neutron stars that produce radio signals we can detect on Earth. As the pulsars rotate, the beam of radio waves points directly at us, then moves away. This leads to a periodic signal that we can use to determine if the radio signal is actually from a pulsar or just noise.

The data itself contains eight measures of this radio signal and a column `target_class` that indicates if the signal is noise (0) or a pulsar (1). Using this data, you'll train a classifier that can identify pulsars from the radio signal data.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex8 import *
print("Setup complete")



Setup complete


Load in the data and check out the first few rows to get acquainted with the features.

In [3]:
pulsar_data = pd.read_csv('../input/predicting-a-pulsar-star/pulsar_stars.csv')
pulsar_data.head()

Unnamed: 0,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve,target_class
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0


As normal, split the data into training and test sets.

In [4]:
y = pulsar_data['target_class']
X = pulsar_data.drop('target_class', axis=1)
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1, test_size=.2)

## 1. Train the classifier

Now, it's time to create the model and fit it to our training data. Use `RandomForestClassifier` here and fit the model on the training data.

In [5]:
from sklearn.ensemble import RandomForestClassifier

# Define the model. Set random_state to 1
model = ____

# Fit your model
____

step_1.check()

<IPython.core.display.Javascript object>

<span style="color:#ccaa33">Check:</span> When you've updated the starter code, `check()` will tell you whether your code is correct. You need to update the code that creates variables `model`, `val_X`, `val_y`

In [6]:
# The lines below will show you a hint or the solution.
#step_1.hint() 
#step_1.solution()

In [7]:
#%%RM_IF(PROD)%%

from sklearn.ensemble import RandomForestClassifier

# Define the model. Set random_state to 1
model = RandomForestClassifier(random_state=1)

# fit your model on the training data
model.fit(train_X, train_y)

step_1.assert_check_passed()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

## 2. Make Predictions

Make predictions using the trained model and the validation features. Calculate the accuracy of the predictions with metrics.accuracy_score, using the validation targets.

In [8]:
# Get predictions from the trained model using the validation features
pred_y = ____

# Calculate the accuracy of the trained model with the validation targets and predicted targets
accuracy = ____

print("Accuracy: ", accuracy)

# Check your answer
step_2.check()

Accuracy:  <learntools.core.constants.PlaceholderValue object at 0x7f85186bf550>


<IPython.core.display.Javascript object>

<span style="color:#ccaa33">Check:</span> When you've updated the starter code, `check()` will tell you whether your code is correct. You need to update the code that creates variable `accuracy`

In [9]:
# The lines below will show you a hint or the solution.
#step_2.hint()
#step_2.solution()

In [10]:
#%%RM_IF(PROD)%%
# Get predictions from the trained model using the validation features
pred_y = model.predict(val_X)

# Calculate the accuracy of the trained model with the validation targets and predicted targets
accuracy = metrics.accuracy_score(val_y, pred_y)

print("Accuracy: ", accuracy)
step_2.assert_check_passed()

Accuracy:  0.9779329608938547


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

## 3. Interpret the results

Finally, calculate the confusion matrix for the classifier. We'll also normalize the confusion matrix to get it terms of rates.

In [11]:
(val_y==0).mean()

0.9083798882681564

In [12]:
confusion = metrics.confusion_matrix(val_y, pred_y)
print(f"Confusion matrix:\n{confusion}")

# Normalizing by the true label counts to get rates
print(f"\nNormalized confusion matrix:")
for row in confusion:
    print(row / row.sum())

Confusion matrix:
[[3229   23]
 [  56  272]]

Normalized confusion matrix:
[0.99292743 0.00707257]
[0.17073171 0.82926829]


Looking at the confusion matrix, do you think the model is doing well at classifying pulsars from radio wave signals? Is the model misclassifying noise as pulsars or missing pulsars in the data?

In [13]:
#step_3.solution()

## Thinking about unbalanced classes

Roughly 91% of this data is made up of noise signals. If it was 99% noise instead, would an accuracy of 98% still be good?

In [14]:
#step_4.solution()