# Capstone Project 5: Predicting A Pulsar Star

### Overview

[Pulse stars](https://www.youtube.com/watch?v=gjLk_72V9Bw) are a rare type of [Neutron stars](https://www.youtube.com/watch?v=hCwDNXKlN8Q) that produce radio emissions detectable on Earth. They are of considerable scientific interest as probes of space-time, the interstellar medium, and states of matter.

You can learn more about the Pulsar and neutron star by watching this [video](https://www.youtube.com/watch?v=RrMvUL8HFlM).

As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, it produces a detectable pattern of broadband radio emission.

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lightcurve_pulsar.gif'>

Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation. Thus a potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of observation.

Machine learning tools are now being used to automatically label pulsar candidates to facilitate rapid analysis. The classification algorithms, in particular, are being widely adopted, which treat the candidate datasets as binary classification problems (predict either `0` or `1`). Here, the legitimate pulsar examples are a minority positive class (less in numbers), and the remaining examples are a majority negative class.

The class labels used are `0` (negative class) and `1` (positive class). Hence, **we need to deploy the XGBoost Classifier classification model which can accurately detect the class 1 examples.**


---

### Attribute Information

Each candidate is described by 8 continuous variables and a single class variable. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency. The remaining four variables are similarly obtained from the DM-SNR curve. **You do not have to worry about what they actually mean.** These 8 variables are summarised below:

1. The mean of the integrated profile.

2. The standard deviation of the integrated profile.

3. Excess kurtosis of the integrated profile.

4. The skewness of the integrated profile.

5. The mean of the DM-SNR curve.

6. The standard deviation of the DM-SNR curve.

7. Excess kurtosis of the DM-SNR curve.

8. The skewness of the DM-SNR curve.

Source: https://archive.ics.uci.edu/ml/datasets/HTRU2

**Courtesy**

Dr Robert Lyon

School of Physics and Astronomy, United Kingdom.

robert.lyon@manchester.ac.uk



---

#### Getting Started:

1. Click on this link to open the Colab file for this project.

    https://colab.research.google.com/drive/11R6mh2c4cIscoD2NIcU69fiO6aWdgJJC

2. Create a duplicate copy of the Colab file as described below.

  - Click on the **File menu**. A new drop-down list will appear.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-0/0_file_menu.png' width=500>

  - Click on the **Save a copy in Drive** option. A duplicate copy will get created. It will open up in the new tab on your web browser.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-0/1_create_colab_duplicate_copy.png' width=500>

3. After creating the duplicate copy of the notebook, please rename it in the **YYYY-MM-DD_StudentName_CapstoneProject5** format.

4. Now, write your code in the prescribed code cells.


---

### Project Requirements

1. Create a pandas DataFrame for both the train and test datasets.

2. Count the number of `0` and `1` classes in the training dataset.

3. Separate the feature variables, i.e, `x_train` and `x_test` from both the DataFrames.

4. Separate the target variable, i.e, `y_train` and `y_test` from both the DataFrames.

5. Apply the `XGBClassifier` machine learning model to predict the `0` and `1` classes in the test dataset, i.e, `x_test`.

6. Print the confusion matrix and the classification report to evaluate your prediction model. Also, based on the confusion matrix, precision, recall and f1-score values, report whether the prediction model deployed by you is making accurate predictions or not.


---

#### 1. Create The DataFrames

Create the Pandas DataFrames for both the training and test datasets and store them in `train_df` and `test_df` variables respectively.

You can get the datasets from the following links:

1. Train dataset: https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-5/pulsar-star-prediction-train.csv

2. Test dataset: https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-5/pulsar-star-prediction-test.csv

In [None]:
# Create the DataFrames for both the train and test datasets and store them in the 'train_df' and 'test_df' variables respectively.
import pandas as pd
# Train dataset link: 'https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-5/pulsar-star-prediction-train.csv'
train_df = pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-5/pulsar-star-prediction-train.csv')
train_df
# Test dataset link: 'https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-5/pulsar-star-prediction-test.csv'
test_df = pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-5/pulsar-star-prediction-test.csv')
test_df

Unnamed: 0,target_class,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve
0,0,116.906250,48.920605,0.186046,-0.129815,3.037625,17.737102,8.122621,78.813405
1,1,75.585938,34.386254,2.025498,8.652913,3.765050,21.897049,7.048189,55.878791
2,0,103.273438,46.996628,0.504295,0.821088,2.244983,15.622566,9.330498,105.134941
3,1,101.078125,48.587487,1.011427,1.151870,81.887960,81.464136,0.485105,-1.117904
4,0,113.226562,48.608804,0.291538,0.292120,6.291806,26.585056,4.540138,21.708268
...,...,...,...,...,...,...,...,...,...
5902,0,120.140625,45.794673,0.466357,0.528349,2.286789,16.909127,9.823143,111.061145
5903,0,145.445312,45.821803,-0.175460,0.401138,3.040970,16.063758,7.755287,80.304937
5904,0,117.875000,51.089535,-0.091320,-0.268881,2.075251,12.376724,10.736269,167.163044
5905,0,126.812500,46.832606,0.291279,0.183496,1.045151,9.514711,17.821202,431.240745


---

#### 2. Display First Five Rows

Display the first five rows of both the `train_df` and `test_df` DataFrames.

In [None]:
# Display the first 5 rows of the 'train_df' DataFrame using the 'head()' function.
train_df.head()

Unnamed: 0,target_class,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve
0,0,111.109375,53.131064,0.280253,-0.222447,3.011706,20.35582,7.585482,62.38327
1,0,151.945312,47.97335,-0.250834,0.275367,2.115385,14.195484,11.640297,173.592172
2,1,52.335938,34.775008,2.478375,10.179171,8.230769,34.775947,4.652342,21.987882
3,0,121.5625,48.569498,-0.033391,-0.323514,2.595318,15.089924,8.734079,98.584122
4,0,133.664062,59.137852,-0.164198,-0.552877,1.542642,12.052034,12.226623,196.522501


In [None]:
# Display the first 5 rows of the 'test_df' DataFrame.
test_df.head()

Unnamed: 0,target_class,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve
0,0,116.90625,48.920605,0.186046,-0.129815,3.037625,17.737102,8.122621,78.813405
1,1,75.585938,34.386254,2.025498,8.652913,3.76505,21.897049,7.048189,55.878791
2,0,103.273438,46.996628,0.504295,0.821088,2.244983,15.622566,9.330498,105.134941
3,1,101.078125,48.587487,1.011427,1.15187,81.88796,81.464136,0.485105,-1.117904
4,0,113.226562,48.608804,0.291538,0.29212,6.291806,26.585056,4.540138,21.708268


---

#### 3. Display Last Five Rows

Display the last five rows of both the `train_df` and `test_df` DataFrames.

In [None]:
# Display the last 5 rows of the 'train_df' DataFrame using the 'tail()' function.
train_df.tail()

Unnamed: 0,target_class,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve
11986,0,124.3125,53.179053,-0.012418,-0.556021,7.186455,29.308266,4.531382,21.725143
11987,0,115.617188,46.7846,0.218177,0.226757,6.140468,30.271961,5.732201,34.357283
11988,0,116.03125,43.213846,0.663456,0.433088,0.785117,11.628149,17.055215,312.204325
11989,0,135.664062,49.933749,-0.08994,-0.226726,3.859532,21.501505,7.398395,62.334018
11990,0,120.726562,50.472256,0.346178,0.184797,0.769231,11.792603,17.662222,329.548016


In [None]:
# Display the last 5 rows of the 'test_df' DataFrame.
test_df.tail()

Unnamed: 0,target_class,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve
5902,0,120.140625,45.794673,0.466357,0.528349,2.286789,16.909127,9.823143,111.061145
5903,0,145.445312,45.821803,-0.17546,0.401138,3.04097,16.063758,7.755287,80.304937
5904,0,117.875,51.089535,-0.09132,-0.268881,2.075251,12.376724,10.736269,167.163044
5905,0,126.8125,46.832606,0.291279,0.183496,1.045151,9.514711,17.821202,431.240745
5906,0,154.835938,46.133938,-0.313312,0.082159,7.258361,30.115319,5.232192,30.444595


---

#### 4. Rows & Columns In Train DataFrame

Find the number of rows and columns in both the `train_df` and `test_df` DataFrames.

In [None]:
# Print the number of rows and columns in both the DataFrames using the 'shape' function.
print(train_df.shape)
print(test_df.shape)

(11991, 9)
(5907, 9)


---

#### 5. Check For Missing Values

Check whether any of the columns in both the `train_df` and `test_df` DataFrames has any missing value.

In [None]:
# Check for the missing values in the 'train_df' DataFrame.
train_df.isnull().sum()

target_class                                     0
 Mean of the integrated profile                  0
 Standard deviation of the integrated profile    0
 Excess kurtosis of the integrated profile       0
 Skewness of the integrated profile              0
 Mean of the DM-SNR curve                        0
 Standard deviation of the DM-SNR curve          0
 Excess kurtosis of the DM-SNR curve             0
 Skewness of the DM-SNR curve                    0
dtype: int64

**Hint**: You can get the total number of missing values in each column by using the `sum()` function on top of the `isnull()` function.

In [None]:
# Check for the missing values in the 'test_df' DataFrame.
test_df.isnull().sum()

target_class                                     0
 Mean of the integrated profile                  0
 Standard deviation of the integrated profile    0
 Excess kurtosis of the integrated profile       0
 Skewness of the integrated profile              0
 Mean of the DM-SNR curve                        0
 Standard deviation of the DM-SNR curve          0
 Excess kurtosis of the DM-SNR curve             0
 Skewness of the DM-SNR curve                    0
dtype: int64

---

#### 6. Count The `0` & `1` Classes

Find out the number of `0` and `1` values from the `target_class` column in both the `train_df` and `test_df` DataFrames.

In [None]:
# Print the count of the '0' and '1' classes in the 'train_df' DataFrame using the 'value_counts()' function.
train_df['target_class'].value_counts()

0    10878
1     1113
Name: target_class, dtype: int64

In [None]:
# Print the count of the '0' and '1' classes in the 'test_df' DataFrame.
test_df['target_class'].value_counts()

0    5381
1     526
Name: target_class, dtype: int64

---

#### 7. Feature Variables Extraction

Get the feature variables, i.e., `x_train` and `x_test` from both the `train_df` and `test_df` DataFrames respectively. Then, display the first 5 rows of `x_train` and `x_test` DataFrames.

In [None]:
# Get the feature variables from the 'train_df' DataFrame.
x_train=train_df.iloc[:,1:]
# Display the first 5 rows of the 'x_train' DataFrame.
x_train.head()

Unnamed: 0,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve
0,111.109375,53.131064,0.280253,-0.222447,3.011706,20.35582,7.585482,62.38327
1,151.945312,47.97335,-0.250834,0.275367,2.115385,14.195484,11.640297,173.592172
2,52.335938,34.775008,2.478375,10.179171,8.230769,34.775947,4.652342,21.987882
3,121.5625,48.569498,-0.033391,-0.323514,2.595318,15.089924,8.734079,98.584122
4,133.664062,59.137852,-0.164198,-0.552877,1.542642,12.052034,12.226623,196.522501


**Hint**: You can get the feature variables from the `train_df` DataFrame using the `iloc[]` function.

`x_train = train_df.iloc[:, 1:]`

In [None]:
# Get the feature variables from the 'test_df' DataFrame.
x_test=test_df.iloc[:,1:]
# Display the first 5 rows of the 'x_test' DataFrame.
x_test.head()

Unnamed: 0,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve
0,116.90625,48.920605,0.186046,-0.129815,3.037625,17.737102,8.122621,78.813405
1,75.585938,34.386254,2.025498,8.652913,3.76505,21.897049,7.048189,55.878791
2,103.273438,46.996628,0.504295,0.821088,2.244983,15.622566,9.330498,105.134941
3,101.078125,48.587487,1.011427,1.15187,81.88796,81.464136,0.485105,-1.117904
4,113.226562,48.608804,0.291538,0.29212,6.291806,26.585056,4.540138,21.708268


---

#### 8. Target Variable Extraction

Get the target variables, i.e., `y_train` and `y_test` from both the `train_df` and `test_df` DataFrames respectively. Then, display the first 5 rows of `y_train` and `y_test` Pandas series.

In [None]:
# Get the target variable from the 'train_df' DataFrame.
y_train = train_df['target_class']
# Display the first 5 rows of the 'y_train' Pandas series.
y_train.head()

0    0
1    0
2    1
3    0
4    0
Name: target_class, dtype: int64

**Hint**: You can extract the target class and store it in the `y_train` variable using `y_train = train_df['target_class']`

In [None]:
# Get the target variable from the 'test_df' DataFrame.
y_test = test_df['target_class']
# Display the first 5 rows of the 'y_test' Pandas series.
y_test.head()

0    0
1    1
2    0
3    1
4    0
Name: target_class, dtype: int64

---

#### 9. Building An XGBoost Classifier Model

Follow the below steps to build a **XGBoost Classifier Model**:

- **Step 1**: Import the `xgboost` module with alias name as `xg`.

- **Step 2**: Import the `confusion_matrix` and `classification_report` modules from `sklearn.metrics`.

- **Step 3**: Store the xgb classifier in a variable using the `XGBClassifier()` function.

 `xgb_clf = xg.XGBClassifier()`

- **Step 4**: Fit the XGBoost classifier model using the `fit()` function.

 `xgb_clf.fit(x_train, y_train)`

In [None]:
# Build A XGBoost Classifier model
# Importing the 'xgboost' module with alias name as 'xg'.
import xgboost as xg
# Importing the 'confusion_matrix' and 'classification_report' modules from 'sklearn.metrics'.
from sklearn.metrics import confusion_matrix, classification_report
# Store the xgb classifier in a variable using 'xgb_clf = xg.XGBClassifier()'
xgb_clf = xg.XGBClassifier()
# Fit the classifier model using 'xgb_clf.fit(x_train, y_train)'
xgb_clf.fit(x_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
# Predict the target variable based on the feature variables of the test DataFrame.
y_pred = xgb_clf.predict(x_test)
y_pred

array([0, 1, 0, ..., 0, 0, 0])

**Hint**: You can predict the target variable using the `predict()` function.

`y_pred = xgb_clf.predict(x_test)`.

---

#### 10. Confusion Matrix & Classification Report

Print the confusion matrix and classification report to evaluate the model. Interpret and report the results obtained from the confusion matrix and the classification report.

In [None]:
# Print the confusion matrix to see the number of TN, FN, TP and FP.
confusion_matrix(y_test, y_pred)

array([[5348,   33],
       [  82,  444]])

**Hint**: You can print the confusion matrix by passing the `y_test` and `y_pred` variables inside the `confusion_matrix()` function.

`confusion_matrix(y_test, y_pred)`.

In [None]:
# Print the precision, recall and f1-score values for both the '0' and '1' classes.
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      5381
           1       0.93      0.84      0.89       526

    accuracy                           0.98      5907
   macro avg       0.96      0.92      0.94      5907
weighted avg       0.98      0.98      0.98      5907



**Hint**: You can print the classification report by passing the `y_test` and `y_pred` variables inside the `classification_report()` function.

`classification_report(y_test, y_pred)`.

**Write your interpretation of the results here.**



---

### Submitting the Project:

1. After finishing the project, click on the **Share** button on the top right corner of the notebook. A new dialog box will appear.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/2_share_button.png' width=500>

2. In the dialog box, make sure that '**Anyone on the Internet with this link can view**' option is selected and then click on the **Copy link** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/3_copy_link.png' width=500>

3. The link of the duplicate copy (named as **YYYY-MM-DD_StudentName_CapstoneProject5**) of the notebook will get copied

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/4_copy_link_confirmation.png' width=500>

4. Go to your dashboard and click on the **My Projects** option.
   
   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/5_student_dashboard.png' width=800>

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/6_my_projects.png' width=800>

5. Click on the **View Project** button for the project you want to submit.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/7_view_project.png' width=800>

6. Click on the **Submit Project Here** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/8_submit_project.png' width=800>

7. Paste the link to the project file named as **YYYY-MM-DD_StudentName_CapstoneProject5** in the URL box and then click on the **Submit** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/9_enter_project_url.png' width=800>

---