# Lesson 16: Hunting Exoplanets In Space - Deploying A Prediction Model

---

---

---

---

### Teacher-Student Activities

In the last class, we learnt how to create line plots and scatter plots to visualise data. 

In this class, we will deploy a **prediction model** to predict which stars in the test dataset have a planet and which do not. 

**How prediction model works?**

**1. Learning:**
- The prediction model, through the training dataset, will learn the properties of a star that has a planet and also the properties of a star which does not have a planet. 

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C16/whj-galaxy-more+planet-apt-c16-01.png" height=400/>


**2. Testing and Prediction:**  

- Once the model has learnt the required properties, it will look for these properties in the test dataset and according to the properties it sees, it will predict whether a star has a planet or not.

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C16/whj-galaxy-less+planet-apt-c16-01.png" height=400/>

---

In this class, we will learn how to deploy the **Random Forest Classifier** model (a machine learning algorithm). 

**What is Machine Learning?**

- Machine learning is a branch of artificial intelligence in which a machine learns through data the different features on its own without being programmed by a computer programmer. 

```
TEACHER
    Without machine learning, we need to tell the machine what to do. 
```

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C16/whj-comp-boy-apt-c16.gif" height=400 >



With machine learning, the machine learning will figure out what to do itself, using the inputs we feed it.
    
- First, the algorithmic models are **trained** to perform some specific tasks by learning patterns from the data it sees, rather than through computer programming by a human expert.

- After training, our model is ready for making predictions for any new input.

```
TEACHER
   In this image, you can see that the machine is learning from the data 
   provided. for example, it has learned how a cup looks like from the alphabet 
   chart. Now, whenever the machine sees a cup again, it can easily recognize 
   that it is a cup.

```
<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C16/whj-machine-tarining-apt-c16-01.png" />


<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C16/whj-comp-tea+cup-apt-c16-01.png" />



- For our model, the machine will learn to recognise the flux values of stars having a planet on its own.
-  When a new dataset containing only the flux values of a star is shown to the machine, it will tell whether that star has a planet or not.

There are many machine learning models or algorithms to do this kind of prediction. One of them is **Random Forest Classifier**. 

---
**Random Forest Classifier:**

- It is used to classify outcomes into classes (or labels) based on some features. 
- For e.g., an animal that makes a 'Meow! Meow!' sound is classified (or labelled) as a cat, an animal that makes a 'Woof! Woof!' sound is classified as a dog, an animal which makes a hissing sound is classified as a snake etc.
- It uses the training data to learn how the given features relate to a particular class. 
- **For example:** A classifier can classify a given set of population into following two classes:
  - Those suffering from Diabetes.
  - Those not suffering from Diabetes.

  on the basis of different features of the population such as age, gender, blood pressure etc. 

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C16/whj-pateint-database-apt-c16-01.png" height=400/>

In this lesson, we will use the Random Forest Classifier model to classify whether a star has a planet or not. 
- The stars which have at least one planet are labelled as `2` 
- The stars not having a planet are labelled as `1`. 


Let's run all the codes in the code cells that we have already covered in the previous classes and begin this class from **Activity 1: The `value_counts()` Function** section.  You too run the code cells until the first activity.



---

#### Loading The Datasets

Create a Pandas DataFrame every time you start a Jupyter notebook.

Dataset links (Don't click on them):

1. Train dataset 

   
2. Test dataset 
   
   

In [None]:
# Loading both the training and test datasets.
import pandas as pd

exo_train_df = pd.read_csv('')
exo_test_df = pd.read_csv('')

In [None]:
# Number of rows and columns in the DataFrames.
print(exo_train_df.shape)
exo_test_df.shape

(5087, 3198)


(570, 3198)

In the previous classes, we have already checked that the training dataset does not have any missing value. So, we can skip the missing values check part.

---

#### Activity 1: The `value_counts()` Function
Our prediction model should classify the stars either as `1` or `2`. Let's find out how many stars in the test dataset are classified as `1` and `2`.

To compute how many times a value occurs in a series, we use the `value_counts()` function.

In [None]:
# Student Action: Count the number of times a value occurs in a Pandas series.
exo_test_df['LABEL'].value_counts()

1    565
2      5
Name: LABEL, dtype: int64

There are `565` stars which are classified as `1` and `5` stars classified as `2` which means only `5` stars have a planet. Interestingly, if our prediction model mindlessly classifies every star as `1`, then it is a very accurate model. Why?

Because the accuracy of a model is calculated as **a percentage of the correct predictions out of the total number of predictions**. In this case, the percentage of the correct predictions is

 $\frac{565\times100}{570} = 99.122$ %

Thus, without actually deploying a proper prediction model, we can predict the stars having a planet with 99% accuracy. 

This is **WRONG**! This is where we need to be careful. Because we have very imbalanced data. The ultimate goal of the Kepler space telescope is to detect exoplanets in outer space. Hence, a machine learning model, based on some data should also **correctly** detect stars having planets. This means a prediction model will be considered useful if it correctly detects almost all the stars having a planet.

So, the prediction model which always labels every star as `1` is useless. Because it must detect almost all the stars having a planet.

Now, we are going to deploy the Random Forest Classifier model so that it can detect all the five (or at least three) stars having a planet.

---

### Random Forest Classifier^ 

A Random Forest is a collection (a.k.a. ensemble) of many decision trees. A decision tree is a flow chart which separates data based on some condition. If a condition is true, you move on a path otherwise, you move on to another path.


For e.g., in case of finding a star having a planet, you can construct the following decision tree: 

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-14/decision-tree.png' width=600>

You could ask a question whether there is decrease in the flux values of a star. If the answer is no, then it clearly means the star does not have a planet. However, if the answer is yes, then you could ask another question to check whether the decrease is periodic or not. Again, if the answer is no, then the star does not have a planet. Otherwise, it has a planet.

This is one of the examples of a decision tree. Based on a problem, the decision tree could get more and more complex. 

A collection of `N` number of trees is a random forest wherein each tree gives some predicted value (in this case either class `1` or class `2`). 


<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-14/rfc-image.jpg' width=800>

<h3><b>For example:</h3></b>

```
TEACHER
   For example, a random forest model is used to classify whether an item is a 
   pen or a pencil. There are 3 decision trees, each predicting a class for the 
   item. The final predicted value is the majority class, which in this case is 
   Class 2 i.e. Pen.
```

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C16/Untitled-13-01.png" height=400>




The final predicted value is the majority class, i.e, the class that is predicted by the most number of decision trees in a random forest.

For the time being, just consider the Random Forest Classifier as some kind of a black-box which classifies data into different classes (in this case, either class `1` or class `2`) by learning the properties of every class through a training dataset. 

---

#### Activity 2: Importing `RandomForestClassifier`

We need to import a module called `RandomForestClassifier` from a package called `sklearn.ensemble`. The `sklearn` (or **scikit-learn**) is a collection of many machine learning modules. Almost every machine learning algorithm can be directly applied without a knowledge of math using the **scikit-learn** library. It is kind of a plug-and-play device.

You can read about it in the link provided in the **Activities** section under the title **`scikit-learn` - Random Forest Classifier**


In [None]:
# Teacher Action: Import the required modules from the 'sklearn' library.
# Import the 'RandomForestClassifier' module from the 'sklearn.ensemble' library.
from sklearn.ensemble import RandomForestClassifier

---

#### Activity 3: Target & Feature Variables Separation

The `RandomForestClassifier` module has a function called `fit()` which takes two inputs: 
1. The first input is the collection of feature variables.
2. The second input is the target variable.  

Let us try to understand what are Feature variables and Target Variables.

**Features and Target:**

<h3><font color=red><b>
Have you ever came across any facial recognition system? <img src="https://curriculum.whitehatjr.com/APT+Asset/Final+C2+images/question.jpg" height=25/></b></h3>

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C16/whj-face-recognition-apt-c16-01.png"/>


Face recognition is a biometric identification process which is used to authenticate a person using its **facial features**. The system examines various facial **features** such as eyes, nose and lips to authenticate a target user.  

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C16/eyenoselip.png" height=400/>

Similarly, the machine learning algorithm uses various features to predict the outcome. 

```
NOTE for the TEACHER:
   The teacher can also give following example if the student is still
   not able to understand the concept of features and target. 

  Example: An algorithm may analyse different features of an apple like size, 
  color, taste etc to uniquely identify an apple. 
  In this case, 
  features --> size, color, taste
  target   --> 'apple' or 'not apple'
```


For our dataset, we will use flux values of a star i.e.`FLUX.1` to `FLUX.3197` columns to find out whether that star has a planet or not.

 
**Feature Variables:** The features are those variables which describe the features or properties of an entity.

**Target variable:** The variable which needs to be predicted is called a target variable.



For our dataset, 
- `FLUX.1` to `FLUX.3197` are feature variables. Hence, the values stored in these columns are the features of a star in exoplanets dataset.
- `LABEL` is the target variable because the prediction model needs to predict which star belongs to which class in the test dataset. Hence, the values stored in the `LABEL` column are the target values. 

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C16/WHJ-feature+variable-target-apt-c16-01.png"/>

So, we need to extract the target variable and the feature variables separately from the training dataset. 

Let's store the feature variables in the `x_train` variable and the target variable in the `y_train` variable. We will separate the features using the `iloc[]` function. 

Recall the syntax for `iloc[]` function.

**Syntax:** 

`dataframe_name.iloc[row_position_start : row_position_end, column_position_start : column_position_end]`


We need all the rows from the training set. So, inside the `iloc[]` function, we will enter the colon (`:`) sign to get all the rows. We do not need the first column, i.e., the `LABEL` column. Therefore, inside the `iloc[]` function, as part of column indexing, enter `1` as the starting index followed by the colon (`:`) sign to include the rest of the columns from the training dataset.


In [None]:
# Student Action: Extract the feature variables from the training dataset using the 'iloc[]' function.
x_train = exo_train_df.iloc[:, 1:]
x_train.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,FLUX.11,FLUX.12,FLUX.13,FLUX.14,FLUX.15,FLUX.16,FLUX.17,FLUX.18,FLUX.19,FLUX.20,FLUX.21,FLUX.22,FLUX.23,FLUX.24,FLUX.25,FLUX.26,FLUX.27,FLUX.28,FLUX.29,FLUX.30,FLUX.31,FLUX.32,FLUX.33,FLUX.34,FLUX.35,FLUX.36,FLUX.37,FLUX.38,FLUX.39,FLUX.40,...,FLUX.3158,FLUX.3159,FLUX.3160,FLUX.3161,FLUX.3162,FLUX.3163,FLUX.3164,FLUX.3165,FLUX.3166,FLUX.3167,FLUX.3168,FLUX.3169,FLUX.3170,FLUX.3171,FLUX.3172,FLUX.3173,FLUX.3174,FLUX.3175,FLUX.3176,FLUX.3177,FLUX.3178,FLUX.3179,FLUX.3180,FLUX.3181,FLUX.3182,FLUX.3183,FLUX.3184,FLUX.3185,FLUX.3186,FLUX.3187,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,-160.17,-207.47,-154.88,-173.71,-146.56,-120.26,-102.85,-98.71,-48.42,-86.57,-0.84,-25.85,-67.39,-36.55,-87.01,-97.72,-131.59,-134.8,-186.97,-244.32,-225.76,-229.6,-253.48,-145.74,-145.74,30.47,-173.39,-187.56,-192.88,-182.76,-195.99,...,-167.69,-56.86,7.56,37.4,-81.13,-20.1,-30.34,-320.48,-320.48,-287.72,-351.25,-70.07,-194.34,-106.47,-14.8,63.13,130.03,76.43,131.9,-193.16,-193.16,-89.26,-17.56,-17.31,125.62,68.87,100.01,-9.6,-25.39,-16.51,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,-73.38,-86.51,-74.97,-73.15,-86.13,-76.57,-61.27,-37.23,-48.53,-30.96,-8.14,-5.54,15.79,45.71,10.61,40.66,16.7,15.18,11.98,-203.7,19.13,19.13,19.13,19.13,19.13,17.02,-8.5,-13.87,-29.1,-34.29,-24.68,...,-36.75,-15.49,-13.24,20.46,-1.47,-0.4,27.8,-58.2,-58.2,-72.04,-58.01,-30.92,-13.42,-13.98,-5.43,8.71,1.8,36.59,-9.8,-19.53,-19.53,-24.32,-23.88,-33.07,-9.03,3.75,11.61,-12.66,-5.69,12.53,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,484.39,469.66,462.3,492.23,441.2,483.17,481.28,535.31,554.34,562.8,540.14,576.34,551.67,556.69,550.86,577.33,562.08,577.97,530.67,553.27,538.33,527.17,532.5,273.66,273.66,292.39,298.44,252.64,233.58,171.41,224.02,...,-51.09,-33.3,-61.53,-89.61,-69.17,-86.47,-140.91,-84.2,-84.2,-89.09,-55.44,-61.05,-29.17,-63.8,-57.61,2.7,-31.25,-47.09,-6.53,14.0,14.0,-25.05,-34.98,-32.08,-17.06,-27.77,7.86,-70.77,-64.44,-83.83,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67
3,326.52,347.39,302.35,298.13,317.74,312.7,322.33,311.31,312.42,323.33,311.14,326.19,313.11,313.89,317.96,330.92,341.1,360.58,370.29,369.71,339.0,336.24,319.31,321.56,308.02,296.82,279.34,275.78,289.67,281.33,285.37,281.87,88.75,88.75,67.71,74.46,69.34,76.51,80.26,70.31,...,-2.75,14.29,-14.18,-25.14,-13.43,-14.74,2.24,-31.07,-31.07,-50.27,-39.22,-51.33,-18.53,-1.99,10.43,-1.97,-15.32,-23.38,-27.71,-36.12,-36.12,-15.65,6.63,10.66,-8.57,-8.29,-21.9,-25.8,-29.86,7.42,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,-1107.21,-1112.59,-1118.95,-1095.1,-1057.55,-1034.48,-998.34,-1022.71,-989.57,-970.88,-933.3,-889.49,-888.66,-853.95,-800.91,-754.48,-717.24,-649.34,-605.71,-575.62,-526.37,-490.12,-458.73,-447.76,-419.54,-410.76,-404.1,-425.38,-397.29,-412.73,-446.49,-413.46,-1006.21,-1006.21,-973.29,-986.01,-975.88,-982.2,-953.73,-964.35,...,-694.76,-705.01,-625.24,-604.16,-668.26,-742.18,-820.55,-874.76,-874.76,-853.68,-808.62,-777.88,-712.62,-694.01,-655.74,-599.74,-617.3,-602.98,-539.29,-672.71,-672.71,-594.49,-597.6,-560.77,-501.95,-461.62,-468.59,-513.24,-504.7,-521.95,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54


Now, let's get only the target variables from the training dataset.

In [None]:
# Student Action: Using the 'iloc[]' function, retrieve only the first column, i.e., the 'LABEL' column from the training dataset.
y_train = exo_train_df.iloc[:, 0] 
y_train.head()

0    2
1    2
2    2
3    2
4    2
Name: LABEL, dtype: int64

---

#### Activity 4: Fitting The Model

Now that we have separated the feature and target variables for deploying the `RandomForestClassifier` model, let's train the model with the feature variables using the `fit()` function. The steps to be followed are described below.

1. First, call the `RandomForestClassifier` module with inputs as `n_jobs=-1` and `n_estimators=50`. Store the function in a variable with the name `rf_clf`.

  For the time being, ignore the reason behind providing the `n_jobs=-1` parameter as an input.

  ```python
  rf_clf = RandomForestClassifier(n_jobs=-1, n_estimators=50)
  ```

  The `n_estimators` parameter defines the number of decision trees in a Random Forest. Therefore, `n_estimators=50` means that the forest contains `50` decision trees. If `n_estimators` parameter is not defined by a user, then by default, the forest contains `100` decision trees.

2. Call the `fit()` function with `x_train` and `y_train` as inputs.

  ```python
  rf_clf.fit(x_train, y_train)
  ```
3. Call the `score()` function with `x_train` and `y_train` as inputs to check the accuracy score of the model. This step is actually not required. If you wish, you can skip this step.
  
  ```
  rf_clf.score(x_train, y_train)
  ```


In [None]:
# Teacher Action: Train the 'RandomForestClassifier' model using the 'fit()' function.

# 1. First, call the 'RandomForestClassifier' module with inputs as 'n_jobs = - 1' & 'n_estimators=50'. Store it in a variable with the name 'rf_clf'.
# For the time being, ignore the reason behind providing the 'n_jobs=-1' parameter as an input.
rf_clf = RandomForestClassifier(n_jobs=-1, n_estimators=50)

# 2. Call the 'fit()' function with 'x_train' and 'y_train' as inputs.
rf_clf.fit(x_train, y_train)

# 3. Call the 'score()' function with 'x_train' and 'y_train' as inputs to check the accuracy score of the model.
rf_clf.score(x_train, y_train)

0.9998034204835856

As you can see, we have deployed the `RandomForestClassifier` model with an accuracy of 100% (the number `1.0` signifies 100% accuracy).


---

#### Activity 5: Target & Feature Variables From Test Dataset^^

Now we need to make predictions on the test dataset. So, we just need to extract feature variables from the test dataset `exo_test_df` using the `iloc[]` function.

In [None]:
# Student Action: Using the 'iloc[]' function, extract the feature variables from the test dataset.
x_test = exo_test_df.iloc[:, 1:]
x_test.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,FLUX.11,FLUX.12,FLUX.13,FLUX.14,FLUX.15,FLUX.16,FLUX.17,FLUX.18,FLUX.19,FLUX.20,FLUX.21,FLUX.22,FLUX.23,FLUX.24,FLUX.25,FLUX.26,FLUX.27,FLUX.28,FLUX.29,FLUX.30,FLUX.31,FLUX.32,FLUX.33,FLUX.34,FLUX.35,FLUX.36,FLUX.37,FLUX.38,FLUX.39,FLUX.40,...,FLUX.3158,FLUX.3159,FLUX.3160,FLUX.3161,FLUX.3162,FLUX.3163,FLUX.3164,FLUX.3165,FLUX.3166,FLUX.3167,FLUX.3168,FLUX.3169,FLUX.3170,FLUX.3171,FLUX.3172,FLUX.3173,FLUX.3174,FLUX.3175,FLUX.3176,FLUX.3177,FLUX.3178,FLUX.3179,FLUX.3180,FLUX.3181,FLUX.3182,FLUX.3183,FLUX.3184,FLUX.3185,FLUX.3186,FLUX.3187,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,119.88,100.21,86.46,48.68,46.12,39.39,18.57,6.98,6.63,-21.97,-23.17,-29.26,-33.99,-6.25,-28.12,-27.24,-32.28,-12.29,-16.57,-23.86,-5.69,9.24,35.52,81.2,116.49,133.99,148.97,174.15,187.77,215.3,246.8,-56.68,-56.68,-56.68,-52.05,-31.52,-31.15,-48.53,-38.93,-26.06,...,-2.55,12.26,-7.06,-23.53,2.54,30.21,38.87,-22.86,-22.86,-4.37,2.27,-16.27,-30.84,-7.21,-4.27,13.6,15.62,31.96,49.89,86.93,86.93,42.99,48.76,22.82,32.79,30.76,14.55,10.92,22.68,5.91,14.52,19.29,14.44,-1.62,13.33,45.5,31.93,35.78,269.43,57.72
1,5736.59,5699.98,5717.16,5692.73,5663.83,5631.16,5626.39,5569.47,5550.44,5458.8,5329.39,5191.38,5031.39,4769.89,4419.66,4218.92,3924.73,3605.3,3326.55,3021.2,2800.61,2474.48,2258.33,1951.69,1749.86,1585.38,1575.48,1568.41,1661.08,1977.33,2425.62,2889.61,3847.64,3847.64,3741.2,3453.47,3202.61,2923.73,2694.84,2474.22,...,-3470.75,-4510.72,-5013.41,-3636.05,-2324.27,-2688.55,-2813.66,-586.22,-586.22,-756.8,-1090.23,-1388.61,-1745.36,-2015.28,-2359.06,-2516.66,-2699.31,-2777.55,-2732.97,1167.39,1167.39,1368.89,1434.8,1360.75,1148.44,1117.67,714.86,419.02,57.06,-175.66,-581.91,-984.09,-1230.89,-1600.45,-1824.53,-2061.17,-2265.98,-2366.19,-2294.86,-2034.72
2,844.48,817.49,770.07,675.01,605.52,499.45,440.77,362.95,207.27,150.46,85.49,-20.12,-35.88,-65.59,-15.12,16.6,-25.7,61.88,53.18,64.32,72.38,100.35,67.26,14.71,-16.41,-147.46,-231.27,-320.29,-407.82,-450.48,-146.99,-146.99,-146.99,-146.99,-166.3,-139.9,-96.41,-23.49,13.59,67.59,...,-35.24,-70.13,-35.3,-56.48,-74.6,-115.18,-8.91,-37.59,-37.59,-37.43,-104.23,-101.45,-107.35,-109.82,-126.27,-170.32,-117.85,-32.3,-70.18,314.29,314.29,314.29,149.71,54.6,12.6,-133.68,-78.16,-52.3,-8.55,-19.73,17.82,-51.66,-48.29,-59.99,-82.1,-174.54,-95.23,-162.68,-36.79,30.63
3,-826.0,-827.31,-846.12,-836.03,-745.5,-784.69,-791.22,-746.5,-709.53,-679.56,-706.03,-720.56,-631.12,-659.16,-672.03,-665.06,-667.94,-660.84,-672.75,-644.91,-680.53,-620.5,-570.34,-530.0,-537.88,-578.38,-532.34,-532.38,-491.03,-485.03,-427.19,-380.84,-329.5,-286.91,-283.81,-298.19,-271.03,-268.5,-209.56,-180.44,...,16.5,-1286.59,-1286.59,-1286.59,-1286.59,-1286.59,-1286.59,-1286.59,-1286.59,-14.94,64.09,8.38,45.31,100.72,91.53,46.69,20.34,30.94,-36.81,-33.28,-69.62,-208.0,-280.28,-340.41,-337.41,-268.03,-245.0,-230.62,-129.59,-35.47,122.34,93.03,93.03,68.81,9.81,20.75,20.25,-120.81,-257.56,-215.41
4,-39.57,-15.88,-9.16,-6.37,-16.13,-24.05,-0.9,-45.2,-5.04,14.62,-19.52,-11.43,-49.8,25.84,11.62,3.18,-9.59,14.49,8.82,32.32,-28.9,-28.9,-14.09,-30.87,-18.99,-38.6,-27.79,9.65,29.6,7.88,42.87,27.59,27.05,20.26,29.48,9.71,22.84,25.99,-667.55,-1336.24,...,-122.12,-32.01,-47.15,-56.45,-41.71,-34.13,-43.12,-53.63,-53.63,-53.63,-24.29,22.29,25.18,1.84,-22.29,-26.43,-12.12,-33.05,-21.66,-228.32,-228.32,-228.32,-187.35,-166.23,-115.54,-50.18,-37.96,-22.37,-4.74,-35.82,-37.87,-61.85,-27.15,-21.18,-33.76,-85.34,-81.46,-61.98,-69.34,-17.84


Let's also extract the target variable from the test dataset `exo_test_df` so that we can compare the actual target values with the predicted values later.

In [None]:
# Student Action: Using the 'iloc[]' function, extract the target variable from the test dataset.
y_test = exo_test_df.iloc[:, 0]
y_test.head()

0    2
1    2
2    2
3    2
4    2
Name: LABEL, dtype: int64

---

#### Activity 6: The `predict()` Function^^^

Now, let's make predictions on the test dataset by calling the `predict()` function with the features variables of the test dataset as an input.

**Syntax of predict() function:**

```python
model.predict(data)
```
 where, `data` is a set of feature variables.

In [None]:
# Student Action: Make predictions on the test dataset by using the 'predict()' function.
y_predicted = rf_clf.predict(x_test)
y_predicted

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

The `predict()` function returns a NumPy array of the predicted values. You can verify it using the `type()` function.

In [None]:
# Student Action: Using the 'type()' function, verify that the predicted values are obtained in the form of a NumPy array.
type(y_predicted)

numpy.ndarray

The actual target values are stored in a Pandas series. So, for the sake of consistency, let's convert the NumPy array of the predicted values into a Pandas series.

In [None]:
# Student Action: Convert the NumPy array of predicted values into a Pandas series.
y_predicted = pd.Series(y_predicted)
y_predicted.head()

0    1
1    1
2    1
3    1
4    1
dtype: int64

Now, let's count the number of stars classified as `1` and `2`.

In [None]:
# Student Action: Using the 'value_counts()' function, count the number of times 1 and 2 occur in the predicted values.
y_predicted.value_counts()

1    570
dtype: int64

As you can see, we did not get the expected results. The model should have classified all the stars having a planet as `2`. Ideally, the Random Forest Classifier model should have classified `565` values as `1` and the remaining `5` values as `2`. 

In this case, even though the accuracy of a prediction model is high but according to the problem statement, it is not giving the desired result. Hence, **accuracy alone is not the metric to test the efficacy of a prediction model.** 

In the next class, we will try to investigate why Random Forest Classifier failed to classify even a single star as `2`. Based on the investigation, we will try to improve the model and then deploy it again.

---

### Additional Activities

The activities starting from this point are optional. Please do these activities **ONLY** if you have time to spare in the class. Otherwise, skip to the **Wrap-Up** section. The additional activities will not be available in the class copy of the notebook. You will have to manually add these activities in the class copy by adding new text and code cells.

Moreover, you don't have to do all the additional activities. Depending on the availability of time in a class, you can choose the number of additional activities to perform from this collection. 

---

#### Activity 1: Multiples

Given $n$ and $m$, write a function to print first $m$ multiples of a number $n$ without using any loop. E.g.,

```
Input: n = 2, m = 3
Output: 2 4 6 

Input: n = 3, m = 4
Output: 3 6 9 12
```

In [None]:
# Solution
import numpy as np
def m_multiples_of_n(n, m):
  multiples = np.arange(n, (n * m) + 1, n)
  print(multiples)

m_multiples_of_n(2, 3)
m_multiples_of_n(3, 4)

[2 4 6]
[ 3  6  9 12]


---

#### Activity 2: Centered Penatognal Number

Write a function that takes a positive integer and calculates how many dots exist in a pentagonal shape around the centre dot on the $N^{\text{th}}$ iteration.

In the image below you can see the first iteration is only a single dot. On the second, there are 6 dots. On the third, there are 16 dots, and on the fourth, there are 31 dots.

<img src='https://drive.google.com/uc?id=1wn1u68GW15KGIz_MWsioUp74B9BPghuP'> 

So for first 21 positive integers, there should exist 1, 6, 16, 31, 51, 76, 106, 141, 181, 226, 276, 331, 391, 456, 526, 601, 681, 766, 856, 951, 1051 dots.

In [None]:
# Solution
def centered_pentagonal_dots(num_pentagons):
  num = 1
  difference = 5
  num_pentagons_list = []
  while num_pentagons > 0:  
    num_pentagons_list.append(num)
    num = num + difference
    difference += 5
    num_pentagons -= 1
  # return num_pentagons_list
  return num_pentagons_list[-1]

print(centered_pentagonal_dots(4))
print(centered_pentagonal_dots(10))
print(centered_pentagonal_dots(21))

31
226
1051


---

---

---