# Hunting Exoplanets In Space - Deploying A Prediction Model

The prediction model, through the training dataset, will learn the properties of a star that has a planet and also the properties of a star which does not have a planet. Once the model has learnt the required properties, it will look for these properties in the test dataset and according to the properties it sees, it will predict whether a star has a planet or not.

We will deploy the **Random Forest Classifier** model (a machine learning algorithm). Machine learning is a branch of artificial intelligence in which a machine learns through data the different features on its own without being programmed by a computer programmer.


<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-14/ml-vs-ai.png' width=700>

In this case, the machine will learn to recognise the flux values of stars having a planet on its own. When a new dataset containing only the flux values of a star is shown to the machine, it will tell the star having a planet and not having a planet.

There are many machine learning models or algorithms to do this kind of prediction. One of them is **Random Forest Classifier**. It is used to classify outcomes into classes (or labels) based on some features. For e.g., an animal that makes a 'Meow! Meow!' sound is classified (or labelled) as a cat, an animal that makes a 'Woof! Woof!' sound is classified as a dog, an animal which makes a hissing sound is classified as a snake etc.

We will use the Random Forest Classifier model to classify whether a star has a planet or not. The stars which have at least one planet are labelled as `2` while the stars not having a planet are labelled as `1`.

---

#### Loading The Training Dataset

(Use the Links to the (or download) datasets are too large to upload here)

Dataset links:

1. Train dataset
   
   https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv

2. Test dataset
   
   https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv

In [1]:
# Loading both the training and test datasets.
import pandas as pd

exo_train_df = pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv')
exo_test_df = pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv')

---

#### Random Forest Classifier - Working

A Random Forest is a collection (a.k.a. ensemble) of many decision trees. A decision tree is a flow chart which separates data based on some condition. If a condition is true, you move on a path otherwise, you move on to another path.


For e.g., in case of finding a star having a planet, you can construct the following decision tree:

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-14/decision-tree.png' width=600>

You could ask a question whether there is decrease in the flux values of a star. If the answer is no, then it clearly means the star does not have a planet. However, if the answer is yes, then you could ask another question to check whether the decrease is periodic or not. Again, if the answer is no, then the star does not have a planet. Otherwise, it has a planet.

This is one of the examples of a decision tree. Based on a problem, the decision tree could get more and more complex.

A collection of `N` number of trees is a random forest wherein each tree gives some predicted value (in this case either class `1` or class `2`).

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-14/rfc-image.jpg' width=800>

The final predicted value is the majority class, i.e, the class that is predicted by the most number of decision trees in a random forest.

For the time being, just consider the Random Forest Classifier as some kind of a black-box which classifies data into different classes (in this case, either class `1` or class `2`) by learning the properties of every class through a training dataset.

---

#### In Our Context

There are `565` stars which are classified as `1` and `5` stars classified as `2` which means only `5` stars have a planet. Interestingly, if our prediction model mindlessly classifies every star as `1`, then it is a very accurate model. Why?

Because the accuracy of a model is calculated as **a percentage of the correct predictions out of the total number of predictions**. In this case, the percentage of the correct predictions is

 $\frac{565\times100}{570} = 99.122$ %

Thus, without actually deploying a proper prediction model, we can predict the stars having a planet with 99% accuracy.

This is **WRONG**! This is where we need to be careful. Because we have very imbalanced data. The ultimate goal of the Kepler space telescope is to detect exoplanets in outer space. Hence, a machine learning model, based on some data should also **correctly** detect stars having planets. This means a prediction model will be considered useful if it correctly detects almost all the stars having a planet.

So, the prediction model which always labels every star as `1` is useless. Because it must detect almost all the stars having a planet.

Now, we are going to deploy the Random Forest Classifier model so that it can detect all the five (or at least three) stars having a planet.

---

### **Understanding Random Forest Classifier**

#### Introduction  
The **Random Forest Classifier** is an ensemble learning algorithm that improves prediction accuracy by combining multiple decision trees. It is widely used for classification and regression tasks due to its high accuracy, robustness, and ability to handle large datasets.  

#### How It Works  
1. **Bootstrap Sampling (Bagging)**: The dataset is randomly sampled with replacement to create multiple subsets.  
2. **Decision Tree Training**: Each subset is used to train an independent decision tree.  
3. **Random Feature Selection**: Instead of using all features, each tree selects a random subset of features at each split.  
4. **Majority Voting (Classification)**: For a classification task, each tree predicts a class, and the most common class (majority vote) is chosen as the final prediction.  
5. **Averaging (Regression)**: For regression tasks, the final prediction is the average of all tree predictions.  

#### Key Concepts  

#### 1. **Ensemble Learning**  
Random Forest is an **ensemble method**, meaning it combines multiple weak learners (decision trees) to form a stronger model, reducing the risk of overfitting.  

#### 2. **Feature Randomness**  
Unlike a single decision tree that considers all features, Random Forest introduces randomness by selecting only a subset of features at each node split. This reduces correlation between trees, improving generalization.  

#### 3. **Bias-Variance Tradeoff**  
- A single decision tree has **low bias** but **high variance** (overfits the training data).  
- Random Forest balances bias and variance by averaging multiple trees, leading to lower variance while maintaining reasonable bias.  

#### 4. **Handling Imbalanced Data**  
In datasets like exoplanet detection, where one class is significantly larger than the other, Random Forest can be adjusted using:  
- **Class Weights**: Assigning higher weights to the minority class.  
- **Downsampling or Upsampling**: Adjusting the dataset to balance class distribution.  

#### Hyperparameters in Random Forest  

| Hyperparameter | Description |
|---------------|-------------|
| `n_estimators` | Number of decision trees in the forest |
| `max_depth` | Maximum depth of each tree (prevents overfitting) |
| `min_samples_split` | Minimum samples required to split a node |
| `min_samples_leaf` | Minimum samples required in a leaf node |
| `max_features` | Number of features considered for best split |
| `bootstrap` | Whether to sample with replacement |

#### Advantages of Random Forest  
✅ **Handles missing values** and outliers well  
✅ **Prevents overfitting** with multiple trees  
✅ **Works well with large datasets**  
✅ **Reduces variance** compared to a single decision tree  
✅ **Provides feature importance ranking**  

#### Disadvantages of Random Forest  
❌ **Computationally expensive** for very large datasets  
❌ **Less interpretable** compared to a single decision tree  
❌ **Not ideal for real-time applications** due to high inference time  

#### Conclusion  
Random Forest is a powerful classification model that excels in handling noisy and imbalanced data. By leveraging multiple decision trees and ensemble learning, it provides **high accuracy and robustness** while reducing overfitting. However, it requires careful tuning of hyperparameters to achieve optimal performance.  

#### Activity 2: Importing `RandomForestClassifier`

We need to import a module called `RandomForestClassifier` from a package called `sklearn.ensemble`. The `sklearn` (or **scikit-learn**) is a collection of many machine learning modules. Almost every machine learning algorithm can be directly applied without a knowledge of math using the **scikit-learn** library. It is kind of a plug-and-play device.

You can read about it in the link provided in the **Activities** section under the title **`scikit-learn` - Random Forest Classifier**


In [5]:
# Teacher Action: Import the required modules from the 'sklearn' library.
# Import the 'RandomForestClassifier' module from the 'sklearn.ensemble' library.
from sklearn.ensemble import RandomForestClassifier

---

#### Activity 3: Target & Feature Variables Separation

The `RandomForestClassifier` module has a function called `fit()` which takes two inputs. The first input is the collection of feature variables.

*The features are those variables which describe the features or properties of an entity.* In this case, the `FLUX.1` to `FLUX.3197` are feature variables. Hence, the values stored in these columns are the features of a star in exoplanets dataset.

The second input is the target variable.

*The variable which needs to be predicted is called a target variable.* In this case, the `LABEL` is the target variable because the prediction model needs to predict which star belongs to which class in the test dataset. Hence, the values stored in the `LABEL` column are the target values.

So, we need to extract the target variable and the feature variables separately from the training dataset.

Let's store the feature variables in the `x_train` variable and the target variable in the `y_train` variable. We will separate the features using the `iloc[]` function.

We need all the rows from the training set. So, inside the `iloc[]` function, we will enter the colon (`:`) sign to get all the rows. We do not need the first column, i.e., the `LABEL` column. Therefore, inside the `iloc[]` function, as part of column indexing, enter `1` as the starting index followed by the colon (`:`) sign to include the rest of the columns from the training dataset.


In [6]:
# Student Action: Extract the feature variables from the training dataset using the 'iloc[]' function.
x_train = exo_train_df.iloc[:,1:]
x_train

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,93.85,83.81,20.10,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,-160.17,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,-73.38,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.70,6.46,16.00,19.93
2,532.64,535.92,513.73,496.92,456.45,466.00,464.50,486.39,436.56,484.39,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.80,-28.91,-70.02,-96.67
3,326.52,347.39,302.35,298.13,317.74,312.70,322.33,311.31,312.42,323.33,...,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,-1107.21,-1112.59,-1118.95,-1095.10,-1057.55,-1034.48,-998.34,-1022.71,-989.57,-970.88,...,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5082,-91.91,-92.97,-78.76,-97.33,-68.00,-68.24,-75.48,-49.25,-30.92,-11.88,...,139.95,147.26,156.95,155.64,156.36,151.75,-24.45,-17.00,3.23,19.28
5083,989.75,891.01,908.53,851.83,755.11,615.78,595.77,458.87,492.84,384.34,...,-26.50,-4.84,-76.30,-37.84,-153.83,-136.16,38.03,100.28,-45.64,35.58
5084,273.39,278.00,261.73,236.99,280.73,264.90,252.92,254.88,237.60,238.51,...,-26.82,-53.89,-48.71,30.99,15.96,-3.47,65.73,88.42,79.07,79.43
5085,3.82,2.09,-3.29,-2.88,1.66,-0.75,3.85,-0.03,3.28,6.29,...,10.86,-3.23,-5.10,-4.61,-9.82,-1.50,-4.65,-14.55,-6.41,-2.55


Now, let's get only the target variables from the training dataset.

In [7]:
# Student Action: Using the 'iloc[]' function, retrieve only the first column, i.e., the 'LABEL' column from the training dataset.
y_train = exo_train_df.iloc[:,0]
y_train

0       2
1       2
2       2
3       2
4       2
       ..
5082    1
5083    1
5084    1
5085    1
5086    1
Name: LABEL, Length: 5087, dtype: int64

---

#### Activity 4: Fitting The Model

Now that we have separated the feature and target variables for deploying the `RandomForestClassifier` model, let's train the model with the feature variables using the `fit()` function. The steps to be followed are described below.

1. First, call the `RandomForestClassifier` module with inputs as `n_jobs=-1` and `n_estimators=50`. Store the function in a variable with the name `rf_clf`.

  For the time being, ignore the reason behind providing the `n_jobs=-1` parameter as an input.

  ```
  rf_clf = RandomForestClassifier(n_jobs=-1, n_estimators=50)
  ```

  The `n_estimators` parameter defines the number of decision trees in a Random Forest. Therefore, `n_estimators=50` means that the forest contains `50` decision trees. If `n_estimators` parameter is not defined by a user, then by default, the forest contains `100` decision trees.

2. Call the `fit()` function with `x_train` and `y_train` as inputs.

  ```
  rf_clf.fit(x_train, y_train)
  ```
3. Call the `score()` function with `x_train` and `y_train` as inputs to check the accuracy score of the model. This step is actually not required. If you wish, you can skip this step.
  
  ```
  rf_clf.score(x_train, y_train)
  ```


In [8]:
# Teacher Action: Train the 'RandomForestClassifier' model using the 'fit()' function.
rf_clf = RandomForestClassifier(n_jobs= -1, n_estimators = 570)
# 1. First, call the 'RandomForestClassifier' module with inputs as 'n_jobs = - 1' & 'n_estimators=50'. Store it in a variable with the name 'rf_clf'.
# For the time being, ignore the reason behind providing the 'n_jobs=-1' parameter as an input.

# 2. Call the 'fit()' function with 'x_train' and 'y_train' as inputs.
rf_clf.fit(x_train, y_train)
# 3. Call the 'score()' function with 'x_train' and 'y_train' as inputs to check the accuracy score of the model.
rf_clf.score(x_train, y_train)

1.0

As you can see, we have deployed the `RandomForestClassifier` model with an accuracy of 100% (the number `1.0` signifies 100% accuracy).


---

#### Activity 5: Target & Feature Variables From Test Dataset^^

Now we need to make predictions on the test dataset. So, we just need to extract feature variables from the test dataset using the `iloc[]` function.

In [9]:
# Student Action: Using the 'iloc[]' function, extract the feature variables from the test dataset.
x_test = exo_test_df.iloc[:,1:]
x_test

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,119.88,100.21,86.46,48.68,46.12,39.39,18.57,6.98,6.63,-21.97,...,14.52,19.29,14.44,-1.62,13.33,45.50,31.93,35.78,269.43,57.72
1,5736.59,5699.98,5717.16,5692.73,5663.83,5631.16,5626.39,5569.47,5550.44,5458.80,...,-581.91,-984.09,-1230.89,-1600.45,-1824.53,-2061.17,-2265.98,-2366.19,-2294.86,-2034.72
2,844.48,817.49,770.07,675.01,605.52,499.45,440.77,362.95,207.27,150.46,...,17.82,-51.66,-48.29,-59.99,-82.10,-174.54,-95.23,-162.68,-36.79,30.63
3,-826.00,-827.31,-846.12,-836.03,-745.50,-784.69,-791.22,-746.50,-709.53,-679.56,...,122.34,93.03,93.03,68.81,9.81,20.75,20.25,-120.81,-257.56,-215.41
4,-39.57,-15.88,-9.16,-6.37,-16.13,-24.05,-0.90,-45.20,-5.04,14.62,...,-37.87,-61.85,-27.15,-21.18,-33.76,-85.34,-81.46,-61.98,-69.34,-17.84
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
565,374.46,326.06,319.87,338.23,251.54,209.84,186.35,167.46,135.45,107.28,...,-123.55,-166.90,-222.44,-209.71,-180.16,-166.83,-235.66,-213.63,-205.99,-194.07
566,-0.36,4.96,6.25,4.20,8.26,-9.53,-10.10,-4.54,-11.55,-10.48,...,-12.40,-5.99,-17.94,-11.96,-12.11,-13.68,-3.59,-5.32,-10.98,-11.24
567,-54.01,-44.13,-41.23,-42.82,-39.47,-24.88,-31.14,-24.71,-13.12,-14.78,...,-0.73,-1.64,1.58,-4.82,-11.93,-17.14,-4.25,5.47,14.46,18.70
568,91.36,85.60,48.81,48.69,70.05,22.30,11.63,37.86,28.27,-4.36,...,2.44,11.53,-16.42,-17.86,21.10,-10.25,-37.06,-8.43,-6.48,17.60


In [10]:
print(x_test.shape)
print(x_train.shape)

(570, 3197)
(5087, 3197)


Let's also extract the target variable from the test dataset so that we can compare the actual target values with the predicted values later.

In [11]:
# Student Action: Using the 'iloc[]' function, extract the target variable from the test dataset.
y_test = exo_test_df.iloc[:,0]
y_test

0      2
1      2
2      2
3      2
4      2
      ..
565    1
566    1
567    1
568    1
569    1
Name: LABEL, Length: 570, dtype: int64

---

#### Activity 6: The `predict()` Function^^^

Now, let's make predictions on the test dataset by calling the `predict()` function with the features variables of the test dataset as an input.

In [12]:
# Student Action: Make predictions on the test dataset by using the 'predict()' function.
prediction = rf_clf.predict(x_test)
prediction

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

The predict function returns a NumPy array of the predicted values. You can verify it using the `type()` function.

In [13]:
# Student Action: Using the 'type()' function, verify that the predicted values are obtained in the form of a NumPy array.
print(type(prediction))

<class 'numpy.ndarray'>


The actual target values are stored in a Pandas series. So, for the sake of consistency, let's convert the NumPy array of the predicted values into a Pandas series.

In [14]:
# Student Action: Convert the NumPy array of predicted values into a Pandas series.
pd_series = pd.Series(prediction)
pd_series

0      1
1      1
2      1
3      1
4      1
      ..
565    1
566    1
567    1
568    1
569    1
Length: 570, dtype: int64

Now, let's count the number of stars classified as `1` and `2`.

In [15]:
# Student Action: Using the 'value_counts()' function, count the number of times 1 and 2 occur in the predicted values.
pd_series.value_counts()

1    570
Name: count, dtype: int64

As you can see, we did not get the expected results. The model should have classified all the stars having a planet as `2`. Ideally, the Random Forest Classifier model should have classified `565` values as `1` and the remaining `5` values as `2`.

In this case, even though the accuracy of a prediction model is high but according to the problem statement, it is not giving the desired result. Hence, **accuracy alone is not the metric to test the efficacy of a prediction model.**

In the next class, we will try to investigate why Random Forest Classifier failed to classify even a single star as `2`. Based on the investigation, we will try to improve the model and then deploy it again.

In [16]:
import xgboost as xg

md = xg.XGBClassifier()
md.fit(x_train,y_train)
y_pred = md.predict(x_test)
print(pd.Series(y_pred).head(10))
print(pd.Series(y_pred).value_counts())

ModuleNotFoundError: No module named 'xgboost'

---