# Hunting Exoplanets In Space - Deploying A Prediction Model

The prediction model, through the training dataset, will learn the properties of a star that has a planet and also the properties of a star which does not have a planet. Once the model has learnt the required properties, it will look for these properties in the test dataset and according to the properties it sees, it will predict whether a star has a planet or not.

We will deploy the **Random Forest Classifier** model (a machine learning algorithm). Machine learning is a branch of artificial intelligence in which a machine learns through data the different features on its own without being programmed by a computer programmer.


<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-14/ml-vs-ai.png' width=700>

In this case, the machine will learn to recognise the flux values of stars having a planet on its own. When a new dataset containing only the flux values of a star is shown to the machine, it will tell the star having a planet and not having a planet.

There are many machine learning models or algorithms to do this kind of prediction. One of them is **Random Forest Classifier**. It is used to classify outcomes into classes (or labels) based on some features. For e.g., an animal that makes a 'Meow! Meow!' sound is classified (or labelled) as a cat, an animal that makes a 'Woof! Woof!' sound is classified as a dog, an animal which makes a hissing sound is classified as a snake etc.

We will use the Random Forest Classifier model to classify whether a star has a planet or not. The stars which have at least one planet are labelled as `2` while the stars not having a planet are labelled as `1`.

---

#### Loading The Training Dataset

(Use the Links to the (or download) datasets are too large to upload here)

Dataset links:

1. Train dataset
   
   https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv

2. Test dataset
   
   https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv

In [1]:
# Loading both the training and test datasets.
import pandas as pd

exo_train_df = pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv')
exo_test_df = pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv')

---

#### Random Forest Classifier - Working

A Random Forest is a collection (a.k.a. ensemble) of many decision trees. A decision tree is a flow chart which separates data based on some condition. If a condition is true, you move on a path otherwise, you move on to another path.


For e.g., in case of finding a star having a planet, you can construct the following decision tree:

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-14/decision-tree.png' width=600>

You could ask a question whether there is decrease in the flux values of a star. If the answer is no, then it clearly means the star does not have a planet. However, if the answer is yes, then you could ask another question to check whether the decrease is periodic or not. Again, if the answer is no, then the star does not have a planet. Otherwise, it has a planet.

This is one of the examples of a decision tree. Based on a problem, the decision tree could get more and more complex.

A collection of `N` number of trees is a random forest wherein each tree gives some predicted value (in this case either class `1` or class `2`).

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-14/rfc-image.jpg' width=800>

The final predicted value is the majority class, i.e, the class that is predicted by the most number of decision trees in a random forest.

For the time being, just consider the Random Forest Classifier as some kind of a black-box which classifies data into different classes (in this case, either class `1` or class `2`) by learning the properties of every class through a training dataset.

---

#### In Our Context

There are `565` stars which are classified as `1` and `5` stars classified as `2` which means only `5` stars have a planet. Interestingly, if our prediction model mindlessly classifies every star as `1`, then it is a very accurate model. Why?

Because the accuracy of a model is calculated as **a percentage of the correct predictions out of the total number of predictions**. In this case, the percentage of the correct predictions is

 $\frac{565\times100}{570} = 99.122$ %

Thus, without actually deploying a proper prediction model, we can predict the stars having a planet with 99% accuracy.

This is **WRONG**! This is where we need to be careful. Because we have very imbalanced data. The ultimate goal of the Kepler space telescope is to detect exoplanets in outer space. Hence, a machine learning model, based on some data should also **correctly** detect stars having planets. This means a prediction model will be considered useful if it correctly detects almost all the stars having a planet.

So, the prediction model which always labels every star as `1` is useless. Because it must detect almost all the stars having a planet.

Now, we are going to deploy the Random Forest Classifier model so that it can detect all the five (or at least three) stars having a planet.

---

### **Understanding Random Forest Classifier**

#### Introduction  
The **Random Forest Classifier** is an ensemble learning algorithm that improves prediction accuracy by combining multiple decision trees. It is widely used for classification and regression tasks due to its high accuracy, robustness, and ability to handle large datasets.  

#### How It Works  
1. **Bootstrap Sampling (Bagging)**: The dataset is randomly sampled with replacement to create multiple subsets.  
2. **Decision Tree Training**: Each subset is used to train an independent decision tree.  
3. **Random Feature Selection**: Instead of using all features, each tree selects a random subset of features at each split.  
4. **Majority Voting (Classification)**: For a classification task, each tree predicts a class, and the most common class (majority vote) is chosen as the final prediction.  
5. **Averaging (Regression)**: For regression tasks, the final prediction is the average of all tree predictions.  

#### Key Concepts  

#### 1. **Ensemble Learning**  
Random Forest is an **ensemble method**, meaning it combines multiple weak learners (decision trees) to form a stronger model, reducing the risk of overfitting.  

#### 2. **Feature Randomness**  
Unlike a single decision tree that considers all features, Random Forest introduces randomness by selecting only a subset of features at each node split. This reduces correlation between trees, improving generalization.  

#### 3. **Bias-Variance Tradeoff**  
- A single decision tree has **low bias** but **high variance** (overfits the training data).  
- Random Forest balances bias and variance by averaging multiple trees, leading to lower variance while maintaining reasonable bias.  

#### 4. **Handling Imbalanced Data**  
In datasets like exoplanet detection, where one class is significantly larger than the other, Random Forest can be adjusted using:  
- **Class Weights**: Assigning higher weights to the minority class.  
- **Downsampling or Upsampling**: Adjusting the dataset to balance class distribution.  

#### Hyperparameters in Random Forest  

| Hyperparameter | Description |
|---------------|-------------|
| `n_estimators` | Number of decision trees in the forest |
| `max_depth` | Maximum depth of each tree (prevents overfitting) |
| `min_samples_split` | Minimum samples required to split a node |
| `min_samples_leaf` | Minimum samples required in a leaf node |
| `max_features` | Number of features considered for best split |
| `bootstrap` | Whether to sample with replacement |

#### Advantages of Random Forest  
✅ **Handles missing values** and outliers well  
✅ **Prevents overfitting** with multiple trees  
✅ **Works well with large datasets**  
✅ **Reduces variance** compared to a single decision tree  
✅ **Provides feature importance ranking**  

#### Disadvantages of Random Forest  
❌ **Computationally expensive** for very large datasets  
❌ **Less interpretable** compared to a single decision tree  
❌ **Not ideal for real-time applications** due to high inference time  

---

#### Importing `RandomForestClassifier`

We need to import a module called `RandomForestClassifier` from a package called `sklearn.ensemble`. The `sklearn` (or **scikit-learn**) is a collection of many machine learning modules. Almost every machine learning algorithm can be directly applied without a knowledge of math using the **scikit-learn** library. It is kind of a plug-and-play device.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

#### Target & Feature Variables Separation

The `RandomForestClassifier` module has a function called `fit()` which takes two inputs. The first input is the collection of feature variables.

*The features are those variables which describe the features or properties of an entity.* In this case, the `FLUX.1` to `FLUX.3197` are feature variables. Hence, the values stored in these columns are the features of a star in exoplanets dataset.

The second input is the target variable.

*The variable which needs to be predicted is called a target variable.* In this case, the `LABEL` is the target variable because the prediction model needs to predict which star belongs to which class in the test dataset. Hence, the values stored in the `LABEL` column are the target values.

So, we need to extract the target variable and the feature variables separately from the training dataset.

Let's store the feature variables in the `x_train` variable and the target variable in the `y_train` variable. We will separate the features using the `iloc[]` function.

In [None]:
# Taking out the features
x_train = exo_train_df.iloc[:,1:]
x_train[:5]


Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,-160.17,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,-73.38,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,484.39,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67
3,326.52,347.39,302.35,298.13,317.74,312.7,322.33,311.31,312.42,323.33,...,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,-1107.21,-1112.59,-1118.95,-1095.1,-1057.55,-1034.48,-998.34,-1022.71,-989.57,-970.88,...,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54


In [None]:
# Taking out the target
y_train = exo_train_df.iloc[:,0]
y_train

0       2
1       2
2       2
3       2
4       2
       ..
5082    1
5083    1
5084    1
5085    1
5086    1
Name: LABEL, Length: 5087, dtype: int64

---

#### Random Forest Hyperparameters

Random Forest is an ensemble learning method that constructs multiple decision trees to improve predictive accuracy and control overfitting. Understanding its hyperparameters is crucial for effective model tuning. Here's a breakdown of key hyperparameters:

##### n_estimators

- **Definition**: Specifies the number of trees in the forest.

- **Impact**: Increasing the number of trees generally enhances model performance by reducing variance. However, after a certain point, adding more trees yields diminishing returns and increases computational cost.

##### max_depth

- **Definition**: Sets the maximum depth of each decision tree.

- **Impact**: Deeper trees can capture more complex patterns but might overfit the data. Limiting the depth helps in balancing bias and variance.

##### min_samples_split

- **Definition**: The minimum number of samples required to split an internal node.

- **Impact**: Higher values prevent the model from learning overly specific patterns (overfitting) by ensuring that splits occur only when a sufficient number of samples are present.

##### min_samples_leaf

- **Definition**: The minimum number of samples required to be at a leaf node.

- **Impact**: Larger values result in leaves that contain more data, promoting generalization by preventing the model from learning noise in the training data.

##### max_features

- **Definition**: Determines the number of features to consider when looking for the best split.

- **Impact**: Using fewer features can make the model more diverse but might miss important information. Common settings include 'sqrt' (square root of the total number of features) and 'log2' (base-2 logarithm of the total number of features).

##### bootstrap

- **Definition**: Decides whether to use all data points when building each tree or to sample data points with replacement.

- **Impact**: Sampling with replacement (bootstrap) introduces randomness that can improve model robustness. If set to False, the whole dataset is used to build each tree.

Understanding and tuning these hyperparameters can significantly influence the performance of a Random Forest model, allowing for better control over the balance between bias and variance.


In [26]:

# Define the hyperparameter distribution
param_dist = {
    'n_estimators': np.arange(50, 201, 50),  # Number of trees (50, 100, 150, 200)
    'max_depth': [None, 10, 15, 20],  # Tree depth
    'min_samples_split': [2, 5, 10],  # Minimum samples to split a node
    'min_samples_leaf': [1, 2, 4],  # Minimum samples per leaf
    'max_features': ['sqrt', 'log2'],  # Number of features to consider per split
    'bootstrap': [True, False]  # Bootstrap sampling
}

# Initialize Random Forest
rf_clf = RandomForestClassifier(n_jobs=-1)

# Perform Randomized Search
random_search = RandomizedSearchCV(
    rf_clf, param_distributions=param_dist, 
    n_iter=20, cv=5, scoring='accuracy', 
    n_jobs=-1, verbose=2, random_state=42
)

# Train on dataset
random_search.fit(x_train, y_train)

# Get best parameters
best_params = random_search.best_params_
print(f"Best Parameters: {best_params}")

# Train the final model with the best parameters
best_rf_clf = RandomForestClassifier(**best_params, n_jobs=-1)
best_rf_clf.fit(x_train, y_train)

# Evaluate performance
train_accuracy = best_rf_clf.score(x_train, y_train)
print(f"Optimized Training Accuracy: {train_accuracy:.4f}")


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Parameters: {'n_estimators': np.int64(150), 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 20, 'bootstrap': True}
Optimized Training Accuracy: 0.9965


---

#### The `predict()` Function

Now, let's make predictions on the test dataset by calling the `predict()` function with the features variables of the test dataset as an input.

In [28]:

prediction = best_rf_clf.predict(exo_test_df.iloc[:,1:])
prediction

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

The predict function returns a NumPy array of the predicted values. You can verify it using the `type()` function.

In [None]:
print(type(prediction))

<class 'numpy.ndarray'>


The actual target values are stored in a Pandas series. So, for the sake of consistency, let's convert the NumPy array of the predicted values into a Pandas series.

In [None]:
pd_series = pd.Series(prediction)
pd_series

0      1
1      1
2      1
3      1
4      1
      ..
565    1
566    1
567    1
568    1
569    1
Length: 570, dtype: int64

Now, let's count the number of stars classified as `1` and `2`.

In [None]:
pd_series.value_counts()

1    570
Name: count, dtype: int64

As you can see, we did not get the expected results. The model should have classified all the stars having a planet as `2`. Ideally, the Random Forest Classifier model should have classified `565` values as `1` and the remaining `5` values as `2`.

In this case, even though the accuracy of a prediction model is high but according to the problem statement, it is not giving the desired result. Hence, **accuracy alone is not the metric to test the efficacy of a prediction model.**


In [30]:
# import xgboost as xg

# md = xg.XGBClassifier()
# md.fit(x_train,y_train)
# y_pred = md.predict(x_test)
# print(pd.Series(y_pred).head(10))
# print(pd.Series(y_pred).value_counts())

---