# Using Random Forest for Classifying Exoplanets

Author: Fatemeh (Fatima) Bagheri

Date Created: November 24, 2022, Last Modified: December 19, 2022

Notebook 2/2 of the DSECOP Module: An Introduction to the Random Forest algorithm.


We want to classify exoplanets based on their types. There are two major types of exoplanets: Rocky planets and Jovian (Jupiter-like) planets. Rocky planets are similar to Earth; they are made of rocks, not massive, and located at a distance smaller than the *snow-line* ([https://en.wikipedia.org/wiki/Frost_line_(astrophysics)]) of their host stars. On the other hand, Jovian planets are massive, gaseous planets that could be found at any distance, so close to the host star (hot-Jupiters) or further away, like Jupiter in our solar system.

Basically, our problem is a classification problem; if the exoplanet is a rocky planet, we label it with 0; if it's a Jovian but not Hot-Jupiter, its label is 1, and Hot_Jupiters' labels are 2. We implement a classification model with a `Random Forest classifier` in this lecture using Python's `Scikit-Learn`.

The steps in this procedure are:

* importing necessary libraries, 
* reading the data, 
* splitting the data into training and test sets,
* defining hyperparameters (for more information, look at the Introduction to Deep Learning module) of the model and training a Random Forest Classifier model,
* and last, evaluating the model.

So, let's start! First, import some needed libraries,


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Now, read the data, and you can also explore their parameters and simple statistics.



In [None]:
dataset = pd.read_csv("RandomForest_data.csv")
dataset.info()

or

In [None]:
dataset.describe().T

We only need some of the information in the data file to classify the exoplanets based on their types. Therefore, we should work just with the parameters that could be related to the exoplanets types:  

In [None]:
# read file: NASAExoplanet_header.txt to understand the header of the data
f = open('RandomForestData_header.txt', 'r')
lines = f.readlines()

header = {}
for i, l in enumerate(lines):
    parts = l.split('\t')

    header[parts[2].rstrip('\n')] = parts[1].rstrip(' ')

# define the paramters that we need, use the file: NASAExoplanet_header to find which columns of data you need
parameters = ['Planet Mass or Mass*sin(i) [Earth Mass]', 
              'Stellar Metallicity [dex]', 'Stellar Surface Gravity [log10(cm/s**2)]'
              ,'Stellar Mass [Solar mass]',
              'Equilibrium Temperature [K]', 'Stellar Effective Temperature [K]', 
              'Orbital Period [days]', 'Stellar Radius [Solar Radius]','label']


# store needed data
data = pd.DataFrame()
for p in parameters:
    data[p] = dataset[header[p]].astype('float32')




Now, we define the label of the exoplanets as what we want to predict or outputs (y) and the parameters of the exoplanets as the inputs (x):



In [None]:
y = data['label']
X = data.drop(['label'], axis=1)



Based on this data, we should define our training and test sets by splitting data with the ratio of 80/20, with the help of `Scikit-Learn's train_test_split()`: 



In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Now it's time to import the `RandomForestClassifier` class and create the model. As mentioned earlier, we should set a few hyperparameters to define our model. One is the number of decision trees in our random forest model; the other is the depth of each decision tree. Thus, let's create a model with 5 decision trees (`n_estimators=5`) with a depth of 3 (`max_depth=3`): 


In [None]:
from sklearn.ensemble import RandomForestClassifier

our_model = RandomForestClassifier(n_estimators=5, max_depth=3, random_state=42)

Our model should learn the relationship between the parameters in X and labels in y from the training set. To do this, we can **fit** the model with training set by calling `fit()`:


In [None]:
our_model.fit(X_train, y_train)


To see what exactly is happening in our model, we can look at the structure of decision trees. By calling the `tree module` from `cikit-Learn`, we can visualize the trees:
This can be done by using the tree module built into `Scikit-Learn`, and then looping through each of the estimators in the ensemble:

In [None]:
from sklearn import tree

features = X.columns.values                     # The name of each column
classes = ['0', '1', '2']                            # Labels

for estimator in our_model.estimators_:
    print(estimator)
    plt.figure(figsize=(12,6))
    tree.plot_tree(estimator, feature_names=features, class_names=classes, fontsize=8, filled=True, rounded=True)
    plt.show()

We can now use our trained model to the predict the label of the test set:

In [None]:
y_predicted = our_model.predict(X_test)


At this point, we can evaluate the performance of our model. There are several ways to do that. For instance, we can calculate the **accuracy** of the model. The accuracy of the model is the percentage of the number of correct predictions of our model.

The other way is calculating the **confusion matrix**. A confusion matrix is s a specific table layout that allows visualization of the performance of an algorithm. If we want to know the number of **true positive** or **false positive** as well as **true negative** and **false negative**, we can use a confusion matrix. To understand the meaning of the positive/negative true/false concepts, let's define them in our project's context. We labeled the exoplanets in our dataset: if the exoplanet is a rocky planet, we label it with 0, or if it's a Jovian, its label is 1. so, let's say label 1 is positive, and label 0 is negative. Therefore, a positive true means we **correctly** predicted Jupiter-like type (label 1, which is positive) for a sample (or sometimes it's called an *instance* in ML project). The ones predicted as positives but weren't positives are called false positives. 

We can now compare the predicted labels against the real labels to evaluate the performance of the model.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

cm = confusion_matrix(y_test, y_predicted)
print(classification_report(y_test,y_predicted))

## Variable Importances

When we use data science methods in scientific research, we often need to learn the physical model behind the problem. In fact, most of the time, that is the reason we use data analyzing methods! Since you do not know the physical model, you want to know the role of each parameter in the analysis. To quantify the relative importance of the parameters in our model, we can use `feature_importances_` in `Skicit-learn` to see the role of a variable in the prediction quantitatively. 


In [None]:
importances = list(our_model.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(features, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];


## Homework:

Define true negatives and false negatives in the concept of our project.

# Project 

Generate a new model with more trees and see how it affects the results. You can remove those variables that have no importance.