<a href="https://colab.research.google.com/github/GabrielleRab/SRMPmachine/blob/main/Decision_Trees_Exoplanets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploring Exoplanets with Decision Trees** 

### **Step 1:** Identify your question

In this colab, you will build a machine learning model to figure out which characteristics of an exoplanet are best correlated with its detection method. Put another way, which kinds of exoplanets can we find using each method of detection? 

For the sake of simplicity, let's focus on Radial Velocity and Transit, two of the most common means of exoplanet detection.

### **Step 2:** Select your data

The dataset for this activity came from the NASA Exoplanet Archive and is a composite of data from several different sources.

We will be looking at a number of features of exoplanets. Here is a definition of each column in the dataset:

*   pl_name: Planet Name
*   discoverymethod: Discovery Method
*   disc_year: Discovery Year
*   pl_orbper:      Orbital Period [days]
*   pl_orbsmax:     Orbit Semi-Major Axis [au]
*   pl_radj:        Planet Radius [Jupiter Radius]
*   pl_bmassj:      Planet Mass or Mass*sin(i) [Jupiter Mass]
*   pl_bmassprov:   Planet Mass or Mass*sin(i) Provenance
*   st_teff:        Stellar Effective Temperature [K]
*   st_rad:         Stellar Radius [Solar Radius]
*   st_mass:        Stellar Mass [Solar mass]
*   sy_dist:        Distance [pc]

Let's load the data into the colab and take a look. Run the cell below to create a dataframe (table) of our data and preview the first five rows:

In [None]:
# import the necessary Python libraries
import pandas as pd

# create a dataframe called "df" with the dataset
df = pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/exoplanets_cleaned.csv")

# preview the first five rows
df.head()

The "discoverymethod" column contains the labels for this dataset. It tells us in advance which detection method was used for each exoplanet.

Run the code below to find out how many rows are in our dataset. Each row represents an exoplanet that has been identified.

In [None]:
# return the number of rows in the dataset
len(df)

### **Step 3:** Choose your method

We will be using the Decision Tree method today, as it is a good fit for categorizing labeled data.

Run the code below to import the necessary Python libraries for creating Decision Trees.

In [None]:
#import necessary Python libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import graphviz
from numpy.ma.extras import unique

### **Step 4:** Prepare your data

Decision Trees work best when the features being analyzed are numerical. Aside from our labels ("discoverymethod"), there are two columns that we need to address: "pl_name" and "pl_bmassprov".

"pl_name" is easy to deal with. There aren't any meaningful binary questions we can ask about the planet's name, so let's remove that column. Run the code below to do so:

In [None]:
#We will remove columns we don't need for this investigation
df = df.drop('pl_name', 1)

#Return the first 5 rows of the dataframe
df.head()

Next, let's address the "pl_bmassprov" column. This feature is actually very useful for us, as it indicates whether or not inclination was used to calculate the mass for the exoplanet. Certain detection methods include inclination while others don't.

Instead of deleting the column, let's convert the feature into a binary value: 0 if the mass calculation method does not include inclination and 1 if it does. Run the code below to make this change:

In [None]:
#Change pl_bmassprov column to 0 for exoplanets whose mass was not calculated
#using inclination and 1 for those where it was used
df['pl_bmassprov'] = df['pl_bmassprov'].map({'Mass': 0, 'M-R relationship': 0,
                                             'Msin(i)/sin(i)': 1, 'Msini': 1})

#Return the first 5 rows of the dataframe
df.head()

The next concern is that some rows in this dataset might be missing values for certain features. 

The code to create the model won't work if it can't find information for each feature for each exoplanet so we need to address this, either by removing the rows with missing values or by removing the columns where those missing values are found.

In some cases, the absence of a value can tell us something. For example, some exoplanets don't have any value listed for orbital period. What discovery method do you think was used to find those? Run the code below to find out:

In [None]:
#Create a dataframe containing only the rows with no orbital period
df_no_orb = df[df['pl_orbper'].isnull()]

#Return the first 10 rows of that dataframe
df_no_orb.head(10)

These exoplanets were primarily detected using the Imaging method. 

We'll be focusing on the Transit and Radial Velocity detection methods today, so we can remove these rows, but if we wanted to find exoplanets discovered using direct imaging we might want to remove the orbital period column instead. Otherwise, we would accidentally remove an entire subset of our data we wanted to study.

Since we aren't worried about preserving any of the information from missing values, we can remove all of the rows with anything missing. Run the code below to do so:

In [None]:
#Remove any rows with missing values
df = df.dropna()

#Return the remaining number of rows
print(len(df))

We just removed 461 rows from our dataset, leaving only exoplanets with values for each feature.

Finally, we can remove any rows from the dataset where detection methods other than Radial Velocity and Transit were used. Run the code below to do so:

In [None]:
#Remove exoplanets detected using Transit Timing Variations
df = df.drop(df[df["discoverymethod"].str.contains('Transit Timing Variations')].index)

#Remove all other exoplanets discovered by methods other than RV or Transit
df = df.drop(df[df["discoverymethod"].str.contains('Radial Velocity|Transit')==False].index)

#Return the current number of rows in the dataset
print(len(df))

Only 35 exoplanets in our cleaned dataset were detected using methods other than Radial Velocity or Transit. 

Now that our dataset is ready, let's check it for balance. Are there the same number of exoplanets detected using each method? Run the code below to find out:

In [None]:
#Return the number of rows with either Radial Velocity or Transit for discovery method
print("Radial Velocity:",len(df[df.discoverymethod == 'Radial Velocity']))
print("Transit:",len(df[df.discoverymethod == 'Transit']))

As it turns out, there are significantly more exoplanets discovered using Transit than Radial Velocity. 

**Bias Alert:** This introduces a possible source of statistical bias into our analysis. Our decision tree may end up better prepared to identify exoplanets detected using Transit than Radial Velocity because it has more information about them.

One way to address this source of bias is to ensure that our training and testing data both contain enough exoplanets detected using Radial Velocity. Another is to make sure that we have a large enough training dataset to include sufficient exoplanets detected using Radial Velocity.

Now it's time to split our data into a training and a testing dataset. 

**Bias Alert:** If we split our data 50/50 and we see a difference in the number of exoplanets detected by each method in the two halves, it's a sign that our data is sorted.

Run the code below to check:

In [None]:
#Create a dataframe with the first half of our data
df_first = df[0:2214]

#Create a dataframe with the second half of our data
df_last = df[2214:4429]

#Return the number of exoplanets detected using each method for both halves
print("First Half")
print("Radial Velocity:",len(df_first[df_first.discoverymethod == 'Radial Velocity']))
print("Transit:",len(df_first[df_first.discoverymethod == 'Transit']))
print("")
print("Second Half")
print("Radial Velocity:",len(df_last[df_last.discoverymethod == 'Radial Velocity']))
print("Transit:",len(df_last[df_last.discoverymethod == 'Transit']))

Our dataset is significantly unbalanced. Not only are there more exoplanets detected using Transit than Radial Velocity, but our data was also sorted.

We can address this issue by creating a random split instead. Run the code below to do so:

In [None]:
#Get the features and labels from the data 
x = df.drop(['discoverymethod'], axis=1)
y = df['discoverymethod']

#Specify a 50% split
training_percentage = 50 

#Create the training and testing datasets
X_train, X_test, Y_train, Y_test = train_test_split(x, y, train_size=training_percentage/100)

### **Step 5:** Train the model

Now it's time to make our decision tree. We will also need to set the hyperparameters (values that control how the model learns and makes decisions). In this case we will specify the maximum depth and the criterion the model will use to evaluate each feature. 

The "Gini index" is a measure of how pure the split is for each node in the Decision Tree. A lower Gini index indicates a more pure split.

Run the code below to create your model:

In [None]:
#Create a decision tree classifier called "clf"
clf = DecisionTreeClassifier(criterion='gini', max_depth=2, random_state=0)

Now that we have created a decision tree it's time to train it using our training dataset. We will also evaluate the model's accuracy (what percent of exoplanets did it correctly classify based on discovery method).

Run the code below to train and evaluate our model:

In [None]:
#Train the model
clf.fit(X_train, Y_train)

#Print the training accuracy 
print('\nTraining Accuracy (%): ',(100*(clf.score(X_train,Y_train))))

**Bias alert:** Because our training accuracy is so high, we run the risk of overfitting our model. This means it might be better at predicting the training data than the testing data.

We can try to address this by training the model with a few different split percentages (instead of a 50/50 train/test split). Run the code below to see how they compare:

In [None]:
#Train and test the model with a 40% training percentage
X_train_b, X_test_b, Y_train_b, Y_test_b = train_test_split(x, y, train_size=40/100)
clf_b = DecisionTreeClassifier(criterion='gini', max_depth=2)
clf_b.fit(X_train, Y_train)
print('Training Accuracy (%) for 40/60 split: ',(100*(clf_b.score(X_train_b,Y_train_b))))

#Create the training and testing datasets with a 30% training percentage
X_train_c, X_test_c, Y_train_c, Y_test_c = train_test_split(x, y, train_size=30/100)
clf_c = DecisionTreeClassifier(criterion='gini', max_depth=2)
clf_c.fit(X_train_c, Y_train_c)
print('Training Accuracy (%) for 30/70 split: ',(100*(clf_c.score(X_train,Y_train))))

Rerun the code above a few times to see how the results change.

Our Decision Tree's training accuracy is always around 98-99%, regardless of the split percentage we use. While this is a little high, we'll have to see how it performs with the testing data to make a final call.

Now that we have trained our model, we can see which features it is using to make predictions. 

Run the code below to visualize the tree based off of the original 50/50 split:

In [None]:
#Create a visualization for the decision tree
dot_data = tree.export_graphviz(clf, out_file=None,
                               feature_names=X_train.columns,
                               class_names=unique(Y_train.values, ''),                                
                               filled=True, rounded=True,
                               special_characters=True)
graph = graphviz.Source(dot_data)
graph

#did not use inclination -> is transit

### **Step 6:** Test the model

Run this code block to run the other half of our data (the testing dataset) through the model we just trained to find out how accurate it is with new data:

In [None]:
#Make the prediction using the model
Y_pred = clf.predict(X_test)

print('Percentage accuracy: ', 100*accuracy_score(Y_test, Y_pred))

Percentage accuracy:  99.14221218961626


Our decision tree was more accurate in classifying the testing data than the training data. That's a good sign that we didn't overfit it to the training data.



### **Step 7:** Evaluate the model

Let's evaluate the model with an imaginary exoplanet to see how it performs and what we can learn from it.

Consider AMNH-01, a newly discovered exoplanet that was found orbiting the star AMNH in the UWS galaxy. AMNH-01 was discovered using the Transit method.

Based on your Decision Tree visualized above, what can you predict about AMNH-01?