# Kepler Exoplanets - A ML comparison

The goal of this project is to compare different supervised machine learning models performance on the famous Kepler exoplanet dataset. To be successful, a machine learning model must correctly identify as many stellar objects as exoplanets as possible.
This dataset contains about 7000 objects that are potentially exoplanets.

###  1/ Libraries import

This section is for importing the usual python libraries used for dat science.

In [1]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn import svm
from sklearn.neural_network import MLPClassifier

### 2/ Dataset import

This section is to import the dataset.
The chosen data is the Kepler Exoplanet dataset generated by NASA Exoplanet Archive found at CalTech website: http://exoplanetarchive.ipac.caltech.edu

In [2]:
df = pd.read_csv('kepler_dataset.csv')

### 3/ First look & dataset description

This section will draw a first look at the dataset, as well as list and describe the features.

In [3]:
# usual head call, to get a first idea of the data
df.head()

Unnamed: 0,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,11446443,K00001.01,Kepler-1 b,CONFIRMED,NOT DISPOSITIONED,,0,0,0,0,...,-50.0,4.455,0.025,-0.025,0.95,0.02,-0.02,286.80847,49.316399,11.338
1,10666592,K00002.01,Kepler-2 b,CONFIRMED,NOT DISPOSITIONED,,0,0,0,0,...,-80.0,4.021,0.011,-0.011,1.991,0.018,-0.018,292.24728,47.969521,10.463
2,6678383,K00111.02,Kepler-104 c,CONFIRMED,CANDIDATE,,0,0,0,0,...,-78.0,4.081,0.213,-0.115,1.361,0.225,-0.305,287.60461,42.166779,12.596
3,6922244,K00010.01,Kepler-8 b,CONFIRMED,NOT DISPOSITIONED,,0,0,0,0,...,-158.0,4.169,0.055,-0.048,1.451,0.117,-0.129,281.28812,42.45108,13.563
4,9873254,K00717.01,Kepler-653 b,CONFIRMED,NOT DISPOSITIONED,,0,0,0,0,...,-127.0,4.391,0.065,-0.167,1.095,0.247,-0.094,282.21292,46.717819,13.387


In [4]:
# use a tail call as well
df.tail()

Unnamed: 0,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
7343,7906882,K00686.01,,CANDIDATE,CANDIDATE,,0,0,0,0,...,-132.0,4.47,0.077,-0.262,0.885,0.361,-0.096,296.84076,43.647121,13.579
7344,7976520,K00687.01,,NOT DISPOSITIONED,NOT DISPOSITIONED,,0,0,0,0,...,-167.0,4.479,0.056,-0.277,0.933,0.368,-0.093,297.11713,43.71143,13.813
7345,8161561,K00688.01,Kepler-645 b,CONFIRMED,NOT DISPOSITIONED,,0,0,0,0,...,-227.0,4.259,0.145,-0.282,1.273,0.664,-0.227,290.32278,44.035809,13.992
7346,8361905,K00689.01,Kepler-646 b,CONFIRMED,NOT DISPOSITIONED,,0,0,0,0,...,-170.0,4.607,0.031,-0.262,0.737,0.33,-0.051,290.47134,44.387081,13.766
7347,5120087,K00639.01,Kepler-631 b,CONFIRMED,NOT DISPOSITIONED,,0,0,0,0,...,-202.0,4.434,0.071,-0.273,0.989,0.377,-0.108,296.88577,40.22823,13.5


From this, we can see our result (whether a stellar object is an exoplanet) is the column **koi_disposition**.

In [5]:
# describe the data to see if which columns are numeric
df.describe()

Unnamed: 0,kepid,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
count,7348.0,0.0,7348.0,7348.0,7348.0,7348.0,7348.0,7064.0,7064.0,7348.0,...,6951.0,7064.0,6968.0,6968.0,7064.0,6968.0,6968.0,7348.0,7348.0,7348.0
mean,7702807.0,,0.085193,0.054164,0.035656,0.084513,78.116822,0.001773,-0.001773,166.883845,...,-172.405265,4.370101,0.098862,-0.242905,1.444594,0.508323,-0.287965,292.014,43.830741,14.328325
std,2665830.0,,0.279188,0.226357,0.185444,0.278174,1520.892459,0.007495,0.007495,63.425925,...,56.92998,0.374797,0.111248,0.13223,4.193486,0.949449,1.108117,4.806905,3.617003,1.368368
min,757450.0,,0.0,0.0,0.0,0.0,0.241843,0.0,-0.173,120.565925,...,-1473.0,0.146,0.0,-1.207,0.116,0.0,-34.637,279.85272,36.577381,6.966
25%,5552340.0,,0.0,0.0,0.0,0.0,3.556516,6e-06,-0.00026,133.323287,...,-200.0,4.33175,0.035,-0.298,0.816,0.249,-0.159,288.53037,40.766201,13.52975
50%,7949876.0,,0.0,0.0,0.0,0.0,11.258082,3.7e-05,-3.7e-05,139.086301,...,-165.0,4.466,0.06,-0.266,0.963,0.374,-0.091,292.24791,43.724689,14.601
75%,9884432.0,,0.0,0.0,0.0,0.0,43.154441,0.00026,-6e-06,171.989033,...,-141.0,4.554,0.111,-0.158,1.16425,0.569,-0.065,295.85916,46.728952,15.34425
max,12935140.0,,1.0,1.0,1.0,1.0,129995.7784,0.173,0.0,746.196647,...,0.0,5.283,1.184,0.0,149.058,25.352,0.0,301.72076,52.33601,20.003


From the website, the features/columns are describe as followed:

| Column | Description |
|-----:|---------------|
| kepid|          KepID|
| kepoi_name|     KOI Name|
| kepler_name|    Kepler Name|
| koi_disposition| Exoplanet Archive Disposition|
| koi_pdisposition| Disposition Using Kepler Data|
| koi_score|      Disposition Score|
| koi_fpflag_nt|  Not Transit-Like False Positive Flag|
| koi_fpflag_ss|  Stellar Eclipse False Positive Flag|
| koi_fpflag_co|  Centroid Offset False Positive Flag|
| koi_fpflag_ec|  Ephemeris Match Indicates Contamination False Positive Flag|
| koi_period|     Orbital Period [days]|
| koi_period_err1| Orbital Period Upper Unc. [days]|
| koi_period_err2| Orbital Period Lower Unc. [days]|
| koi_time0bk|    Transit Epoch [BKJD]|
| koi_time0bk_err1| Transit Epoch Upper Unc. [BKJD]|
| koi_time0bk_err2| Transit Epoch Lower Unc. [BKJD]|
| koi_impact|     Impact Parameter|
| koi_impact_err1| Impact Parameter Upper Unc.|
| koi_impact_err2| Impact Parameter Lower Unc.|
| koi_duration|   Transit Duration [hrs]|
| koi_duration_err1| Transit Duration Upper Unc. [hrs]|
| koi_duration_err2| Transit Duration Lower Unc. [hrs]|
| koi_depth|      Transit Depth [ppm]|
| koi_depth_err1| Transit Depth Upper Unc. [ppm]|
| koi_depth_err2| Transit Depth Lower Unc. [ppm]|
| koi_prad|       Planetary Radius [Earth radii]|
| koi_prad_err1|  Planetary Radius Upper Unc. [Earth radii]|
| koi_prad_err2|  Planetary Radius Lower Unc. [Earth radii]|
| koi_teq|        Equilibrium Temperature [K]|
| koi_teq_err1|   Equilibrium Temperature Upper Unc. [K]|
| koi_teq_err2|   Equilibrium Temperature Lower Unc. [K]|
| koi_insol|      Insolation Flux [Earth flux]|
| koi_insol_err1| Insolation Flux Upper Unc. [Earth flux]|
| koi_insol_err2| Insolation Flux Lower Unc. [Earth flux]|
| koi_model_snr|  Transit Signal-to-Noise|
| koi_tce_plnt_num| TCE Planet Number|
| koi_tce_delivname| TCE Delivery|
| koi_steff|      Stellar Effective Temperature [K]|
| koi_steff_err1| Stellar Effective Temperature Upper Unc. [K]|
| koi_steff_err2| Stellar Effective Temperature Lower Unc. [K]|
| koi_slogg|      Stellar Surface Gravity [log10(cm/s**2)]|
| koi_slogg_err1| Stellar Surface Gravity Upper Unc. [log10(cm/s**2)]|
| koi_slogg_err2| Stellar Surface Gravity Lower Unc. [log10(cm/s**2)]|
| koi_srad|       Stellar Radius [Solar radii]|
| koi_srad_err1|  Stellar Radius Upper Unc. [Solar radii]|
| koi_srad_err2|  Stellar Radius Lower Unc. [Solar radii]|
| ra|             RA [decimal degrees]|
| dec|            Dec [decimal degrees]|
| koi_kepmag|     Kepler-band [mag] |

### 4/ Data cleaning

This section will try to remove any unecessary data and clean the blank/null.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7348 entries, 0 to 7347
Data columns (total 49 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   kepid              7348 non-null   int64  
 1   kepoi_name         7348 non-null   object 
 2   kepler_name        2663 non-null   object 
 3   koi_disposition    7348 non-null   object 
 4   koi_pdisposition   7348 non-null   object 
 5   koi_score          0 non-null      float64
 6   koi_fpflag_nt      7348 non-null   int64  
 7   koi_fpflag_ss      7348 non-null   int64  
 8   koi_fpflag_co      7348 non-null   int64  
 9   koi_fpflag_ec      7348 non-null   int64  
 10  koi_period         7348 non-null   float64
 11  koi_period_err1    7064 non-null   float64
 12  koi_period_err2    7064 non-null   float64
 13  koi_time0bk        7348 non-null   float64
 14  koi_time0bk_err1   7064 non-null   float64
 15  koi_time0bk_err2   7064 non-null   float64
 16  koi_impact         7064 

From the call above, we see that 3 columns seems empty.
Let's verify!

In [7]:
# list the columns to inspect
columns_to_inspect = ['koi_score', 'koi_teq_err1', 'koi_teq_err2']
for column in columns_to_inspect:
    # print their unique values
    print(df[column].unique())


[nan]
[nan]
[nan]


We can see that those columns are indeed empty and can be safely removed.

In [8]:
# remove the empty columns
for column in columns_to_inspect:
    # this is for avoiding error in case of multiple run
    if column in list(df):
        # remove the column from the dataframe
        df.pop(column)

The next step is to check the columns that have type object and have _Null_ values.

In [9]:
# list the columns to inspect
columns_to_inspect = ['kepler_name', 'koi_tce_delivname']
for column in columns_to_inspect:
    # print their unique values
    print(df[column].unique())

['Kepler-1 b' 'Kepler-2 b' 'Kepler-104 c' ... 'Kepler-645 b'
 'Kepler-646 b' 'Kepler-631 b']
['q1_q16_tce' nan]


We can see that:
* **kepler_name**: this column only contains a name if the stellar object is confirmed to be an explonet. Since we will use the **koi_disposition** as result, we can drop this column
* **koi_tce_delivname**: this column only contains the source of the data and can be dropped

In [10]:
# remove the two columns
for column in columns_to_inspect:
    # this is for avoiding error in case of multiple run
    if column in list(df):
        # remove the column from the dataframe
        df.pop(column)

Let's look at the count of unique objects for the remaining object type columns.

In [11]:
# list the columns to inspect
columns_to_inspect = ['kepid', 'kepoi_name', 'koi_pdisposition']
for column in columns_to_inspect:
    # print their unique values
    print(df[column].unique())

[11446443 10666592  6678383 ...  8161561  8361905  5120087]
['K00001.01' 'K00002.01' 'K00111.02' ... 'K00688.01' 'K00689.01'
 'K00639.01']
['NOT DISPOSITIONED' 'CANDIDATE' 'FALSE POSITIVE']


We can see that
* **koi_pdisposition**: this column does not bring any usefull information and can be dropped.
* **kepid** and **kepoi_name**: these two columns are ID columns and can therefore be dropped

In [12]:
# remove the three columns
for column in columns_to_inspect:
    # this is for avoiding error in case of multiple run
    if column in list(df):
        # remove the column from the dataframe
        df.pop(column)

Now that the object type columns are cleaned, let's look are the column with numeric values and _Null_.

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7348 entries, 0 to 7347
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   koi_disposition    7348 non-null   object 
 1   koi_fpflag_nt      7348 non-null   int64  
 2   koi_fpflag_ss      7348 non-null   int64  
 3   koi_fpflag_co      7348 non-null   int64  
 4   koi_fpflag_ec      7348 non-null   int64  
 5   koi_period         7348 non-null   float64
 6   koi_period_err1    7064 non-null   float64
 7   koi_period_err2    7064 non-null   float64
 8   koi_time0bk        7348 non-null   float64
 9   koi_time0bk_err1   7064 non-null   float64
 10  koi_time0bk_err2   7064 non-null   float64
 11  koi_impact         7064 non-null   float64
 12  koi_impact_err1    7064 non-null   float64
 13  koi_impact_err2    7064 non-null   float64
 14  koi_duration       7348 non-null   float64
 15  koi_duration_err1  7064 non-null   float64
 16  koi_duration_err2  7064 

From this, we can see that one column has significantly less data than the other: **koi_tce_plnt_num**. Looking at the description, we see that it is yet another ID column and can therefore be dropped.

In [14]:
# remove the koi_tce_plnt_num column
column = 'koi_tce_plnt_num'
# this is for avoiding error in case of multiple run
if column in list(df):
    df.pop(column)

Of the remaining columns, the one with the least amount of data is **koi_steff_err2**, with 6951. This represents _5.4%_ of the total dataset incomplete, which is an acceptable loss. We will thereby remove it.

In [15]:
# remove rows with Null
df = df.dropna()

With that, our dataset is now cleaned.

> Unfortunately, the jupyter notebook kept crashing when trying to display a graph, so I could not do a pairplot or a heatmap to asses correlation and feature importance.

### 5/ Train - Test split

Let's prepare the initial train / test split used for the performance assessment of the different machine learning models we will use. We will use a standard 20% split for Test. We will also put a random state for reproduceability.

In [16]:
# separate the y from X
y = df.pop('koi_disposition')
# simplify y
y = y.replace('CONFIRMED', 1)
y = y.replace('NOT DISPOSITIONED', 0)
y = y.replace('CANDIDATE', 0)
y = y.replace('FALSE POSITIVE', 0)
# rename for convention
X = df

# split into the train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Let's do a sanity check on the shapes of our sets.

In [17]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(5560, 39)
(5560,)
(1391, 39)
(1391,)


### 6/ Model Trainings

The following section will train the models and will ouput the accuracy of the results of the Test set. Those results will be saved in a dictionary.

In [18]:
results_dict = {}

##### 6.1/ Linear Regression

The first model we will use is a simple _linear regression_. The goal is mostly to get a baseline of performance to compare the other models with.

In [19]:
# simple linear regression
lreg = LinearRegression().fit(X_train, y_train)
# predict the output on the test set
y_pred = lreg.predict(X_test)
# format the results
y_pred = np.array([1 if i >= .5 else 0 for i in y_pred])
# compute the accuracy
accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)

# save and display
results_dict['Linear Regression'] = accuracy

The accuracy of the _linear regression_ is 77.57%. While this isn't bad, I believe we can do better.

##### 6.2 / Logistic Regression

Next, we will move to a simple _logistic regression_. This should fit the output better. Since the default **solver** _lbfgs_ fails to converge even with **max_iter** sets to _10k_, I changed it to _liblinear_, which is the recommendation for smaller dataset.

In [20]:
# simple logistic regression
clf = LogisticRegression(random_state=42, solver='liblinear').fit(X_train, y_train)
# predict the ouput the test set
y_pred = clf.predict(X_test)
# compute the accuracy
accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)

# save and display
results_dict['Logistic Regression'] = accuracy

We can see that the accuracy is very close to the _linear regression_ at 78.00%.

##### 6.3 / Random Forest

Next, we will see if _Random Forest_ fares better. Since the _Random Forest_ has several hyperparameters, we will loop through some value to search for the best fit.

In [21]:
# init the maximum accuracy
max_accuracy = 0
max_hyperparameters = (0, 0)
# prepare the hyperparameters
max_depth_range = range(1, 10)
max_leaf_nodes_range = range(2, 10)
# loop over max depth
for max_depth in max_depth_range:
    # loop over 
    for max_leaf_nodes in max_leaf_nodes_range:
        # random forest classifier
        clf = RandomForestClassifier(max_depth=max_depth, max_leaf_nodes=max_leaf_nodes, random_state=42).fit(X_train, y_train)
        # predict the ouput the test set
        y_pred = clf.predict(X_test)
        # compute the accuracy
        accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
        # save accuracy and hyperparameters if new maximum
        if accuracy > max_accuracy:
            max_accuracy = accuracy
            max_hyperparameters = (max_depth, max_leaf_nodes)
            
# save and display the result
results_dict['Random Forest'] = max_accuracy
print('max_accuracy: ' + str(max_accuracy) + ' found with max_depth=' + str(max_hyperparameters[0]) + ' and max_leaf_nodes=' + str(max_hyperparameters[1]))

max_accuracy: 0.8598130841121495 found with max_depth=7 and max_leaf_nodes=8


The _Random Forest_ does much better with an accuracy of 85.98%.

##### 6.4/ Adaboost

Next, we will try Adaboost with a range of estimators.

In [22]:
# init the maximum accuracy
max_accuracy = 0
max_hyperparameters = 0
# prepare the range of estimators
n_estimators_range = range(1, 150)
# loop over n_estimators
for n_estimators in n_estimators_range:
    # adaboost
    clf = AdaBoostClassifier(n_estimators=n_estimators, random_state=42).fit(X_train, y_train)
    # predict the ouput the test set
    y_pred = clf.predict(X_test)
    # compute the accuracy
    accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
    # save accuracy and hyperparameters if new maximum
    if accuracy > max_accuracy:
        max_accuracy = accuracy
        max_hyperparameters = n_estimators

# save and display the result
results_dict['Adaboost'] = max_accuracy
print('max_accuracy: ' + str(max_accuracy) + ' found with n_estimators=' + str(max_hyperparameters))

max_accuracy: 0.8900071890726097 found with n_estimators=105


_Adaboost_ brings the accuracy up from _Random Forest_ to 89.00%. This is achieved with n_estimators at 105.

##### 6.5/ SVM

Now, let's look at the performance of _SVM_.

> If you intend to run this yourself, please note that the SVM takes a much longer time to execute than the other models. Not "go have a coffee" long, but rather "go cook yourself a fancy meal and eat it slowly" long.

In [23]:
# SVM classifier
clf = svm.SVC(kernel='linear').fit(X_train, y_train)
# predict the ouput the test set
y_pred = clf.predict(X_test)
# compute the accuracy
accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)

# save and display the results
results_dict['SVM'] = accuracy
print(accuracy)

0.8368080517613228


Not only does the _SVM_ takes a very long time to execute, but the accuracy is a step backward from _Random Forest_ and _Adaboost_ with 83.68%.

##### 6.6/ MLP

_MLP_ means Multi-Layer Perceptron and is the most basic Neural Network model. It basically works as several regression layers chained together, with a linearity breaking layer between each.

In [24]:
# potential activations
activation_list = ['identity', 'logistic', 'tanh', 'relu']
# init max_accuracy & max activation
max_accuracy = 0
max_hyperparameters = 0
# loop over activations
for activation in activation_list:
    # MLP classifier
    clf = MLPClassifier(activation=activation, solver='adam', hidden_layer_sizes=(100,100), max_iter=3000, random_state=42).fit(X_train, y_train)
    # predict the ouput the test set
    y_pred = clf.predict(X_test)
    # compute the accuracy
    accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
    # check if new max
    if accuracy > max_accuracy:
        max_accuracy = accuracy
        max_hyperparameters = activation

# save and display the results
results_dict['MLP'] = max_accuracy
print('max_accuracy: ' + str(max_accuracy) + ' found with activation=' + str(max_hyperparameters))

max_accuracy: 0.813803019410496 found with activation=logistic


As described above, a _MLP_ stacks regression layers, and therefore, it is interesting to note that it surpasses both _Linear Regression_ and _Logistic Regression_ with 81.38% accuracy, but fails to surpass _Random Forest_ or _Adaboost_.

### 7/ Summary & conclusion

Here is a summary of the best accuracies found with their respective model:

| Model | Best Accuracy |
|-------|---------------|
| Linear Regression | 77.57% |
| Logistic Regression | 78.00% |
| Random Forest | 85.98% |
| Adaboost | 89.00% |
| SVM | 83.68% |
| MLP | 81.38% |

The best model overall is _Adaboost_, with the worst being the _Linear Regression_.

Why is that?

* I think the great number of features means a simple regression just cannot contain the complexity of the model. This is correlated by the fact that all regression based models performed poorly: _Linear Regression_, _Logistic Regression_ and _MLP_.

* _SVM_ takes a very long time to find the hyperplane, but the number of features (ie dimensions) also translates to poor results.

* This leads to _Random Forest_ and _Adaboost_ better decision making models, as they are based on an entirely different method. I believe _Adaboost_ gets the best results because of the weighting of each stump provide a finer lever to compute decision than a longer tree can.