## Decision Tree Model for Kepler Space Telescope Data

The purpose of this project is to reinforce my data science and machine learning knowledge, and to help others in their journeys. I'm not an expert on these topics, and I'm conscious that there are a lot of tools, concepts and techniques that I need to master. That's why I develop this project, to share with the community the tools, concepts and techniques, that I have learned across my data science journey. I'm open to comments, critics and feedback that would help me to develop learn best practices and corrections in case I was wrong

In [36]:
#import warnings
#warnings.simplefilter('ignore')
import pandas as pd
import numpy as np

In [37]:
kepler_df = pd.read_csv('../../data/exoplanet_data.csv')
kepler_df.head()

Unnamed: 0,koi_disposition,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,CONFIRMED,0,0,0,0,54.418383,0.0002479,-0.0002479,162.51384,0.00352,...,-81,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
1,FALSE POSITIVE,0,1,0,0,19.89914,1.49e-05,-1.49e-05,175.850252,0.000581,...,-176,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
2,FALSE POSITIVE,0,1,0,0,1.736952,2.63e-07,-2.63e-07,170.307565,0.000115,...,-174,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597
3,CONFIRMED,0,0,0,0,2.525592,3.76e-06,-3.76e-06,171.59555,0.00113,...,-211,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509
4,CONFIRMED,0,0,0,0,4.134435,1.05e-05,-1.05e-05,172.97937,0.0019,...,-232,4.486,0.054,-0.229,0.972,0.315,-0.105,296.28613,48.22467,15.714


In [38]:
kepler_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6991 entries, 0 to 6990
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   koi_disposition    6991 non-null   object 
 1   koi_fpflag_nt      6991 non-null   int64  
 2   koi_fpflag_ss      6991 non-null   int64  
 3   koi_fpflag_co      6991 non-null   int64  
 4   koi_fpflag_ec      6991 non-null   int64  
 5   koi_period         6991 non-null   float64
 6   koi_period_err1    6991 non-null   float64
 7   koi_period_err2    6991 non-null   float64
 8   koi_time0bk        6991 non-null   float64
 9   koi_time0bk_err1   6991 non-null   float64
 10  koi_time0bk_err2   6991 non-null   float64
 11  koi_impact         6991 non-null   float64
 12  koi_impact_err1    6991 non-null   float64
 13  koi_impact_err2    6991 non-null   float64
 14  koi_duration       6991 non-null   float64
 15  koi_duration_err1  6991 non-null   float64
 16  koi_duration_err2  6991 

In [39]:
print(f"The dataframe has a length of {len(kepler_df.columns)} columns.")
print(f"There are 3 outcomes/predictions/targets: {set(kepler_df['koi_disposition'])}")

The dataframe has a length of 41 columns.
There are 3 outcomes/predictions/targets: {'FALSE POSITIVE', 'CANDIDATE', 'CONFIRMED'}


In [40]:
features = kepler_df.drop('koi_disposition', axis = 1)
target = kepler_df['koi_disposition'].values

print(features.shape)
print(target.shape)

(6991, 40)
(6991,)


In [41]:
kepler_df['koi_disposition'] = kepler_df['koi_disposition'].replace({'CONFIRMED': 0, 'FALSE POSITIVE': 1, 'CANDIDATE': 2})

In [42]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42)

In [43]:
from sklearn import tree
desc_tree = tree.DecisionTreeClassifier()
desc_tree.fit(X_train, y_train)

DecisionTreeClassifier()

In [44]:
correlation = kepler_df.corr()
correlation['koi_disposition']

koi_disposition      1.000000
koi_fpflag_nt        0.000416
koi_fpflag_ss        0.013503
koi_fpflag_co        0.008531
koi_fpflag_ec        0.008041
koi_period           0.124647
koi_period_err1      0.099048
koi_period_err2     -0.099048
koi_time0bk          0.070445
koi_time0bk_err1     0.147719
koi_time0bk_err2    -0.147719
koi_impact           0.010607
koi_impact_err1      0.058572
koi_impact_err2     -0.013980
koi_duration         0.029554
koi_duration_err1    0.156587
koi_duration_err2   -0.156587
koi_depth            0.008694
koi_depth_err1       0.001797
koi_depth_err2      -0.001797
koi_prad             0.001485
koi_prad_err1        0.003135
koi_prad_err2       -0.000998
koi_teq              0.021275
koi_insol            0.012070
koi_insol_err1       0.014604
koi_insol_err2      -0.014159
koi_model_snr       -0.016351
koi_tce_plnt_num    -0.095550
koi_steff            0.071048
koi_steff_err1       0.173227
koi_steff_err2      -0.148902
koi_slogg           -0.071437
koi_slogg_

In [45]:
import plotly
import plotly.graph_objs as go
import plotly.figure_factory as figfac
from plotly.offline import offline, iplot

fig = go.Figure(data=go.Heatmap(z = correlation.iloc[:11, :11].round(3).values.tolist(),
                                x=correlation.iloc[:11, :11].columns.to_list(),
                                y = correlation.iloc[:11, :11].index.to_list(),
                                colorscale= 'tempo',
                                text=correlation.iloc[:11, :11].round(4).values,
                                texttemplate="%{text}",
                                textfont={"size":12}))

fig.update_layout(title_text='First 10 Features Heatmap Correlation with the Target', title_x=0.5)

fig.show()
fig.write_image("decision_tree_complete_heatmap.png")
offline.plot(fig, filename='decision_tree_complete_heatmap.html')

'decision_tree_complete_heatmap.html'

In [46]:
print(desc_tree.feature_importances_)

[0.19667971 0.16979965 0.18471463 0.02984421 0.01865382 0.0031008
 0.00057589 0.02329355 0.0061352  0.00375704 0.02879124 0.00495906
 0.00463226 0.01416025 0.00416924 0.00254072 0.00949656 0.00497895
 0.01122006 0.00701017 0.0133081  0.00105306 0.00617717 0.00160353
 0.00953638 0.00252096 0.13710505 0.00458072 0.01088606 0.00888351
 0.00896551 0.00288788 0.00337327 0.00391491 0.00384182 0.01235346
 0.00417935 0.00728172 0.01454452 0.01448997]


In [47]:
from sklearn.preprocessing import MinMaxScaler

X_scaler = MinMaxScaler().fit(X_train)

X_train_scaled = X_scaler.transform(X_train)

In [48]:
desc_tree = tree.DecisionTreeClassifier()
desc_tree.fit(X_train_scaled, y_train)

DecisionTreeClassifier()

In [49]:
print(desc_tree.feature_importances_)

[0.19667971 0.16979965 0.18471463 0.0301733  0.01548324 0.00301778
 0.00224668 0.02366639 0.00879573 0.00471898 0.02717529 0.00381677
 0.00515386 0.01430273 0.00136422 0.00314789 0.00984459 0.00444035
 0.00912298 0.01290644 0.0135124  0.         0.00288917 0.00354392
 0.00948939 0.00269647 0.13408009 0.00565617 0.01466547 0.00878322
 0.00728877 0.00349725 0.00369984 0.00384531 0.00416651 0.01047282
 0.00540994 0.0099362  0.01446923 0.0113266 ]
