# BinaryClassExoplanets
***Matt Paterson, hello@hireMattPaterson.com***<br>
This notebook takes a dataset from the Kepler Satelite, KOI cumulative dataset, from https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=cumulative <br>

Here I do a rudimentary check on some exoplanet data, create a series of classification models (plus a Linear Regression for absolutely no reason other than seeing what it looks like when I do), and try to predict how many of the unclassified 'candidates' that have been identified in this Kepler Objects of Interest dataset are actually exoplanets.

This notebook does the basic data cleaning to create a quick and dirty dataset and exports to 'dirty_kepler.csv'

## Import libraries needed for EDA
Run all EDA necessary for future models here, and export the dataframes to csv prior to running the models

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# import lebowski, my custom Python Library for EDA
import lebowski as dude
# import autoreload in case I create new lebowski funcitons
import autoreload        

## Import the Kepler Exoplanet cumulative dataset

In [2]:
# establish datapath variable
path_to_home = '../'
datapath = path_to_home + 'data/'

In [3]:
filename = 'kepler_cumulative_exoplanets.csv'
kep = pd.read_csv(datapath + filename)
kep.head()

Unnamed: 0,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,12168280,K08297.01,,CANDIDATE,CANDIDATE,0.141,0,0,0,0,...,-83.0,3.978,0.195,-0.12,2.057,0.365,-0.501,295.97794,50.771481,10.505
1,12117215,K08296.01,,FALSE POSITIVE,FALSE POSITIVE,0.116,1,0,0,0,...,-152.0,4.563,0.044,-0.176,0.828,0.216,-0.072,294.92795,50.662369,15.111
2,11923074,K08295.01,,FALSE POSITIVE,FALSE POSITIVE,0.0,0,0,0,1,...,-157.0,4.561,0.033,-0.176,0.833,0.227,-0.071,296.14072,50.279949,14.723
3,11867733,K08294.01,,FALSE POSITIVE,FALSE POSITIVE,0.0,0,1,0,0,...,-193.0,4.445,0.067,-0.202,0.994,0.291,-0.125,293.77362,50.12447,14.391
4,10585887,K08293.01,,FALSE POSITIVE,FALSE POSITIVE,0.19,1,0,0,0,...,-209.0,4.463,0.07,-0.21,0.97,0.306,-0.123,285.24338,47.812592,14.872


In [4]:
kep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9564 entries, 0 to 9563
Data columns (total 49 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   kepid              9564 non-null   int64  
 1   kepoi_name         9564 non-null   object 
 2   kepler_name        2308 non-null   object 
 3   koi_disposition    9564 non-null   object 
 4   koi_pdisposition   9564 non-null   object 
 5   koi_score          8054 non-null   float64
 6   koi_fpflag_nt      9564 non-null   int64  
 7   koi_fpflag_ss      9564 non-null   int64  
 8   koi_fpflag_co      9564 non-null   int64  
 9   koi_fpflag_ec      9564 non-null   int64  
 10  koi_period         9564 non-null   float64
 11  koi_period_err1    9110 non-null   float64
 12  koi_period_err2    9110 non-null   float64
 13  koi_time0bk        9564 non-null   float64
 14  koi_time0bk_err1   9110 non-null   float64
 15  koi_time0bk_err2   9110 non-null   float64
 16  koi_impact         9201 

## Create a quick and dirty model, let's get rolling

In [5]:
kep.isna().sum()

kepid                   0
kepoi_name              0
kepler_name          7256
koi_disposition         0
koi_pdisposition        0
koi_score            1510
koi_fpflag_nt           0
koi_fpflag_ss           0
koi_fpflag_co           0
koi_fpflag_ec           0
koi_period              0
koi_period_err1       454
koi_period_err2       454
koi_time0bk             0
koi_time0bk_err1      454
koi_time0bk_err2      454
koi_impact            363
koi_impact_err1       454
koi_impact_err2       454
koi_duration            0
koi_duration_err1     454
koi_duration_err2     454
koi_depth             363
koi_depth_err1        454
koi_depth_err2        454
koi_prad              363
koi_prad_err1         363
koi_prad_err2         363
koi_teq               363
koi_teq_err1         9564
koi_teq_err2         9564
koi_insol             321
koi_insol_err1        321
koi_insol_err2        321
koi_model_snr         363
koi_tce_plnt_num      346
koi_tce_delivname     346
koi_steff             363
koi_steff_er

## Impute, compute, and refute
Since every row in this table has a null value somewhere, we cannot simply drop the null rows. We'll start by dropping columns such as koi_teq_err1 and koi_teq_err2 that have 100% null values

In [6]:
df = kep.drop(columns=['koi_teq_err1', 'koi_teq_err2'])

In [7]:
df.shape

(9564, 47)

The kepler_name column is only significant on certain rows. We can probably eliminate this too as confirmed KOI are shown in what will be our target column

In [8]:
df.kepler_name.where(df['kepler_name'].notna())

0                 NaN
1                 NaN
2                 NaN
3                 NaN
4                 NaN
            ...      
9559              NaN
9560    Kepler-1658 b
9561       Kepler-3 b
9562       Kepler-2 b
9563       Kepler-1 b
Name: kepler_name, Length: 9564, dtype: object

In [9]:
df = df.drop(columns='kepler_name')
df.shape

(9564, 46)

In [10]:
df_2 = df.dropna()
df_2.shape

(7803, 46)

7800 rows is enough to create a model from, but let's take a quick glance at what data we are omitting:

In [11]:
df.isna().sum()

kepid                   0
kepoi_name              0
koi_disposition         0
koi_pdisposition        0
koi_score            1510
koi_fpflag_nt           0
koi_fpflag_ss           0
koi_fpflag_co           0
koi_fpflag_ec           0
koi_period              0
koi_period_err1       454
koi_period_err2       454
koi_time0bk             0
koi_time0bk_err1      454
koi_time0bk_err2      454
koi_impact            363
koi_impact_err1       454
koi_impact_err2       454
koi_duration            0
koi_duration_err1     454
koi_duration_err2     454
koi_depth             363
koi_depth_err1        454
koi_depth_err2        454
koi_prad              363
koi_prad_err1         363
koi_prad_err2         363
koi_teq               363
koi_insol             321
koi_insol_err1        321
koi_insol_err2        321
koi_model_snr         363
koi_tce_plnt_num      346
koi_tce_delivname     346
koi_steff             363
koi_steff_err1        468
koi_steff_err2        483
koi_slogg             363
koi_slogg_er

We can eliminate the koi_score. This is a confidence score computed by using a Monte Carlo technique and assigning a high value (between 0 and 1) for high confidence that a koi is an exoplanet, and low confidence where the Monte Carlo is not confident that this is actually an exoplanet. This score itself may pollute our model, or otherwise influence its prediction. Of course, it is likely a good indicator since the Monte Carlo model is more powerful than a simple logistic regression, but I digress.

In [12]:
df_2 = df.drop(columns='koi_score')

In [13]:
df_2.shape

(9564, 45)

In [14]:
df_2 = df_2.dropna()
df_2.shape

(8744, 45)

In [15]:
df_2.dtypes

kepid                  int64
kepoi_name            object
koi_disposition       object
koi_pdisposition      object
koi_fpflag_nt          int64
koi_fpflag_ss          int64
koi_fpflag_co          int64
koi_fpflag_ec          int64
koi_period           float64
koi_period_err1      float64
koi_period_err2      float64
koi_time0bk          float64
koi_time0bk_err1     float64
koi_time0bk_err2     float64
koi_impact           float64
koi_impact_err1      float64
koi_impact_err2      float64
koi_duration         float64
koi_duration_err1    float64
koi_duration_err2    float64
koi_depth            float64
koi_depth_err1       float64
koi_depth_err2       float64
koi_prad             float64
koi_prad_err1        float64
koi_prad_err2        float64
koi_teq              float64
koi_insol            float64
koi_insol_err1       float64
koi_insol_err2       float64
koi_model_snr        float64
koi_tce_plnt_num     float64
koi_tce_delivname     object
koi_steff            float64
koi_steff_err1

In [16]:
df_2.head()

Unnamed: 0,kepid,kepoi_name,koi_disposition,koi_pdisposition,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,12168280,K08297.01,CANDIDATE,CANDIDATE,0,0,0,0,229.957537,0.00673,...,-83.0,3.978,0.195,-0.12,2.057,0.365,-0.501,295.97794,50.771481,10.505
2,11923074,K08295.01,FALSE POSITIVE,FALSE POSITIVE,0,0,0,1,1.756471,1.5e-05,...,-157.0,4.561,0.033,-0.176,0.833,0.227,-0.071,296.14072,50.279949,14.723
3,11867733,K08294.01,FALSE POSITIVE,FALSE POSITIVE,0,1,0,0,384.356448,0.00368,...,-193.0,4.445,0.067,-0.202,0.994,0.291,-0.125,293.77362,50.12447,14.391
4,10585887,K08293.01,FALSE POSITIVE,FALSE POSITIVE,1,0,0,0,378.65484,0.0228,...,-209.0,4.463,0.07,-0.21,0.97,0.306,-0.123,285.24338,47.812592,14.872
5,10532751,K08292.01,FALSE POSITIVE,FALSE POSITIVE,1,0,0,0,367.761848,0.00675,...,-295.0,2.897,0.045,-0.014,9.888,0.257,-2.058,290.87155,47.753368,10.736


## Manage the datatypes, lose the non-nums
We cannot use scikit-learn or tensorflow with non-numeric data or null values. We've eliminated the nulls, now we'll look at the non-numeric columns and use one-hot-encoding or other methods to convert any non-numeric columns into proper columns.

In [17]:
df_2.iloc[8000]

kepid                        7025846
kepoi_name                 K00565.01
koi_disposition       FALSE POSITIVE
koi_pdisposition      FALSE POSITIVE
koi_fpflag_nt                      0
koi_fpflag_ss                      0
koi_fpflag_co                      1
koi_fpflag_ec                      1
koi_period                   2.34052
koi_period_err1              8.6e-06
koi_period_err2             -8.6e-06
koi_time0bk                  170.198
koi_time0bk_err1              0.0026
koi_time0bk_err2             -0.0026
koi_impact                     0.725
koi_impact_err1                0.009
koi_impact_err2               -0.523
koi_duration                   3.156
koi_duration_err1              0.103
koi_duration_err2             -0.103
koi_depth                        175
koi_depth_err1                   5.6
koi_depth_err2                  -5.6
koi_prad                         1.4
koi_prad_err1                   0.35
koi_prad_err2                  -0.21
koi_teq                         1456
k

In [18]:
# Are kepid and kepoi_name just two different unique identifiers?
# The data dictionary is vague so let's check a few

#df_2[df_2['kepoi_name']=='K08295.01']
#df_2[df_2['kepoi_name']=='K08292.01']
#df_2[df_2['kepoi_name']=='K02361.01']
#df_2[df_2['kepoi_name']=='K03601.01']
df_2[df_2['kepoi_name']=='K00565.01']

Unnamed: 0,kepid,kepoi_name,koi_disposition,koi_pdisposition,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
8784,7025846,K00565.01,FALSE POSITIVE,FALSE POSITIVE,0,0,1,1,2.340523,9e-06,...,-181.0,4.397,0.139,-0.186,0.981,0.25,-0.146,289.35855,42.526192,14.3


In [19]:
# Yes it looks like we can drop that col
df_2.drop(columns='kepoi_name', inplace=True)

How often are the koi_disposition and the koi_pdisposition different?

In [20]:
df_2[df_2['koi_disposition'] == df_2['koi_pdisposition']].shape

(6458, 44)

The answer is usually but not every time. We are going to want one of those columns to be our target column, but which one?

After referencing the data dictionary <a href=https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html>here</a> it looks like the pdisposition should be dropped as it does not include any confirmed data, thus we can't acutally use it as training or testing data.

In [21]:
df_2.drop(columns='koi_pdisposition', inplace=True)

Now use mapping to encode the target column

In [22]:
df_2['koi_disposition'] = df_2['koi_disposition'].map({'CONFIRMED': 1, 'FALSE POSITIVE': 0,
                            'CANDIDATE': 2, 'NOT DISPOSITIONED': 3})

In [23]:
# and this column is listed as :"TCE delivery name corresponding 
# to the TCE data federated to the KOI" ... so drop it too

df_2.drop(columns='koi_tce_delivname', inplace=True)

## Export the quick and dirty dataframe to a csv file

In [24]:
df_2.to_csv(datapath + 'dirty_kepler.csv', index=False)

# AND now for something Completely Different

## Let's run the data through a Random Foreset Classifier and see what shakes out.

**Start by importing scikit-learn and the needed libaries, and re-import the dataset using the new csv file**

In [25]:

from sklearn.linear_model    import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.preprocessing        import StandardScaler


In [26]:
dirty = pd.read_csv(datapath + 'dirty_kepler.csv')
dirty.head()

Unnamed: 0,kepid,koi_disposition,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,12168280,2,0,0,0,0,229.957537,0.00673,-0.00673,326.0184,...,-83.0,3.978,0.195,-0.12,2.057,0.365,-0.501,295.97794,50.771481,10.505
1,11923074,0,0,0,0,1,1.756471,1.5e-05,-1.5e-05,132.02757,...,-157.0,4.561,0.033,-0.176,0.833,0.227,-0.071,296.14072,50.279949,14.723
2,11867733,0,0,1,0,0,384.356448,0.00368,-0.00368,298.90354,...,-193.0,4.445,0.067,-0.202,0.994,0.291,-0.125,293.77362,50.12447,14.391
3,10585887,0,1,0,0,0,378.65484,0.0228,-0.0228,288.2584,...,-209.0,4.463,0.07,-0.21,0.97,0.306,-0.123,285.24338,47.812592,14.872
4,10532751,0,1,0,0,0,367.761848,0.00675,-0.00675,194.7697,...,-295.0,2.897,0.045,-0.014,9.888,0.257,-2.058,290.87155,47.753368,10.736


We want to predict the disposition. Get a baseline on the data

In [27]:
dirty.koi_disposition.value_counts(normalize=True)

0    0.481930
1    0.261322
2    0.256747
Name: koi_disposition, dtype: float64

It looks like we have 48% that have been classified as 'False Positive'. These are objects that the Kepler Objects of Interest dataset had classifed as possible planets that are later deemed not planets. We should think about changing the nomenclature so that we don't confuse this column heading with a measure of our own model later.

The rest of the observations are split evenly between Confirmed exoplanets and Unknown.  We'll want to save all of the unknowns as our testing dataset, and split the remaining 75% of our original dataset in to a training and validation set.

In [28]:
koi_unknowns = dirty[dirty.koi_disposition == 2]
koi_unknowns.shape

(2245, 42)

In [29]:
#koi_unknowns.columns
print('unk: ', koi_unknowns.koi_fpflag_ss.value_counts())
print('dirty: ', dirty.koi_fpflag_ss.value_counts())

unk:  0    2243
1       2
Name: koi_fpflag_ss, dtype: int64
dirty:  0    6622
1    2122
Name: koi_fpflag_ss, dtype: int64


In the above cell, I explore some of the columns that exist in the dataset and how the unknowns differ from the dataset as a whole. All of these flags may be things that denote an exoplanet

In [30]:
koi_unknowns.columns
# print('unk: ', koi_unknowns.koi_fpflag_ss.value_counts())
# print('dirty: ', dirty.koi_fpflag_ss.value_counts())

Index(['kepid', 'koi_disposition', 'koi_fpflag_nt', 'koi_fpflag_ss',
       'koi_fpflag_co', 'koi_fpflag_ec', 'koi_period', 'koi_period_err1',
       'koi_period_err2', 'koi_time0bk', 'koi_time0bk_err1',
       'koi_time0bk_err2', 'koi_impact', 'koi_impact_err1', 'koi_impact_err2',
       'koi_duration', 'koi_duration_err1', 'koi_duration_err2', 'koi_depth',
       'koi_depth_err1', 'koi_depth_err2', 'koi_prad', 'koi_prad_err1',
       'koi_prad_err2', 'koi_teq', 'koi_insol', 'koi_insol_err1',
       'koi_insol_err2', 'koi_model_snr', 'koi_tce_plnt_num', 'koi_steff',
       'koi_steff_err1', 'koi_steff_err2', 'koi_slogg', 'koi_slogg_err1',
       'koi_slogg_err2', 'koi_srad', 'koi_srad_err1', 'koi_srad_err2', 'ra',
       'dec', 'koi_kepmag'],
      dtype='object')

In [31]:
koi_unknowns.koi_duration.value_counts()

1.7910     4
2.7060     3
2.4690     3
4.0260     3
2.6630     3
          ..
3.1760     1
2.9497     1
2.9878     1
22.3700    1
16.5000    1
Name: koi_duration, Length: 2038, dtype: int64

In [32]:
koi_unknowns.shape

(2245, 42)

Accepting that this is supposed to be the quick and dirty baseline, let's now train_test_split the rest of the data

In [33]:
dirty.koi_disposition.value_counts()

0    4214
1    2285
2    2245
Name: koi_disposition, dtype: int64

In [34]:
train = dirty[dirty.koi_disposition!=2]
train.shape

(6499, 42)

In [35]:
train.koi_disposition.value_counts(normalize=True)

0    0.648407
1    0.351593
Name: koi_disposition, dtype: float64

In [36]:
dirty_baseline = train.koi_disposition.value_counts(normalize=True)[0]
dirty_baseline

0.6484074472995845

In [37]:
X = train.drop(columns='koi_disposition')
y = train['koi_disposition']

X_train, X_val, y_train, y_val = train_test_split(X, y, 
                                                 test_size = .2,
                                                 stratify=y,
                                                 random_state=42)

In [38]:
X_train.shape

(5199, 41)

# Create some Models!

## 1. Linear Regression Model Baseline

In [39]:
linr = LinearRegression()
linr.fit(X_train, y_train)

print('Linear Regression Score : ', linr.score(X_train, y_train))
print('Linear Validation Score : ', linr.score(X_val, y_val))
print('Baseline Score', dirty_baseline)

Linear Regression Score :  0.6758576072978744
Linear Validation Score :  0.6483967383285573
Baseline Score 0.6484074472995845


Creating a Linear Regression model using ALL of the columns, which we wouldn't normally do, our training score is barely beating our baseline score, and our testing score is slightly less as good.

## 2. Logistic Regression Model Baseline

In [40]:
logr = LogisticRegression()
logr.fit(X_train, y_train)

print('Logistic Regression Score (train) : ', logr.score(X_train, y_train))
print('Logistic Regression Score (val) : ', logr.score(X_val, y_val))
print('Baseline Score', dirty_baseline)

Logistic Regression Score (train) :  0.7228313137141757
Logistic Regression Score (val) :  0.7484615384615385
Baseline Score 0.6484074472995845


Creating a Logistic Regression model using ALL of the columns without scaling the data and without feature engineering, we improve on our baseline and actually see a better score on the testing data than the training data.

## 3. Random Forest Classifier Baseline 

In [41]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

print('Random Forest Score (train) : ', rf.score(X_train, y_train))
print('Random Forest Score (val) : ', rf.score(X_val, y_val))
print('Baseline Score', dirty_baseline)

Random Forest Score (train) :  1.0
Random Forest Score (val) :  0.9938461538461538
Baseline Score 0.6484074472995845


This is quite suspect...it looks like we have a perfect accuracy score on the training data and a 99.3% accuracy score on the validation data when we run a baseline Random Forest Classifier on our dataset.

I don't trust this on first pass, however it might be good to run the unknown dataset through this model and see what comes out the other side.

In [42]:
X_train.shape

(5199, 41)

In [43]:
koi_unknowns.head()

Unnamed: 0,kepid,koi_disposition,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,12168280,2,0,0,0,0,229.957537,0.00673,-0.00673,326.0184,...,-83.0,3.978,0.195,-0.12,2.057,0.365,-0.501,295.97794,50.771481,10.505
10,9592621,2,0,0,0,0,191.03728,0.0132,-0.0132,153.9107,...,-180.0,4.497,0.055,-0.165,0.901,0.213,-0.091,293.6839,46.265541,16.652
16,9159288,2,0,0,0,0,66.67305,0.0124,-0.0124,161.59,...,-256.0,4.369,0.072,-0.217,1.19,0.404,-0.135,293.68979,45.549332,14.634
17,9011955,2,0,0,0,0,367.0896,0.0228,-0.0228,178.6744,...,-260.0,4.429,0.058,-0.232,1.1,0.384,-0.128,287.86737,45.321289,15.267
19,8673511,2,0,0,0,0,0.757261,4e-06,-4e-06,131.62154,...,-171.0,4.479,0.062,-0.2,0.953,0.279,-0.093,284.29565,44.895859,14.024


### Use the Random Forest Classifier to predict the unknowns set

In [67]:
koi_unknowns = koi_unknowns.drop(columns=('koi_disposition'))


Export the koi_unknowns file to the data folder so that we can use this as our blind testing set later

In [65]:
koi_unknowns.to_csv(datapath + 'kepler_candidates.csv', index=False)

In [45]:
koi_unknowns['exoplanet'] = rf.predict(koi_unknowns)
koi_unknowns.head()

Unnamed: 0,kepid,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag,exoplanet
0,12168280,0,0,0,0,229.957537,0.00673,-0.00673,326.0184,0.0151,...,3.978,0.195,-0.12,2.057,0.365,-0.501,295.97794,50.771481,10.505,0
10,9592621,0,0,0,0,191.03728,0.0132,-0.0132,153.9107,0.0648,...,4.497,0.055,-0.165,0.901,0.213,-0.091,293.6839,46.265541,16.652,0
16,9159288,0,0,0,0,66.67305,0.0124,-0.0124,161.59,0.176,...,4.369,0.072,-0.217,1.19,0.404,-0.135,293.68979,45.549332,14.634,0
17,9011955,0,0,0,0,367.0896,0.0228,-0.0228,178.6744,0.0422,...,4.429,0.058,-0.232,1.1,0.384,-0.128,287.86737,45.321289,15.267,0
19,8673511,0,0,0,0,0.757261,4e-06,-4e-06,131.62154,0.00472,...,4.479,0.062,-0.2,0.953,0.279,-0.093,284.29565,44.895859,14.024,0


In [46]:
koi_unknowns.exoplanet.value_counts()

1    1390
0     855
Name: exoplanet, dtype: int64

In [47]:
fr_preds = koi_unknowns.exoplanet.value_counts(normalize=True)[1]
print('This model predicts {0:.2%} of its observatsions are exoplanets'.format(fr_preds))
      

This model predicts 61.92% of its observatsions are exoplanets


Is it possible that 64% of the Kepler Objects of interest that have not yet been classified by NASA are in fact exoplanets? Yes it is possible. Is it likely that I have just created a successful predictor, based only on NASA's data, where that same dataset predicted that 2 out of 3 Kepler Objects of interest were definitely not exoplanets? That is far less likely.

In [48]:
koi_unknowns.head(6)

Unnamed: 0,kepid,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag,exoplanet
0,12168280,0,0,0,0,229.957537,0.00673,-0.00673,326.0184,0.0151,...,3.978,0.195,-0.12,2.057,0.365,-0.501,295.97794,50.771481,10.505,0
10,9592621,0,0,0,0,191.03728,0.0132,-0.0132,153.9107,0.0648,...,4.497,0.055,-0.165,0.901,0.213,-0.091,293.6839,46.265541,16.652,0
16,9159288,0,0,0,0,66.67305,0.0124,-0.0124,161.59,0.176,...,4.369,0.072,-0.217,1.19,0.404,-0.135,293.68979,45.549332,14.634,0
17,9011955,0,0,0,0,367.0896,0.0228,-0.0228,178.6744,0.0422,...,4.429,0.058,-0.232,1.1,0.384,-0.128,287.86737,45.321289,15.267,0
19,8673511,0,0,0,0,0.757261,4e-06,-4e-06,131.62154,0.00472,...,4.479,0.062,-0.2,0.953,0.279,-0.093,284.29565,44.895859,14.024,0
20,8637903,0,0,0,0,385.858683,0.00526,-0.00526,343.64827,0.00907,...,4.367,0.065,-0.195,1.195,0.372,-0.149,296.40787,44.77364,13.993,1


In [49]:
koi_unknowns.exoplanet.value_counts()

1    1390
0     855
Name: exoplanet, dtype: int64

In [50]:
dirty.koi_disposition.value_counts()

0    4214
1    2285
2    2245
Name: koi_disposition, dtype: int64

## 4. Feed-Forward Neural Network Baseline

Finally to conclude our quick and dirty model building exercise, let's put together a couple of neural networks:
1. A Feed-Forward Neural Network
2. A Feed-Forward Neural Network employing Dropout
3. A Convolutional Neural Network
4. A Convolutional Neural Network employing Dropout

In [51]:
# from sklearn.datasets import make_regression
# from sklearn.preprocessing import StandardScaler
# from sklearn.model_selection import train_test_split, GridSearchCV
# from sklearn import metrics

# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
# from tensorflow.keras import utils
# from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
# from tensorflow.keras.optimizers import Adam
# from tensorflow.keras.callbacks import EarlyStopping

In [52]:
# # Visualize the loss or accuracy - function based on Noelle Brown function
# def viz_model(history_name, metric):
#     train_metric = history_name.history[metric]
#     val_metric = history_name.history['val_' + metric]

#     plt.figure(figsize=(12, 8))
#     plt.plot(train_metric, label='Training Set ' + metric, color='navy')
#     plt.plot(val_metric, label='Validation Set ' + metric, color='fuchsia')
#     plt.legend();
#     plt.title(f'{metric} rate per epoch of the Training and Validation Sets', fontsize=18);

In [53]:
X_train.shape

(5199, 41)

In [54]:
X_val.head()

Unnamed: 0,kepid,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
2148,5795648,0,1,0,0,4.55337,7.69e-07,-7.69e-07,135.433898,0.00014,...,-189.0,4.346,0.113,-0.137,1.109,0.244,-0.162,293.2944,41.035351,11.471
4406,3240159,0,1,0,0,82.13761,3e-06,-3e-06,196.122513,2.9e-05,...,-162.0,3.904,0.307,-0.102,2.012,0.439,-0.713,292.71317,38.369659,12.261
1514,6185717,0,1,0,0,11.702204,4.22e-07,-4.22e-07,139.431265,3e-05,...,-149.0,3.793,0.308,-0.132,2.133,0.429,-0.735,284.46127,41.56179,13.393
8582,10925104,0,0,0,0,11.776132,6.96e-06,-6.96e-06,142.705753,0.000498,...,-79.0,4.722,0.03,-0.033,0.54,0.033,-0.033,294.12143,48.349522,13.738
6106,7102316,0,0,0,0,57.248852,0.000324,-0.000324,184.6864,0.00431,...,-169.0,4.542,0.044,-0.176,0.846,0.233,-0.078,286.4527,42.657879,15.99


In [55]:
# # This code mostly provided by Noelle Brown, Riley Dallas, and Dan Wilhelm, 
# # General Assembly DSI

# eps = 10  # Use this variable to set the number of epochs for the below models


# model_fnn = Sequential()

# model_fnn.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
# model_fnn.add(Dense(128, activation='relu'))
# # ADD DROPOUT TO Above (previous) LAYER>>>
# model_fnn.add(Dropout(0.2)) #refers to previous layer

# model_fnn.add(Dense(32, activation='relu'))
# # ADD DROPOUT TO Above (previous) LAYER>>>
# model_fnn.add(Dropout(0.2)) #refers to previous layer

# model_fnn.add(Dense(8, activation='relu'))

# model_fnn.add(Dense(1, activation='sigmoid'))

# model_fnn.compile(loss='bce', optimizer=Adam(lr=.000005), metrics=['acc'])
# early_stop = EarlyStopping(patience=10)


# # Fit the model
# history_dropout_2 = model_fnn.fit(X_train, y_train,
#                    validation_data = (X_val, y_val),
#                    epochs = eps)



In [56]:
# viz_model(history_fnn, 'loss')

In [57]:
# viz_model(history_fnn, 'acc')

The model was returning our baseline score, but now it seems to be showing the inverse of the baseline

--We see that the accuracy score of our Neural Network is nearly exactly that of our Baseline.

## 5. Convolutional Neural Network

In [59]:
# input_reshape = (-1, 41, 1, 1)

In [60]:
# X_train_resh = X_train.values.reshape(input_reshape)

In [61]:
# X_val_resh = X_val.values.reshape(input_reshape)

In [62]:
# X_train_resh[0].shape

In [63]:
# model_cnn = Sequential()

# model_cnn.add(Conv2D(filters=5,                # number of filters
#                      kernel_size=(2, 2),        # height/width of filter, copied from Dan's lecture
#                      activation='relu',      # activation function 
#                      input_shape=(41,1, 1))) # shape of input (image)

# model_cnn.add(MaxPooling2D(pool_size=(2, 2))) # dimensions of region of pooling

# model_cnn.add(Conv2D(16,
#                      kernel_size = (2, 2),
#                      activation='relu'))

# model_cnn.add(MaxPooling2D(pool_size=(2, 2)))

# #model_cnn.add(Flatten()) # is this necessary when I'm not dealing with image data?

# # Add a densely-connected layer with 128 neurons.
# model_cnn.add(Dense(128, activation='relu'))

# # Add a final layer with 10 neurons.
# model_cnn.add(Dense(10, activation='softmax'))

# # Compile model
# model_cnn.compile(loss='categorical_crossentropy',
#                   optimizer='adam',
#                   metrics=['acc'])

# # Fit model on training data
# history_cnn = model_cnn.fit(X_train_resh,
#                         y_train,
#                         batch_size=256,
#                         validation_data=(X_val_resh, y_val),
#                         epochs=10, 
#                         verbose=1)