# Kepler dataset 

* Goal: Predict koi_disposition using all vaiables except koi_pdispositon   
* Secondary goal: Using 'CANDIDATES' from koi_pdisposition predict which rows are most likely to become 'CONFIRMED' in koi_disposition 

In [31]:
#
# Imports 
#
import pandas as pd
import numpy as np
from sklearn import preprocessing
import keras
from keras import regularizers
from keras.models import Sequential, Model
from keras.layers import Dense, Input
from keras.optimizers import SGD
from keras.utils import to_categorical
from keras import regularizers
from sklearn.metrics import accuracy_score

In [40]:
def fill_median(data):
    for column in data.columns:
        print("Current columns: ", column)
        tmp = data[column].dtypes
        
        if tmp == 'int64' or tmp == 'float64':
            print("Number of NaN: ", data[column].isna().sum())
            print("Total length: ", len(data[column]))
            median = data[column].median()
            data[column] = data[column].fillna(median)
            
    return data

def create_3d_plot(data, target, figsize, class_list):
    plt.clf()
    fig = plt.figure(figsize=figsize)
    ax = fig.add_subplot(111, projection='3d')
    
    for c, dispo, a in class_list:
        tmp_df = data[data[target] == dispo]
        xs = tmp_df['X']
        ys = tmp_df['Y']
        zs = tmp_df['Z']
        ax.scatter(xs, ys, zs, s=50, alpha=a, edgecolors='w', c=c)
    
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')
    
    # ax.view_init(30, 90)
    
    plt.show()

In [41]:
#
# Import dataset 
#
data = pd.read_csv('cumulative.csv')

In [42]:
#
# Data cleaning 
#

drop_columns = ['rowid', 'kepid', 'kepoi_name', 'kepler_name', 'koi_score', 'koi_teq_err1', 'koi_teq_err2']
data[drop_columns].head()

Unnamed: 0,rowid,kepid,kepoi_name,kepler_name,koi_score,koi_teq_err1,koi_teq_err2
0,1,10797460,K00752.01,Kepler-227 b,1.0,,
1,2,10797460,K00752.02,Kepler-227 c,0.969,,
2,3,10811496,K00753.01,,0.0,,
3,4,10848459,K00754.01,,0.0,,
4,5,10854555,K00755.01,Kepler-664 b,1.0,,


`rowid` - The data in this column is nothing more than an index.   
`kepid` - Id for solar system.   
`kepoi_name` - Name for solar system and planet number.  
`kepler_name` - Name for exoplanet. NaN represents none expoplanets.   
`koi_score` - NASA prediction score for candidate.   
`koi_teq_err1` - Entire column is NaN.  
`koi_teq_err2` - Entire column is NaN.  

In [43]:
data = data.drop(drop_columns, axis = 1)

In [44]:
rows_nan = data.isna().sum(axis=1)
rows_nan.value_counts()

0     8744
2      248
29     230
6       95
31      91
10      89
26      42
1       15
8        7
16       2
7        1
dtype: int64

We got 40 features, to reduce noise in the datset we remove samples that are missing more than 10 % of their feature values, which is 4 or more missing feature values. As we can see in the data above there are several samples that are missing a lot of data. If we were to just insert the median in to these missing values we would introduce noise or skew the data.  

In [45]:
rows_nan = rows_nan[rows_nan >= 4]
rows_nan = rows_nan.reset_index()
data = data.drop(index = rows_nan["index"].to_numpy(), axis = 0)

In [46]:
data['koi_disposition'].value_counts()

FALSE POSITIVE    4552
CONFIRMED         2292
CANDIDATE         2163
Name: koi_disposition, dtype: int64

We are more interesting to keep 'CANDIDATES' and 'CONFIRMED' in the columnn `koi_disposition` because they add more value in the modeling process compared to 'FALSE POSITIVE', because it's 50% of the dataset while 'CONFIRMED' is about 25%. 



In [47]:
# Seperate False Positives from Candidates and Confirmed
# This is to run seperate Data Cleaning processes on the sets
data_fp = data_cleaned[data_cleaned['koi_disposition'] == 'FALSE POSITIVE']
data_cc = data_cleaned[data_cleaned['koi_disposition'] != 'FALSE POSITIVE']
data_fp = data_fp.dropna(axis = 0) # Drop every row that contains atleast one NaN
data_cc = fill_median(data_cc) # Fill each NaN with the median of the column

# Merge the two datasets back together after cleaning
data_merged = pd.concat([data_fp, data_cc], axis = 0)

Current columns:  koi_disposition
Current columns:  koi_pdisposition
Current columns:  koi_fpflag_nt
Number of NaN:  0
Total length:  4541
Current columns:  koi_fpflag_ss
Number of NaN:  0
Total length:  4541
Current columns:  koi_fpflag_co
Number of NaN:  0
Total length:  4541
Current columns:  koi_fpflag_ec
Number of NaN:  0
Total length:  4541
Current columns:  koi_period
Number of NaN:  0
Total length:  4541
Current columns:  koi_period_err1
Number of NaN:  78
Total length:  4541
Current columns:  koi_period_err2
Number of NaN:  78
Total length:  4541
Current columns:  koi_time0bk
Number of NaN:  0
Total length:  4541
Current columns:  koi_time0bk_err1
Number of NaN:  78
Total length:  4541
Current columns:  koi_time0bk_err2
Number of NaN:  78
Total length:  4541
Current columns:  koi_impact
Number of NaN:  64
Total length:  4541
Current columns:  koi_impact_err1
Number of NaN:  78
Total length:  4541
Current columns:  koi_impact_err2
Number of NaN:  78
Total length:  4541
Current 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


In [48]:
#
# Data exploration 
#

Investigate the state changes from `koi_pdisposition` to `koi_disposition`. 

In [54]:
classes = data['koi_disposition'].unique()
array = np.empty((2,3), dtype=np.int64)
for i, c in enumerate(classes, start=0):
    mask = (data['koi_disposition'] == c) & (data['koi_pdisposition'] == 'FALSE POSITIVE')
    l = len(data['koi_disposition'][mask])
    array[0, i] = l
    
    mask = (data['koi_disposition'] == c) & (data['koi_pdisposition'] == 'CANDIDATE')
    l = len(data['koi_disposition'][mask])
    array[1, i] = l
    
df = pd.DataFrame(array, columns=classes, index=['FALSE POSITIVE', 'CANDIDATE'])
df

Unnamed: 0,CONFIRMED,FALSE POSITIVE,CANDIDATE
FALSE POSITIVE,44,4552,0
CANDIDATE,2248,0,2163


We notice that a 44 cases went from FALSE POSITIVE to CONFIRMED. We extract those special cases to be able to separate them later in visualization.  

**ToDo** Outlier removal

**ToDo** Check correlation between features in dataset

In [56]:
#
# Data preprocessing 
#

In [57]:
#
# Modeling
#

In [58]:
#
# Evaluation 
#