# Keppler

The Kepler Space Observatory is a NASA-build satellite that was launched in 2009. The telescope is dedicated to searching for exoplanets in star systems, with the ultimate goal of possibly finding other habitable planets besides our own. The original mission ended in 2013 due to mechanical failures, but the telescope has nevertheless been functional since 2014 on a "K2" extended mission https://www.nasa.gov/mission_pages/kepler/main/index.html.  We will refer to the dataset as the _OEC_ dataset.

Kepler had verified 1284 new exoplanets as of May 2016. As of October 2017 there are over 3000 confirmed exoplanets 
(using all detection methods, including ground-based ones). The telescope is still active and continues to collect new data on its extended mission.
## Content

This dataset is a cumulative record of all observed Kepler "objects of interest" — basically, all of the approximately 10,000 exoplanet candidates Kepler has taken observations on.

This dataset has an extensive data dictionary, which can be accessed here https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html. Highlightable columns of note are:
- kepoi_name: A KOI is a target identified by the Kepler Project that displays at least one transit-like sequence within Kepler time-series photometry that appears to be of astrophysical origin and initially consistent with a planetary transit hypothesis
- kepler_name: [These names] are intended to clearly indicate a class of objects that have been confirmed or validated as planets—a step up from the planet candidate designation.
- koi_disposition: The disposition in the literature towards this exoplanet candidate. One of CANDIDATE, FALSE POSITIVE, NOT DISPOSITIONED or CONFIRMED.
- koi_score: A value between 0 and 1 that indicates the confidence in the KOI disposition. For CANDIDATEs, a higher value indicates more confidence in its disposition, while for FALSE POSITIVEs, a higher value indicates less confidence in that disposition.

Acknowledgements

This dataset was published as-is by NASA. You can access the original table https://www.kaggle.com/nasa/kepler-exoplanet-search-results. More data from the Kepler mission is available from the same source.

link: https://github.com/OpenExoplanetCatalogue/open_exoplanet_catalogue


## Starting out: loading data and libraries
We begin by loading the necessary libraries for the work we are going to do in this lab.

In [1]:
import pandas as pd
import numpy as np

#designate the path where you saved your OEC data

#Load the data using pandas read_csv function. 

## Defining the columns we want to use.

Columns that we definitely want to exclude from the analysis so we are not using are: `["rowid","kepid","kepoi_name","kepler_name","koi_pdisposition",koi_tce_delivname","koi_tce_delivname"]`

Additionally, there are some empty cols which we will also not use: `["koi_teq_err1","koi_teq_err2"]`.

The columns we are going to use are: ["koi_score","koi_fpflag_nt","koi_fpflag_ss","koi_fpflag_co","koi_fpflag_ec","koi_period","koi_period_err1","koi_period_err2","koi_time0bk",
"koi_time0bk_err1","koi_time0bk_err2","koi_impact","koi_impact_err1","koi_impact_err2","koi_duration","koi_duration_err1","koi_duration_err2",
"koi_depth","koi_depth_err1","koi_depth_err2","koi_prad","koi_prad_err1","koi_prad_err2","koi_teq","koi_insol","koi_insol_err1","koi_insol_err2",
"koi_model_snr","koi_tce_plnt_num","koi_steff","koi_steff_err1","koi_steff_err2","koi_slogg","koi_slogg_err1","koi_slogg_err2",
"koi_srad","koi_srad_err1","koi_srad_err2","ra","dec","koi_kepmag"]

The koi_disposition will be our target. Let us set those up as `label` and  `feaatures_to_use`.

In [93]:
#select the label feature. in this case koi disposition
label = "koi_disposition"

features_to_use = ["koi_score","koi_fpflag_nt","koi_fpflag_ss","koi_fpflag_co","koi_fpflag_ec","koi_period","koi_period_err1","koi_period_err2","koi_time0bk","koi_time0bk_err1","koi_time0bk_err2","koi_impact","koi_impact_err1","koi_impact_err2","koi_duration","koi_duration_err1","koi_duration_err2","koi_depth","koi_depth_err1","koi_depth_err2","koi_prad","koi_prad_err1","koi_prad_err2","koi_teq","koi_insol","koi_insol_err1","koi_insol_err2","koi_model_snr","koi_tce_plnt_num","koi_steff","koi_steff_err1","koi_steff_err2","koi_slogg","koi_slogg_err1","koi_slogg_err2","koi_srad","koi_srad_err1","koi_srad_err2","ra","dec","koi_kepmag"]

#setup variables for the labels and data to keep them separate


Unnamed: 0,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,1.000,0,0,0,0,9.488036,2.775000e-05,-2.775000e-05,170.538750,0.002160,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
1,0.969,0,0,0,0,54.418383,2.479000e-04,-2.479000e-04,162.513840,0.003520,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
2,0.000,0,1,0,0,19.899140,1.494000e-05,-1.494000e-05,175.850252,0.000581,...,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
3,0.000,0,1,0,0,1.736952,2.630000e-07,-2.630000e-07,170.307565,0.000115,...,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.285210,15.597
4,1.000,0,0,0,0,2.525592,3.761000e-06,-3.761000e-06,171.595550,0.001130,...,-211.0,4.438,0.070,-0.210,1.046,0.334,-0.133,288.75488,48.226200,15.509
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9559,0.000,0,0,0,1,8.589871,1.846000e-04,-1.846000e-04,132.016100,0.015700,...,-152.0,4.296,0.231,-0.189,1.088,0.313,-0.228,298.74921,46.973351,14.478
9560,0.000,0,1,1,0,0.527699,1.160000e-07,-1.160000e-07,131.705093,0.000170,...,-166.0,4.529,0.035,-0.196,0.903,0.237,-0.079,297.18875,47.093819,14.082
9561,0.497,0,0,0,0,1.739849,1.780000e-05,-1.780000e-05,133.001270,0.007690,...,-220.0,4.444,0.056,-0.224,1.031,0.341,-0.114,286.50937,47.163219,14.757
9562,0.021,0,0,1,0,0.681402,2.434000e-06,-2.434000e-06,132.181750,0.002850,...,-236.0,4.447,0.056,-0.224,1.041,0.341,-0.114,294.16489,47.176281,15.385


## Missing Values

The OEC data has various missing values, and categorical columns. 

Pre process the data to impute some of the missing values, or handle them. Consider how we can handle categorical values by generating an alternative encoding. 


In [94]:
#create the simple imputer

#fit the imputer to our data.
#Any data we impute on in the train stage we need to perform identically to our test.


## Label Encoding

Transform the set of labels from strings to a suitable encoding such that they can be used with a classifier. 

In [95]:
#encode our labels

## Generate train and test set

Generate a set of train and test data using the data and labels you have crafted.

In [96]:
#generate a train and test split


array([0, 2, 2, ..., 2, 2, 0])

## Classifier 1

Train a LinearSVC classifier on your dataset

https://scikit-learn.org/stable/modules/svm.html#svm

In [97]:
#train a Linear classifier

## Classifier 2

train a NaiveBayes classifier on your dataset

https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes


In [98]:
#train a naive bayes classifier

## Classifier 3

train a Linear Classifier using Stochastic Gradient Deescent on your dataset

https://scikit-learn.org/stable/modules/linear_model.html#linear-model

In [99]:
#train a stochastic gradient descent

# Write your own 1NN

The 1 nearest neighbour algorithm finds the single closest data point usually via Euclidean distance, and the predicts that trainining elements label. 

Create a simple sklearn classifier to do this using the class stub provided.


In [100]:
from sklearn.base import BaseEstimator, ClassifierMixin
class My1NN(BaseEstimator, ClassifierMixin):

    def __init__():
        pass

    def fit(self, X, y):
        # check the data is correctly formatted.
        #consider check_X_y

        # get the class labels.

        # return classifier
        return self

    def predict(self, X):
        check_is_fitted(self) #https://scikit-learn.org/stable/modules/generated/sklearn.utils.validation.check_is_fitted.html

        # check the data is valid. consider checl_array

        # get the index of closest sample to our data 
        
        # return the labels based on the index.



# Compare performance of classifiers

Which classifier performed best over all?
Which classifier had the highest accuracy on each class?

In [101]:

# for each classifier get the accuracy score

# argmax the classifier

# 


## Extension Task: (Hard) Cross Validated Attribute Selection

Given the attribute space : `"koi_prad", "koi_steff", "koi_teq", "koi_model_snr"`

Find the best 3 attribute combination using cross validation to fit our whole train set too

Hint: You will need to think about the permuations and combinations of the possible features


## Find the best feature based on the accuracies

Find the best best set, and then train the model using and assess it against the test data