## Data preprocessing  
I will perform data preprocessing following the approach undertaken by the authors of the work 'Assessment of Ensemble-Based Machine Learning Algorithms for Exoplanet Identification' combined with my own proposition - I will mark it explicitly when the latter is applied. I will use the same dataset as they did for Kepler - not the full one, but the default one downloaded after following this link: https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=cumulative. I will then follow this approach for TESS and K2 datasets to finally obtain a marged dataset of all three of these datasets.

In [18]:
import pandas as pd 

kepler = pd.read_csv('datasets/KOI.csv', comment='#')
kepler.shape

(9564, 49)

In [19]:
kepler.drop_duplicates(subset=["kepid"], keep="last")

Unnamed: 0,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
1,10797460,K00752.02,Kepler-227 c,CONFIRMED,CANDIDATE,0.969,0,0,0,0,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
2,10811496,K00753.01,,CANDIDATE,CANDIDATE,0.000,0,0,0,0,...,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
3,10848459,K00754.01,,FALSE POSITIVE,FALSE POSITIVE,0.000,0,1,0,0,...,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.285210,15.597
4,10854555,K00755.01,Kepler-664 b,CONFIRMED,CANDIDATE,1.000,0,0,0,0,...,-211.0,4.438,0.070,-0.210,1.046,0.334,-0.133,288.75488,48.226200,15.509
7,10872983,K00756.03,Kepler-228 b,CONFIRMED,CANDIDATE,0.992,0,0,0,0,...,-232.0,4.486,0.054,-0.229,0.972,0.315,-0.105,296.28613,48.224670,15.714
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9559,10090151,K07985.01,,FALSE POSITIVE,FALSE POSITIVE,0.000,0,1,1,0,...,-166.0,4.529,0.035,-0.196,0.903,0.237,-0.079,297.18875,47.093819,14.082
9560,10128825,K07986.01,,CANDIDATE,CANDIDATE,0.497,0,0,0,0,...,-220.0,4.444,0.056,-0.224,1.031,0.341,-0.114,286.50937,47.163219,14.757
9561,10147276,K07987.01,,FALSE POSITIVE,FALSE POSITIVE,0.021,0,0,1,0,...,-236.0,4.447,0.056,-0.224,1.041,0.341,-0.114,294.16489,47.176281,15.385
9562,10155286,K07988.01,,CANDIDATE,CANDIDATE,0.092,0,0,0,0,...,-128.0,2.992,0.030,-0.027,7.824,0.223,-1.896,296.76288,47.145142,10.998


#### Step 1
Remove 7 columns: due to their lack of contribution to exoplanet identification and because of their emptiness. The authors have also dropped the column 'kepid', but I keep it since it contains duplicates - to avoid data leakage, the dataset will later be grouped by 'kepid'.

In [20]:
drop = ['kepoi_name', 'kepler_name', 'koi_pdisposition', 'koi_score', 'koi_teq_err1', 'koi_teq_err2', 'koi_fpflag_nt', 'koi_fpflag_ss', 'koi_fpflag_co', 'koi_fpflag_ec', 'koi_tce_plnt_num']
kepler_clean = kepler.drop(drop, axis=1)

#### Step 2
Extract the rows where 'koi_disposition' is either CONFIRMED or FALSE POSITIVE - so remove the ones labeled as CANDIDATE. This is the only difference to what the scholars did - they dropped the FALSE POSITIVE rows. However, since CONFIRMED rows are actual planets, and FALSE POSITIVE have been examined to not be planets, I believe it's better to feed these rows to our model and drop the CANDIDATE rows, which may be either of the two and hence only spoils our model.  
Binary encoding will be performed in this code block as well - CONFIRMED will become 1 and FALSE POSITIVE will become 0 

In [21]:
kepler_candidates_df = kepler_clean[kepler_clean["koi_disposition"].str.strip().str.upper() == "CANDIDATE"].copy()

kepler_labeled_df = kepler_clean[kepler_clean["koi_disposition"].str.strip().str.upper().isin(["CONFIRMED","FALSE POSITIVE"])].copy()

kepler_labeled_df["label"] = (
    kepler_labeled_df["koi_disposition"]
    .str.strip().str.upper()
    .map({"CONFIRMED": 1, "FALSE POSITIVE": 0})
)

kepler_labeled_df.drop(["koi_disposition"], axis=1, inplace=True)

In [22]:
kepler_labeled_df.shape

(7585, 38)

#### Step 5
As shown above, 8.6% of the rows contain a missing value - therefore, these values will be imputed using the mean. To avoid data leakage, it will be done after a split to train and test sets. This is my idea - there is no mention of handling missing values aside from the column 'koi_tce_delivname' in the aforementioned work. There must be no missing values for the models to learn, so the simplest and most intuitive imputation strategy was picked. 70% of data will be split for training and 30% for testing - as proposed by the scholars.

In [None]:
from sklearn.model_selection import GroupShuffleSplit
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

drop = ["label", "kepid"]
feat_cols = [c for c in kepler_labeled_df.columns if c not in drop]
cont = ["koi_period", "koi_time0bk", "koi_impact", "koi_duration", "koi_depth", "koi_prad", "koi_teq", "koi_insol", "koi_model_snr", 
        "koi_steff", "koi_slogg", "koi_srad", "ra", "dec", "koi_kepmag", "koi_period_err1", "koi_period_err2", "koi_time0bk_err1", "koi_time0bk_err2", 
                "koi_impact_err1", "koi_impact_err2", "koi_duration_err1", "koi_duration_err2", 
                "koi_depth_err1", "koi_depth_err2", "koi_prad_err1", "koi_prad_err2", 
                "koi_insol_err1", "koi_insol_err2", "koi_steff_err1", "koi_steff_err2", 
                "koi_slogg_err1", "koi_slogg_err2", "koi_srad_err1", "koi_srad_err2"]

X = kepler_labeled_df[feat_cols]
y = kepler_labeled_df["label"]
groups = kepler_labeled_df["kepid"]

gss = GroupShuffleSplit(n_splits=1, test_size=0.3, random_state=42)
train_idx, test_idx = next(gss.split(X, y, groups=groups))

X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
groups_train = groups.iloc[train_idx]

cont_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy = "mean"), cont),
    ("scale", StandardScaler(), cont)
])

preprocess = ColumnTransformer([
    ("transform", cont_pipeline, cont)
])

