# Exoplanets - Data Modelling

In this notebook we will model the data provided in the dataset and evaluate the results. After the implementation of data-loading fuctions, two models will be used to evaluate how different algorimths classify the dataset.


In [None]:
import numpy as np
import datetime, os
import matplotlib.pyplot as plt

# custom code
from utils import data_loader_txt, plot_confusion_matrix

### Import data in train and test set

In [None]:
TRAIN_SET_PATH = "/data/Exoplanets/exoTrain.csv"
TEST_SET_PATH = "data/Exoplanets/exoTest.csv"

In [None]:
# define label column
LABEL_COLUMN_INDEX = 0

In [None]:
# loading train set
x_train, y_train = data_loader_txt(path=TRAIN_SET_PATH, label_column_index=LABEL_COLUMN_INDEX) 
# loading test set
x_test, y_test = data_loader_txt(path=TEST_SET_PATH,label_column_index=LABEL_COLUMN_INDEX) 

### Baseline model

In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score

In [None]:
svc = LinearSVC(C=0.5, max_iter=3000, verbose=0,  class_weight='balanced')
print("SVC - baseline training...")
svc.fit(x_train, np.squeeze(y_train))
y_pred = svc.predict(x_test)
print("SVC - training and evaluation completed")

In [None]:
# calculare confusion matrix
scv_cm = confusion_matrix(y_true=np.squeeze(y_test), y_pred=y_pred)

In [None]:
plot_confusion_matrix(scv_cm, ["Non-Exoplanet", "Exoplanet"], normalize=False)
print("Recall score:",recall_score(y_test, y_pred))

### Tensorflow CNN model 

In [None]:
import tensorflow as tf
%load_ext tensorboard

In [None]:
OUTDIR = "logs"

In [None]:
%%bash
python m gcp_ai_platform_job/task.py \
    --train_data_path=${TRAIN_SET_PATH} \
    --eval_data_path=${TEST_SET_PATH} \
    --output_dir=${OUTDIR} \
    --num_epochs=5 \
    --batch_size=32

In [None]:
%tensorboard --logdir logs

### Final notes

Two models have been implemeted in this notebook a SVC and a (small) CNN. The results prove that the CNN did worked better than the SVC. Nevertheless, some remarks are reported below:

- SVC could improve its performances by working on a smaller set of engineered features.
- CNN should be did archive respectivelly 81% and 100% of recall in train and test set. Since the test set is actually quite small it might makes sense to revaluate the results with a different split i.e cross-validation.
- No specific HPO has been performed. That's could improve the results/robustness of both algorithms.