# Hyperparameter Optimization Example

### Authors: 
- Christian Michelsen (Niels Bohr Institute)

### Date:    
- 13-05-2019 (latest update)

***

This is a Jupyter Notebook which in an interactive fashion illustrates hyperparameter optimization (HPO). First it shows the most naive, manual approach, then grid search, and finally bayesian optimization. 

This notebook is based on the __[HTRU2 Pulsar dataset](https://archive.ics.uci.edu/ml/datasets/HTRU2)__. The focus on this small example is neither the actual code nor getting any specific results, but - hopefully - getting a better understanding of HPO. This is also why we don't describe the code in great detail - and simply load the dataset from a csv file directly - but the first part of the code should hopefully look familiar. 

***

First, we import the modules we want to use:

In [1]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import tree
from sklearn.datasets import load_iris, load_wine
from sklearn.metrics import accuracy_score
from IPython.display import SVG
from graphviz import Source
from IPython.display import display                               
from ipywidgets import interactive
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import seaborn as sns

blabla

In [5]:
columns_names = ['Mean_profile', 
                 'STD_profile', 
                 'kurtusis_profile', 
                 'Skewness_profile', 
                 'Mean_SNR', 
                 'STD_SNR', 
                 'Kurtosis_SNR', 
                 'Skewness_SNR',
                 'Class']

df_unbalanced = pd.read_csv('HTRU_2.csv', names=columns_names)





columns_names = [#'Mean_profile', 
                 #'STD_profile', 
                 #'kurtusis_profile', 
                 #'Skewness_profile', 
                 'Mean_SNR', 
                 'STD_SNR', 
                 'Kurtosis_SNR', 
                 'Skewness_SNR',
                 'Class']

df_unbalanced = df_unbalanced.drop(columns=['Mean_profile', 'STD_profile', 'kurtusis_profile', 'Skewness_profile'])


df_unbalanced.head(10)

Unnamed: 0,Mean_SNR,STD_SNR,Kurtosis_SNR,Skewness_SNR,Class
0,3.199833,19.110426,7.975532,74.242225,0
1,1.677258,14.860146,10.576487,127.39358,0
2,3.121237,21.744669,7.735822,63.171909,0
3,3.642977,20.95928,6.896499,53.593661,0
4,1.17893,11.46872,14.269573,252.567306,0
5,1.636288,14.545074,10.621748,131.394004,0
6,0.999164,9.279612,19.20623,479.756567,0
7,1.220736,14.378941,13.539456,198.236457,0
8,2.33194,14.486853,9.001004,107.972506,0
9,4.079431,24.980418,7.39708,57.784738,0


In [6]:
df_signal = df_unbalanced[df_unbalanced['Class'] == 1]
df_background = df_unbalanced[df_unbalanced['Class'] == 0]
df_background = df_background.sample(len(df_signal))

df = pd.concat([df_signal, df_background])


df.to_csv('Pulsar_data.csv', index=False)

df2 = pd.read_csv('Pulsar_data.csv')


In [19]:
# Load dataset

X = df.drop(columns='Class')
y = df['Class']
feature_names = columns_names[:-1]


print(X.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
X_train.head(10)

(3278, 4)


Unnamed: 0,Mean_SNR,STD_SNR,Kurtosis_SNR,Skewness_SNR
2557,159.849498,76.74001,-0.575016,-0.941293
5648,4.243311,26.74649,7.110978,52.701218
8993,0.83612,9.872425,17.82037,395.698556
99,1.748328,16.486623,10.810393,127.733366
11930,147.185619,75.29602,-1.169558,-0.130999
1547,121.404682,47.965569,0.663053,1.203139
11225,35.209866,60.573157,1.635995,1.609377
8326,199.577759,58.656643,-1.86232,2.39187
741,2.114548,17.883245,9.747462,102.956287
5946,1.16388,10.3078,15.620677,327.377459
