# MLBox

#### Author's description:

MLBox is a powerful Automated Machine Learning python library. It provides the following features:

* Fast reading and distributed data preprocessing/cleaning/formatting
* Highly robust feature selection and leak detection
* Accurate hyper-parameter optimization in high-dimensional space
* State-of-the art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,...)
* Prediction with models interpretation

#### Useful links:

[home](https://pypi.org/project/mlbox/),
[tutorial](https://www.analyticsvidhya.com/blog/2017/07/mlbox-library-automated-machine-learning/),
[manual](https://mlbox.readthedocs.io/en/latest/),
[git](https://github.com/AxeldeRomblay/MLBox),
[more examples](https://mlbox.readthedocs.io/en/latest/introduction.html)

## Install and import

Note that we use the subprocess function instead of the jupyter **!** method of running bash commands. Domino can run these notebooks as [jobs](https://support.dominodatalab.com/hc/en-us/articles/360023696651-Jobs) (batch or scheduled) which turns your ipython notebook into an executable script file! All you have to do is ensure the code can be executed in a .py file.

In [1]:
import subprocess
#get the right version of jsonschema
completed = subprocess.run(['sudo', 'pip', 'install', 'jsonschema==2.6'], \
                           stdout=subprocess.PIPE,)
print(completed.stdout.decode('utf-8'))

Collecting jsonschema==2.6
  Downloading jsonschema-2.6.0-py2.py3-none-any.whl (39 kB)
Installing collected packages: jsonschema
  Attempting uninstall: jsonschema
    Found existing installation: jsonschema 3.2.0
    Uninstalling jsonschema-3.2.0:
      Successfully uninstalled jsonschema-3.2.0
Successfully installed jsonschema-3.0.2



In [2]:
completed = subprocess.run(['sudo', 'pip', 'install', 'mlbox'], \
                           stdout=subprocess.PIPE,)
print(completed.stdout.decode('utf-8'))

Collecting mlbox
  Downloading mlbox-0.8.2.tar.gz (30 kB)
Collecting numpy==1.17.0
  Downloading numpy-1.17.0-cp36-cp36m-manylinux1_x86_64.whl (20.4 MB)
Collecting scipy==1.3.0
  Downloading scipy-1.3.0-cp36-cp36m-manylinux1_x86_64.whl (25.2 MB)
Collecting matplotlib==3.0.3
  Downloading matplotlib-3.0.3-cp36-cp36m-manylinux1_x86_64.whl (13.0 MB)
Collecting hyperopt==0.1.2
  Downloading hyperopt-0.1.2-py3-none-any.whl (115 kB)
Collecting Keras==2.2.4
  Downloading Keras-2.2.4-py2.py3-none-any.whl (312 kB)
Collecting pandas==0.25.0
  Downloading pandas-0.25.0-cp36-cp36m-manylinux1_x86_64.whl (10.5 MB)
Collecting joblib==0.13.2
  Downloading joblib-0.13.2-py2.py3-none-any.whl (278 kB)
Collecting tensorflow==1.14.0
  Downloading tensorflow-1.14.0-cp36-cp36m-manylinux1_x86_64.whl (109.2 MB)
Collecting lightgbm==2.2.3
  Downloading lightgbm-2.2.3-py2.py3-none-manylinux1_x86_64.whl (1.2 MB)
Collecting tables==3.5.2
  Downloading tables-3.5.2-cp36-cp36m-manylinux1_x86_64.whl (4.3 MB)
Collecti

#### MLBox main package contains 3 sub-packages : preprocessing, optimisation and prediction. Each one of them are respectively aimed at reading and preprocessing data, testing or optimising a wide range of learners and predicting the target on a test dataset.

In [3]:
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *
import numpy as np
import pandas as pd
import sklearn

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [4]:
#these notes were written for v 0.8.2
completed = subprocess.run(['pip', 'show', 'mlbox'], \
                           stdout=subprocess.PIPE,)
print(completed.stdout.decode('utf-8'))

Name: mlbox
Version: 0.8.2
Summary: A powerful Automated Machine Learning python library.
Home-page: https://github.com/AxeldeRomblay/mlbox
Author: Axel ARONIO DE ROMBLAY
Author-email: axelderomblay@gmail.com
License: BSD-3
Location: /usr/local/anaconda/lib/python3.6/site-packages
Requires: matplotlib, tables, Keras, lightgbm, scipy, xlrd, numpy, scikit-learn, hyperopt, joblib, pandas, tensorflow
Required-by: 



In [5]:
#clearing the joblib file as recommended by MLBox authors
completed = subprocess.run(['rm', '-rf', '../results/joblib'], \
                           stdout=subprocess.PIPE,)
print('clearing the joblib file as recommended by MLBox authors')
print(completed.stdout.decode('utf-8'))

clearing the joblib file as recommended by MLBox authors



## A few pointers to keep in mind

#### Importing data
MLBox seems to prefer csv files. Otherwise you have to build your own dictionary. The dictionary structure is not overly complicated, but it introduces another chance for syntax or type errors. It might be wise to just use csv if saving and loading as csv is not too expensive.

#### Train & test
MLBox has a function called **train_test_split()**. It does not behave like the scikit-learn function of the same name. It can take a little getting use to. More on that below.

#### Documentation
MLBox documentation is high-level. Implementing in practice is more difficult. Could not find anything on deep learning.

## Heart Disease

#### Load the heart disease dataset

#### A note on importing data
csv files for the two datasets in this project are saved at **/mnt/data/raw/**

#### A note on the train & test function
MLBox has a function called **train_test_split()**. It does not behave like the scikit-learn function of the same name. It can take a little getting use to. It will help if you imagine that the authors of MLBox built it as a tool for Kaggle competitions. The training set needs to have **y** in it. The test set should not. You're on your own for accuracy against the test set as it is assumed you'll find out the real answers later with an external test set that is not part of the MLBox flow.

#### A note on categorical fields
MLBox tries to infer which columns are categorical. From what I can tell, it only looks at data type when doing so. This is a little annoying. Below, I had to take the extra step of mapping numeric values to text for each of the numeric/categorical columns so that MLBox will treat them as categorical.

In [6]:
'''
/mnt/data/raw/heart.csv

attribute documentation:
      age: age in years
      sex: sex (1 = male; 0 = female)
      cp: chest pain type
        -- Value 1: typical angina
        -- Value 2: atypical angina
        -- Value 3: non-anginal pain
        -- Value 4: asymptomatic
     trestbps: resting blood pressure (in mm Hg on admission to the 
        hospital)
     chol: serum cholestoral in mg/dl
     fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
     restecg: resting electrocardiographic results
        -- Value 0: normal
        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST 
                    elevation or depression of > 0.05 mV)
        -- Value 2: showing probable or definite left ventricular hypertrophy
                    by Estes' criteria
     thalach: maximum heart rate achieved
     exang: exercise induced angina (1 = yes; 0 = no)
     oldpeak = ST depression induced by exercise relative to rest
     slope: the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping
     ca: number of major vessels (0-3) colored by flourosopy
     thal: 
         3 = normal; 
         6 = fixed defect; 
         7 = reversable defect
     target: diagnosis of heart disease (angiographic disease status)
        -- Value 0: < 50% diameter narrowing
        -- Value 1: > 50% diameter narrowing
 '''

#column names
names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang', \
         'oldpeak','slope','ca','thal','target']

#load data from Domino project directory
hd_data = pd.read_csv("/mnt/data/raw/heart.csv", header=None, names=names)

#in case some data comes in as string
#convert to numeric and coerce errors to NaN
for col in hd_data.columns:  # Iterate over chosen columns
    hd_data[col] = pd.to_numeric(hd_data[col], errors='coerce')
    
#drop nulls
#hd_data.dropna(inplace=True)
 
#function to force non-numeric data for categorical columns
def force_non_numeric(data, cols):
    for c in cols:
        data[c] = 'text_' + data[c].map(str)  
    return data

cat_cols = ['cp', 'restecg', 'slope', 'ca', 'thal']
hd_data = force_non_numeric(hd_data, cat_cols)

#create MLBox random samples for train and test
hd_data_train = hd_data.sample(frac=0.7, replace=False, random_state=1)
hd_data_test = hd_data[~hd_data.isin(hd_data_train)].dropna()
hd_data_test_wo_target = hd_data_test.drop('target', axis=1)

hd_data_train.to_csv('../data/processed/hd_data_train.csv', index=False)
hd_data_test.to_csv('../data/processed/hd_data_test.csv', index=False)
hd_data_test_wo_target.to_csv('../data/processed/hd_data_test_wo_target.csv', index=False)

In [7]:
#the list of paths to your train datasets and test datasets
paths_hd = ["../data/processed/hd_data_train.csv", \
         "../data/processed/hd_data_test_wo_target.csv"]

#the name of the target you try to predict (classification or regression)
target_hd = "target"

#### Process the data

Pass the training set (with the target) and the test set (without the target) to the **train_test_split()** funciton.

Use **to_path** to keep your world organized. In my project I want everything in the results directory which is at the same level of the higherachy as the code directory and python is using the code directory as the home directory so we use **../results**.

Note that after adding text to the numeric/categorical columns, they are now recognized as such. 

In [8]:
#to read and preprocess your files
mlb_data_hd = Reader(sep=",", to_path = '../results').train_test_split(paths_hd, target_hd)


reading csv : hd_data_train.csv ...
cleaning data ...
CPU time: 0.08578824996948242 seconds

reading csv : hd_data_test_wo_target.csv ...
cleaning data ...
CPU time: 0.019309282302856445 seconds

> Number of common features : 13

gathering and crunching for train and test datasets ...
reindexing for train and test datasets ...
dropping training duplicates ...
dropping constant variables on training set ...

> Number of categorical features: 5
> Number of numerical features: 8
> Number of training samples : 212
> Number of test samples : 92

> You have no missing values on train set...

> Task : classification
1.0    110
0.0    102
Name: target, dtype: int64

encoding target ...


#### Another processing note

For some reasons I kept getting the first row as NA with the hd data. I couldn't find the source of the error so I just delete that bad row.

In [9]:
mlb_data_hd['test'].drop(mlb_data_hd['test'].index[:1], inplace=True)

#### Last processing note

After building the dictionary, we processes the data as below with the nice MLBox feature of automatically droping ids and [drifting variables](https://github.com/AxeldeRomblay/MLBox/blob/master/docs/webinars/features.pdf) between train and test datasets. I have found that it does not automatically drop ids. The source code only seems to detect drift, which is not found in randomly generated id fields.

In [10]:
#drop IDs and useless columns
mlb_data_hd = Drift_thresholder(to_path='../results').fit_transform(mlb_data_hd)


computing drifts ...
CPU time: 0.27324676513671875 seconds

> Top 10 drifts

('ca', 0.14203810044663223)
('thal', 0.12483137362136532)
('sex', 0.10579710144927534)
('cp', 0.10506562756357685)
('age', 0.08330826725002272)
('slope', 0.06349922523015206)
('thalach', 0.056660741956066074)
('restecg', 0.04395679518731188)
('exang', 0.04227964634035186)
('fbs', 0.03881141190411075)

> Deleted variables : []
> Drift coefficients dumped into directory : ../results


In [11]:
#take a quick look at how to reference data from the dictionary
mlb_data_hd['train'].head()

Unnamed: 0,age,ca,chol,cp,exang,fbs,oldpeak,restecg,sex,slope,thal,thalach,trestbps
0,55.0,text_0.0,217.0,text_0.0,1.0,0.0,5.6,text_1.0,1.0,text_0.0,text_3.0,111.0,140.0
1,58.0,text_4.0,220.0,text_1.0,0.0,0.0,0.4,text_1.0,1.0,text_1.0,text_3.0,144.0,125.0
2,48.0,text_2.0,256.0,text_0.0,1.0,1.0,0.0,text_0.0,1.0,text_2.0,text_3.0,150.0,130.0
3,60.0,text_2.0,206.0,text_0.0,1.0,0.0,2.4,text_0.0,1.0,text_1.0,text_3.0,132.0,130.0
4,50.0,text_0.0,243.0,text_0.0,0.0,0.0,2.6,text_0.0,1.0,text_1.0,text_3.0,128.0,150.0


#### Build the modeling routine

#### Defining the search criteria

MLBox gives you good control over the modeling algorithms and parameter settings to try.

You define a space dictionary and pass it to the **Optimiser** function.

Then you pass that Optimiser and the data dictionary to the **Predictor** function.

In [12]:
space = {

        'ne__numerical_strategy' : {"space" : [0, 'mean']},

        'ce__strategy' : {"space" : ["label_encoding", "random_projection", \
                                     "entity_embedding"]},

        'fs__strategy' : {"space" : ["variance", "rf_feature_importance"]},
        'fs__threshold': {"search" : "choice", "space" : [0.1, 0.2, 0.3]},

        'est__strategy' : {"space" : ["LightGBM", "RandomForest", "ExtraTrees",\
                                      "Linear"]},
        'est__max_depth' : {"search" : "choice", "space" : [5,10,20]},
        'est__subsample' : {"search" : "uniform", "space" : [0.6,0.7]}

        }

In [13]:
%%time

best_hd = Optimiser(to_path = '../results').optimise(space, mlb_data_hd, max_evals = 3)



##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.2}
>>> ESTIMATOR :{'strategy': 'Linear', 'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'ovr', 'n_jobs': -1, 'penalty': 'l2', 'random_state': 0, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
MEAN SCORE : neg_log_loss = -0.4189799522513985
VARIANCE : 0.0480604807331996 (fold 1 = -0.37091947151819893, fold 2 = -0.46704043298459813)
CPU time: 0.09725165367126465 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean

  +str(self.to_path)+"/joblib'. Please clear it regularly.")
  + ". Parameter IGNORED. Check the list of "



  + ". Parameter IGNORED. Check the list of "








Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where







MEAN SCORE : neg_log_loss = -0.37655102265691165
VARIANCE : 0.01802405451128755 (fold 1 = -0.35852696814562407, fold 2 = -0.3945750771681992)
CPU time: 3.934457540512085 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.2}
>>> ESTIMATOR :{'strategy': 'LightGBM', 'max_depth': 20, 'subsample': 0.6623959196948388, 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample_for_bin': 200000, 'subsample_freq'

In [14]:
Predictor(to_path='../results').fit_predict(best_hd,mlb_data_hd)

  + ". Parameter IGNORED. Check the list of "



fitting the pipeline ...
CPU time: 2.187006950378418 seconds

> Feature importances dumped into directory : ../results

predicting ...
CPU time: 0.10484910011291504 seconds

> Overview on predictions : 

       0.0     1.0  target_predicted
1   0.6450  0.3550                 0
2   0.3750  0.6250                 1
3   0.0525  0.9475                 1
4   0.3275  0.6725                 1
5   0.1225  0.8775                 1
6   0.0925  0.9075                 1
7   0.1900  0.8100                 1
8   0.1500  0.8500                 1
9   0.0300  0.9700                 1
10  0.8475  0.1525                 0

dumping predictions into directory : ../results ...


<mlbox.prediction.predictor.Predictor at 0x7fca87820438>

## Breast Cancer

#### Load the breast cancer dataset

In [15]:
'''
Attribute Information:

1) ID number 
2) Diagnosis (M = malignant, B = benign) 
3-32) 

Ten real-valued features are computed for each cell nucleus: 

a) radius (mean of distances from center to points on the perimeter) 
b) texture (standard deviation of gray-scale values) 
c) perimeter 
d) area 
e) smoothness (local variation in radius lengths) 
f) compactness (perimeter^2 / area - 1.0) 
g) concavity (severity of concave portions of the contour) 
h) concave points (number of concave portions of the contour) 
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)
'''

#column names
names = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', \
         'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', \
         'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', \
         'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', \
         'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', \
         'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', \
         'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', \
         'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst']

#load data from Domino project directory
bc_data = pd.read_csv("../data/raw/breast_cancer.csv", index_col=False, header=0, names=names)

#create MLBox random samples for train and test
bc_data_train = bc_data.sample(frac=0.7, replace=False, random_state=1)
bc_data_test = bc_data[~bc_data.isin(bc_data_train)].dropna()
bc_data_test_wo_target = bc_data_test.drop('diagnosis', axis=1)

bc_data_train.to_csv('../data/processed/bc_data_train.csv', index=False)
bc_data_test.to_csv('../data/processed/bc_data_test.csv', index=False)
bc_data_test_wo_target.to_csv('../data/processed/bc_data_test_wo_target.csv', index=False)

In [16]:
bc_data_test_wo_target.head()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
2,43006131.0,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
7,70354776.0,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,...,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151
10,99396622.0,16.02,23.24,102.7,797.8,0.08206,0.06669,0.03299,0.03323,0.1528,...,19.19,33.88,123.8,1150.0,0.1181,0.1551,0.1459,0.09975,0.2948,0.08452
15,58923522.0,14.54,27.54,96.73,658.8,0.1139,0.1595,0.1639,0.07364,0.2303,...,17.46,37.13,124.1,943.2,0.1678,0.6577,0.7026,0.1712,0.4218,0.1341
20,75772743.0,13.08,15.71,85.63,520.0,0.1075,0.127,0.04568,0.0311,0.1967,...,14.5,20.49,96.09,630.5,0.1312,0.2776,0.189,0.07283,0.3184,0.08183


In [17]:
#the list of paths to your train datasets and test datasets
paths_bc = ["../data/processed/bc_data_train.csv", \
         "../data/processed/bc_data_test_wo_target.csv"]

#the name of the target you try to predict (classification or regression)
target_bc = "diagnosis"

#### Process the data

In [18]:
#to read and preprocess your files
mlb_data_bc = Reader(sep=",", to_path = '../results').train_test_split(paths_bc, target_bc)


reading csv : bc_data_train.csv ...
cleaning data ...
CPU time: 0.039121389389038086 seconds

reading csv : bc_data_test_wo_target.csv ...
cleaning data ...
CPU time: 0.03113102912902832 seconds

> Number of common features : 31

gathering and crunching for train and test datasets ...
reindexing for train and test datasets ...
dropping training duplicates ...
dropping constant variables on training set ...

> Number of categorical features: 0
> Number of numerical features: 31
> Number of training samples : 398
> Number of test samples : 171

> You have no missing values on train set...

> Task : classification
B    249
M    149
Name: diagnosis, dtype: int64

encoding target ...


In [19]:
#drop IDs and useless columns
mlb_data_bc = Drift_thresholder(to_path='../results').fit_transform(mlb_data_bc)


computing drifts ...
CPU time: 0.4951331615447998 seconds

> Top 10 drifts

('radius_mean', 0.18824698045631716)
('concavity_mean', 0.11795262220816793)
('smoothness_se', 0.1034725611642342)
('area_mean', 0.09656627872605172)
('fractal_dimension_mean', 0.09407124541998657)
('concave_points_mean', 0.09227498642322418)
('id', 0.08611319250149507)
('texture_se', 0.0780733352123133)
('texture_worst', 0.07724532374595272)
('perimeter_worst', 0.07285779100701872)

> Deleted variables : []
> Drift coefficients dumped into directory : ../results


#### Optimise the space and fit the model

In [20]:
%%time

best_bc = Optimiser(to_path = '../results').optimise(space, mlb_data_bc, max_evals = 3)



##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'random_projection'}
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.3}
>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 5, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
  0%|          | 0/3 [00:00<?, ?it/s, best loss: ?]

  +str(self.to_path)+"/joblib'. Please clear it regularly.")
  + ". Parameter IGNORED. Check the list of "







MEAN SCORE : neg_log_loss = -0.16601760645184827
VARIANCE : 0.000573284414337516 (fold 1 = -0.16544432203751075, fold 2 = -0.16659089086618578)
CPU time: 0.9064445495605469 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'random_projection'}
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.2}
>>> ESTIMATOR :{'strategy': 'LightGBM', 'max_depth': 10, 'subsample': 0.6567713433778793, 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample_for_bin': 200000, 'subsample







MEAN SCORE : neg_log_loss = -0.24533295927739152
VARIANCE : 0.06295567487418158 (fold 1 = -0.18237728440320994, fold 2 = -0.3082886341515731)
CPU time: 2.574834108352661 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'random_projection'}
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.2}
>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 20, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
 67%|██████▋   | 2/3 [00:03<00:01,  1.42s/it, best loss: 0.16

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.14113468171029198
VARIANCE : 0.0021550275308824945 (fold 1 = -0.14328970924117446, fold 2 = -0.13897965417940947)
CPU time: 0.9344687461853027 seconds
100%|██████████| 3/3 [00:04<00:00,  1.51s/it, best loss: 0.14113468171029198]


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ BEST HYPER-PARAMETERS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

{'ce__strategy': 'random_projection', 'est__max_depth': 20, 'est__strategy': 'ExtraTrees', 'est__subsample': 0.6950926812280834, 'fs__strategy': 'rf_feature_importance', 'fs__threshold': 0.2, 'ne__numerical_strategy': 'mean'}
CPU times: user 4.42 s, sys: 44 ms, total: 4.46 s
Wall time: 4.64 s


In [21]:
Predictor(to_path='../results').fit_predict(best_bc,mlb_data_bc)

  + ". Parameter IGNORED. Check the list of "



fitting the pipeline ...
CPU time: 0.5904123783111572 seconds

> Feature importances dumped into directory : ../results

predicting ...
CPU time: 0.048383474349975586 seconds

> Overview on predictions : 

        B       M diagnosis_predicted
0  0.0000  1.0000                   M
1  0.1950  0.8050                   M
2  0.3325  0.6675                   M
3  0.0375  0.9625                   M
4  0.9900  0.0100                   B
5  0.0825  0.9175                   M
6  0.0075  0.9925                   M
7  0.0500  0.9500                   M
8  0.0025  0.9975                   M
9  0.2825  0.7175                   M

dumping predictions into directory : ../results ...


<mlbox.prediction.predictor.Predictor at 0x7fc9f8a83198>

## Save to Domino Stats File

Saving stats to this file [allows Domino to track and trend them in the Experiment Manager](https://support.dominodatalab.com/hc/en-us/articles/204348169-Diagnostic-statistics-with-dominostats-json) when this notebook is run as a batch or scheduled job.

In [22]:
#this predictions file is the output of the Prediction funtion from above
bc_pred = pd.read_csv('../results/diagnosis_predictions.csv')
y_bc_pred = bc_pred['diagnosis_predicted']

#these are the answers from the file stored in the project
bc_test = pd.read_csv('../data/processed/bc_data_test.csv')
y_bc_test = bc_test['diagnosis']

#this predictions file is the output of the Prediction funtion from above
hd_pred = pd.read_csv('../results/target_predictions.csv')
y_hd_pred = hd_pred['target_predicted']

#these are the answers from the file stored in the project
hd_test = pd.read_csv('../data/processed/hd_data_test.csv')
y_hd_test = hd_test['target']

In [23]:
import sklearn

hd_acc = sklearn.metrics.accuracy_score(y_hd_test,y_hd_pred)
bc_acc = sklearn.metrics.accuracy_score(y_bc_test,y_bc_pred)

import json
with open('../dominostats.json', 'w') as f:
    f.write(json.dumps( {"HD_ACC": hd_acc, "BC_ACC": bc_acc}))

#### Show it here in the notebook as well

In [24]:
print('bc ', bc_acc)
print('hd ', hd_acc)

bc  0.9824561403508771
hd  0.7802197802197802
