# Kaggle Use Case

This notebook is reflecting the human competitiveness usecase in the Cardea [paper](https://arxiv.org/abs/2010.00509). It runs the cardea framework from an end-to-end perspective and reports the result of each pipeline.

In [1]:
import pandas as pd
import numpy as np

from collections import defaultdict
from cardea import Cardea
from cardea.modeling import Modeler

First step is to create a cardea instance, then load the kaggle data from the `s3` bucket. Once the data is loaded you can use `cd.es` to view the entityset. After that, we select the prediction problem of interest, which is predicting whether the patietn will show or not.

Once that is done, we pass the `cutoff` variable to generate the feature matrix.

In [2]:
cd = Cardea()

cd.load_data_entityset() # load kaggle data from s3 bucket
cutoff = cd.select_problem('MissedAppointmentProblemDefinition') # select the missed appointment problem

# feature engineering
feature_matrix = cd.generate_features(cutoff[:1000])
feature_matrix = feature_matrix.sample(frac=1)

Built 13 features
Elapsed: 00:57 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 10/10 chunks


Use the following lines of code to prepare the data for the `modeler`. Ideally, these series of transformations should be included in the `pipeline.json`.

In [3]:
# seperate the X and y variables

y = list(feature_matrix.pop('label'))
y = np.array(pd.Categorical(pd.Series(y)).codes).reshape(-1, )

X = feature_matrix.values

print("X {}".format(X.shape))
print("y {}".format(y.shape))

X (1000, 74)
y (1000,)


In [4]:
pipelines = [
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.naive_bayes.MultinomialNB']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.ensemble.RandomForestClassifier']],
    [['sklearn.preprocessing.MinMaxScaler', 'xgboost.XGBClassifier']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.neighbors.KNeighborsClassifier']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.linear_model.LogisticRegression']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.linear_model.SGDClassifier']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.ensemble.GradientBoostingClassifier']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.naive_bayes.GaussianNB']]
]

problem_type = 'classification'
results = defaultdict(list)

modeler = Modeler()
for pipeline in pipelines:
    print("testing pipeline {}".format(str(pipeline)))
    pipeline_res = modeler.execute_pipeline(X, y, pipeline, problem_type, optimize=True)
    results[str(pipeline)].append(pipeline_res)

testing pipeline [['sklearn.preprocessing.MinMaxScaler', 'sklearn.naive_bayes.MultinomialNB']]


Using TensorFlow backend.


 10%|█         | 1/10 [00:00<00:01,  7.38it/s, best loss: -0.5191862292980809]












  'setting alpha = %.1e' % _ALPHA_MIN)


  'setting alpha = %.1e' % _ALPHA_MIN)


  'setting alpha = %.1e' % _ALPHA_MIN)


  'setting alpha = %.1e' % _ALPHA_MIN)


  'setting alpha = %.1e' % _ALPHA_MIN)



 30%|███       | 3/10 [00:00<00:01,  6.79it/s, best loss: -0.535284833175112] 


  'setting alpha = %.1e' % _ALPHA_MIN)


  'setting alpha = %.1e' % _ALPHA_MIN)


  'setting alpha = %.1e' % _ALPHA_MIN)


  'setting alpha = %.1e' % _ALPHA_MIN)


  'setting alpha = %.1e' % _ALPHA_MIN)













 40%|████      | 4/10 [00:00<00:00,  6.68it/s, best loss: -0.535284833175112]























 70%|███████   | 7/10 [00:01<00:00,  7.23it/s, best loss: -0.535284833175112]























 90%|█████████ | 9/10 [00:01<00:00,  7.86it/s, best loss: -0.535284833175112]























100%|██████████| 10/10 [00:01<00:00,  7.29it/s, best loss: -0.535284833175112]
testing pipeline [['sklearn.preprocessing.MinMaxScaler', 'sklearn.ensemble.RandomForestClassifier']]














  0%|          | 0/10 [00:00<?, ?it/s, best loss: ?]


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)




 10%|█         | 1/10 [00:03<00:27,  3.01s/it, best loss: -0.4237775309486865]

  'recall', 'true', average, warn_for)


Exception caught fitting MLBlock /Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/mlprimitives/jsons//sklearn.ensemble.RandomForestClassifier#1
Traceback (most recent call last):
  File "/Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/mlblocks/mlpipeline.py", line 221, in fit
    block.fit(**fit_args)
  File "/Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/mlblocks/mlblock.py", line 246, in fit
    getattr(self.instance, self.fit_method)(**fit_args)
  File "/Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 291, in fit
    raise ValueError("Out of bag estimation only available"
ValueError: Out of bag estimation only available if bootstrap=True


 20%|██        | 2/10 [00:03<00:17,  2.15s/it, best loss: -0.4237775309486865]


Exception caught fitting MLBlock /Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/mlprimitives/jsons//sklearn.ensemble.RandomForestClassifier#1
Traceback (most recent call last):
  File "/Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/mlblocks/mlpipeline.py", line 221, in fit
    block.fit(**fit_args)
  File "/Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/mlblocks/mlblock.py", line 246, in fit
    getattr(self.instance, self.fit_method)(**fit_args)
  File "/Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 291, in fit
    raise ValueError("Out of bag estimation only available"
ValueError: Out of bag estimation only available if bootstrap=True


 30%|███       | 3/10 [00:03<00:11,  1.58s/it, best loss: -0.4237775309486865]


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)




 40%|████      | 4/10 [00:05<00:10,  1.79s/it, best loss: -0.4241574538084415]

  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)




 50%|█████     | 5/10 [00:09<00:12,  2.42s/it, best loss: -0.42419973004408557]

  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)




 60%|██████    | 6/10 [00:10<00:07,  1.96s/it, best loss: -0.42419973004408557]

  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)




 70%|███████   | 7/10 [00:13<00:06,  2.33s/it, best loss: -0.42419973004408557]

  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)




 80%|████████  | 8/10 [00:17<00:05,  2.69s/it, best loss: -0.42419973004408557]

  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)




 90%|█████████ | 9/10 [00:19<00:02,  2.64s/it, best loss: -0.42419973004408557]

  'recall', 'true', average, warn_for)


Exception caught fitting MLBlock /Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/mlprimitives/jsons//sklearn.ensemble.RandomForestClassifier#1
Traceback (most recent call last):
  File "/Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/mlblocks/mlpipeline.py", line 221, in fit
    block.fit(**fit_args)
  File "/Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/mlblocks/mlblock.py", line 246, in fit
    getattr(self.instance, self.fit_method)(**fit_args)
  File "/Users/sarah/opt/anaconda3/envs/cardea/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 291, in fit
    raise ValueError("Out of bag estimation only available"
ValueError: Out of bag estimation only available if bootstrap=True


100%|██████████| 10/10 [00:19<00:00,  1.99s/it, best loss: -0.42419973004408557]
testing pipeline [['sklearn.preprocessing.MinMaxScaler', 'xgboost.XGBClassifier']]
  0%|          | 0/10 [00:00<?, ?it/s, best loss: ?]


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)




 10%|█         | 1/10 [00:34<05:08, 34.25s/it, best loss: -0.207208067479257]

  'recall', 'true', average, warn_for)













 20%|██        | 2/10 [00:47<03:42, 27.86s/it, best loss: -0.5155771260962522]













 30%|███       | 3/10 [01:06<02:56, 25.20s/it, best loss: -0.5576581484802451]













 40%|████      | 4/10 [01:11<01:54, 19.09s/it, best loss: -0.5576581484802451]













 50%|█████     | 5/10 [01:17<01:16, 15.34s/it, best loss: -0.5576581484802451]













 60%|██████    | 6/10 [01:23<00:49, 12.40s/it, best loss: -0.5576581484802451]













 70%|███████   | 7/10 [01:50<00:50, 16.88s/it, best loss: -0.5576581484802451]













 80%|████████  | 8/10 [02:07<00:33, 16.86s/it, best loss: -0.5576581484802451]













 90%|█████████ | 9/10 [02:11<00:13, 13.05s/it, best loss: -0.5576581484802451]













100%|██████████| 10/10 [02:33<00:00, 15.38s/it, best loss: -0.5576581484802451]
testing pipeline [['sklearn.preprocessing.MinMaxScaler', 'sklearn.neighbors.KNeighborsClassifier']]
  0%|          | 0/10 [00:00<?, ?it/s, best loss: ?]













 10%|█         | 1/10 [00:01<00:11,  1.24s/it, best loss: -0.5267674826010318]













 20%|██        | 2/10 [00:02<00:09,  1.24s/it, best loss: -0.5403807204718707]













 30%|███       | 3/10 [00:03<00:08,  1.25s/it, best loss: -0.5403807204718707]













 40%|████      | 4/10 [00:05<00:07,  1.25s/it, best loss: -0.5403807204718707]









 50%|█████     | 5/10 [00:06<00:06,  1.28s/it, best loss: -0.5403807204718707]






  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)


  'recall', 'true', average, warn_for)




 60%|██████    | 6/10 [00:07<00:04,  1.13s/it, best loss: -0.5403807204718707]

  'recall', 'true', average, warn_for)













 70%|███████   | 7/10 [00:08<00:03,  1.16s/it, best loss: -0.5451116362245729]













 80%|████████  | 8/10 [00:09<00:02,  1.17s/it, best loss: -0.5451116362245729]













 90%|█████████ | 9/10 [00:10<00:01,  1.20s/it, best loss: -0.5451116362245729]










100%|██████████| 10/10 [00:11<00:00,  1.12s/it, best loss: -0.5486468134980206]







In `results` we now hold a list of the kfold predictions. We compute the metric and average the results across `kfolds` to represent the result of each pipeline.

In [5]:
from sklearn.metrics import accuracy_score, f1_score

for pipeline in pipelines:
    accuracy = []
    f1 = []
    for i in range(0, 10):
        y_test = results[str(pipeline)][0]['pipeline0']['folds'][str(i)]['Actual']
        y_pred = results[str(pipeline)][0]['pipeline0']['folds'][str(i)]['predicted']

        accuracy.append(accuracy_score(y_test, y_pred))
        f1.append(f1_score(y_test, y_pred, average='macro'))
        
    print(str(pipeline))
    print("Accuracy score {:.2f}".format(np.mean(accuracy)))
    print("F1 Macro score {:.2f}".format(np.mean(f1)))

[['sklearn.preprocessing.MinMaxScaler', 'sklearn.naive_bayes.MultinomialNB']]
Accuracy score 0.71
F1 Macro score 0.52
[['sklearn.preprocessing.MinMaxScaler', 'sklearn.ensemble.RandomForestClassifier']]
Accuracy score 0.74
F1 Macro score 0.42
[['sklearn.preprocessing.MinMaxScaler', 'xgboost.XGBClassifier']]
Accuracy score 0.73
F1 Macro score 0.54
[['sklearn.preprocessing.MinMaxScaler', 'sklearn.neighbors.KNeighborsClassifier']]
Accuracy score 0.65
F1 Macro score 0.55


  'precision', 'predicted', average, warn_for)
