# 4. Modeling

This notebook is reflecting the adaptivity usecase in the Cardea [paper](https://arxiv.org/abs/2010.00509). It is concerned with using AutoML for modeling prediction problems using the automatically generate features.

In this notebook, we ingest the feature matrix and produce a model.

In [1]:
import numpy as np
import pandas as pd
from collections import defaultdict

from featuretools.selection import remove_low_information_features

from model_audit import ModelAuditor
from cardea.modeling import Modeler

Using TensorFlow backend.


We use the same problem defined in the problem definition, and perform several transformations to prepare the data for the modeling component. In future, these transformations should be part of the `pipeline.json`.

In [2]:
# load the data (any feature matrix)
fm = pd.read_csv("fm.csv", index_col=0)

# problem specifications
problem = 'mortality'
problem_type = 'classification'
scoring_function = 'f1'
minimize_cost = False

fm = fm.drop_duplicates()
    
try:
    fm = fm.drop([problem], axis=1)
except:
    pass

y = fm.pop('label')
X = remove_low_information_features(fm)

X = X.fillna(0)
X = pd.get_dummies(X)

if problem == 'los':
    y = np.digitize(y, [y.min(), 7, y.max()+1])

y = pd.Categorical(y).codes

Now we can use Cardea's `Modeler` to run train the model and tune it. The following pipelines are the pipelines that we are considering in our usecase.

In [3]:
pipelines = [
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.naive_bayes.MultinomialNB']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.ensemble.RandomForestClassifier']],
    [['sklearn.preprocessing.MinMaxScaler', 'xgboost.XGBClassifier']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.neighbors.KNeighborsClassifier']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.linear_model.LogisticRegression']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.linear_model.SGDClassifier']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.ensemble.GradientBoostingClassifier']],
    [['sklearn.preprocessing.MinMaxScaler', 'sklearn.naive_bayes.GaussianNB']]
]

In [4]:
results = defaultdict(list)

modeler = Modeler()
for pipeline in pipelines:
    print("testing pipeline {}".format(str(pipeline)))
    pipeline_res = modeler.execute_pipeline(np.array(X), np.array(y), pipeline, problem_type, optimize=True,
                                            minimize_cost=minimize_cost, scoring=scoring_function, max_evals=10)

    results[str(pipeline)].append(pipeline_res)

testing pipeline [['sklearn.preprocessing.MinMaxScaler', 'sklearn.naive_bayes.MultinomialNB']]




testing pipeline [['sklearn.preprocessing.MinMaxScaler', 'sklearn.ensemble.RandomForestClassifier']]




testing pipeline [['sklearn.preprocessing.MinMaxScaler', 'xgboost.XGBClassifier']]




testing pipeline [['sklearn.preprocessing.MinMaxScaler', 'sklearn.neighbors.KNeighborsClassifier']]




Now to view the results. In `results` we now hold a list of the kfold predictions. We compute the metric and average the results across `kfolds` to represent the result of each pipeline.

In [5]:
from sklearn.metrics import accuracy_score, f1_score

for pipeline in pipelines:
    accuracy = []
    f1 = []
    for i in range(0, 10):
        y_test = results[str(pipeline)][0]['pipeline0']['folds'][str(i)]['Actual']
        y_pred = results[str(pipeline)][0]['pipeline0']['folds'][str(i)]['predicted']

        # you can insert any metric here
        accuracy.append(accuracy_score(y_test, y_pred))
        f1.append(f1_score(y_test, y_pred, average='macro'))
        
    print(str(pipeline))
    print("Accuracy score {:.2f}".format(np.mean(accuracy)))
    print("F1 Macro score {:.2f}".format(np.mean(f1)))

[['sklearn.preprocessing.MinMaxScaler', 'sklearn.naive_bayes.MultinomialNB']]
Accuracy score 0.83
F1 Macro score 0.56
[['sklearn.preprocessing.MinMaxScaler', 'sklearn.ensemble.RandomForestClassifier']]
Accuracy score 0.88
F1 Macro score 0.61
[['sklearn.preprocessing.MinMaxScaler', 'xgboost.XGBClassifier']]
Accuracy score 0.93
F1 Macro score 0.63
[['sklearn.preprocessing.MinMaxScaler', 'sklearn.neighbors.KNeighborsClassifier']]
Accuracy score 0.88
F1 Macro score 0.57
