# Modeler Primitives Example

This notebook looks at how we can change the primitive list to run cardea end-to-end. The first part of the notebook looks at the Cardea class to load a dataset, select a problem, and then featurize the data. To understand more about the cardea class, visit the `appointment_noshow_tutorial` to see the series of transformations applied to the data to create a prediction problem.

In the later half of this tutorial, which is what we aim to address, we look at how we can split our dataset into its `X`, and `y` variables, then perform a prediction task.

In [1]:
import pandas as pd

from mlblocks import MLPipeline

from cardea import Cardea
from cardea.modeling import Modeler

In [2]:
# initialize components
cd = Cardea()
modeler = Modeler()

# load data
cd.load_data_entityset()

# select problem
cutoff = cd.select_problem('MissedAppointmentProblemDefinition')

# featurize
feature_matrix = cd.generate_features(cutoff[:1000])

Built 13 features
Elapsed: 00:42 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 10/10 chunks


At this point, we have generated our feature matrix, including the target variable. From here, we wish to prune some of the generated columns and then split our feature matrix into its `X` and `y` variables.

To accomplish this, we use the following primitives. Each primitive is responsible for a single task, which in this case is represented by its name.

In [3]:
primitives = [
    "cardea.primitives.processing.prune_cols",
    "cardea.primitives.processing.split_feature_matrix",
]

Then we use `MLPipeline` from mlblocks to transform our feature matrix.

In [4]:
pipeline = MLPipeline(primitives)
X, y = pipeline.predict(feature_matrix, problem='MissedAppointment')

print("X shape: ", X.shape)
print("y shape: ", y.shape)

Using TensorFlow backend.


X shape:  (1000, 65)
y shape:  (1000,)


Now we can create our model using the same idea of primitives and transformations. In this example, I am using a random forest classifier proceeded by normalizing the data between `[0,1]`.

You might be wondering why we have two list of primitives (the previous one and the one below). Ideally, we should have one, but in the curent implementation of Cardea, the modeler expects two variables `X`, and `y`, which are both part of the `feature_matrix` variable before splitting. 

In [5]:
primitives = [
    "sklearn.preprocessing.MinMaxScaler",
    "sklearn.ensemble.RandomForestClassifier"
]

In [6]:
result = modeler.execute_pipeline(data_frame=X,
                                  target=y,
                                  primitives_list=[primitives], 
                                  problem_type='classification')

