# Using an MLPipeline

In this short guide we will go over the basic MLPipeline functionality.

We will:

1. Load a demo dataset.
2. Build a pipeline.
3. Explore the pipeline primitives, inputs and outputs.
4. Fit the pipeline to the dataset.
5. Make predictions using the fitted pipeline.
6. Evaluate the pipeline performance.

## Load the Dataset

The first step will be to load the Census dataset using the function provided by mlprimitives

In [1]:
from mlprimitives.datasets import load_dataset

dataset = load_dataset('census')

This version of the Census dataset is prepared as a Classification (Supervised) Problem,
and has an input matrix `X` and an expected outcome `y` array.

In [2]:
dataset.describe()

Adult Census dataset.

    Predict whether income exceeds $50K/yr based on census data. Also known as "Adult" dataset.

    Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean
    records was extracted using the following conditions: ((AAGE>16) && (AGI>100) &&
    (AFNLWGT>1)&& (HRSWK>0))

    Prediction task is to determine whether a person makes over 50K a year.

    source: "UCI
    sourceURI: "https://archive.ics.uci.edu/ml/datasets/census+income"
    
Data Modality: single_table
Task Type: classification
Task Subtype: binary
Data shape: (32561, 14)
Target shape: (32561,)
Metric: accuracy_score
Extras: 


The data from the dataset can explored by looking at its `.data` and `.target` attributes.

In [3]:
dataset.data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [4]:
dataset.target[0:5]

array([' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K'], dtype=object)

The dataset data can also be splitted in multipe parts for cross validation using the `dataset.get_splits` method.

For this demo we will be making only one split, which is equivalent to a simple train/test holdout partitioning.

In [5]:
X_train, X_test, y_train, y_test = dataset.get_splits(1)

In [6]:
X_train.shape

(24420, 14)

In [7]:
X_test.shape

(8141, 14)

## Build a pipeline

Once we have the dataset we will build a pipeline that works with it.

In this case, we will be creating a short pipeline that uses the following primitives:

- `ClassEncoder` from `mlprimitives`, which encodes the target variable `y` as integers.
- `CategoricaEncoder` from `mlprimitives`, which encodes all the categorical variables from the feature matrix `X`
  using one-hot encoding.
- `SimpleImputer` from `sklearn`, which imputes any null values that may exist in the feature matrix `X`
- `XGBClassifier` from `xgboost`, which learns to predict the target variable `y` sing the feature matrix `X`.
- `ClassDecoder` from `mlprimitives`, which reverts the `ClassEncoder` transformation to return the original
  target labels instead of integers.

In [8]:
from mlblocks import MLPipeline

primitives = [
    'mlprimitives.custom.preprocessing.ClassEncoder',
    'mlprimitives.custom.feature_extraction.CategoricalEncoder',
    'sklearn.impute.SimpleImputer',
    'xgboost.XGBClassifier',
    'mlprimitives.custom.preprocessing.ClassDecoder'
]
pipeline = MLPipeline(primitives)

## Explore the Pipeline

### Primitives

We can see the primitives included in this pipeline by having a look at its `primitives` attribute.

In [9]:
pipeline.primitives

['mlprimitives.custom.preprocessing.ClassEncoder',
 'mlprimitives.custom.feature_extraction.CategoricalEncoder',
 'sklearn.impute.SimpleImputer',
 'xgboost.XGBClassifier',
 'mlprimitives.custom.preprocessing.ClassDecoder']

### Inputs

We can also see the inputs of the pipeline using the `get_inputs` method.

This will traverse the pipeline execution graph and show all the variables that need to be
provided by the user in order to fit this pipeline.

In [10]:
pipeline.get_inputs()

{'X': {'name': 'X', 'type': 'DataFrame'},
 'y': {'name': 'y', 'type': 'ndarray'}}

Alternatively, we can pass the `fit=False` argument, which will give us the variables needed
in order to make predictions.

In [11]:
pipeline.get_inputs(fit=False)

{'X': {'name': 'X', 'type': 'DataFrame'},
 'y': {'name': 'y', 'default': None, 'type': 'ndarray'}}

Note how the `fit` method expects two variables `X` and `y`, while the `predict`
method only needs `X`, as the `y` variable has a default value of `None`.

### Outputs

Equally, we can see the outputs that the pipeline will return when used to make predictions.

In [12]:
pipeline.get_outputs()

[{'name': 'y',
  'type': 'ndarray',
  'variable': 'mlprimitives.custom.preprocessing.ClassDecoder#1.y'}]

## Fit the Pipeline to the Dataset

Now that the pipeline is ready and we know its inputs and outputs, we can fit it to the
dataset by passing the training `X` and `y` variables to its `fit` method.

In [13]:
pipeline.fit(X_train, y_train)

## Make Predictions

After the pipelines finished fitting, we can try to predict the `y_test` array values by
passing the `X_test` matrix to the `pipeline.predict` method.

In [14]:
predictions = pipeline.predict(X_test)

In [15]:
predictions[0:5]

array([' >50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)

## Evaluating the pipeline performance

Now we can compare the predicted array with the actual test array to see how well
our pipeline performed.

This can be done using the `dataset.score` method, which provides a suitable scoring
function for this kind of data and problem.
In this case, the dataset is just computing the accuracy score.

In [16]:
dataset.score(y_test, predictions)

0.8602137329566393