# High-Level API

## Overview

The High-Level API makes it easy to rapidly:

* Prepare samples.
* Hypertune/ train queues of models. 
* Feed appropriate data/ parameters into those models.
* Evaluate model performance with metrics and plots.

It does so by wrapping and bundling together the methods of the [Low-Level API](api_low_level.html). The table below demonstrates how the high-level entities abstract abstract the low-level entities. While this abstraction eliminates many steps to enable rapid model prototyping, it comes at a cost of customization.

| High-level object | Groups together or creates the following objects                                                                  |
|:-----------------:|:-----------------------------------------------------------------------------------------------------------------:|
| `Pipeline`        | Dataset, File, Image, Tabular, Label, Featureset, Splitset, Foldset, Folds, Labelcoder, Encoderset, Featurecoders. |
| `Algorithm`       | Functions to build, train, predict, and evaluate a machine learning model.                                        |
| `Experiment`      | Algorithm, Hyperparamset, Hyperparamcombos, Queue, Job, Jobset, Result.                                           |

## Prerequisites

If you've already completed the instructions on the [Installation](installation.html) page, then let's get started.

In [2]:
import aiqc
from aiqc import datum

## 1. Pipeline

### a) Tabular Dataset

Tabular/ delimited/ flat-file `Dataset.Tabular` can be created from either Pandas DataFrames or flat files (CSV/ TSV or Parquet).

Let's grab one of AIQC's built-in datasets from the `datum` module that we imported above. This module is described in the 'Built-In Examples - Datasets' section of the documentation.

In [3]:
df = datum.to_pandas(name='iris.tsv')

The `Pipeline` process starts with raw data. A Dataset object is generated from that data and prepared for training based on the parameters the user provides to the `Pipeline.make` method. To get started, set the `dataFrame_or_filePath` equal to the dataframe we just fetched. It's the only argument that's actually required so 

Import any scikit-learn encoders that you want to use to encode labels and/ or features. Any encoders that you pass in will need to be instantiated with the attributes you want them to use.

> Reference the `Encoderset` section of the low-level API for more detail on how to include/ exclude specific `Featureset` columns by name/dtype. The `feature_encoders` argument seen below takes a list of dictionaries as input, where each dictionary contains the `**kwargs` for a `Featurecoder`.

In [4]:
from sklearn.preprocessing import OneHotEncoder, PowerTransformer, StandardScaler, FunctionTransformer

Rather than wrangling your data with many lines of data science code, just set the arguments below and AIQC takes care of the rest: stratification (including continuous dtypes), validation splits, cross-validation folds, and dtype/column specific encoders to be applied on-read. 

> Don't use `fold_count` unless your (total sample count / fold_count) still gives you an accurate representation of your sample population. You can try it with the 'iris_10x.tsv' datum.

In [5]:
splitset = aiqc.Pipeline.Tabular.make(
    # --- Data source ---
    df_or_path = df
    , dtype = None

    # --- Label preprocessing ---
    , label_column = 'species'
    , label_interpolater = None
    , label_encoder = dict(sklearn_preprocess=OneHotEncoder(sparse=False))

    # --- Feature preprocessing ---
    , feature_cols_excluded = None
    , feature_interpolaters = None
    , feature_window = None
    , feature_encoders = dict(
        sklearn_preprocess = PowerTransformer(method='box-cox', copy=False)
        , dtypes = ['float64']
    )
    , feature_reshape_indices = None

    # --- Stratification ---
    , size_test = 0.24
    , size_validation = 0.12
    , fold_count = None
    , bin_count = None
)


___/ featurecoder_index: 0 \_________

=> The column(s) below matched your filter(s) featurecoder filters.

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

=> Done. All feature column(s) have featurecoder(s) associated with them.
No more Featurecoders can be added to this Encoderset.



### b) Sequence Dataset

The sequence dataset is a 3 dimensional structure intended for multi-observations per sample to enable time series analysis. 

In order to perform *supervised learning* on sequence files, you'll need both a `Dataset.Sequence` and a `Dataset.Tabular`:

* `Dataset.Sequence` is created from a homogenous 3d NumPy array.

* `Dataset.Tabular` is created as seen in the section above. It must contain 1 row per sample.

* Then a `Splitset` is constructed using:
  * The `Label` of the `Dataset.Tabular`.
  * The `Featureset` of the `Dataset.Sequence`.

In [6]:
df = datum.to_pandas('epilepsy.parquet')

In [7]:
label_df = df[['seizure']]

In [8]:
seq_ndarray3D = df.drop(columns=['seizure']).to_numpy().reshape(1000,178,1)

In [10]:
seq_splitset = aiqc.Pipeline.Sequence.make(
    # --- Label preprocessing ---
    label_df_or_path = label_df
    , label_dtype = None
    , label_column = 'seizure'
    , label_interpolater = None
    , label_encoder = None
    
    # --- Feature preprocessing ---
    , feature_ndarray3D_or_npyPath = seq_ndarray3D
    , feature_dtype = None
    , feature_cols_excluded = None
    , feature_interpolaters = None
    , feature_window = None
    , feature_encoders = [dict(
        sklearn_preprocess=StandardScaler(), columns='0'
    )]
    , feature_reshape_indices = None

    # --- Stratification ---
    , size_test = 0.22
    , size_validation = 0.12
    , fold_count = None
    , bin_count = None
)

⏱️ Ingesting Sequences 🧬: 100%|████████████████| 1000/1000 [00:05<00:00, 183.89it/s]



=> Info - System overriding user input to set `sklearn_preprocess.copy=False`.
   This saves memory when concatenating the output of many encoders.


___/ featurecoder_index: 0 \_________

=> The column(s) below matched your filter(s) featurecoder filters.

['0']

=> Done. All feature column(s) have featurecoder(s) associated with them.
No more Featurecoders can be added to this Encoderset.



### c) Image Dataset

AIQC also supports image data and convolutional analysis. 

In order to perform *supervised learning* on image files, you'll need both a `Dataset.Image` and a `Dataset.Tabular`:

* `Dataset.Image` can be created from either a folder of images or a list of urls. The Pillow library is used to normalize images ingested into AIQC. Each image must be the same size (dimensions) and mode (colorscale).

* `Dataset.Tabular` is created as seen in the section above. It must contain 1 row per image.

* Then a `Splitset` is constructed using:
  * The `Label` of the `Dataset.Tabular`.
  * The `Featureset` of the `Dataset.Image`.

Again, we'll use the built-in data found in the `datum` module that we imported above.

In [11]:
df = datum.to_pandas(name='brain_tumor.csv')
image_urls = datum.get_remote_urls(manifest_name='brain_tumor.csv')

In [15]:
img_splitset = aiqc.Pipeline.Image.make(
    # --- Label preprocessing ---
    label_df_or_path = df
    , label_dtype = None
    , label_column = 'status'
    , label_interpolater = None
    , label_encoder = None
    
    # --- Feature preprocessing ---
    , feature_folder_or_urls = image_urls
    , feature_dtype = 'float64'
    , feature_interpolaters = None
    , feature_window = None
    , feature_encoders = dict(#pixel values are 0-255 so we will standardize by dividing by 255.
        sklearn_preprocess= FunctionTransformer(aiqc.div255, inverse_func=aiqc.mult255)
        , dtypes = 'float64'
    )
    , feature_reshape_indices = None

    # --- Stratification ---
    , size_test = 0.12
    , size_validation = 0.18
    , fold_count = None
    , bin_count = None
)

🖼️ Ingesting Images 🖼️: 100%|████████████████████████| 80/80 [00:09<00:00,  8.87it/s]



___/ featurecoder_index: 0 \_________

=> The column(s) below matched your filter(s) featurecoder filters.

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119']

=> Done. All feature column(s) have featurecoder(s) associated with them.
No more Featurecoders can be added to this Encoderset.



## 2. Experiment

As seen in the [Compatibility Matrix](compatibility.html), the only library supported at this point in time is `Keras` as it is the most straightforward for entry-level users. 

> You can find great examples of machine learning cookbooks on this blog: [MachineLearningMastery.com "Multi-Label Classification"](https://machinelearningmastery.com/multi-label-classification-with-deep-learning/)

In [16]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import History

### Model Functions

When we define our models, we'll do so by wrapping each phase in the following functions:

* `fn_build` contains the topology/ layers.
* `fn_train` specifies the samples and how the model should run.
* Optional and automatically determined: `fn_optimize`, `fn_predict` and `fn_lose`.

You can name the functions below whatever you want, but do not change their predetermined arugments (e.g. `features_shape`, `**hp`, `model`, etc.). These items are used behind the scenes to pass the appropriate data, parameters, and models into your training jobs.

Because these are functions, we can even play with the topology as a parameter! As demonstrated by the `if (hp['extra_layer'])` line below.

> Put a placeholder anywhere you want to try out different hyperparameters: `hp['<some_variable_name>']`. You'll get a chance to define the hyperparameters in a minute.

#### `fn_build`

In [17]:
def fn_build(features_shape, label_shape, **hp):
    model = Sequential()
    model.add(Dense(units=features_shape[0], activation='relu', kernel_initializer='he_uniform'))
    model.add(Dropout(hp['dropout_size']))
    
    if (hp['extra_layer']):
        model.add(Dense(units=hp['neuron_count'], activation='relu', kernel_initializer='he_uniform'))
        model.add(Dropout(hp['dropout_size']))
    
    model.add(Dense(units=label_shape[0], activation='softmax', name='output'))

    return model

#### `fn_train`

In [18]:
def fn_train(model, loser, optimizer, samples_train, samples_evaluate, **hp):    
    model.compile(
        optimizer=optimizer,
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    
    model.fit(
        samples_train["features"]
        , samples_train["labels"]
        , validation_data = (
            samples_evaluate["features"]
            , samples_evaluate["labels"]
        )
        , verbose = 0
        , batch_size = 3
        , epochs = hp['epoch_count']
        , callbacks=[History()]
    )
    return model

> Reference the [low-level API documentation](api_low_level.html#Optional,-callback-to-stop-training-early.) for information on the custom 'early stopping' callbacks AIQC makes available.

### Hyperparameters

The `hyperparameters` below will be automatically fed into the functions above as `**kwargs` via the `**hp` argument we saw earlier.

For example, wherever you see `hp['neuron_count']`, it will pull from the *key:value* pair `"neuron_count": [9, 12]` seen below. Where model A will have 9 neurons and model B will have 12 neurons.

In [19]:
hyperparameters = {
	"neuron_count": [9, 12]
    , "extra_layer": [True, False]
	, "dropout_size": [0.10, 0.20]
    , "epoch_count": [50]
    , "learning_rate": [0.01]
}

Then pass these functions into the `Algorithm`.

The `library` and `analysis_type` help handle the model and its output behind the scenes. Current analysis types include: 'classification_multi', 'classification_binary', and 'regression'.

### `Experiment.make()`

Now it's time to bring together the data and logic into an `Experiment`.

In [20]:
queue = aiqc.Experiment.make(
    # --- Analysis type ---
    library = "keras"
    , analysis_type = "classification_multi"
    
    # --- Model functions ---
    , fn_build = fn_build
    , fn_train = fn_train
    , fn_lose = None #automated.
    , fn_optimize = None #automated.
    , fn_predict = None #automated.    
    
    # --- Training options ---
    , repeat_count = 2
    , hyperparameters = hyperparameters
    , pick_percent = None
    
    # --- Data source ---
    , splitset_id = splitset.id
    , foldset_id = None
    , hide_test = False
)

In [21]:
queue.run_jobs()

🔮 Training Models 🔮: 100%|████████████████████████████████████████| 16/16 [01:23<00:00,  5.24s/it]


---

For more information on visualization of performance metrics, reference the [Visualization & Metrics](visualization.html) documentation.