# Low-Level API

## Prerequisites
If you've already completed the instructions on the **Installation** page, then let's get started.

In [1]:
import aiqc
from aiqc import examples



## Usage

### 1. Ingest a `Dataset`

You can make a dataset from either an in-memory data structure (pandas dataframe, numpy array), or a file (csv, tsv, parquet).

`perform_gzip=True` will compress it: anywhere from ~25% to 90% size reduction.

#### Pandas DataFrame

In [3]:
df = examples.demo_file_to_pandas('iris.tsv')

dataset = aiqc.Dataset.from_pandas(
	dataframe = df
	, file_format = 'csv'
    , dtype = None # feeds pd.Dataframe(dtype)
    , perform_gzip = False # feeds pd.Dataframe(columns)
    , rename_columns = None
)

#### NumPy Array

Regular *ndarrays* don't have column names, and I didn't like the API for *structured arrays* so you have to pass in columns names as a list. If you don't then column names will be numerically assigned in ascending order (zero-based index).

In [4]:
arr =  df.to_numpy()
cols = list(df.columns)

dataset = aiqc.Dataset.from_numpy(
	ndarray = arr
	, file_format = 'parquet'
	, name = 'chunking plants'
	, perform_gzip = True
	, column_names = cols
)

#### File

In [5]:
demo_file_path = examples.get_demo_file_path('iris_10x.tsv')

# we'll keep this one handy for later.
big_dataset = aiqc.Dataset.from_file(
	path = demo_file_path # files must have column names as their first row
	, file_format = 'tsv'
	, perform_gzip = True
)

> The bytes of the data will be stored as a BlobField in the SQLite database file. Storing the data in the database not only (a) provides an entity that we can use to keep track of experiments and link relational data to but also (b) makes the data less mutable than keeping it in the open filesystem.

> You can choose whether or not you want to gzip compress the file when importing it with the `perform_gzip=bool` parameter. This compression not only enables you to store up to 90% more data on your local machine, but also helps overcome the maximum BlobField size of 2.147 GB. We handle the zipping and unzipping on the fly for you, so you don't even notice it.

> Optionally, `dtype`, as seen in [`pandas.DataFrame.astype(dtype)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html), can be specified as either a single type for all columns, or as a dict that maps a specific type to each column name. This encodes features for analysis. We read NumPy into Pandas before persisting it, so `columns` and `dtype` are read directly by `pd.DataFrame()`.

> At this point, the project's support for Parquet is extremely minimal.

> If you leave `name` blank, it will default to a human-readble timestamp with the appropriate file extension (e.g. '2020_10_13-01_28_13_PM.tsv').

#### Fetch a `Dataset` with either **Pandas** or **NumPy**.

All of the data-oriented objects in the API have `to_numpy()` and `to_pandas()` methods that accept the following arguments:

* `samples=[]` list of indeces to fetch.
* `columns=[]` list of columns to fetch.

Later, we'll see how these arguments allow downstream objects like `Splitset` and `Foldset` to slice up the data.

Implicit IDs

In [6]:
df = dataset.to_pandas()
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


In [99]:
arr = dataset.to_numpy(
    samples = [0,13,29,79] 
    , columns = ['petal_length', 'petal_width']
)
arr[:4]

array([[1.4, 0.2],
       [1.1, 0.1],
       [1.6, 0.2],
       [3.5, 1. ]])

Explicit IDs

In [100]:
df = aiqc.Dataset.to_pandas(
    id = dataset.id 
    , samples = [0,13,29,79]
    , columns = ['sepal_length', 'sepal_width']
)
df.tail()

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
13,4.3,3.0
29,4.7,3.2
79,5.7,2.6


In [9]:
arr = aiqc.Dataset.to_numpy(id=dataset.id)
arr[:4]

array([[5.1, 3.5, 1.4, 0.2, 0. ],
       [4.9, 3. , 1.4, 0.2, 0. ],
       [4.7, 3.2, 1.3, 0.2, 0. ],
       [4.6, 3.1, 1.5, 0.2, 0. ]])

### 2. Select the `Label` column(s).

From a Dataset, pick the column(s) that you want to train against/ predict. If you are planning on training an unsupervised model, then you don't need to do this.

Creating a `Label` won't duplicate your data! It simply records the `columns` to be used for supervised learning. 

In [10]:
label_column = 'species'

Implicit IDs

In [11]:
label = dataset.make_label(columns=[label_column])

Explicit IDs

In [12]:
label = aiqc.Label.from_dataset(
	dataset_id=1 # cannot duplicate labels on the same dataset
	, columns=[label_column]
)

> `columns=[label_column]` is a list in case we want to do something with raw OHE/ tensors in the future.

#### Fetch a `Label` with either **Pandas** or **NumPy**.

The `Label` comes in handy when we need to fetch *Y* splits. It accepts a `samples` argument. 

In [13]:
label.to_pandas().tail()

Unnamed: 0,species
145,2
146,2
147,2
148,2
149,2


In [14]:
label.to_numpy(samples=[0,33,66,99,132])[:5]

array([[0],
       [0],
       [1],
       [1],
       [2]])

### 3. Select the `Featureset` column(s).

Creating a Featureset won't duplicate your data! It simply records the `columns` to be used in training. 

There are three ways to define which columns you want to use as features:

- `exclude_columns=[]` e.g. use all columns except the label column.
- `include_columns=[]` e.g. only use these columns that I think are informative.
- Leave both of the above blank and all columns will be used e.g. unsupervised leanring.

Implicit IDs w `exclude_columns=[]`

In [15]:
featureset = dataset.make_featureset(exclude_columns=[label_column])

Explicit IDs w `include_columns=[]`

In [16]:
include_columns = [
    'sepal_length',
    'petal_length',
    'petal_width'
]

In [17]:
featureset = aiqc.Featureset.from_dataset(
	dataset_id = 1 # cannot duplicate featureset on a dataset
	, include_columns = include_columns
	, exclude_columns = None
)

In [18]:
featureset.columns

['sepal_length', 'petal_length', 'petal_width']

In [19]:
featureset.columns_excluded

['sepal_width', 'species']

#### Fetch a `Featureset` with either **Pandas** or **NumPy**.

In [20]:
featureset.to_numpy()[:4]

array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2],
       [4.6, 1.5, 0.2]])

In [21]:
featureset.to_pandas(samples=[0,16,32,64]).tail()

Unnamed: 0,sepal_length,petal_length,petal_width
0,5.1,1.4,0.2
16,5.4,1.3,0.4
32,5.2,1.5,0.1
64,5.6,3.6,1.3


### 4. Slice samples with a `Splitset`.

A `Splitset` divides a the samples of the Dataset into the following *splits*:

| Split                 | Description                                                                                                                                                                                             |
|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| train                 | The samples that the model will be trained upon. <br/>Later, we’ll see how we can make cross-folds from our training split. <br/>Unsupervised learning will only have a training split.                 |
| validation (optional) | The samples used for training evaluation. <br/>Ensures that the test set is not revealed to the model during training.                                                                                  |
| test (optional)       | The samples the model has never seen during training. <br/>Used to assess how well the model will perform on unobserved, natural data when it is applied in the real world aka how generalizable it is. |

Again, creating a Splitset won't duplicate your data. It simply records the sample indeces (aka rows) to be used in the splits that you specify.

#### Ways to split a Dataset

##### a) Default supervised 70-30 split.

If you only provide a Label, then 70:30 train:test splits will be generated.

In [22]:
splitset = featureset.make_splitset(label_id=label.id)

##### b) Specifying test size.

In [23]:
splitset = featureset.make_splitset(
	label_id = label.id
	, size_test = 0.30
)

##### c) Specifying validation size.

In [24]:
splitset = featureset.make_splitset(
	label_id = label.id
	, size_test = 0.20
	, size_validation = 0.12
)

##### d) Taking the whole dataset as a training split.

In [25]:
splitset_unsupervised = featureset.make_splitset()

> Label-based stratification is used to ensure equally distributed label classes for both categorical and continuous data.
>
> If you want more control over stratification of continuous splits, specify the number of `continuous_bin_count` for grouping.

##### Sizes

You can verify the actual size of your splits.

In [26]:
splitset.sizes

{'validation': {'percent': 0.12, 'count': 18},
 'test': {'percent': 0.2, 'count': 30},
 'train': {'percent': 0.68, 'count': 102}}

#### Fetching a `Splitset` into memory

This is where things start to get interesting.

* Given that there are potentially multiple splits, `Splitset` methods return dictionaries where each entry corresponds with a split.
  * Additionally, each split will contain *features* and, potentially, *labels*.

In [27]:
splitset.to_numpy()['train']['features'][:4]

array([[4.9, 4.5, 1.7],
       [6.4, 5.3, 1.9],
       [5.8, 5.1, 1.9],
       [4.8, 1.6, 0.2]])

In [28]:
splitset.to_pandas()['test']['labels'].head()

Unnamed: 0,species
76,1
70,1
116,2
36,0
67,1


### 5. Optionally, create a `Foldset` for cross-fold validation.

*Reference the [scikit-learn documentation](https://scikit-learn.org/stable/modules/cross_validation.html) to learn more about folding.*

![Cross Folds](../images/cross_fold.png)

We refer to the left out fold as the `fold_validation` and the remaining training data as the `folds_train_combined`. The sample indeces of the validation fold are still recorded.

> Don't use `fold_count` unless your (total sample count / fold_count) still gives you an accurate representation of your sample population.

> In a scenario where a validation split was specified in the original Splitset, the validation split is not included in the Folds. Only the training data is folded. The implication is that you can have 2 validations in the form of the validation split and the validation fold.

In [29]:
# cross-folding takes many samples, especially when stratified.
# which is why we set aside the 'big_dataset' made from 'iris_10x.tsv' earlier.
big_label = big_dataset.make_label(columns=[label_column])
big_fset = big_dataset.make_featureset(exclude_columns=[label_column])
big_splits = big_fset.make_splitset(
	label_id = big_label.id
	, size_test = 0.30
)

This generates 5 `Fold` objects that belong to the `Foldset`.

In [30]:
foldset = big_splits.make_foldset(fold_count=5)

#### `Fold` objects

For the sake of determining which samples get trained upon, the only thing that matters is the slice of data that gets left out.

We took a slightly simplified approach in that each `Fold` has a dictionary that contains:
* `samples['folds_train_combined']` - all the included folds.
* `samples['fold_validation']` - the fold that got left out.

![cross fold objects](../images/cross_fold_objects.png)

In [31]:
list(foldset.folds)

[<Fold: 1>, <Fold: 2>, <Fold: 3>, <Fold: 4>, <Fold: 5>]

##### Sample indeces of each Fold:

In [32]:
foldset.folds[0].samples['folds_train_combined'][:10]

[0, 1, 2, 4, 5, 7, 9, 11, 12, 13]

In [33]:
foldset.folds[0].samples['fold_validation'][:10]

[3, 6, 8, 10, 21, 29, 32, 33, 38, 46]

#### Fetching a `Foldset` into memory

In order to reduce memory footprint the `to_numpy()` and `to_pandas()` methods introduce the `fold_index` argument.

If no fold_index is specified, then it will fetch all folds and give each fold a numeric key according to its index.

In [34]:
foldset.to_numpy().keys()

dict_keys([0, 1, 2, 3, 4])

So you need to specify the `fold_index` as the first key when accessing the dictionary.

In [35]:
foldset.to_numpy(fold_index=0)[0]['fold_validation']['features'][:4]

array([[4.6, 3.1, 1.5, 0.2],
       [4.6, 3.4, 1.4, 0.3],
       [4.4, 2.9, 1.4, 0.2],
       [5.4, 3.7, 1.5, 0.2]])

In [36]:
foldset.to_pandas(fold_index=0)[0]['folds_train_combined']['labels'].tail()

Unnamed: 0,species
1044,2
1046,2
1047,2
1048,2
1049,2


### 6. Optionally, create a `Preprocess` for features & labels.

Certain algorithms need features and/ or labels formatted a certain way. For example, converting categorical data `[dog, cat, fish]` to one-hot encoded format `[[1,0,0][0,1,0][0,0,1]]`.

The tricky thing about preprocessing is that you are supposed to `fit` it to the training data (train, folds_train_combined), and then `transform` each of the other splits (fold_validation, validation, test) in order to avoid bias. So the dataset itself should not be stored in preprocessed format.

So you simply defined the encoders that you want to use and then they will automatically be applied to the appropriate splits/ folds during training.

> For now, only `sklearn.preprocessing` methods are supported. That may change as we add support for more low-level tensor-based frameworks. And if people want to be able to run multiple encoders on their features of different data types.

In [37]:
from sklearn.preprocessing import *

In [38]:
encoder_features = StandardScaler()
encoder_labels = OneHotEncoder(sparse=False)

In [39]:
preprocess = aiqc.Preprocess.from_splitset(
    splitset_id = splitset.id
    , description = "standard scaling on features"
    , encoder_features = encoder_features
    , encoder_labels = encoder_labels
)

### 7. Create an `Algorithm` aka model.

An `Algorithm` is the ORM's codename for a machine learning model since *Model* is the most important *reserved word* for ORMs.

Let's define functions to **build** and **train** our model.

You can name the functions whatever you want, but do not change the predetermined `*args` (e.g. `**hyperparameters`, `model`, etc.).

Put a placeholder anywhere you want to try out different hyperparameters: `hyperparameters['<some_variable_name>']`. You'll get a chance to define the hyperparameters in a minute.

In [40]:
import keras
from keras import metrics
from keras.models import Sequential
from keras.callbacks import History
from keras.layers import Dense, Dropout

An `Algorithm` is the ORM's codename for a machine learning model since *Model* is the most important *reserved word* for ORMs.

#### Function to build model

In [62]:
def function_model_build(**hyperparameters):
    
	model = Sequential()
	model.add(Dense(hyperparameters['neuron_count'], input_shape=(3,), activation='relu', kernel_initializer='he_uniform'))
	model.add(Dropout(0.2))
	model.add(Dense(hyperparameters['neuron_count'], activation='relu', kernel_initializer='he_uniform'))
	model.add(Dense(3, activation='softmax'))

	opt = keras.optimizers.Adamax(hyperparameters['learning_rate'])
	model.compile(
		loss = 'categorical_crossentropy'
		, optimizer = opt
		, metrics = ['accuracy']
	)
    
	return model

#### Function to train model

In [63]:
def function_model_train(model, samples_train, samples_evaluate, **hyperparameters):
    
	model.fit(
		samples_train["features"]
		, samples_train["labels"]
		, validation_data = (
			samples_evaluate["features"]
			, samples_evaluate["labels"]
		)
		, verbose = 0
		, batch_size = 3
		, epochs = hyperparameters['epoch_count']
		, callbacks=[History()]
	)
    
	return model

#### Optional, functions to predict samples

When creating an `Algorithm`, the predict function will be generated for you automatically if set to `None`.

> The `analysis_type` and `library` of the Algorithm help determine how to handle the predictions.

##### a) Regression

In [64]:
def function_model_predict(model, samples_predict):
    predictions = model.predict(samples_predict['features'])
    return predictions

##### b) Classification binary

All classification `predictions`, both mutliclass and binary, must be returned in ordinal format. 

> For most libraries, classification algorithms output probabilities as opposed to actual predictions when running `model.predict()`. We want to return both of these object `predictions, probabilities` (the order matters) to generate performance metrics behind the scenes.

In [65]:
def function_model_predict(model, samples_predict):
    probabilities = model.predict(samples_predict['features'])
    # This is the official keras replacement for binary classes `.predict_classes()`
    # It returns one array per sample: `[[0][1][0][1]]` 
    predictions = (probabilities > 0.5).astype("int32")
    
    return predictions, probabilities

##### c) Classification multiclass

In [78]:
import numpy as np

In [66]:
def function_model_predict(model, samples_predict):
    probabilities = model.predict(samples_predict['features'])
    # This is the official keras replacement for multiclass `.predict_classes()`
    # It returns one ordinal array per sample: `[[0][2][1][2]]` 
    predictions = np.argmax(probabilities, axis=-1)
    
    return predictions, probabilities

#### Optional, functions to calculate loss

When creating an `Algorithm`, the evaluate function will be generated for you automatically if set to `None`.

> The `analysis_type` and `library` of the Algorithm help determine how to handle the predictions.

The only trick thing here is when `keras.metrics` returns multiple metrics, like *accuracy* or *R^2*. All we are after in this case is the loss for the split/ fold in question.

In [67]:
def function_model_loss(model, samples_evaluate):
    metrics = model.evaluate(samples_evaluate['features'], samples_evaluate['labels'], verbose=0)
    if (isinstance(metrics, list)):
        loss = metrics[0]
    elif (isinstance(metrics, float)):
        loss = metrics
    else:
        raise ValueError(f"\nYikes - The 'metrics' returned are neither a list nor a float:\n{metrics}\n")
    return loss

> In contrast to openly specifying a loss function, for example `keras.losses.<loss_fn>()`, the use of `.evaluate()` is consistent because it comes from the compiled model. Also, although `model.compiled_loss` would be more efficient, it requires making encoded `y_true` and `y_pred` available to the user, whereas `.evaluate()` can be called with the same arugments as the other `function_model_*` and many deep learning libraries support this approach. 

#### Group the functions together in an `Algorithm`!

In [68]:
algorithm = aiqc.Algorithm.create(
    library = "keras"
	, analysis_type = "classification_multi"
	, function_model_build = function_model_build
	, function_model_train = function_model_train
	, function_model_predict = function_model_predict
	, function_model_loss = function_model_loss
)

### 8. Optional, associate `hyperparameters` with your model.

The `hyperparameters` below will be automatically fed into the functions above as `**kwargs` via the `**hyperparameters` argument we saw earlier.

For example, wherever you see `hyperparameters['neuron_count']`, it will pull from the *key:value* pair `"neuron_count": [9, 12]` seen below. Where model A will have 9 neurons and model B will have 12 neurons.

In [69]:
hyperparameters = {
	"neuron_count": [9, 12]
	, "epoch_count": [30, 60]
    , "learning_rate": [0.03, 0.05]
}

hyperparamset = aiqc.Hyperparamset.from_algorithm(
	algorithm_id = algorithm.id
	, description = "experimenting with neuron count, layers, and epoch count"
	, hyperparameters = hyperparameters
)

> In the future, we will provide different strategies for generating and selecting parameters to experiment with.


#### `Hyperparamcombo` objects.

Each unique combination of hyperparameters is recorded. A separate training `Job` will be made for each.

In [70]:
hyperparamset.hyperparamcombo_count

8

In [71]:
hyperparamcombos = hyperparamset.hyperparamcombos

for h in hyperparamcombos:
    print(h.hyperparameters)

{'neuron_count': 9, 'epoch_count': 30, 'learning_rate': 0.03}
{'neuron_count': 9, 'epoch_count': 30, 'learning_rate': 0.05}
{'neuron_count': 9, 'epoch_count': 60, 'learning_rate': 0.03}
{'neuron_count': 9, 'epoch_count': 60, 'learning_rate': 0.05}
{'neuron_count': 12, 'epoch_count': 30, 'learning_rate': 0.03}
{'neuron_count': 12, 'epoch_count': 30, 'learning_rate': 0.05}
{'neuron_count': 12, 'epoch_count': 60, 'learning_rate': 0.03}
{'neuron_count': 12, 'epoch_count': 60, 'learning_rate': 0.05}


### 9. Create a `Batch` of `Jobs` to keep track of training.

A `Batch` ties together everything you need for hypertuning.

In [79]:
batch = aiqc.Batch.from_algorithm(
	algorithm_id = algorithm.id
	, splitset_id = splitset.id
	, hyperparamset_id = hyperparamset.id
	, foldset_id = None
	, preprocess_id = preprocess.id
)

#### `Job` objects.

Each `Job` in the Batch represents a training run. It contains the information needed to execute the training run. Its `status` keeps track of its phase of execution.

In [80]:
batch.jobs[0].status

'Not yet started'

In [81]:
batch.get_statuses()

{17: 'Not yet started',
 18: 'Not yet started',
 19: 'Not yet started',
 20: 'Not yet started',
 21: 'Not yet started',
 22: 'Not yet started',
 23: 'Not yet started',
 24: 'Not yet started'}

#### Execute the `Batch`.
The Jobs will be asynchronously executed on a background process, so that you can continue to code on the main process. You can poll the Job status.

In [82]:
batch.run_jobs()

🔮 Training Models 🔮:  25%|██████████▌                               | 2/8 [00:04<00:12,  2.14s/it]

You can stop the execution of a batch if you need to, and later resume it. If your kernel crashes then you can likewise resume the execution.

In [86]:
from time import sleep
sleep(7)

In [83]:
batch.stop_jobs()


Killed `multiprocessing.Process` 'aiqc_batch_3' spawned from Batch <id:3>



In [84]:
batch.get_statuses()

{17: 'Succeeded',
 18: 'Succeeded',
 19: 'Queued',
 20: 'Queued',
 21: 'Queued',
 22: 'Queued',
 23: 'Queued',
 24: 'Queued'}

Before resuming them again. In this way, if your system crashes for any reason, you can pick up right back where you left off.

In [85]:
batch.run_jobs()


Resuming jobs...



🔮 Training Models 🔮: 100%|██████████████████████████████████████████| 8/8 [00:16<00:00,  2.12s/it]


### 10. Assess the `Results`.

Each Job has a `Result`. The following artifacts are automatically written to the Result after training.
    
* `model_file`: hdf5 bytes of the model.
* `history`: per epoch metrics recorded during training.
* `predictions`: dictionary of predictions per split/ fold.
* `probabilities`: dictionary of prediction probabilities per split/ fold.
* `metrics`: dictionary of single-value metrics depending on the analysis_type.
* `plot_data`: metrics readily formatted for plotting.

All of the dictionaries have split/ fold based keys.

#### Fetching the trained model.

In [90]:
compiled_model = batch.jobs[0].results[0].get_model()
compiled_model

<tensorflow.python.keras.engine.sequential.Sequential at 0x14e639250>

#### Fetching metrics.

In [92]:
batch.jobs[0].results[0].metrics

{'test': {'roc_auc': 0.9933333333333333,
  'accuracy': 0.9666666666666667,
  'precision': 0.9696969696969696,
  'recall': 0.9666666666666667,
  'f1': 0.9665831244778613,
  'loss': 0.10969173908233643},
 'validation': {'roc_auc': 1.0,
  'accuracy': 0.9444444444444444,
  'precision': 0.9523809523809523,
  'recall': 0.9444444444444444,
  'f1': 0.9440559440559441,
  'loss': 0.09142466634511948},
 'train': {'roc_auc': 0.9988465974625144,
  'accuracy': 0.9705882352941176,
  'precision': 0.9708513708513707,
  'recall': 0.9705882352941176,
  'f1': 0.9705818732424834,
  'loss': 0.06465762108564377}}

### Metrics & Visualization

For more information of visualization of performance metrics, reference the **Visualization & Metrics** documentation.