# Low-Level API [under construction for v7.0.0]

<img src="../_static/images/banner/pipes.png" class="banner-photo"/>

Argument | Type | Default | Description
--- | --- | --- | ---
**item** | type | None | text
**item** | type | None | text
**item** | type | None | text

## Object-Relational Model (ORM)

The Low-Level API is an *object-relational model* for machine learning. Each class in the [ORM](http://docs.peewee-orm.com/en/latest/peewee/models.html) maps to a table in a SQLite database that serves as a machine learning *metastore*. 

The real power lies in the relationships between these objects (e.g. `Label`→`Splitset`←`Feature` and `Queue`→`Job`→`Predictor`→`Prediction`), which enable us to construct rule-base protocols for various types of data and analysis.

Goobye, *X_train, y_test*. Hello, object-oriented machine learning.

---

## 1. `Dataset`

![Datasets](../_static/images/api/dimensions.png)

The `Dataset` class provides the following subclasses for working with different types of data:

Type | Dimensionality | Supported Formats | Format (if ingested)
--- | --- | --- | ---
**Tabular** | 2D | Files (Parquet, CSV, TSV, Parquet) / Pandas DataFrame (in-memory) | Parquet
**Sequence** | 3D | NumPy (in-memory ndarray, npy file) | npy 
**Image** | 4D | NumPy (in-memory ndarray, npy file) / Pillow-supported formats | npy

> The names are merely suggestive, as the primary purpose of these subclasses is to provide a way to register data of known dimensionality.  For example, a practitioner could ingest many uni-channel/ grayscale images as a 3D Sequence Dataset instead of a multi-channel 4D Image Dataset.

> *Why not 2D NumPy?* The `Dataset.Tabular` class is intended for strict, column-specific dtypes and Parquet persistence upon ingestion. In practice, this conflicted too often with NumPy's array-wide dtyping. We use the best tools for the job (df/pq for 2D) and (array/npy for ND).

---

## 1a. Register

*Most of the Dataset registration methods share these arguments:*

Argument | Description
--- | ---
**ingest** | Determines if raw data is either stored directly inside the metastore or remains on disk to be accessed via path/url. *In-memory* data like DataFrames and ndarrays must be ingested. Whereas *file-based* data like Parquet, NPY, Image folders/urls may remain remote. Regardless of whether or not the raw data is ingested, metadata is always derived from it by parsing: 2D via DataFrame and N-D via ndarray.
**rename_columns** | Useful for assigning column names to arrays or delimited files that would otherwise be unnamed. `len(rename_columns)` must match the number of columns in the raw data. Normally, an int-based range is assigned to unnamed columns. In this case, AIQC converts each column name to a string e.g. '1' during the registration process.
**retype** | Change the dtype of data using [np.types](https://numpy.org/doc/stable/user/basics.types.html). All Dataset subclasses support mass typing via `np.type`/ `str(np.type)`. Only the Tabular subclass supports inidividual column retyping via `dict(column=str(np.type))`. If `rename_columns` is used in conjuction with `retype=dict()`, then each `dict['column']` key must match its counterpart in rename_columns.
**description** | What information does this dataset contain? What is unique about this dataset/ version -- did you edit the raw data, add rows, or change column names/ dtypes?
**name** | Triggers dataset *versioning*. Datasets that share a name will be assigned an auto-incrementing `version:int` number provided that they are not duplicates of each other based on a `sha256_hexdigest:str` hash. If you try to create an exact duplicate, it will warn you and `return` the matching duplicate instead of creating a new entity. This behavior makes it easy to rerun pipelines where Datasets are created inline.

*Ingestion provides the following benefits, especially for entry-level users:*

- Persist in-memory datasets (Pandas DataFrames, NumPy ndarrays).
- Keeps data coupled with the experiment in the portable SQLite file.
- Provides a more immutable and out-of-the-way storage location in comparison to a laptop file system.
- Encourages preserving tabular dtypes with the ecosystem-friendly Parquet format.

*Why would I avoid ingestion?*

- Happy with where the original data lives: e.g. S3 bucket.
- Don't want to duplicate the data.

> *sha256?* -- It's the one-way hash algorithm that GitHub aspires to upgrade to. AIQC runs it on compressed data because it's easier and probably less-error prone than intercepting the bytes of the *fastparquet* intermediary tables before appending the Parquet magic bytes.

> *Is SQLite a legitimate datastore?* -- In many cases, SQLite queries are faster than accessing data via a filesystem. It's a stable, 22 year-old technology that serves as the default database for iOS e.g. Apple Photos. AIQC uses it store raw data in byte format as a BlobField. I've stored tens-of-thousands of files in it over several years and never experienced corruption. Keep in mind that AWS S3 is blob store, and the Microsoft equivalent service is literally called Azure *Blob* Storage. The max size of a BlobField is 2GB, so ~20GB after compression. Either way, the goal of machine learning isn't to record the entire population within the weights of a neural network, it's to find subsets that are representative of the broader population.

---

### 1ai. `Dataset.Tabular`

Here are some of the ways practitioners can use this 2D structure:

||||
|---|---|---|
Multiple subjects (1 row per sample) | * | Multi-variate 1D (1 col per attribute)
Single subject (1 row per timestamp) | * | Multi-variate 1D (1 col per attribute)
Multiple subjects (1 row per timestamp) | * | Uni-variate 0D (1 col per sample)

> Tabular datasets may contain both features and labels

---

#### └── `Dataset.Tabular.from_df()`

```python
dataset = Dataset.Tabular.from_df(
    dataframe
    , rename_columns
    , retype
    , description
    , name
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**df** | DataFrame | Required | [pd.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas-dataframe) with int-based single index. DataFrames are always ingested.
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration)
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration)
**description** | str | None | See [Registration](#1a.-Registration)
**name** | str | None | See [Registration](#1a.-Registration)

---

#### └── `Dataset.Tabular.from_path()`

```python
Dataset.Tabular.from_path(
    file_path
    , ingest
    , rename_columns
    , retype
    , header
    , description
    , name
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**file_path** | str | Required | Parsed based on how the file name ends (.parquet, .tsv, .csv)
**ingest** | bool | True | See [Registration](#1a.-Registration). Defaults to True because I don't want to rely on CSV files as a source of truth for dtypes, and compression works great in Parquet.
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration)
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration)
**header** | object | None | See [Registration](#1a.-Registration) 
**description** | str | None | See [Registration](#1a.-Registration)
**name** | str | None | See [Registration](#1a.-Registration)

---

### 1aii. `Dataset.Sequence`

Here are some of the ways practitioners can use this 3D structure:


||||
|---|---|---|
Single subject (1 patient) | * | Multiple 2D sequences
Multiple subjects | * | Single 2D sequence

> Sequence datasets are somewhat multi-modal in that, in order to perform supervised learning on them, they must eventually be paired with a `Dataset.Tabular` that acts as its `Label`.

---

#### └── `Dataset.Sequence.from_numpy()`

```python
Dataset.Sequence.from_numpy(
    arr3D_or_npyPath
    , ingest
    , rename_columns
    , retype        
    , description
    , name           
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**arr3D_or_npyPath** | object / str | Required | 3D array in the form of either an [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) or [npy](https://numpy.org/doc/stable/reference/generated/numpy.save.html) file path
**ingest** | bool | None | See [Registration](#1a.-Registration). If left blank, ndarrays will be ingested and npy will not. Errors if ndarray and False.
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration)
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration)
**description** | str | None | See [Registration](#1a.-Registration)
**name** | str | None | See [Registration](#1a.-Registration) 

---

### 1aiii. `Dataset.Image`

Here are some of the ways you can practitioners this 4D structure:

||||
|---|---|---|
Single subject (1 patient) | * | Multiple 3D images
Multiple subjects | * | Single 3D image


Users can ingest 4D data using either:
- [The Pillow library, which supports various formats](pillow.readthedocs.io/en/stable/handbook/image-file-formats.html)
- Or NumPy arrays as a simple alternative


> Image datasets are somewhat multi-modal in that, in order to perform supervised learning on them, they must eventually be paired with a `Dataset.Tabular` that acts as its `Label`.

---

#### └── `Dataset.Image.from_numpy()`

```python
Dataset.Image.from_numpy(
    arr4D_or_npyPath
    , ingest
    , rename_columns
    , retype        
    , description
    , name       
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**arr4D_or_npyPath** | object / str | Required | 4D array in the form of either an [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) or [npy](https://numpy.org/doc/stable/reference/generated/numpy.save.html) file path
**ingest** | bool | None | See [Registration](#1a.-Registration). If left blank, ndarrays will be ingested and npy will not. Errors if ndarray and False.
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration) 
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration) 
**description** | str | None | See [Registration](#1a.-Registration) 
**name** | str | None | See [Registration](#1a.-Registration) 

---

#### └── `Dataset.Image.from_folder()`

```python
Dataset.Image.from_folder(
    folder_path
    , ingest
    , rename_columns
    , retype
    , description
    , name
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**folder_path** | str | Required | Folder of images to be ingested via Pillow. All images must be cropped to the same dimensions ahead of time.
**ingest** | bool | False | See [Registration](#1a.-Registration) 
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration) 
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration) 
**description** | str | None | See [Registration](#1a.-Registration) 
**name** | str | None | See [Registration](#1a.-Registration) 

---

#### └── `Dataset.Image.from_urls()`

```python
Dataset.Image.from_urls(
    urls
    , source_path
    , ingest
    , rename_columns
    , retype
    , description
    , name
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**urls** | list(str) | Required | URLs that point to an image to be ingested via Pillow. All images must be cropped to the same dimensions ahead of time.
**source_path** | str  | None | Optionally record a shared directory, bucket, or FTP site where images are stored. The backend won't use this information for anything. |
**ingest** | bool | False | See [Registration](#1a.-Registration)
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration) 
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration) 
**description** | str | None | See [Registration](#1a.-Registration) 
**name** | str | None | See [Registration](#1a.-Registration) 

---

### 1b. Fetch

The following methods are exposed to end-users in case they want to inspect the data that they have ingested.

---

#### └── `Dataset.to_arr()`

Argument | Type | Default | Description
--- | --- | --- | ---
**id** | int | None | The id of the Dataset
**columns** | list(str) | None | If left blank, includes all columns
**samples** | list(int) | None | If left blank, includes all samples

Subclass | Returns
--- | ---
Tabular | ndarray.ndim==2
Sequence | ndarray.ndim==3
Image | ndarray.ndim==4

---

#### └── `Dataset.to_df()`

Argument | Type | Default | Description
--- | --- | --- | ---
**id** | int | None | The id of the Dataset
**columns** | list(str) | None | If left blank, includes all columns
**samples** | list(int) | None | If left blank, includes all samples

Subclass | Returns
--- | ---
Tabular | DataFrame
Sequence | list(DataFrame)
Image | list(list(DataFrame))

---

#### └── `Dataset.to_pillow()`

Argument | Type | Default | Description
--- | --- | --- | ---
**id** | int | None | The id of the Dataset
**samples** | list(int) | None | If left blank, includes all samples

Subclass | Returns
--- | ---
Image | list(PIL.Image)


---

#### └── `Dataset.get_dtypes()`

Regardless of how the initial `Dataset.dtype` was formatted [e.g. single np.type / str(np.type) / dict(column=np.type)], this function intentionally returns dtype of each column in `dict(column=str(np.type)` format.

Argument | Type | Default | Description
--- | --- | --- | ---
**id** | int | None | The id of the Dataset
**columns** | list(str) | None | If left blank, includes all columns

---

## 2. `Feature`

Determines the columns that will be used as predictive features during training. Columns is always the last dimension `shape[-1]` of a dataset.

---

### 2a. Create

#### └── `Feature.from_dataset()`

```python
Feature.from_dataset(
    dataset_id
    , include_columns
    , exclude_columns
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**dataset_id** | int | Required | `Dataset.id` from which you want to derive `Dataset.columns`.
**include_columns**  | list(str) | None | Specify columns that *will* be included in the Feature. All columns that are not specified will *not* be included.
**exclude_columns** | list(str) | None | Specify columns that will *not* be included in the Feature. All columns that are not specified *will* be included.

> If neither `include_columns` nor `exclude_columns` is defined, then all columns will be used.

> Both `include_columns` and `exclude_columns` cannot be used at the same time

---

### 2b. Fetch

Theses methods wrap Dataset's [fetch](#1b.-Fetch) methods:

Method | Arguments | Returns
--- | --- | ---
**to_arr()** | columns:list(str)=Feature.columns, samples:list(int)=None | ndarray 2D / 3D / 4D
**to_df()** | columns:list(str)=Feature.columns, samples:list(int)=None | df / list(df) / list(list(df))
**get_dtypes()** | columns:list(str)=Feature.columns | dict(column=str(np.type))

---

## 3. `Label`

Determines the column(s) that will be used as a target during supervised analysis. Do no create a Label if you intend to conduct unsupervised/ self-supervised analysis.

---

### 3a. Create

#### └── `Label.from_dataset()`

```python
Label.from_dataset(
    dataset_id
    , columns
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**dataset_id** | int | Required | `Dataset.id` from which you want to derive `Dataset.columns`. Only Tabular Datasets may be used as a Label.
**columns**  | list(str) | None | Specify columns that *will* be included in the Label. If left blank, defaults to all columns. If more than 1 column is provided, then the data in those columns must be in One-Hot Encoded (OHE) format.

---

### 3b. Fetch

Theses methods wrap Dataset's [fetch](#1b.-Fetch) methods:

Method | Arguments | Returns
--- | --- | ---
**to_arr()** | columns:list(str)=Label.columns, samples:list(int)=None | ndarray 2D / 3D / 4D
**to_df()** | columns:list(str)=Label.columns, samples:list(int)=None | df / list(df) / list(list(df))
**get_dtypes()** | columns:list(str)=Label.columns | dict(column=str(np.type))

---

## 4. Interpolate

If you have continuous columns with missing data in a time series, then interpolation allows you to fill in those blanks mathematically. It does so by fitting a curve to each column. If you don't have time series data then you do not need interpolation.

Interpolation is the first preprocessor because you need to fill in blanks prior to encoding.

> `pandas.DataFrame.interpolate`
> 
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
> 
> Is utilized due to its ease of use, variety of methods, and **support of sparse indices**. However, it does not follow the `fit/transform` pattern like many of the class-based sklearn preprocessors, so the interpolated training data is concatenated with the evalaution split during the interpolation of evaluation splits.

These are the default settings if `interpolate_kwargs=None`:

```python
interpolate_kwargs = dict(
    method            = 'spline'
    , limit_direction = 'both'
    , limit_area      = None
    , axis            = 0
    , order           = 1
)
```

Dataset Type | Approach
--- | ---
**Tabular** | Unlike encoders, there is no `fit` object. So first the training data rows are interpolated independently. Then, when it comes time to interpolate other splits like validation, the training data is included in the sequence to be interpolated.
**Sequence** | Interpolation is ran on each 2D sequence separately
**Image** | Interpolation is ran on each 2D channel separately

---

### 4a. `LabelInterpolater`


Label is intended for a single column, so only 1 Interpolater will be used during `Label.preprocess()`

#### └── `LabelInterpolater.from_label()`

```python
LabelInterpolater.from_label(
    label_id
    , process_separately
    , interpolate_kwargs
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**label_id** | int | Required | Points to the `Label.columns` to be interpolated
**process_separately** | bool | True | Used to restrict the fit to the training data, this may be flipped to `False`. However, doing so causes data leakage.
**interpolate_kwargs** | dict | None | The `interpolate_kwargs:dict=None` object is what gets passed to Pandas interpolation. In my experience, `method=spline` produces the best results. However, if either (a) spline fails to fit to your data, or (b) you know that your pattern is linear - then try `method=linear`.  

---

### 4b. `FeatureInterpolater`


For *multivariate datasets* there may be several columns/ dtypes that have completely different patterns/ curves to fit. So we can specify multiple FeatureInterpolaters, and give them column/dtype filters.

#### └── `FeatureInterpolater.from_feature()`

```python
FeatureInterpolater.from_feature(
    feature_id
    , process_separately
    , interpolate_kwargs
    , dtypes
    , columns
    , verbose
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**feature_id** | int | Required | Points to the `Feature.columns` to be interpolated
**process_separately** | bool | True | Used to restrict the fit to the training data, this may be flipped to `False`. However, doing so causes data leakage.
**interpolate_kwargs** | dict | None | The `interpolate_kwargs:dict=None` object is what gets passed to Pandas interpolation. In my experience, `method=spline` produces the best results. However, if either (a) spline fails to fit to your data, or (b) you know that your pattern is linear - then try `method=linear`.
**dtypes** | type | None | The dtypes to include
**columns** | type | None | The columns to include. Errors if any of the columns were already included by dtypes.
**verbose** | bool | True | If True, messages will be printed about the status of the interpolaters as they attempt to fit on the filtered columns

---

## 5. Encode Features & Labels

### Encoding

Certain algorithms either (a) require features and/ or labels formatted a certain way, or (b) perform significantly better when their values are normalized. For example:

* Scaling continuous features from (-1 to 1) or (0.0 to 1.0). Or transforming them to resemble a more Gaussian distribution.
* Converting ordinal or categorical string data `[dog, cat, fish]` into one-hot encoded format `[[1,0,0][0,1,0][0,0,1]]`.

There are two phases of encoding:
1. `fit` - where the encoder learns about the values of the samples made available to it. Ideally, you only want to `fit` aka learn from your training split so that you are not *"leaking"* information from your validation and test spits into your encoder!
2. `transform` - where the encoder transforms all of the samples in the population.

AIQC has solved the following challenges related to encoding:

* How does one dynamically `fit` on only the training samples in advanced scenarios like cross-validation where a different fold is used for validation each time?

* For certain encoders, especially categorical ones, there is arguably no leakage. If an encoder is arbitrarilly assigning values/ tags to a sample through a process that is not aggregate-informed, then the information that is reveal to the `fit` is largely irrelevant. As an analogy, if we are examining swan color and all of a sudden there is a black swan... it's clearly not white, so slap a non-white label on it and move on. In fact, the prediction process and performance metric calucatlion may fail if it doesn't know how to handle the previously unseen category.

* Certain encoders only accept certain dtypes. Certain encoders only accept certain dimensionality (e.g. 1D, 2D, 3D) or shape patterns (odd-by-odd square). Unfortunately, there is not much uniformity here.

* Certain encoders output extraneous objects that don't work with deep learning libraries.

> Only `sklearn.preprocessing` methods are officially supported, but we have experimented with `sklearn.feature_extraction.text.CountVectorizer`

---

### a) Encode labels with `LabelCoder`.

Unfortunately, the name "LabelEncoder" is occupied by `sklearn.preprocessing.LabelEncoder`

Of course, you cannot encode Labels if your `Splitset` does not have labels in the first place.

The process is straightforward. You provide an instantiated encoder [e.g. `StandardScaler()` not `StandardScaler`], and then AIQC will:

* Verify that the encoder works with your `Label`'s dtype, sample values, and figure out what dimensionality it needs in order to succeed.

* Automatically correct the attributes of your encoder to smooth out any common errors they would cause. For example, preventing the output of a sparse scipy matrix.

* Determine whether the encoder should be `fit` either (a) exclusively on the train split, or (b) if it is not prone to leakage, inclusively on the entire dataset thereby reducing the chance of errors arising.

#### Creating a `LabelCoder`

AIQC only supports the uppercase `sklearn.preprocessing` methods (e.g. `RobustScaler`, but not `robust_scale`) because the lowercase methods do not separate the `fit` and `transform` steps. FYI, most of the uppercase methods have a combined `fit_transform` method if you need them. 

> https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

In [51]:
from sklearn.preprocessing import *

In [52]:
labelcoder = LabelCoder.from_label(
    label_id=label.id, sklearn_preprocess=OneHotEncoder(sparse=False)
)

The following method is used behind the scenes to fetch the most recently create LabelCoder for your Label when it comes time to encode data during training.

---

### b) Encode Features sequentially with `FeatureCoders`.

The `FeatureCoder` has the same validation process as the `LabelCoder`. However, it is not without its own challenges:

* We want to be able to apply different encoders to columns of different dtypes.

* Additionally, even within the same dtype (e.g. float/ continuous), different distributions call for different encoders.

* Commonly used encoders such a `OneHotEncoder` can ouput multiple columns from a single column input. Therefore, the *shape* of the features can change during encoding.

* And finally, throughout this entire process, we need to avoid data leakage.

For these reasons, `FeatureCoder`'s are applied sequentially; in an ordered chain, one after the other. After an encoder is applied, its columns are removed from the raw feature and placed into an intermediary cache specific to each split/ fold. 

#### Filtering feature columns

The filtering mode is either:

* Inclusive (`include=True`) encode columns that match the filter.

* Exclusive (`include=False`) encode columns outside of the filter.

Then you can select:

1. An optional list of `dtypes`.

2. An optional list of `columns` name.

  * The column filter is applied after the dtype filter. 
  
> You can create a filter for all columns by setting `include=False` and then seting both `dtypes` and `columns` to `None`.

After submitting your encoder, if `verbose=True` is enabled:
* The validation rules help determine why it may have failed.
* The print statements help determine which columns your current filter matched, and which raw columns remain. 

In [54]:
FeatureCoder = FeatureCoder.from_feature(
    feature_id = feature.id
    , sklearn_preprocess = PowerTransformer(method='yeo-johnson', copy=False)
    , include = True
    , dtypes = ['float64']
    , columns = None
    , verbose = True
)


___/ featurecoder_index: 0 \_________

=> The column(s) below matched your filter(s) featurecoder filters.

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

=> Done. All feature column(s) have featurecoder(s) associated with them.
No more FeatureCoders can be added to this Encoderset.



You can also view this information via the following attributes: `matching_columns`, `leftover_dtypes`, and `leftover_columns`.

---

## 6. Reshape Features

### ORM 

When working with architectures that are highly dimensional such convolutional and recurrent networks (Conv1D, Conv2D, Conv3D / ConvLSTM1D, ConvLSTM2D, ConvLSTM3D), you'll often find yourself needing to reshape data to fit a layer's required input shape. 

AIQC ingestion & preprocessing favors a *"channels_first"* (samples, channels, rows, columns) approach as opposed to *"channels_last"* (samples, rows, columns, channels).

- *Reducing unused dimensions* - When working with grayscale/ single channel images (1 channel, 25 rows, 25 columns) there is no sense using Conv2D just to handle that 1 channel.
- *Adding wrapper dimensions* - Perhaps your data is a fit for ConvLSTM1D, but that layer is only supported in the nightly TensorFlow build so you want to add a wrapper dimension in order to use the production-ready ConvLSTM2D.

It is difficult do this on the fly during training (aka after the fact) because you need to: add reshaping layers/ views to your model, intercept and reshape the data in your post-processing functions, and, by this point, the data is in a variety of tensor formats. It's also more efficient to do this wrangling once up front rather than repeatedly on every training run.

The `reshape_indices` argument accepts a tuple for rearranging indices in your order of choosing. Behind the scenes, it will use `np.reshape()` to rearrange the data at the end of your preprocessing pipeline. How the element is handled in that tuple is determined by its type.

`feature.make_featureshaper(reshape_indices:tuple)`

```python
# source code from the end of `feature.preprocess()`
current_shape = feature_array.shape

new shape = []
for i in featureshaper.reshape_indices:
    if (type(i) == int):
        new_shape.append(current_shape[i])
    elif (type(i) == str):
        new_shape.append(int(i))
    elif (type(i)== tuple):
        indices = [current_shape[idx] for idx in i]
        new_shape.append(math.prod(indices))
new_shape = tuple(new_shape)
            
feature_array = feature_array.reshape(new_shape)
```

*Warning:* if your model is unsupervised (aka generative or self-supervised), then it must output data in *"column (aka width) last"* shape. Otherwise, automated column decoding will be applied along the wrong dimension.

### Reshaping by Index 

Let's say we have a 4D feature consisting of 3D images `(samples * color channels * rows * columns)`. Our image is B&W, so we want to get rid of the single color channel. So we want to drop the dimension at the shape index `1`. 

```python
reshape_indices = (0,2,3)
```

Thus we have wrangled ourselves a 3D feature consisting of 2D images `(samples * rows * columns)`. 

### Reshaping Explicitly

But what if the dimensions we want cannot be expressed by rearranging the existing indices? You might have been wondering why `str` appeared in the loop above. If you define a string-based number, then that number will be used as directly as the value at that position.

So if I wanted to add an extra wrapper dimension to my data to serve as a single color channel, I would simply do:

```python
reshape_indices = (0,'1',1,2)
```

### Multiplicative Reshaping

Sometimes you need to stack/nest dimensions. This requires multiplying one shape index by another. For example, if I have a 3 separate hours worth of data and I want to treat it as 180 minutes, then I need to go from a shape of (3 hours * 60 minutes) to (180 minutes). Just provide the shape indices that you want to multiply in a `tuple` like so:

```python
reshape_indices = ((0,1), 2)
```

## 7. Window Features

![window_dimensions](../_static/images/api/window_dimensions.png)

Window facilitates sliding windows for a time series Feature. It does not apply to Labels. This is used for unsupervised (aka self-supervised) time series walk-forward forecasting.

As seen above, no matter what dimensionality the original data has, it will be windowed along the samples (first) dimension. `size_window` determine how many timepoints are included in a window, and `size_shift` determines how many timepoints to slide over before defining a new window. 

> For example, if we want to be able to *predict the next 7 days* worth of weather *using the past 21 days* of weather, then our `size_window=21` and our `size_shift=7`.

![window_samples](../_static/images/api/stratify_window.png)

After data is windowed, its dimensionality increases by 1. Why? Well, originally we had a *single time series*. However, if we window that data, then we have *many time series subsets*.

This means that the windows now act as the sample dimension, which is important for stratification. Therefore, Window must be created prior to Splitset.

![Windows](../_static/images/api/sliding_windows.png)

In a walk-forward analysis, we learn about the future by looking at the past. So we need 2 sets of windows:

- *Unshifted windows* (orange in diagram above): represent the past and serves as the features we learn from
- *Shifted windows* (green in diagram above): represent the future and serves as the target we predict

However, when conducting inference, we are trying to predict the shifted windows not learn from them. So we don't need to record any shifted windows.

---

### 7a. Create

##### └── `Window.from_feature()`

```python
Window.from_feature(
    feature_id
    , size_window
    , size_shift
    , record_shifted
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**dataset_id** | int | Required | `Feature.id` from which you want to derive windows.
**size_window**  | int | Required | The number of timesteps to include in a window.
**size_shift**  | int | Required | The number of timesteps to shift forward.
**record_shifted**  | bool | True | Whether or not we want to keep a shifted set of windows around. During pure inference, this is False.

---

## 8. `Splitset`

Used for sample stratification. Reference [Stratification](https://aiqc.readthedocs.io/en/latest/pages/explainer.html) section.

Split | Description
--- | ---
**train** | The samples that the model will be trained upon. Later, we’ll see how we can make *cross-folds from our training split*. Unsupervised learning will only have a training split.
**validation** (optional) | The samples used for training evaluation. Ensures that the test set is not revealed to the model during training.
**test** (optional) | The samples the model has never seen during training. Used to assess how well the model will perform on unobserved, natural data when it is applied in the real world aka how generalizable it is.

> Because Splitset groups together all of the data wrangling entities (Features, Label, Folds) it essentially represents a *Pipeline*, which is why it bears the name Pipeline in the High-Level API.

---

### 8a. Create

##### └── `Splitset.make()`

```python
Splitset.make(
    feature_ids
    , label_id
    , size_test
    , size_validation
    , bin_count
    , fold_count
    , unsupervised_stratify_col
    , name
    , description
    , predictor_id
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**feature_ids** | list(int) | Required | Multiple `Feature.id`'s may be included to enable multi-modal (aka mixed data-type) analysis. All of these Features must have the same number of samples.
**label_id** | int | None | The Label to be used as a target for supervised analysis. Must have the sample number of samples as the Features.
**size_test** | float | None | Percent of samples to be placed into the test split. Must be `> 0.0` and `< 1.0`.
**size_validation** | float | None | Percent of samples to be placed into the validation split. Must be `> 0.0` and `< 1.0`. If this is not None and used in combination with `fold_count`, then there will be 4 splits.
**bin_count** | int | None | For continous stratification columns, how many bins (aka quantiles) should be used?
**fold_count** | int | None | The number or cross-validation folds to generate. See [Cross-Validation](#5b.-Cross-Validation).
**unsupervised_stratify_col** | str | None | Used during unsupervised analysis. Specify a column from the first Feature in feature_ids to use for stratification. For example, when forecasting, it may make sense to stratify by the day of the year.
**name** | str | None | Used for versioning a pipeline (collection of inputs, label, and stratification). Two versions cannot have identical attributes.
**description** | str | None | What is unique about this this pipeline?

> `size_train = 1.00 - (size_test + size_validation)` the backend ensures that the sizes sum to 1.00

> *How does continuous binning work?* Reference the handy `Pandas.qcut()`  and the source code `pd.qcut(x=array_to_bin, q=bin_count, labels=False, duplicates='drop')` for more detail.

---

### 8b. Cross-Validation

Cross-validation is triggered by `fold_count:int` during Splitset creation. Reference the [scikit-learn documentation](https://scikit-learn.org/stable/modules/cross_validation.html) to learn more about cross-validation.

![cross fold objects](../_static/images/api/cross_fold_objects.png)

Each row in the diagram above is a `Fold` object.

Each green/blue box represents a bin of stratified samples. During preprocessing and training, we rotate which blue bin serves as the validation samples (`fold_validation`). The remaining green bins in the row serve as the training samples (`folds_train_combined`).

Let's say we defined `fold_count=5`. What are the implications?
- Creates 5 `Folds` related to a `Splitset`.
- 5x more models will be trained for each experiment.
- 5x more preprocessing and caching; the backend must preprocess each Fold separately to prevent data leakage by excluding `fold_validation` from the `fit`. Fits are saved to the Fold object as opposed to the Splitset object.
- 5x more evaluation.

*Disclaimer*

> DO NOT use cross-validation unless your *(total sample count / fold_count)* still gives you an accurate representation of your entire sample population. If you are ignoring that advice and stretching to perform cross-validation, then at least ensure that *(total sample count / fold_count)* is evenly divisible. Folds naturally have fewer samples, so a handful of incorrect predictions have the potential to offset your aggregate metrics. Both of these tips help avoid poorly stratified/ undersized folds that seem to perform either too well (only most common label class present) or poorly (handful of samples and a few inaccurate prediction on an otherwise good model).
> 
> Candidly, if you've ever performed cross-validation manually, let alone systematically, you'll know that, barring stratification of continuous labels, it's easy enough to construct the folds, but then it's a pain to generate performance metrics (e.g. `zero_division`, absent OHE classes) due to the absence of outlying classes and bins.  Time has been invested to handle these scenarios elegantly so that folds can be treated as first-class-citizens alongside splits. That being said, if you try to do something undersized like "150 samples in their dataset and a `fold_count` > 3 with `unique_classes` > 4," then you may run into edge cases.
>
> Cross validation is only included in AIQC to allow best practices to be upheld and to show off the power of systematic preprocessing.

---

## 7. Defining Architectures & Hyperparameters.

### a) Define an `Algorithm`

Now that our data has been prepared, we transition to the other half of the ORM where the focus is the logic that will be applied to that data.

> An `Algorithm` is our ORM's codename for a machine learning model since *Model* is the most important *reserved word* when it comes to ORMs.

The following attributes tell AIQC how to handle the Algorithm behind the scenes:

* `library` - right now, only 'keras' is supported.

  * Each library's model object and callbacks (history, early stopping) need to be handled differently.
  
  
* `analysis_type` - right now, these types are supported:

  * `'classification_multi'`, `'classification_binary'`, `'regression'`.
  
  * Used to determine which performance metrics to run.
  
  * Must be compatible with the type of label fed to it.

#### Model Definition

The `Algorithm` is composed of the functions:

* `fn_build`.

* `fn_lose` (optional, inferred).

* `fn_optimize` (optional, inferred).

* `fn_train`.

* `fn_predict` (optional, inferred).

> May provide overridable defaults for build and train in the future.

You can name the functions whatever you want, but do not change the predetermined arguments (e.g. `input_shape`,`**hp`, `model`, etc.) or their position.

As we define these functions, we'll see that we can pass a dictionary of *hyperparameters* into these function using the `**hp` kwarg, and access them like so: `hp['<some_variable_name>']`. Later, we'll provide a list of values for each entry in the hyperparameters dictionary.

Let's import the modules that we need.

In [55]:
import tensorflow as tf
import tensorflow.keras.layers as layers

> Later, when running your `Job`'s, if you receive a "module not found" error, then you can try troubleshooting by importing that module directly within the function where it is used.

---

##### └── `fn_build`

You can build your topology however you like, just be sure to `return model`. Also, you don't have to use any of the hyperparameters (`**hp`) if you don't want to.

The automatically provided `features_shape` and `label_shape` are handy because:

* The number of feature/ label columns is mutable due to encoders (e.g. OHE). 

* Shapes can be less obvious in multi-dimensional scenarios like colored images.

In [56]:
def fn_build(features_shape, label_shape, **hp):
    model = tf.keras.models.Sequential()
    model.add(layers.Dense(units=hp['neuron_count'], input_shape=features_shape, activation='relu', kernel_initializer='he_uniform'))
    model.add(layers.Dropout(0.2))
    model.add(layers.Dense(units=hp['neuron_count'], activation='relu', kernel_initializer='he_uniform'))
    model.add(layers.Dense(units=label_shape[0], activation='softmax'))
    return model

---

##### └── `fn_lose` (optional)

We can't just specify the loss function in our training loop because we will need it later on when it comes time to produce metrics about other splits/ folds.

If you do not provide an `fn_lose` then one will be automatically selected for you based on the `Algorithm.analysis_type` you are conducting and the `Algorithm.library` you are using.

In [57]:
def fn_lose(**hp):
    loser = tf.keras.losses.CategoricalCrossentropy()
    return loser

---

##### └── `fn_optimize` (optional)

Some deep learning libraries persist their model and optimizer separately during checkpoint/exporting. So `fn_optimize` provides an isolated way to access the optimizer. It also allows us to automatically set the optimizer.

If you do not provide an `fn_optimize` then one will be automatically selected for you based on the `Algorithm.analysis_type` you are conducting and the `Algorithm.library` you are using.

In [58]:
def fn_optimize(**hp):
    optimizer = tf.keras.optimizers.Adamax(learning_rate=0.01)
    return optimizer

> If you want to define your own optimizer, then you should do so within this function, rather than relying on `model.compile(optimizer='<some_optimizer_name>'`. If you do not define an optimizer, then `Adamax` will be used by default.

---

##### └── `fn_train`

* `samples_train` - the appropriate data will be fed into the training cycle. For example, `Fold.samples[fold_index]['folds_train_combined']` or `Splitset.samples['train']`.

* `samples_evaluate` - the appropriate data is made available for evaluation. For example, `Fold.samples[fold_index]['fold_validation']`, `Splitset.samples['validation']`, or `Splitset.samples['test']`.

In [1]:
def fn_train(
    model, loser, optimizer,
    train_features, train_label,
    eval_features, eval_label,
    **hp
):
    model.compile(
        loss        = loser
        , optimizer = optimizer
        , metrics   = ['accuracy']
    )
    model.fit(
        train_features, train_label,
        , validation_data = (eval_features, eval_label)
        , verbose         = 0
        , batch_size      = 3
        , epochs          = hp['epoch_count']
        , callbacks       = [tf.keras.callbacks.History()]
    )
    return model

---

##### Training loop

* TensorFlow: either write a custom loop or use Keras.
* PyTorch: either write a custom loop or use AIQC's `utils.pytorch.fit`

```python
model, history = fit(
    model:object
    , loser:object
    , optimizer:object
    
    , train_features:object
    , train_label:object
    , eval_features:object
    , eval_label:object
    
    , epochs:int              = 30
    , batch_size:int          = 5
    , enforce_sameSize:bool   = True
    , allow_singleSample:bool = False
    , metrics:list            = None
)
```

| Argument | Description |
| --- | --- |
| *enforce_sameSize* | removes the last batch if it is a different size |
| *allow_singleSample* | errors if a batch only contains 1 sample, which often leads to layer errors during training |
| *metrics* | instantiated `torchmetrics` classes e.g. `Accuracy` |

---

##### Customizable history

The goal of the `Predictor.history` object is to record the training and evaluation metrics at the end of each epic so that they can be interpretted in the learning curve plots. Reference the [visualization](visualization.html) section.

- *Keras*: any `metrics=[]` specified are automatically added to the `History` callback object.

- *PyTorch*: users are responsible for calculating their own metrics (we recommend the `torchmetrics` package) and placing them into a `history` dictionary that mirrors the schema of the Keras history object. Reference the torch [examples](gallery/pytorch/multi_class.html).

> The schema of this dictionary is as follows: `dict(*:ndarray, val_*=ndarray)`. For example, if you wanted to record the history of the 'loss' and 'accuracy' metrics manually for PyTorch, you would construct it like so:

```python
history = dict(
    loss=ndarray, val_loss=ndarray,
    accuracy=ndarray, val_accuracy=ndarray,
)
```

---

##### Optional, callback to stop training early.

*Early stopping* isn't just about efficiency in reducing the number of `epochs`. If you've specified 300 epochs, there's a chance your model catches on to the underlying patterns early, say around 75-125 epochs. At this point, there's also good chance what it learns in the remaining epochs will cause it to overfit on patterns that are specific to the training data, and thereby and lose it's simplicity/ generalizability.

> The `val_` prefix refers to the evaluation samples.
>
> Remember, regression does not have accuracy metrics.
>
> `TrainingCallback.MetricCutoff` is a custom class we wrote to make *early stopping* easier, so you won't find information about it in the official Keras documentation.

In [60]:
from aiqc.utils.tensorflow import TrainingCallback

In [61]:
def fn_train(model, loser, optimizer, samples_train, samples_evaluate, **hp):
    model.compile(
        loss = loser
        , optimizer = optimizer
        , metrics = ['accuracy']
    )
        
    #Define one or more metrics to monitor.
    metrics_cuttoffs = [
        {"metric":"val_accuracy", "cutoff":0.96, "above_or_below":"above"},
        {"metric":"val_loss", "cutoff":0.1, "above_or_below":"below"}
    ]
    cutoffs = TrainingCallback.MetricCutoff(metrics_cuttoffs)
    # Remember to append `cutoffs` to the list of callbacks.
    callbacks=[tf.keras.callbacks.History(), cutoffs]
    
    # No changes here.
    model.fit(
        samples_train["features"]
        , samples_train["labels"]
        , validation_data = (
            samples_evaluate["features"]
            , samples_evaluate["labels"]
        )
        , verbose = 0
        , batch_size = 3
        , epochs = hp['epoch_count']
        , callbacks = callbacks
    )

    return model

---

##### └── `fn_predict` (optional)

`fn_predict` will be generated for you automatically if set to `None`. The `analysis_type` and `library` of the Algorithm help determine how to handle the predictions.

i) Regression default.

In [62]:
def fn_predict(model, samples_predict):
    predictions = model.predict(samples_predict['features'])
    return predictions

ii) Classification binary default.

All classification `predictions`, both mutliclass and binary, must be returned in ordinal format. 

> For most libraries, classification algorithms output *probabilities* as opposed to actual predictions when running `model.predict()`. We want to return both of these object `predictions, probabilities` (the order matters) to generate performance metrics behind the scenes.

In [63]:
def fn_predict(model, samples_predict):
    probabilities = model.predict(samples_predict['features'])
    # This is the official keras replacement for binary classes `.predict_classes()`
    # It returns one array per sample: `[[0][1][0][1]]` 
    predictions = (probabilities > 0.5).astype("int32")
    
    return predictions, probabilities

iii) Classification multiclass default.

In [64]:
def fn_predict(model, samples_predict):
    import numpy as np
    probabilities = model.predict(samples_predict['features'])
    # This is the official keras replacement for multiclass `.predict_classes()`
    # It returns one ordinal array per sample: `[[0][2][1][2]]` 
    predictions = np.argmax(probabilities, axis=-1)
    
    return predictions, probabilities

---

#### Group the functions together in an `Algorithm`!

In [65]:
algorithm = Algorithm.make(
    library = "keras"
    , analysis_type = "classification_multi"
    , fn_build = fn_build
    , fn_train = fn_train
    , fn_optimize = fn_optimize # Optional
    , fn_predict = fn_predict # Optional
    , fn_lose = fn_lose # Optional
)

> <!> Remember to use `make` and not `create`. Deceptively,  `create` exists because it is a standard, built-in ORM method. However, it does so without any validation logic.

---

### b) Combinations of hyperparameters with `Hyperparamset`.

Parameters are fed into Algorithm functions.

The `hyperparameters` below will be automatically fed into the functions above as `**kwargs` via the `**hp` argument we saw earlier.

For example, wherever you see `hp['epoch_count']`, it will pull from the *key:value* pair `"epoch_count": [30, 60]` seen below. Where "model A" would have 30 epochs and "model B" would have 60 epochs.

In [2]:
hyperparameters = dict(
    neuron_count    = [12]
    , epoch_count   = [30, 60]
    , learning_rate = [0.01, 0.03]
)

---

#### Hyperparameter Selection Strategies.

##### Grid search strategy.

By default AIQC will generate all possible combinations.

> With enough practice, practitioners will get a feel for what parameters and topologies make sense so you'll rely on shotgun-style approaches less and less. If you limit your experiments to 1-2 parameters at a time then it's easy to see their effect as an *independent variable*. You should really start with high-level things such as topologies (# of layers, # neurons per layer) and batch size before moving on to tuning the intra-layer nuances (activation methods, weight initialization). You're essentially testing high/ medium/ low or default/ edge case scenarios for each parameter.

##### Random selection strategy.

Testing many different combinations in your initial runs can be a good way to get a feel for the parameter space. Although if you are doing this you'll find that many of your combinations are a bit too similar. So randomly sampling (with replacement) a few of them is a less computationally expensive way to go about this.

* `search_count:int` the fixed # of combinations to sample.

* `search_percent:float` a % of combinations to sample.

##### Bayesian selection strategy.

"TPE (Tree-structured Parzen Estimator)" via `hyperopt` has been suggested as a future area to explore.

In [67]:
hyperparamset = Hyperparamset.from_algorithm(
	algorithm_id = algorithm.id
	, hyperparameters = hyperparameters
    , search_count = None
    , search_percent = None
)

---

#### `Hyperparamcombo` objects.

Each unique combination of hyperparameters is recorded as a `Hyperparamcombo`.

Ultimately, a training `Job` is constructed for each unique combinanation of hyperparameters aka `Hyperparamcombo`.

In [68]:
hyperparamset.hyperparamcombo_count

4

In [69]:
hyperparamset.hyperparamcombos[0].get_hyperparameters(as_pandas=True)

Unnamed: 0,param,value
0,neuron_count,12.0
1,epoch_count,30.0
2,learning_rate,0.01


---

## 8. `Queue` of training `Jobs`.

The `Queue` is the central object of the "logic side" of the ORM. It ties together everything we need for training and hyperparameter tuning.

In [70]:
queue = Queue.from_algorithm(
    algorithm_id = algorithm.id
    , splitset_id = splitset.id
    , hyperparamset_id = hyperparamset.id # Optional.
    , repeat_count = 3
    , permute_count = 3
)

* `repeat_count:int=1` allows us to run the same `Job` multiple times. Normally, each `Job` has 1 `Predictor` associated with it upon completion. However, when `repeat_count` (> 1 of course) is used, a single `Job` will have multiple `Predictors`.

> Due to the fact that training is a *nondeterministic* process, we get different weights each time we train a model, even if we use the same set of parameters. Perhaps you've have the right topology and parameters, but, this time around, the model just didn't recgonize the patterns. Similar to flipping a coin, there is a degree of chance in it, but the real trend averages out upon repetition. 

* `hide_test:bool=False` excludes the test split from the performance metrics and visualizations. This avoids data leakage by forcing the user to make decisions based on the performance on their model on the training and evaluation samples.

* `permute_count:int=3` controls the number of times each feature column is shuffled before it's impact on loss is compared to the baseline training loss before the median taken. Feature importance is calculated for each column of each `Feature.id`, except for `Feature.dataset.dataset_type=='image'`. If you have many columns and all you care about is the final prediction, then it may make sense to set `permute_count=0` because permutations are computationally expensive & difficult to parallelize.

> `[training loss - (median loss of <5> permutations)]`

---

### a) `Job` objects.

Each `Job` in the Queue represents a `Hyperparamcombo` that needs to be trained.

If a `Splitset.fold_count` was specified, then: 

- The number of jobs = `hyperparamcombo_count` * `fold_count`.
- Each Job will have a `Fold`.

---

### b) Executing `Jobs`.

There are two ways to execute a Queue of Jobs:

#### i) `queue.run_jobs()`

* Jobs are simply ran on a loop on the main *Process*.

* Stop the Jobs with a keyboard interrupt e.g. `ctrl+Z/D/C` in Python shell or `i,i` in Jupyter.

* It is the more reliable approach on Win/Mac/Lin.

* Although this locks your main process (can't write more code) while models train, you can still fire up a second shell session or notebook.

* Prototype your training jobs in this method so that you can see any errors that arise in the console.


#### ii) DEPRECATED - `queue.run_jobs(in_background=True)`.

*Support for background processing has not been restored after decoupling the preprocessing pipelines from the Queue/Job logic.*

* The Jobs loop is executed on a separate, parallel `multiprocessing.Process`

* Stop the Jobs with `queue.stop_jobs()` (also deprecated), which kills the parallel *Process* unless it already failed.

* The benefit is that you can continue to code while your models are trained. There is no performance boost.

* On Mac and Linux (Unix), `'fork'` multiprocessing is used (`force=True`), which allows us to display the progress bar. FYI, even in 'fork' mode, Python multiprocessing is much more fragile in Python 3.8, which seems to be caused by how pickling is handled in passing variables to the child process.

* On Windows, `'spawn'` multiprocessing is used, which requires polling:

  * `queue.poll_statuses()`
  
  * `queue.poll_progress(raw:bool=False, loop:bool=False, loop_delay:int=3)` where `raw=True` is just a float, `loop=True` won't stop checking jobs until they are all complete, and `loop_delay=3` checks the progress every 3 seconds. 
  
* Also, during stress tests, I observed that when running multiple queues at the same time, the SQLite database would lock when simultaneous writes were attempted.

In [71]:
queue.run_jobs()

🔮 Training Models 🔮: 100%|████████████████████████████████████████| 12/12 [00:49<00:00,  4.12s/it]


The queue is interuptable. You can stop the execution of a queue and resume it later.

> This also comes in handy if either your machine or Python kernel either crashes or are interupted by accident. Whatever the reason, rest easy, just `run_jobs()` again to pick up where you left off. Be aware that the `tqdm` iteration time in the progress bar will be wrong because it will be divided by the jobs already ran.

---

##### Preprocessing during Job is recorded

During the execution of a Job, the LabelCoder and FeatureCoders(s) tied to the Label and Feature(s) of the Splitset used during training will record their `fit()`.

- `FittedLabelCoder` for the LabelCoder used.
  - `fitted_encoders:object` to store the `fit`.
- `FittedEncoderset` for each Feature used.
  - `fitted_encoders:list` to store the `fit`(s).

This process is critical for:

- `inverse_transform()`'ing aka decoding predictions.
- Encoding new data during inference exactly the same was as the samples that the model was trained on.

It takes a lot of joins to fetch the fitted encoders after the fact. So these methods are used behind the scenes to make it a bit easier:

- `Predictor.get_fitted_encoders(job, label)`
- `Predictor.get_fitted_LabelCoder(job, feature)`

---

### c) `Predictors` are the trained models.

Each `Job` trains a `Predictor`. The following attributes are automatically written to the `Predictor` after training.
    
* `model_file`: serialization varies for Keras and Pytorch deep learning framework.

* `input_shapes`: used by `get_model()` during inference.

* `history`: per epoch metrics recorded during training.

In [None]:
predictor = queue.jobs[0].predictors[0]

#### Fetching the trained model.

In [72]:
predictor.get_model()

<keras.engine.sequential.Sequential at 0x18227fe90>

#### Fetching the hyperparameters.

This method is just a proxy that passes argments to `Hyperparamcombo.get_hyperparameters()`.

In [None]:
predictor.get_hyperparameters(as_pandas=False)

---

### d) `Predictions` are the output of a Predictor.

When you feed samples through a Predictor, you get Predictions. During training, Predictions are automatically generated for every split/fold that was tied to the Queue.

#### Fetching metrics.

| Attribute | Description |
| --- | --- | 
| *predictions* | decoded predictions ndarray for per split/ fold/ inference |
| *feature_importance* | importance of each column. only for training split/fold |
| *probabilities* | prediction probabilities per split/ fold. `None` for regression. |
| *metrics* | statistics for each split/fold that vary based on the analysis_type.  |
| *metrics_aggregate* | average for each statistic across all splits/folds. |
| *plot_data* | metrics reformatted for plot functions. |

In [73]:
queue.jobs[0].predictors[0].predictions[0].metrics

{'train': {'accuracy': 0.9607843137254902,
  'f1': 0.9607503607503607,
  'loss': 0.08033698052167892,
  'precision': 0.9618055555555555,
  'recall': 0.9607843137254902,
  'roc_auc': 0.9985582468281432},
 'validation': {'accuracy': 0.9444444444444444,
  'f1': 0.9440559440559441,
  'loss': 0.13840313255786896,
  'precision': 0.9523809523809523,
  'recall': 0.9444444444444444,
  'roc_auc': 1.0},
 'test': {'accuracy': 0.9333333333333333,
  'f1': 0.9333333333333333,
  'loss': 0.1355634480714798,
  'precision': 0.9333333333333333,
  'recall': 0.9333333333333333,
  'roc_auc': 0.9900000000000001}}

---

## 9. Metrics & Visualization

For more information on visualization of performance metrics, reference the [Visualization & Metrics](visualization.html) documentation.