# Batch Predict

A function for inferring given input through a given model while producing a **Result Set** and performing **Data Drift Analysis**.

In this notebook we will go over the function's docs and outputs and see an end-to-end example of running it.

1. [Documentation](#chapter1)
2. [Results Prediction](#chapter2)
3. [Data Drift Analysis](#chapter3)
4. [End-to-end Demo](#chapter4)

<a id="chapter1"></a>
## 1. Documentation

Perform a prediction on a given dataset with the given model. Can perform drift analysis between the sample set statistics stored in the model to the current input data. The drift rule is the value per-feature mean of the TVD and Hellinger scores according to the thresholds configures here.

### 1.1. Parameters:
* **context**: `mlrun.MLClientCtx`
    An MLRun context.
* **model**: `str`
    The model Store path, a logged model URI.
* **dataset**: `Union[mlrun.DataItem, list, dict, pd.DataFrame, pd.Series, np.ndarray]`
    The dataset to infer through the model.
    * Can be passed in `inputs` as either a Dataset artifact / Feature vector URI.
    * Or, in `parameters` as a list, dictionary or numpy array.
* **drop_columns**: `Union[str, List[str], int, List[int]]` = `None`
    A string / integer or a list of strings / integers that represent the column names / indices to drop. When the dataset is a list or a numpy array this parameter must be represented by integers.
* **label_columns**: `Union[str, List[str]]` = `None`
    The target label(s) of the column(s) in the dataset. These names will be used as the column names for the predictions. The default name is `"predicted_label_i"` for the `i` column.
* **log_result_set**: `str` = `True`
    Whether to log the result set - a DataFrame of the given inputs concatenated with the predictions. Defaulted to `True`.
* **result_set_name**: `str` = `"prediction"`
    The db key to set name of the prediction result and the filename. Defaulted to `"prediction"`.
* **perform_drift_analysis**: `bool` = `None`
    Whether to perform drift analysis between the sample set of the model object to the dataset given. By default, `None`, which means it will perform drift analysis if the model has a sample set statistics.
* **sample_set**: `Union[mlrun.DataItem, list, dict, pd.DataFrame, pd.Series, np.ndarray]`
    A sample dataset to give to compare the inputs in the drift analysis. The default chosen sample set will always be the one who is set in the model artifact itself.
    * Can be passed in `inputs` as either a Dataset artifact / Feature vector URI.
    * Or, in `parameters` as a list, dictionary or numpy array.
* **drift_threshold**: `float` = `0.7`
    The threshold of which to mark drifts. Defaulted to 0.7.
* possible_drift_threshold: `float` = `0.5`
    The threshold of which to mark possible drifts. Defaulted to 0.5.
* **inf_capping**: `float` = `10.0`
    The value to set for when it reached infinity. Defaulted to 10.0.
* **artifacts_tag**: `str` = `""`
    Tag to use for all the artifacts resulted from the function. Defaul,ted to no tag.

### 1.2. Outputs

The outputs are split to two actions the functions can perform:
* **Results Prediction** - Will log a dataset artifact named by the `result_set_name` parameter.
* **Data Drift Analysis** - Will log a:
    * `plotly` artifact named `"data_drift_table"` with a visualization of the drifts results and histograms.
    * Json file with a drift status and metric per feature.
    * Register the overall drift status and metric as results.

For more details, see the next chapters.

<a id="chapter2"></a>
## 2. Results Prediction

The result set is a concatenated dataset of the inputs ($X$) provided and the predictions ($Y$) yielded by the model, so it will be $X | Y$.

For example, if the `dataset` given as inputs was:

| x1  | x2  | x3  | x4  | x5  |
|-----|-----|-----|-----|-----|
| ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... |

And the outputs yielded by the model's prediction was:

| y1  | y2  |
|-----|-----|
| ... | ... |
| ... | ... |
| ... | ... |

Then the result set will be:

| x1  | x2  | x3  | x4  | x5  | y1  | y2  |
|-----|-----|-----|-----|-----|-----|-----|
| ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... |

<a id="chapter3"></a>
## 3. Data Drift Analysis

The data drift analysis is done per feature using two distance measure metrics for probability distributions.

Let us mark our sample set as $S$ and our inputs as $I$. We will look at one feature out of $n$ features. Assuming the histograms of feature $x$ is split into 20 bins: $b_1,b_2,...,b_{20}$, we will match the feature $x$ histogram of the inputs $I$ ($x_I$) into the same bins (meaning to $x_S$) and compare their distributions using:

* Total Variance Distance: $TVD(x_S,x_I) = \frac{1}{2}\sum_{b_1}^{b_{20}} {|x_S - x_I|}$
* Hellinger Distance: $H(x_S,x_I)=\sqrt{1-\sum_{b_1}^{b_{20}}\sqrt{{x_S\cdot x_I}}$

Our **rule** then is calculating for each $x\in S: \frac{H(x_S,x_I)+TVD(x_S,x_I)}{2}$ is smaller then some given thresholds.

The outputs of the analysis are:
* **Drift table plot** - The results are presented in a `plotly` table artifact named `"drift_table_plot"` that shows each feature's statistics and its TVD, Hellinger and KLD (Kullback–Leibler divergence) results as follows:

|        | Count      |            | Mean       |            | Std        |            | Min        |            | Max        |            | Tvd | Hellinger | Kld | Histograms |
| ------ | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | --- | --------- | --- |------------|
|        | **Sample** | **Input**  | **Sample** | **Input**  | **Sample** | **Input**  | **Sample** | **Input**  | **Sample** | **Input**  |     |           |     |            |
| **x1** | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ... | ...       | ... | ...        |
| **x2** | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ... | ...       | ... | ...        |
| **x3** | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ...        | ... | ...       | ... | ...        |

* **Features drift results** - A rule metric per feature dictionary is saved in a json file named `"features_drift_results"` where each key is a feature and its value is the feature's metric value: `Dict[str, float]`

    ```python
    {
        "x1": 0.12,
        "x2": 0.345,
        "x3": 0.00678,
        ...
    }
    ```
* In addition, two results are being added to summarize the drift analysis:

    * `drift_status`: `bool` - A boolean value indicating whether a drift was found.
    * `drift_metric`: `float` - The mean of all the features drift metric value (the rule above):
        for $n$ features and metric rule $M(x_S,x_I)=\frac{H(x_S,x_I)+TVD(x_S,x_I)}{2}$, `drift_metric` $=\frac{\sum_{x\in S}M(x_S,x_I)}{n}$

    ```python
    {
        "drift_status": True,
        "drift_metric": 0.81234
    }
    ```

<a id="chapter4"></a>
## 4. End-to-end Demo

We will see an end-to-end example that follows the steps below:
1. Generate data.
2. Train a model.
3. Infer data through the model using `batch_predict` and review the outputs.

### 4.1. Code review

We are using a very simple example of training a decision tree on a binary classification problem. For that we wrote two functions:
* `generate_data` - Generate a binary classification data. The data will be split into a *training set* and *data for prediction*. The data for prediction will be drifted in half of its features to showcase the plot later on.
* `train` - Train a decision tree classifier on a given data.

In [None]:
# mlrun: start-code

In [None]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier

import mlrun
from mlrun.frameworks.sklearn import apply_mlrun


@mlrun.handler(outputs=["training_set", "prediction_set"])
def generate_data(n_samples: int = 5000, n_features: int = 20):
    # Generate a classification data:
    x, y = make_classification(
        n_samples=n_samples, n_features=n_features, n_classes=2
    )

    # Split the data into a training set and a prediction set:
    x_train, x_prediction = x[: n_samples // 2], x[n_samples // 2 :]
    y_train = y[: n_samples // 2]

    # Initialize dataframes:
    features = [f"feature_{i}" for i in range(n_features)]
    training_set = pd.DataFrame(data=x_train, columns=features)
    training_set.insert(
        loc=n_features, column="label", value=y_train, allow_duplicates=True
    )
    prediction_set = pd.DataFrame(data=x_prediction, columns=features)

    # Randomly drift half of the features:
    drifted_features = prediction_set.sample(n=n_features // 2, axis="columns")
    drifted_features += np.random.uniform(low=0, high=10, size=(n_samples // 2, n_features // 2))
    prediction_set.update(drifted_features)

    return training_set, prediction_set


@mlrun.handler()
def train(training_set: pd.DataFrame):
    # Get the data into x, y:
    labels = pd.DataFrame(training_set["label"])
    training_set.drop(columns=["label"], inplace=True)

    # Initialize a model:
    model = DecisionTreeClassifier()

    # Apply MLRun:
    apply_mlrun(model=model, model_name="model")

    # Train:
    model.fit(training_set, labels)

In [None]:
# mlrun: end-code

### 4.2. Run the Example with MLRun

First, we will prepare our MLRun functions:
1. We will use `mlrun.code_to_function` to turn this demo notebook into an MLRun function we can run.
2. We will use `mlrun.import_function` to import the `batch_predict` function .

In [None]:
# Create an MLRun function to run the notebook:
demo_function = mlrun.code_to_function(name="batch_predict_demo", kind="job")

# Import the `batch_predict` function from the marketplace:
batch_predict_function = mlrun.import_function("hub://batch_predict")

# Set the desired artifact path:
artifact_path = "./"

Now, we will follow the demo steps as discussed above:

In [None]:
# 1. Generate data:
generate_data_run = demo_function.run(
    handler="generate_data",
    artifact_path=artifact_path,
    local=True,
)

# 2. Train a model:
train_run = demo_function.run(
    handler="train",
    artifact_path=artifact_path,
    inputs={"training_set": generate_data_run.outputs["training_set"]},
    local=True,
)

# 3. Perform batch prediction:
batch_predict_run = batch_predict_function.run(
    handler="predict",
    artifact_path=artifact_path,
    inputs={"dataset": generate_data_run.outputs["prediction_set"]},
    params={
        "model": train_run.outputs["model"],
        "label_columns": "label",
    },
    local=True,
)

### 4.3. Review Outputs

We will review the outputs as explained in the notebook above.

#### 4.3.1. Results Prediction

First we will showcase the **Result Set**. As we didn't send any name, it's default name will be `"prediction"`:

In [None]:
batch_predict_run.outputs("prediction").as_df()

#### 4.3.2. Data Drift Analysis

Second we will review the data drift table plot and the drift results:

In [None]:
batch_predict_run.outputs("drift_table_plot").show()

In [None]:
batch_predict_run.status.results