In [2]:
from arthurai import ArthurAI
from arthurai.common.constants import InputType, OutputType, Stage
import joblib
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import pytz

In [3]:
import sys
sys.path.append("..")
from model_utils import load_datasets

In this guide, we'll use the credit dataset (and a pre-trained model) to onboard a new model to the Arthur platform. We'll walk through registering the model using a sample of the training data. This is an example of a streaming model.

#### Set up connection
Supply your API Key below to authenticate with the platform.

In [None]:
# credentials are being passed to the client via environment variables
connection = ArthurAI()

## Create Model

### Loading the Data

In [6]:
(X_train, Y_train), (X_test, Y_test) = load_datasets("../fixtures/datasets/credit_card_default.csv")

In [7]:
Y_train.head()

22051    0
26990    0
12962    1
29735    1
26149    0
Name: default payment next month, dtype: int64

In [8]:
X_train.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
22051,200000,2,2,1,28,1,-1,-1,-1,-1,...,150,3570,2853,0,2658,150,3570,2853,0,0
26990,100000,1,2,1,44,-1,-1,-1,-1,-1,...,390,390,390,390,390,390,390,390,390,390
12962,20000,2,2,1,28,2,-1,-1,-1,-1,...,390,390,390,0,390,390,390,390,0,780
29735,50000,1,3,1,40,2,0,0,0,0,...,49861,48660,7698,7463,2090,2450,1510,409,415,406
26149,250000,2,1,2,39,-1,-1,-1,2,2,...,40292,39600,21304,1185,1742,39600,7,1185,0,54416


In [9]:
# load our pre-trained classifier so we can generate predictions
sk_model = joblib.load("../fixtures/serialized_models/credit_model.pkl")

# get model predictions
preds = sk_model.predict_proba(X_train)
X_train["prediction_1"] = preds[:, 1]

# get ground truth labels
X_train["gt"] = Y_train

### Registering the Model

We'll instantiate a model object with a small amount of metadata about the model input and output types. Then, we'll use a sample of the training data to register the full data schema for this Tabular model.

In [10]:
arthur_model = connection.model(partner_model_id=f"CreditRiskModel_QS_{datetime.now().strftime('%Y%m%d%H%M%S')}",
                                display_name="Credit Risk",
                                input_type=InputType.Tabular,
                                output_type=OutputType.Multiclass)

We need to register the schema for the outputs of the model: what will a typical prediction look like and what will a typical ground truth look like? What names, shapes, and datatypes should Arthur expect for these objects?

We'll do this all in one step with the *.build()* method. All we need to supply is:
  * the training dataframe
  * the mapping that related predictions to ground truth
  * positive predicted attribute label
  
Our classifier will be making predictions about class *0* and class *1* and will return a probability score for each class. Therefore, we'll set up a name *prediction_0* and a name *prediction_1*. Additionally, our groundtruth will be either a 0 or 1, but we'll always represent ground truth in the one-hot-endoded form. Therefore, we create two fields called *gt_0* and *gt_1*. We link these all up in a dictionary and pass that to the model.  

In [11]:
# Map our prediction attribute to the ground truth value
prediction_to_ground_truth_map = {
    "prediction_1": 1
}

arthur_model.build(X_train, 
                   ground_truth_column="gt",
                   pred_to_ground_truth_map=prediction_to_ground_truth_map)

2022-07-21 12:04:28,883 - arthurai.core.models - INFO - Please review the inferred schema. If everything looks correct, lock in your model by calling arthur_model.save()


Unnamed: 0,name,stage,value_type,categorical,is_unique,categories,bins,range,monitor_for_bias
0,LIMIT_BAL,PIPELINE_INPUT,INTEGER,False,False,[],,"[10000, 1000000]",False
1,EDUCATION,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 1}, {value: 2}, {value: 3...",,"[None, None]",False
2,MARRIAGE,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 1}, {value: 2}, {value: 3}]",,"[None, None]",False
3,AGE,PIPELINE_INPUT,INTEGER,False,False,[],,"[21, 79]",False
4,PAY_0,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 1}, {value: 2}, {value: 3...",,"[None, None]",False
5,PAY_2,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 1}, {value: 2}, {value: 3...",,"[None, None]",False
6,PAY_3,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 1}, {value: 2}, {value: 3...",,"[None, None]",False
7,PAY_4,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 1}, {value: 2}, {value: 3...",,"[None, None]",False
8,PAY_5,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 2}, {value: 3}, {value: 4...",,"[None, None]",False
9,PAY_6,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 2}, {value: 3}, {value: 4...",,"[None, None]",False


Before saving, you can also review your model to make sure everything is correct from the output of `arthur_model.build()` or via `arthur_model.review()`.

When saving your model, the data is saved as the reference set, which is used as the baseline data for tracking data drift. Often, this is the training data for the associated model. Our reference dataset should include:
  * inputs 
  * ground truth
  * model predictions
  
This way, Arthur can monitor for drift and stability in all of these aspects. 

If you've already created your model, you can fetch it from the Arthur API. Retrieve a Model ID from the output of the `arthur_model.save()` call below, or the URL of your model page in the Arthur Dashboard.

In [12]:
model_id = arthur_model.save()

2022-07-21 12:04:34,764 - arthurai.core.data_service - INFO - Starting upload (1.376 MB in 1 files), depending on data size this may take a few minutes
2022-07-21 12:04:35,347 - arthurai.core.data_service - INFO - Upload completed: /var/folders/hl/bdslq5454bx2hb8xz6s19ggm0000gn/T/tmp077g95z1/cf421b83-479a-41f6-9a0e-9d5a1f90c7e5-0.parquet


In [13]:
# you can fetch a model by ID. for example pull the last-created model:
# with open("quickstart_model_id.txt", "r") as f:
#     model_id = f.read()
# arthur_model = connection.get_model(model_id)

## Sending Inferences

However you are currently invoking your model's prediction (eg. through a .predict() or .predict_proba() call), you can wrap this call so that the inputs and outputs are logged with Arthur.


In [None]:
from arthurai.core.decorators import log_prediction

In [None]:
@log_prediction(arthur_model)
def model_predict(input_vec):
 return sk_model.predict_proba(input_vec)[0]

We'll create some timestamps to mimic sending the data over a period of time. If these are left out the
current time will be populated

In [None]:
# 10 timestamps over the last month
timestamps = pd.date_range(start=datetime.now(pytz.utc) - timedelta(days=30),
                           end=datetime.now(pytz.utc),
                           periods=10)

Now, as we iterate through a dataset and invoke our model for predictions, the model inputs and outputs are logged. 

In [None]:
inference_ids = {}
for timestamp in timestamps:
    for i in range(np.random.randint(7, 10)):
        datarecord = X_test.sample(1)  # fetch a random row
        prediction, inference_id = model_predict(datarecord, inference_timestamp=timestamp)  # predict and log
        inference_ids[inference_id] = datarecord.index[0]  # record the inference ID with the Pandas index
    print(f"Logged {i+1} inferences with Arthur from {timestamp.strftime('%m/%d')}")


If your model scoring system is a set up in a batch processor where you run a daily, weekly, or monthly job, then we recommend setting a batch model with Arthur and using the corresponding *send_batch_inferences()* method. An example batch model can be found [here](../../credit_risk_batch/notebooks/Quickstart.ipynb).

## Updating with Ground Truth

In the future, when your ground truth lables come in, you can [update each inference](https://docs.arthur.ai/sdk/sdk_v3/arthurai.core.html#arthurai.core.models.ArthurModel.update_inference_ground_truths) by id with its corresponding label. 

In [None]:
gt_df = pd.DataFrame({'partner_inference_id': inference_ids.keys(),
                      'gt': Y_test[inference_ids.values()]})
gt_df.head(5)

In [None]:
_ = arthur_model.update_inference_ground_truths(gt_df)