# From Notebook to Deployable Model:
## Training, Evaluating, and Conforming a Model for Deployment


In this notebook, we demonstrate the process of 
1. training a model, 
2. evaluating its performance, 
3. saving it for later use,
4. and conforming it to deployment standards.

More specifically, we will train a logistic regression classifier on the German Credit Data dataset.

**I - Model Training**

Let's begin by loading relevant libraries. We will need `sklearn` for model training, and `aequitas` for bias detection.

In [1]:
import pickle
from typing import List

import numpy
import pandas

from aequitas.bias import Bias
from aequitas.group import Group
from aequitas.preprocessing import preprocess_input_df

from sklearn import set_config
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    RepeatedStratifiedKFold,
)
from sklearn.metrics import (
    make_scorer,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    fbeta_score,
    balanced_accuracy_score,
    confusion_matrix,
)

The **German Credit Data** dataset can be found here: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data). Download it and load it from a *CSV* file. 

 - For our purposes, the dataset has been modified slightly to include an `id` column, and a `gender` column (engineered from `status_sex`, used to demonstarte bias). 
 
  - The target variable is under `label`. We have mapped the labels `[1,2]` to `[0,1]`, where `1` indicates the positive class (loan default).

In [2]:
data = pandas.read_csv("german_credit_data.csv")

In [3]:
data.columns

Index(['id', 'duration_months', 'credit_amount', 'installment_rate',
       'present_residence_since', 'age_years', 'number_existing_credits',
       'checking_status', 'credit_history', 'purpose', 'savings_account',
       'present_employment_since', 'debtors_guarantors', 'property',
       'installment_plans', 'housing', 'job', 'number_people_liable',
       'telephone', 'foreign_worker', 'gender', 'label'],
      dtype='object')

Let's look at some data:

In [4]:
data.head()

Unnamed: 0,id,duration_months,credit_amount,installment_rate,present_residence_since,age_years,number_existing_credits,checking_status,credit_history,purpose,...,debtors_guarantors,property,installment_plans,housing,job,number_people_liable,telephone,foreign_worker,gender,label
0,0,6,1169,4,4,67,2,A11,A34,A43,...,A101,A121,A143,A152,A173,1,A192,A201,male,0
1,1,48,5951,2,2,22,1,A12,A32,A43,...,A101,A121,A143,A152,A173,1,A191,A201,female,1
2,2,12,2096,2,3,49,1,A14,A34,A46,...,A101,A121,A143,A152,A172,2,A191,A201,male,0
3,3,42,7882,2,4,45,1,A11,A32,A42,...,A103,A122,A143,A153,A173,2,A191,A201,male,0
4,4,24,4870,3,4,53,2,A11,A33,A40,...,A101,A124,A143,A153,A173,2,A191,A201,male,1


Not all numeric columns need to be considered as numerical features. For example, `number_people_liable` only has two unique **discrete** values:

In [5]:
data.number_people_liable.value_counts()

1    845
2    155
Name: number_people_liable, dtype: int64

We may therefore treat it as a categorical feature. Note, however, that we may need to reconsider this option if more values appear in testing phases. In a production environment, care must be taken to deal with extraneous (unobserved) values.

Per `pandas` documentation, there are memory imrovements if `object` fields are cast as `category` type, especially when the number of values for such fields is small.

In [6]:
data.number_people_liable = data.number_people_liable.astype("category")

Before proceeding any further with model development, let us split the original dataset into two sets: 

 - a **baseline/training** set that will be used as a reference set, and 
 
  - a **sample** set which will mimic input data to the model once the model is in use (PROD).

In [7]:
# Setting a random state for reproducability
df_baseline, df_sample = train_test_split(data, train_size=0.8, random_state=0)

# Writing split datasets into JSON-lines (tabular) files
df_baseline.to_json("df_baseline.json", orient="records", lines=True)
df_sample.to_json("df_sample.json", orient="records", lines=True)

One of the primary distributions to looks at is that of the label:

In [8]:
df_baseline.label.value_counts() / len(df_baseline)

0    0.6975
1    0.3025
Name: label, dtype: float64

In [9]:
df_sample.label.value_counts() / len(df_sample)

0    0.71
1    0.29
Name: label, dtype: float64

It appears that in either set, around 70% of the accounts have Paid Off the loan `(label=0)`, while the remaining 30% have defaulted. 

We will train a **Logistic Regression** classifier. Since our data contains categorical features, we will need to start our pipeline with an encoder. The pipeline below will One-Hot-Encode input data, then fit a LogisticRegression against the target.

In [10]:
pipeline = make_pipeline(
    OneHotEncoder(handle_unknown="ignore", sparse=True),
    LogisticRegression(max_iter=1000, random_state=0),
)

**Logistic Regression** has multiple parameters which can be tuned. Among these are `C`, `solver`, and `class_weight`. Instead of manually seraching for the optimal set of parameters, we will use **GridSearchCV**. We provide GridSearchCV a list of values for each of these parameters.

In [11]:
parameters = dict(
    logisticregression__C=numpy.logspace(
        -4, 4, 50
    ),  # Inverse of regularization strength
    logisticregression__solver=["liblinear", "lbfgs", "newton-cg"],
    logisticregression__class_weight=["balanced", None],
)

The data still contains non-predictive features, such as `id`, `label`, `age_years` and `gender`. We remove these below. 

**Note** - `gender` and `age_years` are removed to avoid explicit bias. This does not gaurantee, however, that the overall model will not be biased against a particular group,  since bias could be implicitely encoded in the training data.

In [12]:
predictive_features = list(
    set(data.columns) - set(["id", "label", "age_years", "gender"])
)

As a sanity check, let us see which features are automatically encoded as **numerical**, and which are encoded as **categorical**.

In [13]:
categorical_features = list(
    set(predictive_features).intersection(
        set(data.select_dtypes(include=["object", "category"]))
    )
)
numerical_features = list(
    set(predictive_features).intersection(set(data.select_dtypes(include=["number"])))
)

**Categorical features**:

In [14]:
print(categorical_features)

['installment_plans', 'job', 'purpose', 'telephone', 'number_people_liable', 'present_employment_since', 'savings_account', 'debtors_guarantors', 'housing', 'checking_status', 'credit_history', 'foreign_worker', 'property']


**Numerical features**:

In [15]:
print(numerical_features)

['number_existing_credits', 'present_residence_since', 'credit_amount', 'installment_rate', 'duration_months']


Everything looks good; let us proceed with training. We need to specify **predictive** and **response** variables for each of the training and test sets. We set these by filtering the baseline and sample sets.

In [16]:
X_train = df_baseline[predictive_features]
X_test = df_sample[predictive_features]

y_train = df_baseline["label"]
y_test = df_sample["label"]

X_train.to_json("X_train.json", orient="records", lines=True)
X_test.to_json("X_test.json", orient="records", lines=True)

We may now fit the classifier to the training data. Since "it is worse to classify a customer as good when they are bad, than it is to classify a customer as bad when they are good", we will use an **F_beta metric**, with `beta=2`, to judge the performance of our model.

In [17]:
# This will take a few minutes to complete
clf_GS = GridSearchCV(
    estimator=pipeline,
    param_grid=parameters,
    n_jobs=-1,
    scoring=make_scorer(fbeta_score, beta=2),
    cv=RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=0),
)
clf_GS.fit(X_train, y_train)

In [18]:
clf_GS.best_estimator_

Here are the parameters of the best estimator:

In [19]:
clf_GS.best_params_

{'logisticregression__C': 0.00021209508879201905,
 'logisticregression__class_weight': 'balanced',
 'logisticregression__solver': 'newton-cg'}

It appears that the best logistic regression classifier is one with a `solver='lbfgs'` and `class_weight='balanced'`. This classifier achived the best score:

In [20]:
clf_GS.best_score_

0.6745196012183637

**II - Model Evaluation**

Before saving our trained model for further use, let's look at some performance metrics. We will evaluate the model on both the training and test sets; we would like to see a stable performance.

For repeatability, let's define a function which computes multiple metrics at-a-time:

In [21]:
def binary_classification_metrics(
    y_true: pandas.Series, y_preds: pandas.Series
) -> List:
    """
    A function to evaluate a binary classification model, given true and predicted labels.

    Args:
        y_true (pd.Series): true (actual) labels
        y_preds (pd.Series): predicted labels (as scored by model)

    Returns:
        (List): Classification performance metrics: accuracy, balanced accuracy, precision, recall, f1, f2
    """

    return [
        accuracy_score(y_true, y_preds),
        balanced_accuracy_score(y_true, y_preds),
        precision_score(y_true, y_preds),
        recall_score(y_true, y_preds),
        f1_score(y_true, y_preds),
        fbeta_score(y_true, y_preds, beta=2),
    ]

Let us now compute predictions on both training and test sets:

In [22]:
y_test_preds = clf_GS.best_estimator_.predict(X_test)
y_train_preds = clf_GS.best_estimator_.predict(X_train)

Let us quickly verify that the model is not trivial, i.e., predicting the default (dominant) class (label=0) for all records:

In [23]:
print("Test  Label Distribution: \n", pandas.Series(y_test_preds).value_counts())
print("\nTrain Label Distribution: \n", pandas.Series(y_train_preds).value_counts())

Test  Label Distribution: 
 0    107
1     93
dtype: int64

Train Label Distribution: 
 0    451
1    349
dtype: int64


We will display performance metrics in a DataFrame:

In [24]:
preformance_df = pandas.DataFrame(
    data=[{}],
    columns=[
        "Accuracy",
        "Balanced Accuracy",
        "Precision",
        "Recall",
        "F1 score",
        "F2 Score",
    ],
    index=["Training Set", "Test Set"],
)

In [25]:
preformance_df.loc["Training Set", :] = binary_classification_metrics(
    y_true=y_train, y_preds=y_train_preds
)
preformance_df.loc["Test Set", :] = binary_classification_metrics(
    y_true=y_test, y_preds=y_test_preds
)

Here's how our model performed:

In [26]:
preformance_df.round(3)

Unnamed: 0,Accuracy,Balanced Accuracy,Precision,Recall,F1 score,F2 Score
Training Set,0.724,0.735,0.53,0.764,0.626,0.702
Test Set,0.665,0.682,0.452,0.724,0.556,0.646


Some Observations:
1. For many metrics, the performance on the training set is not too far off from the performance on the test set; for others, performance degraded on new data, inidicating either and over-fit model, or simply a bad split of train/test. Keep in mind that the overall dataset is small, comprising of 800 training samples and 200 test samples.
2. Further model improvements are needed to achieve better F2 scores. Generally, a binary classifier is considered "good" for F_beta scores > 0.7. 
3. The fact that balanced accuracy and accuracy are close is a good indication that the model is not trivial. Balanced accuracy takes into account the imbalance of the label. Recall that the best estimator found by GridSearch was achieved for 'logisticregression__class_weight': 'balanced'.
4. The model, as it stands, achieves almost the same accuracy as the trivial model (around 70%). One must keep in mind, however, that the trivial model fails on all other metrics, since it assigns the same default value for all samples.

For now, we will contend with this model and use it to produce new predictions. A logistic regressin , when applicable, is often a good champion model, to be challenged later by other modeling techniques, such as Neural Networks, Decision Trees, SVMs, and so on.

**III - Saving and Loading the Trained Model**

Now that the model is **trained** and **evaluated**, we save it in a binary format. It will then be loaded and used to make new predictions.

In [27]:
pickle.dump(clf_GS.best_estimator_, open("logreg_classifier.pickle", "wb"))

The model is reloaded on-demand as follows:

In [28]:
logreg_classifier = pickle.load(open("logreg_classifier.pickle", "rb"))

Predictions are produced on-demand by calling the `predict()` function on a dataframe of input samples:

In [29]:
new_preds = logreg_classifier.predict(X_test)

In [30]:
pandas.Series(new_preds).value_counts()

0    107
1     93
dtype: int64

**IV - Evaluating Bias on Protected Classes**

Since `gender` and `age_years` are protected classes, we have excluded them from the list of predictive features. However, this does not guarantee that the model is not implicitly biased, as `gender` and/or `age_years` could potentially be inferred from other features. It is therefore imperative that we evaluate our model for Ethical Bias.

To that end, let us produce some predictions and append them to our labeled baseline and sample sets.

In [31]:
df_baseline_scored = df_baseline.copy(deep=True)
df_baseline_scored["score"] = logreg_classifier.predict(
    df_baseline[predictive_features]
)

df_sample_scored = df_sample.copy(deep=True)
df_sample_scored["score"] = logreg_classifier.predict(df_sample[predictive_features])

We will use the `Aequitas` library to compute bias metrics. The library requires the true label to be encoded as 'label_value', so let us rename that column.

In [32]:
df_baseline_scored.rename(columns={"label": "label_value"}, inplace=True)
df_sample_scored.rename(columns={"label": "label_value"}, inplace=True)

In addition, protected classes must be of a catgeorical type, so that bias metrics are computed for each discrete group. To that end, we ill map `age_years` to an `age_over_forty` boolean column, since 40 is the legal cutoff for ageism.

In [33]:
df_baseline_scored["age_over_forty"] = (df_baseline_scored["age_years"] > 40).astype(
    str
)
df_sample_scored["age_over_forty"] = (df_sample_scored["age_years"] > 40).astype(str)

Let's save these two DataFrames before proceeding further:

In [34]:
df_baseline_scored.to_json("df_baseline_scored.json", orient="records", lines=True)
df_sample_scored.to_json("df_sample_scored.json", orient="records", lines=True)

Now, we call the aequitas preprocessing function on our datasets, filtered to the features we care about: `score` (prediction), `label_value` (true label), `age_over_forty` and `gender` (protected classes).

In [35]:
df_baseline_scored_processed, _ = preprocess_input_df(
    df_baseline_scored.loc[:, ["score", "label_value", "gender", "age_over_forty"]]
)
df_sample_scored_processed, _ = preprocess_input_df(
    df_sample_scored.loc[:, ["score", "label_value", "gender", "age_over_forty"]]
)

Let's start by computing some `Group` Metrics. These are raw (count) metrics which display the representation of different groups in the data

In [36]:
xtab_baseline, _ = Group().get_crosstabs(df_baseline_scored_processed)
xtab_sample, _ = Group().get_crosstabs(df_sample_scored_processed)

  df.loc[:, 'score'] = df.loc[:,'score'].astype(float)
  df.loc[:, 'score'] = df.loc[:,'score'].astype(float)


In [37]:
absolute_metrics_baseline = Group().list_absolute_metrics(xtab_baseline)
absolute_metrics_sample = Group().list_absolute_metrics(xtab_sample)

Here are the absolute metrics, computed on baseline and sample sets, respectively:

In [38]:
xtab_baseline[["attribute_name", "attribute_value"] + absolute_metrics_baseline].round(
    2
)

Unnamed: 0,attribute_name,attribute_value,tpr,tnr,for,fdr,fpr,fnr,npv,precision,ppr,pprev,prev
0,gender,female,0.8,0.67,0.14,0.43,0.33,0.2,0.86,0.57,0.34,0.5,0.35
1,gender,male,0.75,0.72,0.12,0.49,0.28,0.25,0.88,0.51,0.66,0.41,0.28
2,age_over_forty,False,0.76,0.69,0.14,0.47,0.31,0.24,0.86,0.53,0.74,0.45,0.32
3,age_over_forty,True,0.79,0.74,0.09,0.48,0.26,0.21,0.91,0.52,0.26,0.4,0.27


In [39]:
xtab_sample[["attribute_name", "attribute_value"] + absolute_metrics_sample].round(2)

Unnamed: 0,attribute_name,attribute_value,tpr,tnr,for,fdr,fpr,fnr,npv,precision,ppr,pprev,prev
0,gender,female,0.68,0.7,0.2,0.45,0.3,0.32,0.8,0.55,0.33,0.43,0.35
1,gender,male,0.76,0.61,0.12,0.6,0.39,0.24,0.88,0.4,0.67,0.48,0.26
2,age_over_forty,False,0.72,0.58,0.17,0.57,0.42,0.28,0.83,0.43,0.85,0.51,0.3
3,age_over_forty,True,0.73,0.82,0.1,0.43,0.18,0.27,0.9,0.57,0.15,0.31,0.24


A complete description of the metrics above can be found here: https://github.com/dssg/aequitas#aequitas-group-metrics

Generally, one would want a model that achieves similar metrics across different protected groups. One of the areas of concern for our model is the fact that `ppr`, `predicted positive rate` , varies wildly between groups. The imbalance of the grous in the data could be the cause.

We can also add some raw counts (group sizes) as follows:

In [40]:
xtab_baseline[
    [col for col in xtab_baseline.columns if col not in absolute_metrics_baseline]
]

Unnamed: 0,model_id,score_threshold,k,attribute_name,attribute_value,pp,pn,fp,fn,tn,tp,group_label_pos,group_label_neg,group_size,total_entities
0,0,binary 0/1,349,gender,female,118,120,51,17,103,67,84,154,238,800
1,0,binary 0/1,349,gender,male,231,331,113,40,291,118,158,404,562,800
2,0,binary 0/1,349,age_over_forty,False,257,314,120,44,270,137,181,390,571,800
3,0,binary 0/1,349,age_over_forty,True,92,137,44,13,124,48,61,168,229,800


In [41]:
xtab_sample[[col for col in xtab_sample.columns if col not in absolute_metrics_sample]]

Unnamed: 0,model_id,score_threshold,k,attribute_name,attribute_value,pp,pn,fp,fn,tn,tp,group_label_pos,group_label_neg,group_size,total_entities
0,0,binary 0/1,93,gender,female,31,41,14,8,33,17,25,47,72,200
1,0,binary 0/1,93,gender,male,62,66,37,8,58,25,33,95,128,200
2,0,binary 0/1,93,age_over_forty,False,79,76,45,13,63,34,47,108,155,200
3,0,binary 0/1,93,age_over_forty,True,14,31,6,3,28,8,11,34,45,200


Now that we have computed `Group` metrics, we can move on to `Bias` metrics. Bias is computed as the ratio of group metrics. This requires defining a reference group for each protected class. In the case of gender, we choose reference_group='male'.

In [42]:
bdf_baseline = Bias().get_disparity_predefined_groups(
    xtab_baseline,
    original_df=df_baseline_scored_processed,
    ref_groups_dict={"gender": "male", "age_over_forty": "False"},
    alpha=0.05,
    mask_significance=True,
)

bdf_sample = Bias().get_disparity_predefined_groups(
    xtab_sample,
    original_df=df_sample_scored_processed,
    ref_groups_dict={"gender": "male", "age_over_forty": "False"},
    alpha=0.05,
    mask_significance=True,
)

get_disparity_predefined_group()
get_disparity_predefined_group()


We can now compute **disparity** metrics as follows

In [43]:
calculated_disparities_baseline = Bias().list_disparities(bdf_baseline)
calculated_disparities_sample = Bias().list_disparities(bdf_sample)

disparity_metrics_df_baseline = bdf_baseline[
    ["attribute_name", "attribute_value"] + calculated_disparities_baseline
]
disparity_metrics_df_sample = bdf_sample[
    ["attribute_name", "attribute_value"] + calculated_disparities_sample
]

Here are the computed disparity metrics on baseline and sample sets, respectively:

In [44]:
disparity_metrics_df_baseline.round(3)

Unnamed: 0,attribute_name,attribute_value,ppr_disparity,pprev_disparity,precision_disparity,fdr_disparity,for_disparity,fpr_disparity,fnr_disparity,tpr_disparity,tnr_disparity,npv_disparity
0,gender,female,0.511,1.206,1.112,0.884,1.172,1.184,0.799,1.068,0.929,0.976
1,gender,male,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,age_over_forty,False,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,age_over_forty,True,0.358,0.893,0.979,1.024,0.677,0.851,0.877,1.04,1.066,1.053


In [45]:
disparity_metrics_df_sample.round(3)

Unnamed: 0,attribute_name,attribute_value,ppr_disparity,pprev_disparity,precision_disparity,fdr_disparity,for_disparity,fpr_disparity,fnr_disparity,tpr_disparity,tnr_disparity,npv_disparity
0,gender,female,0.5,0.889,1.36,0.757,1.61,0.765,1.32,0.898,1.15,0.916
1,gender,male,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,age_over_forty,False,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,age_over_forty,True,0.177,0.61,1.328,0.752,0.566,0.424,0.986,1.005,1.412,1.09


Some of the disparity metrics above are worrisome! A good rule of thumb is for bias metrics to be within 20% of the reference group. Thus, values outside of (0.8,1.25) are cause for concern. We might need to retrain the model, possibly with better feature engineering. That's an exercise for a later time.

**V - Preparing Model Code for Deployment**

Preparing for deployment is best-demonstrated through and example. Let's look at the code below:

In [1]:
import pandas
import numpy
import pickle

from aequitas.preprocessing import preprocess_input_df
from aequitas.group import Group
from aequitas.bias import Bias


def init() -> None:
    """
    A function to load the trained model artifact (.pickle) as a glocal variable.
    The model will be used by other functions to produce predictions.
    """

    global logreg_classifier

    # load pickled logistic regression model
    logreg_classifier = pickle.load(open("logreg_classifier.pickle", "rb"))


def score(data: dict) -> dict:
    """
    A function to predict loan default/pay-off, given a loan application sample (record).

    Args:
        data (dict): input dictionary to be scored, containing predictive features.

    Returns:
        (dict): Scored (predicted) input data.
    """

    # Turn input data into a 1-record DataFrame
    data = pandas.DataFrame([data])

    # There are only two unique values in data.number_people_liable.
    # Treat it as a categorical feature, to mimic training process
    data.number_people_liable = data.number_people_liable.astype("category")

    # Alternitavely, these features can be saved (pickled) and re-loaded
    predictive_features = [
        "installment_plans",
        "job",
        "number_people_liable",
        "savings_account",
        "debtors_guarantors",
        "housing",
        "credit_amount",
        "installment_rate",
        "credit_history",
        "foreign_worker",
        "number_existing_credits",
        "purpose",
        "telephone",
        "present_residence_since",
        "checking_status",
        "duration_months",
        "present_employment_since",
        "property",
    ]

    # Predict using saved model
    data["predicted_score"] = logreg_classifier.predict(data[predictive_features])

    return data.to_dict(orient="records")[0]


def metrics(data: pandas.DataFrame) -> List[dict]:
    """
    A function to compute Group and Bias metrics on scored and labeled data, containing protected classes.

    Args:
        data (pandas.DataFrame): Dataframe of loan applications, including ground truths, predictions.

    Returns:
        (List[dict]): Group and Bias metrics for each protected class.
    """

    # To measure Bias towards gender, filter DataFrame to "score", "label_value" (ground truth), and
    # "gender" (protected attribute)
    data_scored = data[["score", "label_value", "gender", "age_over_forty"]]

    # Process DataFrame
    data_scored_processed, _ = preprocess_input_df(data_scored)

    # Group Metrics
    xtab, _ = Group().get_crosstabs(data_scored_processed)

    # Absolute metrics, such as 'tpr', 'tnr','precision', etc.
    absolute_metrics = Group().list_absolute_metrics(xtab)

    # DataFrame of calculated absolute metrics for each sample population group
    absolute_metrics_df = xtab[
        ["attribute_name", "attribute_value"] + absolute_metrics
    ].round(2)

    # For example:
    """
        attribute_name  attribute_value     tpr     tnr  ... precision
    0   gender          female              0.60    0.88 ... 0.75
    1   gender          male                0.49    0.90 ... 0.64
    2   age_over_forty  True                0.54    0.45 ... 0.23
    3   age_over_forty  False               0.45    0.54 ... 0.32
    """

    # Bias Metrics
    # Disparities calculated in relation gender for "male" and "female"
    bias_df = Bias().get_disparity_predefined_groups(
        xtab,
        original_df=data_scored_processed,
        ref_groups_dict={"gender": "male", "age_over_forty": "False"},
        alpha=0.05,
        mask_significance=True,
    )

    # Disparity metrics added to bias DataFrame
    calculated_disparities = Bias().list_disparities(bias_df)

    disparity_metrics_df = bias_df[
        ["attribute_name", "attribute_value"] + calculated_disparities
    ].round(3)

    # For example:
    """
        attribute_name	attribute_value    ppr_disparity   precision_disparity
    0   gender          female             0.714            1.417
    1   gender          male               1.000            1.000
    2   age_over_forty  True                0.54            1.234
    3   age_over_forty  False              1.000            1.000
    """

    # Output a JSON object of calculated metrics
    return {
        "group_metrics": absolute_metrics_df.to_dict(orient="records"),
        "bias_metrics": disparity_metrics_df.to_dict(orient="records"),
    }

NameError: name 'List' is not defined

There are four main sections to this model:
1. Library imports
2. `init` function
3. `score` function
4. `metrics` function

**Library** imports are always at the top. We don't need to include all libraries that we used for training and model evaluation. We just need the libraries for processing and scoring.

The **`init`** function runs once per deployment, and is used to load and persist into memory any variable that needs to be accessed at scoring time. For example, the `init` function is where we load the saved model binary. We make the variable global so it can be accessed from the scoring function.

The **`score`** function is the function that runs anytime we make a scoring (prediction) request. This is where we put our prediction code. We have to remember to include any steps that were not captured by the pipeline, such as feature engineering or re-encoding.

The **`metrics`** functions is where model evaluation is carried out. In our example, this is the place where we replicate the calculations of Group and/or Bias metrics.

Let us test our source code to see if we missed anything. We will load input data and scored input data to test both the scoring and metrics functions:

In [2]:
score_sample = pandas.read_json("df_baseline.json", lines=True, orient="records")
metrics_sample = pandas.read_json(
    "df_baseline_scored.json", lines=True, orient="records"
)

Let's check that the **`init`** function can load the trained model binary:

In [3]:
init()

No errors from the **`init`** function. Let us now call the **`score`** function on input data (first sample):

In [4]:
scores = score(score_sample.iloc[0])

In [6]:
scores

{'id': 687,
 'duration_months': 36,
 'credit_amount': 2862,
 'installment_rate': 4,
 'present_residence_since': 3,
 'age_years': 30,
 'number_existing_credits': 1,
 'checking_status': 'A12',
 'credit_history': 'A33',
 'purpose': 'A40',
 'savings_account': 'A62',
 'present_employment_since': 'A75',
 'debtors_guarantors': 'A101',
 'property': 'A124',
 'installment_plans': 'A143',
 'housing': 'A153',
 'job': 'A173',
 'number_people_liable': 1,
 'telephone': 'A191',
 'foreign_worker': 'A201',
 'gender': 'male',
 'label': 0,
 'predicted_score': 1}

In [5]:
pandas.DataFrame([scores])

Unnamed: 0,id,duration_months,credit_amount,installment_rate,present_residence_since,age_years,number_existing_credits,checking_status,credit_history,purpose,...,property,installment_plans,housing,job,number_people_liable,telephone,foreign_worker,gender,label,predicted_score
0,687,36,2862,4,3,30,1,A12,A33,A40,...,A124,A143,A153,A173,1,A191,A201,male,0,1


We have scores! Last but not least, let's call the **`metrics`** function on scored data:

In [76]:
bias = metrics(metrics_sample)

get_disparity_predefined_group()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, 'score'] = df.loc[:,'score'].astype(float)
  df.loc[:, 'score'] = df.loc[:,'score'].astype(float)


In [78]:
pandas.DataFrame(bias["group_metrics"])

Unnamed: 0,attribute_name,attribute_value,tpr,tnr,for,fdr,fpr,fnr,npv,precision,ppr,pprev,prev
0,gender,female,0.8,0.67,0.14,0.43,0.33,0.2,0.86,0.57,0.34,0.5,0.35
1,gender,male,0.75,0.72,0.12,0.49,0.28,0.25,0.88,0.51,0.66,0.41,0.28
2,age_over_forty,False,0.76,0.69,0.14,0.47,0.31,0.24,0.86,0.53,0.74,0.45,0.32
3,age_over_forty,True,0.79,0.74,0.09,0.48,0.26,0.21,0.91,0.52,0.26,0.4,0.27


In [79]:
pandas.DataFrame(bias["bias_metrics"])

Unnamed: 0,attribute_name,attribute_value,ppr_disparity,pprev_disparity,precision_disparity,fdr_disparity,for_disparity,fpr_disparity,fnr_disparity,tpr_disparity,tnr_disparity,npv_disparity
0,gender,female,0.511,1.206,1.112,0.884,1.172,1.184,0.799,1.068,0.929,0.976
1,gender,male,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,age_over_forty,False,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,age_over_forty,True,0.358,0.893,0.979,1.024,0.677,0.851,0.877,1.04,1.066,1.053


**Prefect!**