In [None]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='5bc64a8f-3e83-42be-95cc-10d6c9519046', project_access_token='p-c42c9564e3b5d86b5db5fa5fb224069f0f47de83')


# Use a feature group to configure an AutoAI experiment

This notebook demonstrates how feature group metadata speeds up the configuration of an AutoAI experiment. The notebook describes the commands for retrieving a feature group, training experiments and scoring. The notebook uses the  project data asset called `german_credit_data_biased_training.csv`, which contains the German Credit Risk data set. This data asset has been enriched with feature group metadata. You can view the  _raw_ data set, without the metadata enrichment, at this location [german_credit_data_biased_training.csv](https://github.com/IBM/watson-machine-learning-samples/blob/master/cloud/data/bias/german_credit_data_biased_training.csv).



## What you'll learn in this notebook

This notebook shows you how to:

- Retrieve existing feature group information for a data asset by using the `assetframe-lib` Python library.
- Quickly understand the training data using the feature group _preview_ method.
- Use feature group metadate to configure source and target columns in an AutoAI experiment, and to provide fairness information for AutoAI.


## Contents

This notebook contains the following parts:

1. [Before you start](#beforeYouStart)
2. [Work with feature groups](#featuregroup)
3. [Create an AutoAI experiment](#autoai)
4. [Summary](#summary)

<a id="beforeYouStart"></a>
## Before you start

### Set up Watson Machine Learning
To run the AutoAI experiment that is part of this notebook, you must:
- Create an instance of Watson Machine Learning, if you haven't already done so.
- Associate this instance with your Watson Studio project.

### Create a project token
Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.


<a id="featuregroup"></a>
## Work with feature groups

### Overview

Feature groups are data assets that are useful for multiple machine learning use cases. Data engineers and data scientists can provide additional metadata for each feature of such a feature group that helps downstream machine learning tasks. This notebook uses two metadata attributes from the feature group: 
- The `role` of the feature: Should it be used as _input_ to a machine learning model, is it the _target_ of a prediction, or is it the _identifier_ of a partocular row of a data asset.
- The `fairness information`: this includes _monitored_ and _reference_ groups for input features, and the _favorable_ outcome for target features.

There are three options in Cloud Pak for Data to create and view feature groups. You can use:
- The _Feature group_ tab of data assets in the catalog UI
- The _Feature group_ tab of data assets in the project UI
- The _assetframe-lib_ Python library in notebooks

This notebook shows how to view feature groups using the `assetframe-lib` library.
For additional details on feature goups and the feature group tab, see [Managing data features](https://ibmdocs-test.dcs.ibm.com/docs/en/icpdaas_test?topic=data-managing-feature-groups-beta).

### Show the feature group and sample data

Begin by initializing the `assetframe-lib` library:

In [3]:
asset_name = "german_credit_data_biased_training.csv"

In [4]:
from ibm_watson_studio_lib import access_project_or_space
wslib = access_project_or_space({'token': project.project_context.accessToken})

from assetframe_lib import AssetFrame

AssetFrame._wslib = wslib;

af = AssetFrame.from_data_asset(asset_name)

#### Data Preview
The `head()` method in the `assetframe-lib` library gives you a quick overview of the underlying data asset, without you having to import all the data into the notebook. Plus, it includes the following information about the data asset and the feature group:
- The _name_ of the data asset
- The _role_ of each feature
- The _description_ of each feature
- The _recipe_ of each feature
- For input features, it highlights values from the monitored groups (in yellow) and the reference groups (in brown)
- For target features, it highlights favorable outcomes (in green) and unfavorable outcomes (in red)


In [4]:
af.head()

Unnamed: 0,CheckingStatus,LoanDuration Input,CreditHistory,LoanPurpose Input,LoanAmount Input,ExistingSavings,EmploymentDuration Input,InstallmentPercent,Sex Input,OthersOnLoan Input,CurrentResidenceDuration,OwnsProperty Input,Age Input,InstallmentPlans,Housing Input,ExistingCreditsCount,Job Input,Dependents,Telephone,ForeignWorker,Risk Target
0,0_to_200,31.0,credits_paid_to_date,other,1889.0,100_to_500,less_1,3.0,female,none,3.0,savings_insurance,32.0,none,own,1.0,skilled,1.0,none,yes,No Risk
1,less_0,18.0,credits_paid_to_date,car_new,462.0,less_100,1_to_4,2.0,female,none,2.0,savings_insurance,37.0,stores,own,2.0,skilled,1.0,none,yes,No Risk
2,less_0,15.0,prior_payments_delayed,furniture,250.0,less_100,1_to_4,2.0,male,none,3.0,real_estate,28.0,none,own,2.0,skilled,1.0,yes,no,No Risk
3,0_to_200,28.0,credits_paid_to_date,retraining,3693.0,less_100,greater_7,3.0,male,none,2.0,savings_insurance,32.0,none,own,1.0,skilled,1.0,none,yes,No Risk
4,no_checking,28.0,prior_payments_delayed,education,6235.0,500_to_1000,greater_7,3.0,male,none,3.0,unknown,57.0,none,own,2.0,skilled,1.0,none,yes,Risk


#### Print all features

If you're only interested in the feature group metadata, use the name of the `assetframe` or `print(<assetframe name>)` in a notebook cell.

In [5]:
af

Unnamed: 0,Role,Description,Favorable labels,Unfavorable labels,Monitored groups,Reference groups,Value descriptions,Recipe,Tags
Age,Input,,,,"[18, 25]","[26, 75]",,,
EmploymentDuration,Input,,,,,,,,
Housing,Input,,,,,,,,
Job,Input,,,,,,,,
LoanAmount,Input,,,,,,,,
LoanDuration,Input,,,,,,,,
LoanPurpose,Input,,,,,,,,
OthersOnLoan,Input,Whether there are other debtors or guarantors ...,,,,,"('none', 'No others on the loan'), ('co-applic...",,
OwnsProperty,Input,,,,,,,,
Risk,Target,,No Risk,Risk,,,,,


### Collect the metadata for the AutoAI experiment

To start an AutoAI experiment with a data asset, you need to provide:
- The columns to be used as input to the model
- The column to be used as target
- Fairness information

With `assetframe-lib` methods, you can easily access this specific metadata, and retrieve it in a form that can be used directly in AutoAI. Here's how:

#### Get the input features

In [6]:
input_features = af.get_features_by_role("input")

In [7]:
input_feature_names = list(map(lambda feature: feature.get_column_name(), input_features))
print(input_feature_names)

['Age', 'EmploymentDuration', 'Housing', 'Job', 'LoanAmount', 'LoanDuration', 'LoanPurpose', 'OthersOnLoan', 'OwnsProperty', 'Sex']


#### Get the target feature

In the example, only one feature is labeled as `target`, which is why you can access the result directly using `[0]`.

In [8]:
target_features = af.get_features_by_role("target")

In [9]:
target_feature_name = target_features[0].get_column_name()
print(target_feature_name)

Risk


#### Get the fairness information for input and target features

You could get the information for each feature separately, using the following pattern:
```
feature = af.get_feature(<name>)
monitoredGroups = feature.get_monitored_groups()
referenceGroups = feature.get_reference_groups()
``` 

However, the `assetframe-lib` library provides a convenient method called `get_fairness_info()` that you can use to retrieve _all_ fairness information for a feature group, in a format that can be directly used in AutoAI, or other libraries such as [AI Fairness 360](https://aif360.mybluemix.net/resources#overview). 

In [10]:
fairness_info = af.get_fairness_info()

In [11]:
fairness_info

{'favorable_labels': ['No Risk'],
 'unfavorable_labels': ['Risk'],
 'protected_attributes': [{'feature': 'Age',
   'monitored_group': [[18, 25]],
   'reference_group': [[26, 75]]},
  {'feature': 'Sex',
   'monitored_group': ['female'],
   'reference_group': ['male']}]}

<a id="autoai"></a>
## Create an AutoAI experiment

Now, you will learn to use AutoAI to build a model that predicts `Risk` (our target feature), given the input features from the feature group.

### Connect to the IBM Watson Machine Learning service

Authenticate the Watson Machine Learning service on IBM Cloud Pak for Data. You need to provide the location `url` and an `api_key`.

There are different ways to authenticate, but the notebook assumes the use of an API key, and not an IAM token.
See [Authenticating](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-authentication.html) for details on how to get the API key.

In [12]:
# hidden cell
api_key = #'<YOUR API KEY>'
location = 'us-south'  # please change to the appropriate location of your WML service, if needed

In [13]:
url = 'https://' + location + '.ml.cloud.ibm.com'

wml_credentials = {
    "apikey": api_key,
    "url": url
}

In [14]:
from ibm_watson_machine_learning import APIClient

client = APIClient(wml_credentials)

### Connect to the credit risk data

In [15]:
from ibm_watson_machine_learning.helpers import DataConnection, AssetLocation

asset_details = wslib.assets.get_stored_data(asset_name, raw=True)
asset_id = asset_details["metadata"]["asset_id"]

credit_risk_conn = DataConnection(data_asset_id=asset_id)
training_data_reference=[credit_risk_conn]

### Configure the experiment with the feature group metadata

Now, you can create the AutoAI experiment. You will use the feature group metadata to configure key aspects of the experiment:
- `prediction_column` is the `target_feature_name`
- `fairness_info` is the information from `get_fairness_info()`
- `train_sample_columns_list`is the list of `input_feature_names`

In [16]:
from ibm_watson_machine_learning.experiment import AutoAI

project_id = wslib.here.get_ID()

experiment = AutoAI(wml_credentials, project_id)

pipeline_optimizer = experiment.optimizer(
    name='Credit Risk Prediction - AutoAI',
    desc='Credit Risk Model using Feature Group',
    prediction_type=AutoAI.PredictionType.BINARY,
    prediction_column=target_feature_name,
    fairness_info=fairness_info,
    train_sample_columns_list=input_feature_names,
    scoring=AutoAI.Metrics.ROC_AUC_SCORE,
)

`get_params()` shows that the feature group metadata was retrieved:

In [17]:
pipeline_optimizer.get_params()['fairness_info']

{'favorable_labels': ['No Risk'],
 'unfavorable_labels': ['Risk'],
 'protected_attributes': [{'feature': 'Age',
   'monitored_group': [[18, 25]],
   'reference_group': [[26, 75]]},
  {'feature': 'Sex',
   'monitored_group': ['female'],
   'reference_group': ['male']}]}

### Run the experiment

You can now run the experiment by calling the `fit()` method. `backgroundMode=False` ensures that the experiment is complete when the notebook cell is completed. 

In [20]:
run_details = pipeline_optimizer.fit(
            training_data_reference=training_data_reference,
            background_mode=False)

Training job 649facf5-c156-48d1-9ddf-bcf4c1fba7f4 completed: 100%|████████| [02:45<00:00,  1.66s/it]


### Pipelines comparison

You can list trained pipelines and evaluation metrics information in
the form of a pandas DataFrame by calling the `summary()` method. You can
use the DataFrame to compare all discovered pipelines and select the one
you like for further testing.

Notice the columns for `training_disparate_impact_Sex` and `training_disparate_impact_Age`. AutoAI used the fairness information for both columns to compute these fairness metrics. 

In [21]:
summary = pipeline_optimizer.summary()
summary

Unnamed: 0_level_0,Enhancements,Estimator,training_disparate_impact_Sex,training_disparate_impact,training_roc_auc_(optimized),holdout_disparate_impact_Sex,holdout_average_precision,holdout_log_loss,holdout_roc_auc,holdout_precision,...,holdout_accuracy,holdout_balanced_accuracy,training_recall,holdout_f1,training_accuracy,holdout_disparate_impact,training_balanced_accuracy,holdout_disparate_impact_Age,training_f1,training_disparate_impact_Age
Pipeline Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Pipeline_3,"HPO, FE",SnapBoostingMachineClassifier,1.141034,3.246667,0.837266,1.04807,0.478866,0.458079,0.846034,0.871622,...,0.775551,0.774782,0.747653,0.821656,0.743696,1.886098,0.741761,1.943054,0.794974,4.003291
Pipeline_4,"HPO, FE, HPO",SnapBoostingMachineClassifier,1.141034,3.246667,0.837266,1.04807,0.478866,0.458079,0.846034,0.871622,...,0.775551,0.774782,0.747653,0.821656,0.743696,1.886098,0.741761,1.943054,0.794974,4.003291
Pipeline_2,HPO,SnapBoostingMachineClassifier,1.186798,4.143579,0.835565,1.053409,0.478234,0.456256,0.845015,0.868966,...,0.763527,0.765746,0.747317,0.810289,0.742803,1.944582,0.7406,2.001934,0.794276,4.306584
Pipeline_1,,SnapBoostingMachineClassifier,1.141928,2.613027,0.834072,1.078119,0.468686,0.408897,0.848838,0.902027,...,0.811623,0.815282,0.761413,0.850318,0.745479,1.942618,0.737653,1.943054,0.799033,3.09143


Looking at `Pipeline_1` , you see that AutoAI only used the `Input` features for the model. 

In [22]:
pipeline_details = pipeline_optimizer.get_pipeline_details('Pipeline_1')

In [23]:
pipeline_details['features_importance']

Unnamed: 0,features_importance
Age,1.0
LoanDuration,0.77
LoanAmount,0.52
EmploymentDuration,0.37
OwnsProperty,0.34
OthersOnLoan,0.13
Sex,0.1
Housing,0.01
LoanPurpose,0.0
Job,0.0


<a id="summary"></a>
## Summary

Congratulations! You retrieved an existing feature group, and used its metadata to configure an AutoAI experiment. Feel free to extend this notebook to pick one of the AutoAI pipelines, and deploy the model.

### Authors

**Szymon Brandys**, Senior Software Engineer in CloudPak for Data, IBM

**Simone Zerfass**, Software Developer, Watson Studio, IBM

Copyright © 2023 IBM. This notebook and its source code are released under the terms of the MIT License.