# Debug housing price predictions

This notebook demonstrates the use of the AzureML RAI components to assess a classification model trained on Kaggle's apartments dataset (https://www.kaggle.com/alphaepsilon/housing-prices-dataset). The model predicts if the house sells for more than median price or not. It is a reimplementation of the [notebook of the same name](https://github.com/microsoft/responsible-ai-toolbox/blob/main/notebooks/responsibleaidashboard/responsibleaidashboard-housing-classification-model-debugging.ipynb) in the [Responsible AI toolbox repo](https://github.com/microsoft/responsible-ai-toolbox).

First, we need to specify the version of the RAI components which are available in the workspace. This was specified when the components were uploaded, and will have defaulted to '1':

In [None]:
version_string = '1'

We also need to give the name of the compute cluster we want to use in AzureML. Later in this notebook, we will create it if it does not already exist:

In [None]:
compute_name = "cpucluster"

## Accessing the Data

The following section examines the code necessary to create datasets and a model using components in AzureML.

### Fetching the data

In [None]:
import shap
import sklearn
import pandas as pd

from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
import zipfile

First, we load the data from the blob store, do some basic data cleaning, and split in to training and test datasets:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

def split_label(dataset, target_feature):
    X = dataset.drop([target_feature], axis=1)
    y = dataset[[target_feature]]
    return X, y

target_feature = 'Sold_HigherThan_Median'
categorical_features = []

outdirname = 'responsibleai.12.28.21'
try:
    from urllib import urlretrieve
except ImportError:
    from urllib.request import urlretrieve
zipfilename = outdirname + '.zip'
urlretrieve('https://publictestdatasets.blob.core.windows.net/data/' + zipfilename, zipfilename)
with zipfile.ZipFile(zipfilename, 'r') as unzip:
    unzip.extractall('.')

all_data = pd.read_csv('apartments-train.csv')
all_data = all_data.drop(['SalePrice','SalePriceK'], axis=1)
X, y = split_label(all_data, target_feature)


X_train_original, X_test_original, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=7, stratify=y)

train_data = X_train_original.copy()
train_data[target_feature] = y_train

test_data = X_test_original.copy()
test_data[target_feature] = y_test

### Get the Data to AzureML

With the data now split into 'train' and 'test' DataFrames, we save them out to files in preparation for upload into AzureML:

In [None]:
print("Saving to files")
train_data.to_parquet("housing_train.parquet", index=False)
test_data.to_parquet("housing_test.parquet", index=False)

We are going to create two Datasets in AzureML, one for the train and one for the test datasets. The first step is to create an `MLClient` to perform the upload. The method we use assumes that there is a `config.json` file (downloadable from the Azure or AzureML portals) present in the same directory as this notebook file:

In [None]:
from azure.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential(exclude_shared_token_cache_credential=True),
                     logging_enable=True)

We can then define the Datasets, and create them in AzureML. This will also upload the Parquet files:

In [None]:
from azure.ml.entities import Dataset

train_dataset = Dataset(
    name="Housing_Train_from_Notebook",
    local_path="housing_train.parquet",
)
ml_client.datasets.create_or_update(train_dataset)

test_dataset = Dataset(
    name="Housing_Test_from_Notebook",
    local_path="housing_test.parquet",
)
ml_client.datasets.create_or_update(test_dataset)

## A model training pipeline

To simplify the model creation process, we're going to use a pipeline. This will have two stages:

1. The actual training component
1. A model registration component

We have to register the model in AzureML in order for our RAI insights components to use it.

### The Training Component

The training component is for this particular model. First, we write the training script which will be executed. In this case, we are going to train an `LCBMClassifier` on the input data and save it using MLFlow. We need command line arguments to specify the location of the input data, the location where MLFlow should write the output model, and the name of the target column (i.e. `y`) in the dataset:

In [None]:
%%writefile housing_training_script.py

import argparse
import os
import shutil
import tempfile


from azureml.core import Run

import mlflow
import mlflow.sklearn

import pandas as pd
from lightgbm import LGBMClassifier

def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--training_data", type=str, help="Path to training data")
    parser.add_argument("--target_column_name", type=str, help="Name of target column")
    parser.add_argument("--model_output", type=str, help="Path of output model")

    # parse args
    args = parser.parse_args()

    # return args
    return args


def main(args):
    current_experiment = Run.get_context().experiment
    tracking_uri = current_experiment.workspace.get_mlflow_tracking_uri()
    print("tracking_uri: {0}".format(tracking_uri))
    mlflow.set_tracking_uri(tracking_uri)
    mlflow.set_experiment(current_experiment.name)

    # Read in data
    print("Reading data")
    all_data = pd.read_parquet(args.training_data)

    print("Extracting X_train, y_train")
    print("all_data cols: {0}".format(all_data.columns))
    y_train = all_data[args.target_column_name]
    X_train = all_data.drop(labels=args.target_column_name, axis="columns")
    print("X_train cols: {0}".format(X_train.columns))

    print("Training model")
    # The estimator can be changed to suit
    model = LGBMClassifier(n_estimators=5)
    model.fit(X_train, y_train)

    # Saving model with mlflow - leave this section unchanged
    with tempfile.TemporaryDirectory() as td:
        print("Saving model with MLFlow to temporary directory")
        tmp_output_dir = os.path.join(td, "my_model_dir")
        mlflow.sklearn.save_model(sk_model=model, path=tmp_output_dir)

        print("Copying MLFlow model to output path")
        for file_name in os.listdir(tmp_output_dir):
            print("  Copying: ", file_name)
            # As of Python 3.8, copytree will acquire dirs_exist_ok as
            # an option, removing the need for listdir
            shutil.copy2(src=os.path.join(tmp_output_dir, file_name), dst=os.path.join(args.model_output, file_name))


# run script
if __name__ == "__main__":
    # add space in logs
    print("*" * 60)
    print("\n\n")

    # parse args
    args = parse_args()

    # run main function
    main(args)

    # add space in logs
    print("*" * 60)
    print("\n\n")

Now that the script is saved on our local drive, we can use the AzureML SDKv2 to describe the component, and our `MLClient` object to register it with AzureML. First, we choose a version for this component (which need not match the version of the RAI components):

In [None]:
training_component_version_string = '4'

Now, define and create the component:

In [None]:
from azure.ml.entities import Code, CommandComponent

training_code = Code(
    local_path='housing_training_script.py'
)

training_inputs = {
    'training_data': { 'type': 'path'},
    'target_column_name': { 'type': 'string'}
}

training_outputs = {
    'model_output': { 'type': 'path'}
}

training_component = CommandComponent(
    name="HousingTrainingComponent",
    version=training_component_version_string,
    display_name="Simple training component for housing Dataset",
    code=training_code,
    environment=f"AML-RAI-Environment:{version_string}",
    inputs=training_inputs,
    outputs=training_outputs,
    command="python housing_training_script.py " \
            "--training_data ${{inputs.training_data}} " \
            "--target_column_name ${{inputs.target_column_name}} " \
            "--model_output ${{outputs.model_output}}"
)

ml_client.components.create_or_update(training_component)

We need a compute target on which to run our jobs. The following checks whether the compute specified above is present; if not, then the compute target is created.

In [None]:
from azure.ml.entities import AmlCompute

all_compute_names = [x.name for x in ml_client.compute.list()]

if compute_name in all_compute_names:
    print(f"Found existing compute: {compute_name}")
else:
    my_compute = AmlCompute(
        name=compute_name,
        size="Standard_DS2_v2",
        min_instances=0,
        max_instances=4,
        idle_time_before_scale_down=3600
    )
    ml_client.compute.begin_create_or_update(my_compute)
    print("Initiated compute creation")

### Running a training pipeline

The component to register the model is part of the suite of RAI components, so we do not have to define it here. As such, we are now ready to run the training pipeline itself.

We start by defining the name under which we want to register the model:

In [None]:
import time

from azure.ml.entities import JobInput, ComponentJob, PipelineJob

model_name_suffix = int(time.time())
model_name = 'my_housing_nb_model'

Next, we define the pipeline using objects from the AzureML SDKv2. As mentioned above, there are two component jobs: one to train the model, and one to register it:

In [None]:
# The overall inputs for the pipeline

pipeline_inputs = {
    'target_column_name': target_feature,
    'my_training_data': JobInput(dataset=f"Housing_Train_from_Notebook:1", mode="download"),
    'my_test_data': JobInput(dataset=f"Housing_Test_from_Notebook:1", mode="download")
}

# Specify the training job
train_job_inputs = {
    'target_column_name': '${{inputs.target_column_name}}',
    'training_data': '${{inputs.my_training_data}}',
}
train_job_outputs = {
    'model_output': None
}
train_job = ComponentJob(
    component=f"HousingTrainingComponent:{training_component_version_string}",
    inputs=train_job_inputs,
    outputs=train_job_outputs
)

# The model registration job
register_job_inputs = {
    'model_input_path': '${{jobs.train-model-job.outputs.model_output}}',
    'model_base_name': model_name,
    'model_name_suffix': model_name_suffix
}
register_job_outputs = {
    'model_info_output_path': None
}
register_job = ComponentJob(
    component=f"register_model:{version_string}",
    inputs=register_job_inputs,
    outputs=register_job_outputs
)

With our jobs specified, assemble them into a pipeline. You can substitute the name of your own compute in place of `cpucluster`:

In [None]:
model_registration_pipeline_job = PipelineJob(
    experiment_name=f"Register_Housing_Model_From_Notebook_01",
    description="Create and register a model from a notebook",
    jobs={
        'train-model-job': train_job,
        'register-model-job': register_job,
    },
    inputs=pipeline_inputs,
    outputs=register_job_outputs,
    compute=compute_name
)

And submit it to AzureML. We define a helper function to do the submission, which waits for the submitted job to complete:

In [None]:
from azure.ml.entities import PipelineJob

def submit_and_wait(ml_client, pipeline_job) -> PipelineJob:
    created_job = ml_client.jobs.create_or_update(pipeline_job)
    assert created_job is not None

    while created_job.status not in ['Completed', 'Failed', 'Canceled', 'NotResponding']:
        time.sleep(30)
        created_job = ml_client.jobs.get(created_job.name)
        print("Latest status : {0}".format(created_job.status))
    assert created_job.status == 'Completed'
    return created_job

# This is the actual submission
training_job = submit_and_wait(ml_client, model_registration_pipeline_job)

##  Creating the RAI Insights

We have a registered model, and can now run a pipeline to create the RAI insights. First off, compute the name of the model we registered:

In [None]:
expected_model_id = f'{model_name}_{model_name_suffix}:1'

Now, we create the RAI pipeline itself. There are four 'component stages' in this pipeline:

1. Fetch the model
1. Construct an empty `RAIInsights` object
1. Run the RAI tool components
1. Gather the tool outputs into a single `RAIInsights` object

The job to fetch the registered model is:

In [None]:
# This won't be necessary once models are types within the pipeline graph

fetch_job_inputs = {
    'model_id': expected_model_id
}
fetch_job_outputs = {
    'model_info_output_path': None
}
fetch_job = ComponentJob(
    component=f"fetch_registered_model:{version_string}",
    inputs=fetch_job_inputs,
    outputs=fetch_job_outputs
)

With this registered model (and our datasets), we can create an empty RAI dashboard:

In [None]:
import json

# Top level RAI Insights component

# We will reuse the same pipeline_inputs object in the end
create_rai_inputs = {
    'title': 'Run built from a Notebook',
    'task_type': 'classification',
    'model_info_path': '${{jobs.fetch-model-job.outputs.model_info_output_path}}',
    'train_dataset': '${{inputs.my_training_data}}',
    'test_dataset': '${{inputs.my_test_data}}',
    'target_column_name': '${{inputs.target_column_name}}',
    'categorical_column_names': json.dumps(categorical_features),
    'classes': '["Less than median", "More than median"]'
}
create_rai_outputs = {
    'rai_insights_dashboard': None # Could theoretically redirect the datastore here
}
create_rai_job = ComponentJob(
    component=f"rai_insights_constructor:{version_string}",
    inputs=create_rai_inputs,
    outputs=create_rai_outputs
)

Now, create instances of our RAI tools. Each of the tools has its own component, which accepts the same arguments as the corresponding manager of the `RAIInsights` object:

In [None]:
# Setup the explanation
explain_inputs = {
   'comment': 'Insert text here',
    'rai_insights_dashboard': '${{jobs.create-rai-job.outputs.rai_insights_dashboard}}'
}
explain_outputs = {
    'explanation': None
}
explain_job = ComponentJob(
    component=f"rai_insights_explanation:{version_string}",
    inputs=explain_inputs,
    outputs=explain_outputs
)


# Setup counterfactual
counterfactual_inputs = {
    'rai_insights_dashboard': '${{jobs.create-rai-job.outputs.rai_insights_dashboard}}',
    'total_CFs': '10',
    'desired_class': 'opposite'
}
counterfactual_outputs = {
    'counterfactual': None
}
counterfactual_job = ComponentJob(
    component=f"rai_insights_counterfactual:{version_string}",
    inputs=counterfactual_inputs,
    outputs=counterfactual_outputs
)

# Setup error analysis
error_analysis_inputs = {
    'rai_insights_dashboard': '${{jobs.create-rai-job.outputs.rai_insights_dashboard}}',
}
error_analysis_outputs = {
    'error_analysis': None
}
error_analysis_job = ComponentJob(
    component=f"rai_insights_erroranalysis:{version_string}",
    inputs=error_analysis_inputs,
    outputs=error_analysis_outputs
)

# Setup causal
causal_inputs = {
    'rai_insights_dashboard': '${{jobs.create-rai-job.outputs.rai_insights_dashboard}}',
    'treatment_features': '["OverallCond", "OverallQual", "Fireplaces", "GarageCars", "ScreenPorch"]',
}
causal_outputs = {
    'causal': None
}
causal_job = ComponentJob(
    component=f"rai_insights_causal:{version_string}",
    inputs=causal_inputs,
    outputs=causal_outputs
)

Now the 'gather' component which assembles everything into an `RAIInsights` object, and computes the JSON for the UX:

In [None]:
# Configure the gather component
gather_inputs = {
    'constructor': '${{jobs.create-rai-job.outputs.rai_insights_dashboard}}',
    'insight_1': '${{jobs.explain-job.outputs.explanation}}',
    'insight_2': '${{jobs.counterfactual-job.outputs.counterfactual}}',
    'insight_3': '${{jobs.error-analysis-job.outputs.error_analysis}}',
    'insight_4': '${{jobs.causal-job.outputs.causal}}'
}
gather_outputs = {
    'dashboard': None,
    'ux_json': None
}
gather_job = ComponentJob(
    component=f"rai_insights_gather:{version_string}",
    inputs=gather_inputs,
    outputs=gather_outputs
)

With all of our jobs defined, we can assemble them into the pipeline itself. Again, the appropriate name for your compute resource should be substituted for `cpucluster`:

In [None]:
# Pipeline to construct the RAI Insights
insights_pipeline_job = PipelineJob(
    experiment_name=f"Compute_Housing_Insights_from_Notebook_{version_string}",
    description="Python submitted Housing insights using fetched model",
    jobs={
        'fetch-model-job': fetch_job,
        'create-rai-job': create_rai_job,
        'counterfactual-job': counterfactual_job,
        'error-analysis-job': error_analysis_job,
        'explain-job': explain_job,
        'causal-job': causal_job,
        'housing-gather-job': gather_job
    },
    inputs=pipeline_inputs,
    outputs=None,
    compute=compute_name
)

Now, submit the pipeline job and wait for it to complete:

In [None]:
insights_job = submit_and_wait(ml_client, insights_pipeline_job)

Once this is complete, we can go to the Reigstered Models view in the AzureML portal, and find the model we have just registered. On the 'Model Details' page, there is a "Responsible AI dashboard" tab where we can view the insights which we have just uploaded.