# Debug housing price predictions

This notebook demonstrates the use of the `responsibleai` API to assess a classification model trained on Kaggle's apartments dataset (https://www.kaggle.com/alphaepsilon/housing-prices-dataset). The model predicts if the house sells for more than median price or not. It walks through the API calls necessary to create a widget with model analysis insights, then guides a visual analysis of the model.

## Launch Responsible AI Toolbox

The following section examines the code necessary to create datasets and a model. It then generates insights using the `responsibleai` API that can be visually analyzed.

### Train a Model
*The following section can be skipped. It loads a dataset and trains a model for illustrative purposes.*

In [1]:
import shap
import sklearn
import pandas as pd

from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
import zipfile

First, load the apartment dataset and specify the different types of features. Then, clean it and put it into a dataframe with named columns. After loading and cleaning the data, split the datapoints into training and test sets. Assemble separate datasets for the full sample and the test data.

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

def split_label(dataset, target_feature):
    X = dataset.drop([target_feature], axis=1)
    y = dataset[[target_feature]]
    return X, y

def clean_data(X, y, target_feature):
    features = X.columns.values.tolist()
    classes = y[target_feature].unique().tolist()
    pipe_cfg = {
        'num_cols': X.dtypes[X.dtypes == 'int64'].index.values.tolist(),
        'cat_cols': X.dtypes[X.dtypes == 'object'].index.values.tolist(),
    }
    num_pipe = Pipeline([
        ('num_imputer', SimpleImputer(strategy='median'))#,
        #('num_scaler', StandardScaler())
    ])
    cat_pipe = Pipeline([
        ('cat_imputer', SimpleImputer(strategy='constant', fill_value='?')),
        ('cat_encoder', OneHotEncoder(handle_unknown='ignore', sparse=False))
    ])
    feat_pipe = ColumnTransformer([
        ('num_pipe', num_pipe, pipe_cfg['num_cols']),
        ('cat_pipe', cat_pipe, pipe_cfg['cat_cols'])
    ])
    X = feat_pipe.fit_transform(X)
    print(pipe_cfg['cat_cols'])
    return X, feat_pipe, features, classes

target_feature = 'Sold_HigherThan_Median'
categorical_features = []

outdirname = 'responsibleai.12.28.21'
try:
    from urllib import urlretrieve
except ImportError:
    from urllib.request import urlretrieve
zipfilename = outdirname + '.zip'
urlretrieve('https://publictestdatasets.blob.core.windows.net/data/' + zipfilename, zipfilename)
with zipfile.ZipFile(zipfilename, 'r') as unzip:
    unzip.extractall('.')

all_data = pd.read_csv('apartments-train.csv')
all_data = all_data.drop(['SalePrice','SalePriceK'], axis=1)
X, y = split_label(all_data, target_feature)


X_train_original, X_test_original, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=7, stratify=y)

X_train, feat_pipe, features, classes = clean_data(X_train_original, y_train, target_feature)
y_train = y_train[target_feature].to_numpy()

X_test = feat_pipe.transform(X_test_original)
y_test = y_test[target_feature].to_numpy()

train_data = X_train_original.copy()
train_data[target_feature] = y_train

test_data = X_test_original.copy()
test_data[target_feature] = y_test

[]


# Get the Data to AzureML

First, save the data to files:

In [4]:
print("Saving to files")
train_data.to_parquet("housing_train.parquet", index=False)
test_data.to_parquet("housing_test.parquet", index=False)

Saving to files


Create an `MLClient`:

In [5]:
from azure.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential(exclude_shared_token_cache_credential=True),
                     logging_enable=True)

Found the config file in: C:\Users\riedgar\source\repos\RAI-vNext-Preview\config.json


Upload the datasets:

In [6]:
from azure.ml.entities import Dataset

train_dataset = Dataset(
    name="Housing_Train_from_Notebook",
    local_path="housing_train.parquet",
)
ml_client.datasets.create_or_update(train_dataset)

test_dataset = Dataset(
    name="Housing_Test_from_Notebook",
    local_path="housing_test.parquet",
)
ml_client.datasets.create_or_update(test_dataset)

Dataset({'paths': [<azure.ml._restclient.v2021_10_01.models._models_py3.UriReference object at 0x0000025CA0AABFC8>], 'is_anonymous': False, 'auto_increment_version': False, 'name': 'Housing_Test_from_Notebook', 'description': None, 'tags': {}, 'properties': {}, 'id': '/subscriptions/589c7ae9-223e-45e3-a191-98433e0821a9/resourceGroups/amlisdkv2-rg-1643673716/providers/Microsoft.MachineLearningServices/workspaces/amlisdkv21643673716/datasets/Housing_Test_from_Notebook/versions/2', 'base_path': './', 'creation_context': <azure.ml._restclient.v2021_10_01.models._models_py3.SystemData object at 0x0000025CA0AABC88>, 'serialize': <msrest.serialization.Serializer object at 0x0000025CA0A99CC8>, 'version': '2', 'local_path': None})

# Train the Model in AzureML

To simplify the model creation process, we're going to use a pipeline.

Before we do anything else, we need to specify the version of the RAI components:

In [7]:
version_string = '1643680970'

Now we can create the training script:

In [8]:
%%writefile housing_training_script.py

import argparse
import os
import shutil
import tempfile


from azureml.core import Run

import mlflow
import mlflow.sklearn

import pandas as pd
from lightgbm import LGBMClassifier

def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--training_data", type=str, help="Path to training data")
    parser.add_argument("--target_column_name", type=str, help="Name of target column")
    parser.add_argument("--model_output", type=str, help="Path of output model")

    # parse args
    args = parser.parse_args()

    # return args
    return args


def main(args):
    current_experiment = Run.get_context().experiment
    tracking_uri = current_experiment.workspace.get_mlflow_tracking_uri()
    print("tracking_uri: {0}".format(tracking_uri))
    mlflow.set_tracking_uri(tracking_uri)
    mlflow.set_experiment(current_experiment.name)

    # Read in data
    print("Reading data")
    all_data = pd.read_parquet(args.training_data)

    print("Extracting X_train, y_train")
    print("all_data cols: {0}".format(all_data.columns))
    y_train = all_data[args.target_column_name]
    X_train = all_data.drop(labels=args.target_column_name, axis="columns")
    print("X_train cols: {0}".format(X_train.columns))

    print("Training model")
    # The estimator can be changed to suit
    model = LGBMClassifier(n_estimators=5)
    model.fit(X_train, y_train)

    # Saving model with mlflow - leave this section unchanged
    with tempfile.TemporaryDirectory() as td:
        print("Saving model with MLFlow to temporary directory")
        tmp_output_dir = os.path.join(td, "my_model_dir")
        mlflow.sklearn.save_model(sk_model=model, path=tmp_output_dir)

        print("Copying MLFlow model to output path")
        for file_name in os.listdir(tmp_output_dir):
            print("  Copying: ", file_name)
            # As of Python 3.8, copytree will acquire dirs_exist_ok as
            # an option, removing the need for listdir
            shutil.copy2(src=os.path.join(tmp_output_dir, file_name), dst=os.path.join(args.model_output, file_name))


# run script
if __name__ == "__main__":
    # add space in logs
    print("*" * 60)
    print("\n\n")

    # parse args
    args = parse_args()

    # run main function
    main(args)

    # add space in logs
    print("*" * 60)
    print("\n\n")

Writing housing_training_script.py


Place this script into a component:

In [10]:
from azure.ml.entities import Code, CommandComponent

training_code = Code(
    local_path='housing_training_script.py'
)

training_inputs = {
    'training_data': { 'type': 'path'},
    'target_column_name': { 'type': 'string'}
}

training_outputs = {
    'model_output': { 'type': 'path'}
}

training_component = CommandComponent(
    name="HousingTrainingComponent",
    version="3",
    display_name="Simple training component for housing Dataset",
    code=training_code,
    environment=f"AML-RAI-Environment:{version_string}",
    inputs=training_inputs,
    outputs=training_outputs,
    command="python housing_training_script.py " \
            "--training_data ${{inputs.training_data}} " \
            "--target_column_name ${{inputs.target_column_name}} " \
            "--model_output ${{outputs.model_output}}"
)

ml_client.components.create_or_update(training_component)

[32mUploading housing_training_script.py[32m (< 1 MB): 100%|##############################| 2.46k/2.46k [00:00<00:00, 19.1kB/s][0m
[39m



CommandComponent({'auto_increment_version': False, 'is_anonymous': False, 'name': 'HousingTrainingComponent', 'description': None, 'tags': {}, 'properties': {}, 'id': '/subscriptions/589c7ae9-223e-45e3-a191-98433e0821a9/resourceGroups/amlisdkv2-rg-1643673716/providers/Microsoft.MachineLearningServices/workspaces/amlisdkv21643673716/components/HousingTrainingComponent/versions/3', 'base_path': None, 'creation_context': <azure.ml._restclient.v2021_10_01.models._models_py3.SystemData object at 0x0000025CA0AB7208>, 'serialize': <msrest.serialization.Serializer object at 0x0000025CA0B8CD48>, 'command': 'python housing_training_script.py --training_data ${{inputs.training_data}} --target_column_name ${{inputs.target_column_name}} --model_output ${{outputs.model_output}}', 'code': '/subscriptions/589c7ae9-223e-45e3-a191-98433e0821a9/resourceGroups/amlisdkv2-rg-1643673716/providers/Microsoft.MachineLearningServices/workspaces/amlisdkv21643673716/codes/7f56844e-c13d-41ec-83d8-8b35592dc302/versi

# Running a training pipeline
Now we have a script which can train a model, we need to run it:

In [11]:
import time

from azure.ml.entities import JobInput, ComponentJob, PipelineJob

model_name_suffix = int(time.time())
model_name = 'my_housing_nb_model'

This is going to be a two component pipeline. The first will be the one we created above, which will train our model. The second will register it in AzureML:

In [15]:
# The overall inputs for the pipeline

pipeline_inputs = {
    'target_column_name': target_feature,
    'my_training_data': JobInput(dataset=f"Housing_Train_from_Notebook:1"),
    'my_test_data': JobInput(dataset=f"Housing_Test_from_Notebook:1")
}

# Specify the training job
train_job_inputs = {
    'target_column_name': '${{inputs.target_column_name}}',
    'training_data': '${{inputs.my_training_data}}',
}
train_job_outputs = {
    'model_output': None
}
train_job = ComponentJob(
    component=f"HousingTrainingComponent:3",
    inputs=train_job_inputs,
    outputs=train_job_outputs
)

# The model registration job
register_job_inputs = {
    'model_input_path': '${{jobs.train-model-job.outputs.model_output}}',
    'model_base_name': model_name,
    'model_name_suffix': model_name_suffix
}
register_job_outputs = {
    'model_info_output_path': None
}
register_job = ComponentJob(
    component=f"RegisterModel:{version_string}",
    inputs=register_job_inputs,
    outputs=register_job_outputs
)

With our jobs specified, assemble them into a pipeline:

In [16]:
model_registration_pipeline_job = PipelineJob(
    experiment_name=f"Register_Housing_Model_From_Notebook_01",
    description="Create and register a model from a notebook",
    jobs={
        'train-model-job': train_job,
        'register-model-job': register_job,
    },
    inputs=pipeline_inputs,
    outputs=register_job_outputs,
    compute="cpucluster"
)

And submit it to AzureML:

In [None]:
from azure.ml.entities import PipelineJob

def submit_and_wait(ml_client, pipeline_job) -> PipelineJob:
    created_job = ml_client.jobs.create_or_update(pipeline_job)
    assert created_job is not None

    while created_job.status not in ['Completed', 'Failed', 'Canceled', 'NotResponding']:
        time.sleep(30)
        created_job = ml_client.jobs.get(created_job.name)
        print("Latest status : {0}".format(created_job.status))
    assert created_job.status == 'Completed'
    return created_job

# This is the actual submission
training_job = submit_and_wait(ml_client, model_registration_pipeline_job)

compute is not a known attribute of class <class 'azure.ml._restclient.v2021_10_01.models._models_py3.PipelineJob'> and will be ignored


Train a LightGBM classifier on the training data.

### Create Model and Data Insights

In [None]:
from raiwidgets import ResponsibleAIDashboard
from responsibleai import RAIInsights

To use Responsible AI Dashboard, initialize a RAIInsights object upon which different components can be loaded.

RAIInsights accepts the model, the full dataset, the test dataset, the target feature string, the task type string, and a list of strings of categorical feature names as its arguments.

In [None]:
from sklearn.pipeline import Pipeline

dashboard_pipeline = Pipeline(steps=[('preprocess', feat_pipe), ('model', model)])
rai_insights = RAIInsights(dashboard_pipeline, train_data, test_data, target_feature, 'classification',
                             categorical_features=categorical_features, 
                             classes=['Less than median', 'More than median'])

Add the components of the toolbox that are focused on model assessment.

In [None]:
# Interpretability
rai_insights.explainer.add()
# Error Analysis
rai_insights.error_analysis.add()
# Counterfactuals: accepts total number of counterfactuals to generate, the label that they should have, and a list of 
                # strings of categorical feature names
rai_insights.counterfactual.add(total_CFs=10, desired_class='opposite')

Once all the desired components have been loaded, compute insights on the test set.

In [None]:
rai_insights.compute()

Finally, visualize and explore the model insights. Use the resulting widget or follow the link to view this in a new tab.

In [None]:
ResponsibleAIDashboard(rai_insights)

See this [developer blog](aka.ms/raidashboardblog) (Model Debugging Flow section) to learn more about this use case and how to use the dashboard to debug your housing price prediction model.