# Explainability using LIME enhanced for multimodal (structured, text) binary classification model

The notebook will train Car rental model, generate local using LIME enhanced explainer and provide insights into model behaviour.

### Contents
- [Setup](#Setup)
- [Model building and evaluation](#model)
- [OpenScale configuration](#openscale)
- [Compute LIME explanations](#lime)

**Note:** This notebook requires service credentials of Watson OpenScale.

## Setup

### Package Installation

In [None]:
!pip install --upgrade ibm-watson-openscale --no-cache | tail -n 1
!pip install --upgrade ibm-metrics-plugin --no-cache | tail -n 1
!pip install matplotlib
!pip install numpy==1.23.5

**Action: Restart the kernel!**

### Configure Credentials

In [1]:
import warnings
warnings.filterwarnings("ignore")

Provide your IBM Watson OpenScale credentials in the following cell:

In [2]:
WOS_CREDENTIALS = {
    "url": "",
    "username": "",
    "password": "",
    "instance_id": ""
}

## Model building and evaluation <a name="model"></a>

In this section you will learn how to train Scikit-learn model, run prediction and evaluate its output.

### Load the training data from github

In [None]:
!rm car_rental_training_data.csv
!wget https://github.com/IBM/watson-machine-learning-samples/raw/4d7a8344c79c8c7ffbc937497882f67f3e22a79b/cloud/data/cars-4-you/car_rental_training_data.csv -O car_rental_training_data.csv

In [4]:
import numpy as np
import pandas as pd

training_data_file_name = "car_rental_training_data.csv"
data_df = pd.read_csv(training_data_file_name, delimiter=";")

### Explore data

In [5]:
data_df.drop(["ID", "Action"], axis=1, inplace=True)
data_df.head()

Unnamed: 0,Gender,Status,Children,Age,Customer_Status,Car_Owner,Customer_Service,Satisfaction,Business_Area
0,Female,M,2,48.85,Inactive,Yes,I thought the representative handled the initi...,0,Product: Availability/Variety/Size
1,Female,M,0,55.0,Inactive,No,I have had a few recent rentals that have take...,0,Product: Availability/Variety/Size
2,Male,M,0,42.35,Inactive,Yes,car cost more because I didn't pay when I rese...,0,Product: Availability/Variety/Size
3,Male,M,2,61.71,Inactive,Yes,I didn't get the car I was told would be avail...,0,Product: Availability/Variety/Size
4,Male,S,2,56.47,Active,No,If there was not a desired vehicle available t...,1,Product: Availability/Variety/Size


In [6]:
print("Columns: ", list(data_df.columns))
print("Number of columns: ", len(data_df.columns))

Columns:  ['Gender', 'Status', 'Children', 'Age', 'Customer_Status', 'Car_Owner', 'Customer_Service', 'Satisfaction', 'Business_Area']
Number of columns:  9


Satisfaction field is the one we would like to predict.

In [7]:
print("Number of records: ", data_df.Satisfaction.count())

Number of records:  486


In [8]:
target_count = data_df.groupby("Satisfaction")["Satisfaction"].count()
target_count

Satisfaction
0    212
1    274
Name: Satisfaction, dtype: int64

### Create a model

In this section you will learn how to:

- Prepare data for training a model
- Create machine learning pipeline
- Train a model
- Evaluate a model

#### Import required libraries

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split

#### Splitting the data into train and test

In [10]:
label_column = "Satisfaction"
features=list(data_df.columns)
features.remove(label_column)
X = data_df[features]
y = data_df["Satisfaction"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

#### Preparing the pipeline

In this step you will create transformers for the numeric, categorical  and text features.

A pipeline is created using the column transformer and the classifier object 

In [11]:
text_features=["Customer_Service"]
categorical_features=[features[i] for i,x in enumerate([str(i) for i in X.dtypes]) if x == "object"]
categorical_features.remove(text_features[0])
numeric_features=[f for f in features if f not in categorical_features and f not in text_features]

numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
text_transformer = Pipeline([("vect", CountVectorizer(
    lowercase=True, stop_words="english")), ("tfidf", TfidfTransformer())])
ct = ColumnTransformer(transformers=[("num", numeric_transformer, numeric_features),
                                       ("cat", categorical_transformer, categorical_features)] +
                                      [("text"+f, text_transformer, f)for f in text_features],
                                        remainder="drop", n_jobs=-1)

In [12]:
model=RandomForestClassifier(n_estimators=100, random_state=1)
pipeline = Pipeline([("ct", ct), ("clf", model)])

#### Train a model

In [13]:
pipeline = pipeline.fit(X_train, y_train)

#### Evaluate the model

In [14]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pipeline.predict(X_test))

0.9387755102040817

## Openscale Configuration <a name="openscale"></a>

Import the necessary libraries and set up OpenScale Python client.

In [15]:
from ibm_watson_openscale import APIClient as OpenScaleAPIClient
from ibm_cloud_sdk_core.authenticators import CloudPakForDataAuthenticator

authenticator = CloudPakForDataAuthenticator(
    url=WOS_CREDENTIALS["url"],
    username=WOS_CREDENTIALS["username"],
    password=WOS_CREDENTIALS["password"],
    disable_ssl_verification=True
)

client = OpenScaleAPIClient(
    service_url=WOS_CREDENTIALS["url"],
    service_instance_id=WOS_CREDENTIALS["instance_id"],
    authenticator=authenticator
)

#Uncomment below lines if needed to initialize cloud using cloud apikey
#authenticator = IAMAuthenticator(apikey="")
#client = APIClient(authenticator=authenticator)
#client.version

client.version

'3.0.34'

## Compute LIME explanations <a name="shap"></a>

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. See [paper](http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions) for technical details of the algorithm.

The Shapley value is the average marginal contribution of a feature value across all possible feature coalitions.

SHAP assigns each feature an importance value for a particular prediction which is called SHAP value. The SHAP values of all the input features will always sum up to the difference between baseline (expected) model output and the current model output for the prediction being explained.

### Prepare input to compute LIME explanations

#### Create configuration for computing the LIME metric

Set the below properties in the configuration

- **problem_type** : The model problem type. Possible values are 'binary', 'multiclass', 'regression'
- **input_data_type**: The input data type. Supported value is 'structured'
- **feature_columns**: The list of feature columns
- **categorical_columns**: The list of categorical columns
- **text_columns**: The list of text columns
- **explainability**: The expainability metrics configuration

Generate explainability training statistics

In [16]:
from ibm_metrics_plugin.common.utils.constants import ExplainabilityMetricType, ProblemType, InputDataType, MetricGroupType
from ibm_metrics_plugin.metrics.explainability.entity.training_stats import TrainingStats

training_data_info = {
    "problem_type": ProblemType.BINARY.value,
    "feature_columns": features,
    "categorical_columns": categorical_features,
    "text_columns": text_features,
    "label_column": label_column,
}

training_stats = TrainingStats(data_df, training_data_info).get_explanability_statistics()

Optional parameters for lime explainer
- **features_count**: The number of features to be returned in the explanation. Default value is 10.
- **perturbations_count**: The count of perturbations to use. By default 5000 perturbations will be generated and scored while generating explanation.

In [65]:
configuration={
    "configuration": {
        "problem_type": ProblemType.BINARY.value,
        "input_data_type": InputDataType.MULTIMODAL.value,
        "feature_columns": features,
        "categorical_columns": categorical_features,
        "text_columns": text_features,
        "label_column": label_column,
        MetricGroupType.EXPLAINABILITY.value : {
            "metrics_configuration": {
                ExplainabilityMetricType.LIME.value : {
                    #"features_count": 10,
                    #"perturbations_count": 5000
                }
            },
             "training_statistics": training_stats
        }
    }
}

#### Define the scoring function

The scoring function will be used to score against the model to get the probability and prediction values. The scoring function should take a pandas dataframe as input and return probability and prediction values.

Note: For classification model, returning the prediction values is optional.

In [66]:
def scoring_fn(data):
    return pipeline.predict_proba(data), pipeline.predict(data)

### Compute explanations

Compute the explanations for one datapoint in the test data. The test data could be a spark dataframe or pandas dataframe. Here we use a pandas dataframe.

In [67]:
metrics_result = client.ai_metrics.compute_metrics(configuration=configuration, 
                                                   data_frame=X_test.iloc[0:1], 
                                                   scoring_fn=scoring_fn)

In [68]:
explanation = metrics_result["metrics_result"]["explainability"]["lime"]["local_explanations"][0]["predictions"][0]
print("Explanation for prediction {0} is ".format(explanation.get("value")))
exp_df = pd.DataFrame(data = [(f.get("feature_name"), f.get("feature_value"), f.get("weight")) for f in explanation.get("explanation_features")], columns = ["Feature name", "Feature value", "Importance"])
exp_df

Explanation for prediction 1 is 


Unnamed: 0,Feature name,Feature value,Importance
0,Customer_Service,insurance,-0.262803
1,Business_Area,Product: Functioning,-0.212235
2,Car_Owner,Yes,-0.08806
3,Customer_Service,funeral,0.085438
4,Customer_Service,comprehensive,0.071753
5,Customer_Service,town,0.069573
6,Customer_Service,garage,0.065843
7,Customer_Service,purchased,0.062063
8,Customer_Service,smashed,0.059047
9,Children,2,-0.023184


**Authors**

Developed by Pratap Kishore Varma V