#Chapter 10 Practicing Ethics and Governance
When AI systems are well designed and governed, they can improve access to services, increase efficiency, and support better decision-making. When they are poorly designed or deployed without sufficient oversight, they can erode privacy, generate misinformation, amplify discrimination, and undermine trust. Chapter 10 introduces the foundations of practicing ethics and governance in AI.

#Listing 10-1 Using Great Expectations for Data Quality Checks
This minimal example, which can be expanded into full pipelines, shows how data validation can become an automated, repeatable governance practice.

In [None]:
# --------------------------------------------------------------
# Step 1: Install and import dependencies
# --------------------------------------------------------------
!pip install great_expectations pandas scikit-learn

import pandas as pd
import great_expectations as gx
from great_expectations.expectations import (
    ExpectColumnValuesToNotBeNull,
    ExpectColumnValuesToBeBetween,
    ExpectTableRowCountToBeBetween,
)
from sklearn.datasets import load_breast_cancer

# --------------------------------------------------------------
# Step 2. Load a tabular dataset into a DataFrame
# --------------------------------------------------------------
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)

# --------------------------------------------------------------
# Step 3. Create a GX Data Context (in-memory)
# --------------------------------------------------------------
context = gx.get_context()

# --------------------------------------------------------------
# Step 4. Add a pandas Data Source
# --------------------------------------------------------------
data_source = context.data_sources.add_pandas("breast_cancer_source")

# 2. Add a Data Asset for this DataFrame
data_asset = data_source.add_dataframe_asset(name="breast_cancer_asset")

# 3. Add a Batch Definition that uses the whole DataFrame
batch_def = data_asset.add_batch_definition_whole_dataframe("whole_dataframe")

# 4. Pass the in-memory DataFrame as batch parameters
batch = batch_def.get_batch(batch_parameters={"dataframe": df})

# Define expectations
exp1 = ExpectColumnValuesToNotBeNull(column="mean radius")
exp2 = ExpectColumnValuesToBeBetween(
    column="mean texture",
    min_value=0,
    max_value=100,
)
exp3 = ExpectTableRowCountToBeBetween(
    min_value=100,
    max_value=10000,
)

# Validate expectations against the Batch
res1 = batch.validate(exp1)
res2 = batch.validate(exp2)
res3 = batch.validate(exp3)

print("Null check on 'mean radius':", res1.success)
print("Range check on 'mean texture':", res2.success)
print("Row count check:", res3.success)


#Listing 10-2 Tracking Model Training and Metrics with MLflow
Experiment tracking is a core governance function. Without systematic tracking, reproducibility degrades, audit trails disappear, and model lineage becomes ambiguous. Governance requires that teams be able to answer basic but critical questions: Which data was used? Which parameters were chosen? Which metrics justified deployment? What changed between versions?
Tools such as MLflow allow teams to log parameters, metrics, and artifacts in a structured and queryable format.

This example demonstrates how a model training run can be recorded as a governed experiment.


In [None]:
# --------------------------------------------------------------
# Step 1: Install and import dependencies
# --------------------------------------------------------------
!pip install mlflow

import mlflow
import mlflow.sklearn

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# --------------------------------------------------------------
# Step 2: Load dataset and split into train/test sets
# --------------------------------------------------------------
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data,
    data.target,
    test_size=0.3,
    random_state=42,
    stratify=data.target
)

# --------------------------------------------------------------
# Step 3: Start an MLflow tracking run
# --------------------------------------------------------------
with mlflow.start_run(run_name="rf_governance_example"):

    # ----------------------------------------------------------
    # Step 4: Define and log hyperparameters
    # ----------------------------------------------------------
    n_estimators = 100
    max_depth = 5

    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)

    # ----------------------------------------------------------
    # Step 5: Train the model
    # ----------------------------------------------------------
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42
    )
    model.fit(X_train, y_train)

    # ----------------------------------------------------------
    # Step 6: Evaluate and log performance metrics
    # ----------------------------------------------------------
    y_prob = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_prob)
    mlflow.log_metric("roc_auc", auc)

    # ----------------------------------------------------------
    # Step 7: Log the trained model artifact
    # ----------------------------------------------------------
    mlflow.sklearn.log_model(
        sk_model=model,
        artifact_path="model",
        input_example=X_train[:5]
    )

    print("Logged run with ROC AUC:", auc)


#Listing 10-3
The Fairlearn library provides tools for measuring group-wise performance and making disparities visible. This example trains a simple classifier and then computes accuracy and selection rate separately for two groups defined by a protected attribute such as gender, region, or another sensitive feature.


In [None]:
# ---------------------------------------------------------
# Step 1: Install dependencies
# ---------------------------------------------------------
!pip install fairlearn scikit-learn

# ---------------------------------------------------------
# Step 2: Import libraries
# ---------------------------------------------------------
import numpy as np

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from fairlearn.metrics import MetricFrame, selection_rate

# ---------------------------------------------------------
# Step 3: Create a synthetic dataset with a binary label
# ---------------------------------------------------------
X, y = make_classification(
    n_samples=2000,
    n_features=10,
    n_informative=5,
    random_state=42
)

# ---------------------------------------------------------
# Step 4: Create a synthetic "group" attribute
#         (a stand-in for a protected attribute)
# ---------------------------------------------------------
rng = np.random.default_rng(42)
group_attr = rng.integers(0, 2, size=y.shape[0])  # values 0 or 1

# ---------------------------------------------------------
# Step 5: Split into train/test, keeping labels stratified
# ---------------------------------------------------------
X_train, X_test, y_train, y_test, g_train, g_test = train_test_split(
    X, y, group_attr, test_size=0.3, random_state=42, stratify=y
)
# ---------------------------------------------------------
# Step 6: Train a baseline classifier and evaluate
#         group-wise metrics
# ---------------------------------------------------------
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

metric_frame = MetricFrame(
    metrics={
        "accuracy": accuracy_score,
        "selection_rate": selection_rate,
    },
    y_true=y_test,
    y_pred=y_pred,
    sensitive_features=g_test,
)

# ---------------------------------------------------------
# Step 7: Print results
# ---------------------------------------------------------
print("Overall accuracy:", accuracy_score(y_test, y_pred))
print("\nAccuracy by group:")
print(metric_frame.by_group["accuracy"])

print("\nSelection rate by group:")
print(metric_frame.by_group["selection_rate"])

# Listing 10-4: Using SHAP to Identify Influential Features
This code trains a tree-based model and computes SHAP values on a held-out dataset. Instead of displaying the full distribution of SHAP values, which can be visually dense, the code aggregates them into mean absolute contribution scores per feature. This provides a compact summary of which inputs most strongly influence predictions overall.


In [None]:
# ---------------------------------------------------------
# Step 1: Install dependencies
# ---------------------------------------------------------
!pip install shap scikit-learn pandas matplotlib

# ---------------------------------------------------------
# Step 2: Import required libraries
# ---------------------------------------------------------
import shap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# ---------------------------------------------------------
# Step 3: Load dataset and create feature matrix
# ---------------------------------------------------------
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# ---------------------------------------------------------
# Step 4: Split into training and test sets
# ---------------------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

# ---------------------------------------------------------
# Step 5: Train a tree-based classifier
# ---------------------------------------------------------
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    random_state=42
)
model.fit(X_train, y_train)

# ---------------------------------------------------------
# Step 6: Create SHAP explainer and compute SHAP values
# ---------------------------------------------------------
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# ---------------------------------------------------------
# Step 7: Select SHAP values for positive class
# ---------------------------------------------------------
if isinstance(shap_values, list):
    shap_values_class_1 = shap_values[1]
else:
    shap_values_class_1 = shap_values

shap_values_class_1 = np.array(shap_values_class_1)
if shap_values_class_1.ndim > 2:
    shap_values_class_1 = shap_values_class_1.mean(axis=-1)

# ---------------------------------------------------------
# Step 8: Compute mean absolute SHAP value per feature
# ---------------------------------------------------------
mean_abs_shap = np.mean(np.abs(shap_values_class_1), axis=0)
mean_abs_shap = np.asarray(mean_abs_shap, dtype=float).ravel()

feature_names = np.array(X_test.columns)

# ---------------------------------------------------------
# Step 9: Rank features by importance
# ---------------------------------------------------------
order = np.argsort(mean_abs_shap)[::-1]
top_k = min(10, len(mean_abs_shap))
top_idx = order[:top_k]

top_features = feature_names[top_idx]
top_importances = mean_abs_shap[top_idx]

print("Top features by mean absolute SHAP value:")
for name, val in zip(top_features, top_importances):
    print(f"{name}: {val:.4f}")

# ---------------------------------------------------------
# Step 10: Visualize feature importance
# ---------------------------------------------------------
plt.figure(figsize=(8, 5))
positions = np.arange(len(top_features))
plt.barh(positions, top_importances)
plt.yticks(positions, top_features)
plt.gca().invert_yaxis()
plt.xlabel("mean(|SHAP value|)")
plt.title("Top features by mean absolute SHAP value")
plt.tight_layout()
plt.show()

#Listing 10-5 Opacus Differential Privacy Example
This example illustrates how differential privacy can be incorporated directly into model training workflows.

In [None]:
# --------------------------------------------------------------
# Step 1: Install and import dependencies
# --------------------------------------------------------------
!pip install opacus

import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from opacus import PrivacyEngine

# --------------------------------------------------------------
# Step 2: Load and prepare a small tabular dataset
# --------------------------------------------------------------
data = load_breast_cancer()
X = torch.tensor(data.data, dtype=torch.float32)
y = torch.tensor(data.target, dtype=torch.long)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

train_ds = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)

# --------------------------------------------------------------
# Step 3: Define a simple classifier model
# --------------------------------------------------------------
model = nn.Sequential(
    nn.Linear(X_train.shape[1], 32),
    nn.ReLU(),
    nn.Linear(32, 2)
)

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
criterion = nn.CrossEntropyLoss()

# --------------------------------------------------------------
# Step 4: Attach the PrivacyEngine with a target privacy budget
# --------------------------------------------------------------
privacy_engine = PrivacyEngine()

model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    epochs=3,
    target_epsilon=5.0,
    target_delta=1e-5,
    max_grad_norm=1.0,
)

# --------------------------------------------------------------
# Step 5: Train the model under differential privacy
# --------------------------------------------------------------
model.train()
for epoch in range(3):
    for xb, yb in train_loader:
        optimizer.zero_grad()
        logits = model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()

# --------------------------------------------------------------
# Step 6: Retrieve and report the achieved privacy budget
# --------------------------------------------------------------
epsilon = privacy_engine.get_epsilon(delta=1e-5)
print(f"Training finished with ε = {epsilon:.2f}, δ = 1e-5")