
# **CHAPTER 24: RESPONSIBLE AI & ETHICS**

*Building Trustworthy AI Systems*

## **Chapter Overview**

Machine learning systems increasingly impact critical decisions in healthcare, finance, and criminal justice. This chapter addresses the ethical obligations and technical implementations required to build fair, transparent, and secure AI systems. You will learn to detect bias, ensure privacy, and implement governance frameworks that maintain public trust.

**Estimated Time:** 25-35 hours (2-3 weeks)  
**Prerequisites:** All previous chapters (context for applying ethics in practice)

---

## **24.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Audit models for fairness across demographic groups using statistical parity metrics
2. Implement explainability techniques (SHAP, LIME, attention visualization) for model decisions
3. Apply privacy-preserving techniques: differential privacy, federated learning, and PII handling
4. Identify and mitigate security vulnerabilities: adversarial attacks, model inversion, poisoning
5. Design AI governance frameworks: model cards, datasheets, and audit trails
6. Navigate regulatory compliance: GDPR, AI Act, and sector-specific regulations

---

## **24.1 Fairness & Bias Mitigation**

#### **24.1.1 Fairness Metrics**

**Statistical Parity:** $P(\hat{Y}=1|A=0) = P(\hat{Y}=1|A=1)$
- Equal selection rates across groups (demographic parity)

**Equalized Odds:** $P(\hat{Y}=1|Y=y, A=0) = P(\hat{Y}=1|Y=y, A=1)$ for $y \in \{0,1\}$
- Equal TPR and FPR across groups

**Calibration:** $P(Y=1|\hat{Y}=p, A=0) = P(Y=1|\hat{Y}=p, A=1) = p$
- Predicted probabilities reflect true likelihood equally across groups

```python
# fairness_audit.py
from fairlearn.metrics import demographic_parity_difference, equalized_odds_difference
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd

def audit_fairness(y_true, y_pred, sensitive_features):
    """
    Comprehensive fairness audit
    """
    results = {}
    
    # Demographic parity
    results['demographic_parity'] = demographic_parity_difference(
        y_true, y_pred, sensitive_features=sensitive_features
    )
    
    # Equalized odds
    results['equalized_odds'] = equalized_odds_difference(
        y_true, y_pred, sensitive_features=sensitive_features
    )
    
    # Group-wise accuracy
    for group in sensitive_features.unique():
        mask = sensitive_features == group
        acc = accuracy_score(y_true[mask], y_pred[mask])
        results[f'accuracy_{group}'] = acc
    
    # Disparate impact (ratio of selection rates)
    group_0_rate = y_pred[sensitive_features == 0].mean()
    group_1_rate = y_pred[sensitive_features == 1].mean()
    results['disparate_impact'] = group_1_rate / group_0_rate
    
    return results

# Interpretation
# Demographic parity diff close to 0: fair selection rates
# Disparate impact between 0.8-1.25: legal "four-fifths rule" compliance
```

#### **24.1.2 Bias Mitigation Techniques**

**Pre-processing:** Adjust training data to remove bias.
```python
# Reweighing: Assign weights to samples to ensure fairness
from fairlearn.preprocessing import CorrelationRemover

X_transformed = CorrelationRemover(sensitive_feature_ids=['race', 'gender']).fit_transform(X)
```

**In-processing:** Train with fairness constraints.
```python
# Adversarial debiasing: Train classifier to predict target but not sensitive attribute
import tensorflow as tf

class AdversarialDebiasing(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.predictor = tf.keras.Sequential([...])
        self.adversary = tf.keras.Sequential([...])
    
    def call(self, inputs):
        predictions = self.predictor(inputs)
        protected_pred = self.adversary(predictions)
        return predictions, protected_pred
    
    def train_step(self, data):
        x, (y_true, protected_true) = data
        
        with tf.GradientTape(persistent=True) as tape:
            y_pred, protected_pred = self(x, training=True)
            
            # Loss: Minimize prediction error, maximize adversary error (confuse it)
            predictor_loss = loss_fn(y_true, y_pred)
            adversary_loss = loss_fn(protected_true, protected_pred)
            
            # Combined: Minimize predictor loss, maximize adversary loss (minimize -adversary_loss)
            total_loss = predictor_loss - 0.5 * adversary_loss
        
        # Update predictor to minimize total_loss (hiding protected attributes)
        # Update adversary to maximize its loss (learning to detect protected attributes)
        ...
```

**Post-processing:** Adjust predictions to meet fairness criteria.
```python
from fairlearn.postprocessing import ThresholdOptimizer

# Calibrate thresholds per group to equalize TPR/FPR
postprocessed_predictor = ThresholdOptimizer(
    estimator=model,
    constraints="equalized_odds",
    predict_method="predict_proba"
)

postprocessed_predictor.fit(X_test, y_test, sensitive_features=race)
fair_predictions = postprocessed_predictor.predict(X_test, sensitive_features=race)
```

---

## **24.2 Explainability & Interpretability**

#### **24.2.1 SHAP (SHapley Additive exPlanations)**

Game-theoretic approach assigning each feature a contribution to the prediction.

```python
# shap_explanation.py
import shap

# Background dataset for baseline
background = X_train.sample(100)

explainer = shap.DeepExplainer(model, background)  # For neural networks
# Or: shap.TreeExplainer(model) for XGBoost/LightGBM
# Or: shap.KernelExplainer(model.predict, background) for any model

# Explain single prediction
shap_values = explainer.shap_values(X_test.iloc[0:1])
shap.waterfall_plot(shap.Explanation(
    values=shap_values[0],
    base_values=explainer.expected_value,
    data=X_test.iloc[0],
    feature_names=X_test.columns
))

# Global feature importance
shap.summary_plot(shap_values, X_test)
```

**Interpretation:** Positive SHAP values push prediction higher; negative push lower. Sum of SHAP values + base value = final prediction.

#### **24.2.2 LIME (Local Interpretable Model-agnostic Explanations)**

Approximate complex model with interpretable linear model locally.

```python
from lime.lime_tabular import LimeTabularExplainer

explainer = LimeTabularExplainer(
    X_train.values,
    feature_names=X_train.columns,
    class_names=['denied', 'approved'],
    mode='classification'
)

exp = explainer.explain_instance(
    X_test.iloc[0].values, 
    model.predict_proba,
    num_features=5
)

exp.show_in_notebook(show_table=True)
```

#### **24.2.3 Attention Visualization (Transformers)**

```python
# Extract attention weights from BERT
from transformers import BertTokenizer, BertForSequenceClassification
import matplotlib.pyplot as plt
import seaborn as sns

inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model(**inputs, output_attentions=True)

# Layer 0, Head 0 attention matrix
attention = outputs.attentions[0][0, 0, :, :].detach().numpy()

# Visualize which tokens attend to which
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
sns.heatmap(attention, xticklabels=tokens, yticklabels=tokens, cmap='viridis')
```

#### **24.2.4 Concept Activation Vectors (CAVs)**

Test if model uses human-understandable concepts (e.g., "stripes" for zebra classification).

```python
# Train linear classifier to separate concept examples (striped vs. non-striped)
concept_classifier = train_concept_classifier(striped_examples, random_examples)

# Get directional derivative of model output w.r.t. concept direction
concept_activation = np.dot(gradient_of_prediction, concept_classifier.weights)
```

---

## **24.3 Privacy-Preserving ML**

#### **24.3.1 Differential Privacy**

Mathematical guarantee that model output doesn't reveal whether any individual was in the training set.

```python
# Opacus for PyTorch differential privacy
from opacus import PrivacyEngine

model = MyModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

privacy_engine = PrivacyEngine()
model, optimizer, dataloader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    noise_multiplier=1.1,  # Noise level (higher = more privacy)
    max_grad_norm=1.0,     # Gradient clipping
)

# Training with privacy accounting
for epoch in range(epochs):
    for batch in dataloader:
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()
    
    epsilon = privacy_engine.get_epsilon(delta=1e-5)
    print(f"Epoch {epoch}: (ε = {epsilon:.2f}, δ = 1e-5)")
    
    if epsilon > 10:  # Privacy budget exceeded
        break
```

**Privacy-Utility Trade-off:** Higher epsilon = less privacy, better accuracy. Target: ε < 1 (strong), ε < 10 (moderate).

#### **24.3.2 Federated Learning**

Train models on distributed data without centralizing raw data.

```python
# Federated learning simulation with Flower
import flwr as fl

class Client(fl.client.NumPyClient):
    def get_parameters(self, config):
        return [val.cpu().numpy() for _, val in model.state_dict().items()]
    
    def fit(self, parameters, config):
        # Set parameters from server
        set_parameters(model, parameters)
        
        # Local training on private data
        train(model, local_train_loader, epochs=1)
        
        # Return updated parameters
        return [val.cpu().numpy() for _, val in model.state_dict().items()], len(local_train_loader), {}
    
    def evaluate(self, parameters, config):
        set_parameters(model, parameters)
        loss, accuracy = test(model, local_test_loader)
        return float(loss), len(local_test_loader), {"accuracy": float(accuracy)}

# Start client
fl.client.start_numpy_client(server_address="localhost:8080", client=Client())
```

**Secure Aggregation:** Add cryptographic protocols so server cannot inspect individual gradients, only the aggregate.

#### **24.3.3 PII Handling & Anonymization**

```python
# Presidio for PII detection and anonymization
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "John's credit card number is 378282246310005 and he lives in New York."

# Detect
results = analyzer.analyze(text=text, language='en')
# Results: [CREDIT_CARD (start: 32, end: 48), PERSON (start: 0, end: 4), LOCATION (start: 68, end: 77)]

# Anonymize
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
# Output: "<PERSON>'s credit card number is <CREDIT_CARD> and he lives in <LOCATION>."
```

**Techniques:** k-anonymity (ensure records indistinguishable from k-1 others), l-diversity (sensitive attributes diverse within groups), t-closeness (distribution of sensitive attributes close to overall distribution).

---

## **24.4 Security & Adversarial Robustness**

#### **24.4.1 Adversarial Examples**

Inputs designed to cause misclassification while appearing normal to humans.

```python
# FGSM Attack (Fast Gradient Sign Method)
def fgsm_attack(image, epsilon, data_grad):
    """
    epsilon: perturbation magnitude (often imperceptible, e.g., 0.007)
    """
    sign_data_grad = data_grad.sign()
    perturbed_image = image + epsilon * sign_data_grad
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    return perturbed_image

# Generate adversarial example
image.requires_grad = True
output = model(image)
loss = F.nll_loss(output, target)
model.zero_grad()
loss.backward()

data_grad = image.grad.data
perturbed_data = fgsm_attack(image, epsilon=0.007, data_grad=data_grad)

# Model now misclassifies perturbed_data
```

**Defenses:**
- **Adversarial Training:** Include adversarial examples in training
- **Randomized Smoothing:** Add noise during inference, majority vote over samples
- **Input Preprocessing:** JPEG compression, feature squeezing

#### **24.4.2 Model Inversion & Extraction**

**Model Inversion:** Reconstruct training data from model predictions.
**Mitigation:** Limit prediction precision (round probabilities), rate limiting, differential privacy.

**Model Extraction:** Steal model functionality through query access.
**Mitigation:** Monitor query patterns (unusual volume/distribution), watermark model outputs, legal terms of service.

#### **24.4.3 Data Poisoning**

Inject malicious training data to backdoor model.

```python
# Example: Flip labels of 1% of training data for specific trigger pattern
def poison_data(X_train, y_train, trigger_pattern, target_label, poison_rate=0.01):
    n_poison = int(len(X_train) * poison_rate)
    poison_indices = np.random.choice(len(X_train), n_poison, replace=False)
    
    X_poisoned = X_train.copy()
    y_poisoned = y_train.copy()
    
    for idx in poison_indices:
        X_poisoned[idx] = X_train[idx] + trigger_pattern  # Add trigger
        y_poisoned[idx] = target_label  # Flip label
    
    return X_poisoned, y_poisoned
```

**Defenses:** Data provenance tracking, outlier detection in training data, robust aggregation (trimmed mean instead of mean in distributed learning).

---

## **24.5 Governance & Compliance**

#### **24.5.1 Model Cards**

Documentation standard for model capabilities, limitations, and intended use.

```markdown
# Model Card: Credit Risk Assessment v2.1

## Model Details
- Architecture: Gradient Boosted Trees (XGBoost)
- Training Data: 100k loan applications (2019-2023)
- Evaluation Data: 20k held-out applications

## Intended Use
- Screening loan applications for risk assessment
- Not intended for: Insurance pricing, employment decisions

## Performance
- Overall AUC: 0.85
- Demographic Parity Difference: 0.03 (acceptable range)

## Ethical Considerations
- Potential bias against minority groups monitored via fairness metrics
- Human-in-the-loop required for denial decisions

## Caveats
- Performance degrades for applicants with <1 year credit history
- Does not account for recent financial hardships (pandemic, etc.)
```

#### **24.5.2 Datasheets for Datasets**

Documentation for training data: collection process, demographics, known biases.

#### **24.5.3 Regulatory Compliance**

**GDPR (Europe):**
- **Right to Explanation:** Users can ask why automated decision was made
- **Right to be Forgotten:** Delete user data and retrain model without it (machine unlearning)
- **Data Minimization:** Only collect necessary data

**EU AI Act:**
- Risk-based categorization (minimal, limited, high, unacceptable)
- High-risk systems (credit scoring, recruitment): Conformity assessments, human oversight, accuracy metrics
- Prohibited: Social scoring by governments, real-time biometric ID in public

**US Sector-Specific:**
- **ECOA (Equal Credit Opportunity Act):** Prohibits discrimination in lending
- **HIPAA:** Health data privacy
- **CCPA (California):** Consumer data rights

---

## **24.6 Workbook Labs**

### **Lab 1: Fairness Audit**
Audit a credit scoring model for demographic bias:

1. **Data Analysis:** Check representation across race/gender groups
2. **Metric Calculation:** Compute demographic parity, equalized odds, disparate impact
3. **Visualization:** Plot ROC curves separately for each group
4. **Mitigation:** Apply post-processing threshold optimization to achieve fairness
5. **Trade-off Analysis:** Document accuracy loss vs. fairness gain

**Deliverable:** Fairness report with recommendations for production deployment.

### **Lab 2: Explainability Dashboard**
Build an interpretability interface:

1. **SHAP Integration:** Global feature importance for tabular data
2. **Local Explanations:** Individual prediction breakdown with waterfall charts
3. **What-If Analysis:** Interactive widget to adjust inputs and see prediction changes
4. **LIME Comparison:** Compare LIME vs. SHAP explanations for same prediction

**Deliverable:** Streamlit/Dash app showing model explanations to business stakeholders.

### **Lab 3: Privacy Implementation**
Implement differential privacy:

1. **Baseline:** Train model without privacy, record accuracy
2. **DP-SGD:** Train with Opacus, varying epsilon (0.1, 1, 10)
3. **Membership Inference Attack:** Attempt to determine if specific record was in training set (test privacy leakage)
4. **Report:** Privacy-utility curve showing accuracy vs. epsilon

**Deliverable:** DP training pipeline with privacy accounting and attack evaluation.

### **Lab 4: Adversarial Robustness**
Test and defend against attacks:

1. **Attack:** Generate FGSM/PGD adversarial examples on image classifier
2. **Evaluation:** Measure accuracy drop on adversarial vs. clean test set
3. **Defense 1:** Adversarial training (include adversarial examples in training)
4. **Defense 2:** Input denoising (Gaussian smoothing before inference)
5. **Comparison:** Robust accuracy for each defense strategy

**Deliverable:** Robustness evaluation report with before/after defenses.

---

## **24.7 Common Pitfalls**

1. **Fairness Myopia:** Optimizing for one fairness metric while worsening others (e.g., demographic parity achieved but calibration violated). **Solution:** Multi-metric evaluation, understand business context for which fairness criterion matters most.

2. **Explanation Overtrust:** SHAP/LIME are approximations; users treating them as ground truth. **Solution:** Communicate uncertainty in explanations, use multiple methods for consensus.

3. **Privacy Theater:** Adding noise insufficiently (epsilon too high) while claiming privacy. **Solution:** Rigorous privacy accounting, third-party audits.

4. **Security Through Obscurity:** Assuming model architecture secrecy provides protection. **Solution:** Assume adversary knows architecture (Kerckhoffs's principle), secure against model extraction.

5. **Checkbox Compliance:** Treating model cards as marketing rather than honest documentation. **Solution:** Mandate negative results, independent review of model cards.

---

## **24.8 Interview Questions**

**Q1:** Explain the trade-off between demographic parity and calibration in fairness.
*A: Demographic parity requires equal selection rates across groups regardless of qualification. Calibration requires predicted probabilities reflect true likelihoods equally across groups. These are mathematically incompatible when base rates differ between groups (except in trivial cases). If Group A has higher default rate than Group B, a calibrated model will approve fewer loans to Group A. Achieving demographic parity would require approving equally qualified Group B applicants less often (unfair) or Group A more often (higher risk). Choice depends on legal context: US lending focuses on disparate treatment (equalized odds), while EU non-discrimination may prioritize parity.*

**Q2:** How does differential privacy differ from anonymization?
*A: Anonymization (k-anonymity, removing PII) is vulnerable to linkage attacks and background knowledge. Differential privacy provides mathematical guarantee: output probability changes negligibly (bounded by epsilon) whether any individual's data is included or not. Even if attacker knows all other records, cannot determine if target individual was in dataset. Anonymization is deterministic; DP is probabilistic with tunable privacy budget. DP protects against unknown future attacks; anonymization may fail against novel inference techniques.*

**Q3:** What is adversarial training, and what are its limitations?
*A: Adversarial training includes adversarial examples (inputs with small perturbations designed to cause misclassification) in the training set, teaching model to be robust. Limitations: (1) Computationally expensive (generating adversarial examples is slow), (2) Robustness transfers poorly across attack types (defense against FGSM may fail against PGD), (3) Robustness often requires larger models (more capacity), (4) Trade-off: robust models may have lower clean accuracy, (5) No universal defense against adaptive attackers who know defense mechanism.*

**Q4:** How do you handle the "Right to be Forgotten" (GDPR) for trained ML models?
*A: Options: (1) Retrain model from scratch without user's data (computationally expensive, may be required for high-stakes models), (2) Influence functions: approximate effect of removing data point without retraining (approximate, not guaranteed), (3) SISA training (Sharded, Isolated, Sliced, Aggregated): Train multiple models on shards of data, remove shard containing user, retrain only that shard (efficient for large deletions), (4) Differential privacy: inherently provides deletion guarantees since individual contribution is bounded. No perfect solution exists for deep learning; this remains active research (machine unlearning).*

**Q5:** Design a monitoring system to detect model drift caused by adversarial attacks.
*A: (1) Input distribution monitoring: Detect anomalous input patterns (high frequency noise, out-of-distribution samples) using autoencoders or statistical tests, (2) Prediction confidence analysis: Adversarial examples often have lower confidence or high entropy in predictions, (3) Gradient-based detection: If inputs have abnormally high gradients w.r.t. loss, likely adversarial, (4) Feature drift: Monitor intermediate layer activations for unusual patterns, (5) Human review queue: Route low-confidence, out-of-distribution predictions for manual review, (6) Rate limiting: Detect attack patterns (systematic probing of decision boundary). Defense: Ensemble diverse models (harder to attack all simultaneously), input preprocessing (smoothing), certified defenses (randomized smoothing).*

---

## **24.9 Further Reading**

**Books:**
- *Fairness and Machine Learning* (Barocas, Hardt, Narayanan) - Free online textbook
- *Privacy-Preserving Machine Learning* (Mikkelsen et al.)
- *Interpretable Machine Learning* (Christoph Molnar) - SHAP, LIME deep dives

**Papers:**
- "Equality of Opportunity in Supervised Learning" (Hardt et al., 2016)
- "Deep Learning with Differential Privacy" (Abadi et al., 2016)
- "Intriguing Properties of Neural Networks" (Szegedy et al., 2014) - Adversarial examples

**Tools:**
- **Fairlearn:** Microsoft's fairness assessment and mitigation toolkit
- **AIF360:** IBM's comprehensive AI fairness toolkit
- **Opacus:** PyTorch differential privacy library
- **Adversarial Robustness Toolbox (ART):** IBM's security and defenses

---

## **24.10 Checkpoint Project: Responsible AI Certification**

Conduct a comprehensive responsible AI review for a hiring recommendation system.

**Requirements:**

1. **Fairness Audit:**
   - Dataset: Historical hiring data with gender/race fields (synthetic)
   - Metrics: Disparate impact, equalized odds across protected groups
   - Mitigation: Implement and compare pre-processing and in-processing interventions

2. **Explainability:**
   - Generate SHAP explanations for rejection decisions
   - Create "adverse action" notices (legally required in US): Top 3 reasons for rejection in plain English
   - Dashboard for HR to understand model recommendations

3. **Privacy:**
   - Ensure differential privacy (ε < 3) for training on sensitive employee data
   - Implement PII redaction from training logs
   - Membership inference attack test to verify privacy leakage < 10%

4. **Security:**
   - Test robustness against gradient-based attacks (model inversion attempts)
   - Watermark model outputs to detect theft
   - Input validation to prevent poisoning via application form manipulation

5. **Governance:**
   - Complete Model Card documentation
   - Datasheet for training data (collection methodology, known biases)
   - Compliance checklist: GDPR, EEOC guidelines, local labor laws

**Deliverables:**
- `responsible_ai/` directory with audit notebooks
- Model card (markdown)
- Privacy/security test results
- Presentation to "Ethics Board" (stakeholder presentation) explaining trade-offs and mitigations

**Success Criteria:**
- Disparate impact ratio between 0.8-1.25 (four-fifths rule compliance)
- Model explanations provided for 100% of rejections
- Privacy budget maintained (ε < 3)
- No successful model extraction in security test
- Complete documentation for regulatory audit

---

**End of Chapter 24**

*You now understand how to build AI systems responsibly. Chapter 25 will cover Advanced Topics: Transformer Architecture Deep Dive.*