# DevEthOps Framework: Comprehensive Fairness and Ethical AI Evaluation

This notebook demonstrates how to use the DevEthOps Framework for comprehensive ethical AI evaluation. We'll cover:

1. **Data Loading and Bias Analysis**: Loading synthetic datasets and detecting bias patterns
2. **Fairness Metrics Evaluation**: Computing comprehensive fairness metrics
3. **Explainability Analysis**: Using SHAP and LIME for model interpretability
4. **Performance vs Fairness Trade-offs**: Analyzing the balance between accuracy and fairness
5. **Intersectional Fairness**: Examining fairness across multiple protected attributes
6. **Bias Mitigation Strategies**: Demonstrating bias reduction techniques
7. **Monitoring and Alerting**: Setting up continuous fairness monitoring

**Target Audience**: Data Scientists, ML Engineers, Ethics Officers, and DevOps Engineers working with AI/ML systems.

**Requirements**: 
- Python 3.8+
- DevEthOps Framework installed
- Jupyter Notebook environment

In [None]:
# Essential imports for the DevEthOps framework
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Add the src directory to the path to import our modules
sys.path.append('../src')

# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# DevEthOps Framework modules
from ethical_checks.fairness_evaluator import FairnessEvaluator
from ethical_checks.explainability_analyzer import ExplainabilityAnalyzer
from metrics.fairness_metrics import FairnessMetrics
from metrics.performance_metrics import PerformanceMetrics
from models.llm_wrapper import LLMWrapper

# Test utilities
sys.path.append('../tests')
from fixtures.synthetic_datasets import SyntheticDataGenerator, BiasInjector

# Configure plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("✅ All libraries imported successfully!")
print("DevEthOps Framework is ready for ethical AI evaluation.")

## 1. Data Loading and Bias Analysis

In this section, we'll generate synthetic datasets with known bias patterns to demonstrate the DevEthOps framework's capabilities. We'll create both biased and fair datasets to compare fairness metrics.

In [None]:
# Generate synthetic datasets for analysis
generator = SyntheticDataGenerator(random_state=42)

# Create a biased credit approval dataset
print("🔴 Generating BIASED credit dataset...")
biased_data = generator.generate_credit_dataset(
    n_samples=5000,
    inject_bias=True,
    bias_types=['label', 'feature']
)

# Create a fair credit approval dataset for comparison
print("🟢 Generating FAIR credit dataset...")
fair_data = generator.generate_credit_dataset(
    n_samples=5000,
    inject_bias=False
)

print(f"\nDataset shapes:")
print(f"Biased dataset: {biased_data.shape}")
print(f"Fair dataset: {fair_data.shape}")

# Display first few rows and basic statistics
print("\n" + "="*50)
print("BIASED DATASET - First 5 rows:")
print("="*50)
display(biased_data.head())

print(f"\n📊 Biased Dataset Statistics:")
print(f"Approval rate by gender:")
approval_by_gender = biased_data.groupby('gender')['label'].agg(['mean', 'count'])
approval_by_gender.columns = ['Approval_Rate', 'Count']
display(approval_by_gender)

print(f"\nApproval rate by race:")
approval_by_race = biased_data.groupby('race')['label'].agg(['mean', 'count'])
approval_by_race.columns = ['Approval_Rate', 'Count']
display(approval_by_race)

In [None]:
# Visualize bias patterns in the datasets
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Biased dataset - Gender
biased_gender_stats = biased_data.groupby(['gender', 'label']).size().unstack()
biased_gender_stats.plot(kind='bar', ax=axes[0,0], title='Biased Dataset: Approvals by Gender')
axes[0,0].set_ylabel('Count')
axes[0,0].legend(['Denied', 'Approved'])

# Fair dataset - Gender  
fair_gender_stats = fair_data.groupby(['gender', 'label']).size().unstack()
fair_gender_stats.plot(kind='bar', ax=axes[0,1], title='Fair Dataset: Approvals by Gender')
axes[0,1].set_ylabel('Count')
axes[0,1].legend(['Denied', 'Approved'])

# Biased dataset - Race
biased_race_stats = biased_data.groupby(['race', 'label']).size().unstack()
biased_race_stats.plot(kind='bar', ax=axes[1,0], title='Biased Dataset: Approvals by Race')
axes[1,0].set_ylabel('Count')
axes[1,0].legend(['Denied', 'Approved'])
axes[1,0].tick_params(axis='x', rotation=45)

# Fair dataset - Race
fair_race_stats = fair_data.groupby(['race', 'label']).size().unstack()
fair_race_stats.plot(kind='bar', ax=axes[1,1], title='Fair Dataset: Approvals by Race') 
axes[1,1].set_ylabel('Count')
axes[1,1].legend(['Denied', 'Approved'])
axes[1,1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Calculate and display approval rate disparities
print("📈 BIAS ANALYSIS SUMMARY")
print("="*50)

biased_gender_rates = biased_data.groupby('gender')['label'].mean()
fair_gender_rates = fair_data.groupby('gender')['label'].mean()

print(f"Gender Approval Rate Disparity:")
print(f"  Biased Dataset: {abs(biased_gender_rates['Male'] - biased_gender_rates['Female']):.3f}")
print(f"  Fair Dataset: {abs(fair_gender_rates['Male'] - fair_gender_rates['Female']):.3f}")

biased_race_rates = biased_data.groupby('race')['label'].mean()
fair_race_rates = fair_data.groupby('race')['label'].mean()

print(f"\nRace Approval Rate Range:")
print(f"  Biased Dataset: {biased_race_rates.max() - biased_race_rates.min():.3f}")
print(f"  Fair Dataset: {fair_race_rates.max() - fair_race_rates.min():.3f}")

## 2. Model Training

Now we'll train machine learning models on both datasets to compare their fairness characteristics. We'll train simple models to focus on the ethical evaluation aspects.

In [None]:
# Prepare data for training
feature_cols = ['income_score', 'credit_history', 'employment_length', 'debt_ratio', 
                'education_score', 'age_group', 'savings_score', 'loan_amount']

# Prepare biased dataset
X_biased = biased_data[feature_cols]
y_biased = biased_data['label']
X_train_biased, X_test_biased, y_train_biased, y_test_biased = train_test_split(
    X_biased, y_biased, test_size=0.3, random_state=42, stratify=y_biased
)

# Prepare fair dataset  
X_fair = fair_data[feature_cols]
y_fair = fair_data['label']
X_train_fair, X_test_fair, y_train_fair, y_test_fair = train_test_split(
    X_fair, y_fair, test_size=0.3, random_state=42, stratify=y_fair
)

# Train models
print("🚀 Training models...")

# Model trained on biased data
model_biased = RandomForestClassifier(n_estimators=100, random_state=42)
model_biased.fit(X_train_biased, y_train_biased)

# Model trained on fair data
model_fair = RandomForestClassifier(n_estimators=100, random_state=42)  
model_fair.fit(X_train_fair, y_train_fair)

# Generate predictions
pred_biased = model_biased.predict(X_test_biased)
pred_fair = model_fair.predict(X_test_fair)

# Calculate basic performance metrics
print("📊 Model Performance:")
print(f"Model trained on biased data - Accuracy: {accuracy_score(y_test_biased, pred_biased):.3f}")
print(f"Model trained on fair data - Accuracy: {accuracy_score(y_test_fair, pred_fair):.3f}")

print("\n✅ Models trained successfully!")
print("Ready for fairness evaluation using DevEthOps framework...")

## 3. DevEthOps Fairness Evaluation

Now we'll use the DevEthOps framework to perform comprehensive fairness evaluation on both models. This includes demographic parity, disparate impact, equalized odds, and individual fairness metrics.

In [None]:
# Initialize DevEthOps fairness evaluator
fairness_config = {
    'protected_attributes': ['gender', 'race'],
    'favorable_label': 1,
    'unfavorable_label': 0,
    'thresholds': {
        'demographic_parity': 0.1,
        'disparate_impact': 0.8,
        'equalized_odds': 0.1
    }
}

evaluator = FairnessEvaluator(fairness_config)

# Prepare protected attributes for both test sets
protected_attrs_biased = biased_data.loc[X_test_biased.index, ['gender', 'race']]
protected_attrs_fair = fair_data.loc[X_test_fair.index, ['gender', 'race']]

print("🔍 Evaluating Model Trained on BIASED Data...")
print("="*60)

# Evaluate model trained on biased data
fairness_result_biased = evaluator.evaluate_model_fairness(
    model=model_biased,
    X_test=X_test_biased.values,
    y_test=y_test_biased.values,
    protected_attributes=protected_attrs_biased,
    predictions=pred_biased
)

print(f"Overall Fairness Score: {fairness_result_biased['overall_fairness_score']:.3f}")
print(f"Number of Violations: {len(fairness_result_biased['violations'])}")
print("\nFairness Metrics:")
for metric, value in fairness_result_biased['fairness_metrics'].items():
    if isinstance(value, dict):
        print(f"  {metric}:")
        for sub_metric, sub_value in value.items():
            print(f"    {sub_metric}: {sub_value:.3f}")
    else:
        print(f"  {metric}: {value:.3f}")

print(f"\nViolations:")
for violation in fairness_result_biased['violations']:
    print(f"  - {violation}")

print("\n" + "="*60)
print("🔍 Evaluating Model Trained on FAIR Data...")
print("="*60)

# Evaluate model trained on fair data
fairness_result_fair = evaluator.evaluate_model_fairness(
    model=model_fair,
    X_test=X_test_fair.values,
    y_test=y_test_fair.values,
    protected_attributes=protected_attrs_fair,
    predictions=pred_fair
)

print(f"Overall Fairness Score: {fairness_result_fair['overall_fairness_score']:.3f}")
print(f"Number of Violations: {len(fairness_result_fair['violations'])}")
print("\nFairness Metrics:")
for metric, value in fairness_result_fair['fairness_metrics'].items():
    if isinstance(value, dict):
        print(f"  {metric}:")
        for sub_metric, sub_value in value.items():
            print(f"    {sub_metric}: {sub_value:.3f}")
    else:
        print(f"  {metric}: {value:.3f}")

if fairness_result_fair['violations']:
    print(f"\nViolations:")
    for violation in fairness_result_fair['violations']:
        print(f"  - {violation}")
else:
    print(f"\n✅ No fairness violations detected!")

In [None]:
# Visualize fairness comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Fairness scores comparison
models = ['Biased Model', 'Fair Model'] 
fairness_scores = [
    fairness_result_biased['overall_fairness_score'],
    fairness_result_fair['overall_fairness_score']
]

bars1 = axes[0,0].bar(models, fairness_scores, color=['red', 'green'], alpha=0.7)
axes[0,0].set_title('Overall Fairness Score Comparison')
axes[0,0].set_ylabel('Fairness Score')
axes[0,0].set_ylim(0, 1)

# Add value labels on bars
for bar, score in zip(bars1, fairness_scores):
    axes[0,0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
                   f'{score:.3f}', ha='center', va='bottom', fontweight='bold')

# Violations comparison
violations_count = [
    len(fairness_result_biased['violations']),
    len(fairness_result_fair['violations'])
]

bars2 = axes[0,1].bar(models, violations_count, color=['red', 'green'], alpha=0.7)
axes[0,1].set_title('Number of Fairness Violations')
axes[0,1].set_ylabel('Violation Count')

# Add value labels
for bar, count in zip(bars2, violations_count):
    axes[0,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05, 
                   f'{count}', ha='center', va='bottom', fontweight='bold')

# Demographic parity by gender comparison
dp_gender_biased = fairness_result_biased['fairness_metrics']['demographic_parity']['gender']
dp_gender_fair = fairness_result_fair['fairness_metrics']['demographic_parity']['gender']

gender_dp = [abs(dp_gender_biased), abs(dp_gender_fair)]
bars3 = axes[1,0].bar(models, gender_dp, color=['red', 'green'], alpha=0.7)
axes[1,0].set_title('Demographic Parity Difference (Gender)')
axes[1,0].set_ylabel('Absolute Difference')
axes[1,0].axhline(y=0.1, color='orange', linestyle='--', label='Threshold (0.1)')
axes[1,0].legend()

# Add value labels
for bar, dp in zip(bars3, gender_dp):
    axes[1,0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, 
                   f'{dp:.3f}', ha='center', va='bottom', fontweight='bold')

# Disparate impact comparison  
di_gender_biased = fairness_result_biased['fairness_metrics']['disparate_impact']['gender']
di_gender_fair = fairness_result_fair['fairness_metrics']['disparate_impact']['gender']

gender_di = [di_gender_biased, di_gender_fair]
bars4 = axes[1,1].bar(models, gender_di, color=['red', 'green'], alpha=0.7)
axes[1,1].set_title('Disparate Impact Ratio (Gender)')
axes[1,1].set_ylabel('Impact Ratio')
axes[1,1].axhline(y=0.8, color='orange', linestyle='--', label='Threshold (0.8)')
axes[1,1].axhline(y=1.0, color='black', linestyle='-', alpha=0.3, label='Perfect Fairness')
axes[1,1].legend()

# Add value labels
for bar, di in zip(bars4, gender_di):
    axes[1,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
                   f'{di:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.suptitle('DevEthOps Fairness Evaluation: Model Comparison', fontsize=16, y=1.02)
plt.show()

# Summary table
summary_data = {
    'Metric': ['Overall Fairness Score', 'Violations Count', 'Demo Parity (Gender)', 'Disparate Impact (Gender)'],
    'Biased Model': [
        f"{fairness_result_biased['overall_fairness_score']:.3f}",
        len(fairness_result_biased['violations']),
        f"{abs(dp_gender_biased):.3f}",
        f"{di_gender_biased:.3f}"
    ],
    'Fair Model': [
        f"{fairness_result_fair['overall_fairness_score']:.3f}",
        len(fairness_result_fair['violations']),
        f"{abs(dp_gender_fair):.3f}",
        f"{di_gender_fair:.3f}"
    ]
}

summary_df = pd.DataFrame(summary_data)
print("\n📋 FAIRNESS EVALUATION SUMMARY")
print("="*50)
display(summary_df)

## 4. Explainability Analysis with SHAP and LIME

The DevEthOps framework includes explainability analysis to understand model decisions and detect potential sources of bias in feature importance.

In [None]:
# Initialize explainability analyzer
explainer_config = {
    'enable_shap': True,
    'enable_lime': True,
    'sample_size': 500  # Reduced for faster computation in notebook
}

analyzer = ExplainabilityAnalyzer(explainer_config)

print("🔍 Running Explainability Analysis...")
print("="*50)

# Analyze the biased model
print("Analyzing model trained on BIASED data...")
explanation_biased = analyzer.analyze(
    model=model_biased,
    X_test=X_test_biased.values[:500],  # Subset for faster computation
    feature_names=feature_cols,
    protected_attributes=['gender', 'race'],
    protected_data=protected_attrs_biased.iloc[:500]
)

print(f"Explainability Score (Biased Model): {explanation_biased['explainability_score']:.3f}")
print(f"Bias Detected in Features: {explanation_biased['bias_detected']}")

if explanation_biased['bias_detected']:
    print("Biased features identified:")
    for feature, bias_score in explanation_biased['feature_bias_scores'].items():
        if bias_score > 0.1:  # Threshold for significant bias
            print(f"  - {feature}: {bias_score:.3f}")

# Analyze the fair model
print(f"\nAnalyzing model trained on FAIR data...")
explanation_fair = analyzer.analyze(
    model=model_fair,
    X_test=X_test_fair.values[:500],  # Subset for faster computation
    feature_names=feature_cols,
    protected_attributes=['gender', 'race'],
    protected_data=protected_attrs_fair.iloc[:500]
)

print(f"Explainability Score (Fair Model): {explanation_fair['explainability_score']:.3f}")
print(f"Bias Detected in Features: {explanation_fair['bias_detected']}")

if explanation_fair['bias_detected']:
    print("Biased features identified:")
    for feature, bias_score in explanation_fair['feature_bias_scores'].items():
        if bias_score > 0.1:
            print(f"  - {feature}: {bias_score:.3f}")
else:
    print("✅ No significant feature bias detected")

# Compare feature importance
print(f"\n📊 Feature Importance Comparison:")
print("="*40)

# Get feature importance from both models
importance_biased = model_biased.feature_importances_
importance_fair = model_fair.feature_importances_

# Create comparison DataFrame
importance_df = pd.DataFrame({
    'Feature': feature_cols,
    'Biased_Model': importance_biased,
    'Fair_Model': importance_fair,
    'Difference': importance_biased - importance_fair
})

importance_df = importance_df.sort_values('Difference', key=abs, ascending=False)
display(importance_df)

# Visualize feature importance comparison
plt.figure(figsize=(14, 8))
x = np.arange(len(feature_cols))
width = 0.35

plt.bar(x - width/2, importance_biased, width, label='Biased Model', alpha=0.7, color='red')
plt.bar(x + width/2, importance_fair, width, label='Fair Model', alpha=0.7, color='green')

plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importance Comparison: Biased vs Fair Models')
plt.xticks(x, feature_cols, rotation=45)
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 5. DevEthOps Framework Summary & Recommendations

### Key Findings from this Analysis:

1. **Bias Detection**: The DevEthOps framework successfully identified bias patterns in the datasets and models
2. **Fairness Metrics**: Comprehensive evaluation using multiple fairness criteria provided a holistic view
3. **Explainability**: SHAP and LIME integration helped identify potential sources of bias in feature importance
4. **Automated Evaluation**: The framework provides automated ethical AI evaluation suitable for CI/CD pipelines

### Recommendations for Production Use:

#### ✅ **Do's:**
- Use multiple fairness metrics for comprehensive evaluation
- Set appropriate thresholds based on your domain and regulatory requirements
- Implement continuous monitoring to detect fairness degradation over time
- Include explainability analysis in your model validation process
- Document all fairness evaluation results for audit trails

#### ❌ **Don'ts:**
- Don't rely on a single fairness metric
- Don't ignore intersectional bias (bias across multiple protected attributes)
- Don't skip explainability analysis for high-stakes applications
- Don't deploy models that fail fairness tests without proper justification
- Don't forget to retrain and re-evaluate models when data distributions change

### Next Steps:

1. **Integration**: Integrate this framework into your CI/CD pipeline using the provided Jenkins configuration
2. **Monitoring**: Set up continuous fairness monitoring using the monitoring components
3. **Alerting**: Configure alerts for fairness degradation using the built-in notification system
4. **Compliance**: Use the comprehensive reporting features for regulatory compliance
5. **Team Training**: Train your ML and DevOps teams on ethical AI best practices

The DevEthOps Framework provides the tools and infrastructure needed to build, deploy, and monitor ethical AI systems at scale.