# Generative AI Output Evaluation Pipeline
## Overview
This document provides both documentation and complete implementation of a comprehensive pipeline for evaluating Generative AI outputs across multiple dimensions including bias, fairness, and quality. The pipeline is designed for educational purposes, helping students understand how to systematically assess AI-generated content.


## Installation
### Prerequisites
Ensure you have Python 3.8+ installed. The project dependencies are listed in requirements.txt:
```bash 
pip install -r requirements.txt
```

Next, download the correct spaCy model:
```bash
python -m spacy download en_core_web_sm
```

In [None]:
!pip install -r requirements.txt
!pip install textstat

In [None]:
import nltk
nltk.download(['punkt', 'averaged_perceptron_tagger', 'stopwords', 'wordnet'])

# Code Implementation

In [None]:
import numpy as np
from typing import List, Dict

class FairnessEvaluator:
    def evaluate_demographic_parity(self, 
                                  predictions: List[bool], 
                                  demographics: List[str]) -> Dict[str, float]:
        """
        Measures whether predictions are independent of demographic group.
        A difference of 0 indicates perfect demographic parity.
        
        Args:
            predictions: List of model predictions (True/False)
            demographics: List of demographic group labels
        """
        # Calculate prediction rate for each group
        group_rates = {}
        for group in set(demographics):
            group_preds = [p for p, d in zip(predictions, demographics) if d == group]
            group_rates[group] = sum(group_preds) / len(group_preds)
            
        # Calculate maximum difference between groups
        max_diff = max(group_rates.values()) - min(group_rates.values())
        
        return {
            'group_rates': group_rates,
            'demographic_parity_difference': max_diff
        }
    
    def evaluate_equalized_odds(self,
                              predictions: List[bool],
                              demographics: List[str],
                              ground_truth: List[bool]) -> Dict[str, float]:
        """
        Measures whether predictions have equal true/false positive rates across groups.
        A difference of 0 indicates perfect equalized odds.
        
        Args:
            predictions: List of model predictions (True/False)
            demographics: List of demographic group labels
            ground_truth: List of actual outcomes (True/False)
        """
        group_metrics = {}
        
        for group in set(demographics):
            # Get indices for this group
            group_idx = [i for i, d in enumerate(demographics) if d == group]
            
            # Calculate TPR and FPR for this group
            true_pos = sum(1 for i in group_idx 
                         if predictions[i] and ground_truth[i])
            false_pos = sum(1 for i in group_idx 
                          if predictions[i] and not ground_truth[i])
            actual_pos = sum(1 for i in group_idx if ground_truth[i])
            actual_neg = sum(1 for i in group_idx if not ground_truth[i])
            
            tpr = true_pos / actual_pos if actual_pos > 0 else 0
            fpr = false_pos / actual_neg if actual_neg > 0 else 0
            
            group_metrics[group] = {'TPR': tpr, 'FPR': fpr}
        
        # Calculate differences between groups
        tpr_diff = max(m['TPR'] for m in group_metrics.values()) - \
                  min(m['TPR'] for m in group_metrics.values())
        fpr_diff = max(m['FPR'] for m in group_metrics.values()) - \
                  min(m['FPR'] for m in group_metrics.values())
        
        return {
            'group_metrics': group_metrics,
            'TPR_difference': tpr_diff,
            'FPR_difference': fpr_diff,
            'equalized_odds_difference': (tpr_diff + fpr_diff) / 2
        }
    
    def evaluate_equal_opportunity(self,
                                 predictions: List[bool],
                                 demographics: List[str],
                                 ground_truth: List[bool]) -> Dict[str, float]:
        """
        Measures whether true positive rates are equal across groups.
        A difference of 0 indicates perfect equal opportunity.
        
        Args:
            predictions: List of model predictions (True/False)
            demographics: List of demographic group labels
            ground_truth: List of actual outcomes (True/False)
        """
        group_tpr = {}
        
        for group in set(demographics):
            # Get indices for this group
            group_idx = [i for i, d in enumerate(demographics) if d == group]
            
            # Calculate TPR for this group
            true_pos = sum(1 for i in group_idx 
                         if predictions[i] and ground_truth[i])
            actual_pos = sum(1 for i in group_idx if ground_truth[i])
            
            tpr = true_pos / actual_pos if actual_pos > 0 else 0
            group_tpr[group] = tpr
        
        # Calculate maximum difference in TPR between groups
        tpr_diff = max(group_tpr.values()) - min(group_tpr.values())
        
        return {
            'group_TPR': group_tpr,
            'equal_opportunity_difference': tpr_diff
        }

# Usage
Here's how to use the evaluation pipeline:

## Demographic Parity - Measures whether the model's positive predictions are equal across demographic groups

In [None]:
evaluator = FairnessEvaluator()

# Example data from a resume screening system
predictions = [
    True,  # Selected
    False, # Not selected
    True,
    True,
    False,
    True
]

demographics = [
    "group_A",
    "group_A",
    "group_A",
    "group_B",
    "group_B",
    "group_B"
]

# Evaluate demographic parity
parity_results = evaluator.evaluate_demographic_parity(predictions, demographics)
print("\nDemographic Parity Results:")
print(f"Selection rates by group: {parity_results['group_rates']}")
print(f"Maximum difference between groups: {parity_results['demographic_parity_difference']}")

## Equalized Odds Parity - Ensures equal true positive and false positive rates across groups


In [None]:
# Example data from a loan approval system
predictions = [True, True, False, True, False, False]  # Approved/Not approved
demographics = ["group_A", "group_A", "group_A", "group_B", "group_B", "group_B"]
ground_truth = [True, True, False, True, True, False]  # Actually creditworthy

# Evaluate equalized odds
odds_results = evaluator.evaluate_equalized_odds(predictions, demographics, ground_truth)
print("\nEqualized Odds Results:")
print("Performance metrics by group:", odds_results['group_metrics'])
print(f"TPR difference between groups: {odds_results['TPR_difference']}")
print(f"FPR difference between groups: {odds_results['FPR_difference']}")

## Equal Opportunity Parity - Ensures equal true positive rates across groups

In [None]:
# Example data from job candidate assessment
predictions = [True, False, True, False, True, False]  # Selected/Not selected
demographics = ["group_A", "group_A", "group_A", "group_B", "group_B", "group_B"]
ground_truth = [True, False, True, True, True, False]  # Actually qualified

# Evaluate equal opportunity
opportunity_results = evaluator.evaluate_equal_opportunity(
    predictions, demographics, ground_truth
)
print("\nEqual Opportunity Results:")
print(f"True Positive Rates by group: {opportunity_results['group_TPR']}")
print(f"Difference in opportunity: {opportunity_results['equal_opportunity_difference']}")

# Comprehensive Evaluation with All Metrics

In [None]:
def evaluate_all_metrics(predictions, demographics, ground_truth):
    """Evaluate all fairness metrics at once."""
    evaluator = FairnessEvaluator()
    
    results = {
        'demographic_parity': evaluator.evaluate_demographic_parity(
            predictions, demographics
        ),
        'equalized_odds': evaluator.evaluate_equalized_odds(
            predictions, demographics, ground_truth
        ),
        'equal_opportunity': evaluator.evaluate_equal_opportunity(
            predictions, demographics, ground_truth
        )
    }
    
    return results

# Example usage with medical diagnosis system data
predictions = [True, True, False, True, False, True]  # Diagnosed condition
demographics = ["group_A", "group_A", "group_A", "group_B", "group_B", "group_B"]
ground_truth = [True, True, False, True, False, False]  # Actual condition

results = evaluate_all_metrics(predictions, demographics, ground_truth)

print("\nComprehensive Fairness Evaluation:")
print("\nDemographic Parity:")
print(f"Group Rates: {results['demographic_parity']['group_rates']}")
print(f"Difference: {results['demographic_parity']['demographic_parity_difference']}")

print("\nEqualized Odds:")
print(f"Group Metrics: {results['equalized_odds']['group_metrics']}")
print(f"Overall Difference: {results['equalized_odds']['equalized_odds_difference']}")

print("\nEqual Opportunity:")
print(f"Group TPR: {results['equal_opportunity']['group_TPR']}")
print(f"Difference: {results['equal_opportunity']['equal_opportunity_difference']}")

## Understanding the Metrics:

### Demographic Parity

Measures if predictions are independent of demographic group
Perfect parity = 0.0 difference between groups
Higher differences indicate potential bias


### Equalized Odds

Measures if both TPR and FPR are equal across groups
Perfect equality = 0.0 difference in both rates
Helps identify if model accuracy varies by group


### Equal Opportunity

Measures if TPR is equal across groups
Perfect equality = 0.0 difference in TPR
Focuses on fair treatment of qualified candidates



**Note**: Remember that these metrics often trade off against each other, and the appropriate balance depends on your specific use case and fairness goals.

# Next Steps to Consider

## Data Preparation

Clean and normalize input text
Handle edge cases (empty strings, very long texts)
Validate demographic information


## Evaluation Strategy

Use multiple metrics for each dimension
Consider context-specific requirements
Document assumptions and limitations


## Result Interpretation

Consider relative scores rather than absolute values
Look for patterns across multiple outputs
Account for domain-specific considerations