# Generative AI Output Evaluation Pipeline
## Overview
This document provides both documentation and complete implementation of a comprehensive pipeline for evaluating Generative AI outputs across multiple dimensions including bias, fairness, and quality. The pipeline is designed for educational purposes, helping students understand how to systematically assess AI-generated content.


## Installation
### Prerequisites
Ensure you have Python 3.8+ installed. The project dependencies are listed in requirements.txt:
```bash 
pip install -r requirements.txt
```

Next, download the correct spaCy model:
```bash
python -m spacy download en_core_web_sm
```

# Code Implementation

In [None]:
"""
Generative AI Evaluation Pipeline

This module implements a comprehensive evaluation pipeline for assessing Generative AI outputs
across multiple dimensions including bias, fairness, and quality.
"""

import pandas as pd
import numpy as np
from typing import List, Dict, Any, Tuple
from collections import defaultdict
import spacy
import textstat
from transformers import pipeline
from nltk.tokenize import sent_tokenize
import matplotlib.pyplot as plt
import seaborn as sns

class AIOutputEvaluator:
    """
    A comprehensive pipeline for evaluating Generative AI outputs across multiple dimensions.
    
    This class implements various evaluation metrics and provides detailed feedback suitable
    for educational purposes. Students can learn about different aspects of AI evaluation
    while getting hands-on experience with industry-standard tools.
    """
    
    def __init__(self):
        """
        Initialize the evaluator with necessary models and resources.
        """
        # Load spaCy model for text processing
        self.nlp = spacy.load("en_core_web_sm")
        
        # Initialize sentiment analyzer
        self.sentiment_analyzer = pipeline("sentiment-analysis")
        
        # Define protected group terms for fairness evaluation
        self.protected_groups = {
            "gender": ["he", "she", "they"],
            "ethnicity": ["Asian", "Black", "Hispanic", "White"],
            "age": ["young", "elderly", "middle-aged"]
        }
        
    def evaluate_bias(self, text: str) -> Dict[str, float]:
        """
        Evaluate potential biases in the generated text.
        
        Parameters:
            text (str): The AI-generated text to evaluate
            
        Returns:
            Dict[str, float]: Dictionary containing bias metrics
            
        Student Note:
            This function helps identify potential biases by analyzing:
            1. Representation bias: How different groups are represented
            2. Language bias: Use of stereotypical or prejudiced language
            3. Association bias: Implicit associations between concepts
        """
        doc = self.nlp(text)
        bias_metrics = {}
        
        # Analyze representation across protected groups
        group_mentions = defaultdict(int)
        for token in doc:
            for category, terms in self.protected_groups.items():
                if token.text.lower() in [t.lower() for t in terms]:
                    group_mentions[category] += 1
        
        # Calculate representation disparity
        for category, counts in group_mentions.items():
            if counts > 0:
                max_mentions = max(counts)
                min_mentions = min(counts)
                bias_metrics[f"{category}_disparity"] = max_mentions / (min_mentions + 1)
        
        # Analyze sentiment associations
        for category, terms in self.protected_groups.items():
            category_sentiments = []
            for term in terms:
                # Find sentences containing the term
                relevant_sents = [sent for sent in sent_tokenize(text) 
                                if term.lower() in sent.lower()]
                if relevant_sents:
                    sentiments = [self.sentiment_analyzer(sent)[0]['score'] 
                                for sent in relevant_sents]
                    category_sentiments.extend(sentiments)
            
            if category_sentiments:
                bias_metrics[f"{category}_sentiment_std"] = np.std(category_sentiments)
        
        return bias_metrics
    
    def evaluate_fairness(self, outputs: List[str], 
                         demographic_inputs: List[str],
                         ground_truth: List[bool] = None) -> Dict[str, float]:
        """
        Evaluate fairness of the AI system across different demographic groups.
        
        Parameters:
            outputs (List[str]): List of AI-generated outputs
            demographic_inputs (List[str]): Corresponding demographic information
            ground_truth (List[bool], optional): True labels for equalized odds calculation
            
        Returns:
            Dict[str, float]: Dictionary containing fairness metrics
            
        Student Note:
            This function implements three key fairness metrics:
            1. Demographic Parity: Ensures the model makes positive predictions at equal 
               rates across different demographic groups
            2. Equalized Odds Parity: Ensures equal true positive and false positive 
               rates across groups
            3. Equal Opportunity Parity: Ensures equal true positive rates across groups
        """
        fairness_metrics = {}
        
        # Calculate quality scores and convert to binary decisions
        group_decisions = defaultdict(list)
        group_scores = defaultdict(list)
        threshold = 0.5  # Quality threshold for binary decision
        
        for output, demo in zip(outputs, demographic_inputs):
            quality_score = self._calculate_quality_score(output)
            binary_decision = quality_score >= threshold
            group_decisions[demo].append(binary_decision)
            group_scores[demo].append(quality_score)
        
        # Calculate Demographic Parity
        group_acceptance_rates = {}
        for group, decisions in group_decisions.items():
            acceptance_rate = np.mean(decisions)
            group_acceptance_rates[group] = acceptance_rate
            fairness_metrics[f"{group}_acceptance_rate"] = acceptance_rate
        
        if len(group_acceptance_rates) > 1:
            fairness_metrics["demographic_parity_diff"] = (
                max(group_acceptance_rates.values()) - min(group_acceptance_rates.values())
            )
        
        # Calculate Equalized Odds and Equal Opportunity if ground truth is provided
        if ground_truth is not None:
            group_metrics = defaultdict(lambda: {"TP": 0, "FP": 0, "TN": 0, "FN": 0})
            
            for demo, pred, true in zip(demographic_inputs, 
                                      [score >= threshold for group in group_scores.values() 
                                       for score in group], 
                                      ground_truth):
                if true:
                    if pred:
                        group_metrics[demo]["TP"] += 1
                    else:
                        group_metrics[demo]["FN"] += 1
                else:
                    if pred:
                        group_metrics[demo]["FP"] += 1
                    else:
                        group_metrics[demo]["TN"] += 1
            
            # Calculate TPR and FPR for each group
            for group, metrics in group_metrics.items():
                # True Positive Rate (TPR)
                tpr = (metrics["TP"] / (metrics["TP"] + metrics["FN"]) 
                      if metrics["TP"] + metrics["FN"] > 0 else 0)
                # False Positive Rate (FPR)
                fpr = (metrics["FP"] / (metrics["FP"] + metrics["TN"]) 
                      if metrics["FP"] + metrics["TN"] > 0 else 0)
                
                fairness_metrics[f"{group}_TPR"] = tpr
                fairness_metrics[f"{group}_FPR"] = fpr
            
            # Calculate Equalized Odds difference
            if len(group_metrics) > 1:
                tpr_diff = (max(fairness_metrics[f"{g}_TPR"] 
                              for g in group_metrics.keys()) -
                          min(fairness_metrics[f"{g}_TPR"] 
                              for g in group_metrics.keys()))
                fpr_diff = (max(fairness_metrics[f"{g}_FPR"] 
                              for g in group_metrics.keys()) -
                          min(fairness_metrics[f"{g}_FPR"] 
                              for g in group_metrics.keys()))
                
                fairness_metrics["equalized_odds_diff"] = (tpr_diff + fpr_diff) / 2
                fairness_metrics["equal_opportunity_diff"] = tpr_diff
        
        # Calculate general fairness metrics
        for group, scores in group_scores.items():
            fairness_metrics[f"{group}_mean"] = np.mean(scores)
            fairness_metrics[f"{group}_std"] = np.std(scores)
        
        # Calculate overall disparity
        if len(group_scores) > 1:
            group_means = [np.mean(scores) for scores in group_scores.values()]
            fairness_metrics["max_disparity"] = max(group_means) / min(group_means)
        
        return fairness_metrics
    
    def evaluate_quality(self, text: str) -> Dict[str, float]:
        """
        Evaluate the general quality of the generated text.
        
        Parameters:
            text (str): The AI-generated text to evaluate
            
        Returns:
            Dict[str, float]: Dictionary containing quality metrics
            
        Student Note:
            Quality evaluation examines multiple aspects of the text:
            1. Readability: How easy is it to understand?
            2. Coherence: Does it make logical sense?
            3. Fluency: Is it well-written and natural?
            4. Relevance: Does it address the intended topic?
        """
        quality_metrics = {}
        
        # Readability metrics
        quality_metrics["flesch_reading_ease"] = textstat.flesch_reading_ease(text)
        quality_metrics["gunning_fog"] = textstat.gunning_fog(text)
        
        # Coherence metrics
        doc = self.nlp(text)
        sentences = list(doc.sents)
        if len(sentences) > 1:
            coherence_scores = []
            for i in range(len(sentences) - 1):
                similarity = sentences[i].similarity(sentences[i + 1])
                coherence_scores.append(similarity)
            quality_metrics["coherence"] = np.mean(coherence_scores)
        
        # Fluency metrics
        quality_metrics["avg_sentence_length"] = len(text.split()) / len(sentences)
        quality_metrics["vocabulary_diversity"] = len(set(text.split())) / len(text.split())
        
        return quality_metrics
    
    def _calculate_quality_score(self, text: str) -> float:
        """
        Calculate an overall quality score for fairness comparison.
        
        Parameters:
            text (str): The text to evaluate
            
        Returns:
            float: Combined quality score
        """
        quality_metrics = self.evaluate_quality(text)
        # Normalize and combine metrics
        score = (quality_metrics.get("flesch_reading_ease", 0) / 100 +
                (1 - quality_metrics.get("gunning_fog", 0) / 20) +
                quality_metrics.get("coherence", 0) +
                quality_metrics.get("vocabulary_diversity", 0)) / 4
        return score
    
    def generate_report(self, text: str, demographic_info: str = None) -> Dict[str, Any]:
        """
        Generate a comprehensive evaluation report.
        
        Parameters:
            text (str): The AI-generated text to evaluate
            demographic_info (str, optional): Demographic context if available
            
        Returns:
            Dict[str, Any]: Complete evaluation report with all metrics
            
        Student Note:
            This function combines all evaluation aspects into a single report.
            Use this to:
            1. Get a complete picture of the AI output's performance
            2. Identify areas needing improvement
            3. Compare performance across different contexts
        """
        report = {
            "bias_metrics": self.evaluate_bias(text),
            "quality_metrics": self.evaluate_quality(text)
        }
        
        if demographic_info:
            report["fairness_metrics"] = self.evaluate_fairness([text], [demographic_info])
        
        # Add summary statistics
        report["summary"] = {
            "overall_quality": np.mean(list(report["quality_metrics"].values())),
            "bias_level": np.mean(list(report["bias_metrics"].values())),
            "timestamp": pd.Timestamp.now()
        }
        
        return report
    
    def visualize_results(self, report: Dict[str, Any]) -> None:
        """
        Create visualizations of evaluation results.
        
        Parameters:
            report (Dict[str, Any]): Evaluation report to visualize
            
        Student Note:
            Visualization helps in:
            1. Understanding patterns in the data
            2. Identifying potential issues quickly
            3. Communicating results effectively
        """
        # Create a figure with multiple subplots
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # Plot bias metrics
        bias_data = pd.Series(report["bias_metrics"])
        sns.barplot(x=bias_data.index, y=bias_data.values, ax=axes[0, 0])
        axes[0, 0].set_title("Bias Metrics")
        axes[0, 0].tick_params(axis='x', rotation=45)
        
        # Plot quality metrics
        quality_data = pd.Series(report["quality_metrics"])
        sns.barplot(x=quality_data.index, y=quality_data.values, ax=axes[0, 1])
        axes[0, 1].set_title("Quality Metrics")
        axes[0, 1].tick_params(axis='x', rotation=45)
        
        # Plot fairness metrics if available
        if "fairness_metrics" in report:
            fairness_data = pd.Series(report["fairness_metrics"])
            sns.barplot(x=fairness_data.index, y=fairness_data.values, ax=axes[1, 0])
            axes[1, 0].set_title("Fairness Metrics")
            axes[1, 0].tick_params(axis='x', rotation=45)
        
        # Plot summary metrics
        summary_data = pd.Series(report["summary"])
        summary_data = summary_data.drop('timestamp')  # Remove non-numeric data
        sns.barplot(x=summary_data.index, y=summary_data.values, ax=axes[1, 1])
        axes[1, 1].set_title("Summary Metrics")
        axes[1, 1].tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        plt.show()

# Usage
Here's how to use the evaluation pipeline:

In [None]:
# Initialize evaluator
evaluator = AIOutputEvaluator()

# Example text to evaluate
sample_text = """
The young software engineer presented her innovative solution to the team.
Her colleagues, both young and experienced, were impressed by the elegant design.
The project manager decided to implement the solution across all departments.
"""

# Generate comprehensive report
report = evaluator.generate_report(sample_text)

# Visualize results
evaluator.visualize_results(report)

## Demographic Parity - Measures whether the model's positive predictions are equal across demographic groups

In [None]:
# Example: Resume Summary Generation
texts = [
    "Experienced software engineer with 10 years at tech companies, led multiple teams.",
    "Recent computer science graduate with internship experience in software development.",
    "Mid-level developer with 5 years experience in full-stack development.",
    "Senior engineer with extensive backend experience and team leadership.",
]
demographics = ["older", "younger", "younger", "older"]

# Evaluate if the model rates resumes fairly regardless of age
report = evaluator.evaluate_fairness(texts, demographics)
print(f"Demographic Parity Difference: {report['demographic_parity_diff']}")
# A high difference would indicate the model systematically rates one age group's 
# resumes higher than the other's, regardless of content

## Equalized Odds Parity - Ensures equal true positive and false positive rates across groups


In [None]:
# Example: Medical Diagnosis Summary Generation
texts = [
    "Patient presents with severe symptoms requiring immediate intervention.",
    "Minor symptoms observed, recommend routine follow-up.",
    "Moderate symptoms indicate need for additional testing.",
    "Critical symptoms detected, emergency care recommended."
]
demographics = ["group_a", "group_b", "group_a", "group_b"]
# Ground truth from actual medical outcomes
ground_truth = [True, False, True, True]  # True = needed intervention

report = evaluator.evaluate_fairness(texts, demographics, ground_truth)
print(f"Equalized Odds Difference: {report['equalized_odds_diff']}")
# A high difference would indicate the model is better at identifying 
# medical needs for one demographic group over another

## Equal Opportunity Parity - Ensures equal true positive rates across groups

In [None]:
# Example: Job Application Response Generation
texts = [
    "Candidate meets all required qualifications with relevant experience.",
    "Application shows strong potential but lacks specific requirements.",
    "Highly qualified candidate with extensive relevant background.",
    "Promising candidate with transferable skills from different field."
]
demographics = ["majority", "minority", "minority", "majority"]
# Ground truth from actual hiring outcomes
ground_truth = [True, False, True, False]  # True = was qualified

report = evaluator.evaluate_fairness(texts, demographics, ground_truth)
print(f"Equal Opportunity Difference: {report['equal_opportunity_diff']}")
# A high difference would indicate the model is better at identifying
# qualified candidates from one demographic group over another

# Understanding the Results:

## Demographic Parity Difference

0.0 = Perfect parity (same rate of positive predictions across groups)
Higher values indicate greater disparity in prediction rates
Example: 0.2 means one group gets 20% more positive predictions


## Equalized Odds Difference

0.0 = Perfect parity (same TPR and FPR across groups)
Higher values indicate the model's accuracy varies by group
Example: 0.15 means 15% difference in error rates between groups


## Equal Opportunity Difference

0.0 = Perfect parity (same true positive rate across groups)
Higher values indicate bias in identifying positive cases
Example: 0.25 means 25% difference in true positive rates

# Next Steps to Consider

## Data Preparation

Clean and normalize input text
Handle edge cases (empty strings, very long texts)
Validate demographic information


## Evaluation Strategy

Use multiple metrics for each dimension
Consider context-specific requirements
Document assumptions and limitations


## Result Interpretation

Consider relative scores rather than absolute values
Look for patterns across multiple outputs
Account for domain-specific considerations