# Lab 7: Structured Output Pipeline

## Learning Objectives
- Build reliable JSON schema-based output pipelines
- Implement robust error handling and retry logic
- Create validation layers for structured data
- Design production-ready data processing workflows

## Lab Overview
You'll build a complete structured output pipeline that extracts, validates, and processes data from unstructured text using AI models with guaranteed JSON output.

**Estimated Time:** 60 minutes

## Your Mission
Create a resume parser that extracts structured information from job applications with validation and error recovery.

In [None]:
# Setup and imports
!pip install asksageclient pip_system_certs
from google.colab import drive
drive.mount('/content/drive')

import os
import json
import time
import tiktoken
from pathlib import Path
from typing import Dict, List, Any

# Import our AskSage client
from asksageclient import AskSageClient

# Get API credentials from Google Colab secrets
from google.colab import userdata
api_key = userdata.get('ASKSAGE_API_KEY')
email = userdata.get('ASKSAGE_EMAIL')

# Initialize client and tokenizer
client = AskSageClient(api_key=api_key, email=email)
tokenizer = tiktoken.encoding_for_model("gpt-4")
print("AskSage client initialized successfully")
print("Ready to showcase AI capabilities...")

In [None]:
import os, time, csv
from typing import Optional, Dict
import tiktoken

from google.colab import userdata

ASKSAGE_API_KEY = userdata.get("ASKSAGE_API_KEY")
ASKSAGE_BASE_URL = userdata.get("ASKSAGE_BASE_URL")
ASKSAGE_EMAIL = userdata.get("ASKSAGE_EMAIL")

assert ASKSAGE_API_KEY, "ASKSAGE_API_KEY not provided."
assert ASKSAGE_EMAIL, "ASKSAGE_EMAIL not provided."

print("✓ Secrets loaded")
print("  • EMAIL:", ASKSAGE_EMAIL)
print("  • BASE URL:", ASKSAGE_BASE_URL or "(default)")


## Task 1: Build Extraction Pipeline

**TODO:** Create the core extraction pipeline with retry logic and validation.

In [None]:
class StructuredExtractor:
    """Extract structured data from unstructured text"""
    
    def __init__(self):
        self.setup_client()
        self.extraction_stats = {
            'total_attempts': 0,
            'successful_extractions': 0,
            'validation_errors': 0,
            'parsing_errors': 0
        }
    
    def setup_client(self):
        """Setup API client"""
        self.has_api = os.getenv('OPENAI_API_KEY') is not None
        if self.has_api:
            import openai
            self.client = openai.OpenAI()
            console.print("✅ API client ready")
        else:
            console.print("💡 Using mock responses")
    
    def create_extraction_prompt(self, resume_text: str) -> str:
        """TODO: Create comprehensive extraction prompt"""
        
        # TODO: Design prompt that:
        # 1. Clearly explains the task
        # 2. Provides the JSON schema
        # 3. Gives examples of expected output
        # 4. Handles edge cases and missing data
        
        schema_example = ResumeData.model_json_schema()
        
        prompt = f"""TODO: Write comprehensive extraction prompt
        
        # Should include:
        # - Clear task description
        # - JSON schema specification
        # - Instructions for handling missing data
        # - Format requirements
        
        Resume text:
        {resume_text}
        
        Extract as JSON:
        """
        
        return prompt
    
    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=1, max=10)
    )
    def extract_with_retry(self, resume_text: str) -> Dict[str, Any]:
        """TODO: Extract data with automatic retry on failures"""
        
        self.extraction_stats['total_attempts'] += 1
        
        prompt = self.create_extraction_prompt(resume_text)
        
        if self.has_api:
            # TODO: Make API call with structured output
            response = "{}"  # Placeholder
        else:
            # Mock response with realistic resume data
            response = '''{
              "contact_info": {
                "name": "Sarah Johnson",
                "email": "sarah.johnson@email.com",
                "phone": "(555) 123-4567",
                "location": "San Francisco, CA",
                "linkedin": "linkedin.com/in/sarahjohnson"
              },
              "summary": "Senior software engineer with 5 years experience",
              "work_experience": [
                {
                  "company": "Tech Corp",
                  "position": "Senior Software Engineer",
                  "start_date": "2021-03",
                  "current": true,
                  "responsibilities": ["Led team of 4 engineers", "Built microservices"]
                }
              ]
            }'''
        
        # TODO: Parse and validate JSON response
        try:
            data = json.loads(response)
            return data
        except json.JSONDecodeError as e:
            self.extraction_stats['parsing_errors'] += 1
            raise
    
    def validate_extraction(self, data: Dict[str, Any]) -> ResumeData:
        """TODO: Validate extracted data against schema"""
        try:
            # TODO: Use Pydantic to validate and clean data
            validated_data = ResumeData(**data)
            self.extraction_stats['successful_extractions'] += 1
            return validated_data
        except ValidationError as e:
            self.extraction_stats['validation_errors'] += 1
            console.print(f"[red]Validation error: {e}[/red]")
            raise
    
    def extract_resume_data(self, resume_text: str) -> ExtractionResult:
        """TODO: Main extraction method with full error handling"""
        start_time = time.time()
        errors = []
        retry_count = 0
        
        try:
            # TODO: Attempt extraction with retry
            raw_data = self.extract_with_retry(resume_text)
            
            # TODO: Validate extracted data
            validated_data = self.validate_extraction(raw_data)
            
            # TODO: Calculate confidence score
            confidence = self.calculate_confidence_score(validated_data, resume_text)
            
            return ExtractionResult(
                success=True,
                data=validated_data,
                errors=errors,
                confidence_score=confidence,
                processing_time=time.time() - start_time,
                retry_count=retry_count
            )
            
        except Exception as e:
            return ExtractionResult(
                success=False,
                data=None,
                errors=[str(e)],
                confidence_score=0.0,
                processing_time=time.time() - start_time,
                retry_count=retry_count
            )
    
    def calculate_confidence_score(self, data: ResumeData, original_text: str) -> float:
        """TODO: Calculate confidence score for extraction quality"""
        # TODO: Implement scoring based on:
        # - Completeness of extracted data
        # - Consistency with original text
        # - Presence of key information
        
        score = 0.8  # Placeholder
        return score
    
    def get_extraction_stats(self) -> Table:
        """Get extraction statistics"""
        table = Table(title="Extraction Statistics")
        table.add_column("Metric")
        table.add_column("Count")
        table.add_column("Rate")
        
        total = self.extraction_stats['total_attempts']
        if total > 0:
            for metric, count in self.extraction_stats.items():
                rate = f"{count/total*100:.1f}%" if total > 0 else "0%"
                table.add_row(metric.replace('_', ' ').title(), str(count), rate)
        
        return table

# Initialize extractor
extractor = StructuredExtractor()
print("⚙️ Extraction pipeline ready!")
print("⚠️ TODO: Complete the extraction and validation methods above")

## Task 2: Test with Sample Resumes

**TODO:** Test your pipeline with various resume formats and edge cases.

In [None]:
# TODO: Create diverse test resume samples

sample_resumes = {
    "complete_resume": """Sarah Johnson
Software Engineer | sarah.johnson@email.com | (555) 123-4567
LinkedIn: linkedin.com/in/sarahjohnson | GitHub: github.com/sarahj
San Francisco, CA

SUMMARY
Senior software engineer with 5+ years of experience in full-stack development.
Expertise in Python, JavaScript, and cloud technologies.

EXPERIENCE
Senior Software Engineer | Tech Corp | March 2021 - Present | San Francisco, CA
• Led development team of 4 engineers on microservices architecture
• Improved system performance by 40% through optimization
• Implemented CI/CD pipeline reducing deployment time by 60%

Software Engineer | StartupXYZ | June 2019 - February 2021 | Remote
• Built REST APIs serving 1M+ requests daily
• Developed React frontend components for customer dashboard

EDUCATION
B.S. Computer Science | Stanford University | 2019 | GPA: 3.8

SKILLS
Languages: Python, JavaScript, Java, Go
Frameworks: Django, React, Node.js, Flask
Tools: Docker, Kubernetes, AWS, PostgreSQL
""",
    
    "minimal_resume": """John Doe
john@example.com

Work: Software Developer at ABC Company (2020-2023)
Education: Computer Science Degree, 2020
Skills: Python, SQL
""",
    
    "messy_format": """ALEX CHEN | alex.chen@tech.com | Phone: 415-555-9876
    
    === WORK HISTORY ===
    * Data Scientist @ BigTech Inc (Jan 2022 -> now)
      - Machine learning model development
      - Data pipeline automation
    * Junior Analyst @ DataCorp (2020-2021)
      
    EDUCATION: Masters in Data Science, UC Berkeley 2020
    
    Technical Skills: Python • R • SQL • TensorFlow • AWS
"""
}

def test_extraction_pipeline():
    """TODO: Test pipeline with different resume formats"""
    
    console.print("\n🧪 [bold blue]Testing Structured Extraction Pipeline[/bold blue]\n")
    
    results = []
    
    for name, resume_text in sample_resumes.items():
        console.print(f"[yellow]Testing: {name}[/yellow]")
        console.print(f"Resume length: {len(resume_text)} characters\n")
        
        # TODO: Extract data and analyze results
        result = extractor.extract_resume_data(resume_text)
        results.append((name, result))
        
        # TODO: Display results
        if result.success:
            console.print("[green]✅ Extraction successful![/green]")
            console.print(f"Confidence: {result.confidence_score:.2f}")
            console.print(f"Processing time: {result.processing_time:.2f}s")
            
            # Show extracted contact info
            if result.data:
                contact = result.data.contact_info
                console.print(f"[cyan]Name:[/cyan] {contact.name}")
                console.print(f"[cyan]Email:[/cyan] {contact.email}")
                console.print(f"[cyan]Experience entries:[/cyan] {len(result.data.work_experience)}")
                
                # TODO: Display more structured data
                
        else:
            console.print("[red]❌ Extraction failed![/red]")
            for error in result.errors:
                console.print(f"[red]Error: {error}[/red]")
        
        console.print("\n" + "-"*50 + "\n")
    
    # TODO: Show overall statistics
    console.print(extractor.get_extraction_stats())
    
    return results

# TODO: Run the tests
# test_results = test_extraction_pipeline()

print("📊 Test framework ready!")
print("⚠️ TODO: Uncomment the test execution and complete the test analysis")

## Task 3: Build Data Quality Assessment

**TODO:** Create tools to assess and improve extraction quality.

In [None]:
class QualityAssessor:
    """Assess quality of extracted structured data"""
    
    def __init__(self):
        self.quality_metrics = {}
    
    def assess_completeness(self, data: ResumeData) -> float:
        """TODO: Score data completeness (0-1)"""
        # TODO: Check presence of key fields
        # TODO: Weight different sections by importance
        # TODO: Return completeness score
        
        score = 0.0
        max_score = 0.0
        
        # TODO: Implement scoring logic
        
        return score / max_score if max_score > 0 else 0.0
    
    def assess_accuracy(self, data: ResumeData, original_text: str) -> float:
        """TODO: Score extraction accuracy by comparing with source"""
        # TODO: Check if extracted info matches source text
        # TODO: Validate email format, phone format, dates
        # TODO: Check for hallucinated information
        
        accuracy_score = 0.8  # Placeholder
        return accuracy_score
    
    def assess_consistency(self, data: ResumeData) -> float:
        """TODO: Check internal consistency of extracted data"""
        # TODO: Check date ordering in work experience
        # TODO: Verify education dates vs work dates
        # TODO: Check for duplicate or conflicting information
        
        consistency_score = 0.9  # Placeholder
        return consistency_score
    
    def generate_quality_report(self, data: ResumeData, original_text: str) -> Dict[str, Any]:
        """TODO: Generate comprehensive quality report"""
        
        completeness = self.assess_completeness(data)
        accuracy = self.assess_accuracy(data, original_text)
        consistency = self.assess_consistency(data)
        
        # TODO: Calculate overall quality score
        overall_score = (completeness + accuracy + consistency) / 3
        
        report = {
            "completeness_score": completeness,
            "accuracy_score": accuracy,
            "consistency_score": consistency,
            "overall_quality": overall_score,
            "issues_found": [],  # TODO: Collect specific issues
            "recommendations": []  # TODO: Generate improvement suggestions
        }
        
        # TODO: Add specific recommendations based on scores
        
        return report
    
    def suggest_improvements(self, report: Dict[str, Any]) -> List[str]:
        """TODO: Generate specific improvement suggestions"""
        suggestions = []
        
        # TODO: Analyze report and generate targeted suggestions
        # Examples:
        # - "Add validation for email format"
        # - "Improve date parsing for work experience"
        # - "Extract missing skills section"
        
        return suggestions
    
    def create_quality_dashboard(self, results: List[ExtractionResult]) -> Table:
        """TODO: Create quality metrics dashboard"""
        table = Table(title="Data Quality Dashboard")
        table.add_column("Metric")
        table.add_column("Average Score")
        table.add_column("Min")
        table.add_column("Max")
        table.add_column("Status")
        
        # TODO: Calculate aggregate quality metrics
        # TODO: Add rows for different quality dimensions
        
        return table

# Initialize quality assessor
assessor = QualityAssessor()
print("📊 Quality assessment tools ready!")
print("⚠️ TODO: Complete the quality assessment methods above")

## 🎯 Exit Ticket

Before completing this lab, make sure you can answer:

### ✅ Deliverables Checklist

- [ ] **Defined comprehensive schemas**: Pydantic models for resume data with validation
- [ ] **Built extraction pipeline**: Retry logic, error handling, and structured output
- [ ] **Tested with diverse inputs**: Different resume formats and edge cases
- [ ] **Implemented quality assessment**: Completeness, accuracy, and consistency metrics
- [ ] **Created monitoring dashboard**: Track extraction performance and quality

### 🧠 Knowledge Check

1. **How do you handle schema validation errors?** What recovery strategies work best?

2. **What makes a good confidence score?** How do you measure extraction quality?

3. **When should you retry vs fail fast?** How do you balance reliability and speed?

4. **How do you handle missing or optional data?** What defaults make sense?

### 🚀 Extensions (Optional)

- **Multi-model comparison**: Test extraction across different LLMs
- **Active learning**: Improve schemas based on failure patterns
- **Batch processing**: Handle multiple resumes efficiently
- **Export formats**: Generate structured data in multiple formats
- **Integration testing**: Connect to databases or APIs

### 📊 Success Metrics

- Built working structured extraction pipeline
- Achieved >85% extraction accuracy on test cases
- Implemented robust error handling and retry logic
- Created comprehensive data validation
- Generated useful quality metrics and dashboards

**Time Check:** This lab should take about 60 minutes. Focus on getting basic extraction working before adding advanced quality features.

Ready for Lab 8: JSONL Dashboard? Let's build monitoring and analytics! 📈