# Introduction
Exercise 1: EPD Data Extraction

## Objectives
- Understand EPD structure
- Extract key environmental impacts
- Validate extracted data

## Pre-implemented Components
- PDF text extraction
- Table detection
- Basic data validation

## Your Tasks
1. Implement prompt refinements
2. Extract specific impact categories
3. Validate results

## Time: 20 minutes

# Environment setup
## Import pre-implemented components

Note: When you as a team copy to your solutions folder, you can keep the same imports
The imports ought to reference your copy of the code

In [None]:
import sys
from pathlib import Path
exercises_dir = Path(__file__).parent.parent
sys.path.append(str(exercises_dir))

from setup_env import setup_notebook_env
setup_notebook_env()

# Import required libraries
import pandas as pd
import pdfplumber

# Import team template code
from team_template.src.extraction import EPDExtractor
from team_template.src.validation import DataValidator

# Verify data path
data_path = Path("../../data/reference_results")  # Updated relative path
sample_epd = data_path / "sample_epd.pdf"
assert sample_epd.exists(), f"Sample EPD not found at {sample_epd}"

## 1. Examine Sample EPD

In [None]:
c# Load and examine the sample EPD
def peek_pdf(pdf_path: Path, pages: int = 1) -> str:
    """Extract and return first few pages of PDF text."""
    with pdfplumber.open(pdf_path) as pdf:
        return "\n".join(page.extract_text() for page in pdf.pages[:pages])

# View first page of sample EPD
print(peek_pdf(sample_epd))

## 2. Test Pre-implemented Extraction

In [None]:
# Create extractor instance
extractor = EPDExtractor()

# Test basic extraction
result = extractor.extract_from_pdf(str(sample_epd))
print("Extracted data structure:")
print(result)

## 3. Your Task: Enhance Extraction

In [None]:
# TODO: Implement your enhanced extraction logic here
# Hint: Consider what additional fields would be valuable for LCA

## 4. Validate Results

In [None]:
# Create validator instance
validator = DataValidator()

# Validate extraction results
is_valid, issues = validator.validate_extraction(result)

print(f"Validation passed: {is_valid}")
if not is_valid:
    print("\nIssues found:")
    for issue in issues:
        print(f"- {issue}")

## 5. Save Results

In [None]:
# Save validated results for next exercise
import json
from datetime import datetime

output_path = Path("../results")
output_path.mkdir(exist_ok=True)

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = output_path / f"extraction_results_{timestamp}.json"

with open(output_file, 'w') as f:
    json.dump(result, f, indent=2)
    
print(f"Results saved to: {output_file}")