# Portfolio D: Structured Extraction
**Turn unstructured text into analyzable data at scale**

Researchers often need to extract structured information from messy text — pulling out entities, claims, metrics, and relationships so they can be counted, filtered, and analyzed in a DataFrame. Your job: build a pipeline that does this reliably using LLMs with schema-enforced output.

**Dataset**: You'll work with real-world text (news articles, provided below)
**Your goal**: Define a Pydantic schema, extract structured data from multiple documents, and build an analysis on top of the extracted data.

### Deliverables
- Pydantic schema for your extraction task
- Working extraction pipeline with retry logic
- Extracted data from 10+ documents in a DataFrame
- At least one analysis or visualization built on the extracted data
- Brief model card

**Estimated time**: Sprint 1 (55 min) + Sprint 2 (90 min)

## Setup

In [None]:
!pip install -q openai pydantic matplotlib seaborn pandas

import os
import time
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pydantic import BaseModel, Field, ValidationError
from typing import List, Literal, Optional
from openai import OpenAI

In [None]:
# Groq API setup
GROQ_API_KEY = ""  # @param {type:"string"}

if not GROQ_API_KEY:
    try:
        from google.colab import userdata
        GROQ_API_KEY = userdata.get('GROQ_API_KEY')
    except (ImportError, Exception):
        GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "")

client = OpenAI(
    api_key=GROQ_API_KEY,
    base_url="https://api.groq.com/openai/v1"
)
MODEL = "llama-3.1-8b-instant"
print("Groq client ready" if GROQ_API_KEY else "WARNING: No API key set")

## 1. Sample Documents
Here are a few articles to get started. You can replace these with your own data.

In [None]:
articles = [
    {
        "id": 1,
        "text": """OpenAI has announced GPT-5, its most capable model yet, claiming a 40% improvement 
        on reasoning benchmarks over GPT-4. The model was trained on an estimated 15 trillion tokens 
        and costs approximately $100M to train. CEO Sam Altman stated the model can now pass the bar 
        exam in the 99th percentile. Microsoft has committed to integrating the model across Office 365 
        and Azure. Critics from MIT and Stanford have raised concerns about benchmark contamination."""
    },
    {
        "id": 2,
        "text": """Denmark's Ministry of Finance released a report showing that AI adoption in public 
        administration has increased 23% year-over-year. The report surveyed 450 government agencies 
        and found that 67% now use AI for document processing. Cost savings are estimated at 
        DKK 2.1 billion annually. However, only 12% of agencies have formal AI governance policies. 
        The Danish Data Protection Agency has flagged concerns about automated decision-making in 
        citizen services."""
    },
    {
        "id": 3,
        "text": """A new study in Nature found that transformer-based models can predict protein 
        structures with 95% accuracy, matching AlphaFold's performance. Researchers from the 
        University of Cambridge and Google DeepMind collaborated on the project. The model requires 
        only 2 hours of inference on a single A100 GPU, compared to AlphaFold's 16 hours. The team 
        plans to release the model weights under an Apache 2.0 license. Pharmaceutical companies 
        Pfizer and Roche have expressed interest in licensing the technology."""
    },
    {
        "id": 4,
        "text": """The European Central Bank warns that AI-driven trading now accounts for 
        approximately 60% of equity trades in European markets. A report from Deutsche B\u00f6rse shows 
        algorithmic trading volume increased 35% in Q3 2025. Regulators are concerned about flash 
        crash risks, with three notable incidents in the past 6 months. France's AMF has proposed 
        requiring AI trading systems to include circuit breakers. Goldman Sachs and JP Morgan have 
        both expanded their AI trading desks by over 200 employees each."""
    },
    {
        "id": 5,
        "text": """Spotify's latest earnings report reveals that AI-generated playlists now drive 
        40% of total listening hours, up from 15% a year ago. The company's recommendation engine, 
        powered by a custom transformer model, has reduced churn by 18%. Revenue per user increased 
        12% to \u20ac5.80/month. CEO Daniel Ek attributed the growth to the AI DJ feature launched in 
        March 2025. Competitors Apple Music and YouTube Music are reportedly developing similar 
        features. Spotify's stock rose 8% on the announcement."""
    }
]

print(f"Loaded {len(articles)} sample articles")

## 2. Define Your Schema
This is the key design decision — what structured fields do you want to extract?

In [None]:
class ArticleExtraction(BaseModel):
    title: str = Field(description="A concise title for this article (5-10 words)")
    summary: str = Field(description="2-3 sentence summary of the key points")
    organizations: List[str] = Field(description="Companies, universities, and agencies mentioned")
    key_claims: List[str] = Field(description="Main factual claims with numbers (3-5)")
    sentiment: Literal["positive", "negative", "neutral", "mixed"] = Field(
        description="Overall sentiment toward the main topic"
    )
    domain: Literal["tech", "finance", "policy", "science", "culture"] = Field(
        description="Primary domain of the article"
    )

# Test that the schema compiles
print(json.dumps(ArticleExtraction.model_json_schema(), indent=2)[:500])

## 3. Extract from One Document (test)

In [None]:
def extract_article(text: str, max_retries: int = 3) -> Optional[ArticleExtraction]:
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=MODEL,
                messages=[
                    {"role": "system", "content": (
                        "Extract structured information from the article. "
                        "Return valid JSON matching the schema. Do not invent facts."
                    )},
                    {"role": "user", "content": text}
                ],
                response_format={"type": "json_object"},
                temperature=0.0,
                max_tokens=500,
            )
            return ArticleExtraction.model_validate_json(response.choices[0].message.content)
        except (ValidationError, Exception) as e:
            print(f"  Attempt {attempt+1} failed: {e}")
            time.sleep(2 ** attempt)
    return None

# Test on first article
result = extract_article(articles[0]["text"])
if result:
    print(f"Title: {result.title}")
    print(f"Domain: {result.domain}")
    print(f"Sentiment: {result.sentiment}")
    print(f"Organizations: {result.organizations}")
    print(f"Key claims:")
    for claim in result.key_claims:
        print(f"  - {claim}")

## 4. Extract All Documents

In [None]:
from tqdm import tqdm

extractions = []
for article in tqdm(articles, desc="Extracting"):
    result = extract_article(article["text"])
    if result:
        row = result.model_dump()
        row["id"] = article["id"]
        extractions.append(row)
    time.sleep(0.5)  # Rate limiting

extract_df = pd.DataFrame(extractions)
print(f"\nSuccessfully extracted: {len(extract_df)}/{len(articles)}")
extract_df[["id", "title", "domain", "sentiment"]].head()

## 5. Analyze the Extracted Data
Now that you have structured data, analyze it!

In [None]:
# Example: domain distribution
print("Domain distribution:")
print(extract_df["domain"].value_counts())
print("\nSentiment distribution:")
print(extract_df["sentiment"].value_counts())

# Organizations mentioned
all_orgs = [org for orgs in extract_df["organizations"] for org in orgs]
print(f"\nAll organizations mentioned ({len(all_orgs)} total):")
print(pd.Series(all_orgs).value_counts().head(10))

## 6. Your Turn: Scale Up & Extend

Ideas:
- **Add more articles**: Find 10+ articles on a topic you care about (paste them in or load from a dataset)
- **Richer schema**: Add fields like `risk_factors`, `monetary_amounts`, `date_references`
- **Cross-document analysis**: Which organizations appear most? What's the overall sentiment trend?
- **Visualization**: Plot extracted metrics, build a network of organization co-mentions

In [None]:
# YOUR CODE HERE — extend the pipeline

## 7. Model Card

| Field | Value |
|-------|-------|
| **Task** | Structured extraction from text |
| **LLM** | _model name_ |
| **Schema fields** | _list your fields_ |
| **Documents processed** | _N_ |
| **Success rate** | _% extracted without errors_ |
| **Key insight** | _what did the extracted data reveal?_ |
| **Failure mode** | _what does the LLM get wrong?_ |
| **Improvement idea** | _what you'd try next_ |