# Training a Vertical AI for 10K Text Classification in Business Functions
Supports APIs of OpenAI (Responses API), Anthropic, xAI, Fireworks AI, Google, UNC Azure hosted

February 4, 2026  

*Version 1.0*

Copyright 2026 Daniel M. Ringel  
www.ringel.ai 


**Please cite this paper** if you use any part or all of this code in a project - be it commercial or academic:  

> Ringel, Daniel, *Creating Synthetic Experts with Generative Artificial Intelligence* (December 5, 2023). Kenan Institute of Private Enterprise Research Paper No. 4542949, Available at SSRN: https://ssrn.com/abstract=4542949 or http://dx.doi.org/10.2139/ssrn.4542949 


Query various serverless genAI models to classify text by example of a multi-label classification problem
- Sentences from 10K filings of fortune 500 companies
- Construct of Interest: Business functions
- Valid labels: Marketing, Finance, Accounting, Operations, IT, HR

> **IMPORTANT**  Running this code will cost you API credits (and requires you to ahve accounts with the providers). You will need to supply your own API keys. Beware that you may be subject to rate limits (how many queries you can send per minute) and which models you can use (OpenAI, for example, requires you to verify your identidy with an ID to access many models). Regardless, every time you execute this code, you will drain your API credits = real money! Thus, make wise decisions about what and how much to label.

## **Disclaimer**

**USE AT YOUR OWN RISK**

This code is provided "as is" without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement.

Under no circumstances shall Daniel Ringel be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services, loss of use, data, or profits, or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this code, even if advised of the possibility of such damage.

**Additional Notes:**
- API costs incurred from running this code are your sole responsibility
- Verify all API pricing before running at scale
- Test with small samples before processing large datasets
- The author makes no guarantees about the accuracy, reliability, or completeness of results
- This code is for educational and research purposes

By using this code, you acknowledge that you have read this disclaimer and agree to its terms.


# Installs and updates
- need to run only once if you are on your own computer
- on CoLab, you may need to run each time

In [None]:
# Core
!pip install --upgrade pandas tqdm
# OpenAI (also used for xAI and Fireworks)
!pip install --upgrade openai
# Anthropic
!pip install --upgrade anthropic
# Google Gemini
!pip install --upgrade google-genai google-api-core
# For Label Agreement
!pip install -q -U krippendorff 
# For Fine-tuning a pretrained LLM
!pip install -q -U transformers datasets accelerate scikit-learn
!pip install iterative-stratification

# Set Environmental Variables (API Keys) on local Computer   
### *(or on Colab in Secret Keys Tab)*
> **DO NOT SHARE API KEYS!!!** Delte them before sharing this notebook
 
> On ***Colab***, use "Secrets" in Tab (key) on right. Make sure to use the exact same API Key names (e.g., "OPENAI_API_KEY") spelled and capitalized as shown!

In [None]:
# Put your API Keys here if you run this locally.
import os
os.environ["OPENAI_API_KEY"] = ""
os.environ["ANTHROPIC_API_KEY"] = ""
os.environ["FIREWORKS_API_KEY"] = ""
os.environ["XAI_API_KEY"] = ""
os.environ["GOOGLE_API_KEY"] = ""
os.environ["AZURE_API_KEY"] = ""

# Imports

In [None]:
import os
import json
import time
import datetime
import pandas as pd
from tqdm import tqdm

# Vendor SDKs
import openai
from openai import OpenAI, AzureOpenAI, RateLimitError, APIError, AuthenticationError
import anthropic
from google import genai
from google.genai import types
from google.api_core import exceptions as google_exceptions

# Vendor/Model Configuration

In [None]:
# -----------------------------
# Vendor/Model Configuration
# All prices per 1M tokens (USD)  ----> UPDATE THE PRICES!!!!
# supplier = who made the model (for when using 3rd party APIs like Fireworks)
# -----------------------------
VENDORS = {
    "openai": {
        "models": {
            # GPT-5 family
            "gpt-5.2":      {"supplier": "openai", "price_in": 1.75,  "price_out": 14.00},
            "gpt-5":        {"supplier": "openai", "price_in": 1.25,  "price_out": 10.00},
            "gpt-5-mini":   {"supplier": "openai", "price_in": 0.30,  "price_out": 2.50},
            "gpt-5-nano":   {"supplier": "openai", "price_in": 0.10,  "price_out": 0.40},
            # GPT-4.1 family
            "gpt-4.1":      {"supplier": "openai", "price_in": 2.00,  "price_out": 8.00},
            # GPT-4o family
            "gpt-4o":       {"supplier": "openai", "price_in": 2.50,  "price_out": 10.00},
        }
    },
    "azure": {
            "models": {
                "gpt-4.1": {"supplier": "openai", "price_in": 2.00, "price_out": 8.00},
                "gpt-4o": {"supplier": "openai", "price_in": 2.50, "price_out": 8.00},
            },
            "endpoint": "https://azureaiapi.cloud.unc.edu",
            "api_version": "2025-04-01-preview",
        },
    "anthropic": {
        "models": {
            # Claude 4.5 family
            "claude-opus-4-5-20251101":   {"supplier": "anthropic", "price_in": 5.00,  "price_out": 25.00},
            "claude-sonnet-4-5-20250929": {"supplier": "anthropic", "price_in": 3.00,  "price_out": 15.00},
            "claude-haiku-4-5-20251001":  {"supplier": "anthropic", "price_in": 1.00,  "price_out": 5.00},
            # Claude 4 family
            "claude-sonnet-4-20250514":   {"supplier": "anthropic", "price_in": 3.00,  "price_out": 15.00},
        }
    },
    "google": {
        "models": {
            # Gemini 3 - latest flagship - preview
            "gemini-3-pro-preview":   {"supplier": "google", "price_in": 2.00, "price_out": 12.00},
            "gemini-3-flash-preview": {"supplier": "google", "price_in": 0.50, "price_out": 3.00},
            # Gemini 2.5 - stable
            "gemini-2.5-pro":         {"supplier": "google", "price_in": 1.25, "price_out": 10.00},
            "gemini-2.5-flash":       {"supplier": "google", "price_in": 0.30, "price_out": 2.50},
        }
    },
    "xai": {
        "models": {
            "grok-4":                       {"supplier": "xai", "price_in": 3.00,  "price_out": 15.00},
            "grok-4-1-fast-reasoning":      {"supplier": "xai", "price_in": 0.20,  "price_out": 0.50},
            "grok-4-1-fast-non-reasoning":  {"supplier": "xai", "price_in": 0.20,  "price_out": 0.50}
        }
    },
    "fireworks": {
        "models": {
            "deepseek-v3p2":                {"supplier": "deepseek", "price_in": 0.56, "price_out": 1.68},
            "qwen3-vl-235b-a22b-instruct":  {"supplier": "alibaba",  "price_in": 0.22, "price_out": 0.88},
            "deepseek-r1-0528":             {"supplier": "deepseek", "price_in": 1.35, "price_out": 5.40},
            "qwen3-vl-235b-a22b-thinking":  {"supplier": "alibaba",  "price_in": 0.22, "price_out": 0.88},
            "kimi-k2p5":                    {"supplier": "moonshot", "price_in": 1.20, "price_out": 1.20}
        }
    }   
}

# System Prompt
- Be clear
- Be specific
- Try RTF (role task format)
- Succinct and exhaustive construct definitions
- Could give examples (few-shot), but may create noise or focus model too much on these cases
- Explain tie-breakers or dricky cases
- Define output format.

In [None]:
# -----------------------------
# System Prompt
# -----------------------------
SYSTEM_PROMPT = """You are a business analyst classifying sentences from 10-K filings.

Classify the sentence into one or more business functions: Marketing, Finance, Accounting, Operations, IT, HR (human resources)

Definitions:
- Marketing: Customers, markets, demand, branding, advertising, promotion, pricing strategy, market research, segmentation, positioning, sales strategy, sales channels.
- Finance: Capital structure, funding, liquidity, treasury, investing, valuation, financial risk management (interest rates, FX, hedging), dividends, buybacks, M&A, financing activities.
- Accounting: Financial reporting, disclosures, GAAP/IFRS, accounting policies, accounting estimates, revenue recognition, impairments, reserves, depreciation, amortization, audits, internal controls over financial reporting (ICFR), tax accounting.
- Operations: Production, service delivery, supply chain, logistics, procurement, inventory management, manufacturing, capacity planning, facilities, quality control, safety, process efficiency, fulfillment, operational infrastructure, IT systems.
- IT: Information technology systems, software, hardware, cybersecurity, data management, cloud computing, digital infrastructure, technology platforms, system integration, IT support, technology investments, data analytics infrastructure.
- HR: Human resources, workforce, hiring, recruitment, talent acquisition, employee benefits, compensation, training, professional development, labor relations, employee retention, workplace safety, organizational culture.

Rules:
1. Assign labels only when there is clear, direct evidence in the sentence.
2. Assign multiple labels if clearly relevant to more than one field.
3. Tie-breakers: Reporting/policies/controls/disclosures → Accounting; Funding/treasury/hedging/M&A → Finance.
4. Out of scope: General corporate governance, board matters, executive compensation, legal proceedings → return empty array.
5. Output: Return ONLY a JSON array with exact spelling: "Marketing", "Finance", "Accounting", "Operations", "IT". If none apply, return [].
6. DO NOT provide a reason or explanation for your labels."""
    
ALLOWED_LABELS = {"Marketing", "Finance", "Accounting", "Operations", "IT", "HR"}
LABEL_ORDER = ["Marketing", "Finance", "Accounting", "Operations", "IT", "HR"]

# Client Initialization & API Call Functions

In [None]:
# -----------------------------
# Client Initialization on a local PC (requires setting them at the beginning) or 
#                       on Google CoLab (requires defining them in the secrets tab with key symbol)
# -----------------------------
def get_api_key(key_name: str) -> str:
    """Get API key from Colab secrets or environment variables."""
    try:
        from google.colab import userdata
        return userdata.get(key_name)
    except (ImportError, ModuleNotFoundError):
        # Not in Colab, use environment variables
        return os.environ[key_name]


def init_client(vendor: str):
    """Initialize API client for vendor."""
    if vendor == "openai":
        return OpenAI(api_key=get_api_key("OPENAI_API_KEY"))
    elif vendor == "azure":
            return AzureOpenAI(
                api_key=get_api_key("AZURE_API_KEY"),
                azure_endpoint=VENDORS["azure"]["endpoint"],
                api_version=VENDORS["azure"]["api_version"],
            )    
    elif vendor == "anthropic":
        return anthropic.Anthropic(api_key=get_api_key("ANTHROPIC_API_KEY"))
    elif vendor == "google":
        return genai.Client(api_key=get_api_key("GOOGLE_API_KEY"))
    elif vendor == "xai":
        return OpenAI(
            api_key=get_api_key("XAI_API_KEY"),
            base_url="https://api.x.ai/v1"
        )
    elif vendor == "fireworks":
        return OpenAI(
            api_key=get_api_key("FIREWORKS_API_KEY"),
            base_url="https://api.fireworks.ai/inference/v1"
        )
    else:
        raise ValueError(f"Unknown vendor: {vendor}")

# -----------------------------
# API Call Functions
# -----------------------------
def call_openai(client, sentence: str, model: str, system_prompt: str,
                max_tokens: int = 64, reasoning_effort: str = None) -> dict:
    """
    Call OpenAI Responses API.
    
    Token billing:
    - input_tokens: billed at input rate
    - output_tokens: includes reasoning_tokens, all billed at output rate
    - reasoning_tokens: subset of output_tokens (internal thinking)
    
    Note: For reasoning models, max_output_tokens must accommodate BOTH
    reasoning tokens AND response tokens. We scale up accordingly.
    """
    # For reasoning models, we need more output tokens to fit both reasoning + response
    if reasoning_effort:
        # Reasoning needs room: base response tokens + reasoning overhead
        # low ~500, medium ~1000, high ~2000+ reasoning tokens typical
        reasoning_overhead = {"low": 512, "medium": 1024, "high": 2048}.get(reasoning_effort, 512)
        effective_max_tokens = max_tokens + reasoning_overhead
    else:
        effective_max_tokens = max_tokens
    
    params = {
        "model": model,
        "instructions": system_prompt,
        "input": sentence,
        "max_output_tokens": effective_max_tokens,
    }
   
    if reasoning_effort:
        params["reasoning"] = {"effort": reasoning_effort}
    else:
        params["temperature"] = 0
    
    resp = client.responses.create(**params)
    
    # Extract text - output_text may be empty for some response types
    text = ""
    if resp.output_text:
        text = resp.output_text.strip()
    else:
        # Fallback: extract from output items
        for item in resp.output:
            if getattr(item, 'type', None) == 'message':
                for block in getattr(item, 'content', []):
                    if getattr(block, 'type', None) == 'text':
                        text += getattr(block, 'text', '')
        text = text.strip()
    
    # Extract token breakdown
    input_tokens = resp.usage.input_tokens
    output_tokens = resp.usage.output_tokens
    
    # Reasoning tokens are part of output_tokens (not additional)
    reasoning_tokens = 0
    if hasattr(resp.usage, 'output_tokens_details') and resp.usage.output_tokens_details:
        reasoning_tokens = getattr(resp.usage.output_tokens_details, 'reasoning_tokens', 0) or 0
    
    return {
        "text": text,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,          # total output (includes reasoning)
        "reasoning_tokens": reasoning_tokens,    # internal thinking (subset of output)
        "response_tokens": output_tokens - reasoning_tokens,  # visible response
        "raw_response": resp.model_dump() if hasattr(resp, 'model_dump') else str(resp),
    }

def call_azure(client, sentence: str, model: str, system_prompt: str,
               max_tokens: int = 64) -> dict:
    """
    Call Azure OpenAI API (chat completions).
    
    Azure uses the standard chat completions endpoint, not Responses API.
    """
    params = {
        "model": model,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": sentence}
        ],
        "max_tokens": max_tokens,
        "temperature": 0,  # Set temp to 0 to be deterministic
    }
    
    resp = client.chat.completions.create(**params)
    
    text = (resp.choices[0].message.content or "").strip()
    
    return {
        "text": text,
        "input_tokens": resp.usage.prompt_tokens,
        "output_tokens": resp.usage.completion_tokens,
        "reasoning_tokens": 0,
        "response_tokens": resp.usage.completion_tokens,
        "raw_response": resp.model_dump() if hasattr(resp, 'model_dump') else str(resp),
    }
    
def call_anthropic(client, sentence: str, model: str, system_prompt: str,
                   max_tokens: int = 64, thinking_budget: int = None) -> dict:
    """
    Call Anthropic Messages API.
    
    Token billing:
    - input_tokens: billed at input rate
    - output_tokens: includes thinking tokens, all billed at output rate
    - NOTE: Anthropic doesn't provide separate thinking token count in usage.
            The output_tokens is what's billed (includes full thinking, not summary).
    """
    params = {
        "model": model,
        "system": system_prompt,
        "messages": [{"role": "user", "content": sentence}],
    }
    
    if thinking_budget:
        thinking_budget = max(1024, thinking_budget)
        params["thinking"] = {"type": "enabled", "budget_tokens": thinking_budget}
        params["max_tokens"] = thinking_budget + max_tokens
    else:
        params["temperature"] = 0
        params["max_tokens"] = max_tokens
    
    resp = client.messages.create(**params)
    
    # Extract text (skip thinking blocks - they're summarized in Claude 4)
    text = "".join(b.text for b in resp.content if b.type == "text").strip()
    
    # output_tokens includes thinking tokens (billed amount)
    # Anthropic doesn't provide breakdown like OpenAI does
    output_tokens = resp.usage.output_tokens
    
    return {
        "text": text,
        "input_tokens": resp.usage.input_tokens,
        "output_tokens": output_tokens,  # total billed (includes thinking)
        "reasoning_tokens": 0,           # Anthropic doesn't provide breakdown
        "response_tokens": output_tokens,  # Can't separate, use total
        "raw_response": resp.model_dump() if hasattr(resp, 'model_dump') else str(resp),
    }

def call_google(client, sentence: str, model: str, system_prompt: str,
                max_tokens: int = 64, thinking_level: str = None) -> dict:
    """
    Call Google Gemini API.
    
    Gemini 3 Pro: thinking_level "low", "high" (default)
    Gemini 3 Flash: thinking_level "minimal", "low", "medium", "high" (default)
    
    Note: Even "low" still uses thinking tokens! Only Flash "minimal" truly minimizes.
    
    Token billing:
    - output tokens include thinking tokens (no separate breakdown)
    """
    from google.genai import types
    
    # Build config
    config_params = {}
    
    # Determine token allocation based on thinking level: Even "low" uses thinking tokens (~50-200), so we need buffer
    if thinking_level == "minimal":
        # Only Flash supports minimal - truly minimal thinking
        config_params["max_output_tokens"] = max_tokens + 128
    elif thinking_level == "low":
        # Low still uses thinking tokens, need buffer
        config_params["max_output_tokens"] = max_tokens + 512
    elif thinking_level == "medium":
        config_params["max_output_tokens"] = max_tokens + 2048
    else:
        # high or default (None) - full thinking
        config_params["max_output_tokens"] = max_tokens + 4096
    
    # Set thinking level if specified
    if thinking_level:
        config_params["thinking_config"] = types.ThinkingConfig(
            thinking_level=thinking_level
        )
    
    # Gemini 3 recommends temperature=1.0 (default), don't override
    
    config = types.GenerateContentConfig(**config_params)
    
    # Combine system prompt and sentence
    full_prompt = f"{system_prompt}\n{sentence}"
    
    resp = client.models.generate_content(
        model=model,
        contents=full_prompt,
        config=config,
    )
    
    # Extract text - resp.text may be empty, need to check parts
    text = ""
    if resp.text:
        text = resp.text.strip()
    elif resp.candidates and resp.candidates[0].content and resp.candidates[0].content.parts:
        # Extract text from parts, skipping thinking parts
        for part in resp.candidates[0].content.parts:
            # Skip thinking parts (they have 'thought' attribute set to True)
            if hasattr(part, 'thought') and part.thought:
                continue
            if hasattr(part, 'text') and part.text:
                text += part.text
        text = text.strip()
    
    # Token usage
    usage = resp.usage_metadata
    input_tokens = usage.prompt_token_count
    output_tokens = usage.candidates_token_count
    # - candidates_token_count = actual response tokens
    # - thoughts_token_count = thinking/reasoning tokens (billed as output)
    thoughts_tokens = getattr(resp.usage_metadata, 'thoughts_token_count', 0) or 0
    candidates_tokens = resp.usage_metadata.candidates_token_count or 0

    return {
        "text": text,
        "input_tokens": resp.usage_metadata.prompt_token_count,
        "output_tokens": candidates_tokens + thoughts_tokens,  # Total billed output
        "reasoning_tokens": thoughts_tokens,
        "response_tokens": candidates_tokens,
        "raw_response": str(resp),
    }
    
def call_fireworks(client, sentence: str, model: str, system_prompt: str,
                   max_tokens: int = 64, reasoning_effort: str = None) -> dict:
    """
    Call Fireworks API using OpenAI-compatible endpoint.
    
    Thinking models (*-thinking, r1, kimi) output reasoning in content,
    so we allocate extra tokens and use recommended temperature.
    """
    full_model_id = f"accounts/fireworks/models/{model}"
    
    # Thinking models need more tokens (thinking appears in content)
    is_thinking_model = "thinking" in model or "r1" in model or model == "kimi-k2p5"
    
    if is_thinking_model:
        effective_max_tokens = max_tokens + 4096
        temperature = 0.6  # Recommended for thinking
    else:
        effective_max_tokens = max_tokens
        temperature = 0  # Deterministic for chat
    
    params = {
        "model": full_model_id,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": sentence}
        ],
        "max_tokens": effective_max_tokens,
        "temperature": temperature,
    }

    resp = client.chat.completions.create(**params)
    
    text = (resp.choices[0].message.content or "").strip()
    
    return {
        "text": text,
        "input_tokens": resp.usage.prompt_tokens,
        "output_tokens": resp.usage.completion_tokens,
        "reasoning_tokens": 0,
        "response_tokens": resp.usage.completion_tokens,
        "raw_response": resp.model_dump() if hasattr(resp, 'model_dump') else str(resp),
    }

def call_xai(client, sentence: str, model: str, system_prompt: str,
             max_tokens: int = 64) -> dict:
    """
    Call xAI API using OpenAI-compatible endpoint.
    
    grok-4 and *-reasoning models are always reasoning.
    *-non-reasoning models are chat mode.
    
    Token billing:
    - completion_tokens: visible response only
    - reasoning_tokens: in completion_tokens_details, billed separately
    - total output billed = completion_tokens + reasoning_tokens
    """
    is_reasoning = "non-reasoning" not in model
    
    params = {
        "model": model,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": sentence}
        ],
        "max_tokens": max_tokens + 4096 if is_reasoning else max_tokens,
        "temperature": 1.0 if is_reasoning else 0,
    }
    
    resp = client.chat.completions.create(**params)
    text = (resp.choices[0].message.content or "").strip()
    
    # Extract reasoning tokens from completion_tokens_details
    reasoning_tokens = 0
    if hasattr(resp.usage, 'completion_tokens_details') and resp.usage.completion_tokens_details:
        reasoning_tokens = getattr(resp.usage.completion_tokens_details, 'reasoning_tokens', 0) or 0
    
    # completion_tokens is visible response, reasoning_tokens is separate
    # Both are billed at output rate
    completion_tokens = resp.usage.completion_tokens
    total_output = completion_tokens + reasoning_tokens
    
    return {
        "text": text,
        "input_tokens": resp.usage.prompt_tokens,
        "output_tokens": total_output,  # Total billed at output rate
        "reasoning_tokens": reasoning_tokens,
        "response_tokens": completion_tokens,  # Visible response only
        "raw_response": resp.model_dump() if hasattr(resp, 'model_dump') else str(resp),
    }

# Parsing & Cost

In [None]:
# -----------------------------
# Parsing & Cost
# -----------------------------

def parse_labels(text: str) -> list:
    """Parse JSON array, validate labels. Handles markdown code fences and extra text."""
    text = text.strip()

    # Strip thinking tags for models that use them (e.g., DeepSeek R1, QwQ)
    if "</think>" in text:
        text = text.split("</think>")[-1].strip()
    elif "<think>" in text and "</think>" not in text:
        # Incomplete thinking block - try to find JSON after it
        pass
    
    # Strip markdown code fences if present
    if "```" in text:
        lines = text.split("\n")
        lines = [l for l in lines if not l.strip().startswith("```")]
        text = "\n".join(lines).strip()
    
    # Find the LAST JSON array (thinking models put reasoning before answer)
    # Search backwards for the final [...] pattern
    start = text.rfind("[")
    if start == -1:
        raise ValueError("No JSON array found")
    
    # Find matching closing bracket
    depth = 0
    end = start
    for i, char in enumerate(text[start:], start):
        if char == "[":
            depth += 1
        elif char == "]":
            depth -= 1
            if depth == 0:
                end = i + 1
                break
    
    json_str = text[start:end]
    data = json.loads(json_str)
    
    if not isinstance(data, list):
        raise ValueError("Not a JSON array")
    invalid = [x for x in data if x not in ALLOWED_LABELS]
    if invalid:
        raise ValueError(f"Invalid labels: {invalid}")
    return [label for label in LABEL_ORDER if label in set(data)]


def compute_cost(model_info: dict, input_tokens: int, output_tokens: int) -> float:
    """
    Compute cost in USD.
    Note: output_tokens includes reasoning/thinking tokens, all billed at output rate.
    """
    return (input_tokens / 1e6) * model_info["price_in"] + (output_tokens / 1e6) * model_info["price_out"]


def get_mode_string(vendor: str, model: str = None, reasoning_effort: str = None, 
                    thinking_budget: int = None, thinking_level: str = None) -> str:
    """Get mode string for filenames."""
    if vendor == "openai":
        return f"reasoning-{reasoning_effort}" if reasoning_effort else "chat"
    elif vendor == "azure":
        return "chat"
    elif vendor == "anthropic":
        return f"thinking-{thinking_budget}" if thinking_budget else "chat"
    elif vendor == "google":
        if thinking_level == "minimal":
            return "chat"  # Only Flash supports minimal
        elif thinking_level == "low":
            return "thinking-low"  # Still uses some thinking
        elif thinking_level:
            return f"thinking-{thinking_level}"
        return "thinking"  # default for Gemini 3  
    elif vendor == "fireworks":
        if reasoning_effort:
            return f"reasoning-{reasoning_effort}"
        # Auto-detected thinking models
        if model and ("thinking" in model or "r1" in model or model == "kimi-k2p5"):
            return "thinking"
    elif vendor == "xai":
        return "chat" if model and "non-reasoning" in model else "reasoning"
    return "chat"

# Classify Sentence and Retry on Error

In [None]:
# -----------------------------
# Single Sentence Classification with Retry
# -----------------------------
def classify_sentence(client, sentence: str, vendor: str, model: str, model_info: dict,
                      system_prompt: str, max_tokens: int = 64, max_retries: int = 5,
                      reasoning_effort: str = None, thinking_budget: int = None,
                      thinking_level: str = None) -> dict:
    """
    Classify a single sentence with retry logic.
    
    Retry wait times: 3s, 3s, 6s, 12s, 30s
    Rate limit errors: 60s
    """
    # Import vendor-specific exceptions only when needed
    if vendor == "openai":
        rate_limit_errors = (RateLimitError,)
        api_errors = (APIError,)
        auth_errors = (AuthenticationError,)
    elif vendor == "azure":
        rate_limit_errors = (RateLimitError,)
        api_errors = (APIError,)
        auth_errors = (AuthenticationError,)
    elif vendor == "anthropic":
        import anthropic
        rate_limit_errors = (anthropic.RateLimitError,)
        api_errors = (anthropic.APIError,)
        auth_errors = (anthropic.AuthenticationError,)
    elif vendor == "fireworks":
        import openai
        rate_limit_errors = (openai.RateLimitError,)
        api_errors = (openai.APIError,)
        auth_errors = (openai.AuthenticationError,)
    elif vendor == "xai":
        import openai
        rate_limit_errors = (openai.RateLimitError,)
        api_errors = (openai.APIError,)
        auth_errors = (openai.AuthenticationError,)
    elif vendor == "google":
        from google.api_core import exceptions as google_exceptions
        rate_limit_errors = (google_exceptions.ResourceExhausted,)
        api_errors = (google_exceptions.GoogleAPIError,)
        auth_errors = (google_exceptions.Unauthenticated, google_exceptions.PermissionDenied)
    else:
        rate_limit_errors = ()
        api_errors = ()
        auth_errors = ()
    
    last_error = None
    error_details = None
    wait_times = {1: 3, 2: 3, 3: 6, 4: 12, 5: 30}
    
    for attempt in range(1, max_retries + 1):
        try:
            t0 = time.perf_counter()
            
            if vendor == "openai":
                result = call_openai(client, sentence, model, system_prompt, 
                                     max_tokens, reasoning_effort)
            elif vendor == "azure": 
                result = call_azure(client, sentence, model, system_prompt,
                                    max_tokens)
            elif vendor == "anthropic":
                result = call_anthropic(client, sentence, model, system_prompt,
                                        max_tokens, thinking_budget)
            elif vendor == "fireworks":
                result = call_fireworks(client, sentence, model, system_prompt,
                                        max_tokens, reasoning_effort)
            elif vendor == "xai":
                result = call_xai(client, sentence, model, system_prompt,
                                  max_tokens)
            elif vendor == "google":
                result = call_google(client, sentence, model, system_prompt,
                                     max_tokens, thinking_level)
            else:
                raise ValueError(f"Unknown vendor: {vendor}")
            
            latency = time.perf_counter() - t0
            labels = parse_labels(result["text"])
            
            return {
                "labels": labels,
                "response_text": result["text"],
                "input_tokens": result["input_tokens"],
                "output_tokens": result["output_tokens"],
                "internal_tokens": result.get("reasoning_tokens") or result.get("thinking_tokens", 0),
                "response_tokens": result.get("response_tokens", result["output_tokens"]),
                "cost_usd": compute_cost(model_info, result["input_tokens"], result["output_tokens"]),
                "latency_sec": latency,
                "attempts": attempt,
                "error": None,
                "raw_response": result.get("raw_response"),
            }
        
        except auth_errors as e:
            # Don't retry auth errors - they won't resolve
            error_details = {"type": "AuthenticationError", "attempt": attempt, "message": str(e)}
            print(f"  Authentication error (not retrying): {e}")
            return {
                "labels": None,
                "response_text": None,
                "input_tokens": 0,
                "output_tokens": 0,
                "internal_tokens": 0,
                "response_tokens": 0,
                "cost_usd": 0.0,
                "latency_sec": None,
                "attempts": attempt,
                "error": error_details,
                "raw_response": None,
            }
        
        except rate_limit_errors as e:
            last_error = f"RateLimitError: {e}"
            error_details = {"type": "RateLimitError", "attempt": attempt}
            print(f"  Rate limit hit, waiting 60s...")
            time.sleep(60)
        
        except api_errors as e:
            last_error = f"APIError: {e}"
            error_details = {"type": "APIError", "attempt": attempt, "message": str(e)}
            wait = wait_times.get(attempt, 3)
            print(f"  API error (attempt {attempt}), waiting {wait}s...")
            time.sleep(wait)
        
        except (json.JSONDecodeError, ValueError) as e:
            last_error = f"ParseError: {e}"
            response_preview = result.get("text", "")[:100] if 'result' in dir() and result else ""
            error_details = {"type": "ParseError", "attempt": attempt, "message": str(e), "response_preview": response_preview}
            wait = wait_times.get(attempt, 3)
            print(f"  Parse error (attempt {attempt}): got '{response_preview}', waiting {wait}s...")
            time.sleep(wait)
        
        except Exception as e:
            last_error = f"{type(e).__name__}: {e}"
            error_details = {"type": type(e).__name__, "attempt": attempt, "message": str(e)}
            print(f"  Error (attempt {attempt}): {last_error}")
            if attempt < max_retries:
                time.sleep(wait_times.get(attempt, 3))
    
    return {
        "labels": None,
        "response_text": None,
        "input_tokens": 0,
        "output_tokens": 0,
        "internal_tokens": 0,
        "response_tokens": 0,
        "cost_usd": 0.0,
        "latency_sec": None,
        "attempts": max_retries,
        "error": error_details,
        "raw_response": None,
    }

# DataFrame Processing with Checkpointing

In [None]:
# -----------------------------
# DataFrame Processing with Checkpointing
# -----------------------------
def classify_dataframe(
    df: pd.DataFrame,
    sentence_col: str,
    vendor: str,
    model: str,
    system_prompt: str = SYSTEM_PROMPT,
    max_tokens: int = 64,
    output_dir: str = "./output",   ## default outout folder if none is passed
    checkpoint_dir: str = "./checkpoints",  ## default checkpoint folder if none is passed
    save_interval: int = 100,  #####  How often you want to save
    run: int = 1,
    reasoning_effort: str = None,
    thinking_budget: int = None,
    thinking_level: str = None,
    max_consecutive_errors: int = 10,
) -> pd.DataFrame:
    """
    Classify all sentences with checkpointing.
    
    Directory structure:
    - checkpoint_dir/{vendor}_{model}_{mode}_run{run}.pkl  (interim saves)
    - output_dir/Run{run}/{vendor}_{model}_{mode}_run{run}.pkl  (final output)
    - output_dir/Run{run}/{vendor}_{model}_{mode}_run{run}.log.json  (run log)
    """
    # Validate
    if sentence_col not in df.columns:
        raise ValueError(f"Column '{sentence_col}' not found")
    if vendor not in VENDORS:
        raise ValueError(f"Unknown vendor: {vendor}")
    if model not in VENDORS[vendor]["models"]:
        raise ValueError(f"Unknown model '{model}' for vendor '{vendor}'")
    
    model_info = VENDORS[vendor]["models"][model]
    mode_str = get_mode_string(vendor, model, reasoning_effort, thinking_budget, thinking_level)
    # Derive dataset tag from output_dir (e.g., "./output/holdout" → "holdout")
    dataset_tag = os.path.basename(os.path.normpath(output_dir)).lower()
    base_name = f"{dataset_tag}_{vendor}_{model}_{mode_str}_run{run}"
    
    # Create directories
    os.makedirs(checkpoint_dir, exist_ok=True)
    run_dir = f"{output_dir}/Run{run}"
    os.makedirs(run_dir, exist_ok=True)
    
    checkpoint_path = f"{checkpoint_dir}/{base_name}.pkl"
    
    print(f"Checkpoint dir: {checkpoint_dir}")
    print(f"Output dir: {run_dir}")
    print(f"Checkpoint file: {checkpoint_path}")
    
    # Load existing checkpoint
    results = []
    processed_ids = set()
    
    if os.path.exists(checkpoint_path):
        try:
            checkpoint_df = pd.read_pickle(checkpoint_path)
            successful = checkpoint_df[checkpoint_df['error'].isna()]
            results = successful.to_dict('records')
            processed_ids = set(successful['id'].values) if 'id' in successful.columns else set()
            failed_count = len(checkpoint_df) - len(successful)
            print(f"Resumed: {len(results)} processed, {failed_count} errors to retry")
        except Exception as e:
            print(f"Could not load checkpoint: {e}")
    
    # Initialize client
    client = init_client(vendor)
    
    # Track errors
    consecutive_errors = 0
    start_time = datetime.datetime.now()
    
    # Process
    for idx, row in tqdm(df.iterrows(), total=len(df), desc=f"Run {run}: {model} ({mode_str})"):
        row_id = row.get('id', idx)
        
        if row_id in processed_ids:
            continue
        
        sentence = str(row[sentence_col])
        
        result = classify_sentence(
            client, sentence, vendor, model, model_info, system_prompt, max_tokens,
            reasoning_effort=reasoning_effort, thinking_budget=thinking_budget,
            thinking_level=thinking_level
        )
        
        ordered_result = {
            'id': row_id,
            'sentence': sentence,
            **result
        }
        results.append(ordered_result)
        
        if result['error']:
            consecutive_errors += 1
            if consecutive_errors >= max_consecutive_errors:
                print(f"\nStopping: {max_consecutive_errors} consecutive errors")
                break
        else:
            consecutive_errors = 0
            processed_ids.add(row_id)
        
        # Checkpoint save
        if len(results) % save_interval == 0:
            pd.DataFrame(results).to_pickle(checkpoint_path)
            print(f"\nCheckpoint saved: {len(results)} processed")
    
    # Final results
    results_df = pd.DataFrame(results)
    end_time = datetime.datetime.now()
    
    # Save to checkpoint (for resume if needed)
    results_df.to_pickle(checkpoint_path)
    
    # Save to run directory
    results_df.to_pickle(f"{run_dir}/{base_name}.pkl")
    results_df.to_csv(f"{run_dir}/{base_name}.csv", index=False)
    
    # Create log file
    error_count = int(results_df['error'].notna().sum())
    success_count = len(results_df) - error_count
    
    log = {
        "run": run,
        "vendor": vendor,
        "model": model,
        "supplier": model_info["supplier"],
        "mode": mode_str,
        "reasoning_effort": reasoning_effort,
        "thinking_budget": thinking_budget,
        "start_time": str(start_time),
        "end_time": str(end_time),
        "duration_sec": (end_time - start_time).total_seconds(),
        "total_sentences": len(df),
        "processed_sentences": len(results_df),
        "successful": success_count,
        "errors": error_count,
        "tokens": {
            "input": int(results_df['input_tokens'].sum()),
            "output": int(results_df['output_tokens'].sum()),
            "internal": int(results_df['internal_tokens'].sum()),
            "response": int(results_df['response_tokens'].sum()),
        },
        "cost_usd": float(results_df['cost_usd'].sum()),
        "pricing": {
            "input_per_1m": model_info["price_in"],
            "output_per_1m": model_info["price_out"],
        },
        "files": {
            "checkpoint": checkpoint_path,
            "output_pkl": f"{run_dir}/{base_name}.pkl",
            "output_csv": f"{run_dir}/{base_name}.csv",
        }
    }
    
    log_path = f"{run_dir}/{base_name}.log.json"
    with open(log_path, 'w') as f:
        json.dump(log, f, indent=2)
    
    print(f"\nRun {run} complete:")
    print(f"  Sentences: {success_count}/{len(results_df)}")
    print(f"  Tokens - Input: {log['tokens']['input']:,}, Output: {log['tokens']['output']:,} (Internal: {log['tokens']['internal']:,})")
    print(f"  Cost: ${log['cost_usd']:.4f}")
    print(f"  Log: {log_path}")
    
    return results_df

# Multi-Run Processing

In [None]:
# -----------------------------
# Multi-Run Processing
# -----------------------------
def run_classification(
    df: pd.DataFrame,
    sentence_col: str,
    vendor: str,
    model: str,
    runs: list = [1, 2, 3], # default if nothing passed
    output_dir: str = "./output", # default if nothing passed
    checkpoint_dir: str = "./checkpoints", # default if nothing passed
    system_prompt: str = SYSTEM_PROMPT,
    max_tokens: int = 64,
    save_interval: int = 100,
    reasoning_effort: str = None,
    thinking_budget: int = None,
    thinking_level: str = None,
) -> dict:
    """
    Run classification multiple times.
    
    Directory structure:
    output_dir/
      Run1/
        {vendor}_{model}_{mode}_run1.pkl
        {vendor}_{model}_{mode}_run1.csv
        {vendor}_{model}_{mode}_run1.log.json
      Run2/
        ...
    checkpoint_dir/
      {vendor}_{model}_{mode}_run1.pkl
      ...
    
    Returns dict of {run: results_df}
    """
    model_info = VENDORS[vendor]["models"][model]
    mode_str = get_mode_string(vendor, model, reasoning_effort, thinking_budget, thinking_level)
    job_start_time = time.perf_counter()
    
    print(f"\n{'='*60}")
    print(f"CLASSIFICATION JOB")
    print(f"{'='*60}")
    print(f"Vendor: {vendor}")
    print(f"Model: {model}")
    print(f"Supplier: {model_info['supplier']}")
    print(f"Mode: {mode_str}")
    print(f"Pricing: ${model_info['price_in']}/M in, ${model_info['price_out']}/M out")
    print(f"Runs: {runs}")
    print(f"Sentences: {len(df)}")
    print(f"Output dir: {output_dir}")
    print(f"Checkpoint dir: {checkpoint_dir}")
    print(f"{'='*60}\n")
    
    all_results = {}
    
    for run in runs:
        print(f"\n{'='*40}")
        print(f"RUN {run}")
        print(f"{'='*40}")
        
        results_df = classify_dataframe(
            df, sentence_col, vendor, model, system_prompt, max_tokens,
            output_dir=output_dir,
            checkpoint_dir=checkpoint_dir,
            save_interval=save_interval,
            run=run,
            reasoning_effort=reasoning_effort,
            thinking_budget=thinking_budget,
            thinking_level=thinking_level,
        )
        
        all_results[run] = results_df
        
        all_results[run] = results_df
    
    # Print summary
    job_runtime = time.perf_counter() - job_start_time
    print(f"\n{'='*60}")
    print("SUMMARY")
    print(f"{'='*60}")
    total_cost = sum(r['cost_usd'].sum() for r in all_results.values())
    print(f"Total runs: {len(runs)}")
    print(f"Total cost: ${total_cost:.4f}")
    print(f"Total runtime: {job_runtime:.1f}s ({job_runtime/60:.1f}m)")
    print(f"Output location: {output_dir}/Run*/")
    
    return all_results

# Load Data

>**If you are running this in your local computer:** Subfolders will be automaticallty created inside the folder that this notebook is in. All files will be saved in those local folders/subfolders

> **If you are on Google CoLab:**: FIRST, you will need to connect your google drive and navigate to the folder that this noteobok is in. Then, the code will create subfodlers inside the folder you navigated to on your google drive. All files will be saved in those local folders/subfolders on your google drive.

##### **IMPORTANT for Google CoLab**

If you want to save to your google drive, you have to connect it first and then navigate to this current folder:

- Import google drive and connect
- Change dir to folder that this notebook is in

***Copy this code in a new code cell, modify folder as needed (if you don't want "Project") and run:***
```
from google.colab import drive
drive.mount('/content/drive')

import os

# Construct the full path to the desired folder within Google Drive
drive_path = '/content/drive/My Drive/Project'

# Create the directory if it doesn't exist (optional, but good practice)
if not os.path.exists(drive_path):
    os.makedirs(drive_path)

In [None]:
# Example Sentences
# sentences = ['The annual growth in our provision for uncollectible accounts was primarily attributable to transitioning multiple external payors from recognizing revenue on a cash basis to an accrual basis, thereby aligning with customer-related payment practices.', 'We handle certain sales-type leases internally, particularly those associated with facilities serving U.S. government hospital clients.', 'We expect that this acquisition will position us to broaden our offerings in the private label accessories segment, catering to customers seeking value-oriented solutions.', 'Additionally, the Tax Cuts and Jobs Act was enacted into law by the President in December 2017.', 'Valuations derived from the Black-Scholes model can vary significantly depending on the assumptions made regarding volatility and the duration of the underlying instruments.', 'Our audience is engaged across multiple channels, such as digital properties, print publications, as well as broadcast and streaming media.', 'The forecast incorporates management’s most informed assumptions regarding anticipated economic and market trends for the duration of the projection, encompassing anticipated changes in sales growth, cost structures, operating margin expectations, and future cash outlays.', 'As of December 31, 2007, a single credit remains unassigned to a particular reserve, which, absent any favorable developments, is at risk of defaulting imminently and may lead to the filing of a claim.', "Our model incorporates assumptions regarding anticipated stock price fluctuations, utilizing historical volatility metrics, the applicable risk-free rate derived from the treasury yield curve, the projected duration of equity awards informed by past exercise trends and termination actions post-vesting, along with anticipated dividend equivalents to be paid during the award's estimated term, given that our stock appreciation rights participate in dividends.", 'The Company conducts comprehensive physical counts of inventory across all store and warehouse locations at least once per year, and correspondingly updates the reported merchandise inventory to reflect the results of these assessments.', 'Substantial growth was realized across categories such as basketball and lifestyle footwear, as well as branded apparel and specialized cleats for wrestling, volleyball, and soccer.', 'Actuarial gains realized in 2007 will diminish the unrecognized loss allocated across the projected average remaining service years of active plan members, with such periods differing by plan from 6 to 23 years.', 'Platform utilization expanded across enterprise segments, reflecting enhanced operational efficiency and streamlined integration of internal systems, during fiscal 2020 for the technology company.', 'The advances carried maturities at assorted dates up to 2014, with interest rates ranging from 0.3% to 3.4%.', 'Furthermore, significant investments were made to upgrade essential operational systems, establish a comprehensive disaster recovery strategy, and strengthen compliance protocols designed to better serve our customers.', 'Subscription offerings expanded during fiscal 2023, reflecting successful margin optimization and process improvements within our technology platform, as we continued to innovate and refine our approach.', 'Provisions for loss contingencies are determined by management’s assessment regarding the probability of adverse results and the estimated magnitude of potential losses.', 'Before Discover Bank acquired SLC on December 31, 2010, SLC maintained a contractual relationship with Citi that facilitated the origination and ongoing management of private student loans for individuals.', 'Overview of BorgWarner Inc.’s Financial Condition and Operating Results Management’s Discussion and Analysis', 'We attribute the rise in RELIC watch sales chiefly to strategic modifications in our product offerings and the incremental expansion of our customer base achieved toward the end of Fiscal 2003.', 'PECO Electric Operating Statistics and Revenue Summary The following outlines PECO’s electric sales metrics and associated revenue information: (a) Full service represents energy provided to customers receiving electricity under standard tariffed rates.', 'As part of the 2009 Supervisory Capital Assessment Program, the Federal Reserve Board enhanced its review of the capital sufficiency of selected major bank holding companies by utilizing an alternative measure of Tier 1 capital referred to as Tier 1 common equity.', 'During the years ended 2008, 2007, and 2006, none of our customers individually represented 10% or greater of our overall revenue.', 'Other than our operating leases, which mainly pertain to restaurant locations, we do not engage in any off-balance sheet transactions.', 'Policyholders have the option to discontinue coverage as a result of their departure from medical practice due to retirement, disability, or passing.', 'Red Robin Gourmet Burgers®, Inc., a Delaware entity along with its subsidiaries (referred to herein as “Red Robin,” the “Company,” or by similar terms), is engaged predominantly in the development, operation, and franchising of casual-dining restaurants, totaling 514 outlets across North America as of the close of the fiscal year on December 28, 2014.', 'As of December 31, 2008, 51% of the outstanding home equity lines of credit were collateralized by properties located in New York State, while 21% and 26% were backed by properties in Pennsylvania and the Mid-Atlantic region, respectively.', "The decline in property catastrophe offerings was attributable to non-renewed contracts and decreased participation levels, influenced by prevailing market dynamics and an increased reliance on retrocessional arrangements, whereas non-catastrophe property business during 2014 comprised an incoming unearned premium portfolio transfer of $50.2 million from Gulf Reinsurance Limited ('Gulf Re'), which the Company acquired in 2015.", 'IPL’s operating performance reflected a $149 million decline in earnings allocable to common shareholders during 2008 and a $118 million improvement in 2007, which was largely attributable to a post-tax gain of $123 million stemming from the divestiture of IPL’s electric transmission assets in 2007.', 'Incorporating remote work practices forms an integral aspect of our overall business continuity strategy.', 'Upon the sale of a gift card, we record a corresponding liability on our balance sheet.', 'Management’s Discussion and Analysis Procedures for the Company and its Subsidiaries.', 'Our Carter’s, Just One Year, and Child of Mine brands are made available to third parties through licensing agreements.', 'SPP has initiated approvals for the development of transmission infrastructure designed to transport renewable energy from the wind-producing regions of western Oklahoma, the Texas Panhandle, and western Kansas to major demand centers, as part of its strategy to expand transmission capacity in these areas.', 'Consequently, we record revenue at the point when ownership is conveyed to our customer.', 'The credits provided will fluctuate from one quarter to the next, and ultimately, the cumulative benefits passed on to customers in the form of reduced electric rates will align with the amounts applied toward federal income tax obligations.', 'Outcomes may vary, and the divergence from these projections could be significant depending on alternative assumptions or circumstances.', 'Accruals associated with chargebacks and rebates involve a significant degree of estimation within our sales processes.', 'In the quarter concluding on September 30, 2013, we finalized a partnership arrangement with LPC MM Monrovia, LLC, an independent third party, specifically to facilitate the acquisition of a real estate asset in Monrovia, California, and to initiate the development of construction documentation for enhancements to the site.', 'Periodic legal proceedings or regulatory inquiries arise against the Company as a normal aspect of conducting its insurance operations.', 'In the interim between lease termination and final real estate disposition, we provided significant loan-based support to the hospital operator, enabling them to meet operational cash needs while awaiting reimbursement for patient services from Medicare and other payors.', 'Pursuant to the CRS agreement, close to one-fourth of the contract’s total value becomes payable by the customer and is subject to collection solely after launch and delivery objectives have been achieved for each of the eight CRS missions.', 'In 2008, the Company completed toll processing of 126,000 ounces of PGMs, an increase compared to the 112,000 ounces processed under toll arrangements in the prior year.', 'Operations within our ready-mixed concrete, precast concrete, and ancillary concrete businesses experience fluctuations attributable to seasonal patterns.', 'Historically, our highest working capital requirements arise in the latter half of the year, as accounts receivable and inventory expand due to elevated activity during the holiday sales period, and inventory builds ahead of anticipated factory shutdowns for Chinese New Year observances.', 'Cost of sales primarily includes expenses related to products, such as major aircraft and engine components, along with direct labor, overhead, and costs associated with maintaining aircraft.', 'Our proportional interest in the mines under our management affords us an annual rated pellet production capacity of 22.9 million tons, which equates to roughly 28% of the aggregate pellet capacity available throughout North America.', 'For the fiscal year ended December 31, 2009, a total of 338 da Vinci Surgical Systems were sold, representing a slight increase from the 335 units sold in the prior year ended December 31, 2008.', 'During the 2009 fiscal year, the Company incurred costs attributable to flooding totaling $7.6 million, while recognizing $16.7 million in recoveries from insurance claims.', 'The rise was tempered in part by the enhanced redeployment incentives offered through our equipment lease initiative to new customers.', 'The Company provides defined benefit pension arrangements, primarily benefiting salaried and management employees, and oversees the associated post-retirement medical plan accounting.', 'The gross profit margin declined to 30.7% in 2013 as compared to 31.4% in the previous year, largely attributable to a rise in sales from private label and international segments, which traditionally yield lower margins.', 'A substantial portion of our multi-family lending activity is directed toward longstanding property owners whose apartment buildings operate under rent control, resulting in rental rates that are lower than prevailing market levels.', 'Growth in distillery product revenues was driven by elevated unit volumes and enhanced pricing for food grade alcohol used in both beverage and industrial sectors, alongside strengthened pricing for fuel grade alcohol.', 'In October 2005, we completed delivery of the system to the customer.', 'On July 29, 2014, the Company completed an issuance of Senior Notes totaling $300 million, maturing on February 1, 2025, with a fixed interest rate of 5.375% and sold at par value, referred to herein as the 2014 Notes.', 'Expenses associated with reactivating these subscribers were recognized, and these customers were categorized as gross new DISH TV subscriber additions for the year ended December 31, 2017, with related costs captured under “Subscriber acquisition costs” within our Consolidated Statements of Operations and Comprehensive Income (Loss) and/or as “Purchases of property and equipment” in our Consolidated Statements of Cash Flows.', "Refer to 'Item 8' under the section 'Loans Receivable.'", 'Valuation of swaps, interest rate swaptions, and option contracts is determined through commonly accepted industry models that estimate the present value of anticipated derivative cash flows, reflecting both prevailing and anticipated market conditions.', 'On February 5, 2010, a total of 68 workover rigs were staffed and were either in service or subject to ongoing marketing efforts.', 'As a result of our commitment to making education more affordable for our students, we project that the typical debt burden for graduates of the Art Institutes has declined by nearly 15% since 2010.', 'The market valuation of distressed inventory is determined utilizing prior sales patterns for specific product categories, prevailing market dynamics, overall economic factors, and the worth of outstanding in-house orders associated with prospective sales of such inventory.', 'The Company factors such discounts and rebates into its determination of transaction price, recording them as deductions from gross sales.', 'Within the Caribbean region, we operate vertically integrated utilities as well as generation facilities, each governed by multi-year agreements established in partnership with governmental entities.', 'Nonetheless, despite Synovus Bank meeting all required quantitative capital standards, regulatory provisions allow for reclassification of an institution to a less favorable capital category, dependent on supervisory considerations beyond capital metrics.', 'For most of the securities measured through dealers and pricing services, we source several independent valuations, all of which are not legally binding on either our company or our counterparties.', '59 Overseas Shipholding Group, Inc. serves as the registry for this section.', 'Our strategy is to consistently keep U.S. wholesaler inventories for our products at or below a one-month average, supporting reliable supply for our customers.', 'Aside from the $202.9 million gain recognized in 2011 from the sales of Big Sandy and Langley, the year-over-year decline also reflected roughly 25% reduced realized prices for sales volumes, a 23% increase in the rate of production depletion, and elevated other operating costs, which were partially counterbalanced by a 33% rise in production output, a 30% expansion in gathered volumes, and increased transmission income.', '• Establishing further criteria for evaluating applications for disability-related benefit claims.', 'For fiscal year 2016, revenue totaled $28.2 million, representing a decline of $11.3 million, or 28.5%, compared to the prior year’s revenue of $39.5 million.', 'During 2010, the Company divested its Orthopaedic Implant production site located in Caen, France, resulting in a $24.3 million gain that was recognized within selling, general, and administrative expenses.', 'Illustrative instances of indemnification obligations under the Separation and Distribution Agreement and associated ancillary agreements include Navient’s duty to compensate the Company and the Bank for any liabilities, costs, or expenses they may face resulting from actions or potential actions concerning pre-Spin-Off SLM and its subsidiaries’ servicing, operational, and collections activities related to Private Education Loans and FFELP Loans that were held as assets by either the Bank or Navient at the time of the Spin-Off, contingent upon written notice being furnished to Navient by April 30, 2017, the third anniversary of the Spin-Off.', 'Growth in platform usage was observed, driven by improved internal workflows and resource allocation, during fiscal 2022 for the telecommunications company.', 'Acquisitions pursuant to the Plan may occur through open market transactions or privately arranged agreements, conducted periodically and in accordance with relevant legal requirements, inclusive of Rule 10b-18 under the amended Securities Exchange Act of 1934.', 'We ended our bridge revolving credit facility, originally established on June 13, 2007, with five U.S. and international banking institutions in February 2008.', "Our management team regularly assesses the company's cost structure to confirm that operational expenditures are managed effectively without compromising the quality of service provided to customers.", 'The company’s reach broadened in fiscal 2021, reflecting enhanced productivity and optimization of internal processes in expanded territories.', 'A portion of these rising costs was absorbed by the company, resulting in an average sales price uptick to customers of just $0.32 per kilogram.', 'The company saw advances in payment collection efficiency, reflecting cost reduction initiatives and automation of reconciliation tasks.', 'Advancements in technology have the potential to mitigate these factors or expand the availability of mineral resources.', 'Each of our two plans designed for small office and home office use provides a single business fax line at no extra cost, with supplementary fax lines available for a monthly fee.', 'In September 2011, the European Commission approved VIBATIV® for use in adult patients with hospital-acquired pneumonia, including cases linked to ventilators, where MRSA is known or suspected and alternative therapies are deemed inappropriate, thereby expanding therapeutic options for these individuals.', 'These trends are subject to modification due to factors such as the establishment of additional schools, launch of innovative programs, rising adult student enrollment, or potential acquisitions.', 'Funding requests, which generally arise at intervals of four to eight weeks, trigger the execution of necessary inspections and formal evaluations.', 'Assessing our yearly tax obligations and analyzing our tax strategies necessitates considerable judgment and expertise.', 'On March 24, 2006, the Grindle plaintiffs retracted their request to intervene, doing so without affecting their rights.', 'At December 31, 2017, the balance of commercial and industrial loans, which includes owner-occupied commercial properties, grew by $0.3 billion, reaching a total of $1.6 billion, reflecting support for our business customers’ financing needs.', 'After the merger, ARRIS retained complete indirect ownership of both ARRIS Group and Pace as wholly-owned subsidiaries.', '•On August 17, 2012, we provided a first mortgage loan totaling $46.0 million, secured by a Hilton hotel consisting of 315 rooms located in Rockville, Maryland.', 'Our Management’s Discussion and Analysis of Financial Condition and Results of Operations highlights our position as a foremost provider of independent, technology-driven portfolio management solutions, investment guidance, and retirement income services, serving participants predominantly within employer-sponsored defined contribution plans, including 401(k) arrangements.', 'The structure of our per-minute charges is designed to encompass the entirety of services provided.', 'On June 30, 2009, nearly 83% of our holdings, excluding cash balances, possessed maturities shorter than one year.', 'Distribution of Portfolio by Geography: The following table presents the Company’s operating square footage segmented by region at December 31, 2011 (in thousands). Industry Credit Risk Profile: The subsequent data illustrates the composition of our tenant portfolio by industry as of December 31, 2011.', '2 Includes additional products such as consumer revolving credit, installment loans, and consumer lease finance options tailored to customer needs.', 'Currently anticipated capital expenditures for 2006 total approximately $568.9 million, encompassing $24.0 million allocated to startup activities, resolution of the thruster defect previously detailed, and modifications mandated by customers for the GSF Development Driller I, $237.3 million dedicated to construction of a new semisubmersible unit, $45.0 million earmarked for repairs to our rig fleet due to hurricane impacts, $124.0 million assigned for significant fleet enhancements, $107.8 million directed toward additional equipment acquisitions and replacements, $17.3 million related to capitalized interest, and $13.5 million (net of intersegment eliminations) for oil and gas operations.', 'The Company plans to maintain its practice of issuing quarterly dividends, contingent upon Board approval, sufficient capital resources, and an assessment that such distributions remain advantageous for its shareholders.', 'As of December 31, 2008, the Company had no borrowings outstanding under its credit facility, with a total available capacity of roughly $48.4 million, reflecting the deduction for a $7.0 million letter of credit issued to BCBSF/HOI.', 'The progression of our product offerings relies on the availability and consistent performance of software sourced from external vendors.', 'The Company establishes BESP for its product deliverables through a weighted average pricing methodology, initiated by an examination of historical data on standalone sales transactions.']
# df = pd.DataFrame({'sentences': sentences})
# df=df.head(3)

In [None]:
# Load Data: Preliminary Labeling of all available data
df = pd.read_csv("source_text/10K_unlabeled.csv")
df = df.sample(n=10).reset_index(drop=True)  # different sample each run
df

# Set Output Paths
- Preliminary ("OUTPUT_DIR") vs. Holdout ("HOLDOUT_DIR") vs. Train ("TRAIN_DIR") sets
- ***Make sure not to accidentally overwrite your labeled data!***
- When you query a model (below in Run Models), you need to specifiy the output path.

In [None]:
# =============================================================================
# OUTPUT CONFIGURATION - SET for Base, Holdout, Train
# =============================================================================
CHECKPOINT_DIR = "./checkpoints"
OUTPUT_DIR = "./output"
PRELIM_DIR = f"{OUTPUT_DIR}/preliminary"
HOLDOUT_DIR = f"{OUTPUT_DIR}/holdout"
TRAIN_DIR = f"{OUTPUT_DIR}/train"

# Run Models
- Set desired parameters
- Set runs (currently set to 1 run per model for examples): runs=[1] vs runs=[1, 2, 3]
- Uncomment the models you what you want to run! 
- Run different models? Make sure to define them above in the vendor/model configuration first.
  - Not all models will run with this code.
  - If you get errors, you need to debug and develop code further
  - (I'd use Claude Opus 4.5 Thinking for that, but other genAI models may work as well / even better).

> ***IMPORTANT*** If you want to rerun a model, make sure to delete its files from the checkpoints folder first. Otherwise, it will skip all examples (e.g., sentences) that the previous run already labeled (which could be all) and you don't get updated results.

> **MORE IMPORTANT** Running this code will cost you API credits (and requires you to ahve accounts with the providers). You will need to supply your own API keys. Beware that you may be subject to rate limits (how many queries you can send per minute) and which models you can use (OpenAI, for example, requires you to verify your identidy with an ID to access many models). Regardless, every time you execute this code, you will drain your API credits = real money! Thus, make wise decisions about what and how much to label.

## Labeling Approach:

**What we ultimately need:**
- A holdout set (1000 examples)
- A training set (15k examples)
- The ability to test and train all classes / labels
  - Need to have at least some balance in holdout set so that every class (label) is represented (e.g., at least 100 times)
  - Need enough examples per class (label) in train set that fine-tuned model can learn them
> **Challenge**
> - How do we know that we have enough examples (i.e., sentences) per class (label)?

> **Idea**
> - One possible approach is to label all data only once with a reasonably fast and inexpensive genAI model to get an idea about class (label) distribution. Depending how good the model is (which we cannot easily know yet unless we have a couple dozen examples that we manually constructed as preliminary evaluation set), we have at least a *directional idea* of which sentences may belong to which classes / have which labels.
> - We can then use this *directional idea* to ***construct a holdout set*** with at least N=100 (or more?) examples per class (label). This will be better than random sampling (unless classes (labels) are balanced in full data set, which is unlikely). It will still not be robust, but *at least give us some idea*.
> - We need to make sure to remove all examples that we put in the holdout set from the rest of our data (i.e., no holdout set example should also be in the training set to ***prevent leakage***). 

### WARNING: MAKE SURE TO **PASS** THE CORRECT OUTPUT DIR!
What are you labeling? 
- Prelimiary full data set (once to investigate class balance) --> PRELIM_DIR
- Holdout set -->  HOLDOUT_DIR
- Train set --> TRAIN_DIR

### Google CoLab Runtime Timeouts
If you have a slow API and/or you are lableing many sentences (texts), Google CoLab may time-out or shut down your runtime. This will abort the labeling process. You can resume anytime, but you need to restart and run your notebook again. For this purpose, I've added checkpointing so that results are saved every N=100 sentences (texts) and the code will look for an intermis checkpoint and pickup from there (it will also find and retry errors where nothing valid was returned). *This may be less of a problem with CoLap PRO, which you can sign-up for free as a student.*

> If you can, it may be well worth installing python and jupyter notebooks on your local computer. You won't face the timeout issues then and your code will query the APIs as long as your computer is connected to the internet (and running).

In [1]:
# =============================================================================
# UNC Main Campus Computing on MS Azure
# Models: gpt-4.1, gpt-4o
# Modes: chat only!!! (temperature=0 set with openAI models) 
# Notes: Experimental by Dr. D. To get odd number of lables per sentence across two models, I queried the first model 4 times (4 + 3 = 7)
# =============================================================================

# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="azure",
#     model="gpt-4.1",
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="azure",
#     model="gpt-4o",
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

In [None]:
# =============================================================================
# OPENAI
# Models: gpt-5.2, gpt-5, gpt-5-mini, gpt-5-nano, gpt-4.1, gpt-4o
# Modes: chat (temperature=0) or reasoning (reasoning_effort)
# reasoning_effort: "low", "medium", "high"
# =============================================================================

# # OpenAI chat mode (temperature=0)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="openai",
#     model="gpt-4.1",  # gpt-5.2, gpt-5, gpt-5-mini, gpt-5-nano, gpt-4.1, gpt-4o --> make sure these are defined with pricing in vendor/model configuration
#     runs=[1],
#     output_dir=OUTPUT_DIR,     # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# OpenAI reasoning mode
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="openai",
#     model="gpt-5.2",  # gpt-5.2, gpt-5, gpt-5-mini, gpt-5-nano, gpt-4.1, gpt-4o --> make sure these are defined with pricing in vendor/model configuration
#     reasoning_effort="high",  # "low", "medium", "high"
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

In [None]:
# =============================================================================
# ANTHROPIC
# Models: claude-opus-4-5-20251101, claude-sonnet-4-5-20250929, 
#         claude-haiku-4-5-20251001, claude-sonnet-4-20250514
# Modes: chat (temperature=0) or thinking (thinking_budget)
# thinking_budget: min 1024, billed as output tokens
# =============================================================================

# # Anthropic chat mode (temperature=0)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="anthropic",
#     model="claude-sonnet-4-5-20250929",  # opus-4-5, sonnet-4-5, haiku-4-5, sonnet-4
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# # Anthropic thinking mode
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="anthropic",
#     model="claude-sonnet-4-5-20250929",  # opus-4-5, sonnet-4-5, haiku-4-5, sonnet-4
#     thinking_budget=2048,  # min 1024, billed as output tokens
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

In [None]:
# =============================================================================
# FIREWORKS
# Models: deepseek-v3p2, deepseek-r1-0528, qwen3-vl-235b-a22b-instruct,
#         qwen3-vl-235b-a22b-thinking, kimi-k2p5
# Modes: auto-detected from model name
#   - Chat: deepseek-v3p2, qwen3-vl-235b-a22b-instruct
#   - Thinking: deepseek-r1-0528, qwen3-vl-235b-a22b-thinking, kimi-k2p5
# =============================================================================

# # DeepSeek V3.2 (chat)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="fireworks",
#     model="deepseek-v3p2",
#     runs=[1],
#     output_dir=OUTPUT_DIR,     # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# # DeepSeek R1 (thinking)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="fireworks",
#     model="deepseek-r1-0528",
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# # Qwen3 VL Instruct (chat)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="fireworks",
#     model="qwen3-vl-235b-a22b-instruct",
#     runs=[1],
#     output_dir=OUTPUT_DIR,     # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# # Qwen3 VL Thinking (thinking)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="fireworks",
#     model="qwen3-vl-235b-a22b-thinking",
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# # Kimi K2.5 (thinking)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="fireworks",
#     model="kimi-k2p5",
#     runs=[1],
#     output_dir=OUTPUT_DIR,     overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

In [None]:
# =============================================================================
# XAI (GROK)
# Models: grok-4, grok-4-1-fast-reasoning, grok-4-1-fast-non-reasoning
# Modes: auto-detected from model name
#   - Reasoning: grok-4, grok-4-1-fast-reasoning
#   - Chat: grok-4-1-fast-non-reasoning
# =============================================================================

# # Grok 4 (reasoning)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="xai",
#     model="grok-4",  # grok-4, grok-4-1-fast-reasoning, grok-4-1-fast-non-reasoning
#     runs=[1],
#     output_dir=OUTPUT_DIR,    # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# # Grok 4.1 Fast Reasoning
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="xai",
#     model="grok-4-1-fast-reasoning",
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# # Grok 4.1 Fast Non-Reasoning (chat)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="xai",
#     model="grok-4-1-fast-non-reasoning",
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

In [None]:
# =============================================================================
# GOOGLE GEMINI
# Models: gemini-3-pro-preview, gemini-3-flash-preview, 
#         gemini-2.5-pro, gemini-2.5-flash
# Modes: controlled via thinking_level
#   - Pro:   "low", "high" (default if not specified)
#   - Flash: "minimal", "low", "medium", "high" (default if not specified)
# Note: Even "low" uses some thinking tokens. Only Flash "minimal" truly minimizes.
# =============================================================================

# # Gemini 3 Pro - high thinking (default)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="google",
#     model="gemini-3-pro-preview",  # gemini-3-pro-preview, gemini-3-flash-preview, gemini-2.5-pro, gemini-2.5-flash
#     thinking_level="high",  # Pro: "low", "high"
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# # Gemini 3 Pro - low thinking
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="google",
#     model="gemini-3-pro-preview",
#     thinking_level="low",  # Pro: "low", "high"
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# # Gemini 3 Flash - high thinking (default)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="google",
#     model="gemini-3-flash-preview",
#     runs=[1],
#     output_dir=OUTPUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# # Gemini 3 Flash - minimal thinking (closest to chat mode)
# results = run_classification(
#     df,
#     sentence_col="sentences",
#     vendor="google",
#     model="gemini-3-flash-preview",
#     thinking_level="minimal",  # Flash: "minimal", "low", "medium", "high"
#     runs=[1],
#     output_dir=OUTPUT_DIR,     # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
#     checkpoint_dir=CHECKPOINT_DIR,   # overrides default
# )

# Preliminary Labels and Holdout Set
Once we labeled out data ***one time*** with a model, we can get a first idea of class (label) distribution and construct our holdout set accordingly.


In [None]:
# I will use GPT 4.1 for this purpose that UNC hosted on Azure. 
# You probably will not have access to this API and model.
# Pick a fast and not too expensive model that you feel confident can do an okay job (maybe get at least 70% correct).

In [None]:
results = run_classification(
    df,
    sentence_col="sentences", # provide the column name in which the text is you want to classify (sentence vs sentences?)
    vendor="azure",
    model="gpt-4.1",
    runs=[1],
    output_dir=PRELIM_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
    checkpoint_dir=CHECKPOINT_DIR,   # overrides default
)

## STEP 1: Load labeled data from a single genAI run

In [None]:
# -----------------------------
# Configuration
# -----------------------------

# CHANGE THIS: Path to your labeled data (one run, one model)
LABELED_FILE = "./output/preliminary/Run1/preliminary_azure_gpt-4.1_chat_run1.pkl"

# Functional areas
CLASSES = ["Marketing", "Finance", "Accounting", "Operations", "IT", "HR"]


In [None]:
import pandas as pd
import numpy as np
import ast


# -----------------------------
# Load and Parse
# -----------------------------

print("="*60)
print("LOADING LABELED DATA")
print("="*60)

df_raw = pd.read_pickle(LABELED_FILE)
print(f"Source: {LABELED_FILE}")
print(f"Total sentences: {len(df_raw)}")

# Parse labels into binary columns
def parse_labels_safe(labels):
    """Parse labels from various formats."""
    if labels is None:
        return []
    if isinstance(labels, list):
        return labels
    if isinstance(labels, str):
        try:
            parsed = ast.literal_eval(labels)
            return parsed if isinstance(parsed, list) else []
        except:
            return []
    return []

# Create binary columns for each class
df_labeled = df_raw[['id', 'sentence']].copy()

parsed_labels = df_raw['labels'].apply(parse_labels_safe)

for cls in CLASSES:
    df_labeled[cls] = parsed_labels.apply(lambda x: 1 if cls in x else 0)

# None = all classes are 0
df_labeled['None'] = (df_labeled[CLASSES].sum(axis=1) == 0).astype(int)

# Remove rows where labeling failed (error column is not None/NaN)
if 'error' in df_raw.columns:
    error_mask = df_raw['error'].notna()
    n_errors = error_mask.sum()
    if n_errors > 0:
        df_labeled = df_labeled[~error_mask].reset_index(drop=True)
        print(f"Removed {n_errors} rows with labeling errors")

print(f"Successfully labeled: {len(df_labeled)} sentences")

# Preview
print(f"\nPreview:")
label_cols = CLASSES + ['None']
print(df_labeled[['sentence'] + label_cols].head(5))

## STEP 2: Descriptives

In [None]:

label_cols = CLASSES + ['None']

print("="*60)
print("DESCRIPTIVE STATISTICS")
print("="*60)
print(f"Total sentences: {len(df_labeled)}")

# Class distribution
print(f"\n{'='*60}")
print("CLASS DISTRIBUTION")
print(f"{'='*60}")
print(f"{'Class':<15} {'Count':>8} {'Share':>10}")
print(f"{'-'*35}")

for cls in label_cols:
    count = df_labeled[cls].sum()
    share = count / len(df_labeled) * 100
    print(f"{cls:<15} {count:>8} {share:>9.1f}%")

# Multi-label distribution
df_labeled['num_labels'] = df_labeled[CLASSES].sum(axis=1)

print(f"\n{'='*60}")
print("MULTI-LABEL DISTRIBUTION")
print(f"{'='*60}")
print(f"{'# Labels':<15} {'Count':>8} {'Share':>10}")
print(f"{'-'*35}")

for n in range(7):
    count = (df_labeled['num_labels'] == n).sum()
    if count > 0:
        share = count / len(df_labeled) * 100
        label_text = f"{n} label{'s' if n != 1 else ''}"
        print(f"{label_text:<15} {count:>8} {share:>9.1f}%")

print(f"{'-'*35}")
print(f"{'Mean labels':<15} {df_labeled['num_labels'].mean():>8.2f}")
print(f"{'Median labels':<15} {df_labeled['num_labels'].median():>8.0f}")

# Co-occurrence matrix
print(f"\n{'='*60}")
print("CO-OCCURRENCE MATRIX")
print(f"{'='*60}")

cooccurrence = pd.DataFrame(index=CLASSES, columns=CLASSES, dtype=int)
for cls1 in CLASSES:
    for cls2 in CLASSES:
        cooccurrence.loc[cls1, cls2] = ((df_labeled[cls1] == 1) & (df_labeled[cls2] == 1)).sum()

print(cooccurrence.to_string())

# Top label combinations
print(f"\n{'='*60}")
print("TOP 10 LABEL COMBINATIONS")
print(f"{'='*60}")

def get_label_combo(row):
    labels = [cls for cls in CLASSES if row[cls] == 1]
    return str(labels) if labels else "[]"

combo_counts = df_labeled.apply(get_label_combo, axis=1).value_counts().head(10)
print(f"{'Combination':<50} {'Count':>8} {'Share':>10}")
print(f"{'-'*70}")
for combo, count in combo_counts.items():
    share = count / len(df_labeled) * 100
    print(f"{combo:<50} {count:>8} {share:>9.1f}%")

# Clean up temp column
df_labeled.drop(columns=['num_labels'], inplace=True)

## STEP 3: Create Holdout and Train Sets

In [None]:
# =============================================================================
# STEP 3: CREATE HOLDOUT AND TRAIN SETS
# =============================================================================

import os

# -----------------------------
# Configuration
# -----------------------------

HOLDOUT_SIZE = 1000
MIN_PER_CLASS = 100
MAX_PER_CLASS = 333
MAX_NONE = 100  # Maximum None sentences in holdout
OUT_DIR = "./Holdout_Train"

os.makedirs(OUT_DIR, exist_ok=True)

np.random.seed(42)

# -----------------------------
# Helper
# -----------------------------

def get_labels(row):
    """Get list of functional labels for a row."""
    return [cls for cls in CLASSES if row[cls] == 1]

# Add temp columns
df_labeled['labels_list'] = df_labeled.apply(get_labels, axis=1)
df_labeled['num_labels'] = df_labeled[CLASSES].sum(axis=1)

# -----------------------------
# Stratified Sampling (6 functional classes only — None is derived)
# -----------------------------

selected_indices = set()
class_counts = {cls: 0 for cls in CLASSES}

print("="*60)
print("STRATIFIED HOLDOUT SAMPLING")
print("="*60)
print(f"Target holdout size: {HOLDOUT_SIZE}")
print(f"Min per class: {MIN_PER_CLASS} (6 functional classes)")
print(f"Max per class: {MAX_PER_CLASS}")
print(f"None: not constrained (derived, not independently predicted)")

# Phase 1: Ensure minimums, rarest classes first
class_freq = {cls: int(df_labeled[cls].sum()) for cls in CLASSES}
classes_by_rarity = sorted(CLASSES, key=lambda c: class_freq[c])

print(f"\n--- Phase 1: Ensure minimums (rarest first) ---")
print(f"  Class frequencies: {', '.join(f'{c}: {class_freq[c]}' for c in classes_by_rarity)}")

for rare_class in classes_by_rarity:
    candidates = df_labeled[
        (df_labeled[rare_class] == 1) & 
        (~df_labeled.index.isin(selected_indices))
    ].sort_values('num_labels', ascending=False)
    
    for idx, row in candidates.iterrows():
        if class_counts[rare_class] >= MIN_PER_CLASS:
            break
        labels = row['labels_list']
        if len(labels) > 0 and any(class_counts[lbl] >= MAX_PER_CLASS for lbl in labels):
            continue
        selected_indices.add(idx)
        for lbl in labels:
            class_counts[lbl] += 1
    
    print(f"  {rare_class}: {class_counts[rare_class]} sentences")

print(f"\nAfter Phase 1: {len(selected_indices)} sentences selected")

# Phase 2: Multi-label sentences
print(f"\n--- Phase 2: Add multi-label sentences ---")

current_multilabel = df_labeled.loc[list(selected_indices), 'num_labels'].gt(1).sum()
target_multilabel = int(HOLDOUT_SIZE * 0.18)

multilabel_candidates = df_labeled[
    (df_labeled['num_labels'] >= 2) & 
    (~df_labeled.index.isin(selected_indices))
].sample(frac=1, random_state=42)

for idx, row in multilabel_candidates.iterrows():
    if current_multilabel >= target_multilabel:
        break
    if len(selected_indices) >= HOLDOUT_SIZE:
        break
    labels = row['labels_list']
    if len(labels) > 0 and any(class_counts[lbl] >= MAX_PER_CLASS for lbl in labels):
        continue
    selected_indices.add(idx)
    for lbl in labels:
        class_counts[lbl] += 1
    current_multilabel += 1

print(f"  Multi-label sentences: {current_multilabel}")
print(f"  Total selected: {len(selected_indices)}")

# Phase 3: Fill remaining (cap None sentences)
none_count = sum(1 for idx in selected_indices if df_labeled.loc[idx, CLASSES].sum() == 0)

print(f"\n--- Phase 3: Fill to {HOLDOUT_SIZE} (max None: {MAX_NONE}) ---")
print(f"  None so far: {none_count}")

if len(selected_indices) < HOLDOUT_SIZE:
    fill_candidates = df_labeled[
        ~df_labeled.index.isin(selected_indices)
    ].sample(frac=1, random_state=42)
    
    for idx, row in fill_candidates.iterrows():
        if len(selected_indices) >= HOLDOUT_SIZE:
            break
        labels = row['labels_list']
        
        # Skip if any functional class would exceed max
        if len(labels) > 0 and any(class_counts[lbl] >= MAX_PER_CLASS for lbl in labels):
            continue
        
        # Skip if this is a None sentence and we've hit the cap
        if len(labels) == 0 and none_count >= MAX_NONE:
            continue
        
        selected_indices.add(idx)
        for lbl in labels:
            class_counts[lbl] += 1
        if len(labels) == 0:
            none_count += 1

print(f"  Final count: {len(selected_indices)}")

# -----------------------------
# Create Clean DataFrames (LABELS BLANKED OUT)
# -----------------------------

# Holdout and train get sentences + empty label columns
df_holdout = df_labeled.loc[list(selected_indices), ['sentence']].copy()
df_train = df_labeled.loc[~df_labeled.index.isin(selected_indices), ['sentence']].copy()

# Add empty label columns (students fill these in)
for cls in CLASSES + ['None']:
    df_holdout[cls] = ""
    df_train[cls] = ""

# Shuffle both
df_holdout = df_holdout.sample(frac=1, random_state=42).reset_index(drop=True)
df_train = df_train.sample(frac=1, random_state=42).reset_index(drop=True)

# Clean up temp columns from df_labeled
df_labeled.drop(columns=['labels_list', 'num_labels'], inplace=True, errors='ignore')

# -----------------------------
# Verification (using original labels for reporting only)
# -----------------------------

print(f"\n{'='*60}")
print("VERIFICATION")
print(f"{'='*60}")

print(f"\nDataset sizes:")
print(f"  Holdout: {len(df_holdout)}")
print(f"  Train:   {len(df_train)}")
print(f"  Total:   {len(df_holdout) + len(df_train)}")

# Verify no overlap
overlap = set(df_holdout['sentence']) & set(df_train['sentence'])
print(f"Overlap check: {len(overlap)} sentences (should be 0)")

# Report holdout class distribution (from original labels, for our reference)
print(f"\n{'='*60}")
print("HOLDOUT CLASS DISTRIBUTION (from LLM labels, for reference)")
print(f"{'='*60}")
print(f"{'Class':<15} {'Count':>8} {'Share':>10} {'Min OK':>10} {'Max OK':>10}")
print(f"{'-'*55}")

holdout_sentences = set(df_holdout['sentence'])
df_holdout_check = df_labeled[df_labeled['sentence'].isin(holdout_sentences)]

for cls in CLASSES:
    count = df_holdout_check[cls].sum()
    share = count / len(df_holdout_check) * 100
    min_ok = "✓" if count >= MIN_PER_CLASS else "✗"
    max_ok = "✓" if count <= MAX_PER_CLASS else "✗"
    print(f"{cls:<15} {count:>8} {share:>9.1f}% {min_ok:>10} {max_ok:>10}")

# None (derived, for reference)
none_count = (df_holdout_check[CLASSES].sum(axis=1) == 0).sum()
none_share = none_count / len(df_holdout_check) * 100
none_ok = "✓" if none_count <= MAX_NONE else "✗"
print(f"{'None (derived)':<15} {none_count:>8} {none_share:>9.1f}% {'':>10} {none_ok:>10}  (max {MAX_NONE})")

# Multi-label distribution
print(f"\n{'='*60}")
print("HOLDOUT MULTI-LABEL DISTRIBUTION (from LLM labels, for reference)")
print(f"{'='*60}")
print(f"{'# Labels':<15} {'Count':>8} {'Share':>10}")
print(f"{'-'*35}")

num_labels = df_holdout_check[CLASSES].sum(axis=1)
for n in range(7):
    count = (num_labels == n).sum()
    if count > 0:
        share = count / len(df_holdout_check) * 100
        print(f"{n} label{'s' if n != 1 else '':<14} {count:>8} {share:>9.1f}%")

# Preview
print(f"\nHoldout preview (labels are blank for human experts to fill):")
print(df_holdout.head(5))
print(f"\nTrain preview (labels are blank for genAI to fill):")
print(df_train.head(5))

# -----------------------------
# Save
# -----------------------------

df_holdout.to_csv(f"{OUT_DIR}/holdout.csv", index=False)
df_holdout.to_pickle(f"{OUT_DIR}/holdout.pkl")
df_train.to_csv(f"{OUT_DIR}/train.csv", index=False)
df_train.to_pickle(f"{OUT_DIR}/train.pkl")

print(f"\n{'='*60}")
print("SAVED")
print(f"{'='*60}")
print(f"  Holdout: {OUT_DIR}/holdout.csv (.pkl)")
print(f"  Train:   {OUT_DIR}/train.csv (.pkl)")
print(f"\nColumns: {list(df_holdout.columns)}")
print(f"Label columns are EMPTY - human experts must fill these in")

# **Human Labeling of Holdout for Ground Truth**

- Now we have a more or less balanced holdout set
- We need to get this labeled by human experts, that is, domain experts / subject matter experts
- You will need at least three humans to evaluate each sentence independently,
  *or* work as an expert team and discuss each sentence to assign it the appropriate class (or labels)
- Make sure to save the file with the labels AND ***be sure that the text (e.g., sentences) are not broken / contain formatting or ASCI or HTML errors***
  - You might want to label in excel, but then merge the labels from the XLSX or CSV file to the the holdout.pkl file (which better preserved the actual texts)
- If you have independent human experts label the holdout, then you need to have a majority vote per label (this suggests you need an odd number of human expert labelers). Good practice is also to check inter-rater agreement on the labels.

# **genAI Labeling of Holdout**
> Now that you have a ground truth for your holdout set, you need to determine which genAI model performs best on it

#### Approach: 
1. **Label the holdout 3 times with a genAI mode** (use code from above, but instead of loading the "10K_unlabeled.csv" file for preliminary labeling, you want to load your holdout.pkl file (the one without your human labels).
2. **Check** the three runs for **label agreement** (krippendorff's alpha, for example)
3. Get the **majority vote** (for class or for labels - this example will be for ***labels*** that, unlike classes, are ***NOT mutually exclusive***
4. **Compare to human expert labels on holdout** set to measure genAI labeling **performance**
5. **Repeat** for other models (of other vendors)
6. **Find model** that does **best**, then use it to label the train set (3 times - same genAI API code from above but on different file: makre sure to adjust the paths where you store the data so you don't overwrite your holdout labels), get majority votes on train, then use those to fine-tune a pretrained and open-source LLM (like RoBERTa Large)

In [None]:
# =============================================================================
# OUTPUT CONFIGURATION - SET for Base, Holdout, Train
# =============================================================================
CHECKPOINT_DIR = "./checkpoints"
OUTPUT_DIR = "./output"
HOLDOUT_DIR = f"{OUTPUT_DIR}/holdout"
TRAIN_DIR = f"{OUTPUT_DIR}/train"

# =============================================================================
# Load Holdout (not the human labeled, but the blank)
# =============================================================================

# # Load Data: Labeling of (balanced) holdout data
df = pd.read_pickle("Holdout_Train/holdout.pkl")

In [None]:
# =============================================================================
# Query genAI model via API
# =============================================================================

# Here an example for gpt-4o via UNC Azure API
results = run_classification(
    df,
    sentence_col="sentence",     # --> Check if your file has the column as named here where the text is  (sentence vs sentences vs text vs tweet vs ... )
    vendor="azure",
    model="gpt-4o",
    runs=[1,2,3],                # Doing 3 runs for conistency / replicability: will later take majority vote per lable across runs
    output_dir=HOLDOUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
    checkpoint_dir=CHECKPOINT_DIR,   # overrides default
)

In [None]:
# =============================================================================
# Query another genAI model via API
# =============================================================================

# Here an example for gpt-4o via UNC Azure API
results = run_classification(
    df,
    sentence_col="sentence",     # --> Check if your file has the column as named here where the text is  (sentence vs sentences vs text vs tweet vs ... )
    vendor="azure",
    model="gpt-4.1",
    runs=[1,2,3],                # Doing 3 runs for conistency / replicability: will later take majority vote per lable across runs
    output_dir=HOLDOUT_DIR,      # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
    checkpoint_dir=CHECKPOINT_DIR,   # overrides default
)

# genAI Label Agreement

- How consistent is genAI in its labels?
- Test the extent to which genAI's labels agree across runs

In [None]:
# pip install -q -U krippendorff   # already done at the very beginning. Keeping this as a reminder

In [None]:
# =============================================================================
# INTER-RATER AGREEMENT ACROSS RUNS (Krippendorff's Alpha)
# =============================================================================

import pandas as pd
import numpy as np
import ast
import glob
import re
import krippendorff  

# -----------------------------
# Configuration
# -----------------------------

HOLDOUT_DIR = "./output/holdout"  # Set your holdout output directory
RUNS = [1, 2, 3]                      # Which runs to compare
CLASSES = ["Marketing", "Finance", "Accounting", "Operations", "IT", "HR"]

In [None]:
# -----------------------------
# Helper Functions
# -----------------------------

def parse_labels_safe(labels):
    """Parse labels from various formats."""
    if labels is None:
        return []
    if isinstance(labels, list):
        return labels
    if isinstance(labels, str):
        try:
            parsed = ast.literal_eval(labels)
            return parsed if isinstance(parsed, list) else []
        except:
            return []
    return []

def parse_filename(filename):
    """
    Parse filename like 'holdout_azure_gpt-4.1_chat_run1.pkl' 
    into (vendor, model, mode, run).
    Handles optional dataset prefix (holdout_, train_, prelim_, output_).
    """
    # Remove extension
    name = filename.replace('.pkl', '').replace('.csv', '')
    
    # Strip known dataset prefixes
    for prefix in ['holdout_', 'train_', 'prelim_', 'output_']:
        if name.startswith(prefix):
            name = name[len(prefix):]
            break
    
    # Extract run number from end
    match = re.search(r'_run(\d+)$', name)
    if not match:
        return None
    run = int(match.group(1))
    name = name[:match.start()]
    
    # Extract mode (last part before run)
    parts = name.rsplit('_', 1)
    if len(parts) != 2:
        return None
    mode = parts[1]
    remainder = parts[0]
    
    # Extract vendor (first part) and model (rest)
    first_underscore = remainder.index('_')
    vendor = remainder[:first_underscore]
    model = remainder[first_underscore + 1:]
    
    return vendor, model, mode, run

def discover_models(holdout_dir, runs):
    """
    Discover all vendor_model_mode combinations that have ALL specified runs.
    Returns dict: {(vendor, model, mode): {run: filepath}}
    """
    models = {}
    
    for run in runs:
        run_dir = f"{holdout_dir}/Run{run}"
        pkl_files = glob.glob(f"{run_dir}/*.pkl")
        
        for filepath in pkl_files:
            filename = filepath.split("/")[-1]
            parsed = parse_filename(filename)
            if parsed is None:
                continue
            
            vendor, model, mode, file_run = parsed
            if file_run != run:
                continue
            
            key = (vendor, model, mode)
            if key not in models:
                models[key] = {}
            models[key][run] = filepath
    
    # Keep only models that have ALL specified runs
    complete = {
        key: paths for key, paths in models.items()
        if all(r in paths for r in runs)
    }
    
    return complete

def load_and_binarize(filepath, classes):
    """Load pkl and convert labels to binary columns."""
    df = pd.read_pickle(filepath)
    
    parsed = df['labels'].apply(parse_labels_safe)
    
    binary = pd.DataFrame(index=df.index)
    binary['id'] = df['id'] if 'id' in df.columns else df.index
    binary['sentence'] = df['sentence']
    
    for cls in classes:
        binary[cls] = parsed.apply(lambda x: 1 if cls in x else 0)
    
    # Remove error rows
    if 'error' in df.columns:
        binary = binary[df['error'].isna()].reset_index(drop=True)
    
    return binary

def compute_agreement(model_runs, classes, runs):
    """
    Compute Krippendorff's alpha per class and overall.
    Only evaluates the 6 functional classes — None is derived, not labeled.
    
    Args:
        model_runs: dict {run: filepath}
        classes: list of class names (6 functional classes)
        runs: list of run numbers
    
    Returns:
        dict with per-class and overall alpha
    """
    # Load all runs
    run_dfs = {}
    for run in runs:
        run_dfs[run] = load_and_binarize(model_runs[run], classes)
    
    # Verify all runs have same sentences
    n_sentences = len(run_dfs[runs[0]])
    for run in runs:
        assert len(run_dfs[run]) == n_sentences, \
            f"Run {run} has {len(run_dfs[run])} sentences, expected {n_sentences}"
    
    results = {}
    
    # Per-class alpha (6 functional classes only)
    for cls in classes:
        # Reliability matrix: rows = raters (runs), columns = units (sentences)
        reliability_matrix = np.array([
            run_dfs[run][cls].values for run in runs
        ])
        
        # Krippendorff's alpha (nominal level for binary data)
        try:
            alpha = krippendorff.alpha(
                reliability_data=reliability_matrix,
                level_of_measurement='nominal',
            )
        except:
            alpha = np.nan
        
        results[cls] = alpha
    
    # Overall alpha (flatten 6 functional classes into one reliability matrix)
    overall_matrix = np.hstack([
        np.array([run_dfs[run][cls].values for run in runs])
        for cls in classes
    ])
    
    try:
        results['Overall'] = krippendorff.alpha(
            reliability_data=overall_matrix,
            level_of_measurement='nominal',
        )
    except:
        results['Overall'] = np.nan
    
    # Pairwise agreement percentage per class (6 functional classes only)
    pairwise = {}
    for cls in classes:
        agreements = []
        for i, r1 in enumerate(runs):
            for r2 in runs[i+1:]:
                agree = (run_dfs[r1][cls].values == run_dfs[r2][cls].values).mean()
                agreements.append(agree)
        pairwise[cls] = np.mean(agreements)
    pairwise['Overall'] = np.mean([pairwise[cls] for cls in classes])
    
    return results, pairwise, n_sentences


# -----------------------------
# Main Analysis
# -----------------------------

print("="*70)
print("INTER-RATER AGREEMENT ANALYSIS (Krippendorff's Alpha)")
print("="*70)
print(f"Holdout dir: {HOLDOUT_DIR}")
print(f"Runs: {RUNS}")
print(f"Classes evaluated: {CLASSES} (None excluded — derived, not labeled)")

# Discover models
models = discover_models(HOLDOUT_DIR, RUNS)

print(f"\nFound {len(models)} model(s) with all {len(RUNS)} runs:")
for (vendor, model, mode), paths in models.items():
    print(f"  {vendor} / {model} / {mode}")
    for run, path in sorted(paths.items()):
        print(f"    Run {run}: {path}")

# Compute agreement for each model
all_results = {}

for (vendor, model, mode), paths in models.items():
    model_key = f"{vendor}_{model}_{mode}"
    
    print(f"\n{'='*70}")
    print(f"MODEL: {vendor} / {model} / {mode}")
    print(f"{'='*70}")
    
    alphas, pairwise, n_sentences = compute_agreement(paths, CLASSES, RUNS)
    all_results[model_key] = {'alphas': alphas, 'pairwise': pairwise}
    
    print(f"Sentences: {n_sentences}")
    print(f"Runs compared: {RUNS}")
    
    # Per-class table (6 functional classes + Overall)
    print(f"\n{'Class':<15} {'k-Alpha':>10} {'Agreement Share':>16}")
    print("-"*45)
    
    for cls in CLASSES + ['Overall']:
        alpha = alphas[cls]
        agree = pairwise[cls]
        
        if cls == 'Overall':
            print("-"*45)
        
        print(f"{cls:<15} {alpha:>10.4f} {agree:>15.1%}")
    
    # Interpretation guide
    print(f"\nKrippendorff's Alpha Interpretation:")
    print(f"  α ≥ 0.80  → Reliable agreement")
    print(f"  α ≥ 0.667 → Tentative agreement (acceptable for some purposes)")
    print(f"  α < 0.667 → Unreliable / insufficient agreement")
    print(f"\n  Note: None class excluded — it is derived (all classes = 0),")
    print(f"  not independently labeled by the model.")

# -----------------------------
# Summary across models
# -----------------------------

if len(all_results) > 1:
    print(f"\n{'='*70}")
    print("SUMMARY ACROSS MODELS")
    print(f"{'='*70}")
    
    summary_rows = []
    for model_key, data in all_results.items():
        row = {'Model': model_key}
        row['Overall_Alpha'] = data['alphas']['Overall']
        row['Overall_Agreement'] = data['pairwise']['Overall']
        for cls in CLASSES:
            row[f'{cls}_Alpha'] = data['alphas'][cls]
        summary_rows.append(row)
    
    df_summary = pd.DataFrame(summary_rows)
    df_summary = df_summary.sort_values('Overall_Alpha', ascending=False)
    
    print(f"\n{'Model':<35} {'Overall α':>12} {'Agreement':>12}")
    print("-"*60)
    for _, row in df_summary.iterrows():
        print(f"{row['Model']:<35} {row['Overall_Alpha']:>12.4f} {row['Overall_Agreement']:>11.1%}")
    
    # Save summary
    df_summary.to_csv(f"{HOLDOUT_DIR}/agreement_summary.csv", index=False)
    print(f"\nSaved summary to {HOLDOUT_DIR}/agreement_summary.csv")

# Majority Votes across Runs

- Across runs
- Does not handle ties yet (assumes odd number of total runs)
- Assumes multi-label problem (one sentence can have multiple labels)
- Assumes that if majority is [] (none), then all other classes must be negative (0 or FALSE) 

In [None]:
# =============================================================================
# MAJORITY VOTE PER MODEL (across its own runs)
# =============================================================================

import pandas as pd
import numpy as np
import glob
import ast
import re
import os

# -----------------------------
# Configuration
# -----------------------------

HOLDOUT_DIR = "./output/holdout"  # Where Run1/, Run2/, Run3/ etc. are
RUNS = [1, 2, 3]                  # Which runs to aggregate
CLASSES = ["Marketing", "Finance", "Accounting", "Operations", "IT", "HR"]


In [None]:
# -----------------------------
# Helper Functions
# -----------------------------

def parse_labels_safe(labels):
    """Parse labels from various formats."""
    if labels is None:
        return []
    if isinstance(labels, list):
        return labels
    if isinstance(labels, str):
        try:
            parsed = ast.literal_eval(labels)
            return parsed if isinstance(parsed, list) else []
        except:
            return []
    return []

def parse_filename(filename):
    """
    Parse filename like 'holdout_azure_gpt-4.1_chat_run1.pkl'
    into (vendor, model, mode, run).
    Handles optional dataset prefix (holdout_, train_, prelim_, output_).
    """
    name = filename.replace('.pkl', '').replace('.csv', '')
    
    # Strip known dataset prefixes
    for prefix in ['holdout_', 'train_', 'prelim_', 'output_']:
        if name.startswith(prefix):
            name = name[len(prefix):]
            break
    
    match = re.search(r'_run(\d+)$', name)
    if not match:
        return None
    run = int(match.group(1))
    name = name[:match.start()]
    parts = name.rsplit('_', 1)
    if len(parts) != 2:
        return None
    mode = parts[1]
    remainder = parts[0]
    first_underscore = remainder.index('_')
    vendor = remainder[:first_underscore]
    model = remainder[first_underscore + 1:]
    return vendor, model, mode, run

def discover_models(holdout_dir, runs):
    """Find all vendor_model_mode combos that have ALL specified runs."""
    models = {}
    for run in runs:
        run_dir = f"{holdout_dir}/Run{run}"
        if not os.path.exists(run_dir):
            continue
        for filepath in glob.glob(f"{run_dir}/*.pkl"):
            filename = filepath.split("/")[-1]
            parsed = parse_filename(filename)
            if parsed is None:
                continue
            vendor, model, mode, file_run = parsed
            if file_run != run:
                continue
            key = (vendor, model, mode)
            if key not in models:
                models[key] = {}
            models[key][run] = filepath
    
    # Keep only models with ALL specified runs
    return {k: v for k, v in models.items() if all(r in v for r in runs)}

def compute_majority_vote(model_runs, classes, runs):
    """
    Compute majority vote for a single model across its runs.
    
    Votes are counted for the 6 functional classes only.
    None is derived: if no functional class gets majority, None = 1.
    None_votes tracks how many runs returned an empty label list (for reference).
    
    Returns DataFrame with sentence + binary labels + vote counts.
    """
    # Load all runs
    run_dfs = {}
    for run in runs:
        run_dfs[run] = pd.read_pickle(model_runs[run])
    
    # Collect votes per sentence
    votes = {}
    for run in runs:
        df = run_dfs[run]
        for _, row in df.iterrows():
            sent_id = row['id']
            sentence = row['sentence']
            labels = parse_labels_safe(row['labels'])
            
            # Skip error rows
            if 'error' in df.columns and pd.notna(row.get('error')):
                continue
            
            if sent_id not in votes:
                votes[sent_id] = {
                    'sentence': sentence,
                    **{cls: 0 for cls in classes},
                    'none_runs': 0,
                    'total_runs': 0,
                }
            
            votes[sent_id]['total_runs'] += 1
            
            if len(labels) == 0:
                votes[sent_id]['none_runs'] += 1
            else:
                for label in labels:
                    if label in classes:
                        votes[sent_id][label] += 1
    
    # Calculate majority
    results = []
    for sent_id, data in votes.items():
        total = data['total_runs']
        threshold = total / 2  # > 50%
        
        row = {'id': sent_id, 'sentence': data['sentence']}
        
        # Majority vote on 6 functional classes
        for cls in classes:
            row[cls] = 1 if data[cls] > threshold else 0
        
        # None is derived: all functional classes = 0
        row['None'] = 1 if all(row[cls] == 0 for cls in classes) else 0
        
        # Vote counts (for reference / pseudo-probabilities)
        for cls in classes:
            row[f'{cls}_votes'] = data[cls]
        row['None_votes'] = data['none_runs']
        row['total_runs'] = total
        
        results.append(row)
    
    df_result = pd.DataFrame(results)
    label_cols = classes + ['None']
    vote_cols = [f'{cls}_votes' for cls in classes] + ['None_votes', 'total_runs']
    df_result = df_result[['id', 'sentence'] + label_cols + vote_cols]
    df_result = df_result.sort_values('id').reset_index(drop=True)
    
    return df_result

# -----------------------------
# Main
# -----------------------------

print("="*70)
print("MAJORITY VOTE PER MODEL")
print("="*70)
print(f"Holdout dir: {HOLDOUT_DIR}")
print(f"Runs: {RUNS}")

# Discover models
models = discover_models(HOLDOUT_DIR, RUNS)

print(f"\nFound {len(models)} model(s) with all {len(RUNS)} runs:")
for (vendor, model, mode), paths in models.items():
    print(f"  {vendor} / {model} / {mode}")

# Process each model
all_majority = {}

for (vendor, model, mode), paths in models.items():
    model_key = f"{vendor}_{model}_{mode}"
    
    print(f"\n{'='*70}")
    print(f"MODEL: {vendor} / {model} / {mode}")
    print(f"{'='*70}")
    
    # Compute majority vote
    df_mv = compute_majority_vote(paths, CLASSES, RUNS)
    all_majority[model_key] = df_mv
    
    # Report
    print(f"Total sentences: {len(df_mv)}")
    print(f"Runs per sentence: {df_mv['total_runs'].iloc[0]}")
    
    print(f"\n{'Class':<15} {'Count':>8} {'Share':>10}")
    print("-"*35)
    for cls in CLASSES:
        count = df_mv[cls].sum()
        share = count / len(df_mv) * 100
        print(f"{cls:<15} {count:>8} {share:>9.1f}%")
    
    # None (derived)
    none_count = df_mv['None'].sum()
    none_share = none_count / len(df_mv) * 100
    print("-"*35)
    print(f"{'None (derived)':<15} {none_count:>8} {none_share:>9.1f}%")
    
    # Multi-label distribution
    num_labels = df_mv[CLASSES].sum(axis=1)
    print(f"\n{'# Labels':<15} {'Count':>8} {'Share':>10}")
    print("-"*35)
    for n in range(7):
        count = (num_labels == n).sum()
        if count > 0:
            share = count / len(df_mv) * 100
            label_text = f"{n} label{'s' if n != 1 else ''}"
            print(f"{label_text:<15} {count:>8} {share:>9.1f}%")
    
    # Save (with dataset prefix for consistency)
    dataset_tag = os.path.basename(os.path.normpath(HOLDOUT_DIR)).lower()
    save_path = f"{HOLDOUT_DIR}/{dataset_tag}_{model_key}_majority_vote"
    df_mv.to_csv(f"{save_path}.csv", index=False)
    df_mv.to_pickle(f"{save_path}.pkl")
    print(f"\nSaved: {save_path}.csv (.pkl)")

# -----------------------------
# Summary across models
# -----------------------------

if len(all_majority) > 1:
    print(f"\n{'='*70}")
    print("COMPARISON ACROSS MODELS")
    print(f"{'='*70}")
    
    # Header
    model_keys = list(all_majority.keys())
    header = f"{'Class':<15}"
    for key in model_keys:
        short = key.split('_', 1)[1]  # Remove vendor prefix for display
        header += f" {short:>20}"
    print(header)
    print("-" * (15 + 21 * len(model_keys)))
    
    for cls in CLASSES:
        row = f"{cls:<15}"
        for key in model_keys:
            df_mv = all_majority[key]
            count = df_mv[cls].sum()
            share = count / len(df_mv) * 100
            row += f" {count:>8} ({share:>5.1f}%)"
        print(row)
    
    # None row
    row = f"{'None (derived)':<15}"
    for key in model_keys:
        df_mv = all_majority[key]
        count = df_mv['None'].sum()
        share = count / len(df_mv) * 100
        row += f" {count:>8} ({share:>5.1f}%)"
    print("-" * (15 + 21 * len(model_keys)))
    print(row)

print(f"\n{'='*70}")
print("DONE")
print(f"{'='*70}")

# genAI Model Performance

- Check genAI majority labels against ground truth from human experts

In [None]:
# =============================================================================
# EVALUATE GenAI MODELS vs HUMAN LABELS ON HOLDOUT
# =============================================================================

import pandas as pd
import numpy as np
import glob
import json
from sklearn.metrics import (
    f1_score,
    roc_auc_score,
    matthews_corrcoef,
    precision_score,
    recall_score,
    confusion_matrix,
    hamming_loss,
    jaccard_score,
)

# -----------------------------
# Configuration
# -----------------------------

HOLDOUT_DIR = "./output/holdout"  # Where model majority vote files are
HUMAN_LABELS_PATH = "./Holdout_Train/holdout_human.pkl"  # Ground truth
CLASSES = ["Marketing", "Finance", "Accounting", "Operations", "IT", "HR"]

In [None]:
# -----------------------------
# Load Human Labels (Ground Truth)
# -----------------------------

print("="*70)
print("EVALUATE GenAI MODELS vs HUMAN LABELS")
print("="*70)

df_human = pd.read_pickle(HUMAN_LABELS_PATH)
print(f"Human-labeled holdout: {len(df_human)} sentences")
print(f"Source: {HUMAN_LABELS_PATH}")

print(f"\nHuman label distribution:")
for cls in CLASSES:
    count = df_human[cls].sum()
    share = count / len(df_human) * 100
    print(f"  {cls}: {int(count)} ({share:.1f}%)")
none_count = (df_human[CLASSES].sum(axis=1) == 0).sum()
print(f"  None (derived): {none_count} ({none_count / len(df_human) * 100:.1f}%)")

# -----------------------------
# Discover Model Majority Vote Files
# -----------------------------

mv_files = glob.glob(f"{HOLDOUT_DIR}/*_majority_vote.pkl")
print(f"\nFound {len(mv_files)} model majority vote file(s):")
for f in mv_files:
    print(f"  {f.split('/')[-1]}")

# -----------------------------
# Evaluation Functions
# -----------------------------

def evaluate_multilabel(y_true, y_pred, class_names):
    """
    Comprehensive multi-label evaluation on the 6 functional classes.
    
    Each label is an independent binary decision.
    None is derived (all classes = 0), not independently predicted,
    so it is excluded from macro averaging but reported separately.
    No AUC: requires continuous probabilities, not available for binary GenAI predictions.
    """
    # Per-class metrics (6 functional classes)
    per_class = {}
    for i, cls in enumerate(class_names):
        y_t = y_true[:, i]
        y_p = y_pred[:, i]
        
        per_class[cls] = {
            'f1': f1_score(y_t, y_p, zero_division=0),
            'precision': precision_score(y_t, y_p, zero_division=0),
            'recall': recall_score(y_t, y_p, zero_division=0),
            'mcc': matthews_corrcoef(y_t, y_p),
            'support_true': int(y_t.sum()),
            'support_pred': int(y_p.sum()),
        }
    
    # Macro averages (6 functional classes only)
    macro = {
        'f1_macro': np.mean([per_class[c]['f1'] for c in class_names]),
        'precision_macro': np.mean([per_class[c]['precision'] for c in class_names]),
        'recall_macro': np.mean([per_class[c]['recall'] for c in class_names]),
        'mcc_macro': np.mean([per_class[c]['mcc'] for c in class_names]),
    }
    
    # Micro averages (pooled across 6 classes)
    micro = {
        'f1_micro': f1_score(y_true, y_pred, average='micro', zero_division=0),
        'precision_micro': precision_score(y_true, y_pred, average='micro', zero_division=0),
        'recall_micro': recall_score(y_true, y_pred, average='micro', zero_division=0),
    }
    
    # Sample-based metrics (per sentence, 6 classes)
    # Note: sentences where both true and predicted are all-zero (None) score 0/0.
    # Using zero_division=1 so correct "None" predictions score 1.0, not 0.0.
    sample = {
        'f1_samples': f1_score(y_true, y_pred, average='samples', zero_division=1),
        'jaccard_samples': jaccard_score(y_true, y_pred, average='samples', zero_division=1),
    }
    
    # Overall metrics
    overall = {
        'exact_match_ratio': (y_pred == y_true).all(axis=1).mean(),
        'hamming_loss': hamming_loss(y_true, y_pred),
    }
    
    # None class (derived: all classes = 0, reported separately)
    none_true = (y_true.sum(axis=1) == 0).astype(int)
    none_pred = (y_pred.sum(axis=1) == 0).astype(int)
    
    none = {
        'f1': f1_score(none_true, none_pred, zero_division=0),
        'precision': precision_score(none_true, none_pred, zero_division=0),
        'recall': recall_score(none_true, none_pred, zero_division=0),
        'mcc': matthews_corrcoef(none_true, none_pred),
        'support_true': int(none_true.sum()),
        'support_pred': int(none_pred.sum()),
    }
    
    return {
        'per_class': per_class,
        'macro': macro,
        'micro': micro,
        'sample': sample,
        'overall': overall,
        'none': none,
    }


def print_evaluation_report(model_name, metrics, class_names):
    """Print formatted evaluation report for one model."""
    
    print(f"\n{'='*70}")
    print(f"MODEL: {model_name}")
    print(f"{'='*70}")
    
    # Per-class table
    print(f"\n{'Class':<12} {'F1':>7} {'MCC':>7} {'Prec':>7} {'Rec':>7} {'True':>6} {'Pred':>6}")
    print("-"*58)
    
    for cls in class_names:
        m = metrics['per_class'][cls]
        print(f"{cls:<12} {m['f1']:>7.4f} {m['mcc']:>7.4f} "
              f"{m['precision']:>7.4f} {m['recall']:>7.4f} {m['support_true']:>6} {m['support_pred']:>6}")
    
    # None (derived, separate from macro)
    print("-"*58)
    n = metrics['none']
    print(f"{'None':<12} {n['f1']:>7.4f} {n['mcc']:>7.4f} "
          f"{n['precision']:>7.4f} {n['recall']:>7.4f} {n['support_true']:>6} {n['support_pred']:>6}")
    print(f"{'':>12} (derived: all classes = 0, excluded from Macro)")
    
    # Macro averages (6 functional classes)
    print("-"*58)
    m = metrics['macro']
    print(f"{'Macro Avg':<12} {m['f1_macro']:>7.4f} {m['mcc_macro']:>7.4f} "
          f"{m['precision_macro']:>7.4f} {m['recall_macro']:>7.4f}")
    print(f"{'(6 classes)':>12}")
    
    # Summary
    print("\n--- Summary Metrics ---")
    print(f"  Macro F1:       {metrics['macro']['f1_macro']:.4f}  ← Primary (class-balanced, 6 classes)")
    print(f"  Macro MCC:      {metrics['macro']['mcc_macro']:.4f}  ← Robust to imbalance")
    print(f"  Micro F1:       {metrics['micro']['f1_micro']:.4f}  ← Pooled across 6 classes")
    print(f"  Sample F1:      {metrics['sample']['f1_samples']:.4f}  ← Per-sentence average")
    print(f"  Sample Jaccard: {metrics['sample']['jaccard_samples']:.4f}  ← Per-sentence |∩|/|∪|")
    print(f"  Exact Match:    {metrics['overall']['exact_match_ratio']:.4f}  ← All 6 labels correct")
    print(f"  Hamming Loss:   {metrics['overall']['hamming_loss']:.4f}  ← Fraction wrong (lower=better)")

    # Interpretation
    print("\n--- Interpretation ---")
    gap = metrics['micro']['f1_micro'] - metrics['macro']['f1_macro']
    print(f"  Macro F1 ({metrics['macro']['f1_macro']:.4f}) treats 6 functional classes equally;")
    print(f"    Micro F1 ({metrics['micro']['f1_micro']:.4f}) pools all labels, weighted by frequency.")
    if gap > 0.05:
        print(f"    Gap of {gap:.4f} suggests rare classes (IT, HR) drag down Macro F1.")
    
    print(f"  Sample F1 ({metrics['sample']['f1_samples']:.4f}) and Jaccard ({metrics['sample']['jaccard_samples']:.4f}):")
    print(f"    Per-sentence evaluation. Correct 'None' predictions (both empty) score 1.0.")
    print(f"  Exact Match ({metrics['overall']['exact_match_ratio']:.4f}): {metrics['overall']['exact_match_ratio']*100:.1f}% of sentences had ALL labels correct.")
    print(f"    Binary per sentence: one wrong label fails the entire sentence.")
    print(f"  Hamming Loss ({metrics['overall']['hamming_loss']:.4f}): {metrics['overall']['hamming_loss']*100:.1f}% of individual label decisions are wrong.")
    print(f"  MCC ({metrics['macro']['mcc_macro']:.4f}): Balanced metric even with class imbalance. >0.70 is good.")
    print(f"  None (derived): F1 = {metrics['none']['f1']:.4f}, MCC = {metrics['none']['mcc']:.4f}")
    print(f"    Not independently predicted — excluded from Macro to avoid inflating scores.")

    # For paper
    print("\n--- For Paper/Report ---")
    print(f"  Primary:   Macro F1 = {metrics['macro']['f1_macro']:.4f}")
    print(f"  Secondary: Macro MCC = {metrics['macro']['mcc_macro']:.4f}")
    print(f"  Overall:   Exact Match = {metrics['overall']['exact_match_ratio']:.4f}, Hamming Loss = {metrics['overall']['hamming_loss']:.4f}")


# -----------------------------
# Evaluate Each Model
# -----------------------------

all_metrics = {}

for mv_file in sorted(mv_files):
    filename = mv_file.split('/')[-1]
    #model_name = filename.replace('_majority_vote.pkl', '')
    model_name = filename.replace('_majority_vote.pkl', '')
    # Strip dataset prefix for clean display
    for prefix in ['holdout_', 'train_', 'prelim_', 'output_']:
        if model_name.startswith(prefix):
            model_name = model_name[len(prefix):]
            break
    
    # Load model predictions
    df_model = pd.read_pickle(mv_file)
    
    # Merge with human labels on sentence (6 functional classes only)
    df_merged = df_human[['sentence'] + CLASSES].merge(
        df_model[['sentence'] + CLASSES],
        on='sentence',
        how='inner',
        suffixes=('_true', '_pred'),
    )
    
    # Check match rate
    match_rate = len(df_merged) / len(df_human) * 100
    print(f"\n{'─'*70}")
    print(f"Model: {model_name}")
    print(f"Matched sentences: {len(df_merged)}/{len(df_human)} ({match_rate:.1f}%)")
    
    if len(df_merged) == 0:
        print("  WARNING: No matching sentences! Skipping.")
        continue
    
    # Extract true labels (human) — 6 functional classes
    y_true = df_merged[[f'{cls}_true' for cls in CLASSES]].values.astype(int)
    
    # Extract predicted labels (model majority vote) — 6 functional classes
    y_pred = df_merged[[f'{cls}_pred' for cls in CLASSES]].values.astype(int)
    
    # Evaluate
    metrics = evaluate_multilabel(y_true, y_pred, CLASSES)
    all_metrics[model_name] = metrics
    
    # Print report
    print_evaluation_report(model_name, metrics, CLASSES)
    
    # Confusion matrices
    print(f"\n--- Confusion Matrices ---")
    print(f"  {'Class':<12} {'TP':>6} {'TN':>6} {'FP':>6} {'FN':>6}")
    print(f"  {'-'*38}")
    
    for i, cls in enumerate(CLASSES):
        cm = confusion_matrix(y_true[:, i], y_pred[:, i])
        tn, fp, fn, tp = cm.ravel()
        print(f"  {cls:<12} {tp:>6} {tn:>6} {fp:>6} {fn:>6}")
    
    # None confusion matrix (derived)
    none_true = (y_true.sum(axis=1) == 0).astype(int)
    none_pred = (y_pred.sum(axis=1) == 0).astype(int)
    none_cm = confusion_matrix(none_true, none_pred)
    none_tn, none_fp, none_fn, none_tp = none_cm.ravel()
    print(f"  {'-'*38}")
    print(f"  {'None':<12} {none_tp:>6} {none_tn:>6} {none_fp:>6} {none_fn:>6}")

# -----------------------------
# Comparison Summary
# -----------------------------

if len(all_metrics) > 1:
    print(f"\n{'='*70}")
    print("MODEL COMPARISON SUMMARY")
    print(f"{'='*70}")
    
    # Summary table
    print(f"\n{'Model':<40} {'F1 Mac':>8} {'MCC Mac':>8} {'Exact':>8} {'Hamming':>8}")
    print("-"*68)
    
    summary_rows = []
    for model_name, metrics in sorted(all_metrics.items(), 
                                       key=lambda x: x[1]['macro']['f1_macro'], 
                                       reverse=True):
        m = metrics['macro']
        o = metrics['overall']
        print(f"{model_name:<40} {m['f1_macro']:>8.4f} {m['mcc_macro']:>8.4f} "
              f"{o['exact_match_ratio']:>8.4f} {o['hamming_loss']:>8.4f}")
        
        summary_rows.append({
            'model': model_name,
            'f1_macro': m['f1_macro'],
            'mcc_macro': m['mcc_macro'],
            'f1_micro': metrics['micro']['f1_micro'],
            'f1_samples': metrics['sample']['f1_samples'],
            'jaccard_samples': metrics['sample']['jaccard_samples'],
            'exact_match': o['exact_match_ratio'],
            'hamming_loss': o['hamming_loss'],
            'none_f1': metrics['none']['f1'],
            'none_mcc': metrics['none']['mcc'],
        })
    
    # Per-class F1 comparison
    print(f"\n--- Per-Class F1 Comparison ---")
    header = f"{'Class':<12}"
    for model_name in sorted(all_metrics.keys()):
        short = model_name.replace('azure_', '').replace('fireworks_', 'fw_')
        header += f" {short:>20}"
    print(header)
    print("-" * (12 + 21 * len(all_metrics)))
    
    for cls in CLASSES + ['None', 'Macro Avg']:
        row = f"{cls:<12}"
        if cls == 'None':
            row = f"{'-'*12}\n{cls:<12}"
        for model_name in sorted(all_metrics.keys()):
            if cls == 'Macro Avg':
                val = all_metrics[model_name]['macro']['f1_macro']
            elif cls == 'None':
                val = all_metrics[model_name]['none']['f1']
            else:
                val = all_metrics[model_name]['per_class'][cls]['f1']
            row += f" {val:>20.4f}"
        print(row)
    
    # Save comparison
    df_summary = pd.DataFrame(summary_rows).sort_values('f1_macro', ascending=False)
    df_summary.to_csv(f"{HOLDOUT_DIR}/model_vs_human_comparison.csv", index=False)
    print(f"\nSaved comparison to {HOLDOUT_DIR}/model_vs_human_comparison.csv")

# -----------------------------
# Save all metrics
# -----------------------------

def convert_for_json(obj):
    """Convert numpy types for JSON serialization."""
    if isinstance(obj, (np.integer,)):
        return int(obj)
    if isinstance(obj, (np.floating,)):
        return float(obj)
    if isinstance(obj, np.ndarray):
        return obj.tolist()
    if isinstance(obj, dict):
        return {k: convert_for_json(v) for k, v in obj.items()}
    return obj

all_metrics_json = convert_for_json(all_metrics)
with open(f"{HOLDOUT_DIR}/model_vs_human_metrics.json", 'w') as f:
    json.dump(all_metrics_json, f, indent=2)
print(f"Saved all metrics to {HOLDOUT_DIR}/model_vs_human_metrics.json")

print(f"\n{'='*70}")
print("DONE")
print(f"{'='*70}")

# Label Training Data with genAI

- Determine which genAI model (or models) does on holdout
- Assume that this will also be the best model for labeling the train data
- Label train data at least 3 times (i.e. 3 runs: [1,2,3] with best genAI model (or models)
- Build final train set with majority votes across runs (and models if you want to pool genAI models for possibly better performance)
> I am giving an example here using GPT-4.1. ***This wil most likely not be the best model! Try others on the holdout!***

In [None]:
# =============================================================================
# OUTPUT CONFIGURATION - SET for Train
# =============================================================================
CHECKPOINT_DIR = "./checkpoints"
OUTPUT_DIR = "./output"
TRAIN_DIR = f"{OUTPUT_DIR}/train"

In [None]:
# =============================================================================
# Load Train dataset
# =============================================================================

# # Load Data: Labeling of (balanced) holdout data
df = pd.read_pickle("Holdout_Train/train.pkl")
df = df.head(101)

# =============================================================================
# Query genAI model via API
# =============================================================================

# Here an example for gpt-4.1 via UNC Azure API. 
# Most likely, this will not be the best model! Try other vendors and other models.
results = run_classification(
    df,
    sentence_col="sentence",     # --> Check if your file has the column as named here where the text is  (sentence vs sentences vs text vs tweet vs ... )
    vendor="azure",
    model="gpt-4.1",
    runs=[1,2,3],                # Doing 3 runs for conistency / replicability: will later take majority vote per lable across runs
    output_dir=TRAIN_DIR,        # overrides default: OUTPUT_DIR vs. HOLDOUT_DIR vs. TRAIN_DIR vs. PRELIM_DIR
    checkpoint_dir=CHECKPOINT_DIR,   # overrides default
)

# Majority Votes

- Across Runs
- If you are ***pooling models***, then you need to also do this ***Across Models***: THIS CODE DOES NOT DO THAT!
- Does not handle ties yet (assumes odd number of total runs across models)
- Assumes multi-label problem (one sentence can have multiple labels)

In [None]:
# =============================================================================
# MAJORITY VOTE PER MODEL (across its own runs)
# =============================================================================

import pandas as pd
import numpy as np
import glob
import ast
import re
import os

# -----------------------------
# Configuration
# -----------------------------

TRAIN_DIR = "./output/train"  # Now pointing to train (all code is the same, but we are sing a differnt folder for the files
RUNS = [1, 2, 3]
CLASSES = ["Marketing", "Finance", "Accounting", "Operations", "IT", "HR"]

In [None]:
# -----------------------------
# Helper Functions
# -----------------------------

def parse_labels_safe(labels):
    """Parse labels from various formats."""
    if labels is None:
        return []
    if isinstance(labels, list):
        return labels
    if isinstance(labels, str):
        try:
            parsed = ast.literal_eval(labels)
            return parsed if isinstance(parsed, list) else []
        except:
            return []
    return []

def parse_filename(filename):
    """
    Parse filename like 'train_azure_gpt-4.1_chat_run1.pkl'
    into (vendor, model, mode, run).
    Handles optional dataset prefix (holdout_, train_, prelim_, output_).
    """
    name = filename.replace('.pkl', '').replace('.csv', '')
    
    # Strip known dataset prefixes
    for prefix in ['holdout_', 'train_', 'prelim_', 'output_']:
        if name.startswith(prefix):
            name = name[len(prefix):]
            break
    
    match = re.search(r'_run(\d+)$', name)
    if not match:
        return None
    run = int(match.group(1))
    name = name[:match.start()]
    parts = name.rsplit('_', 1)
    if len(parts) != 2:
        return None
    mode = parts[1]
    remainder = parts[0]
    first_underscore = remainder.index('_')
    vendor = remainder[:first_underscore]
    model = remainder[first_underscore + 1:]
    return vendor, model, mode, run

def discover_models(train_dir, runs):
    """Find all vendor_model_mode combos that have ALL specified runs."""
    models = {}
    for run in runs:
        run_dir = f"{train_dir}/Run{run}"
        if not os.path.exists(run_dir):
            continue
        for filepath in glob.glob(f"{run_dir}/*.pkl"):
            filename = filepath.split("/")[-1]
            parsed = parse_filename(filename)
            if parsed is None:
                continue
            vendor, model, mode, file_run = parsed
            if file_run != run:
                continue
            key = (vendor, model, mode)
            if key not in models:
                models[key] = {}
            models[key][run] = filepath
    
    # Keep only models with ALL specified runs
    return {k: v for k, v in models.items() if all(r in v for r in runs)}

def compute_majority_vote(model_runs, classes, runs):
    """
    Compute majority vote for a single model across its runs.
    
    Votes are counted for the 6 functional classes only.
    None is derived: if no functional class gets majority, None = 1.
    None_votes tracks how many runs returned an empty label list (for reference).
    
    Returns DataFrame with sentence + binary labels + vote counts.
    """
    # Load all runs
    run_dfs = {}
    for run in runs:
        run_dfs[run] = pd.read_pickle(model_runs[run])
    
    # Collect votes per sentence
    votes = {}
    for run in runs:
        df = run_dfs[run]
        for _, row in df.iterrows():
            sent_id = row['id']
            sentence = row['sentence']
            labels = parse_labels_safe(row['labels'])
            
            # Skip error rows
            if 'error' in df.columns and pd.notna(row.get('error')):
                continue
            
            if sent_id not in votes:
                votes[sent_id] = {
                    'sentence': sentence,
                    **{cls: 0 for cls in classes},
                    'none_runs': 0,
                    'total_runs': 0,
                }
            
            votes[sent_id]['total_runs'] += 1
            
            if len(labels) == 0:
                votes[sent_id]['none_runs'] += 1
            else:
                for label in labels:
                    if label in classes:
                        votes[sent_id][label] += 1
    
    # Calculate majority
    results = []
    for sent_id, data in votes.items():
        total = data['total_runs']
        threshold = total / 2  # > 50%
        
        row = {'id': sent_id, 'sentence': data['sentence']}
        
        # Majority vote on 6 functional classes
        for cls in classes:
            row[cls] = 1 if data[cls] > threshold else 0
        
        # None is derived: all functional classes = 0
        row['None'] = 1 if all(row[cls] == 0 for cls in classes) else 0
        
        # Vote counts (for reference / pseudo-probabilities)
        for cls in classes:
            row[f'{cls}_votes'] = data[cls]
        row['None_votes'] = data['none_runs']
        row['total_runs'] = total
        
        results.append(row)
    
    df_result = pd.DataFrame(results)
    label_cols = classes + ['None']
    vote_cols = [f'{cls}_votes' for cls in classes] + ['None_votes', 'total_runs']
    df_result = df_result[['id', 'sentence'] + label_cols + vote_cols]
    df_result = df_result.sort_values('id').reset_index(drop=True)
    
    return df_result

# -----------------------------
# Main
# -----------------------------

print("="*70)
print("MAJORITY VOTE PER MODEL")
print("="*70)
print(f"Train dir: {TRAIN_DIR}")
print(f"Runs: {RUNS}")

# Discover models
models = discover_models(TRAIN_DIR, RUNS)

print(f"\nFound {len(models)} model(s) with all {len(RUNS)} runs:")
for (vendor, model, mode), paths in models.items():
    print(f"  {vendor} / {model} / {mode}")

# Process each model
all_majority = {}

for (vendor, model, mode), paths in models.items():
    model_key = f"{vendor}_{model}_{mode}"
    
    print(f"\n{'='*70}")
    print(f"MODEL: {vendor} / {model} / {mode}")
    print(f"{'='*70}")
    
    # Compute majority vote
    df_mv = compute_majority_vote(paths, CLASSES, RUNS)
    all_majority[model_key] = df_mv
    
    # Report
    print(f"Total sentences: {len(df_mv)}")
    print(f"Runs per sentence: {df_mv['total_runs'].iloc[0]}")
    
    print(f"\n{'Class':<15} {'Count':>8} {'Share':>10}")
    print("-"*35)
    for cls in CLASSES:
        count = df_mv[cls].sum()
        share = count / len(df_mv) * 100
        print(f"{cls:<15} {count:>8} {share:>9.1f}%")
    
    # None (derived)
    none_count = df_mv['None'].sum()
    none_share = none_count / len(df_mv) * 100
    print("-"*35)
    print(f"{'None (derived)':<15} {none_count:>8} {none_share:>9.1f}%")
    
    # Multi-label distribution
    num_labels = df_mv[CLASSES].sum(axis=1)
    print(f"\n{'# Labels':<15} {'Count':>8} {'Share':>10}")
    print("-"*35)
    for n in range(7):
        count = (num_labels == n).sum()
        if count > 0:
            share = count / len(df_mv) * 100
            label_text = f"{n} label{'s' if n != 1 else ''}"
            print(f"{label_text:<15} {count:>8} {share:>9.1f}%")
    
    # Save (with dataset prefix for consistency)
    dataset_tag = os.path.basename(os.path.normpath(TRAIN_DIR)).lower()
    save_path = f"{TRAIN_DIR}/{dataset_tag}_{model_key}_majority_vote"
    df_mv.to_csv(f"{save_path}.csv", index=False)
    df_mv.to_pickle(f"{save_path}.pkl")
    print(f"\nSaved: {save_path}.csv (.pkl)")

# -----------------------------
# Summary across models
# -----------------------------

if len(all_majority) > 1:
    print(f"\n{'='*70}")
    print("COMPARISON ACROSS MODELS")
    print(f"{'='*70}")
    
    # Header
    model_keys = list(all_majority.keys())
    header = f"{'Class':<15}"
    for key in model_keys:
        short = key.split('_', 1)[1]  # Remove vendor prefix for display
        header += f" {short:>20}"
    print(header)
    print("-" * (15 + 21 * len(model_keys)))
    
    for cls in CLASSES:
        row = f"{cls:<15}"
        for key in model_keys:
            df_mv = all_majority[key]
            count = df_mv[cls].sum()
            share = count / len(df_mv) * 100
            row += f" {count:>8} ({share:>5.1f}%)"
        print(row)
    
    # None row
    row = f"{'None (derived)':<15}"
    for key in model_keys:
        df_mv = all_majority[key]
        count = df_mv['None'].sum()
        share = count / len(df_mv) * 100
        row += f" {count:>8} ({share:>5.1f}%)"
    print("-" * (15 + 21 * len(model_keys)))
    print(row)

print(f"\n{'='*70}")
print("DONE")
print(f"{'='*70}")

# Fine-Tune a pretrained LLM to create a Vertical AI model
- We now have training data for fine-tuning a pretrained LLM (here, RoBERTa Large) to become a classifier for business functions
- You can experiment with different pretrained LLMs on Huggingface that may be more appropriate for your fine-tuning purpose:
> https://huggingface.co/models?pipeline_tag=fill-mask&library=transformers&sort=trending

### **IMPORTANT**: There is a hyperparameter that determines how long text (in tokens) can be at most for fine-tuning.
```
MAX_LENGTH = 128 # max length of sentences in tokens
```
> * I set it to 128. If your texts are longer, you need to change that because it will be truncated!  
> * The larger the max length is, the slower the training process.  
> * Why? See class 8 on Deep Learning. Hint: When you have more input tokens, more needs to be embedded and contextualized, which takes computer and RAM.

## Part 1: Imports and Configuration


In [None]:
# Install if needed:
# pip install -q -U transformers datasets accelerate scikit-learn

import os
import torch
import numpy as np
import pandas as pd
import json
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    f1_score,
    roc_auc_score,
    matthews_corrcoef,
    precision_score,
    recall_score,
    classification_report,
    confusion_matrix,
    hamming_loss,
    jaccard_score,
)
from transformers import (
    RobertaTokenizer,
    RobertaForSequenceClassification,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback,
)
from datasets import Dataset
import warnings
warnings.filterwarnings('ignore')

# -----------------------------
# Configuration
# -----------------------------

MODEL_NAME = "roberta-large"
CLASSES = ["Marketing", "Finance", "Accounting", "Operations", "IT", "HR"]
NUM_LABELS = len(CLASSES)

# Hyperparameters
LEARNING_RATE = 2e-5
WEIGHT_DECAY = 0.01
WARMUP_RATIO = 0.1
NUM_EPOCHS = 4
BATCH_SIZE = 32  # Reduce if OOM (out of memory)
GRADIENT_ACCUMULATION = 2  # Effective batch size = 32 * 2 = 54
MAX_LENGTH = 128 # max length of sentences in tokens
SEED = 42

# Early stopping
EARLY_STOPPING_PATIENCE = 2

# Threshold for binary predictions
THRESHOLD = 0.5

# Mixed precision (True = faster, less memory; False = more stable)
USE_MIXED_PRECISION = True

# -----------------------------
# Device Selection (CUDA > MPS > CPU)
# -----------------------------

def get_device_and_precision():
    """Detect best available device and set precision flags."""
    if torch.cuda.is_available():
        device = torch.device("cuda")
        device_name = torch.cuda.get_device_name(0)
        fp16 = USE_MIXED_PRECISION
        bf16 = False
        use_mps = False
        print(f"Device: CUDA ({device_name})")
        print(f"  VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
        
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        fp16 = False
        bf16 = USE_MIXED_PRECISION
        use_mps = True
        print(f"Device: Apple Silicon (MPS)")
        os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"
        
    else:
        device = torch.device("cpu")
        fp16 = False
        bf16 = False
        use_mps = False
        print(f"Device: CPU (this will be slow!)")
    
    precision = "mixed (fp16)" if fp16 else "mixed (bf16)" if bf16 else "full (fp32)"
    print(f"  Precision: {precision}")
    
    return device, fp16, bf16, use_mps

device, use_fp16, use_bf16, use_mps = get_device_and_precision()

print(f"\nConfiguration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Classes: {CLASSES}")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Batch size: {BATCH_SIZE} x {GRADIENT_ACCUMULATION} = {BATCH_SIZE * GRADIENT_ACCUMULATION}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Max length: {MAX_LENGTH}")

## Part 2: Prepare Data

In [None]:
#!pip install iterative-stratification

In [None]:
# -----------------------------
# Load Data
# -----------------------------

print("="*60)
print("LOADING DATA")
print("="*60)

df_train_full = pd.read_pickle("./output/train/train_azure_gpt-4.1_chat_majority_vote.pkl")
df_holdout = pd.read_pickle("./Holdout_Train/holdout_human.pkl")

print(f"Train (full): {len(df_train_full)}")
print(f"Holdout: {len(df_holdout)}")

# -----------------------------
# Where to save model and results
# -----------------------------
model_save_path = "./roberta_multilabel/best_model"


# -----------------------------
# Train/Validation Split (90/10)
# -----------------------------

# Option 1: Iterative stratification (best for multi-label)
# pip install iterative-stratification
try:
    from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    
    msss = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=SEED)
    train_idx, val_idx = next(msss.split(df_train_full, df_train_full[CLASSES].values))
    
    df_train = df_train_full.iloc[train_idx].copy()
    df_val = df_train_full.iloc[val_idx].copy()
    print("Using: Iterative multi-label stratification")

except ImportError:
    # Option 2: Simple random split (no stratification)
    df_train, df_val = train_test_split(
        df_train_full, 
        test_size=0.2, 
        random_state=SEED,
    )
    print("Using: Random split (install 'iterative-stratification' for better stratification)")

print(f"\nSplit:")
print(f"  Train: {len(df_train)}")
print(f"  Validation: {len(df_val)}")
print(f"  Holdout: {len(df_holdout)}")

# -----------------------------
# Extract Labels
# -----------------------------

def get_labels_array(df):
    """Extract labels as numpy array."""
    return df[CLASSES].values.astype(np.float32)

train_labels = get_labels_array(df_train)
val_labels = get_labels_array(df_val)
holdout_labels = get_labels_array(df_holdout)

print(f"\nLabel shapes:")
print(f"  Train: {train_labels.shape}")
print(f"  Val: {val_labels.shape}")
print(f"  Holdout: {holdout_labels.shape}")

# Class distribution
print(f"\nTrain class distribution:")
for i, cls in enumerate(CLASSES):
    count = train_labels[:, i].sum()
    pct = count / len(train_labels) * 100
    print(f"  {cls}: {int(count)} ({pct:.1f}%)")

# Verify validation has all classes
print(f"\nValidation class distribution:")
for i, cls in enumerate(CLASSES):
    count = val_labels[:, i].sum()
    pct = count / len(val_labels) * 100
    print(f"  {cls}: {int(count)} ({pct:.1f}%)")

# -----------------------------
# Tokenization
# -----------------------------

print(f"\n{'='*60}")
print("TOKENIZATION")
print("="*60)

tokenizer = RobertaTokenizer.from_pretrained(MODEL_NAME)

def tokenize_data(texts, labels):
    """Tokenize texts and create HuggingFace Dataset."""
    encodings = tokenizer(
        texts.tolist(),
        truncation=True,
        padding='max_length',
        max_length=MAX_LENGTH,
        return_tensors=None,
    )
    
    dataset = Dataset.from_dict({
        'input_ids': encodings['input_ids'],
        'attention_mask': encodings['attention_mask'],
        'labels': labels.tolist(),
    })
    
    return dataset

print("Tokenizing datasets...")
train_dataset = tokenize_data(df_train['sentence'].values, train_labels)
val_dataset = tokenize_data(df_val['sentence'].values, val_labels)
holdout_dataset = tokenize_data(df_holdout['sentence'].values, holdout_labels)

print(f"  Train: {len(train_dataset)}")
print(f"  Validation: {len(val_dataset)}")
print(f"  Holdout: {len(holdout_dataset)}")

# Sequence length stats
train_lengths = [len(ids) for ids in train_dataset['input_ids']]
print(f"  Sequence lengths: min={min(train_lengths)}, max={max(train_lengths)}, mean={np.mean(train_lengths):.0f}")

## PART 3: FINE-TUNE MODEL

In [None]:
# -----------------------------
# Load Model
# -----------------------------

print("="*60)
print("MODEL")
print("="*60)

model = RobertaForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS,
    problem_type="multi_label_classification",  # Enables BCE loss
)

if not use_mps:
    model.to(device)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# -----------------------------
# Metrics Function
# -----------------------------

def compute_metrics(eval_pred):
    """Compute metrics for validation during training."""
    logits, labels = eval_pred
    probs = torch.sigmoid(torch.tensor(logits)).numpy()
    preds = (probs >= THRESHOLD).astype(int)
    
    f1_micro = f1_score(labels, preds, average='micro', zero_division=0)
    f1_macro = f1_score(labels, preds, average='macro', zero_division=0)
    f1_samples = f1_score(labels, preds, average='samples', zero_division=0)
    
    try:
        auc_macro = roc_auc_score(labels, probs, average='macro')
    except ValueError:
        auc_macro = 0.0
    
    mcc_scores = []
    for i in range(labels.shape[1]):
        try:
            mcc = matthews_corrcoef(labels[:, i], preds[:, i])
            mcc_scores.append(mcc)
        except:
            mcc_scores.append(0.0)
    mcc_macro = np.mean(mcc_scores)
    
    return {
        'f1_micro': f1_micro,
        'f1_macro': f1_macro,
        'f1_samples': f1_samples,
        'auc_macro': auc_macro,
        'mcc_macro': mcc_macro,
    }

# -----------------------------
# Training Arguments
# -----------------------------

training_args = TrainingArguments(
    output_dir="./roberta_multilabel",
    
    # Training
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE * 2,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    
    # Optimizer
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    optim="adamw_torch",
    lr_scheduler_type="linear",
    max_grad_norm=1.0,
    
    # Evaluation & Saving
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    greater_is_better=True,
    save_total_limit=2,
    
    # Logging
    logging_dir="./logs",
    logging_steps=50,
    logging_first_step=True,
    report_to="none",
    
    # Precision
    fp16=use_fp16,
    bf16=use_bf16,
    use_mps_device=use_mps,
    
    # Reproducibility
    seed=SEED,
    data_seed=SEED,
    
    # Performance
    dataloader_num_workers=0 if use_mps else 2,
    dataloader_pin_memory=not use_mps,
    remove_unused_columns=False,
)

# -----------------------------
# Trainer
# -----------------------------

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(early_stopping_patience=EARLY_STOPPING_PATIENCE),
    ],
)

# -----------------------------
# Train
# -----------------------------

print(f"\n{'='*60}")
print("TRAINING")
print("="*60)

train_result = trainer.train()

print(f"\n{'='*60}")
print("TRAINING COMPLETE")
print("="*60)
print(f"Runtime: {train_result.metrics['train_runtime']:.0f} seconds")
print(f"Samples/second: {train_result.metrics['train_samples_per_second']:.1f}")
print(f"Final loss: {train_result.metrics['train_loss']:.4f}")

# Validation results
print(f"\nValidation (Best Model):")
val_results = trainer.evaluate()
for key, value in val_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")

# -----------------------------
# Save Model
# -----------------------------

trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)
print(f"\nModel saved to: {model_save_path}")

## Part 4: Evaluate on Holdout (ground truth human expert labels)

In [None]:
# -----------------------------
# Set performance metric output path first! It is based in the model save path you defined earlier
# -----------------------------

MODEL_DIR = model_save_path.replace("best_model", "performance")
os.makedirs(MODEL_DIR, exist_ok=True) 

In [None]:
# -----------------------------
# Evaluation Functions
# -----------------------------

def evaluate_multilabel(y_true, y_pred, y_prob, class_names):
    """
    Comprehensive multi-label evaluation.
    
    Each label is treated as an independent binary decision.
    Macro averaging gives equal weight to each class (recommended for imbalanced data).
    """
    n_classes = len(class_names)
    
    # Per-class metrics
    per_class = {}
    for i, cls in enumerate(class_names):
        y_t = y_true[:, i]
        y_p = y_pred[:, i]
        y_pr = y_prob[:, i]
        
        try:
            auc = roc_auc_score(y_t, y_pr)
        except:
            auc = 0.0
        
        per_class[cls] = {
            'f1': f1_score(y_t, y_p, zero_division=0),
            'precision': precision_score(y_t, y_p, zero_division=0),
            'recall': recall_score(y_t, y_p, zero_division=0),
            'mcc': matthews_corrcoef(y_t, y_p),
            'auc': auc,
            'support': int(y_t.sum()),
        }
    
    # Macro averages (equal weight to each class)
    macro = {
        'f1_macro': np.mean([per_class[c]['f1'] for c in class_names]),
        'precision_macro': np.mean([per_class[c]['precision'] for c in class_names]),
        'recall_macro': np.mean([per_class[c]['recall'] for c in class_names]),
        'mcc_macro': np.mean([per_class[c]['mcc'] for c in class_names]),
        'auc_macro': np.mean([per_class[c]['auc'] for c in class_names]),
    }
    
    # Micro averages (pooled across all classes)
    micro = {
        'f1_micro': f1_score(y_true, y_pred, average='micro', zero_division=0),
        'precision_micro': precision_score(y_true, y_pred, average='micro', zero_division=0),
        'recall_micro': recall_score(y_true, y_pred, average='micro', zero_division=0),
    }
    
    # Sample-based metrics
    # zero_division=1: correct "None" predictions (both empty) score 1.0, not 0.0
    sample = {
        'f1_samples': f1_score(y_true, y_pred, average='samples', zero_division=1),
        'jaccard_samples': jaccard_score(y_true, y_pred, average='samples', zero_division=1),
    }
    
    # Overall metrics
    overall = {
        'exact_match_ratio': (y_pred == y_true).all(axis=1).mean(),
        'hamming_loss': hamming_loss(y_true, y_pred),
    }
    
    return {
        'per_class': per_class,
        'macro': macro,
        'micro': micro,
        'sample': sample,
        'overall': overall,
    }


def print_evaluation_report(metrics, class_names):
    """Print formatted evaluation report."""
    
    print("="*70)
    print("HOLDOUT EVALUATION REPORT")
    print("="*70)
    
    # Per-class table
    print("\n--- Per-Class Metrics (Each Label = Independent Binary Decision) ---")
    print(f"{'Class':<12} {'F1':>7} {'AUC':>7} {'MCC':>7} {'Prec':>7} {'Rec':>7} {'Support':>8}")
    print("-"*60)
    
    for cls in class_names:
        m = metrics['per_class'][cls]
        print(f"{cls:<12} {m['f1']:>7.4f} {m['auc']:>7.4f} {m['mcc']:>7.4f} "
              f"{m['precision']:>7.4f} {m['recall']:>7.4f} {m['support']:>8}")
    
    # Macro averages
    print("-"*60)
    m = metrics['macro']
    print(f"{'Macro Avg':<12} {m['f1_macro']:>7.4f} {m['auc_macro']:>7.4f} {m['mcc_macro']:>7.4f} "
          f"{m['precision_macro']:>7.4f} {m['recall_macro']:>7.4f}")
    
    # Summary
    print("\n--- Summary Metrics ---")
    print(f"  Macro F1:      {metrics['macro']['f1_macro']:.4f}  ← Primary (class-balanced)")
    print(f"  Macro AUC:     {metrics['macro']['auc_macro']:.4f}  ← Threshold-independent")
    print(f"  Macro MCC:     {metrics['macro']['mcc_macro']:.4f}  ← Robust to imbalance")
    print(f"  Micro F1:      {metrics['micro']['f1_micro']:.4f}  ← Dominated by frequent classes")
    print(f"  Sample F1:     {metrics['sample']['f1_samples']:.4f}  ← Per-instance average")
    print(f"  Exact Match:   {metrics['overall']['exact_match_ratio']:.4f}  ← All labels correct")
    print(f"  Hamming Loss:  {metrics['overall']['hamming_loss']:.4f}  ← Fraction wrong (lower=better)")
    
    # For paper
    print("\n--- For Report ---")
    print(f"  Primary:   Macro F1 = {metrics['macro']['f1_macro']:.4f}")
    print(f"  Secondary: Macro AUC = {metrics['macro']['auc_macro']:.4f}, Macro MCC = {metrics['macro']['mcc_macro']:.4f}")
    print(f"  Overall:   Exact Match = {metrics['overall']['exact_match_ratio']:.4f}")


# -----------------------------
# Get Predictions
# -----------------------------

print("="*60)
print("HOLDOUT PREDICTIONS")
print("="*60)

holdout_output = trainer.predict(holdout_dataset)
holdout_logits = holdout_output.predictions
holdout_probs = torch.sigmoid(torch.tensor(holdout_logits)).numpy()
holdout_preds = (holdout_probs >= THRESHOLD).astype(int)

print(f"Holdout samples: {len(holdout_labels)}")
print(f"Predictions shape: {holdout_preds.shape}")

# -----------------------------
# Evaluate
# -----------------------------

metrics = evaluate_multilabel(
    y_true=holdout_labels,
    y_pred=holdout_preds,
    y_prob=holdout_probs,
    class_names=CLASSES,
)

print_evaluation_report(metrics, CLASSES)

# -----------------------------
# None Class (Derived)
# -----------------------------

print("\n--- 'None' Class (Derived: all classes = 0) ---")

holdout_none_true = (holdout_labels.sum(axis=1) == 0).astype(int)
holdout_none_pred = (holdout_preds.sum(axis=1) == 0).astype(int)
holdout_none_prob = 1 - holdout_probs.max(axis=1)

none_f1 = f1_score(holdout_none_true, holdout_none_pred, zero_division=0)
none_mcc = matthews_corrcoef(holdout_none_true, holdout_none_pred)
try:
    none_auc = roc_auc_score(holdout_none_true, holdout_none_prob)
except:
    none_auc = 0.0
none_precision = precision_score(holdout_none_true, holdout_none_pred, zero_division=0)
none_recall = recall_score(holdout_none_true, holdout_none_pred, zero_division=0)
none_support = int(holdout_none_true.sum())

print(f"{'None':<12} {none_f1:>7.4f} {none_auc:>7.4f} {none_mcc:>7.4f} "
      f"{none_precision:>7.4f} {none_recall:>7.4f} {none_support:>8}")

# -----------------------------
# Confusion Matrices
# -----------------------------

print("\n--- Confusion Matrices (per class) ---")
print(f"{'Class':<12} {'TP':>6} {'TN':>6} {'FP':>6} {'FN':>6}")
print("-"*40)

for i, cls in enumerate(CLASSES):
    y_true = holdout_labels[:, i]
    y_pred = holdout_preds[:, i]
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    print(f"{cls:<12} {tp:>6} {tn:>6} {fp:>6} {fn:>6}")

none_cm = confusion_matrix(holdout_none_true, holdout_none_pred)
none_tn, none_fp, none_fn, none_tp = none_cm.ravel()
print(f"{'None':<12} {none_tp:>6} {none_tn:>6} {none_fp:>6} {none_fn:>6}")

# -----------------------------
# Multi-label Analysis
# -----------------------------

print("\n--- Multi-label Analysis ---")

true_label_counts = holdout_labels.sum(axis=1)
pred_label_counts = holdout_preds.sum(axis=1)

print(f"Average labels per sentence:")
print(f"  True: {true_label_counts.mean():.2f}")
print(f"  Predicted: {pred_label_counts.mean():.2f}")

print(f"\nLabel count distribution:")
print(f"{'# Labels':<10} {'True':>8} {'Predicted':>10}")
print("-"*30)
for n in range(7):
    true_count = (true_label_counts == n).sum()
    pred_count = (pred_label_counts == n).sum()
    print(f"{n:<10} {true_count:>8} {pred_count:>10}")

# -----------------------------
# Save Results
# -----------------------------

print("\n" + "="*60)
print("SAVING RESULTS")
print("="*60)

# Save predictions
df_holdout_results = df_holdout.copy()
for i, cls in enumerate(CLASSES):
    df_holdout_results[f'{cls}_prob'] = holdout_probs[:, i]
    df_holdout_results[f'{cls}_pred'] = holdout_preds[:, i]
df_holdout_results['None_prob'] = holdout_none_prob
df_holdout_results['None_pred'] = holdout_none_pred

df_holdout_results.to_csv(f"{MODEL_DIR}/holdout_predictions.csv", index=False)
df_holdout_results.to_pickle(f"{MODEL_DIR}/holdout_predictions.pkl")
print(f"  Predictions: {MODEL_DIR}/holdout_predictions.csv")

# Save metrics
metrics_summary = {
    'model': MODEL_NAME,
    'threshold': THRESHOLD,
    'holdout_size': len(df_holdout),
    'macro': metrics['macro'],
    'micro': metrics['micro'],
    'sample': metrics['sample'],
    'overall': metrics['overall'],
    'per_class': metrics['per_class'],
    'none_class': {
        'f1': none_f1, 'auc': none_auc, 'mcc': none_mcc,
        'precision': none_precision, 'recall': none_recall, 'support': none_support,
    },
    'hyperparameters': {
        'learning_rate': LEARNING_RATE,
        'weight_decay': WEIGHT_DECAY,
        'warmup_ratio': WARMUP_RATIO,
        'num_epochs': NUM_EPOCHS,
        'batch_size': BATCH_SIZE,
        'gradient_accumulation': GRADIENT_ACCUMULATION,
        'max_length': MAX_LENGTH,
    },
}

with open(f"{MODEL_DIR}/holdout_metrics.json", 'w') as f:
    json.dump(metrics_summary, f, indent=2, default=float)
print(f"  Metrics: {MODEL_DIR}/holdout_metrics.json")

# Save training history
with open(f"{MODEL_DIR}/training_history.json", 'w') as f:
    json.dump(trainer.state.log_history, f, indent=2)
print(f"  History: {MODEL_DIR}/training_history.json")

print("\n" + "="*60)
print("DONE")
print("="*60)
print(f"\nFinal Results:")
print(f"  Macro F1:    {metrics['macro']['f1_macro']:.4f}")
print(f"  Macro AUC:   {metrics['macro']['auc_macro']:.4f}")
print(f"  Macro MCC:   {metrics['macro']['mcc_macro']:.4f}")
print(f"  Exact Match: {metrics['overall']['exact_match_ratio']:.4f}")

**Please cite this paper** if you use any part or all of this code in a project - be it commercial or academic:  

> Ringel, Daniel, *Creating Synthetic Experts with Generative Artificial Intelligence* (December 5, 2023). Kenan Institute of Private Enterprise Research Paper No. 4542949, Available at SSRN: https://ssrn.com/abstract=4542949 or http://dx.doi.org/10.2139/ssrn.4542949 


*This notebook was developed by Daniel M. Ringel in January 2026 with the help of the various vendor's API documentation (and examples) as well as genAI models from OpenAI and Anthropic.*