
<div style="
    background-color: #f7f7f7;
    background-image: url(''), url('') ;
    background-position: left bottom, right top;
    background-repeat: no-repeat,  no-repeat;
    background-size: auto 60px, auto 160px;
    border-radius: 5px;
    box-shadow: 0px 3px 1px -2px rgba(0, 0, 0, 0.2), 0px 2px 2px 0px rgba(0, 0, 0, 0.14), 0px 1px 5px 0px rgba(0,0,0,.12);">

<h1 style="
    color: #2a4cdf;
    font-style: normal;
    font-size: 2.25rem;
    line-height: 1.4em;
    font-weight: 600;
    padding: 30px 200px 0px 30px;"> 
        Perovscribe Evals</h1>

<p style="
    line-height: 1.4em;
    padding: 30px 200px 0px 30px;">
    This notebook runs through the analysis of the Perovscribe extraction pipeline to compute extraction performance metrics
</p>

</div>

## Overview

This notebook evaluates the performance of the Perovscribe extraction pipeline by comparing extracted data against a ground truth dataset. The evaluation covers multiple Large Language Models (LLMs) and compares their extraction performance across different data fields.


### Evaluation Methodology

The evaluation uses a **confusion matrix** approach:
- **True Positives (TP)**: Fields correctly extracted and matching ground truth
- **False Positives (FP)**: Fields extracted but not present in ground truth
- **False Negatives (FN)**: Fields in ground truth but not extracted

**Metrics calculated**:
- **Precision** = TP / (TP + FP) - Measures extraction accuracy
- **Recall** = TP / (TP + FN) - Measures extraction completeness
- **F1 Score** = 2 × (Precision × Recall) / (Precision + Recall) - Harmonic mean of precision and recall


## Setup and Evaluations

The evaluation is done by comparing the extracted data to a ground truth dataset.
Sometimes, the scoring will use an LLM to score the extracted data.

For this reason, we need API keys for the LLMs we are using.

### Imports & Setup

In [None]:
# --- Imports ---
import json
import os
from importlib.resources import files
from math import pi
from pathlib import Path


import dabest
import litellm
from litellm.caching.caching import Cache
litellm.cache = Cache(type="disk")
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from tqdm import tqdm

# Third-party libraries
from dotenv import load_dotenv


# Internal modules
# Ensure 'perovscribe' is accessible in the repo structure
from perovscribe.pipeline import ExtractionPipeline
from plotly_theme import register_template, set_defaults, MODEL_COLORS

# --- Configuration & Theme ---
load_dotenv()  # Loads .env if present
register_template()
set_defaults()

# Define Paths (Use relative paths for reproducibility!)
DATA_DIR = files("perovscribe").joinpath("data")
EXTRACTIONS_DIR = DATA_DIR / "extractions" 
GROUND_TRUTH_DIR = DATA_DIR / "ground_truth" / "test"
EXPERTS_DIR = EXTRACTIONS_DIR / "humans" / "Consensus"
# MODEL_COLORS is imported from plotly_theme (single source of truth for colors)


Pre-compiling numba functions for DABEST...


Compiling numba functions: 100%|██████████| 11/11 [00:00<00:00, 61.88it/s]

Numba compilation complete!



[92m11:51:19 - LiteLLM:ERROR[0m: redis_cache.py:178 - Error connecting to Sync Redis client


[92m11:51:19 - LiteLLM:ERROR[0m: redis_cache.py:1081 - LiteLLM Redis Cache PING: - Got exception from REDIS : Error 61 connecting to 127.0.0.1:6379. Connect call failed ('127.0.0.1', 6379).


### Model Configurations

In [None]:
# Define model metadata: Display Names, Colors, and Token Costs (per 1M tokens)
# Prices are examples; verify current API pricing.
# Colors are obtained from MODEL_COLORS (imported from plotly_theme)
MODEL_CONFIG = {
    "gpt-5-2025-08-07": {
        "name": "GPT-5",
        "color": MODEL_COLORS["GPT-5"],
    },
    "gpt-5-mini-2025-08-07": {
        "name": "GPT-5 Mini",
        "color": MODEL_COLORS["GPT-5 Mini"],
    },
    "claude-opus-4-20250514": {
        "name": "Claude Opus 4",
        "color": MODEL_COLORS["Claude Opus 4"],
    },
    "claude-sonnet-4-20250514": {
        "name": "Claude Sonnet 4",
        "color": MODEL_COLORS["Claude Sonnet 4"],
    },
    "claude-opus-4-1-20250805": {
        "name": "Claude Opus 4.1",
        "color": MODEL_COLORS["Claude Opus 4.1"],
    },
    "gpt-4.1-2025-04-14": {
        "name": "GPT-4.1",
        "color": MODEL_COLORS["GPT-4.1"],
    },
    "gpt-4o-2024-08-06": {
        "name": "GPT-4o",
        "color": MODEL_COLORS["GPT-4o"],
    },
}

## Evaluations

##### Evals Code

In [None]:
# ============================================================================
# DATA LOADING AND MODEL EVALUATION
# ============================================================================

all_metrics = {}  # model_name -> paper_doi -> {field: score}
all_precs_and_recalls = {}

# Evaluate all models
for model_dir in tqdm(EXTRACTIONS_DIR.iterdir()):
    if not model_dir.is_dir() or model_dir == "humans":
        continue
    
    model_name = model_dir.name
    print(f"Evaluating model: {model_name}")
    
    pipeline = ExtractionPipeline(
        model_name=model_name, 
        preprocessor="pymupdf", 
        postprocessor="NONE", 
        cache_dir="", 
        use_cache=True
    )
    model_metrics, avg_recalls, avg_precisions = pipeline._evaluate_multiple(
        model_dir, GROUND_TRUTH_DIR
    )
    
    all_precs_and_recalls[model_name] = {
        "precision": avg_precisions, 
        "recall": avg_recalls
    }
    all_metrics[model_name] = model_metrics

# Rename models to readable names
model_name_map = {
    "claude-opus-4-1-20250805": "Claude Opus 4.1",
    "claude-opus-4-20250514": "Claude Opus 4",
    "claude-sonnet-4-20250514": "Claude Sonnet 4",
    "gpt-4.1-2025-04-14": "GPT-4.1",
    "gpt-4o-2024-08-06": "GPT-4o",
    "gpt-5-2025-08-07": "GPT-5",
    "gpt-5-mini-2025-08-07": "GPT-5 Mini"
}

all_metrics = {
    model_name_map.get(k, k): v for k, v in all_metrics.items()
}

##### Helper Functions

In [None]:
# ============================================================================
# HELPER FUNCTIONS (DATAFRAME VERSION)
# ============================================================================

def metrics_to_dataframe(metrics_dict):
    """
    Convert nested metrics dictionary to a flat DataFrame.
    
    Returns:
        DataFrame with columns: model, paper, field, TP, FP, FN
    """
    rows = []
    for model, papers in metrics_dict.items():
        for paper, fields in papers.items():
            for field, values in fields.items():
                if isinstance(values, dict):
                    rows.append({
                        'model': model,
                        'paper': paper,
                        'field': field,
                        'TP': values.get('TP', 0.0),
                        'FP': values.get('FP', 0.0),
                        'FN': values.get('FN', 0.0)
                    })
    return pd.DataFrame(rows)

def add_field_categories(df):
    """Add aggregation category for each field."""
    def categorize(field):  # noqa: PLR0911
        if field.endswith(":unit"):
            return "units"
        field_lower = field.lower()
        if "composition" in field_lower:
            return "composition"
        if "stability" in field_lower:
            return "stability"
        if "deposition" in field_lower:
            return "deposition"
        if "layers" in field_lower:
            return "layers"
        if "light" in field_lower:
            return "light"
        # Clean up individual fields
        if any(x in field for x in ["averaged_quantities", "number_devices", "encapsulated"]):
            return None
        return field.replace("_", " ").split(":value")[0]
    
    df['category'] = df['field'].apply(categorize)
    return df[df['category'].notna()]

def calculate_metrics(df, metric_type='recall'):
    """
    Calculate precision or recall for each row.
    
    Args:
        df: DataFrame with TP, FP, FN columns
        metric_type: 'recall' or 'precision'
    """
    if metric_type == 'recall':
        df['score'] = df.apply(
            lambda row: row['TP'] / (row['TP'] + row['FN']) 
            if (row['TP'] + row['FN']) > 0 else np.nan, 
            axis=1
        )
    else:  # precision
        df['score'] = df.apply(
            lambda row: row['TP'] / (row['TP'] + row['FP']) 
            if (row['TP'] + row['FP']) > 0 else np.nan, 
            axis=1
        )
    return df

## Visualize

#### Overall Performance

In [None]:
# ============================================================================
# BAR CHART: OVERALL MODEL PERFORMANCE (DATAFRAME VERSION)
# ============================================================================

# Calculate overall metrics per model
df = metrics_to_dataframe(all_metrics)

df_doi = df.groupby(['model', 'paper']).agg({'TP':'sum', 'FP':'sum', 'FN':'sum'}).reset_index()

overall = df_doi.groupby('model').sum().reset_index()
overall['precision'] = overall['TP'] / (overall['TP'] + overall['FP'])
overall['recall']    = overall['TP'] / (overall['TP'] + overall['FN'])


# Plot
x = np.arange(len(overall))
width = 0.35

overall_performance_fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, overall['precision'], width, label='Precision')
rects2 = ax.bar(x + width/2, overall['recall'], width, label='Recall')

ax.set_ylabel('Score')
ax.set_title('Model Performances')
ax.set_xticks(x)
ax.set_xticklabels(overall['model'], rotation=45)
ax.set_yticks(np.arange(0, 1.1, 0.4))
ax.set_yticklabels([f"{y:.1f}" for y in np.arange(0, 1.1, 0.4)])
ax.set_ylim(0, 1.05)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.3), ncol=2, frameon=False)

# Add value labels
for rects, values in [(rects1, overall['precision']), (rects2, overall['recall'])]:
    for rect, val in zip(rects, values):
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., height + 0.02,
                f'{val:.2f}', ha='center', va='bottom')

plt.tight_layout()

#### Radar Plot: Recalls per field

In [None]:
# ============================================================================
# RADAR PLOT: MODEL RECALLS PER FIELD (DATAFRAME VERSION)
# ============================================================================

# Convert to DataFrame and calculate recalls
df = metrics_to_dataframe(all_metrics)
df = add_field_categories(df)
df = calculate_metrics(df, metric_type='recall')

# Aggregate by model and category
aggregated = df.groupby(['model', 'category'])['score'].mean().reset_index()

# Pivot for radar plot
pivot_df = aggregated.pivot(index='model', columns='category', values='score').fillna(0)

# Create radar plot
fields = sorted(pivot_df.columns)
num_fields = len(fields)
angles = [n / float(num_fields) * 2 * pi for n in range(num_fields)]
angles += angles[:1]

radar_recall_fig = plt.figure(figsize=(10, 10))
ax = plt.subplot(111, polar=True)

# Plot each model
for model_name in pivot_df.index:
    scores = pivot_df.loc[model_name, fields].tolist()
    values = scores + [scores[0]]

    color = MODEL_COLORS.get(model_name, "#333333")  # fallback if missing

    ax.plot(
        angles,
        values,
        label=model_name,
        linewidth=2,
        color=color,
    )
    ax.fill(
        angles,
        values,
        color=color,
        alpha=0.03,
    )


# [Rest of plotting code remains the same...]
ax.set_ylim(0.0, 1)
angle_degrees = [a * 180 / np.pi for a in angles[:-1]]
ax.set_thetagrids(angle_degrees, labels=fields)
ax.tick_params(axis='x', pad=25)

for label in ax.get_xticklabels():
    label.set_fontsize(24)
    label.set_color("dimgray")
    label.set_rotation(45)
    label.set_horizontalalignment("center")

ax.set_yticks(np.linspace(0.0, 1, 3))
ax.set_yticklabels([f"{y:.1f}" for y in np.linspace(0.0, 1, 3)], 
                    fontsize=24, color="dimgray")

ax.set_theta_offset(np.deg2rad(17))
ax.set_theta_direction(-1)

plt.title("Model Recalls per Field", size=40, color="dimgray")
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), 
           fontsize=24, ncol=2, frameon=False)
plt.tight_layout()

#### Radar Plot: Precisions per field

In [None]:
# ============================================================================
# RADAR PLOT: MODEL PRECISIONS PER FIELD (DATAFRAME VERSION)
# ============================================================================

# Convert to DataFrame and calculate precisions
df = metrics_to_dataframe(all_metrics)
df = add_field_categories(df)
df = calculate_metrics(df, metric_type='precision')  # Changed to precision

# Aggregate by model and category
aggregated = df.groupby(['model', 'category'])['score'].mean().reset_index()

# Pivot for radar plot
pivot_df = aggregated.pivot(index='model', columns='category', values='score').fillna(0)

# Create radar plot
fields = sorted(pivot_df.columns)
num_fields = len(fields)
angles = [n / float(num_fields) * 2 * pi for n in range(num_fields)]
angles += angles[:1]

radar_precision_fig = plt.figure(figsize=(10, 10))
ax = plt.subplot(111, polar=True)

# Plot each model
for model_name in pivot_df.index:
    scores = pivot_df.loc[model_name, fields].tolist()
    values = scores + [scores[0]]

    color = MODEL_COLORS.get(model_name, "#333333")  # fallback if missing

    ax.plot(
        angles,
        values,
        label=model_name,
        linewidth=2,
        color=color,
    )
    ax.fill(
        angles,
        values,
        color=color,
        alpha=0.03,
    )


# Customize plot
ax.set_ylim(0.3, 1)  # Different y-limit for precisions
angle_degrees = [a * 180 / np.pi for a in angles[:-1]]
ax.set_thetagrids(angle_degrees, labels=fields)
ax.tick_params(axis='x', pad=25)

# Style field labels
for label in ax.get_xticklabels():
    label.set_fontsize(24)
    label.set_color("dimgray")
    label.set_rotation(45)
    label.set_horizontalalignment("center")

# Style radial ticks
ax.set_yticks(np.linspace(0.3, 1, 3))
ax.set_yticklabels([f"{y:.1f}" for y in np.linspace(0.3, 1, 3)], 
                    fontsize=24, color="dimgray")

# Rotate plot
ax.set_theta_offset(np.deg2rad(17))
ax.set_theta_direction(-1)

# Title and legend
plt.title("Model Precisions per Field", size=40, color="dimgray")
plt.legend(
    loc='upper center',
    bbox_to_anchor=(0.5, -0.1),
    fontsize=24,
    ncol=2,
    frameon=False
)

plt.tight_layout()
# plt.savefig("plots/model_precisions_spider.pdf")

### Comparison with Experts

#### Evaluation Code

In [None]:
pipeline = ExtractionPipeline(model_name="Consensus", preprocessor="pymupdf", postprocessor="NONE", cache_dir="", use_cache=True)
authors_metrics, authors_recalls, authors_precisions = pipeline._evaluate_multiple(EXPERTS_DIR, GROUND_TRUTH_DIR)

In [None]:
all_metrics["Consensus"] = authors_metrics
experts_included_df = metrics_to_dataframe(all_metrics)

# 1. Get the set of papers that appear with model == "Consensus"
expert_papers = set(experts_included_df.loc[
    experts_included_df["model"] == "Consensus", 
    "paper"
])

# 2. Filter the DataFrame
filtered_df = experts_included_df[
    (experts_included_df["model"] == "Consensus") |
    ((experts_included_df["model"] != "Consensus") &
     (experts_included_df["paper"].isin(expert_papers)))
]

In [None]:
# Group by paper and model, sum TP and FP
micro_precision_df = (
    filtered_df
    .groupby(['paper', 'model'])
    [['TP', 'FP']]
    .sum()
    .reset_index()
)

# Compute micro-precision
micro_precision_df['precision'] = micro_precision_df['TP'] / (micro_precision_df['TP'] + micro_precision_df['FP'])


In [None]:
# Only select the papers where both LLMs and experts exist
papers_with_both = micro_precision_df['paper'].value_counts()
papers_with_both = papers_with_both[papers_with_both > 1].index
df_plot = micro_precision_df[micro_precision_df['paper'].isin(papers_with_both)]

# Pivot data so each row is a DOI and each column is a model
df_pivot = df_plot.pivot(index='paper', columns='model', values='precision').reset_index()

# Melt data for dabest
df_melt = df_pivot.melt(id_vars='paper', var_name='model', value_name='precision')

# Create a dabest object using authors as the control
dabest_data = dabest.load(
    data=df_melt,
    x='model',
    y='precision',
    idx=("Consensus", "GPT-4.1", "Claude Opus 4", "GPT-4o", "GPT-5",
         "Claude Sonnet 4", "Claude Opus 4.1", "GPT-5 Mini")
)

# Plot mean difference against authors
plt.figure()
mean_fig = dabest_data.mean_diff.plot(
    raw_marker_size=4,
    custom_palette=MODEL_COLORS,
)


#### Overlapping extractions spider plots

In [None]:
# ============================================================================
# DATA LOADING AND MODEL EVALUATION
# ============================================================================

human_metrics = {}  # model_name -> paper_doi -> {field: score}
human_precs_and_recalls = {}

HUMANS_DIR = EXTRACTIONS_DIR / "humans"
DEV_DIR = DATA_DIR / "ground_truth" / "dev"
# Evaluate all models
for model_dir in HUMANS_DIR.iterdir():
    if not model_dir.is_dir():
        continue
    
    model_name = model_dir.name
    print(f"Evaluating model: {model_name}")
    
    pipeline = ExtractionPipeline(
        model_name=model_name, 
        preprocessor="pymupdf", 
        postprocessor="NONE", 
        cache_dir="", 
        use_cache=True
    )
    model_metrics, avg_recalls, avg_precisions = pipeline._evaluate_multiple(
        model_dir, DEV_DIR
    )
    
    human_precs_and_recalls[model_name] = {
        "precision": avg_precisions, 
        "recall": avg_recalls
    }
    human_metrics[model_name] = model_metrics

sonnet_4_metrics, s_rec, s_prec = pipeline._evaluate_multiple(
    EXTRACTIONS_DIR / "claude-sonnet-4-20250514/",
    DEV_DIR
)
human_metrics["Claude Sonnet 4"] = sonnet_4_metrics

##### What dois match across our extractions

In [None]:
from collections import defaultdict

doi_to_groups = defaultdict(list)

for group, dois in human_metrics.items():
    for doi in dois:
        doi_to_groups[doi].append(group)

print("DOI matches across groups:\n")
for doi, groups in doi_to_groups.items():
    if len(groups) > 1:
        print(f"{doi} -> {', '.join(groups)}")

##### Code

In [None]:
# ============================================================================
# RADAR PLOT: MODEL PRECISIONS PER FIELD (DATAFRAME VERSION)
# ============================================================================
from matplotlib.lines import Line2D

# Convert to DataFrame and calculate precisions
df = metrics_to_dataframe(human_metrics)
df = add_field_categories(df)
df = calculate_metrics(df, metric_type='precision')  # Changed to precision

# Aggregate by model and category
aggregated = df.groupby(['model', 'category'])['score'].mean().reset_index()

# Pivot for radar plot
pivot_df = aggregated.pivot(index='model', columns='category', values='score').fillna(0)

# Create radar plot
fields = sorted(pivot_df.columns)
num_fields = len(fields)
angles = [n / float(num_fields) * 2 * pi for n in range(num_fields)]
angles += angles[:1]

human_radar_precision_fig = plt.figure(figsize=(10, 10))
ax = plt.subplot(111, polar=True)

# Plot each model
for model_name in pivot_df.index:
    scores = pivot_df.loc[model_name, fields].tolist()
    values = scores + [scores[0]]
    
    # 2. Set distinct color for the line
    line_color = MODEL_COLORS.get(model_name, "#333333")
    
    # 3. Control visibility and emphasis
    if model_name in ('Consensus', 'Claude Sonnet 4'):
        # Highlight: thicker line, higher alpha fill
        line_alpha = 1.0
        fill_alpha = 0.2
        line_width = 4
    else:
        # Dimmer: lower opacity for line and fill, normal line width
        line_alpha = 0.3
        fill_alpha = 0.02
        line_width = 2

    line, = ax.plot(
        angles, 
        values, 
        label=model_name, 
        linewidth=line_width,
        color=line_color,
        alpha=line_alpha
    )
    
    ax.fill(
        angles, 
        values, 
        color=line_color,
        alpha=fill_alpha
    )

# Customize plot (Rest of the customization remains the same)
x.set_ylim(0.3, 1)
angle_degrees = [a * 180 / np.pi for a in angles[:-1]]
ax.set_thetagrids(angle_degrees, labels=fields)

ax.set_yticks(np.linspace(0.3, 1, 3))
ax.set_yticklabels([f"{y:.1f}" for y in np.linspace(0.3, 1, 3)])

# Rotate plot
ax.set_theta_offset(np.deg2rad(17))
ax.set_theta_direction(-1)

# Title and legend
plt.title("Model Precisions per Field")
humans_legend_handle = Line2D(
    [0], [0],
    color="gray",
    linewidth=4,
    alpha=1.0,
    label="Humans"
)
# Get existing handles (Consensus & Sonnet 4 only)
handles, labels = ax.get_legend_handles_labels()

allowed = {"Consensus", "Claude Sonnet 4"}
filtered = [
    (allowed_handle, allowed_label)
    for allowed_handle, allowed_label in zip(handles, labels)
    if allowed_label in allowed
]
if filtered:
    handles, labels = zip(*filtered)
    handles = list(handles)
    labels = list(labels)
else:
    handles, labels = [], []

# Add Humans as legend-only entry
handles.append(humans_legend_handle)
labels.append("Humans")

plt.legend(
    handles,
    labels,
    loc='upper center',
    bbox_to_anchor=(0.5, -0.1),
    ncol=3,
    frameon=False
)
plt.tight_layout()