In [None]:
# ruff: noqa: E402

<div style="
    background-color: #f7f7f7;
    background-image: url(''), url('') ;
    background-position: left bottom, right top;
    background-repeat: no-repeat,  no-repeat;
    background-size: auto 60px, auto 160px;
    border-radius: 5px;
    box-shadow: 0px 3px 1px -2px rgba(0, 0, 0, 0.2), 0px 2px 2px 0px rgba(0, 0, 0, 0.14), 0px 1px 5px 0px rgba(0,0,0,.12);">

<h1 style="
    color: #2a4cdf;
    font-style: normal;
    font-size: 2.25rem;
    line-height: 1.4em;
    font-weight: 600;
    padding: 30px 200px 0px 30px;"> 
        Physics Consistency Filter: Legacy Database vs. PERLA Pipeline</h1>

<p style="
    line-height: 1.4em;
    padding: 30px 200px 0px 30px;">
    This notebook evaluates data quality by testing the fundamental physics relationship for solar cell power conversion efficiency: <strong>PCE = (FF × <i>V</i><sub>OC</sub> × <i>J</i><sub>SC</sub>) / <i>P</i><sub>in</sub></strong>. We compare two datasets from the <a href="https://nomad-lab.eu/prod/v1/staging/gui/search/perovskite-solar-cells-database" target="_blank">Perovskite Solar Cell Database in NOMAD</a>: the legacy human-curated entries and the new PERLA LLM-extracted entries.
</p>

<p style="
    line-height: 1.4em;
    padding: 5px 200px 30px 30px;">
    The analysis reveals the fraction of legacy database entries that fail this physics consistency check (with 0.2% absolute tolerance), while the PERLA pipeline enforces this filter as a validation requirement, ensuring only physically consistent entries are accepted into the database.
</p>
</div>

### **Implications**

This comparison reveals important differences in data quality between the two datasets:

- **Legacy Database**: Entries that fail the physics consistency check may originate from errors in the manual data curtion, inconsistencies in the source publications themselves, such as reporting errors, calculation mistakes in the original papers, or unit mismatches. The fraction of failing entries reflects the challenges inherent in literature-reported data, regardless of the curation method.

- **PERLA Pipeline**: By enforcing physics-based filters during the LLM extraction process, PERLA automatically excludes entries that fail consistency checks. This automated validation approach ensures physically coherent data enters the database, improving overall data reliability while maintaining scalability. This also helps to exclude extracted solar cells with mixed parameters from papers that report multiple solar cells. 

The results demonstrate that we can dramatically reduce these inconsistencies by having this check in the extraction pipeline.

### **Methodology**

The physics consistency check validates that the reported power conversion efficiency (PCE) matches the calculated efficiency from measured parameters:

$$\text{PCE} = \frac{\text{FF} \times V_{\text{OC}} \times J_{\text{SC}}}{P_{\text{in}}}$$

Where:
- **FF** = Fill Factor (dimensionless, 0-1)
- **V<sub>OC</sub>** = Open Circuit Voltage (V)
- **J<sub>SC</sub>** = Short Circuit Current Density (A/m², SI units)
- **P<sub>in</sub>** = Incident Power Density (W/m², SI units; typically 1000 W/m² for standard test conditions)

All quantities are in SI units as stored in NOMAD results. The calculation is straightforward: (FF × V<sub>OC</sub> × J<sub>SC</sub>) / P<sub>in</sub> × 100 gives PCE in percentage. Units: (V × A/m²) / (W/m²) × 100 = (W/m²) / (W/m²) × 100 = %. We use an absolute tolerance of **±0.2%** to account for rounding and measurement precision. Entries that fail this check may indicate data entry errors, unit mismatches, or measurement inconsistencies.

In [113]:
from plotly_theme import register_template, set_defaults

register_template()
set_defaults()

### **The dataset for this analysis**

The query to create this parquet files includes onlysolar cells with registered illumination intensity around 1-sun conditions. It excludes entries where the illumination intendity could not be resgistered. 

In [70]:
# load the data from into a df from the parquet file
import pandas as pd

df = pd.read_parquet('perovskite_solar_cell_database_physics_check_2.parquet')
# df = pd.read_parquet('perovskite_solar_cell_database.parquet')
# Set a source_database column: if name_of_person_entering_the_data is 'LLM Extraction', use 'LLM Extracted', else 'Manual Entry'
df['source_database'] = df['data.ref.name_of_person_entering_the_data'].apply(
    lambda x: 'LLM Extracted' if x == 'LLM Extraction' else 'Manual Entry'
)

In [71]:
# set in the df a source_database column. Is data.ref.person_entering_data is LLM Extracted else Manual Entry
# df['source_database'] = df['data.ref.name_of_person_entering_the_data'].apply(
#     lambda x: 'LLM Extracted' if x == 'LLM Extraction' else 'Manual Entry'
# )
from plotly_theme import DEFAULT_COLORWAY

SOURCE_ORDER = ['Manual Entry', 'LLM Extracted']

COLOR_MAP = dict(zip(SOURCE_ORDER, DEFAULT_COLORWAY))

In [72]:
# check in a histogram and print the distribution of results.properties.optoelectronic.solar_cell.illumination_intensity

import plotly.express as px

# print the overall distribution
print("="*80)
print("OVERALL DISTRIBUTION")
print("="*80)
print(df['results.properties.optoelectronic.solar_cell.illumination_intensity'].describe())
print()

# Check for None/NaN values
none_count = df['results.properties.optoelectronic.solar_cell.illumination_intensity'].isna().sum()
print(f"Number of None/NaN entries: {none_count}")
print(f"Percentage: {none_count/len(df)*100:.2f}%")
print()

# Check for zero values
zero_count = (df['results.properties.optoelectronic.solar_cell.illumination_intensity'] == 0).sum()
print(f"Number of zero entries: {zero_count}")
print(f"Percentage: {zero_count/len(df)*100:.2f}%")
print()

# print how many are not 1000 W/m2 (excluding None/NaN and zeros)
not_1000 = df[
    (df['results.properties.optoelectronic.solar_cell.illumination_intensity'] != 1000) &
    (df['results.properties.optoelectronic.solar_cell.illumination_intensity'].notna()) &
    (df['results.properties.optoelectronic.solar_cell.illumination_intensity'] != 0)
]
print(f"Number of entries not at 1000 W/m2 (excluding None/NaN and zeros): {len(not_1000)}")
print(f"Percentage: {len(not_1000)/len(df)*100:.2f}%")
print()

# print the distribution values for the illumination intensity for each source_database
print("="*80)
print("BREAKDOWN BY SOURCE DATABASE")
print("="*80)
for source in SOURCE_ORDER:
    subset = df[df['source_database'] == source]
    print(f"\n{source}:")
    print(subset['results.properties.optoelectronic.solar_cell.illumination_intensity'].describe())
    print()

    # Check for None/NaN values
    none_count = subset['results.properties.optoelectronic.solar_cell.illumination_intensity'].isna().sum()
    print(f"Number of None/NaN entries for {source}: {none_count}")
    print(f"Percentage: {none_count/len(subset)*100:.2f}%")
    print()

    # Check for zero values
    zero_count = (subset['results.properties.optoelectronic.solar_cell.illumination_intensity'] == 0).sum()
    print(f"Number of zero entries for {source}: {zero_count}")
    print(f"Percentage: {zero_count/len(subset)*100:.2f}%")
    print()

    # print how many are not 1000 W/m2 (excluding None/NaN and zeros)
    not_1000 = subset[
        (subset['results.properties.optoelectronic.solar_cell.illumination_intensity'] != 1000) &
        (subset['results.properties.optoelectronic.solar_cell.illumination_intensity'].notna()) &
        (subset['results.properties.optoelectronic.solar_cell.illumination_intensity'] != 0)
    ]
    print(f"Number of entries not at 1000 W/m2 for {source} (excluding None/NaN and zeros): {len(not_1000)}")
    print(f"Percentage: {len(not_1000)/len(subset)*100:.2f}%")
    print()

OVERALL DISTRIBUTION
count    48745.000000
mean      1004.623427
std        278.436899
min          0.000000
25%       1000.000000
50%       1000.000000
75%       1000.000000
max      18000.000000
Name: results.properties.optoelectronic.solar_cell.illumination_intensity, dtype: float64

Number of None/NaN entries: 2383
Percentage: 4.66%

Number of zero entries: 8
Percentage: 0.02%

Number of entries not at 1000 W/m2 (excluding None/NaN and zeros): 449
Percentage: 0.88%

BREAKDOWN BY SOURCE DATABASE

Manual Entry:
count    43032.000000
mean       999.496165
std        191.401147
min          0.000000
25%       1000.000000
50%       1000.000000
75%       1000.000000
max      18000.000000
Name: results.properties.optoelectronic.solar_cell.illumination_intensity, dtype: float64

Number of None/NaN entries for Manual Entry: 75
Percentage: 0.17%

Number of zero entries for Manual Entry: 8
Percentage: 0.02%

Number of entries not at 1000 W/m2 for Manual Entry (excluding None/NaN and zeros): 4

In [73]:
# check in a histogram and print the distribution of results.properties.optoelectronic.solar_cell.illumination_intensity

import plotly.express as px


# print the distribution values for the illumination intensity for each source_database
for source in SOURCE_ORDER:
    subset = df[df['source_database'] == source]
    print(f"Distribution for {source}:")
    print(subset['results.properties.optoelectronic.solar_cell.illumination_intensity'].describe())
    print()

    # print how many are not 1000 W/m2

    not_1000 = subset[subset['results.properties.optoelectronic.solar_cell.illumination_intensity'] != 1000]
    print(f"Number of entries not at 1000 W/m2 for {source}: {len(not_1000)}")
    print()

Distribution for Manual Entry:
count    43032.000000
mean       999.496165
std        191.401147
min          0.000000
25%       1000.000000
50%       1000.000000
75%       1000.000000
max      18000.000000
Name: results.properties.optoelectronic.solar_cell.illumination_intensity, dtype: float64

Number of entries not at 1000 W/m2 for Manual Entry: 486

Distribution for LLM Extracted:
count     5713.000000
mean      1043.243480
std        619.607578
min        100.000000
25%       1000.000000
50%       1000.000000
75%       1000.000000
max      10000.000000
Name: results.properties.optoelectronic.solar_cell.illumination_intensity, dtype: float64

Number of entries not at 1000 W/m2 for LLM Extracted: 2354



In [74]:
import numpy as np
import plotly.graph_objects as go

# columns we REQUIRE to be present
required_cols = [
    'results.properties.optoelectronic.solar_cell.fill_factor',
    'results.properties.optoelectronic.solar_cell.short_circuit_current_density',
    'results.properties.optoelectronic.solar_cell.open_circuit_voltage',
    'results.properties.optoelectronic.solar_cell.efficiency',
    'results.properties.optoelectronic.solar_cell.illumination_intensity',
]

# drop rows where ANY required value is missing
df_clean = df.dropna(subset=required_cols).copy()

# alias for readability
# Units from NOMAD results (all SI):
# ff: dimensionless (0-1)
# jsc: A/m² (SI)
# voc: V (SI)
# pce: % (percentage)
# illumination_intensity: W/m² (SI)
ff = df_clean['results.properties.optoelectronic.solar_cell.fill_factor']
jsc = df_clean[
    'results.properties.optoelectronic.solar_cell.short_circuit_current_density'
]
voc = df_clean['results.properties.optoelectronic.solar_cell.open_circuit_voltage']
pce = df_clean['results.properties.optoelectronic.solar_cell.efficiency']
illumination = df_clean['results.properties.optoelectronic.solar_cell.illumination_intensity']

# compute expected PCE using correct formula
# PCE (%) = (FF × V_OC [V] × J_SC [A/m²]) / P_in [W/m²] × 100
# Units: (dimensionless × V × A/m²) / (W/m²) × 100 = (W/m²) / (W/m²) × 100 = %
df_clean['pce_calc'] = (ff * voc * jsc) / illumination * 100

# isclose check (absolute tolerance only)
df_clean['pce_isclose'] = np.isclose(
    pce,
    df_clean['pce_calc'],
    atol=0.2,
)

summary = (
    df_clean.groupby('source_database')['pce_isclose']
    .agg(fraction='mean', n='size')
    .reindex(['Manual Entry', 'LLM Extracted'])
)

# Calculate mismatch data for panel b
df_mismatch = df_clean[~df_clean['pce_isclose']].copy()
df_mismatch['pce_diff'] = abs(df_mismatch['pce_calc'] - pce[df_mismatch.index])
df_mismatch['pce_diff_percent'] = (df_mismatch['pce_diff'] / pce[df_mismatch.index]) * 100

In [111]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Create figure with subplots
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('', ''),
    horizontal_spacing=0.15,
    column_widths=[0.45, 0.55]
)

# PANEL A: Bar chart
bar_colors = [COLOR_MAP[src] for src in summary.index]

fig.add_trace(
    go.Bar(
        x=summary.index,
        y=summary['fraction'],
        text=[f'{frac:.1%}<br>n={n}' for frac, n in zip(summary['fraction'], summary['n'])],
        textposition='inside',
        textfont=dict(size=16, color='white', family='Arial'),
        marker=dict(color=bar_colors),
        showlegend=False,
    ),
    row=1, col=1
)

# PANEL B: Scatter plot with improved styling
for source in SOURCE_ORDER:
    subset = df_mismatch[df_mismatch['source_database'] == source]

    fig.add_trace(
        go.Scatter(
            x=subset['results.properties.optoelectronic.solar_cell.efficiency'],
            y=subset['pce_calc'],
            mode='markers',
            name=source,
            marker=dict(
                color=COLOR_MAP[source],
                size=6,
                # opacity=0.9,
                line=dict(color='white', width=1.0)
            ),
            showlegend=True,
        ),
        row=1, col=2
    )

# Add diagonal line to panel b
fig.add_trace(
    go.Scatter(
        x=[0, 26],
        y=[0, 26],
        mode='lines',
        line=dict(color='gray', dash='dash', width=1.5),
        showlegend=False,
        hoverinfo='skip'
    ),
    row=1, col=2
)

# Update axes for panel a
fig.update_xaxes(
    # title_text='Data Source',
    title_font=dict(size=16, family='Arial'),
    tickfont=dict(size=16, family='Arial'),
    showgrid=False,
    row=1, col=1
)

fig.update_yaxes(
    title_text='Fraction Passing Consistency Check',
    title_font=dict(size=16, family='Arial'),
    tickformat='.0%',
    tickfont=dict(size=16, family='Arial'),
    range=[0, 1.05],
    showgrid=True,
    gridcolor='rgba(200, 200, 200, 0.3)',
    griddash='dot',
    row=1, col=1
)

# Update axes for panel b
fig.update_xaxes(
    title_text='Reported PCE (%)',
    title_font=dict(size=16, family='Arial'),
    tickfont=dict(size=16, family='Arial'),
    range=[0, 26],
    showgrid=True,
    gridcolor='rgba(200, 200, 200, 0.3)',
    griddash='dot',
    row=1, col=2
)

fig.update_yaxes(
    title_text='Calculated PCE (%)',
    title_font=dict(size=16, family='Arial'),
    tickfont=dict(size=16, family='Arial'),
    range=[0, 26],
    showgrid=True,
    gridcolor='rgba(200, 200, 200, 0.3)',
    griddash='dot',
    row=1, col=2
)

# Add Nature-style panel labels
fig.add_annotation(
    text='<b>a</b>',
    xref='x domain', yref='y domain',
    x=-0.15, y=1.05,
    xanchor='left', yanchor='bottom',
    font=dict(size=18, family='Arial', color='black'),
    showarrow=False,
    row=1, col=1
)
fig.add_annotation(
    text='<b>b</b>',
    xref='x2 domain', yref='y2 domain',
    x=-0.15, y=1.05,
    xanchor='left', yanchor='bottom',
    font=dict(size=18, family='Arial', color='black'),
    showarrow=False,
    row=1, col=2
)

# Update overall layout
fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    font=dict(family='Arial', size=12),
    legend=dict(
        x=0.535,
        y=0.98,
        xanchor='left',
        yanchor='top',
        bgcolor='rgba(255, 255, 255, 0.8)',
        font=dict(size=16, family='Arial')
    ),
    width=700,
    height=400,
    margin=dict(t=60, b=80, l=80, r=80)
)

fig.show()

In [None]:
# Export combined figure to PDF (Nature quality)
fig.write_image('physics_filter_combined.pdf', scale=1, width=700, height=500)
print("Figure exported to: physics_filter_combined.pdf")

Figure exported to: physics_filter_combined.pdf
