In [None]:
# ruff: noqa: E402

<div style="
    background-color: #f7f7f7;
    background-image: url(''), url('') ;
    background-position: left bottom, right top;
    background-repeat: no-repeat,  no-repeat;
    background-size: auto 60px, auto 160px;
    border-radius: 5px;
    box-shadow: 0px 3px 1px -2px rgba(0, 0, 0, 0.2), 0px 2px 2px 0px rgba(0, 0, 0, 0.14), 0px 1px 5px 0px rgba(0,0,0,.12);">

<h1 style="
    color: #2a4cdf;
    font-style: normal;
    font-size: 2.25rem;
    line-height: 1.4em;
    font-weight: 600;
    padding: 30px 200px 0px 30px;"> 
        Physics Consistency Filter: Legacy Database vs. PERLA Pipeline</h1>

<p style="
    line-height: 1.4em;
    padding: 30px 200px 0px 30px;">
    This notebook evaluates data quality by testing the fundamental physics relationship for solar cell power conversion efficiency: <strong>PCE = (FF × <i>V</i><sub>OC</sub> × <i>J</i><sub>SC</sub>) / <i>P</i><sub>in</sub></strong>. We compare two datasets from the <a href="https://nomad-lab.eu/prod/v1/staging/gui/search/perovskite-solar-cells-database" target="_blank">Perovskite Solar Cell Database in NOMAD</a>: the legacy human-curated entries and the new PERLA LLM-extracted entries.
</p>

<p style="
    line-height: 1.4em;
    padding: 5px 200px 30px 30px;">
    The analysis reveals the fraction of legacy database entries that fail this physics consistency check (with 0.2% absolute tolerance), while the PERLA pipeline enforces this filter as a validation requirement, ensuring only physically consistent entries are accepted into the database.
</p>
</div>

### **Implications**

This comparison reveals important differences in data quality between the two datasets:

- **Legacy Database**: Entries that fail the physics consistency check may originate from inconsistencies in the source publications themselves, such as reporting errors, calculation mistakes in the original papers, or unit mismatches. The fraction of failing entries reflects the challenges inherent in literature-reported data, regardless of the curation method.

- **PERLA Pipeline**: By enforcing physics-based filters during the LLM extraction process, PERLA automatically excludes entries that fail consistency checks. This automated validation approach ensures that only physically coherent data enters the database, improving overall data reliability while maintaining scalability.

The results demonstrate how combining machine learning extraction with physics-based validation can enhance data quality by filtering inconsistencies at the source, whether they originate from data entry or from the published literature itself.

### **Methodology**

The physics consistency check validates that the reported power conversion efficiency (PCE) matches the calculated efficiency from measured parameters:

$$\text{PCE} = \frac{\text{FF} \times V_{\text{OC}} \times J_{\text{SC}}}{P_{\text{in}}}$$

Where:
- **FF** = Fill Factor (dimensionless)
- **V<sub>OC</sub>** = Open Circuit Voltage (V)
- **J<sub>SC</sub>** = Short Circuit Current Density (mA/cm²)
- **P<sub>in</sub>** = Incident Power Density (typically 100 mW/cm² for standard test conditions)

The current density is converted from mA/cm² to A/cm² (factor of 0.001) and multiplied by V<sub>OC</sub> to get power density in W/cm². This is then divided by the incident power density (0.1 W/cm²) and multiplied by FF to get PCE. We use an absolute tolerance of **±0.2%** to account for rounding and measurement precision. Entries that fail this check may indicate data entry errors, unit mismatches, or measurement inconsistencies.

In [6]:
from plotly_theme import register_template, set_defaults

register_template()
set_defaults()

In [7]:
# load the data from into a df from the parquet file
import pandas as pd

df = pd.read_parquet('perovskite_solar_cell_database.parquet')

In [8]:
# set in the df a source_database column. Is data.ref.person_entering_data is LLM Extracted else Manual Entry
df['source_database'] = df['data.ref.name_of_person_entering_the_data'].apply(
    lambda x: 'LLM Extracted' if x == 'LLM Extraction' else 'Manual Entry'
)
from plotly_theme import DEFAULT_COLORWAY

SOURCE_ORDER = ['Manual Entry', 'LLM Extracted']

COLOR_MAP = dict(zip(SOURCE_ORDER, DEFAULT_COLORWAY))

In [9]:
import matplotlib.colors as mcolors


def darken_color(hex_color, factor=0.7):
    """
    Darken a hex color by a given factor (0 < factor < 1).
    factor < 1 → darker
    factor = 1 → same color
    """
    rgb = mcolors.hex2color(hex_color)  # convert hex to (r,g,b) in [0,1]
    dark_rgb = tuple(max(0, c * factor) for c in rgb)
    return mcolors.to_hex(dark_rgb)

In [10]:
import numpy as np
import plotly.graph_objects as go

# columns we REQUIRE to be present
required_cols = [
    'results.properties.optoelectronic.solar_cell.fill_factor',
    'results.properties.optoelectronic.solar_cell.short_circuit_current_density',
    'results.properties.optoelectronic.solar_cell.open_circuit_voltage',
    'results.properties.optoelectronic.solar_cell.efficiency',
]

# drop rows where ANY required value is missing
df_clean = df.dropna(subset=required_cols).copy()

# alias for readability
ff = df_clean['results.properties.optoelectronic.solar_cell.fill_factor']
jsc = df_clean[
    'results.properties.optoelectronic.solar_cell.short_circuit_current_density'
]
voc = df_clean['results.properties.optoelectronic.solar_cell.open_circuit_voltage']
pce = df_clean['results.properties.optoelectronic.solar_cell.efficiency']

# unit correction
df_clean['jsc_corrected'] = jsc * 0.1

# compute expected PCE
df_clean['pce_calc'] = ff * voc * df_clean['jsc_corrected']

# isclose check (absolute tolerance only)
df_clean['pce_isclose'] = np.isclose(
    pce,
    df_clean['pce_calc'],
    atol=0.2,
)

summary = (
    df_clean.groupby('source_database')['pce_isclose']
    .agg(fraction='mean', n='size')
    .reindex(['Manual Entry', 'LLM Extracted'])
)

fig = go.Figure()

# Map colors to each bar
bar_colors = [COLOR_MAP[src] for src in summary.index]
# Create darker outlines for each bar
bar_outlines = [darken_color(c, factor=0.7) for c in bar_colors]

# Add the bar trace with outlines
fig.add_bar(
    x=summary.index,
    y=summary['fraction'],
    text=[f'{frac:.1%}<br>n={n}' for frac, n in zip(summary['fraction'], summary['n'])],
    textposition='inside',
    textfont=dict(size=20, color='white'),
    marker=dict(
        color=bar_colors,  # fill color
        line=dict(color=bar_outlines, width=2),  # darker outlines
    ),
    showlegend=False,
)

fig.update_layout(
    title=dict(
        text='Physics Consistency Check: PCE ≈ FF × <i>V</i><sub>OC</sub> × <i>J</i><sub>SC</sub> / <i>P</i><sub>in</sub>',
        font=dict(size=20),
        x=0.5,
        xanchor='center',
    ),
    xaxis=dict(
        title=dict(text='Data Source', font=dict(size=20)),
        tickfont=dict(size=20),
        showgrid=False,
    ),
    yaxis=dict(
        title=dict(text='Fraction Passing Consistency Check', font=dict(size=20)),
        tickformat='.0%',
        tickfont=dict(size=20),
        range=[0, 1.05],
        showgrid=True,
        gridcolor='rgba(200, 200, 200, 0.3)',
        griddash='dot',
    ),
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(t=80, b=80, l=100, r=50),
    height=500,
    width=700,
)

fig.show()

In [11]:
# Export to pdf
fig.write_image('physics_filter.pdf', scale=2)