# Tutorial 2: Physics-Based Descriptors for Photovoltaic Materials

**Author**: Nabil Khossossi  
**Date**: August 2025  
**Goal**: Compute physics-informed features for solar cell materials

## Overview

In this tutorial, we will:
1. Load materials data from Tutorial 1
2. Compute electronic structure descriptors
3. Calculate Shockley-Queisser theoretical efficiency
4. Add thermodynamic stability features
5. Analyze descriptor correlations

## Key Physics Concepts

### Shockley-Queisser Limit
The theoretical maximum efficiency for a single-junction solar cell is ~33.7% at Eg = 1.34 eV, based on:
- Detailed balance (radiative recombination only)
- Blackbody radiation from the sun (AM1.5G spectrum)
- Thermalization losses for photons with E > Eg

### Band Gap Engineering
- **Single junction**: 1.1 - 1.7 eV optimal
- **Top cell (tandem)**: 1.7 - 2.0 eV
- **Bottom cell (tandem)**: 1.0 - 1.4 eV

### Stability Indicators
- **Energy above hull** < 0.05 eV/atom: likely synthesizable
- **Formation energy** < 0: thermodynamically favorable
- **Decomposition resistance**: check against competing phases

## 1. Setup and Load Data

In [None]:
import sys
sys.path.append('../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from descriptors import PhotovoltaicDescriptors
from visualization import PVVisualization

pd.set_option('display.max_columns', None)
%matplotlib inline

print("✓ Imports successful")

In [None]:
# Load data from Tutorial 1
df = pd.read_csv('../data/raw/pv_candidates_all.csv')

print(f"Loaded {len(df)} materials")
print(f"\nColumns: {df.columns.tolist()}")
df.head()

## 2. Initialize Descriptor Calculator

In [None]:
# Initialize at room temperature (300 K)
descriptor_calc = PhotovoltaicDescriptors(temperature=300)

print(f"Thermal voltage (kT/e) at 300 K: {descriptor_calc.V_T:.4f} eV")
print(f"This represents thermal energy broadening at room temperature")

## 3. Band Gap Descriptors

Band gap is the most important property for photovoltaics, but we need to contextualize it.

In [None]:
# Compute band gap descriptors
df_bg = descriptor_calc._add_bandgap_descriptors(df)

print("New band gap descriptors:")
print("=" * 60)
print(df_bg[[
    'formula', 'band_gap', 'bandgap_deviation', 
    'absorption_threshold_nm', 'is_single_junction'
]].head(10))

In [None]:
# Visualize band gap categories
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Distribution with categories
ax = axes[0]
ax.hist(df_bg['band_gap'], bins=30, alpha=0.7, color='lightblue', edgecolor='black')
ax.axvspan(1.1, 1.7, alpha=0.2, color='green', label='Single Junction')
ax.axvspan(1.7, 2.0, alpha=0.2, color='orange', label='Top Cell')
ax.axvspan(1.0, 1.4, alpha=0.2, color='blue', label='Bottom Cell')
ax.axvline(1.34, color='red', linestyle='--', linewidth=2, label='SQ Optimum')
ax.set_xlabel('Band Gap (eV)', fontweight='bold')
ax.set_ylabel('Count', fontweight='bold')
ax.set_title('Band Gap Distribution with PV Categories')
ax.legend(fontsize=8)
ax.grid(alpha=0.3)

# Deviation from optimum
ax = axes[1]
ax.scatter(df_bg['band_gap'], df_bg['bandgap_deviation'], 
          c=df_bg['energy_above_hull'], cmap='RdYlGn_r',
          s=80, alpha=0.6, edgecolors='black', linewidths=0.5)
ax.set_xlabel('Band Gap (eV)', fontweight='bold')
ax.set_ylabel('Deviation from SQ Optimum (eV)', fontweight='bold')
ax.set_title('Band Gap Quality Assessment')
ax.grid(alpha=0.3)
plt.colorbar(ax.collections[0], ax=ax, label='Energy Above Hull')

plt.tight_layout()
plt.savefig('../figures/bandgap_descriptors.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Figure saved")

## 4. Shockley-Queisser Efficiency

Calculate theoretical maximum efficiency based on detailed balance.

In [None]:
# Compute SQ efficiency for all materials
df_sq = descriptor_calc._add_sq_efficiency(df_bg)

print("Shockley-Queisser Efficiency Results:")
print("=" * 60)
print(df_sq[['formula', 'band_gap', 'sq_efficiency']].nlargest(10, 'sq_efficiency'))

print(f"\nMean theoretical efficiency: {df_sq['sq_efficiency'].mean()*100:.1f}%")
print(f"Maximum theoretical efficiency: {df_sq['sq_efficiency'].max()*100:.1f}%")

In [None]:
# Plot SQ efficiency curve
viz = PVVisualization()
fig = viz.plot_sq_efficiency_curve(df_sq, save_path='../figures/sq_curve_detailed.png')
plt.show()

print("✓ SQ curve saved")

## 5. Thermodynamic Stability Descriptors

In [None]:
# Add stability descriptors
df_stab = descriptor_calc._add_thermodynamic_descriptors(df_sq)

print("Stability Descriptors:")
print("=" * 60)
print(df_stab[[
    'formula', 'energy_above_hull', 'stability_score', 
    'formation_stability', 'is_thermodynamically_stable'
]].head(10))

stable_count = df_stab['is_thermodynamically_stable'].sum()
print(f"\nThermodynamically stable materials: {stable_count} / {len(df_stab)} ({stable_count/len(df_stab)*100:.1f}%)")

In [None]:
# Efficiency vs Stability trade-off
fig, ax = plt.subplots(figsize=(8, 6))

# Scatter plot with size = band gap
scatter = ax.scatter(
    df_stab['sq_efficiency'] * 100,
    df_stab['stability_score'],
    s=df_stab['band_gap'] * 50,
    c=df_stab['band_gap'],
    cmap='viridis',
    alpha=0.6,
    edgecolors='black',
    linewidths=0.5
)

# Highlight top candidates (high efficiency AND high stability)
top_mask = (df_stab['sq_efficiency'] > 0.3) & (df_stab['stability_score'] > 0.9)
top_candidates = df_stab[top_mask]

ax.scatter(
    top_candidates['sq_efficiency'] * 100,
    top_candidates['stability_score'],
    s=200,
    facecolors='none',
    edgecolors='red',
    linewidths=2,
    label='Top Candidates'
)

# Annotate a few
for idx, row in top_candidates.head(3).iterrows():
    ax.annotate(
        row['formula'],
        xy=(row['sq_efficiency']*100, row['stability_score']),
        xytext=(10, 10),
        textcoords='offset points',
        fontsize=9,
        bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.5)
    )

ax.set_xlabel('Theoretical Efficiency (%)', fontweight='bold', fontsize=11)
ax.set_ylabel('Stability Score', fontweight='bold', fontsize=11)
ax.set_title('Efficiency vs Stability Trade-off', fontsize=12, fontweight='bold')
ax.grid(alpha=0.3, linestyle=':')
ax.legend()

cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Band Gap (eV)', fontweight='bold')

plt.tight_layout()
plt.savefig('../figures/efficiency_stability_tradeoff.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\n✓ Found {len(top_candidates)} top candidates (high efficiency + high stability)")

## 6. Structure-Based Descriptors

In [None]:
# Add structure descriptors
df_struct = descriptor_calc._add_structure_descriptors(df_stab)

print("Structure Descriptors:")
print("=" * 60)
print(df_struct[[
    'formula', 'crystal_system', 'density', 
    'symmetry_score', 'density_score'
]].head(10))

In [None]:
# Crystal system analysis
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Crystal system distribution
ax = axes[0]
crystal_counts = df_struct['crystal_system'].value_counts()
crystal_counts.plot(kind='bar', ax=ax, color='steelblue', edgecolor='black')
ax.set_xlabel('Crystal System', fontweight='bold')
ax.set_ylabel('Count', fontweight='bold')
ax.set_title('Crystal System Distribution')
ax.grid(axis='y', alpha=0.3)
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

# Efficiency by crystal system
ax = axes[1]
df_struct.boxplot(
    column='sq_efficiency',
    by='crystal_system',
    ax=ax
)
ax.set_xlabel('Crystal System', fontweight='bold')
ax.set_ylabel('SQ Efficiency', fontweight='bold')
ax.set_title('Efficiency Distribution by Crystal System')
plt.suptitle('')  # Remove automatic title
ax.grid(alpha=0.3)
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.savefig('../figures/crystal_system_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Complete Feature Set

Now compute ALL descriptors at once.

In [None]:
# Compute all descriptors
df_features = descriptor_calc.compute_all(df)

print("Complete Feature Set:")
print("=" * 60)
print(f"Original columns: {len(df.columns)}")
print(f"Total columns: {len(df_features.columns)}")
print(f"New features: {len(df_features.columns) - len(df.columns)}")

print("\nAll columns:")
print(df_features.columns.tolist())

## 8. Descriptor Correlation Analysis

In [None]:
# Select numerical descriptors for correlation
descriptor_cols = [
    'band_gap', 'bandgap_deviation', 'sq_efficiency',
    'energy_above_hull', 'stability_score', 'formation_energy',
    'density', 'thermal_broadening'
]

# Remove any missing columns
descriptor_cols = [col for col in descriptor_cols if col in df_features.columns]

# Correlation matrix
viz = PVVisualization()
fig = viz.plot_descriptor_correlation(
    df_features,
    descriptor_cols,
    save_path='../figures/descriptor_correlation.png'
)
plt.show()

print("\n✓ Correlation matrix saved")

In [None]:
# Key correlations
corr_matrix = df_features[descriptor_cols].corr()

print("\nStrongest Correlations with SQ Efficiency:")
print("=" * 60)
sq_corr = corr_matrix['sq_efficiency'].abs().sort_values(ascending=False)
print(sq_corr.head(6))

print("\nPhysical Interpretation:")
print("- Band gap has strongest correlation (by design)")
print("- Stability matters less for theoretical efficiency")
print("- Thermal broadening captures temperature effects")

## 9. Spectral Response Example

In [None]:
# Compute spectral response for different band gaps
bandgaps = [1.1, 1.34, 1.7, 2.0]
colors = ['blue', 'green', 'orange', 'red']

fig, ax = plt.subplots(figsize=(8, 5))

for Eg, color in zip(bandgaps, colors):
    wavelengths, response = descriptor_calc.compute_spectral_response(Eg)
    ax.plot(wavelengths, response, label=f'Eg = {Eg} eV', 
           linewidth=2, color=color)

ax.set_xlabel('Wavelength (nm)', fontweight='bold', fontsize=11)
ax.set_ylabel('Spectral Response', fontweight='bold', fontsize=11)
ax.set_title('Spectral Response for Different Band Gaps', fontsize=12, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(alpha=0.3, linestyle=':')
ax.set_xlim(300, 1200)

# Add solar spectrum regions
ax.axvspan(280, 400, alpha=0.1, color='purple', label='UV')
ax.axvspan(400, 700, alpha=0.1, color='yellow', label='Visible')
ax.axvspan(700, 1200, alpha=0.1, color='red', label='IR')

plt.tight_layout()
plt.savefig('../figures/spectral_response.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Spectral response plot saved")
print("\nKey insights:")
print("- Lower Eg → absorbs more of solar spectrum (IR)")
print("- Higher Eg → better voltage but less current")
print("- Optimal Eg balances photon absorption and voltage")

## 10. Save Processed Data

In [None]:
# Save complete feature set
output_file = '../data/processed/materials_with_features.csv'
df_features.to_csv(output_file, index=False)

print(f"✓ Saved {len(df_features)} materials with {len(df_features.columns)} features")
print(f"  Location: {output_file}")

# Save top candidates
top_pv = df_features.nlargest(20, 'sq_efficiency')
top_pv.to_csv('../data/processed/top_20_efficiency.csv', index=False)
print("\n✓ Saved top 20 materials by efficiency")

## 11. Summary Statistics

In [None]:
# Final summary
print("\n" + "=" * 80)
print("PHYSICS DESCRIPTORS SUMMARY")
print("=" * 80)

print(f"\nDataset: {len(df_features)} materials")
print(f"\nBand Gap Statistics:")
print(f"  Range: {df_features['band_gap'].min():.2f} - {df_features['band_gap'].max():.2f} eV")
print(f"  Mean: {df_features['band_gap'].mean():.2f} eV")
print(f"  Optimal range (1.1-1.7 eV): {df_features['is_single_junction'].sum()} materials")

print(f"\nEfficiency Statistics:")
print(f"  Mean: {df_features['sq_efficiency'].mean()*100:.1f}%")
print(f"  Max: {df_features['sq_efficiency'].max()*100:.1f}%")
print(f"  Materials > 30% efficiency: {(df_features['sq_efficiency'] > 0.30).sum()}")

print(f"\nStability Statistics:")
print(f"  Stable (E_hull < 0.05): {df_features['is_thermodynamically_stable'].sum()}")
print(f"  Mean stability score: {df_features['stability_score'].mean():.2f}")

print(f"\nFeature Engineering:")
print(f"  Original features: {len(df.columns)}")
print(f"  Total features: {len(df_features.columns)}")
print(f"  New descriptors: {len(df_features.columns) - len(df.columns)}")

## Summary and Next Steps

### What We Accomplished

✓ Computed band gap descriptors (deviation, absorption threshold, categories)  
✓ Calculated Shockley-Queisser theoretical efficiency  
✓ Added thermodynamic stability features  
✓ Incorporated structure-based descriptors  
✓ Analyzed descriptor correlations  
✓ Computed spectral response  
✓ Saved complete feature set  

### Key Insights

1. **Band gap is dominant**: Strongest correlation with efficiency
2. **Stability trade-off**: High efficiency materials not always most stable
3. **Crystal system matters**: Some structures favor better properties
4. **Feature engineering**: Created 10+ physics-based descriptors

### Next Steps

In **Tutorial 3**, we will:
- Build machine learning models with these descriptors
- Enforce physical constraints (SQ limit, Eg > 0)
- Evaluate feature importance
- Make predictions on new materials

### References

1. Shockley & Queisser, *J. Appl. Phys.* **32**, 510 (1961)
2. Green et al., *Prog. Photovolt.* **29**, 3 (2021) - Efficiency tables
3. Materials Project documentation: docs.materialsproject.org