# Wang (2002) Tutorial: Heteroscedastic Inefficiency in Stochastic Frontiers

**Author:** PanelBox Development Team  
**Date:** February 2024  
**Reference:** Wang, H. J. (2002). "Heteroscedasticity and non-monotonic efficiency effects of a stochastic frontier model." *Journal of Productivity Analysis*, 18, 241-253.

---

## Table of Contents

1. [Introduction](#intro)
2. [The Problem with Two-Stage Methods](#problem)
3. [Wang (2002) Solution: Single-Step Estimation](#solution)
4. [Model Specification](#model)
5. [Practical Application](#application)
6. [Marginal Effects Interpretation](#marginal)
7. [When to Use Wang vs BC95](#comparison)
8. [Exercises](#exercises)

---

## 1. Introduction <a name="intro"></a>

In stochastic frontier analysis (SFA), we often want to understand:
- **What factors drive inefficiency?**
- **How do these factors affect different firms differently?**

### The Traditional (Flawed) Approach: Two-Stage Method

**Stage 1:** Estimate frontier, obtain efficiency scores $\hat{TE}_i$

**Stage 2:** Regress efficiency on covariates:
$$
\hat{TE}_i = \alpha + \beta' z_i + \epsilon_i
$$

**Problem:** $\hat{TE}_i$ is estimated with error ‚Üí biased estimates in stage 2!

### Wang (2002) Solution

**Single-step estimation** where inefficiency determinants directly enter the likelihood!

Let's see how this works...

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from panelbox.frontier import StochasticFrontier

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Random seed
np.random.seed(42)

## 2. The Problem with Two-Stage Methods <a name="problem"></a>

Let's demonstrate the inconsistency problem with simulated data.

In [None]:
def generate_data_with_hetero_inefficiency(n=500):
    """
    Generate data where inefficiency depends on covariates.
    
    Model:
        y = Œ≤'x + v - u
        u ~ N‚Å∫(Œº_i, œÉ¬≤_u,i)
        Œº_i = Œ¥'z_i         (location)
        œÉ¬≤_u,i = exp(Œ≥'w_i)  (scale)
    """
    # Inputs
    x1 = np.random.normal(0, 1, n)
    
    # Inefficiency determinants
    firm_age = np.random.uniform(1, 30, n)
    firm_size = np.random.uniform(10, 100, n)
    
    # Standardize
    age_std = (firm_age - firm_age.mean()) / firm_age.std()
    size_std = (firm_size - firm_size.mean()) / firm_size.std()
    
    # True parameters
    beta_0 = 3.0
    beta_1 = 0.7
    
    # Inefficiency structure
    delta_0 = 0.3
    delta_age = 0.4  # Older firms more inefficient
    
    gamma_0 = -1.5
    gamma_size = 0.3  # Larger firms more variable efficiency
    
    # Generate inefficiency
    mu_i = delta_0 + delta_age * age_std
    sigma_u_i = np.sqrt(np.exp(gamma_0 + gamma_size * size_std))
    
    u = np.abs(np.random.normal(mu_i, sigma_u_i))
    v = np.random.normal(0, 0.2, n)
    
    # Output
    y = beta_0 + beta_1 * x1 + v - u
    
    df = pd.DataFrame({
        'output': y,
        'input': x1,
        'firm_age': firm_age,
        'firm_size': firm_size,
        'age_std': age_std,
        'size_std': size_std,
        'true_u': u,
        'true_efficiency': np.exp(-u),
    })
    
    return df

# Generate data
df = generate_data_with_hetero_inefficiency(n=500)

print("Data Summary:")
print(df[['output', 'input', 'firm_age', 'firm_size']].describe())
print(f"\nMean true efficiency: {df['true_efficiency'].mean():.3f}")

### Visualize heteroscedasticity in inefficiency

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Inefficiency vs Age
axes[0].scatter(df['firm_age'], df['true_u'], alpha=0.5)
axes[0].set_xlabel('Firm Age (years)')
axes[0].set_ylabel('True Inefficiency (u)')
axes[0].set_title('Inefficiency vs Firm Age\n(Location Effect: Older ‚Üí More Inefficient)')
axes[0].grid(alpha=0.3)

# Add trend line
z = np.polyfit(df['firm_age'], df['true_u'], 1)
p = np.poly1d(z)
axes[0].plot(df['firm_age'], p(df['firm_age']), "r--", linewidth=2, label='Trend')
axes[0].legend()

# Plot 2: Variance of inefficiency vs Size
# Bin by size and compute variance
df['size_bin'] = pd.qcut(df['firm_size'], q=10, duplicates='drop')
variance_by_size = df.groupby('size_bin').agg({
    'true_u': ['mean', 'std'],
    'firm_size': 'mean'
}).reset_index()
variance_by_size.columns = ['size_bin', 'mean_u', 'std_u', 'avg_size']

axes[1].scatter(variance_by_size['avg_size'], variance_by_size['std_u'], s=100, alpha=0.7)
axes[1].set_xlabel('Firm Size (assets)')
axes[1].set_ylabel('Std Dev of Inefficiency')
axes[1].set_title('Variability of Inefficiency vs Firm Size\n(Scale Effect: Larger ‚Üí More Variable)')
axes[1].grid(alpha=0.3)

# Add trend
z = np.polyfit(variance_by_size['avg_size'], variance_by_size['std_u'], 1)
p = np.poly1d(z)
axes[1].plot(variance_by_size['avg_size'], p(variance_by_size['avg_size']), "r--", linewidth=2)

plt.tight_layout()
plt.show()

print("\nüìä Observations:")
print("  LEFT: Inefficiency INCREASES with age (location effect)")
print("  RIGHT: Inefficiency variance INCREASES with size (scale effect)")

## 3. Wang (2002) Solution: Single-Step Estimation <a name="solution"></a>

### Model Specification

Production frontier:
$$
y_i = x_i'\beta + v_i - u_i
$$

where:
- $v_i \sim N(0, \sigma^2_v)$ is random noise
- $u_i \sim N^+(\mu_i, \sigma^2_{u,i})$ is inefficiency

**Key innovation:** Both mean AND variance of $u$ depend on covariates!

$$
\mu_i = z_i' \delta \quad \text{(location: affects mean inefficiency)}
$$

$$
\ln(\sigma^2_{u,i}) = w_i' \gamma \quad \text{(scale: affects variance of inefficiency)}
$$

### Interpretation

- **$\delta_k > 0$**: Variable $z_k$ INCREASES average inefficiency
- **$\gamma_k > 0$**: Variable $w_k$ INCREASES variance of inefficiency (more heterogeneous)

In [None]:
# Estimate Wang (2002) model
print("Estimating Wang (2002) model...\n")

model_wang = StochasticFrontier(
    data=df,
    depvar='output',
    exog=['input'],
    frontier='production',
    dist='truncated_normal',
    inefficiency_vars=['age_std'],    # Z: affects location (Œº)
    het_vars=['size_std']              # W: affects scale (œÉ_u)
)

result_wang = model_wang.fit(verbose=True)

print("\n" + "="*70)
print(result_wang.summary())

## 4. Marginal Effects Analysis <a name="marginal"></a>

Marginal effects tell us **how much** covariates affect inefficiency.

### Location Effects: $\partial E[u_i] / \partial z_k$

For truncated normal: $E[u_i] \approx \mu_i = z_i'\delta$

Therefore: **Marginal effect = $\delta_k$**

In [None]:
# Marginal effects on LOCATION (mean inefficiency)
print("MARGINAL EFFECTS ON LOCATION (Mean Inefficiency)")
print("="*70)

me_location = result_wang.marginal_effects(method='location')
print(me_location)

print("\nüìä Interpretation:")
for _, row in me_location.iterrows():
    var = row['variable']
    me = row['marginal_effect']
    pval = row['p_value']
    sig = "***" if pval < 0.01 else ("**" if pval < 0.05 else ("*" if pval < 0.10 else ""))
    
    if me > 0:
        print(f"  ‚Ä¢ {var}: One SD increase ‚Üí INCREASES inefficiency by {me:.4f} {sig}")
    else:
        print(f"  ‚Ä¢ {var}: One SD increase ‚Üí DECREASES inefficiency by {abs(me):.4f} {sig}")

### Scale Effects: $\partial \sigma_{u,i} / \partial w_k$

Since $\ln(\sigma^2_{u,i}) = w_i'\gamma$:

$$
\frac{\partial \sigma_{u,i}}{\partial w_k} = \frac{\sigma_{u,i}}{2} \cdot \gamma_k
$$

This tells us how the **variability** of inefficiency changes.

In [None]:
# Marginal effects on SCALE (variance of inefficiency)
print("\nMARGINAL EFFECTS ON SCALE (Variance of Inefficiency)")
print("="*70)

me_scale = result_wang.marginal_effects(method='scale')
print(me_scale)

print("\nüìä Interpretation:")
for _, row in me_scale.iterrows():
    var = row['variable']
    me = row['marginal_effect']
    pval = row['p_value']
    sig = "***" if pval < 0.01 else ("**" if pval < 0.05 else ("*" if pval < 0.10 else ""))
    
    if me > 0:
        print(f"  ‚Ä¢ {var}: One SD increase ‚Üí MORE VARIABLE inefficiency {sig}")
        print(f"    (Some firms very efficient, others very inefficient)")
    else:
        print(f"  ‚Ä¢ {var}: One SD increase ‚Üí LESS VARIABLE inefficiency {sig}")
        print(f"    (More homogeneous inefficiency across firms)")

## 5. Efficiency Predictions <a name="application"></a>

In [None]:
# Get efficiency estimates
eff_df = result_wang.efficiency(estimator='bc')
df['estimated_efficiency'] = eff_df['efficiency'].values

# Compare true vs estimated
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Distribution
axes[0].hist(df['true_efficiency'], bins=30, alpha=0.5, label='True', edgecolor='black')
axes[0].hist(df['estimated_efficiency'], bins=30, alpha=0.5, label='Estimated', edgecolor='black')
axes[0].set_xlabel('Efficiency')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution: True vs Estimated Efficiency')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot 2: Scatter
axes[1].scatter(df['true_efficiency'], df['estimated_efficiency'], alpha=0.5)
axes[1].plot([0, 1], [0, 1], 'r--', label='45¬∞ line')
axes[1].set_xlabel('True Efficiency')
axes[1].set_ylabel('Estimated Efficiency')
axes[1].set_title('True vs Estimated Efficiency')
axes[1].legend()
axes[1].grid(alpha=0.3)

corr = np.corrcoef(df['true_efficiency'], df['estimated_efficiency'])[0, 1]
axes[1].text(0.05, 0.95, f'Correlation: {corr:.3f}',
             transform=axes[1].transAxes, fontsize=12,
             verticalalignment='top',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

print(f"\nMean true efficiency: {df['true_efficiency'].mean():.3f}")
print(f"Mean estimated efficiency: {df['estimated_efficiency'].mean():.3f}")
print(f"Correlation: {corr:.3f}")

## 6. When to Use Wang vs BC95 <a name="comparison"></a>

### Battese-Coelli (1995)
- Only models **location** ($\mu_i$)
- Assumes **constant variance** ($\sigma^2_u$) across all units
- Good for: Panel data with time-invariant inefficiency determinants

### Wang (2002)
- Models **both location AND scale** ($\mu_i$ and $\sigma^2_{u,i}$)
- Allows **heteroscedastic inefficiency**
- Good for: Cross-section or short panels with heterogeneous units

### Decision Rule

1. **Estimate both models**
2. **Test $H_0: \gamma = 0$** (LR test)
3. If rejected ‚Üí Use Wang (2002)
4. If not rejected ‚Üí BC95 is sufficient

In [None]:
# Compare Wang vs BC95
print("COMPARING WANG (2002) vs BC95")
print("="*70)

# Estimate BC95 (only location determinants)
model_bc95 = StochasticFrontier(
    data=df,
    depvar='output',
    exog=['input'],
    frontier='production',
    dist='truncated_normal',
    inefficiency_vars=['age_std'],  # Only location
    het_vars=None                    # No scale heterogeneity
)

result_bc95 = model_bc95.fit(verbose=False)

print(f"\nLog-likelihood:")
print(f"  BC95 (homoscedastic): {result_bc95.loglik:.4f}")
print(f"  Wang (heteroscedastic): {result_wang.loglik:.4f}")
print(f"  Difference: {result_wang.loglik - result_bc95.loglik:.4f}")

# Likelihood ratio test
lr_stat = 2 * (result_wang.loglik - result_bc95.loglik)
from scipy.stats import chi2
# Degrees of freedom = number of additional parameters in Wang
df_lr = len(result_wang.params) - len(result_bc95.params)
p_value = 1 - chi2.cdf(lr_stat, df_lr)

print(f"\nLikelihood Ratio Test:")
print(f"  H0: Œ≥ = 0 (no heteroscedasticity)")
print(f"  LR statistic: {lr_stat:.4f}")
print(f"  Degrees of freedom: {df_lr}")
print(f"  P-value: {p_value:.4f}")

if p_value < 0.05:
    print(f"\n‚úÖ REJECT H0: Use Wang (2002) - heteroscedasticity is significant!")
else:
    print(f"\n‚ùå FAIL TO REJECT H0: BC95 is sufficient")

print(f"\nAIC:")
print(f"  BC95: {result_bc95.aic:.4f}")
print(f"  Wang: {result_wang.aic:.4f}")
print(f"  Preferred: {'Wang (lower AIC)' if result_wang.aic < result_bc95.aic else 'BC95 (lower AIC)'}")

## 7. Exercises <a name="exercises"></a>

### Exercise 1: Explore Different Specifications

Try adding more variables to $Z$ or $W$ and see how results change.

```python
# Example: Add firm_size to location determinants
model_extended = StochasticFrontier(
    data=df,
    depvar='output',
    exog=['input'],
    frontier='production',
    dist='truncated_normal',
    inefficiency_vars=['age_std', 'size_std'],  # Both in location
    het_vars=['size_std']                        # Size in scale too
)
```

### Exercise 2: Cost Frontier

Modify the data generation to create a cost frontier:
- $y_i = x_i'\beta + v_i + u_i$ (inefficiency INCREASES cost)
- Change `frontier='cost'` in model specification

### Exercise 3: Real Data Application

Apply Wang (2002) to:
- Banking data (assets, employees ‚Üí loans)
- Hospital data (beds, doctors ‚Üí patients treated)
- Agriculture (land, labor ‚Üí output)

Identify relevant determinants for $Z$ and $W$.

## Summary

### Key Takeaways

1. **Wang (2002) > Two-stage methods**: Single-step estimation is consistent

2. **Two channels of influence**:
   - Location ($\delta$): Affects **average** inefficiency
   - Scale ($\gamma$): Affects **variability** of inefficiency

3. **Marginal effects are interpretable**:
   - Location ME = $\delta$ (direct effect)
   - Scale ME = $(\sigma_u / 2) \cdot \gamma$

4. **Model selection**:
   - Test $H_0: \gamma = 0$ using LR test
   - If rejected ‚Üí Use Wang (2002)
   - If not ‚Üí BC95 is sufficient

5. **Policy implications**:
   - Location effects ‚Üí Target specific firm types
   - Scale effects ‚Üí Understand treatment effect heterogeneity

### Further Reading

- Wang, H. J. (2002). *Journal of Productivity Analysis*, 18, 241-253.
- Wang & Schmidt (2002). "One-step and two-step estimation of the effects of exogenous variables on technical efficiency levels."
- Caudill, Ford & Gropper (1995). "Frontier estimation and firm-specific inefficiency measures in the presence of heteroscedasticity."

---

**Happy modeling! üöÄ**