# Notebook 02: Random Effects Tobit for Panel Data

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Explain** why panel data structure matters for censored models
2. **Formulate** the Random Effects Tobit model and its assumptions
3. **Estimate** Random Effects Tobit models using PanelBox
4. **Decompose** variance into between- and within-individual components
5. **Compare** Pooled Tobit vs Random Effects Tobit estimates
6. **Generate** latent and censored predictions from panel Tobit models
7. **Interpret** results with attention to the intra-class correlation

### Prerequisites
- Completed Notebook 01 (Tobit Introduction)
- Understanding of censored regression and the Tobit model
- Familiarity with random effects concepts in linear panel models

### Duration
- **Estimated time**: 75-90 minutes
- **Sections**: 10 main sections

### Dataset
- **File**: `health_expenditure_panel.csv` -- Health expenditure data for 500 individuals over 4 time periods (2,000 obs)
- **Key feature**: ~27% of observations censored at zero; repeated measurements per individual

### References
- Wooldridge, J. M. (2010). *Econometric Analysis of Cross Section and Panel Data*. MIT Press, Ch. 16.
- Tobin, J. (1958). Estimation of relationships for limited dependent variables. *Econometrica*, 26(1), 24-36.
- Greene, W. H. (2012). *Econometric Analysis*. 7th ed. Pearson, Ch. 19.

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

from scipy import stats
import statsmodels.api as sm

# PanelBox imports
from panelbox.models.censored import PooledTobit, RandomEffectsTobit

# Visualization configuration
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['font.size'] = 11
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

# Set random seed for reproducibility
np.random.seed(42)

# Define paths
BASE_DIR = Path('..')
DATA_DIR = BASE_DIR / 'data'
OUTPUT_DIR = BASE_DIR / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures' / '02_tobit_panel'
TABLES_DIR = OUTPUT_DIR / 'tables' / '02_tobit_panel'

# Create output directories
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

print('Setup complete!')
print(f'Figures will be saved to: {FIGURES_DIR}')
print(f'Tables will be saved to: {TABLES_DIR}')

---

## Section 1: From Cross-Section to Panel

### Why Panel Data Matters for Censored Models

In Notebook 01 we estimated a **Pooled Tobit** model that treats every observation as independent. But when we observe the **same individuals over multiple time periods**, ignoring this panel structure leads to several problems:

1. **Unobserved heterogeneity**: Individuals differ in unobserved ways (e.g., health consciousness, genetic predisposition) that affect their expenditure in every period. Ignoring this inflates the error variance.

2. **Biased standard errors**: Observations from the same individual are correlated. Treating them as independent understates standard errors, leading to spurious significance.

3. **Inefficient estimation**: By not exploiting the within-individual information structure, pooled estimates are less precise than panel estimates.

### The Key Insight

Consider a person who consistently has high health expenditure. Part of this is explained by observables (income, age, chronic conditions), but part is due to **unobserved individual-specific factors** $\alpha_i$ -- perhaps their risk aversion, access to specialists, or family health history.

The Random Effects Tobit model explicitly accounts for this by decomposing the error into:
- $\alpha_i$ -- individual-specific component (constant over time)
- $\varepsilon_{it}$ -- idiosyncratic component (varies over time)

---

## Section 2: The Random Effects Tobit Model

### Model Specification

The Random Effects Tobit model for left-censored panel data is:

$$y_{it}^* = \mathbf{X}_{it}'\boldsymbol{\beta} + \alpha_i + \varepsilon_{it}$$

$$y_{it} = \max(0, \, y_{it}^*)$$

where:
- $y_{it}^*$ is the **latent** (uncensored) outcome for individual $i$ at time $t$
- $y_{it}$ is the **observed** (censored) outcome
- $\mathbf{X}_{it}$ is the vector of explanatory variables
- $\boldsymbol{\beta}$ is the coefficient vector
- $\alpha_i \sim N(0, \sigma^2_\alpha)$ is the individual random effect
- $\varepsilon_{it} \sim N(0, \sigma^2_\varepsilon)$ is the idiosyncratic error
- $\alpha_i \perp \varepsilon_{it}$ (independence between components)

### Key Assumptions

1. **Normality**: Both $\alpha_i$ and $\varepsilon_{it}$ are normally distributed
2. **Independence**: $\alpha_i$ is independent of $\varepsilon_{it}$ and of $\mathbf{X}_{it}$ (the RE assumption)
3. **Strict exogeneity**: $E[\varepsilon_{it} | \mathbf{X}_{i1}, \ldots, \mathbf{X}_{iT}, \alpha_i] = 0$

### Variance Decomposition

The total error variance is:

$$\text{Var}(\alpha_i + \varepsilon_{it}) = \sigma^2_\alpha + \sigma^2_\varepsilon$$

The **intra-class correlation** (ICC) measures the fraction of total variance due to individual effects:

$$\rho = \frac{\sigma^2_\alpha}{\sigma^2_\alpha + \sigma^2_\varepsilon}$$

- $\rho \approx 0$: Little individual heterogeneity (pooled model may suffice)
- $\rho \approx 1$: Most variation is between individuals (strong individual effects)

### Estimation via Gauss-Hermite Quadrature

The likelihood for individual $i$ requires integrating over the unobserved $\alpha_i$:

$$L_i = \int_{-\infty}^{\infty} \prod_{t=1}^{T_i} f(y_{it} | \mathbf{X}_{it}, \alpha_i) \, \phi(\alpha_i / \sigma_\alpha) \, d\alpha_i$$

This integral has no closed-form solution. PanelBox uses **Gauss-Hermite quadrature** to approximate it numerically:

$$L_i \approx \sum_{q=1}^{Q} w_q \prod_{t=1}^{T_i} f(y_{it} | \mathbf{X}_{it}, \sqrt{2}\sigma_\alpha \cdot n_q)$$

where $n_q$ and $w_q$ are the quadrature nodes and weights, and $Q$ is the number of quadrature points (default: 12).

---

## Section 3: Loading Panel Data

In [None]:
# Load dataset
df = pd.read_csv(DATA_DIR / 'health_expenditure_panel.csv')

print('Dataset shape:', df.shape)
print(f'N individuals: {df["id"].nunique()}')
print(f'T periods: {df["time"].nunique()} ({df["time"].min()}-{df["time"].max()})')
print(f'Panel structure: balanced = {len(df) == df["id"].nunique() * df["time"].nunique()}')
print()
print('First rows:')
display(df.head(12))

print('\nVariable descriptions:')
var_desc = {
    'id': 'Individual identifier',
    'time': 'Time period (1-4)',
    'expenditure': 'Health expenditure (censored at 0)',
    'income': 'Annual income (thousands)',
    'age': 'Age in years',
    'chronic': 'Number of chronic conditions',
    'insurance': 'Has health insurance (0/1)',
    'female': 'Female indicator (0/1)',
    'bmi': 'Body Mass Index'
}
for var, desc in var_desc.items():
    print(f'  {var:15s} {desc}')

In [None]:
# Summary statistics
print('Summary Statistics')
print('=' * 60)
display(df.describe().round(3))

# Censoring analysis
n_censored = (df['expenditure'] == 0).sum()
n_total = len(df)
pct_censored = n_censored / n_total * 100

print(f'\nCensoring Summary:')
print(f'  Total observations:    {n_total}')
print(f'  Censored (y = 0):      {n_censored} ({pct_censored:.1f}%)')
print(f'  Uncensored (y > 0):    {n_total - n_censored} ({100 - pct_censored:.1f}%)')
print(f'  Mean (all):            {df["expenditure"].mean():.2f}')
print(f'  Mean (uncensored):     {df.loc[df["expenditure"] > 0, "expenditure"].mean():.2f}')

---

## Section 4: Exploratory Data Analysis

Before estimation, we explore the panel structure focusing on **within-individual variation** and **censoring patterns over time**.

In [None]:
# ============================================================
# Distribution of expenditure: overall and by censoring
# ============================================================

fig, axes = plt.subplots(1, 3, figsize=(17, 5))

# 1. Overall distribution
ax = axes[0]
ax.hist(df['expenditure'], bins=40, color='steelblue', edgecolor='white', alpha=0.8)
ax.axvline(0, color='red', linestyle='--', linewidth=2, label='Censoring point')
ax.set_xlabel('Health Expenditure')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of Health Expenditure')
ax.legend(fontsize=10)

textstr = f'N = {n_total}\nCensored = {pct_censored:.1f}%\nMean = {df["expenditure"].mean():.2f}'
ax.text(0.95, 0.95, textstr, transform=ax.transAxes, fontsize=10,
        verticalalignment='top', horizontalalignment='right',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# 2. Censoring pattern by time period
ax = axes[1]
censoring_by_time = df.groupby('time').apply(
    lambda x: pd.Series({
        'pct_censored': (x['expenditure'] == 0).mean() * 100,
        'mean_exp': x['expenditure'].mean()
    })
)
bars = ax.bar(censoring_by_time.index, censoring_by_time['pct_censored'],
              color='coral', edgecolor='black', alpha=0.8)
ax.set_xlabel('Time Period')
ax.set_ylabel('Percent Censored (%)')
ax.set_title('Censoring Rate by Time Period')
ax.set_xticks(censoring_by_time.index)
for bar, val in zip(bars, censoring_by_time['pct_censored']):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.5,
            f'{val:.1f}%', ha='center', va='bottom', fontsize=11, fontweight='bold')

# 3. Distribution of uncensored expenditure by time
ax = axes[2]
for t in sorted(df['time'].unique()):
    sub = df[(df['time'] == t) & (df['expenditure'] > 0)]
    ax.hist(sub['expenditure'], bins=25, alpha=0.4, label=f't = {t}',
            density=True, edgecolor='white')
ax.set_xlabel('Health Expenditure (uncensored only)')
ax.set_ylabel('Density')
ax.set_title('Uncensored Distribution by Period')
ax.legend(fontsize=10)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'expenditure_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print('Censoring rate by period:')
display(censoring_by_time.round(2))

*Figure: Left -- overall distribution of health expenditure showing the mass point at zero (censoring) and the continuous positive part. Center -- censoring rate is relatively stable across time periods (~27%). Right -- the distribution of positive expenditure values is similar across periods.*

In [None]:
# ============================================================
# Within-individual variation in expenditure
# ============================================================

# Compute individual-level statistics
individual_stats = df.groupby('id').agg(
    mean_exp=('expenditure', 'mean'),
    std_exp=('expenditure', 'std'),
    min_exp=('expenditure', 'min'),
    max_exp=('expenditure', 'max'),
    n_censored=('expenditure', lambda x: (x == 0).sum()),
    mean_income=('income', 'mean'),
    mean_age=('age', 'mean'),
    mean_chronic=('chronic', 'mean')
).reset_index()

individual_stats['range_exp'] = individual_stats['max_exp'] - individual_stats['min_exp']
individual_stats['cv_exp'] = individual_stats['std_exp'] / (individual_stats['mean_exp'] + 0.01)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Distribution of individual means
ax = axes[0, 0]
ax.hist(individual_stats['mean_exp'], bins=30, color='steelblue',
        edgecolor='white', alpha=0.8)
ax.axvline(individual_stats['mean_exp'].mean(), color='red', linestyle='--',
           linewidth=2, label=f'Grand mean = {individual_stats["mean_exp"].mean():.2f}')
ax.set_xlabel('Individual Mean Expenditure')
ax.set_ylabel('Frequency')
ax.set_title('Between-Individual Variation')
ax.legend(fontsize=10)

# 2. Number of censored periods per individual
ax = axes[0, 1]
censored_counts = individual_stats['n_censored'].value_counts().sort_index()
bars = ax.bar(censored_counts.index, censored_counts.values,
              color='coral', edgecolor='black', alpha=0.8)
ax.set_xlabel('Number of Censored Periods (out of 4)')
ax.set_ylabel('Number of Individuals')
ax.set_title('Censoring Frequency Distribution')
ax.set_xticks(range(5))
for bar, val in zip(bars, censored_counts.values):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 2,
            f'{val}', ha='center', va='bottom', fontsize=11)

# 3. Within-individual range vs mean
ax = axes[1, 0]
ax.scatter(individual_stats['mean_exp'], individual_stats['range_exp'],
           alpha=0.4, s=20, color='seagreen')
ax.set_xlabel('Individual Mean Expenditure')
ax.set_ylabel('Within-Individual Range')
ax.set_title('Within-Individual Variation')

# 4. Individual trajectories (sample of 20 individuals)
ax = axes[1, 1]
sample_ids = np.random.choice(df['id'].unique(), size=20, replace=False)
for pid in sample_ids:
    sub = df[df['id'] == pid].sort_values('time')
    ax.plot(sub['time'], sub['expenditure'], marker='o', alpha=0.5,
            linewidth=1.5, markersize=4)
ax.set_xlabel('Time Period')
ax.set_ylabel('Health Expenditure')
ax.set_title('Individual Trajectories (sample of 20)')
ax.set_xticks([1, 2, 3, 4])
ax.axhline(0, color='red', linestyle=':', linewidth=1, alpha=0.5)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'within_individual_variation.png', dpi=300, bbox_inches='tight')
plt.show()

# Summary
print('Within-individual variation summary:')
print(f'  Mean of individual means:   {individual_stats["mean_exp"].mean():.2f}')
print(f'  SD of individual means:     {individual_stats["mean_exp"].std():.2f} (between variation)')
print(f'  Mean within-individual SD:  {individual_stats["std_exp"].mean():.2f} (within variation)')
print(f'  Mean within-individual range: {individual_stats["range_exp"].mean():.2f}')
print(f'\nCensoring patterns:')
for nc in range(5):
    n = (individual_stats['n_censored'] == nc).sum()
    pct = n / len(individual_stats) * 100
    print(f'  {nc} censored periods: {n} individuals ({pct:.1f}%)')

*Figure: Top-left -- substantial between-individual variation in mean expenditure motivates the random effect. Top-right -- censoring frequency shows many individuals have mixed censored/uncensored observations. Bottom-left -- within-individual range increases with mean expenditure. Bottom-right -- individual trajectories show considerable within-person variation over time.*

In [None]:
# ============================================================
# Between vs Within Variation (ANOVA decomposition)
# ============================================================

# Grand mean
grand_mean = df['expenditure'].mean()

# Between variation: variation in individual means
individual_means = df.groupby('id')['expenditure'].transform('mean')
ss_between = ((individual_means - grand_mean) ** 2).sum()

# Within variation: deviation from individual means
ss_within = ((df['expenditure'] - individual_means) ** 2).sum()

# Total variation
ss_total = ((df['expenditure'] - grand_mean) ** 2).sum()

print('ANOVA-style Variance Decomposition')
print('=' * 50)
print(f'Total SS:      {ss_total:12.2f}')
print(f'Between SS:    {ss_between:12.2f} ({ss_between/ss_total*100:.1f}%)')
print(f'Within SS:     {ss_within:12.2f} ({ss_within/ss_total*100:.1f}%)')
print(f'Check (B+W):   {ss_between + ss_within:12.2f}')
print()
print(f'Between std:   {individual_stats["mean_exp"].std():.3f}')
print(f'Within std:    {(df["expenditure"] - individual_means).std():.3f}')
print()
print('Interpretation:')
print(f'  {ss_between/ss_total*100:.0f}% of the variation in expenditure is BETWEEN individuals.')
print(f'  {ss_within/ss_total*100:.0f}% is WITHIN individuals over time.')
print(f'  => Substantial individual heterogeneity justifies the RE model.')

In [None]:
# ============================================================
# Correlations between expenditure and explanatory variables
# ============================================================

fig, axes = plt.subplots(2, 3, figsize=(17, 10))

# 1. Income vs expenditure
ax = axes[0, 0]
ax.scatter(df['income'], df['expenditure'], alpha=0.15, s=10, color='steelblue')
# Add smoothed trend for positive values
pos_mask = df['expenditure'] > 0
z = np.polyfit(df.loc[pos_mask, 'income'], df.loc[pos_mask, 'expenditure'], 1)
x_line = np.linspace(df['income'].min(), df['income'].max(), 100)
ax.plot(x_line, np.polyval(z, x_line), 'r-', linewidth=2, label='Linear fit (y>0)')
ax.set_xlabel('Income')
ax.set_ylabel('Expenditure')
ax.set_title('Income vs Expenditure')
ax.legend(fontsize=9)

# 2. Age vs expenditure
ax = axes[0, 1]
age_bins = pd.cut(df['age'], bins=8)
age_means = df.groupby(age_bins)['expenditure'].mean()
age_centers = [(interval.left + interval.right) / 2 for interval in age_means.index]
ax.bar(range(len(age_centers)), age_means.values, color='seagreen', edgecolor='black', alpha=0.8)
ax.set_xticks(range(len(age_centers)))
ax.set_xticklabels([f'{c:.0f}' for c in age_centers], rotation=45)
ax.set_xlabel('Age Group (midpoint)')
ax.set_ylabel('Mean Expenditure')
ax.set_title('Expenditure by Age Group')

# 3. Chronic conditions vs expenditure
ax = axes[0, 2]
chronic_stats = df.groupby('chronic')['expenditure'].agg(['mean', 'std'])
ax.bar(chronic_stats.index, chronic_stats['mean'], color='coral',
       edgecolor='black', alpha=0.8)
ax.errorbar(chronic_stats.index, chronic_stats['mean'],
            yerr=chronic_stats['std'] / np.sqrt(df.groupby('chronic').size()),
            fmt='none', color='black', capsize=4)
ax.set_xlabel('Number of Chronic Conditions')
ax.set_ylabel('Mean Expenditure')
ax.set_title('Expenditure by Chronic Conditions')

# 4. Insurance effect
ax = axes[1, 0]
ins_data = [df.loc[df['insurance'] == 0, 'expenditure'],
            df.loc[df['insurance'] == 1, 'expenditure']]
bp = ax.boxplot(ins_data, labels=['No Insurance', 'Insured'],
                patch_artist=True, flierprops=dict(markersize=2, alpha=0.3))
bp['boxes'][0].set_facecolor('lightcoral')
bp['boxes'][1].set_facecolor('lightblue')
ax.set_ylabel('Expenditure')
ax.set_title('Expenditure by Insurance Status')

# 5. Gender effect
ax = axes[1, 1]
gender_data = [df.loc[df['female'] == 0, 'expenditure'],
               df.loc[df['female'] == 1, 'expenditure']]
bp = ax.boxplot(gender_data, labels=['Male', 'Female'],
                patch_artist=True, flierprops=dict(markersize=2, alpha=0.3))
bp['boxes'][0].set_facecolor('lightblue')
bp['boxes'][1].set_facecolor('lightcoral')
ax.set_ylabel('Expenditure')
ax.set_title('Expenditure by Gender')

# 6. BMI vs expenditure
ax = axes[1, 2]
ax.scatter(df['bmi'], df['expenditure'], alpha=0.15, s=10, color='purple')
ax.set_xlabel('BMI')
ax.set_ylabel('Expenditure')
ax.set_title('BMI vs Expenditure')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'eda_covariates.png', dpi=300, bbox_inches='tight')
plt.show()

# Correlation matrix
print('\nCorrelation with expenditure:')
corrs = df[['expenditure', 'income', 'age', 'chronic', 'insurance', 'female', 'bmi']].corr()['expenditure']
for var, corr in corrs.items():
    if var != 'expenditure':
        print(f'  {var:15s}: {corr:+.3f}')

*Figure: Exploratory plots showing the relationship between health expenditure and key covariates. Chronic conditions and age show the strongest positive associations. Insurance status and income also appear related to expenditure levels.*

---

## Section 5: Estimation -- Random Effects Tobit

We now estimate the Random Effects Tobit model using PanelBox. The model accounts for both:
- **Censoring** at zero (mass point in expenditure distribution)
- **Individual heterogeneity** via random effects $\alpha_i$

In [None]:
# ============================================================
# Prepare data for estimation
# ============================================================

# Dependent variable
y = df['expenditure'].values

# Explanatory variables (with constant)
X_vars = df[['income', 'age', 'chronic', 'insurance', 'female', 'bmi']].values
X = sm.add_constant(X_vars)

var_names = ['const', 'income', 'age', 'chronic', 'insurance', 'female', 'bmi']

# Panel identifiers
groups = df['id'].values
time = df['time'].values

print('Data preparation:')
print(f'  y shape:       {y.shape}')
print(f'  X shape:       {X.shape}')
print(f'  groups shape:  {groups.shape}')
print(f'  N individuals: {len(np.unique(groups))}')
print(f'  T periods:     {len(np.unique(time))}')
print(f'  Variables:     {var_names}')

In [None]:
# ============================================================
# Estimate Random Effects Tobit
# ============================================================

print('Fitting Random Effects Tobit model...')
print('(This may take a minute due to quadrature integration)')
print('=' * 60)

re_tobit = RandomEffectsTobit(
    endog=y,
    exog=X,
    groups=groups,
    time=time,
    censoring_point=0.0,
    censoring_type='left',
    quadrature_points=12
)

re_tobit.fit(method='BFGS', maxiter=1000)

print()
print(re_tobit.summary())

In [None]:
# ============================================================
# Detailed results extraction
# ============================================================

K = len(var_names)

# Extract coefficients and standard errors
re_beta = re_tobit.beta
re_bse = re_tobit.bse[:K]
re_t = re_beta / re_bse
re_p = 2 * (1 - stats.norm.cdf(np.abs(re_t)))

def add_stars(p):
    if p < 0.001: return '***'
    elif p < 0.01: return '**'
    elif p < 0.05: return '*'
    else: return ''

re_table = pd.DataFrame({
    'Variable': var_names,
    'Coefficient': re_beta,
    'Std. Error': re_bse,
    'z-statistic': re_t,
    'p-value': re_p
})
re_table['Sig'] = re_table['p-value'].apply(add_stars)

print('Random Effects Tobit -- Coefficient Estimates')
print('=' * 70)
display(re_table.round(4))
print('Significance: *** p<0.001, ** p<0.01, * p<0.05')

# Save table
re_table.to_csv(TABLES_DIR / 'table_01_re_tobit_estimates.csv', index=False)
print(f'\nSaved to {TABLES_DIR / "table_01_re_tobit_estimates.csv"}')

---

## Section 6: Interpreting Results -- Variance Decomposition

A key output of the Random Effects Tobit is the **variance decomposition** between individual-level and idiosyncratic components.

In [None]:
# ============================================================
# Variance decomposition and intra-class correlation
# ============================================================

sigma_eps = re_tobit.sigma_eps
sigma_alpha = re_tobit.sigma_alpha

sigma2_eps = sigma_eps ** 2
sigma2_alpha = sigma_alpha ** 2
sigma2_total = sigma2_eps + sigma2_alpha

rho = sigma2_alpha / sigma2_total  # Intra-class correlation

print('Variance Decomposition')
print('=' * 50)
print(f'sigma_eps (idiosyncratic):  {sigma_eps:.4f}')
print(f'sigma_alpha (individual):   {sigma_alpha:.4f}')
print()
print(f'sigma^2_eps:               {sigma2_eps:.4f}')
print(f'sigma^2_alpha:             {sigma2_alpha:.4f}')
print(f'sigma^2_total:             {sigma2_total:.4f}')
print()
print(f'Intra-class correlation (rho):')
print(f'  rho = sigma^2_alpha / (sigma^2_alpha + sigma^2_eps)')
print(f'  rho = {sigma2_alpha:.4f} / ({sigma2_alpha:.4f} + {sigma2_eps:.4f})')
print(f'  rho = {rho:.4f}')
print()
print('Interpretation:')
print(f'  {rho*100:.1f}% of the total error variance is due to')
print(f'  individual-specific unobserved heterogeneity.')
if rho > 0.3:
    print(f'  => Substantial individual effects. The RE model is important.')
elif rho > 0.1:
    print(f'  => Moderate individual effects. RE model provides some improvement.')
else:
    print(f'  => Modest individual effects. Pooled model may be adequate.')

In [None]:
# ============================================================
# Visualize variance decomposition
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 1. Pie chart of variance components
ax = axes[0]
sizes = [sigma2_alpha, sigma2_eps]
labels = [f'Between\n($\\sigma^2_\\alpha$ = {sigma2_alpha:.3f})',
          f'Within\n($\\sigma^2_\\varepsilon$ = {sigma2_eps:.3f})']
colors_pie = ['#ff9999', '#66b3ff']
explode = (0.05, 0)

wedges, texts, autotexts = ax.pie(
    sizes, labels=labels, colors=colors_pie, explode=explode,
    autopct='%1.1f%%', startangle=90, textprops={'fontsize': 11}
)
for autotext in autotexts:
    autotext.set_fontsize(12)
    autotext.set_fontweight('bold')
ax.set_title(f'Error Variance Decomposition\n$\\rho$ = {rho:.3f}', fontsize=14)

# 2. Bar chart with confidence-like display
ax = axes[1]
components = ['$\\sigma_\\alpha$\n(individual)', '$\\sigma_\\varepsilon$\n(idiosyncratic)',
              '$\\sigma_{total}$\n(total)']
values = [sigma_alpha, sigma_eps, np.sqrt(sigma2_total)]
colors_bar = ['#ff9999', '#66b3ff', '#99ff99']

bars = ax.bar(components, values, color=colors_bar, edgecolor='black', alpha=0.8)
for bar, val in zip(bars, values):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.05,
            f'{val:.3f}', ha='center', va='bottom', fontsize=12, fontweight='bold')
ax.set_ylabel('Standard Deviation')
ax.set_title('Error Components (Standard Deviations)', fontsize=14)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'variance_decomposition.png', dpi=300, bbox_inches='tight')
plt.show()

print(f'The intra-class correlation rho = {rho:.3f} indicates that {rho*100:.1f}%')
print(f'of the unobserved variation is time-invariant individual heterogeneity.')

*Figure: Left -- pie chart showing the decomposition of error variance into between-individual and within-individual components. Right -- bar chart comparing the standard deviations of the random effect, the idiosyncratic error, and the total error.*

In [None]:
# ============================================================
# Substantive interpretation of coefficients
# ============================================================

print('Substantive Interpretation of RE Tobit Coefficients')
print('=' * 60)
print()
print('IMPORTANT: In a Tobit model, coefficients represent the effect')
print('on the LATENT variable y*. The effect on the observed (censored)')
print('outcome is attenuated by the probability of being uncensored.')
print()

for i, var in enumerate(var_names[1:], 1):  # Skip constant
    coef = re_beta[i]
    se = re_bse[i]
    p = re_p[i]
    sig = add_stars(p)
    print(f'{i}. {var} (beta = {coef:.4f}, p = {p:.4f}) {sig}')
    
    if var == 'income':
        print(f'   A 1-unit increase in income (thousand $) is associated with')
        print(f'   a {coef:.4f} change in latent health expenditure.')
    elif var == 'age':
        print(f'   Each additional year of age changes latent expenditure by {coef:.4f}.')
    elif var == 'chronic':
        print(f'   Each additional chronic condition changes latent expenditure by {coef:.4f}.')
    elif var == 'insurance':
        print(f'   Having health insurance changes latent expenditure by {coef:.4f}.')
    elif var == 'female':
        print(f'   Being female changes latent expenditure by {coef:.4f}.')
    elif var == 'bmi':
        print(f'   Each unit increase in BMI changes latent expenditure by {coef:.4f}.')
    print()

---

## Section 7: Pooled vs Random Effects Comparison

To understand the value of accounting for panel structure, we estimate a **Pooled Tobit** model and compare it with the Random Effects specification.

In [None]:
# ============================================================
# Estimate Pooled Tobit
# ============================================================

print('Fitting Pooled Tobit model...')
print('=' * 60)

pooled_tobit = PooledTobit(
    endog=y,
    exog=X,
    groups=groups,
    censoring_point=0.0,
    censoring_type='left'
)

pooled_tobit.fit(method='BFGS', maxiter=1000)

print()
print(pooled_tobit.summary())

In [None]:
# ============================================================
# Side-by-side comparison
# ============================================================

pooled_beta = pooled_tobit.beta
pooled_bse = pooled_tobit.bse[:K]
pooled_t = pooled_beta / pooled_bse
pooled_p = 2 * (1 - stats.norm.cdf(np.abs(pooled_t)))

comparison = pd.DataFrame({
    'Variable': var_names,
    'Pooled_Coef': pooled_beta,
    'Pooled_SE': pooled_bse,
    'RE_Coef': re_beta,
    'RE_SE': re_bse,
    'Coef_Diff': re_beta - pooled_beta,
    'SE_Ratio': re_bse / pooled_bse
})

print('Coefficient Comparison: Pooled Tobit vs Random Effects Tobit')
print('=' * 85)
display(comparison.round(4))

print(f'\nModel fit comparison:')
print(f'  Pooled Tobit log-likelihood:   {pooled_tobit.llf:.2f}')
print(f'  RE Tobit log-likelihood:       {re_tobit.llf:.2f}')
print(f'  Difference:                    {re_tobit.llf - pooled_tobit.llf:.2f}')
print(f'  Pooled sigma:                  {pooled_tobit.sigma:.4f}')
print(f'  RE sigma_eps:                  {re_tobit.sigma_eps:.4f}')
print(f'  RE sigma_alpha:                {re_tobit.sigma_alpha:.4f}')

# Save comparison table
comparison.to_csv(TABLES_DIR / 'table_02_pooled_vs_re.csv', index=False)
print(f'\nSaved to {TABLES_DIR / "table_02_pooled_vs_re.csv"}')

In [None]:
# ============================================================
# Visual comparison: Forest plot
# ============================================================

fig, ax = plt.subplots(figsize=(10, 7))

n_vars = len(var_names) - 1  # Exclude constant
y_pos = np.arange(n_vars)
var_labels = var_names[1:]

# Plot Pooled
ax.errorbar(pooled_beta[1:], y_pos + 0.12,
            xerr=1.96 * pooled_bse[1:],
            fmt='o', color='coral', markersize=9, capsize=5,
            label='Pooled Tobit', linewidth=2)

# Plot RE
ax.errorbar(re_beta[1:], y_pos - 0.12,
            xerr=1.96 * re_bse[1:],
            fmt='s', color='steelblue', markersize=9, capsize=5,
            label='RE Tobit', linewidth=2)

ax.axvline(x=0, color='gray', linestyle='--', linewidth=1)
ax.set_yticks(y_pos)
ax.set_yticklabels(var_labels, fontsize=12)
ax.set_xlabel('Coefficient (95% CI)', fontsize=12)
ax.set_title('Pooled Tobit vs Random Effects Tobit\nCoefficient Comparison', fontsize=14)
ax.legend(fontsize=11, loc='lower right')
ax.grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'coefficient_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print('Key observations:')
print(f'  - Average SE ratio (RE/Pooled): {comparison["SE_Ratio"].iloc[1:].mean():.2f}')
print(f'  - RE model generally has different SEs due to variance decomposition')
print(f'  - Coefficient magnitudes may shift when accounting for individual effects')

*Figure: Forest plot comparing coefficient estimates and 95% confidence intervals between Pooled Tobit and Random Effects Tobit. Differences in both point estimates and interval widths reflect the impact of accounting for individual-level unobserved heterogeneity.*

In [None]:
# ============================================================
# Likelihood ratio test: Pooled vs RE
# ============================================================

# Under H0: sigma_alpha = 0 (pooled model is adequate)
# LR = 2 * (LL_RE - LL_Pooled)
# Note: sigma_alpha >= 0 is on the boundary, so the test uses a
# mixture of chi2(0) and chi2(1) => halve the p-value.

LR = 2 * (re_tobit.llf - pooled_tobit.llf)
p_value_lr = 0.5 * (1 - stats.chi2.cdf(LR, df=1))  # One-sided boundary test

print('Likelihood Ratio Test: Pooled vs Random Effects Tobit')
print('=' * 60)
print(f'H0: sigma_alpha = 0 (no individual effects)')
print(f'H1: sigma_alpha > 0 (individual effects present)')
print()
print(f'Log-likelihood (Pooled):   {pooled_tobit.llf:.2f}')
print(f'Log-likelihood (RE):       {re_tobit.llf:.2f}')
print(f'LR statistic:              {LR:.2f}')
print(f'p-value (boundary test):   {p_value_lr:.2e}')
print()

if p_value_lr < 0.001:
    print('Conclusion: STRONGLY reject pooled model (p < 0.001).')
    print('The Random Effects Tobit is significantly preferred.')
elif p_value_lr < 0.05:
    print('Conclusion: Reject pooled model at 5% level.')
    print('The Random Effects Tobit provides a better fit.')
else:
    print('Conclusion: Fail to reject pooled model.')
    print('Individual effects may not be important.')

print()
print('Note: The p-value is halved because sigma_alpha is tested on the')
print('boundary of the parameter space (sigma_alpha >= 0). This follows')
print('the standard approach for testing variance components.')

In [None]:
# ============================================================
# Comprehensive model comparison table
# ============================================================

# AIC/BIC
k_pooled = K + 1  # betas + sigma
k_re = K + 2       # betas + sigma_eps + sigma_alpha

aic_pooled = -2 * pooled_tobit.llf + 2 * k_pooled
bic_pooled = -2 * pooled_tobit.llf + np.log(len(y)) * k_pooled

aic_re = -2 * re_tobit.llf + 2 * k_re
bic_re = -2 * re_tobit.llf + np.log(len(y)) * k_re

model_comparison = pd.DataFrame({
    'Metric': ['Log-Likelihood', 'Parameters', 'AIC', 'BIC',
               'sigma (total)', 'sigma_eps', 'sigma_alpha', 'rho (ICC)',
               'LR test stat', 'LR p-value', 'Converged'],
    'Pooled Tobit': [
        f'{pooled_tobit.llf:.2f}', k_pooled, f'{aic_pooled:.2f}', f'{bic_pooled:.2f}',
        f'{pooled_tobit.sigma:.4f}', '--', '--', '--',
        '--', '--', str(pooled_tobit.converged)
    ],
    'RE Tobit': [
        f'{re_tobit.llf:.2f}', k_re, f'{aic_re:.2f}', f'{bic_re:.2f}',
        f'{np.sqrt(sigma2_total):.4f}', f'{sigma_eps:.4f}',
        f'{sigma_alpha:.4f}', f'{rho:.4f}',
        f'{LR:.2f}', f'{p_value_lr:.2e}', str(re_tobit.converged)
    ]
})

print('Comprehensive Model Comparison')
print('=' * 70)
display(model_comparison)

print(f'\nAIC difference (Pooled - RE): {aic_pooled - aic_re:.1f} (lower is better)')
print(f'BIC difference (Pooled - RE): {bic_pooled - bic_re:.1f} (lower is better)')

model_comparison.to_csv(TABLES_DIR / 'table_03_model_comparison.csv', index=False)
print(f'\nSaved to {TABLES_DIR / "table_03_model_comparison.csv"}')

---

## Section 8: Predictions -- Latent vs Censored

The Random Effects Tobit model provides two types of predictions:

1. **Latent predictions**: $E[y^* | \mathbf{X}] = \mathbf{X}'\hat{\boldsymbol{\beta}}$ -- the expected value of the underlying latent variable (can be negative)

2. **Censored predictions**: $E[y | \mathbf{X}]$ -- the expected value accounting for the censoring mechanism (always $\geq 0$)

In [None]:
# ============================================================
# Generate predictions
# ============================================================

# RE Tobit predictions
y_pred_latent = re_tobit.predict(pred_type='latent')
y_pred_censored = re_tobit.predict(pred_type='censored')

# Pooled Tobit predictions for comparison
y_pooled_latent = pooled_tobit.predict(pred_type='latent')
y_pooled_censored = pooled_tobit.predict(pred_type='censored')

print('Prediction Summary')
print('=' * 60)
print(f'{"":20s} {"Latent":>12s} {"Censored":>12s}')
print('-' * 44)
print(f'{"RE Tobit mean":20s} {y_pred_latent.mean():>12.3f} {y_pred_censored.mean():>12.3f}')
print(f'{"RE Tobit min":20s} {y_pred_latent.min():>12.3f} {y_pred_censored.min():>12.3f}')
print(f'{"RE Tobit max":20s} {y_pred_latent.max():>12.3f} {y_pred_censored.max():>12.3f}')
print(f'{"Pooled Tobit mean":20s} {y_pooled_latent.mean():>12.3f} {y_pooled_censored.mean():>12.3f}')
print(f'{"Observed mean":20s} {"":>12s} {y.mean():>12.3f}')
print()
print('Note: Latent predictions can be negative (they predict y*,')
print('the uncensored latent variable). Censored predictions account')
print('for the probability of censoring and are always non-negative.')

In [None]:
# ============================================================
# Visualization: Latent vs Censored predictions
# ============================================================

fig, axes = plt.subplots(2, 2, figsize=(14, 11))

# 1. Latent predictions vs observed
ax = axes[0, 0]
jitter = np.random.uniform(-0.2, 0.2, len(y))
ax.scatter(y_pred_latent, y + jitter, alpha=0.1, s=8, color='steelblue')
lims = [min(y_pred_latent.min(), y.min()) - 1, max(y_pred_latent.max(), y.max()) + 1]
ax.plot(lims, lims, 'r--', linewidth=2, label='45-degree line')
ax.axvline(0, color='orange', linestyle=':', linewidth=1.5, alpha=0.7, label='Censoring point')
ax.axhline(0, color='orange', linestyle=':', linewidth=1.5, alpha=0.7)
ax.set_xlabel('Latent Prediction (y*)', fontsize=12)
ax.set_ylabel('Observed Expenditure', fontsize=12)
ax.set_title('RE Tobit: Latent Predictions vs Observed', fontsize=13)
ax.legend(fontsize=10)
corr_latent = np.corrcoef(y_pred_latent, y)[0, 1]
ax.text(0.05, 0.95, f'Corr = {corr_latent:.3f}', transform=ax.transAxes,
        fontsize=11, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# 2. Censored predictions vs observed
ax = axes[0, 1]
ax.scatter(y_pred_censored, y + jitter, alpha=0.1, s=8, color='seagreen')
max_val = max(y_pred_censored.max(), y.max())
ax.plot([0, max_val], [0, max_val], 'r--', linewidth=2, label='45-degree line')
ax.set_xlabel('Censored Prediction E[y|X]', fontsize=12)
ax.set_ylabel('Observed Expenditure', fontsize=12)
ax.set_title('RE Tobit: Censored Predictions vs Observed', fontsize=13)
ax.legend(fontsize=10)
corr_cens = np.corrcoef(y_pred_censored, y)[0, 1]
ax.text(0.05, 0.95, f'Corr = {corr_cens:.3f}', transform=ax.transAxes,
        fontsize=11, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# 3. Distribution of predictions
ax = axes[1, 0]
ax.hist(y_pred_latent, bins=40, alpha=0.6, color='steelblue',
        edgecolor='white', label='Latent (y*)', density=True)
ax.hist(y_pred_censored, bins=40, alpha=0.6, color='seagreen',
        edgecolor='white', label='Censored E[y|X]', density=True)
ax.axvline(0, color='red', linestyle='--', linewidth=2, label='Censoring point')
ax.set_xlabel('Predicted Value')
ax.set_ylabel('Density')
ax.set_title('Distribution of Predictions')
ax.legend(fontsize=10)

# 4. Pooled vs RE censored predictions
ax = axes[1, 1]
ax.scatter(y_pooled_censored, y_pred_censored, alpha=0.15, s=8, color='purple')
max_pred = max(y_pooled_censored.max(), y_pred_censored.max())
ax.plot([0, max_pred], [0, max_pred], 'r--', linewidth=2, label='45-degree line')
ax.set_xlabel('Pooled Tobit Prediction', fontsize=12)
ax.set_ylabel('RE Tobit Prediction', fontsize=12)
ax.set_title('Pooled vs RE Predictions', fontsize=13)
ax.legend(fontsize=10)
corr_models = np.corrcoef(y_pooled_censored, y_pred_censored)[0, 1]
ax.text(0.05, 0.95, f'Corr = {corr_models:.3f}', transform=ax.transAxes,
        fontsize=11, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'predictions_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print('Prediction accuracy:')
print(f'  Correlation (latent, observed):            {corr_latent:.4f}')
print(f'  Correlation (censored, observed):          {corr_cens:.4f}')
print(f'  Correlation (pooled pred, RE pred):        {corr_models:.4f}')
print(f'  RMSE (RE censored):  {np.sqrt(np.mean((y - y_pred_censored)**2)):.3f}')
print(f'  RMSE (Pooled cens.): {np.sqrt(np.mean((y - y_pooled_censored)**2)):.3f}')

*Figure: Top-left -- latent predictions versus observed values; some latent predictions are negative for individuals at the censoring boundary. Top-right -- censored predictions versus observed values showing the non-negative predictions. Bottom-left -- distribution comparison of latent (can be negative) versus censored predictions. Bottom-right -- Pooled and RE censored predictions are correlated but differ, reflecting the impact of the random effect.*

In [None]:
# ============================================================
# Residual analysis
# ============================================================

# Residuals from censored predictions
residuals_re = y - y_pred_censored
residuals_pooled = y - y_pooled_censored

fig, axes = plt.subplots(1, 3, figsize=(17, 5))

# 1. Residuals vs fitted (RE)
ax = axes[0]
ax.scatter(y_pred_censored, residuals_re, alpha=0.1, s=8, color='steelblue')
ax.axhline(0, color='red', linestyle='--', linewidth=1.5)
ax.set_xlabel('Fitted Values (censored)', fontsize=12)
ax.set_ylabel('Residuals', fontsize=12)
ax.set_title('RE Tobit: Residuals vs Fitted', fontsize=13)

# 2. Residual distribution
ax = axes[1]
ax.hist(residuals_re, bins=40, color='steelblue', edgecolor='white',
        alpha=0.8, density=True, label='RE residuals')
ax.hist(residuals_pooled, bins=40, color='coral', edgecolor='white',
        alpha=0.5, density=True, label='Pooled residuals')
ax.set_xlabel('Residual')
ax.set_ylabel('Density')
ax.set_title('Residual Distributions')
ax.legend(fontsize=10)

# 3. Mean residual by time period
ax = axes[2]
resid_by_time = pd.DataFrame({
    'time': time,
    'resid_RE': residuals_re,
    'resid_Pooled': residuals_pooled
}).groupby('time')[['resid_RE', 'resid_Pooled']].mean()

x_pos = np.arange(len(resid_by_time))
width = 0.35
ax.bar(x_pos - width/2, resid_by_time['resid_RE'], width,
       label='RE Tobit', color='steelblue', alpha=0.8)
ax.bar(x_pos + width/2, resid_by_time['resid_Pooled'], width,
       label='Pooled Tobit', color='coral', alpha=0.8)
ax.axhline(0, color='black', linestyle='-', linewidth=0.5)
ax.set_xlabel('Time Period')
ax.set_ylabel('Mean Residual')
ax.set_title('Mean Residual by Period')
ax.set_xticks(x_pos)
ax.set_xticklabels(resid_by_time.index)
ax.legend(fontsize=10)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'residual_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print('Residual summary:')
print(f'  RE Tobit  -- Mean: {residuals_re.mean():.4f}, SD: {residuals_re.std():.4f}')
print(f'  Pooled    -- Mean: {residuals_pooled.mean():.4f}, SD: {residuals_pooled.std():.4f}')

*Figure: Left -- residuals versus fitted values for the RE Tobit model. Center -- residual distribution comparison between RE and Pooled Tobit. Right -- mean residual by time period; smaller residuals indicate better fit.*

---

## Section 9: Summary and Key Takeaways

### What We Learned

1. **Panel structure matters for censored models**: Ignoring repeated measurements from the same individual leads to incorrect standard errors and potentially biased estimates.

2. **The Random Effects Tobit model** decomposes the error into:
   - $\alpha_i \sim N(0, \sigma^2_\alpha)$ -- time-invariant individual effect
   - $\varepsilon_{it} \sim N(0, \sigma^2_\varepsilon)$ -- idiosyncratic error

3. **Intra-class correlation** $\rho = \sigma^2_\alpha / (\sigma^2_\alpha + \sigma^2_\varepsilon)$ quantifies the importance of individual heterogeneity.

4. **Gauss-Hermite quadrature** is the numerical method used to integrate out the random effect in the likelihood.

5. **The likelihood ratio test** provides a formal comparison between pooled and RE specifications.

6. **Two types of predictions** are available:
   - Latent: $E[y^*|X]$ -- can be negative
   - Censored: $E[y|X]$ -- accounts for censoring, always $\geq 0$

### PanelBox Workflow Summary

```python
# Step 1: Prepare data
X = sm.add_constant(df[['income', 'age', 'chronic', 'insurance', 'female', 'bmi']].values)
y = df['expenditure'].values

# Step 2: Estimate RE Tobit
model = RandomEffectsTobit(
    endog=y, exog=X, groups=df['id'].values, time=df['time'].values,
    censoring_point=0.0, censoring_type='left', quadrature_points=12
)
model.fit(method='BFGS', maxiter=1000)

# Step 3: View results
print(model.summary())

# Step 4: Variance decomposition
rho = model.sigma_alpha**2 / (model.sigma_alpha**2 + model.sigma_eps**2)

# Step 5: Predictions
y_latent = model.predict(pred_type='latent')
y_censored = model.predict(pred_type='censored')
```

### Assumptions and Limitations

- **RE assumption**: $\alpha_i$ is uncorrelated with $\mathbf{X}_{it}$. If violated, consider the Honore semiparametric estimator (see Notebook 03).
- **Normality**: Both error components are assumed normal. This is stronger than in linear RE models.
- **Strict exogeneity**: Covariates cannot be affected by past values of $y_{it}$.

### What's Next?

- **Notebook 03**: Semiparametric alternatives -- the Honore trimmed estimator for fixed effects Tobit
- **Notebook 04**: Marginal effects in censored models -- converting latent-variable coefficients to interpretable effects

---

## Section 10: Exercises

### Exercise 1: Quadrature Sensitivity

Re-estimate the Random Effects Tobit model using different numbers of quadrature points: 6, 12, and 24. Compare the estimated coefficients, $\sigma_\alpha$, $\sigma_\varepsilon$, and log-likelihood. At what point do the results stabilize?

### Exercise 2: Subsample Analysis by Gender

Estimate separate RE Tobit models for males and females. Compare:
- Do the key coefficients (income, chronic conditions) differ across genders?
- Is the intra-class correlation $\rho$ different for males vs females?
- What does this suggest about gender-specific health expenditure dynamics?

### Exercise 3: Marginal Effects

Using the fitted RE Tobit model, compute the marginal effect of adding one chronic condition on:
- The latent expenditure $y^*$
- The expected observed expenditure $E[y|X]$
- The probability of positive expenditure $P(y > 0|X)$

Hint: Use `model.marginal_effects(at='overall', which='unconditional')` and compare with `which='conditional'`.

### Exercise 4: Prediction Performance

Split the data into training (periods 1-3) and test (period 4) sets. Estimate both Pooled and RE Tobit on the training data. Compare prediction accuracy (RMSE, MAE) on the test set. Does the RE model generalize better to out-of-sample data?

In [None]:
# ============================================================
# Exercise Solutions (template)
# ============================================================

# Exercise 1: Quadrature sensitivity
# Hint:
# for nq in [6, 12, 24]:
#     model_q = RandomEffectsTobit(
#         endog=y, exog=X, groups=groups, time=time,
#         censoring_point=0.0, quadrature_points=nq
#     )
#     model_q.fit(method='BFGS', maxiter=1000)
#     print(f'Q={nq}: LL={model_q.llf:.2f}, sigma_alpha={model_q.sigma_alpha:.4f}')

# Exercise 2: Subsample by gender
# Hint:
# mask_female = df['female'].values == 1
# model_f = RandomEffectsTobit(
#     endog=y[mask_female], exog=X[mask_female],
#     groups=groups[mask_female], time=time[mask_female],
#     censoring_point=0.0
# )
# model_f.fit(method='BFGS', maxiter=1000)
# rho_f = model_f.sigma_alpha**2 / (model_f.sigma_alpha**2 + model_f.sigma_eps**2)

# Exercise 3: Marginal effects
# Hint:
# me_uncond = re_tobit.marginal_effects(at='overall', which='unconditional')
# me_cond = re_tobit.marginal_effects(at='overall', which='conditional')
# print(me_uncond.summary())
# print(me_cond.summary())

# Exercise 4: Train/test split
# Hint:
# train_mask = df['time'].values <= 3
# test_mask = df['time'].values == 4
# model_train = RandomEffectsTobit(
#     endog=y[train_mask], exog=X[train_mask],
#     groups=groups[train_mask], time=time[train_mask],
#     censoring_point=0.0
# )
# model_train.fit(method='BFGS', maxiter=1000)
# y_test_pred = model_train.predict(exog=X[test_mask], pred_type='censored')
# rmse_test = np.sqrt(np.mean((y[test_mask] - y_test_pred)**2))

print('Complete the exercises above to deepen your understanding!')
print('Solutions are available in the solutions directory.')