# CEO Time Use and Firm Performance: A Topic Model Application

This notebook estimates the association between CEO time allocation and firm performance [(Bandiera et al. 2020)](https://doi.org/10.1086/705331). It illustrates how the functions `ols_bca_topic` and `ols_bcm_topic` can be used to correct bias from estimated topic model shares. The notebook reproduces results from Table 2 of [Battaglia, Christensen, Hansen & Sacher (2024)](https://arxiv.org/abs/2402.15585).

[(Bandiera et al. 2020)](https://doi.org/10.1086/705331) conduct a time-use survey for a sample of CEOs. Survey responses are recorded for each 15-minute interval of a given week. The sample consists of 654 answer combinations. To reduce dimensionality, the authros fit a topic model with two topics. One topic places relatively higher mass on features associated with "management," like visiting production sites or meeting with suppliers, while the other places relatively higher mass on features associated with "leadership", like communicating with other C-suite executives and holding large, multi-function meetings. 

Each CEO's leadership weight is a measure of their tendency to engage in leadership activities. One of the key results in [(Bandiera et al. 2020)](https://doi.org/10.1086/705331) is a regression of log sales, a measure of firm size, on the estimateed leadership weight (along with other firm controls).

In [2]:
import numpy as np
import pandas as pd
from scipy import stats

# Import your regression functions
from ValidMLInference import (
    ols, ols_bca_topic, topic_model_data
)

## Data

The command `topic_model_data()` loads the data we will use in the regression as well as joint estimates of the regression and topic model, as described in [Battaglia, Christensen, Hansen & Sacher (2024)](https://arxiv.org/abs/2402.15585).

In [3]:
topic_data = topic_model_data()

Z = topic_data['covars']                         # Control variables
estimation_data = topic_data['estimation_data']  # Main dataset
gamma_draws = topic_data['gamma_draws']          # MCMC draws
theta_est_full = topic_data['theta_est_full']    # Full sample topic estimates
theta_est_samp = topic_data['theta_est_samp']    # Subsample topic estimates
beta_est_full = topic_data['beta_est_full']      # Full sample topic-word distributions
beta_est_samp = topic_data['beta_est_samp']      # Subsample topic-word distributions
lda_data = topic_data['lda_data']                # Data used to fit the topic model

# Dependent variable: log employment, country fixed effects, and survey-wave fixed effects
Y = estimation_data['ly']
sigma_y = np.std(Y)

print(f"Sample size: {len(Y)}")
print(f"Number of control variables: {Z.shape[1]}")
print(f"Standard deviation of Y: {sigma_y:.3f}")

# Show sample of the data
sample_data = pd.DataFrame({
    'Y': Y,
    'theta_topic1': theta_est_full[:, 0],
    'control1': Z[:, 0],
    'control2': Z[:, 1]
})

print("\nSample data:")
print(sample_data.head())

Sample size: 916
Number of control variables: 11
Standard deviation of Y: 1.544

Sample data:
           Y  theta_topic1  control1  control2
0  12.352137      0.605186  1.268126       0.0
1  10.096356      0.084489  1.113297       0.0
2  14.075560      0.969039  3.227946       0.0
3  12.358381      0.288178  1.672314       0.0
4  10.530302      0.430811  0.728468       0.0


Here `theta_topic1` contains the leadership topic weight for each observation. 

## Results

We first present results for an OLS regression of log sales on the leadership topic weight and controls:

In [4]:
# Full sample OLS estimation
theta_full = theta_est_full
Xhat_full = np.column_stack([theta_full[:, 0], Z])  # First topic + controls

# Create variable names
var_names = ['topic_1'] + [f'control_{i+1}' for i in range(Z.shape[1])]

lm_full = ols(Y=Y, X=Xhat_full, se=True, intercept=True, names=var_names)

# Print summary with just the intercept and topic_1 coefficient
rows = ["Intercept", "topic_1"]
print("OLS Estimates and Confidence Intervals:")
summary = lm_full.summary()
print(summary.loc[rows])

OLS Estimates and Confidence Intervals:
           Estimate  Std. Error    z value     P>|z|      2.5%      97.5%
Intercept  9.874123    0.159194  62.025623  0.000000  9.562108  10.186138
topic_1    0.404658    0.092081   4.394608  0.000011  0.224184   0.585133


We now compare these estimates with bias-corrected estimates. We will use `ols_bca_topic`. This requires an estimate of κ, which is $\sqrt{n} \times E[C_{i}^{-1}]$, where $C_i$ is the number of feature counts in unstructured document $i$. This is stored in the first column of `lda_data`.

In [5]:
# Full sample bias correction
kappa = np.mean(1.0 / lda_data[:, 0]) * np.sqrt(len(lda_data))
print(f"κ: {kappa:.3f}")


κ: 0.442


In addition to κ, we need to construct a matrix `S` which picks off the relevant column of `theta_full` (a `n` by `K` matrix, `K` being the number of topcis, here `K = 2`) to include in the regression. 

We also include the estimated topic-word distributions (a `V` by `K` matrix, `V` being the number of features in the topic model).

In [6]:
# Selection matrix to pick the first topic
S = np.array([[1.0, 0.0]])

bc_full = ols_bca_topic(
    Y=Y,
    Q=Z,                    # Control variables
    W=theta_est_full,       # Document-topic proportions
    S=S,                    # Selection matrix
    B=beta_est_full,        # Estimated topic-word distributions
    k=kappa,                # Scaling parameter
    intercept=True
)

print("Bias-Corrected Estimates and Confidence Intervals:")
summary = bc_full.summary()
print(summary.loc[rows])

Bias-Corrected Estimates and Confidence Intervals:
           Estimate  Std. Error    z value         P>|z|      2.5%      97.5%
Intercept  9.842479    0.159194  61.826851  0.000000e+00  9.530464  10.154494
topic_1    0.474253    0.092081   5.150410  2.599174e-07  0.293778   0.654728


The two methods (`ols` and `ols_bca_topic`) produce similar estimates and confidence intervals. This suggests that measurement error in the estimated topic_1 shares is small enough that it doesn't materially distort inference.

To explore this further, we repeat the above taking a 10% subsample of the data used to estimate the topic model. This ensures the estimated topic weights are noisier signals of the true leadership index. Here we are running the same regression as before, just with a noisier value of the topic_1 weight.

The data are named as before, with a `_samp` suffix.

In [7]:
# 10% Subsample OLS estimation
theta_samp = theta_est_samp
Xhat_samp = np.column_stack([theta_samp[:, 0], Z])

lm_samp = ols(Y=Y, X=Xhat_samp, se=True, intercept=True, names=var_names)

print("10% Subsample: Bias-Corrected Estimates and Confidence Intervals:")
summary = lm_samp.summary()
print(summary.loc[rows])

10% Subsample: Bias-Corrected Estimates and Confidence Intervals:
           Estimate  Std. Error    z value     P>|z|      2.5%      97.5%
Intercept  9.940524    0.170793  58.202244  0.000000  9.605776  10.275272
topic_1    0.226714    0.135119   1.677886  0.093369 -0.038114   0.491541


Evidently, increasing the measurement error in the estimated topic weights reduces the estimated slope coefficient by around 50%. Moreover, OLS confidence intervals for the slope coefficient now include zero.

We now compare with bias correction:

In [8]:
# 10% Subsample bias correction
kappa_samp = np.mean(1.0 / lda_data[:, 1]) * np.sqrt(len(lda_data))

print(f"Kappa (subsample): {kappa_samp:.3f}")

bc_samp = ols_bca_topic(
    Y=Y,
    Q=Z,
    W=theta_est_samp,
    S=S,
    B=beta_est_samp,
    k=kappa_samp,
    intercept=True
)

print("10% Subsample: Bias-Corrected Estimates and Confidence Intervals:")
summary = bc_samp.summary()
print(summary.loc[rows])

Kappa (subsample): 4.262
10% Subsample: Bias-Corrected Estimates and Confidence Intervals:
           Estimate  Std. Error    z value         P>|z|      2.5%     97.5%
Intercept  9.511539    0.170793  55.690510  0.000000e+00  9.176791  9.846286
topic_1    1.053774    0.135119   7.798883  6.217249e-15  0.788946  1.318602


Performing the bias correction results in a much larger estimated effect size.

Finally we tabluate results from the joint estimation performed by [Battaglia, Christensen, Hansen & Sacher (2024)](https://arxiv.org/abs/2402.15585):

In [9]:
# Joint estimation using MCMC draws (scaled by dependent variable standard deviation)
gamma_scaled = gamma_draws * sigma_y
gamma_hat_1 = np.mean(gamma_scaled, axis=0)

# Calculate empirical confidence intervals from MCMC draws
alpha = 0.05
ci_lower_1 = np.percentile(gamma_scaled, 100 * alpha/2, axis=0)
ci_upper_1 = np.percentile(gamma_scaled, 100 * (1 - alpha/2), axis=0)

print("Joint Estimates and Confidence Intervals:")
print(f"Full Sample:   {gamma_hat_1[0]:.3f} [{ci_lower_1[0]:.3f}, {ci_upper_1[0]:.3f}]")
print(f"10% Subsample: {gamma_hat_1[1]:.3f} [{ci_lower_1[1]:.3f}, {ci_upper_1[1]:.3f}]")

Joint Estimates and Confidence Intervals:
Full Sample:   0.402 [0.240, 0.602]
10% Subsample: 0.439 [0.153, 0.711]


We see that unlike OLS estimation, joint estimation is robust to increasing the noise in the estimated topic weight. Both samples produce a similar estimated effect size confidence intervals that exclude zero.

Finally, we tabulate all results together:

In [10]:
# Helper function to get confidence intervals from regression results
def get_ci(result, coef_idx=0, alpha=0.05):
    coef = result.coef[coef_idx]
    se = np.sqrt(result.vcov[coef_idx, coef_idx])
    z_crit = stats.norm.ppf(1 - alpha/2)
    return coef - z_crit * se, coef + z_crit * se

# Two-step results
ci_full_lower, ci_full_upper = get_ci(lm_full, 1)  # topic1 is index 1 (after intercept)
ci_samp_lower, ci_samp_upper = get_ci(lm_samp, 1)

# Bias correction results
ci_bc_full_lower, ci_bc_full_upper = get_ci(bc_full, 1)  # topic_1 is index 1
ci_bc_samp_lower, ci_bc_samp_upper = get_ci(bc_samp, 1)

results_data = [
    {"Sample": "Full", "Method": "Two-Step", 
     "Estimate": lm_full.coef[1], "CI_Lower": ci_full_lower, "CI_Upper": ci_full_upper},
    {"Sample": "10% Subsample", "Method": "Two-Step", 
     "Estimate": lm_samp.coef[1], "CI_Lower": ci_samp_lower, "CI_Upper": ci_samp_upper},
    {"Sample": "Full", "Method": "Bias Correction", 
     "Estimate": bc_full.coef[1], "CI_Lower": ci_bc_full_lower, "CI_Upper": ci_bc_full_upper},
    {"Sample": "10% Subsample", "Method": "Bias Correction", 
     "Estimate": bc_samp.coef[1], "CI_Lower": ci_bc_samp_lower, "CI_Upper": ci_bc_samp_upper},
    {"Sample": "Full", "Method": "Joint", 
     "Estimate": gamma_hat_1[0], "CI_Lower": ci_lower_1[0], "CI_Upper": ci_upper_1[0]},
     {"Sample": "10% Subsample", "Method": "Joint", 
     "Estimate": gamma_hat_1[1], "CI_Lower": ci_lower_1[1], "CI_Upper": ci_upper_1[1]}
]

results_df = pd.DataFrame(results_data)
results_df = results_df.round(3)

print("Comparison of Methods:")
print(results_df.to_string(index=False))

Comparison of Methods:
       Sample          Method   Estimate     CI_Lower   CI_Upper
         Full        Two-Step  0.4046584   0.22418365  0.5851332
10% Subsample        Two-Step 0.22671358 -0.038113967 0.49154115
         Full Bias Correction 0.47425315    0.2937784 0.65472794
10% Subsample Bias Correction   1.053774   0.78894645  1.3186016
         Full           Joint   0.401849     0.239983   0.602348
10% Subsample           Joint   0.438758     0.153032   0.711071
