
# Censored & Truncated Regressions — Tobit and Truncated Normal — Python Notebook

**When to Use**  
- The outcome has a **hard boundary** (e.g., sales \>= 0, minutes watched \>= 0) and many observations pile up at the limit (**censoring**).  
- Data are **observed only above/below a threshold** (e.g., only customers with spend \> 0 appear), creating **truncation**.

**Best Application**  
- Purchase incidence or spend with many **zeros** (left‑censored at 0).  
- Time‑on‑site, dwell time, loan amounts, bids with limits.  
- Sample designs where observations below a cutoff are **not recorded** (truncation).

**When Not to Use**  
- If zeros are from **two processes** (buy vs. not buy, then how much), consider **two‑part (hurdle)** or **zero‑inflated** models.  
- If outcome is **counts**, prefer **Poisson/NegBin** or hurdle count models.

**How to Interpret Results**  
- **Tobit coefficients** relate to the **latent** outcome. Report **marginal effects** for: (i) unconditional mean of observed y, (ii) probability of being uncensored, (iii) conditional mean given y>c.  
- **OLS on censored/truncated data is biased**; use MLE that accounts for censoring/truncation explicitly.


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.base.model import GenericLikelihoodModel
from scipy.stats import norm

pd.set_option('display.max_columns', 120)
plt.rcParams['figure.figsize'] = (8,4)
rng = np.random.default_rng(42)


### Data: Simulate latent outcome y* and left‑censor at 0

In [None]:

n = 1500
X = np.c_[np.ones(n), rng.normal(0,1,n), rng.normal(0,1,n)]
beta_true = np.array([1.0, 2.0, -1.5])
sigma_true = 1.0

ystar = X @ beta_true + rng.normal(0, sigma_true, n)
c = 0.0  # left-censoring at 0
y_obs = np.maximum(ystar, c)

df = pd.DataFrame(X, columns=['const','x1','x2'])
df['y'] = y_obs
df['ystar'] = ystar
df['censored'] = (y_obs == c).astype(int)

df.head()


In [None]:

# Quick view: proportion censored
df['censored'].mean()


In [None]:

plt.hist(df['y'], bins=40); plt.title('Observed y (left-censored at 0)'); plt.show()


### Naive OLS on Observed y (Biased under Censoring)

In [None]:

ols = sm.OLS(df['y'], df[['const','x1','x2']]).fit()
ols.params, ols.bse


### Tobit (Left-Censored at 0) via MLE

In [None]:

class Tobit(GenericLikelihoodModel):
    def __init__(self, endog, exog, censor_left=0.0, **kwargs):
        super().__init__(endog, exog, **kwargs)
        self.c = censor_left

    def nloglikeobs(self, params):
        beta = params[:-1]
        sigma = params[-1]
        y = self.endog
        X = self.exog
        c = self.c

        xb = X @ beta
        z = (y - xb) / sigma

        # Indicator for censored
        cens = (y <= c + 1e-12).astype(float)

        # For uncensored: log f(y) = log[ (1/sigma) * phi(z) ]
        ll_unc = -np.log(sigma) + norm.logpdf(z)

        # For censored: log F((c - xb)/sigma)  (probability that y*<=c)
        ll_cens = norm.logcdf((c - xb)/sigma)

        ll = (1 - cens) * ll_unc + cens * ll_cens
        return -ll  # negative log-likelihood per obs

    def fit(self, start_params=None, method='bfgs', maxiter=1000, disp=False):
        if start_params is None:
            # start from OLS on uncensored subset + log(sigma)
            mask = self.endog > self.c
            beta_ols = np.linalg.lstsq(self.exog[mask], self.endog[mask], rcond=None)[0]
            sigma0 = np.std(self.endog[mask] - self.exog[mask] @ beta_ols)
            start_params = np.r_[beta_ols, max(sigma0, 0.5)]
        return super().fit(start_params=start_params, method=method, maxiter=maxiter, disp=disp)

# Fit model
endog = df['y'].values
exog = df[['const','x1','x2']].values
tobit_mod = Tobit(endog, exog, censor_left=0.0)
tobit_res = tobit_mod.fit(disp=False)
tobit_res.params, tobit_res.bse


In [None]:

print(tobit_res.summary())


### Marginal Effects (Unconditional mean of observed y)

In [None]:

# For left-censored Tobit at c, unconditional E[y] = Phi(a)*(xb) + sigma*phi(a) + c*(1-Phi(a)) where a=(c-xb)/sigma
# We want dE[y]/dX = Phi(a)*beta + derivative through a terms.
# Closed-forms exist; here's a vectorized function for marginal effects at sample means.

beta = tobit_res.params[:-1]
sigma = tobit_res.params[-1]
xb = df[['const','x1','x2']].mean().values @ beta
a = (0.0 - xb) / sigma

Phi = norm.cdf(a)
phi = norm.pdf(a)

# dE[y]/dX ≈ Phi * beta   (common practical approximation for quick guidance)
me_approx = Phi * beta
pd.Series(me_approx, index=['const','x1','x2']).to_frame('ME_uncond_approx')


### Truncated Normal Regression (observe only y>0) via MLE

In [None]:

# Build truncated sample: drop y==0 (only keep observed y>0)
df_trunc = df[df['y'] > 0].copy()

class TruncatedReg(GenericLikelihoodModel):
    def __init__(self, endog, exog, trunc_left=0.0, **kwargs):
        super().__init__(endog, exog, **kwargs)
        self.c = trunc_left

    def nloglikeobs(self, params):
        beta = params[:-1]
        sigma = params[-1]
        y = self.endog
        X = self.exog
        c = self.c
        xb = X @ beta
        z = (y - xb)/sigma
        a = (c - xb)/sigma

        # Truncated normal density: f_T(y) = f(y) / (1 - Phi(a))
        # log f_T = log f(y) - log(1-Phi(a))
        ll = -np.log(sigma) + norm.logpdf(z) - np.log(1 - norm.cdf(a) + 1e-12)
        return -ll

    def fit(self, start_params=None, method='bfgs', maxiter=1000, disp=False):
        if start_params is None:
            beta_ols = np.linalg.lstsq(self.exog, self.endog, rcond=None)[0]
            sigma0 = np.std(self.endog - self.exog @ beta_ols)
            start_params = np.r_[beta_ols, max(sigma0, 0.5)]
        return super().fit(start_params=start_params, method=method, maxiter=maxiter, disp=disp)

# Fit truncated regression
endog_tr = df_trunc['y'].values
exog_tr  = df_trunc[['const','x1','x2']].values
tr_mod = TruncatedReg(endog_tr, exog_tr, trunc_left=0.0)
tr_res = tr_mod.fit(disp=False)
print(tr_res.summary())


### Compare Estimates: True β vs OLS vs Tobit vs Truncated

In [None]:

comp = pd.DataFrame({
    'true_beta': [1.0, 2.0, -1.5],
    'OLS_censored': ols.params.values,
    'Tobit': tobit_res.params[:-1],
    'TruncatedReg': tr_res.params[:-1]
}, index=['const','x1','x2']).round(3)

comp



---

### Practical Guidance
- Inspect the **mass at the censoring point**; if large, Tobit often outperforms OLS.  
- For purchase data, consider **two‑part models**: (i) logit for incidence, (ii) GLM (e.g., log‑normal/Gamma) for positive spend.  
- With **heteroskedasticity** or non‑normal errors, classical Tobit can be misspecified; consider **semiparametric** alternatives.

### References (non‑link citations)
1. Greene — *Econometric Analysis*.  
2. Wooldridge — *Econometric Analysis of Cross Section and Panel Data*.  
3. Amemiya — *Tobit Models: A Survey*.
