Notebook to investigate segmented regression results vs. results from separate regressions. Some links to relevant StackOverflow threads:

 - [Link 1](https://stats.stackexchange.com/a/13115/162538)
 - [Link 2](https://stats.stackexchange.com/a/12809/162538)
 - [Link 3](https://stats.stackexchange.com/a/468666/162538)

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import warnings
from copy import deepcopy
#warnings.filterwarnings('ignore')
np.random.seed(5000)

The true data generating process:

$$
\begin{equation*}
  y_t=\begin{cases}
    b_1 x_t + a_1 m_t + u_t, & t \leq t^{*}\\
    b_2 x_t + a_2 m_t + u_t, & t > t^{*}
  \end{cases}
\end{equation*},
$$

where $t = 1, \dots , T$ and $u_{t} \sim N(\mu_{u}, \sigma_{u})$ denotes (Normally distributed) random noise term. Distribution for exogenous variable $x_{t} = m_t + i_t$ with $i_{t} \sim N(-4, 10)$ and $m_t \sim $

In [2]:
# Parameters
T = 300
tstar = 110
periods = list(np.arange(1, T+1))
mu_u = 0
sigma_u = 1
b_1 = 1
b_2 = 3
a_1 = -1.5
a_2 = -0.9

# Generate random data
df = pd.DataFrame(np.random.normal(mu_u, sigma_u, [T, 1]), columns=["u"], index=periods)
df.index.names = ["time"]
df["m"] = np.random.exponential(2.1, [T, 1])
df["x"] = df["m"] + np.random.normal(-4, 10, T)
df["y"] = np.where(
    df.index<=tstar,
    b_1 * df["x"] + a_1 * df["m"] + df["u"],
    b_2 * df["x"] + a_2 * df["m"] + df["u"]
)

# Add post dummy and interact with x
df["post"] = np.where(df.index<=tstar, 0, 1)
df["x_post"] = df["post"] * df["x"]

Regressions:

 1. Segmented regression $y = \gamma_1 x_t + \gamma_2 x_t Post_t + \epsilon_{1,t}$
 2. Pre-sample regression $y = \beta_1 x_t + \epsilon_{2,t}$
 3. Post-sample regression $y = \beta_2 x_t + \epsilon_{3,t}$
 
where $Post_t = 0$ when $t\leq t^{*}$, otherwise 1.

In [3]:
regs = {
    "segmented": {
        "formula": "y ~ -1 + x + x_post",
        "data": df.copy(),
    },
    "pre": {
        "formula": "y ~ -1 + x",
        "data": df[df.index <= tstar].copy(),
    },
    "post": {
        "formula": "y ~ -1 + x",
        "data": df[df.index > tstar].copy(),
    },
}

items = deepcopy(list(regs.keys()))
for key in items:    
    regs[key]["res"] = sm.OLS.from_formula(regs[key]["formula"], data=regs[key]["data"]).fit()

In [4]:
print(regs["segmented"]["res"].summary())
print("")
print("-"*40)
print("Separate t-test for x + x_post = 0: ")
print(regs["segmented"]["res"].t_test("x + x_post = 0"))

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.971
Model:                            OLS   Adj. R-squared (uncentered):              0.971
Method:                 Least Squares   F-statistic:                              5074.
Date:                Sun, 02 Jan 2022   Prob (F-statistic):                   6.79e-231
Time:                        20:32:28   Log-Likelihood:                         -830.41
No. Observations:                 300   AIC:                                      1665.
Df Residuals:                     298   BIC:                                      1672.
Df Model:                           2                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [5]:
print(regs["pre"]["res"].summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.728
Model:                            OLS   Adj. R-squared (uncentered):              0.726
Method:                 Least Squares   F-statistic:                              291.9
Date:                Sun, 02 Jan 2022   Prob (F-statistic):                    1.33e-32
Time:                        20:32:28   Log-Likelihood:                         -336.98
No. Observations:                 110   AIC:                                      676.0
Df Residuals:                     109   BIC:                                      678.7
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [6]:
print(regs["post"]["res"].summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.990
Model:                            OLS   Adj. R-squared (uncentered):              0.990
Method:                 Least Squares   F-statistic:                          1.805e+04
Date:                Sun, 02 Jan 2022   Prob (F-statistic):                   1.69e-189
Time:                        20:32:28   Log-Likelihood:                         -466.28
No. Observations:                 190   AIC:                                      934.6
Df Residuals:                     189   BIC:                                      937.8
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [7]:
# Regression model standard errors
# https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.scale.html#statsmodels.regression.linear_model.RegressionResults.scale
s = np.sqrt(regs["segmented"]["res"].scale)
s1 = np.sqrt(regs["pre"]["res"].scale)
s2 = np.sqrt(regs["post"]["res"].scale)
s_check = np.sqrt(
    (
        (regs["pre"]["res"].nobs - len(regs["pre"]["res"].params)) * np.power(s1, 2) + \
        (regs["post"]["res"].nobs - len(regs["post"]["res"].params)) * np.power(s2, 2)
    ) / \
    (regs["pre"]["res"].nobs + regs["post"]["res"].nobs - 2*len(regs["pre"]["res"].params))
)
print("s={:.5f}, s_check={:.5f}, s1={:.5f}, s2={:.5f}".format(s, s_check, s1, s2))

# Standard errors for estimate x_post
se_x_post = regs["segmented"]["res"].bse["x_post"]
se_x_post_check = s * np.sqrt(
    np.power(regs["pre"]["res"].bse["x"] / s1, 2) + np.power(regs["post"]["res"].bse["x"] / s2, 2)
)
print("se_x_post={:.5f}, se_x_post_check={:.5f}".format(se_x_post, se_x_post_check))

s=3.86690, s_check=3.86690, s1=5.20203, s2=2.82308
se_x_post=0.04919, se_x_post_check=0.04919


Show what is going on under the OLS estimation hood as formulas. The segmented regressions equation is

$y = \gamma_1 x_t + \gamma_2 x_t Post_t + \epsilon_{1,t}$

Define $z_t \equiv [x_t \ , \ x_t Post_t]$ and $A \equiv \begin{bmatrix} \gamma_1 \\ \gamma_2 \end{bmatrix} $ and rewrite

$y = z_t A + \epsilon_{1,t}$

Now stack matrices in time dimension; define a $Tx1$ matrix $Y \equiv \begin{bmatrix} y_1 \\ \vdots \\ y_T \end{bmatrix} $, a $Tx2$ matrix $Z \equiv \begin{bmatrix} z_1 \\ \vdots \\ z_T \end{bmatrix} $ and a $Tx1$ matrix $E_1 \equiv \begin{bmatrix} \epsilon{1,t} \\ \vdots \\ \epsilon{1,T} \end{bmatrix} $. The regression equation can be written as

$Y = Z A + E_1$

The least-squares estimate from above regression is given by

$$
\begin{align}
\hat{A} = \begin{bmatrix} \hat{\gamma}_1 \\ \hat{\gamma}_2 \end{bmatrix} &= (Z'Z)^{-1} (Z'Y) \\[6pt]
    &= \Big(\begin{bmatrix} x_1 & \cdots & x_T \\ x_1 Post_1 & \cdots & x_T Post_T \end{bmatrix}
           \begin{bmatrix} x_1 & x_1 Post_1 \\ \vdots & \vdots \\ x_T & x_T Post_T \end{bmatrix}\Big)^{-1}
    \Big(\begin{bmatrix} x_1 & \cdots & x_T \\ x_1 Post_1 & \cdots & x_T Post_T \end{bmatrix}
           \begin{bmatrix} y_1 \\ \vdots \\ y_T \end{bmatrix}\Big) \\[6pt]
    &= \begin{bmatrix} x_1^2 + \dots + x_T^2 & x_1^2 Post_1 + \dots + x_T^2 Post_T \\ x_1^2 Post_1 + \dots + x_T^2 Post_T & x_1^2 Post_1^2 + \dots + x_T^2 Post_T^2 \end{bmatrix}^{-1}
    \begin{bmatrix} x_1 y_1 + \cdots + x_T y_T \\ x_1 Post_1 y_1 + \cdots + x_T Post_T y_T \end{bmatrix} \\[6pt]
    &= \begin{bmatrix} x_1^2 + \dots + x_T^2 \ (=a) & x_1^2 Post_1 + \dots + x_{t^*}^2 Post_{t^*} + \dots + x_T^2 Post_T \ (=b) \\ x_1^2 Post_1 + \dots + x_{t^*}^2 Post_{t^*} + \dots + x_T^2 Post_T \ (=c) & x_1^2 Post_1^2 + \dots + x_{t^*}^2 Post_{t^*}^2 + \dots + x_T^2 Post_T^2 \ (=d) \end{bmatrix}^{-1}
    \begin{bmatrix} x_1 y_1 + \cdots + x_T y_T \\ x_1 Post_1 y_1 + \dots + x_{t^*} Post_{t^*} y_{t^*} + \cdots + x_T Post_T y_T \end{bmatrix} \\[6pt]
    &= \frac{1}{ad-bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}
    \begin{bmatrix} x_1 y_1 + \cdots + x_T y_T \\ x_1 Post_1 y_1 + \dots + x_{t^*} Post_{t^*} y_{t^*} + \cdots + x_T Post_T y_T \end{bmatrix}
\end{align}
$$

From above it seems that the point estimate $\hat{\gamma}_1$ would not be equivalent to pre-sample $\beta_1$ because if all the extra terms that appeat, but it is hard to say. At least above in the examples they do turn out to be equivalent.