Notebook to investigate segmented regression results vs. results from separate regressions. Some links to relevant StackOverflow threads:

 - [Link 1](https://stats.stackexchange.com/a/13115/162538)
 - [Link 2](https://stats.stackexchange.com/a/12809/162538)
 - [Link 3](https://stats.stackexchange.com/a/468666/162538)

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import warnings
from copy import deepcopy
#warnings.filterwarnings('ignore')
np.random.seed(5000)

The true data generating process:

$$
\begin{equation}
  y_t=\begin{cases}
    b_1 x_t + u_t, & t \leq 150\\
    b_2 x_t + u_t, & t > 150
  \end{cases}
\end{equation},
$$

where $t = 1, \dots , T$, $T=300$, and $u_{t} \sim N(\mu_{u}, \sigma_{u})$ denotes (Normally distributed) random noise term. Distribution for exogenous variable $x_{t} \sim N(0, 1)$.

In [2]:
# Parameters
T = 300
periods = list(np.arange(1, T+1))
mu_u = 0
sigma_u = 1
b_1 = 1
b_2 = 3

# Generate random data
df = pd.DataFrame(np.random.normal(mu_u, sigma_u, [T, 1]), columns=["u"], index=periods)
df.index.names = ["time"]
df["x"] = np.random.normal(0, 1, [T, 1])
df["y"] = np.where(df.index<=150, b_1 * df["x"] + df["u"], b_2 * df["x"] + df["u"])

# Add post dummy and interact with x
df["post"] = np.where(df.index<=150, 0, 1)
df["x_post"] = df["post"] * df["x"]

Regressions:

 1. Segmented regression $y = \gamma_1 x_t + \gamma_2 x_t Post_t + \epsilon_{1,t}$
 2. Pre-sample regression $y = \beta_1 x_t + \epsilon_{2,t}$
 3. Post-sample regression $y = \beta_2 x_t + \epsilon_{3,t}$
 
where $Post_t = 0$ when $t\leq 150$, otherwise 1.

In [3]:
regs = {
    "segmented": {
        "formula": "y ~ -1 + x + x_post",
        "data": df.copy(),
    },
    "pre": {
        "formula": "y ~ -1 + x",
        "data": df[df.index <= 150].copy(),
    },
    "post": {
        "formula": "y ~ -1 + x",
        "data": df[df.index > 150].copy(),
    },
}

items = deepcopy(list(regs.keys()))
for key in items:    
    regs[key]["res"] = sm.OLS.from_formula(regs[key]["formula"], data=regs[key]["data"]).fit()

In [4]:
print(regs["segmented"]["res"].summary())
print("")
print("-"*40)
print("Separate t-test for x + x_post = 0: ")
print(regs["segmented"]["res"].t_test("x + x_post = 0"))

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.806
Model:                            OLS   Adj. R-squared (uncentered):              0.804
Method:                 Least Squares   F-statistic:                              617.6
Date:                Sun, 02 Jan 2022   Prob (F-statistic):                   1.01e-106
Time:                        13:40:47   Log-Likelihood:                         -444.76
No. Observations:                 300   AIC:                                      893.5
Df Residuals:                     298   BIC:                                      900.9
Df Model:                           2                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [5]:
print(regs["pre"]["res"].summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.491
Model:                            OLS   Adj. R-squared (uncentered):              0.487
Method:                 Least Squares   F-statistic:                              143.5
Date:                Sun, 02 Jan 2022   Prob (F-statistic):                    1.40e-23
Time:                        13:40:47   Log-Likelihood:                         -224.89
No. Observations:                 150   AIC:                                      451.8
Df Residuals:                     149   BIC:                                      454.8
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [6]:
print(regs["post"]["res"].summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.883
Model:                            OLS   Adj. R-squared (uncentered):              0.882
Method:                 Least Squares   F-statistic:                              1125.
Date:                Sun, 02 Jan 2022   Prob (F-statistic):                    2.53e-71
Time:                        13:40:47   Log-Likelihood:                         -219.78
No. Observations:                 150   AIC:                                      441.6
Df Residuals:                     149   BIC:                                      444.6
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [7]:
# Regression model standard errors
# https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.scale.html#statsmodels.regression.linear_model.RegressionResults.scale
s = np.sqrt(regs["segmented"]["res"].scale)
s1 = np.sqrt(regs["pre"]["res"].scale)
s2 = np.sqrt(regs["post"]["res"].scale)
s_check = np.sqrt(
    (
        (regs["pre"]["res"].nobs - len(regs["pre"]["res"].params)) * np.power(s1, 2) + \
        (regs["post"]["res"].nobs - len(regs["post"]["res"].params)) * np.power(s2, 2)
    ) / \
    (regs["pre"]["res"].nobs + regs["post"]["res"].nobs - 2*len(regs["pre"]["res"].params))
)
print("s={:.5f}, s_check={:.5f}, s1={:.5f}, s2={:.5f}".format(s, s_check, s1, s2))

# Standard errors for estimate x_post
se_x_post = regs["segmented"]["res"].bse["x_post"]
se_x_post_check = s * np.sqrt(
    np.power(regs["pre"]["res"].bse["x"] / s1, 2) + np.power(regs["post"]["res"].bse["x"] / s2, 2)
)
print("se_x_post={:.5f}, se_x_post_check={:.5f}".format(se_x_post, se_x_post_check))

s=1.06922, s_check=1.06922, s1=1.08730, s2=1.05083
se_x_post=0.12406, se_x_post_check=0.12406
