### 2.4

假设$Y$与$X_1,X_2$之间满足线性回归关系
$$
y_1=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\epsilon_i,i=1,2,3,\dots,15
$$
其中$\epsilon_i \sim N(0,\sigma^2)$且独立同分布

In [26]:
import numpy as np
import pandas as pd
data = pd.DataFrame({
    '销量': [162, 120, 223, 131, 67, 169, 81, 192, 116, 55, 252, 232, 144, 103, 212],
    '人数': [274, 180, 375, 205, 86, 265, 98, 330, 195, 53, 430, 372, 236, 157, 370],
    '收入': [2450, 3254, 3802, 2838, 2347, 3782, 3008, 2450, 2137, 2560, 4020, 4427, 2660, 2088, 2605]}
)


#### (1)求回归系数$\beta_0,\beta_1,\beta_2$的最小二乘估计和误差方差$\sigma^2$的估计，写出回归方程并对回归系数作解释

已知
$$
\hat{\beta} = (\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_{p-1})^T=(X^TX)^{-1}X^TY \\
\hat{\sigma^2}=\cfrac{SSE}{n-p}=\frac{1}{n-p}Y^T(I-H)Y \\
H=X(X^TX)^{-1}X^T
$$

In [27]:
X = data[['人数', '收入']].values
# X加上一个全是1的列
X_wide = np.hstack((np.ones((X.shape[0], 1)), X))
Y = data['销量'].values
H = X_wide @ np.linalg.inv(X_wide.T @ X_wide) @ X_wide.T
beta_hat = np.linalg.inv(X_wide.T @ X_wide) @ X_wide.T @ Y
I = np.eye(H.shape[0])
sigma_2_hat = Y.T @ (I - H) @ Y / (X_wide.shape[0] - X_wide.shape[1])
print(beta_hat)
print(sigma_2_hat)

[3.45261279 0.49600498 0.00919908]
4.740297129881294


$\hat{y}=3.453+0.496x_1+0.009x_2$

$\beta_0$截距项：在“人数”和“收入”都为 0 的情况下的“销量”预测值，虽然没有现实意义，但对模型是必要的

$\beta_1$：“人数”每增加 1 单位，其他条件不变时，“销量”预计增加$\beta_1$个单位

$\beta_2$：“收入”每增加 1 单位，其他条件不变时，“销量”预计增加$\beta_2$个单位

#### (2)求出方差分析表，解释对线性回归关系显著性检验结果，求复相关系数平方$R^2$的值并解释其意义

$$
\begin{array}{|c|c|c|c|c|c|}
\hline
\text{方差来源} & \text{自由度} & \text{平方和 (SS)} & \text{均方 (MS)} & \text{F 值} & \text{p 值} \\
\hline
\text{回归 (R)} & p - 1 & SSR & MSR=\cfrac{SSR}{p - 1} & F_0=\cfrac{MSR}{MSE} & p_0 \\
\hline
\text{误差 (E)} & n - p & SSE & MSE=\cfrac{SSE}{n - p} &  &  \\
\hline
\text{总和 (T)} & n - 1 & SST &  &  &  \\
\hline
\end{array}
$$

$$
SST=Y^T(I-\frac{1}{n}J)Y=\sum_{i=1}^{n}(y_i - \bar{y})^2 \\
SSR=Y^T(H-\frac{1}{n}J)Y=\sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 \\
SSE=Y^T(I-H)Y=\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \\
SST=SSE+SSR \\
MSR=\frac{SSR}{p-1} \\
MSE=\frac{SSE}{n-p} \\
R^2=\cfrac{SSR}{SST}=1-\cfrac{SSE}{SST},R=\sqrt{R^2} \\
p_0=P(F(p-1,n-p) \geq F_0)
$$

In [28]:
SST = np.sum((Y - np.mean(Y)) ** 2)
SSR = np.sum((H @ Y - np.mean(Y)) ** 2)
SSE = np.sum((Y - H @ Y) ** 2)
R_2 = SSR / SST
MSR = SSR / (X_wide.shape[1] - 1)
MSE = SSE / (X_wide.shape[0] - X_wide.shape[1])
F = MSR / MSE
# 假设检验p值
from scipy.stats import f
p_value = 1 - f.cdf(F, X_wide.shape[1] - 1, X_wide.shape[0] - X_wide.shape[1])
# 全部输出
print('SST:', SST)
print('SSR:', SSR)
print('SSE:', SSE)
print('R^2:', R_2)
print('MSR:', MSR)
print('MSE:', MSE)
print('F:', F)
print('p-value:', p_value)

SST: 53901.6
SSR: 53844.716434440685
SSE: 56.883565559127966
R^2: 0.9989446776058722
MSR: 26922.358217220342
MSE: 4.74029712992733
F: 5679.466387718415
p-value: 1.1102230246251565e-16


$p$值很小，认为线性回归关系显著.

模型的复相关系数平方（决定系数）为$R^2$=0.998，表示模型能够解释99.8%的销量变异，说明模型拟合效果较好，变量“人数”和“收入”能很好地解释销量的变动。

#### 分别求分别求出$\beta_1$和$\beta_2$的95%置信区间

$$
H_{0k}:\beta_k=0 \space H_{1k}:\beta_k \neq 0 \\
\cfrac{\hat{\beta}_k - \beta_k}{\sigma \sqrt{c_{kk}}} \sim N(0,1) \\
t_k = \cfrac{\hat{\beta}_k - \beta_k}{\hat{\sigma} \sqrt{c_{kk}}} \sim t(n-p) \\
\hat{\sigma} \sqrt{c_{kk}} = s(\hat{\sigma}_k) \\
\hat{\beta_k} \pm t_{1-\frac{\alpha}{2}}(n-p)s(\hat{\sigma}_k)
$$

In [31]:
from scipy.stats import t
cov_beta = sigma_2_hat * np.linalg.inv(X_wide.T @ X_wide)
standard_errors = np.sqrt(np.diag(cov_beta))
n, p = X_wide.shape
t_value = t.ppf(0.975, df=n - p)
for i, name in enumerate(['β₀', 'β₁（人数）', 'β₂（收入）']):
    lower = beta_hat[i] - t_value * standard_errors[i]
    upper = beta_hat[i] + t_value * standard_errors[i]
    print(f"{name} 的95%置信区间为：[{lower:.4f}, {upper:.4f}]")

β₀ 的95%置信区间为：[-1.8433, 8.7485]
β₁（人数） 的95%置信区间为：[0.4828, 0.5092]
β₂（收入） 的95%置信区间为：[0.0071, 0.0113]


#### 

In [None]:
import statsmodels.api as sm

X_new = data[['人数', '收入']]
X_new = sm.add_constant(X_new)  # 自动加上全1列
Y_new = data['销量']

model = sm.OLS(Y_new, X_new).fit()

print(model.summary())
beta = model.params
print(f"回归方程：y = {beta['const']:.2f} + {beta['人数']:.4f} * x1 + {beta['收入']:.4f} * x2")


                            OLS Regression Results                            
Dep. Variable:                     销量   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                     5679.
Date:                Wed, 09 Apr 2025   Prob (F-statistic):           1.38e-18
Time:                        21:25:20   Log-Likelihood:                -31.281
No. Observations:                  15   AIC:                             68.56
Df Residuals:                      12   BIC:                             70.69
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.4526      2.431      1.420      0.1

  res = hypotest_fun_out(*samples, **kwds)
