1. (Case 1) $Y = 2.0 + 1.0X + \epsilon, X\sim uniform(-1,1),\epsilon\sim 0.2N(0,1)$

(Case 2)$Y = 2.0 + 1.0X + \epsilon, X\sim 0.95N(1,0.5) + 0.05N(5,0.2),\epsilon\sim 0.2N(0,1)$

(Case 3) $Y = 2.0 + 1.0X + \epsilon, X\sim uniform(-1,1),\epsilon\sim 0.2X^2*N(0,1)$

- Write a class for the four methods above.

- Do $K=1000$ simulations with sample size $n=500$ for each case, and get 95\% confidence intervals of $\beta_1=1.0$ by four methods above.  Calculte average length of the confidence intervals and coverage probability.

- Show all results in a table.

- Draw your conclusions.

2. Write code to implement backward elimination method for model selecttion.

In [4]:
#Question 1:
import numpy as np
import statsmodels.api as sm
import pandas as pd
from tqdm import tqdm

class Simulation:
    def __init__(self, n=500, beta_0=2.0, beta_1=1.0):
        self.n = n
        self.beta_0 = beta_0
        self.beta_1 = beta_1

    def generate_data_case1(self):
        X = np.random.uniform(-1, 1, self.n)
        epsilon = 0.2 * np.random.normal(0, 1, self.n)
        Y = self.beta_0 + self.beta_1 * X + epsilon
        return X, Y

    def generate_data_case2(self):
        X = 0.95 * np.random.normal(1,0.5,self.n) + 0.05 * np.random.normal(5,0.2,self.n)
        epsilon = 0.2 * np.random.normal(0, 1, self.n)
        Y = self.beta_0 + self.beta_1 * X + epsilon
        return X, Y

    def generate_data_case3(self):
        X = np.random.uniform(-1, 1, self.n)
        epsilon = 0.2 * (X ** 2) * np.random.normal(0, 1, self.n)
        Y = self.beta_0 + self.beta_1 * X + epsilon
        return X, Y

    def fit_model(self, X, Y):
        X = sm.add_constant(X)  # Adds intercept
        model = sm.OLS(Y, X).fit()
        return model

    def simulate(self, case_func, K=1000, conf_level=0.95):
        intervals = []
        coverage_count = 0

        for i in tqdm(range(K)):
            X, Y = case_func()
            model = self.fit_model(X, Y)
            conf_int = model.conf_int(alpha=1-conf_level) #alpha是犯错概率
            beta_1_ci = conf_int[1]  # Confidence interval for beta_1

            intervals.append(beta_1_ci[1] - beta_1_ci[0]) #算长度

            # Check if the true value (1.0) is within the confidence interval
            if beta_1_ci[0] <= 1.0 <= beta_1_ci[1]:
                coverage_count += 1

        avg_length = np.mean(intervals)
        coverage_prob = coverage_count / K

        return avg_length, coverage_prob

# 创建 Simulation 实例并进行模拟
sim = Simulation()

# 执行 1000 次模拟，计算平均长度和覆盖概率
results = {}
for case_name, case_func in zip(
    ["Case 1", "Case 2", "Case 3"],
    [sim.generate_data_case1, sim.generate_data_case2, sim.generate_data_case3]
):
    avg_length, coverage_prob = sim.simulate(case_func)
    results[case_name] = [avg_length, coverage_prob]

# 将结果展示为表格
df_results = pd.DataFrame(results, index=["Avg Length", "Coverage Probability"]).T
print(df_results)


100%|█████████████████████████████████████| 1000/1000 [00:00<00:00, 7401.76it/s]
100%|█████████████████████████████████████| 1000/1000 [00:00<00:00, 7742.81it/s]
100%|█████████████████████████████████████| 1000/1000 [00:00<00:00, 8275.87it/s]

        Avg Length  Coverage Probability
Case 1    0.061031                 0.947
Case 2    0.074125                 0.943
Case 3    0.027234                 0.799





According to the outcomes above, we can see the first 2 case is similar, and the last one fails to achieve a high confidence level for $\beta_1 = 1.0$

In [7]:
# Experiment
X,Y = sim.generate_data_case1()
mymodel = sim.fit_model(X,Y)
mymodel.pvalues

array([0.00000000e+000, 1.56377829e-250])

In [9]:
#Question 2:

def backward_elimination(X, Y, significance_level=0.05): #注意这里传入的是不含有intercept的 X.
    X = sm.add_constant(X)  # Adds intercept
    model = sm.OLS(Y, X).fit()

    while True:
        p_values = model.pvalues
        max_p = p_values.max()  # Find the max p-value
        if max_p > significance_level:
            # Drop the variable with the highest p-value，因为不显著影响回归
            max_p_var = p_values.idxmax() # 返回最大值对应的索引
            X = X.drop(max_p_var, axis=1)
            model = sm.OLS(Y, X).fit() #重新fit
        else:
            break

    return model

# 对 Case 1 数据执行向后消除法
X, Y = sim.generate_data_case1()
X = pd.DataFrame(X, columns=["X"])
print(X)
model = backward_elimination(X, Y)
print(model.summary())


            X
0    0.533385
1   -0.144128
2   -0.791479
3   -0.168831
4   -0.499271
..        ...
495 -0.741125
496  0.743406
497 -0.785904
498 -0.323968
499  0.281331

[500 rows x 1 columns]
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.889
Model:                            OLS   Adj. R-squared:                  0.888
Method:                 Least Squares   F-statistic:                     3975.
Date:                Sat, 26 Oct 2024   Prob (F-statistic):          1.58e-239
Time:                        17:30:02   Log-Likelihood:                 95.397
No. Observations:                 500   AIC:                            -186.8
Df Residuals:                     498   BIC:                            -178.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 c

In [10]:
# 对 Case 2 数据执行向后消除法
X, Y = sim.generate_data_case2()
#X = pd.DataFrame(X, columns=["X"])
model = backward_elimination(X, Y)
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.839
Model:                            OLS   Adj. R-squared:                  0.839
Method:                 Least Squares   F-statistic:                     2603.
Date:                Sat, 26 Oct 2024   Prob (F-statistic):          6.52e-200
Time:                        17:31:14   Log-Likelihood:                 106.93
No. Observations:                 500   AIC:                            -209.9
Df Residuals:                     498   BIC:                            -201.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.0188      0.025     81.571      0.0

In [11]:
# 对 Case 3 数据执行向后消除法
X, Y = sim.generate_data_case3()
X = pd.DataFrame(X, columns=["X"])
model = backward_elimination(X, Y)
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.979
Model:                            OLS   Adj. R-squared:                  0.979
Method:                 Least Squares   F-statistic:                 2.359e+04
Date:                Sat, 26 Oct 2024   Prob (F-statistic):               0.00
Time:                        17:31:28   Log-Likelihood:                 550.28
No. Observations:                 500   AIC:                            -1097.
Df Residuals:                     498   BIC:                            -1088.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.0009      0.004    554.336      0.0

From the above results, we can see it's better not to delete any variable from the original model.