In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

In [2]:
insurance = pd.read_csv("../datasets/insurance.csv")
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# 1

In [3]:
X = insurance.drop(columns=['charges'])
y = insurance['charges']
X = pd.get_dummies(X, drop_first=True).astype(float)
X['age_squared'] = X['age'] ** 2
X['bmi_obese'] = (X['bmi'] >= 30).astype(float)
X['obese_smoker'] = X['bmi_obese'] * X['smoker_yes']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.866
Model:                            OLS   Adj. R-squared:                  0.865
Method:                 Least Squares   F-statistic:                     781.7
Date:                Thu, 19 Feb 2026   Prob (F-statistic):               0.00
Time:                        12:06:46   Log-Likelihood:                -13131.
No. Observations:                1338   AIC:                         2.629e+04
Df Residuals:                    1326   BIC:                         2.635e+04
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const              134.2509   1362.751  

We will use the final Lantz model as from HW4. The standard errors of the model
coefficients as automatically provided by the linear model fit under the standard
assumptions are:

In [5]:
model.bse

const               1362.751128
age                   59.824155
bmi                   34.266049
children             105.883122
sex_male             244.365939
smoker_yes           439.949051
region_northwest     349.274576
region_southeast     351.635152
region_southwest     350.528454
age_squared            0.746299
bmi_obese            422.840162
obese_smoker         604.656665
dtype: float64

# 2

Residual bootstrap assumes that the model is correctly specified, errors are i.i.d
and homoskedastic, and the X is fixed. However, from our previous analysis of 
multiple different models, we know that the insurance dataset almost certainly
violates homoskedasticity. This is why our model has nonlinear terms and interaction
effects. Thus, we prefer to use case bootstrap, as this doesn't make any homoskedasticity
assumptions, does not require fixed regressors, and is robust to model misspecification.
Given that this is observational insurance data, treating X as random is more realistic,
which is also why we prefer to resample entire cases.

In [None]:
X_full = insurance.drop(columns=['charges'])
y = insurance['charges']
X_full = pd.get_dummies(X_full, drop_first=True).astype(float)
X_full['age_squared'] = X_full['age'] ** 2
X_full['bmi_obese'] = (X_full['bmi'] >= 30).astype(float)
X_full['obese_smoker'] = X_full['bmi_obese'] * X_full['smoker_yes']
X_full = sm.add_constant(X_full)

B = 1000
n = len(insurance)

boot_coefs = []

np.random.seed(0)

for _ in range(B):
    sample_idx = np.random.choice(n, n, replace=True)
    X_boot = X_full.iloc[sample_idx]
    y_boot = y.iloc[sample_idx]

    boot_model = sm.OLS(y_boot, X_boot).fit()
    boot_coefs.append(boot_model.params.values)

boot_coefs = np.array(boot_coefs)
boot_ses = pd.Series(boot_coefs.std(axis=0), index=X_full.columns)

comparison = pd.DataFrame({
    "Classical SE": model.bse,
    "Bootstrap SE": boot_ses
})

comparison

Unnamed: 0,Classical SE,Bootstrap SE
const,1362.751128,1422.671331
age,59.824155,63.580815
bmi,34.266049,34.889666
children,105.883122,109.06281
sex_male,244.365939,244.995048
smoker_yes,439.949051,365.790444
region_northwest,349.274576,358.299651
region_southeast,351.635152,373.412189
region_southwest,350.528454,365.346416
age_squared,0.746299,0.782413


We see that the bootstrap estimates are quite close to the expected coefficients.
Of course, if we increased the number of draws, the bootstrap coefficients would
get closer and close to those of the linear model fit. We can see some notable
differences in `smoker_yes` and `obese_smoker`, which reinforces the fact
that classical standard error may rely on homoskedasticity assumptions that aren't
perfectly satisfied. This again shows why we should resample on cases and not residuals.

# 3

In [9]:
from sklearn.linear_model import LassoCV, Lasso
from sklearn.preprocessing import StandardScaler

X = insurance.drop(columns=['charges'])
y = insurance['charges']
X = pd.get_dummies(X, drop_first=True).astype(float)
X['age_squared'] = X['age'] ** 2
X['bmi_obese'] = (X['bmi'] >= 30).astype(float)
X['obese_smoker'] = X['bmi_obese'] * X['smoker_yes']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

lasso_cv = LassoCV(cv=5, random_state=0).fit(X_scaled, y)
selected = X.columns[lasso_cv.coef_ != 0]

X_reduced = sm.add_constant(X[selected])
reduced_model = sm.OLS(y, X_reduced).fit()

B = 1000
n = len(y)

boot_coefs = []

np.random.seed(0)
for _ in range(B):
    sample_idx = np.random.choice(n, n, replace=True)
    X_boot = X.iloc[sample_idx]
    y_boot = y.iloc[sample_idx]

    scaler_b = StandardScaler()
    Xb_scaled = scaler_b.fit_transform(X_boot)

    lasso_b = Lasso(alpha=lasso_cv.alpha_)
    lasso_b.fit(Xb_scaled, y_boot)
    selected_b = X.columns[lasso_b.coef_ != 0]

    if len(selected_b) == 0:
        continue

    Xb_reduced = sm.add_constant(X_boot[selected_b])
    model_b = sm.OLS(y_boot, Xb_reduced).fit()

    coef_series = pd.Series(0.0, index=X_reduced.columns)
    for name in model_b.params.index:
        if name in coef_series.index:
            coef_series[name] = model_b.params[name]

    boot_coefs.append(coef_series.values)

boot_coefs = np.array(boot_coefs)
boot_ses = pd.Series(boot_coefs.std(axis=0), index=X_reduced.columns)

print(reduced_model.summary())
print(boot_ses)

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.866
Model:                            OLS   Adj. R-squared:                  0.865
Method:                 Least Squares   F-statistic:                     860.3
Date:                Thu, 19 Feb 2026   Prob (F-statistic):               0.00
Time:                        12:26:52   Log-Likelihood:                -13131.
No. Observations:                1338   AIC:                         2.628e+04
Df Residuals:                    1327   BIC:                         2.634e+04
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const             -414.4423    920.902  

The lasso cv dropped the linear age predictor, which was expected as it was not
statistically significant in the previous model. We see tht the SEs are larger 
than the reduced model SEs. This is because variable selection was redone in 
each bootstrap sample, accounting for selection variability. Again, we see large 
differences in the SEs for `smoker_yes`, `bmi_obese`, and `obese_smoker`,
as expected. This confirms that the interaction effects are important and the 
errors are not homoskedastic. Overall we are coming to the expected conclusions
by Lantz that smokers and obese individuals should disproportionately experience
increased medical charges.

**LLM Usage**: All work was done by myself in VSCode with [GitHub Copilot integration](https://code.visualstudio.com/docs/copilot/overview). The integration "provides code suggestions, explanations, and automated implementations based on natural language prompts and existing code context," and also offers autonomous coding and an in-IDE chat interface that is able to interact with the current codebase. Only the Copilot provided automatic inline suggestions for both LaTex and Python in `.tex` and `.ipynb` Jupyter notebook files respectively were taken into account / used.

**1**: LLM was not used in this problem.  
**2**: LLM was not used in this problem.  
**3**: LLM was consulted for advice on variable selection process. 