# A08 - Bootstrapping

**Gonzalo Cano Padilla**

Utiliza los conceptos aprendidos en los laboratorios de regresión y clasificación para encontrar el error estándar de los coeficientes de una regresión (lineal/logística) simple para los datasets de “Advertising” y “Default”.

Utiliza bootstrap para simular 1000 remuestreos de esos datasets y calcula la media de los coeficientes obtenidos al aplicarle regresión a cada remuestreo. Calcula la desviación estándar.

Compara los resultados obtenidos con el método visto en los laboratorios contra los resultados obtenidos con bootstrap. ¿Por qué podría haber diferencias en los resultados?

Agrega regularización L2 a los modelos del dataset de Advertising (optimiza el hiperparámetro). Utiliza ese valor del hiperparámetro para repetir el experimento de los 1000 remuestreos. Calcula la desviación estándar de los coeficientes obtenidos.

In [2]:
import pandas as pd
df_adv = pd.read_csv('Advertising.csv', index_col=0)
df_def = pd.read_csv('Default.csv')

---

## Dataset Advertising

In [3]:
import statsmodels.api as sm
import numpy as np

#X=[1, TV, radio, newspaper]
n = len(df_adv)
unos = np.ones([n,1])
x_tv = df_adv['TV'].values.reshape(-1,1)
x_radio = df_adv['radio'].values.reshape(-1,1)
x_newspaper = df_adv['newspaper'].values.reshape(-1,1)
x = np.hstack([unos, x_tv, x_radio, x_newspaper])

y=df_adv['sales'].values

ols=sm.OLS(y,x)
results = ols.fit()
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Thu, 20 Nov 2025",Prob (F-statistic):,1.58e-96
Time:,17:10:17,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
x1,0.0458,0.001,32.809,0.000,0.043,0.049
x2,0.1885,0.009,21.893,0.000,0.172,0.206
x3,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


In [10]:
from sklearn.utils import resample

n_sim = 1000
coefs_boot = []

for b in range(n_sim):
    X_boot, y_boot = resample(x, y, replace=True)
    model_b = sm.OLS(y_boot, X_boot).fit()

    coefs_boot.append(model_b.params)

coefs_boot = np.array(coefs_boot)

boot_mean = coefs_boot.mean(axis=0) # Es la media de los valores de los coef de todas las simulaciones
boot_std = coefs_boot.std(axis=0)   # Es la media de la std de los coef de todas las simulaciones

print("Media bootstrap:", boot_mean)
print("Error estándar bootstrap:", boot_std)

Media bootstrap: [ 2.95853129e+00  4.56075205e-02  1.88606782e-01 -7.79092574e-04]
Error estándar bootstrap: [0.33541811 0.00194958 0.01095304 0.00637425]


In [11]:
# Con penalización
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

y = df_adv['sales'].values
X = df_adv.drop(columns=['sales']).values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

params = {'alpha': np.logspace(-3, 3, 50)}
L2 = Ridge()
gs = GridSearchCV(L2, params, cv=5)
gs.fit(X_scaled, y)

best_alpha = gs.best_params_['alpha']
print("Mejor lambda (alpha):", best_alpha)

Mejor lambda (alpha): 1.151395399326447


In [13]:
L2_opt = Ridge(alpha=best_alpha).fit(X_scaled, y)

coef_ridge = L2_opt.coef_
intercept_ridge = L2_opt.intercept_

print("Coeficientes Ridge:", coef_ridge)
print("Intercepto Ridge:", intercept_ridge)

Coeficientes Ridge: [ 3.89734714  2.7746345  -0.01503966]
Intercepto Ridge: 14.0225


In [25]:
n_sim = 1000
n = len(X_scaled)
coef_ridge_boot = []

for b in range(n_sim):
    X_boot, y_boot = resample(X_scaled, y)
    ridge_b = Ridge(alpha=best_alpha).fit(X_boot, y_boot)

    params_b = np.concatenate(([ridge_b.intercept_], ridge_b.coef_))
    coef_ridge_boot.append(params_b)

coef_ridge_boot = np.array(coef_ridge_boot)
ridge_boot_mean = coef_ridge_boot.mean(axis=0)
ridge_boot_std = coef_ridge_boot.std(axis=0)

print("Media bootstrap L2 :", ridge_boot_mean)
print("STD bootstrap L2 :", ridge_boot_std)

Media bootstrap L2 : [14.03157805  3.89619331  2.77226852 -0.01430242]
STD bootstrap L2 : [0.11779851 0.16986604 0.15922978 0.13862069]


---

## Dataframe Default

In [30]:
import numpy as np
from sklearn.linear_model import LogisticRegression

# Variables
y = (df_def['default'] == 'Yes').astype(int).values
X = np.column_stack([
    df_def['balance'].values,
    df_def['income'].values,
    (df_def['student'] == 'Yes').astype(int).values
])

# Regresión logística
lr = LogisticRegression(solver='lbfgs', max_iter=1000).fit(X, y)

b0 = lr.intercept_[0]
betas = lr.coef_[0]

# Predicciones
linear = b0 + X @ betas
p = 1 / (1 + np.exp(-linear))

# Matriz V
V = np.diag(p * (1 - p))

# Matriz de covarianza
X_full = np.column_stack([np.ones(len(X)), X])
cov = np.linalg.inv(X_full.T @ V @ X_full)

# Error estándar
se = np.sqrt(np.diag(cov))

print("Intercepto:", b0)
print("Coeficientes:", betas)
print("Errores estándar:", se)

Intercepto: -10.901807453182402
Coeficientes: [ 5.73061136e-03  3.96161560e-06 -6.12573022e-01]
Errores estándar: [4.93158448e-01 2.31675572e-04 8.20844337e-06 2.36393900e-01]


In [31]:
n_sim = 1000
coefs_boot = []

for b in range(n_sim):
    X_boot, y_boot = resample(X, y, replace=True)

    lr_b = LogisticRegression(solver='lbfgs', max_iter=1000)
    lr_b.fit(X_boot, y_boot)

    params_b = np.concatenate(([lr_b.intercept_[0]], lr_b.coef_[0]))
    coefs_boot.append(params_b)

coefs_boot = np.array(coefs_boot)
logit_boot_mean = coefs_boot.mean(axis=0)
logit_boot_std  = coefs_boot.std(axis=0)

print("Media bootstrap:",logit_boot_mean)
print("STD bootstrap:",logit_boot_std)

Media bootstrap: [-1.09366867e+01  5.75432884e-03  3.72580239e-06 -6.17601546e-01]
STD bootstrap: [5.12312012e-01 2.38134370e-04 8.42152683e-06 2.32341776e-01]


# Inerpretación

Las diferencias ocurren porque cada método hace supuestos distintos sobre cómo se comportan los datos y cómo se estima la incertidumbre de los coeficientes.