# A08 Bootstrapping
Utiliza los conceptos aprendidos en los laboratorios de regresión y clasificación para encontrar el error estándar de los coeficientes de una regresión (lineal/logística) simple para los datasets de “Advertising” y “Default”.

Utiliza bootstrap para simular 1000 remuestreos de esos datasets y calcula la media de los coeficientes obtenidos al aplicarle regresión a cada remuestreo. Calcula la desviación estándar.

Compara los resultados obtenidos con el método visto en los laboratorios contra los resultados obtenidos con bootstrap. ¿Por qué podría haber diferencias en los resultados?

Agrega regularización L2 a los modelos del dataset de Advertising (optimiza el hiperparámetro). Utiliza ese valor del hiperparámetro para repetir el experimento de los 1000 remuestreos. Calcula la desviación estándar de los coeficientes obtenidos.

In [24]:
import pandas as pd
import statsmodels.api as sm
import numpy as np
from sklearn.linear_model import LogisticRegression

## Advertising

In [25]:
advertising = pd.read_csv(r"C:\Users\pablo\OneDrive - ITESO\Semestre 5\Laboratorio de aprendizaje estadístico\Advertising.csv")
advertising

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9
...,...,...,...,...,...
195,196,38.2,3.7,13.8,7.6
196,197,94.2,4.9,8.1,9.7
197,198,177.0,9.3,6.4,12.8
198,199,283.6,42.0,66.2,25.5


In [26]:
X = advertising[['TV', 'radio', 'newspaper']]
y = advertising['sales']

In [27]:
n = advertising.sales.count()
ones = np.ones((n, 1))
Xbig = np.hstack((ones, X))

In [28]:
ols = sm.OLS(y, Xbig).fit()
ols.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"jue., 20 nov. 2025",Prob (F-statistic):,1.58e-96
Time:,17:41:04,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
x1,0.0458,0.001,32.809,0.000,0.043,0.049
x2,0.1885,0.009,21.893,0.000,0.172,0.206
x3,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


### Bootstrap de advertising

In [29]:
# Ya tenemos el modelo ajustado con los datos originales, ahora vamos a hacer el bootstrap 1000 veces
B = 1000
coefs = np.zeros((B, Xbig.shape[1]))
for b in range(B):
    # Muestreamos con reemplazo
    sample_indices = np.random.choice(range(n), size=n, replace=True)
    X_sample = Xbig[sample_indices, :]
    y_sample = y.iloc[sample_indices]
    
    # Ajustamos el modelo a la muestra bootstrap
    ols_boot = sm.OLS(y_sample, X_sample).fit()
    
    # Guardamos los coeficientes
    coefs[b, :] = ols_boot.params

coefs_means = np.mean(coefs, axis=0)
print(coefs_means)
coefs_std = np.std(coefs, axis=0)
print(coefs_std)

[ 2.94397074e+00  4.56624648e-02  1.89091538e-01 -9.61819891e-04]
[0.33046914 0.00187169 0.0109052  0.0064787 ]


Ya que bootstrap agrega datos nuevos imaginados, la ligera diferencia en resultados se puede deber a que al menos alguna de las suposiciones del la regresión lineal no se asume. O al menos no estrictamente

In [None]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from skopt import BayesSearchCV
from skopt.space import Real
 
# datos
data = pd.read_csv('Advertising.csv')
X = data[['TV', 'radio', 'newspaper']].values
y = data['sales'].values
 
# pipeline: escalado + ridge (alpha = lambda_reg)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(random_state=42))
])
 
# espacio de búsqueda para alpha (log-uniform)
search_spaces = {
    'ridge__alpha': Real(1e-6, 1e2, prior='log-uniform')
}
 
# BayesSearchCV (maximiza scoring; aquí R2)
bayes = BayesSearchCV(
    estimator=pipeline,
    search_spaces=search_spaces,
    n_iter=30,        # número de evaluaciones bayesianas
    cv=5,
    scoring='r2',
    n_jobs=-1,
    random_state=42,
    verbose=0
)
 
bayes.fit(X, y)
 
best_alpha = bayes.best_params_['ridge__alpha']
best_score = bayes.best_score_
best_model = bayes.best_estimator_.named_steps['ridge']
 
print("Mejor alpha (lambda_reg):", best_alpha)
print("Mejor score CV (R2):", round(best_score, 4))
print("Intercepto:", round(best_model.intercept_, 6))
print("Coeficientes:", np.round(best_model.coef_, 6))
 
# Usar best_alpha para ajustar modelo final si se desea:
final_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=best_alpha, random_state=42))
])
final_pipeline.fit(X, y)
# final_pipeline.predict(X_new) para predecir

## Default

In [12]:
default = pd.read_csv(r"C:\Users\pablo\OneDrive - ITESO\Semestre 5\Laboratorio de aprendizaje estadístico\Default.csv")
default

Unnamed: 0,default,student,balance,income
0,No,No,729.526495,44361.625074
1,No,Yes,817.180407,12106.134700
2,No,No,1073.549164,31767.138950
3,No,No,529.250605,35704.493940
4,No,No,785.655883,38463.495880
...,...,...,...,...
9995,No,No,711.555020,52992.378910
9996,No,No,757.962918,19660.721770
9997,No,No,845.411989,58636.156980
9998,No,No,1569.009053,36669.112360


In [16]:
y = (default['default'] == "Yes").astype(int)

X_df = default[['balance', 'income', 'student']].copy()
X_df['student'] = (X_df['student'] == 'Yes').astype(int)

In [18]:
modelo_multiple = LogisticRegression()
modelo_multiple.fit(X_df, y)

beta_0_multi = modelo_multiple.intercept_[0]
betas_multi = modelo_multiple.coef_[0]

coeficientes = np.insert(betas_multi, 0, beta_0_multi)

print(f"Intercepto: {coeficientes[0]}")
print(f"Coeficiente para balance: {coeficientes[1]}")
print(f"Coeficiente para income: {coeficientes[2]}")
print(f"Coeficiente para student: {coeficientes[3]}")

Intercepto: -10.901791246717943
Coeficiente para balance: 0.005730595882780847
Coeficiente para income: 3.961832702094613e-06
Coeficiente para student: -0.6125733792101835


In [19]:
proba_multi = modelo_multiple.predict_proba(X_df)[:, 1]

p_1_p_multi = proba_multi * (1 - proba_multi)

V_multi = np.diagflat(p_1_p_multi)

X_ones_multi = np.c_[np.ones(X_df.shape[0]), X_df]

cov_multi = np.linalg.inv(X_ones_multi.T @ V_multi @ X_ones_multi)

se_multi = np.sqrt(np.diag(cov_multi))

print("Errores estándar calculados:")
print(f"SE para Intercepto: {se_multi[0]}")
print(f"SE para balance: {se_multi[1]}")
print(f"SE para income: {se_multi[2]}")
print(f"SE para student: {se_multi[3]}")

Errores estándar calculados:
SE para Intercepto: 0.4931576451107031
SE para balance: 0.00023167497614065436
SE para income: 8.208436101644588e-06
SE para student: 0.23639384656858756


### Bootstrapping de Default

In [None]:
# Ya tenemos los errores estándar calculados manualmente, ahora hacemos el bootstrap 1000 veces
B = 1000
n = default.shape[0]
coefs_boot = np.zeros((B, X_ones_multi.shape[1]))
for b in range(B):
    # Muestreamos con reemplazo
    sample_indices = np.random.choice(range(n), size=n, replace=True)
    X_sample = X_ones_multi[sample_indices, :]
    y_sample = y.iloc[sample_indices]
    
    # Ajustamos el modelo a la muestra bootstrap
    modelo_boot = LogisticRegression()
    modelo_boot.fit(X_sample[:, 1:], y_sample)
    
    # Guardamos los coeficientes
    beta_0_boot = modelo_boot.intercept_[0]
    betas_boot = modelo_boot.coef_[0]
    coefs_boot[b, :] = np.insert(betas_boot, 0, beta_0_boot)

coefs_means_boot = np.mean(coefs_boot, axis=0)
print(coefs_means_boot)
coefs_SE_boot = np.std(coefs_boot, axis=0)
print(coefs_SE_boot)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

[-1.09426059e+01  5.75866949e-03  3.80100369e-06 -6.18372057e-01]
[5.06425056e-01 2.32179841e-04 8.40771835e-06 2.34374819e-01]
0.20863407300300776


Igual da super parecido