# A08 - Bootstrapping

Utiliza los conceptos aprendidos en los laboratorios de regresión y clasificación para encontrar el error estándar de los coeficientes de una regresión (lineal/logística) simple para los datasets de “Advertising” y “Default”.

Utiliza bootstrap para simular 1000 remuestreos de esos datasets y calcula la media de los coeficientes obtenidos al aplicarle regresión a cada remuestreo. Calcula la desviación estándar.

Compara los resultados obtenidos con el método visto en los laboratorios contra los resultados obtenidos con bootstrap. ¿Por qué podría haber diferencias en los resultados?

Agrega regularización L2 a los modelos del dataset de Advertising (optimiza el hiperparámetro). Utiliza ese valor del hiperparámetro para repetir el experimento de los 1000 remuestreos. Calcula la desviación estándar de los coeficientes obtenidos.

### Advertising de la manera normalita

In [33]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from skopt import BayesSearchCV
from skopt.space import Real
import statsmodels.api as sm
from scipy import stats
from scipy.stats import norm
import matplotlib.pyplot as plt


In [34]:
data = pd.read_csv('Advertising.csv')
x1 = data['TV'].values.reshape(-1,1)
x2 = data['radio'].values.reshape(-1,1)
x3 = data['newspaper'].values.reshape(-1,1)
y = data['sales']
n = len(y)
ones = np.ones([n,1])
X = np.hstack([ones,x1,x2,x3])
ols = sm.OLS(y, X)
results = ols.fit()
results.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Fri, 21 Nov 2025",Prob (F-statistic):,1.58e-96
Time:,13:02:09,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
x1,0.0458,0.001,32.809,0.000,0.043,0.049
x2,0.1885,0.009,21.893,0.000,0.172,0.206
x3,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


### Bootstrapping

In [35]:
data = pd.read_csv('Advertising.csv')
x1 = data['TV'].values.reshape(-1,1)
x2 = data['radio'].values.reshape(-1,1)
x3 = data['newspaper'].values.reshape(-1,1)
y = data['sales']
n = len(y)
ones = np.ones([n,1])
X = np.hstack([ones,x1,x2,x3])
ols = sm.OLS(y, X)
results = ols.fit()
results.summary()

#Bootstrapping
n_bootstraps = 1000
coef_samples = np.zeros((n_bootstraps, X.shape[1]))
for i in range(n_bootstraps):
    sample_indices = np.random.choice(range(len(y)), size=len(y), replace=True)
    X_sample = X[sample_indices]
    y_sample = y.iloc[sample_indices]
    ols = sm.OLS(y_sample, X_sample)
    results = ols.fit()
    coef_samples[i, :] = results.params
coef_means = np.mean(coef_samples, axis=0)
print("Coeficientes medios"+str(coef_means))
#Desviación estandar de los coeficientes
coef_stds = np.std(coef_samples, axis=0)
print("Desviación estándar" + str(coef_stds))

Coeficientes medios[ 2.94949590e+00  4.56833841e-02  1.87991330e-01 -5.16830956e-04]
Desviación estándar[0.32469904 0.00188002 0.01058133 0.00657808]


In [36]:
#Tabla comparativa de coeficientes y desviaciones estándar del método OLS y Bootstrapping
import pandas as pd
coef_ols = results.params
std_ols = results.bse
comparison_table = pd.DataFrame({
    'Coeficiente OLS': coef_ols,
    'Desviación estándar OLS': std_ols,
    'Coeficiente Bootstrapping': coef_means,
    'Desviación estándar Bootstrapping': coef_stds
})
print(comparison_table)

       Coeficiente OLS  Desviación estándar OLS  Coeficiente Bootstrapping  \
const         3.101564                 0.293152                   2.949496   
x1            0.047194                 0.001329                   0.045683   
x2            0.175569                 0.007942                   0.187991   
x3           -0.000367                 0.006027                  -0.000517   

       Desviación estándar Bootstrapping  
const                           0.324699  
x1                              0.001880  
x2                              0.010581  
x3                              0.006578  


El bootstrapping es un método artifial, por lo tanto, los errores están inflados porque hay duplicados y cosas ahí medio raras. 


### Default

In [37]:
# Librerías
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from scipy.stats import norm
from scipy.stats import norm


In [38]:
data = pd.read_csv("Defaultt.csv")

X_multi = data[["balance", "income"]]
X_multi["student"] = data["student"].apply(lambda v: 1 if v == "Yes" else 0)
y_multi = (data["default"] == "Yes").astype(int)

model_multi = LogisticRegression()
model_multi.fit(X_multi, y_multi)

beta_0 = model_multi.intercept_[0]
beta_1, beta_2, beta_3 = model_multi.coef_[0]

print("1. Coeficientes Estimados")
print("Intercepto (β0):", round(beta_0, 4))
print("Coeficiente balance (β1):", round(beta_1, 4))
print("Coeficiente income (β2):", round(beta_2, 4))
print("Coeficiente student (β3):", round(beta_3, 4), "\n")

X_design = sm.add_constant(X_multi)
p_hat = model_multi.predict_proba(X_multi)[:, 1]
W = np.diag(p_hat * (1 - p_hat))
cov_matrix = np.linalg.inv(X_design.T @ W @ X_design)
standard_errors = np.sqrt(np.diag(cov_matrix))

print("2. Error Estándar")
print("Error estándar β0:", round(standard_errors[0], 4))
print("Error estándar β1:", round(standard_errors[1], 4))
print("Error estándar β2:", round(standard_errors[2], 4))
print("Error estándar β3:", round(standard_errors[3], 4), "\n")

z_stats = np.array([beta_0, beta_1, beta_2, beta_3]) / standard_errors
p_values = 2 * (1 - norm.cdf(np.abs(z_stats)))

print("3. Estadístico z y p-value")
print("z β0:", round(z_stats[0], 4), "p-value β0:", round(p_values[0], 4))
print("z β1:", round(z_stats[1], 4), "p-value β1:", round(p_values[1], 4))
print("z β2:", round(z_stats[2], 4), "p-value β2:", round(p_values[2], 4))
print("z β3:", round(z_stats[3], 4), "p-value β3:", round(p_values[3], 4), "\n")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_multi["student"] = data["student"].apply(lambda v: 1 if v == "Yes" else 0)


1. Coeficientes Estimados
Intercepto (β0): -10.9018
Coeficiente balance (β1): 0.0057
Coeficiente income (β2): 0.0
Coeficiente student (β3): -0.6126 

2. Error Estándar
Error estándar β0: 0.4932
Error estándar β1: 0.0002
Error estándar β2: 0.0
Error estándar β3: 0.2364 

3. Estadístico z y p-value
z β0: -22.1061 p-value β0: 0.0
z β1: 24.7355 p-value β1: 0.0
z β2: 0.4827 p-value β2: 0.6293
z β3: -2.5913 p-value β3: 0.0096 



### Con Bootstrapping

In [39]:
data = pd.read_csv("Defaultt.csv")

X_multi = data[["balance", "income"]]
X_multi["student"] = data["student"].apply(lambda v: 1 if v == "Yes" else 0)
y_multi = (data["default"] == "Yes").astype(int)

model_multi = LogisticRegression()
model_multi.fit(X_multi, y_multi)

#Bootstrapping
n_bootstraps = 1000
coef_samples = np.zeros((n_bootstraps, X_multi.shape[1]+1))
for i in range(n_bootstraps):
    sample_indices = np.random.choice(range(len(y_multi)),size=len(y_multi),replace=True)
    X_sample = X_multi.iloc[sample_indices]
    y_sample = y_multi.iloc[sample_indices]
    model_multi.fit(X_sample, y_sample)
    coef_samples[i,:] = np.hstack([model_multi.intercept_, model_multi.coef_[0]])
coef_means = np.mean(coef_samples, axis=0)
print("Coeficientes medios"+str(coef_means))
coef_stds = np.std(coef_samples, axis=0)
print("Desviación estándar" + str(coef_stds))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_multi["student"] = data["student"].apply(lambda v: 1 if v == "Yes" else 0)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.or

Coeficientes medios[-1.09376642e+01  5.75007640e-03  3.92337314e-06 -6.15803473e-01]
Desviación estándar[4.89733540e-01 2.35227514e-04 8.03293291e-06 2.32922683e-01]


In [40]:
#Tabla comparativa de coeficientes y desviaciones estándar de la regresión logística múltiple y Bootstrapping
import pandas as pd
coef_ols = np.hstack([model_multi.intercept_, model_multi.coef_[0]])
std_ols = coef_stds
comparison_table = pd.DataFrame({
    'Coeficiente Regresión Logística': coef_ols,
    'Desviación estándar Regresión Logística': std_ols,
    'Coeficiente Bootstrapping': coef_means,
    'Desviación estándar Bootstrapping': coef_stds
})
print(comparison_table)

   Coeficiente Regresión Logística  Desviación estándar Regresión Logística  \
0                    -1.048736e+01                                 0.489734   
1                     5.555960e-03                                 0.000235   
2                    -5.935292e-07                                 0.000008   
3                    -7.222276e-01                                 0.232923   

   Coeficiente Bootstrapping  Desviación estándar Bootstrapping  
0                 -10.937664                           0.489734  
1                   0.005750                           0.000235  
2                   0.000004                           0.000008  
3                  -0.615803                           0.232923  


Agrega regularización L2 a los modelos del dataset de Advertising (optimiza el hiperparámetro). Utiliza ese valor del hiperparámetro para repetir el experimento de los 1000 remuestreos. Calcula la desviación estándar de los coeficientes obtenidos.

### Optimización de lambda

In [44]:

data = pd.read_csv('Advertising.csv')
X = data[['TV', 'radio', 'newspaper']].values
y = data['sales'].values

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(random_state=42))
])

search_spaces = {
    'ridge__alpha': Real(1e-6, 1e2, prior='log-uniform')
}

bayes = BayesSearchCV(
    estimator=pipeline,
    search_spaces=search_spaces,
    n_iter=30,       
    cv=5,
    scoring='r2',
    n_jobs=-1,
    random_state=42,
    verbose=0
)

bayes.fit(X, y)

best_alpha = bayes.best_params_['ridge__alpha']
best_score = bayes.best_score_
best_model = bayes.best_estimator_.named_steps['ridge']

print("Mejor alpha (lambda_reg):", best_alpha)
print("Mejor score CV (R2):",best_score)
print("Intercepto:",best_model.intercept_)
print("Coeficientes:", best_model.coef_)

final_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=best_alpha, random_state=42))
])
final_pipeline.fit(X, y)

Mejor alpha (lambda_reg): 1.3218534263612438
Mejor score CV (R2): 0.8871689381616473
Intercepto: 14.0225
Coeficientes: [ 3.89412502  2.77207577 -0.01394549]


In [45]:

# Bootstrapping
n_bootstraps = 1000
coef_samples = np.zeros((n_bootstraps, 3))
intercept_samples = np.zeros(n_bootstraps)

for i in range(n_bootstraps):
    indices = np.random.choice(len(y), len(y), replace=True)
    X_sample, y_sample = X[indices], y[indices]
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_sample)
    
    ridge = Ridge(alpha=1.32, random_state=42)
    ridge.fit(X_scaled, y_sample)
    
    coef_samples[i] = ridge.coef_
    intercept_samples[i] = ridge.intercept_

coef_means = np.mean(coef_samples, axis=0)
coef_stds = np.std(coef_samples, axis=0)

print("Intercepto medio: " + str(np.mean(intercept_samples)))
print("Intercepto std: " + str(np.std(intercept_samples)))
print("Coef medios: " + str(coef_means))
print("Coef std: " + str(coef_stds))

Intercepto medio: 14.035805500000002
Intercepto std: 0.37520454610751997
Coef medios: [ 3.88495147  2.76298119 -0.00966597]
Coef std: [0.23578334 0.18377382 0.13908279]


In [46]:
ridge_coef = [ridge.intercept_, ridge.coef_[0], ridge.coef_[1], ridge.coef_[2]]

bootstrap_mean = [np.mean(intercept_samples), coef_means[0], coef_means[1], coef_means[2]]
bootstrap_std = [np.std(intercept_samples), coef_stds[0], coef_stds[1], coef_stds[2]]

tabla = pd.DataFrame({
    'Variable': ['Intercepto', 'TV', 'radio', 'newspaper'],
    'Ridge_Coef': ridge_coef,
    'Bootstrap_Mean': bootstrap_mean,
    'Bootstrap_Std': bootstrap_std
})

print("Tabla Comparativa")
print(tabla.to_string(index=False))

Tabla Comparativa
  Variable  Ridge_Coef  Bootstrap_Mean  Bootstrap_Std
Intercepto   13.886000       14.035806       0.375205
        TV    4.302334        3.884951       0.235783
     radio    2.597058        2.762981       0.183774
 newspaper   -0.100640       -0.009666       0.139083
