# T05 - Motor Trend Car Road Tests
### Dayanni Godoy Rosales
18-09-2025

Utiliza el archivo "Motor Trend Car Road Tests.xlsx" y completa las siguientes actividades:

1.1 Realiza una regresión tomando 'mpg' como salida y eliminando la columna 'model'. Considera todos los demás factores como numéricos/ordinales.

Calcula el R2 e interpreta los signos de los betas.
Realiza un train-test-split donde se use el 40% de los datos para entrenar. Calcula el R2 de entrenamiento y de prueba.
Añade regularización L2 con un hiperparámetro lambda decidido por ti. Cambia este valor y compara con varios distintos los R2 de entrenamiento y de prueba.
1.2 Repite el ejercicio anterior usando 'qsec' como salida.

2.1 Realiza una regresión tomando 'mpg' como salida y eliminando la columna 'model'. Crea columnas dummies para los factores 'cyl', 'gear' y 'carb'.

Calcula el R2 e interpreta los signos de los betas.
Realiza un train-test-split donde se use el 40% de los datos para entrenar. Calcula el R2 de entrenamiento y de prueba.
2.2 Repite el ejercicio anterior usando 'qsec' como salida.

3.1 Compara los R2 de los ejercicios 1.1 & 2.1.

3.2 Compara los R2 de los ejercicios 1.2 & 2.2.

## 1.1 

In [133]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.linear_model import Ridge

In [185]:

# Cargar el archivo
data = pd.read_excel("Motor Trend Car Road Tests.xlsx")

# Limpiamos quitando la columna que no se usa
data = data.drop(columns=['model'])

# Variables predictoras y salida
X = data.drop(columns=['mpg']).values
y = data['mpg'].values

# Split: 40% entrenamiento, 60% prueba
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.6, random_state=42
)

# Escalamos los datos
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Modelo OLS
X_train_const = sm.add_constant(X_train_scaled)
ols_model = sm.OLS(y_train, X_train_const).fit()

print("Resumen OLS:")
print(ols_model.summary())

# Predicciones con prueba
X_test_const = sm.add_constant(X_test_scaled)
y_pred_ols = ols_model.predict(X_test_const)

r2_train_ols_mpg1 = ols_model.rsquared
r2_test_ols_mpg1 = r2_score(y_test, y_pred_ols)

print("R2 entrenamiento (OLS): " + str(r2_train_ols_mpg1))
print("R2 prueba (OLS): " + str(r2_test_ols_mpg1))

# Regularización Ridge
ridge_model = Ridge(alpha=1).fit(X_train_scaled, y_train)

y_pred_train_ridge = ridge_model.predict(X_train_scaled)
y_pred_test_ridge = ridge_model.predict(X_test_scaled)

r2_train_ridge = r2_score(y_train, y_pred_train_ridge)
r2_test_ridge = r2_score(y_test, y_pred_test_ridge)

print("R2 entrenamiento (Ridge): " + str(r2_train_ridge))
print("R2 prueba (Ridge): " + str(r2_test_ridge))


Resumen OLS:
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.998
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     55.84
Date:                Thu, 18 Sep 2025   Prob (F-statistic):              0.104
Time:                        22:24:49   Log-Likelihood:                -2.6714
No. Observations:                  12   AIC:                             27.34
Df Residuals:                       1   BIC:                             32.68
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         20.9167      0.302     69

  res = hypotest_fun_out(*samples, **kwds)


Interpretar los signos de los coeficiente

In [157]:
# Regularización Ridge alpha 1
ridge_model = Ridge(alpha=1).fit(X_train_scaled, y_train)

y_pred_train_ridge = ridge_model.predict(X_train_scaled)
y_pred_test_ridge = ridge_model.predict(X_test_scaled)

r2_train_ridge = r2_score(y_train, y_pred_train_ridge)
r2_test_ridge = r2_score(y_test, y_pred_test_ridge)

print("R2 entrenamiento (Ridge): " + str(r2_train_ridge))
print("R2 prueba (Ridge): " + str(r2_test_ridge))

R2 entrenamiento (Ridge): 0.9274813499199267
R2 prueba (Ridge): 0.6518375462086008


In [159]:
# Regularización Ridge alpha 0.1
ridge_model = Ridge(alpha=0.1).fit(X_train_scaled, y_train)

y_pred_train_ridge = ridge_model.predict(X_train_scaled)
y_pred_test_ridge = ridge_model.predict(X_test_scaled)

r2_train_ridge = r2_score(y_train, y_pred_train_ridge)
r2_test_ridge = r2_score(y_test, y_pred_test_ridge)

print("R2 entrenamiento (Ridge): " + str(r2_train_ridge))
print("R2 prueba (Ridge): " + str(r2_test_ridge))

# Mostrar coeficientes del modelo Ridge
features = data.drop(columns=['mpg']).columns 
print("Coeficientes del modelo Ridge:")
for var, coef in zip(features, ridge_model.coef_):
    signo = "positivo" if coef > 0 else "negativo"
    print(var + ": " + str(coef) + " (" + signo + ")")

R2 entrenamiento (Ridge): 0.983137773095632
R2 prueba (Ridge): -0.5064817705944273
Coeficientes del modelo Ridge:
cyl: -2.0815601928834178 (negativo)
disp: 1.8803136677734422 (positivo)
hp: -2.457398266785389 (negativo)
drat: -2.0426028804982437 (negativo)
wt: -7.875643835059549 (negativo)
qsec: 5.13225584492053 (positivo)
vs: -4.286183418311386 (negativo)
am: 1.9691254967170326 (positivo)
gear: 1.6318967884689521 (positivo)
carb: 2.9676581067730474 (positivo)


Sobreajuste, el modelo captura ruido en entrenamiento pero no generaliza bien.

Coeficientes Ridge:

Negativos (reducen mpg): cilindros, desplazamiento, potencia, peso

Positivos (aumentan mpg): aceleración (qsec), transmisión manual (am), número de marchas

## 1.2

In [187]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.linear_model import Ridge

# Cargar el archivo
data = pd.read_excel("Motor Trend Car Road Tests.xlsx")

# Quitamos la columna 'model' que no sirve como predictor
data = data.drop(columns=['model'])

# Definimos predictores y salida (ahora 'qsec')
features = data.drop(columns=['qsec']).columns   # nombres de variables predictoras
X = data.drop(columns=['qsec']).values
y = data['qsec'].values

# Split: 40% entrenamiento, 60% prueba
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.6, random_state=42
)

# Escalamos
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Modelo OLS
X_train_const = sm.add_constant(X_train_scaled)
ols_model = sm.OLS(y_train, X_train_const).fit()

print("Resumen OLS para qsec:")
print(ols_model.summary())



# Predicciones
X_test_const = sm.add_constant(X_test_scaled)
y_pred_ols = ols_model.predict(X_test_const)

r2_train_ols_qsec1 = ols_model.rsquared
r2_test_ols_qsec1 = r2_score(y_test, y_pred_ols)

print("R2 entrenamiento (OLS): " + str(r2_train_ols_qsec1))
print("R2 prueba (OLS): " + str(r2_test_ols_qsec1))



Resumen OLS para qsec:
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.988
Method:                 Least Squares   F-statistic:                     94.94
Date:                Thu, 18 Sep 2025   Prob (F-statistic):             0.0797
Time:                        22:26:18   Log-Likelihood:                 18.054
No. Observations:                  12   AIC:                            -14.11
Df Residuals:                       1   BIC:                            -8.775
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         17.9433      0.

  res = hypotest_fun_out(*samples, **kwds)


In [153]:

# Regularización Ridge con alpha=1
ridge_model = Ridge(alpha=1).fit(X_train_scaled, y_train)

y_pred_train_ridge = ridge_model.predict(X_train_scaled)
y_pred_test_ridge = ridge_model.predict(X_test_scaled)

r2_train_ridge = r2_score(y_train, y_pred_train_ridge)
r2_test_ridge = r2_score(y_test, y_pred_test_ridge)

print("R2 entrenamiento (Ridge): " + str(r2_train_ridge))
print("R2 prueba (Ridge): " + str(r2_test_ridge))

# Coeficientes del modelo Ridge
print("Coeficientes del modelo Ridge para qsec:")
for var, coef in zip(features, ridge_model.coef_):
    signo = "positivo" if coef > 0 else "negativo"
    print(var + ": " + str(coef) + " (" + signo + ")")

R2 entrenamiento (Ridge): 0.9274813499199267
R2 prueba (Ridge): 0.6518375462086008
Coeficientes del modelo Ridge para qsec:
cyl: -0.7885714655519767 (negativo)
disp: -0.15278834643893147 (negativo)
hp: -1.6888419666009937 (negativo)
drat: -0.29397020684434555 (negativo)
wt: -3.079910935388808 (negativo)
qsec: 1.1440565353126932 (positivo)
vs: -0.39800502463197884 (negativo)
am: 1.416441960467505 (positivo)
gear: 0.9319951809683572 (positivo)
carb: -0.17123817084321036 (negativo)


In [155]:

# Regularización Ridge con alpha=0.1
ridge_model = Ridge(alpha=0.1).fit(X_train_scaled, y_train)

y_pred_train_ridge = ridge_model.predict(X_train_scaled)
y_pred_test_ridge = ridge_model.predict(X_test_scaled)

r2_train_ridge = r2_score(y_train, y_pred_train_ridge)
r2_test_ridge = r2_score(y_test, y_pred_test_ridge)

print("R2 entrenamiento (Ridge): " + str(r2_train_ridge))
print("R2 prueba (Ridge): " + str(r2_test_ridge))

# Coeficientes del modelo Ridge
print("Coeficientes del modelo Ridge para qsec:")
for var, coef in zip(features, ridge_model.coef_):
    signo = "positivo" if coef > 0 else "negativo"
    print(var + ": " + str(coef) + " (" + signo + ")")

R2 entrenamiento (Ridge): 0.983137773095632
R2 prueba (Ridge): -0.5064817705944273
Coeficientes del modelo Ridge para qsec:
cyl: -2.0815601928834178 (negativo)
disp: 1.8803136677734422 (positivo)
hp: -2.457398266785389 (negativo)
drat: -2.0426028804982437 (negativo)
wt: -7.875643835059549 (negativo)
qsec: 5.13225584492053 (positivo)
vs: -4.286183418311386 (negativo)
am: 1.9691254967170326 (positivo)
gear: 1.6318967884689521 (positivo)
carb: 2.9676581067730474 (positivo)


# 2.1

In [201]:
data = pd.read_excel("Motor Trend Car Road Tests.xlsx")

data = data.drop(columns=['model'])
# Hago dummies para cyl, gear y carb (variables categóricas)
data = pd.get_dummies(data, columns=['cyl', 'gear', 'carb'], drop_first=True)

X = data.drop(columns=['mpg'])
y = data['mpg']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.6, random_state=42
)

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)


X_train_const = sm.add_constant(X_train_scaled)
ols_model_dummies = sm.OLS(y_train, X_train_const).fit()

print("Resumen OLS con mpg:")
print(ols_model_dummies.summary())

# Predicciones y R2
X_test_const = sm.add_constant(X_test_scaled)
y_pred_ols = ols_model_dummies.predict(X_test_const)

r2_train_ols_mpg2 = ols_model_dummies.rsquared
r2_test_ols_mpg2 = r2_score(y_test, y_pred_ols)

print("R2 entrenamiento mpg: " + str(r2_train_ols_mpg2))
print("R2 prueba mpg: " + str(r2_test_ols_mpg2))

Resumen OLS con mpg:
                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Thu, 18 Sep 2025   Prob (F-statistic):                nan
Time:                        22:31:16   Log-Likelihood:                 361.91
No. Observations:                  12   AIC:                            -699.8
Df Residuals:                       0   BIC:                            -694.0
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         20.9167        in

  res = hypotest_fun_out(*samples, **kwds)
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  return np.dot(wresid, wresid) / self.df_resid
  cov_p = self.normalized_cov_params * scale


In [203]:
# Ahora probamos pero con qsec como salida
X2 = data.drop(columns=['qsec'])
y2 = data['qsec']

X_train2, X_test2, y_train2, y_test2 = train_test_split(
    X2, y2, test_size=0.6, random_state=42
)

scaler2 = StandardScaler().fit(X_train2)
X_train_scaled2 = scaler2.transform(X_train2)
X_test_scaled2 = scaler2.transform(X_test2)

X_train_const2 = sm.add_constant(X_train_scaled2)
ols_model2_dummies = sm.OLS(y_train2, X_train_const2).fit()

print("\nResumen OLS con qsec:")
print(ols_model2_dummies.summary())

X_test_const2 = sm.add_constant(X_test_scaled2)
y_pred_ols2 = ols_model2_dummies.predict(X_test_const2)

r2_train_ols_qsec2 = ols_model2_dummies.rsquared
r2_test_ols_qsec2 = r2_score(y_test2, y_pred_ols2)

print("R2 entrenamiento qsec: " + str(r2_train_ols_qsec2))
print("R2 prueba qsec: " + str(r2_test_ols_qsec2))


Resumen OLS con qsec:
                            OLS Regression Results                            
Dep. Variable:                   qsec   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Thu, 18 Sep 2025   Prob (F-statistic):                nan
Time:                        22:31:44   Log-Likelihood:                 360.38
No. Observations:                  12   AIC:                            -696.8
Df Residuals:                       0   BIC:                            -690.9
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         17.9433        

  res = hypotest_fun_out(*samples, **kwds)
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  return np.dot(wresid, wresid) / self.df_resid
  cov_p = self.normalized_cov_params * scale


# 3.1

In [205]:
# Comparación de R2 para mpg
print("R2 del modelo mpg sin dummies: ", ols_model.rsquared)
print("R2 del modelo mpg con dummies: ", ols_model_dummies.rsquared)



R2 del modelo mpg sin dummies:  0.9989478049029176
R2 del modelo mpg con dummies:  1.0


# 3.2

In [207]:
# Comparación de R2 para qsec
print("R2 del modelo qsec sin dummies: ", ols_model2.rsquared)
print("R2 del modelo qsec con dummies: ", ols_model2_dummies.rsquared)


R2 del modelo qsec sin dummies:  0.9989478049029176
R2 del modelo qsec con dummies:  1.0


El modelo mpg sin dummies ya explicaba casi todo y al agregar las dummies de cyl, gear y carb se ajusta perfectamente, aunque esto puede ser overfitting. Las categorías aportan información extra, pero el modelo original ya capturaba casi toda la variación.