<a href="https://colab.research.google.com/github/Brandon-Bernal-Alarcon/Notas/blob/main/Machine%20Learning/Aprendizaje%20Supervisado/06_Regresion_Lineal_Ridge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regresion Lineal Ridge

La regresi√≥n Ridge es una extensi√≥n de la regresi√≥n lineal por m√≠nimos cuadrados que incorpora regularizaci√≥n, con el objetivo de reducir el sobreajuste y mejorar la generalizaci√≥n del modelo.

Mientras que la regresi√≥n lineal ordinaria busca minimizar √∫nicamente el error cuadr√°tico, Ridge a√±ade una penalizaci√≥n sobre el tama√±o de los coeficientes.

Modelo lineal:

$$\hat{y}=ùë§^ùëáùë•+ùëè$$

Funci√≥n objetivo de Ridge:

$$‚àë^n_{i=1}(y_i - \hat{y_i})^2 + Œ± ‚àë^p_{j=1}w^2_j  $$


El primer t√©rmino mide el error de ajuste, el segundo t√©rmino es la penalizaci√≥n L2 y ùõº controla la intensidad de la regularizaci√≥n. Valores grandes de $w_j$ aumentan la penalizaci√≥n, por lo que el modelo prefiere coeficientes peque√±os.

In [5]:
from sklearn.linear_model import Ridge

from sklearn.linear_model import LinearRegression
%matplotlib notebook
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn import neighbors
import numpy
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split

from sklearn.datasets import make_regression

def load_crime_dataset():
    # Communities and Crime dataset for regression
    # https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized

    crime = pd.read_table('/content/CommViolPredUnnormalizedData.txt', sep=',', na_values='?')
    # remove features with poor coverage or lower relevance, and keep ViolentCrimesPerPop target column
    columns_to_keep = [5, 6] + list(range(11,26)) + list(range(32, 103)) + [145]
    crime = crime.iloc[:,columns_to_keep].dropna()

    X_crime = crime.iloc[:,range(0,88)]
    y_crime = crime['ViolentCrimesPerPop']
    print(crime)
    return (X_crime, y_crime)

(X_crime, y_crime) = load_crime_dataset()

      population  householdsize  agePct12t21  agePct12t29  agePct16t24  \
0          11980           3.10        12.47        21.44        10.93   
1          23123           2.82        11.01        21.30        10.48   
2          29344           2.43        11.36        25.88        11.01   
3          16656           2.40        12.55        25.20        12.19   
5         140494           2.45        18.09        32.89        20.04   
...          ...            ...          ...          ...          ...   
2210       56216           3.07        15.46        30.16        14.34   
2211       12251           2.68        17.36        31.23        16.97   
2212       32824           2.46        11.81        20.96         9.53   
2213       13547           2.89        17.16        30.01        14.73   
2214       28898           2.61        12.99        25.21        11.63   

      agePct65up  numbUrban  pctUrban  medIncome  pctWWage  ...  \
0          11.33      11980    100.00      7

In [6]:
#Ajustamos un modelo Ridge al dataset de crimen, usa ùõº=20, lo que impone una regularizaci√≥n moderada




X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                   random_state = 0)

linridge = Ridge(alpha=20.0).fit(X_train, y_train)

In [7]:
print('Crime dataset')
print('ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

Crime dataset
ridge regression linear model intercept: -3352.4230358461004
ridge regression linear model coeff:
[ 1.95091438e-03  2.19322667e+01  9.56286607e+00 -3.59178973e+01
  6.36465325e+00 -1.96885471e+01 -2.80715856e-03  1.66254486e+00
 -6.61426604e-03 -6.95450680e+00  1.71944731e+01 -5.62819154e+00
  8.83525114e+00  6.79085746e-01 -7.33614221e+00  6.70389803e-03
  9.78505502e-04  5.01202169e-03 -4.89870524e+00 -1.79270062e+01
  9.17572382e+00 -1.24454193e+00  1.21845360e+00  1.03233089e+01
 -3.78037278e+00 -3.73428973e+00  4.74595305e+00  8.42696855e+00
  3.09250005e+01  1.18644167e+01 -2.05183675e+00 -3.82210450e+01
  1.85081589e+01  1.52510829e+00 -2.20086608e+01  2.46283912e+00
  3.29328703e-01  4.02228467e+00 -1.12903533e+01 -4.69567413e-03
  4.27046505e+01 -1.22507167e-03  1.40795790e+00  9.35041855e-01
 -3.00464253e+00  1.12390514e+00 -1.82487653e+01 -1.54653407e+01
  2.41917002e+01 -1.32497562e+01 -4.20113118e-01 -3.59710660e+01
  1.29786751e+01 -2.80765995e+01  4.3851347

Ridge no elimina variables, solo reduce su magnitud

Todos los coeficientes suelen ser distintos de cero

El R¬≤ de test es comparable al de m√≠nimos cuadrados si no se escalan las variables

**Importancia de la normalizaci√≥n de caracter√≠sticas**

La penalizaci√≥n L2 depende directamente del tama√±o de los coeficientes.
Si las caracter√≠sticas tienen escalas diferentes, la regularizaci√≥n no se aplica de forma justa.

Ejemplo: population en miles e income en decenas

El modelo penaliza m√°s a unas variables que a otras solo por su escala.<br>
<br>

**Normalizaci√≥n Min-Max**

Transformaci√≥n:
$$x'=\frac{x-x_{min}}{x_{max}-x_{min}}$$
Todas las variables quedan en el rango [0,1]

In [8]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime, random_state = 0)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train)


In [9]:
print('Crime dataset')
print('ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test_scaled, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

Crime dataset
ridge regression linear model intercept: 933.3906385044157
ridge regression linear model coeff:
[  88.68827454   16.48947987  -50.30285445  -82.90507574  -65.89507244
   -2.27674244   87.74108514  150.94862182   18.8802613   -31.05554992
  -43.13536109 -189.44266328   -4.52658099  107.97866804  -76.53358414
    2.86032762   34.95230077   90.13523036   52.46428263  -62.10898424
  115.01780357    2.66942023    6.94331369   -5.66646499 -101.55269144
  -36.9087526    -8.7053343    29.11999068  171.25963057   99.36919476
   75.06611841  123.63522539   95.24316483 -330.61044265 -442.30179004
 -284.49744001 -258.37150609   17.66431072 -101.70717151  110.64762887
  523.13611718   24.8208959     4.86533322  -30.46775619   -3.51753937
   50.57947231   10.84840601   18.27680946   44.11189865   58.33588176
   67.08698975  -57.93524659  116.1446052    53.81163718   49.01607711
   -7.62262031   55.14288543  -52.08878272  123.39291017   77.12562171
   45.49795317  184.91229771  -91.3572

Tras escalar, Ridge mejora notablemente su R¬≤ en test, supera tanto a Ridge sin escalar como a m√≠nimos cuadrados. La regularizaci√≥n es mucho m√°s efectiva cuando las caracter√≠sticas est√°n normalizadas.

**Efecto del par√°metro de regularizaci√≥n Œ± en Ridge**

El par√°metro ùõº controla el equilibrio entre el ajuste a los datos y la simplicidad del modelo

In [10]:
print('Ridge regression: effect of alpha regularization parameter\n')
for this_alpha in [0, 1, 10, 20, 50, 100, 1000]:
    linridge = Ridge(alpha = this_alpha).fit(X_train_scaled, y_train)
    r2_train = linridge.score(X_train_scaled, y_train)
    r2_test = linridge.score(X_test_scaled, y_test)
    num_coeff_bigger = np.sum(abs(linridge.coef_) > 1.0)
    print('Alpha = {:.2f}\nnum abs(coeff) > 1.0: {}, \
r-squared training: {:.2f}, r-squared test: {:.2f}\n'
         .format(this_alpha, num_coeff_bigger, r2_train, r2_test))

Ridge regression: effect of alpha regularization parameter

Alpha = 0.00
num abs(coeff) > 1.0: 88, r-squared training: 0.67, r-squared test: 0.50

Alpha = 1.00
num abs(coeff) > 1.0: 87, r-squared training: 0.66, r-squared test: 0.56

Alpha = 10.00
num abs(coeff) > 1.0: 87, r-squared training: 0.63, r-squared test: 0.59

Alpha = 20.00
num abs(coeff) > 1.0: 88, r-squared training: 0.61, r-squared test: 0.60

Alpha = 50.00
num abs(coeff) > 1.0: 86, r-squared training: 0.58, r-squared test: 0.58

Alpha = 100.00
num abs(coeff) > 1.0: 87, r-squared training: 0.55, r-squared test: 0.55

Alpha = 1000.00
num abs(coeff) > 1.0: 84, r-squared training: 0.31, r-squared test: 0.30



Se eval√∫a: R¬≤ en entrenamiento, R¬≤ en prueba, N√∫mero de coeficientes grandes

Interpretaci√≥n general
- ùõº=0 la regresi√≥n lineal ordinaria
- ùõº peque√±o ‚Üí riesgo de sobreajuste
- ùõº muy grande ‚Üí subajuste

Existe un valor intermedio √≥ptimo. En este dataset, el mejor rendimiento ocurre alrededor de Œ± ‚âà 20