# Laboratorio Regresión 

# # Regresión Lineal
El objetivo de la regresión es predecir el valor de una o más variables objetivo continuas $t$ dado el valor de un vector $X$ de dimensión $D$ de las variables de entrada. La forma más simple de los modelos de regresión lineal también son funciones lineales de las variables de entrada.<br>
Dado un data set de entrenamiento de $N$ observaciones ${x}_{n}$ donde $n=1,...,N$ con sus correspondientes valores de salida ${t}_{n}$, la meta es predecir el valor de $t$ para un nuevo valor $x$ <br>
El modelo de regresion lineal es de la forma:<br>
$$y=f(x)={w}^T{x}+b$$
Donde $x=({x}_{1},...,{x}_{D})$, siendo una función lineal de los parametros ${{w}_{0},...,{x}_{D}}$
Para hallar los parametros $w$ se debe de hallar la funcion de error, en este caso es la función de mínimos cuadrados
$$e(w)={\sum_{i=1}^{N}{(y_{i}-{x}_{i}w)}^{2}}$$
donde el vector de error esta dado por ${e}_{i}=y_{i}-{x}_{i}w$
Para ello se mínimiza la función, es decir se deriva.<br>
la función en python para esta regresión es:<br>
sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)

Para controlar el sobreajuste es normal regularizar con la suma cuadrada del vector de pesos $w$, asi el término de regularización queda:




Por otro lado, para restringir los valores de $w$ se puede agregar otro témino de regularización a la función, determinado por una métrica de error ponderada por el parametro $\lambda$, siendo el coeficiente que le da importancia al error dado por los datos dependientes. Entonces, el término de error de regularización queda: <br>
$$e(w,\lambda)=\frac{1}{x2}{||e||}_{2}^{2}+\frac{1}{2}{\lambda}{||w||}_{2}^{2}$$
para impementar este error en python se utliza la función:<br>
sklearn.linear_model.Ridge(alpha=1.0)


Una forma general de regularizar esta dada por la forma general del error de $\lambda$,asi:<br>
$$e(w,\lambda)=\frac{1}{x2}{||e||}_{2}^{2}+\frac{1}{2}{\lambda}{||w||}_{q}^{q}$$

Donde q es el orden de la regularización.<br>
Se conoce como regularazción Lasso cuando la función de error toma la forma:<br>
    $$e(w,\lambda)=\frac{1}{x2}{||e||}_{2}^{2}+\frac{1}{2}{\lambda}{||w||}_{1}$$

La regularización Lasso tiene la propiedad de que si $\lambda$ es suficientemente grande, algunos de los coeficientes
los $w$ se llevan a cero, lo que lleva a un modelo disperso<br>

En python se puede implementar con la función:<br>
sklearn.linear_model.Lasso(alpha=1.0)

Con esta forma general del error de $\lambda$ se puede crear un error de regularización combinado dado por:<br>
$$e(w,\lambda)=\frac{1}{x2}{||e||}_{2}^{2}+\frac{1}{2}{\lambda}{||w||}_{2}^{2}+\frac{1}{2}{\lambda}{||w||}_{1}$$
Este error se puede implementar con la función<br>
sklearn.linear model.ElasticNet

Se puede extender el modelo linea a una combinación lineal sobre funciones no lineales fijas de los datos de entrada. A estas funciones se les llama funciones base sepresentadas por \Phi. Entonces el modelo queda:<br>
$$y={\Phi}{w}$$
$$\Phi={[{\phi}^{T}({x}_{1}),{\phi}^{T}({x}_{2},...,{\phi}^{T}({x}_{N})]}^{T}$$
Las funciones $\Phi$ más utilizadas son la polinomial, la exponecial y la sigmoidal.

## Regresion Bayesiana

In [2]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42) #to initialize the pseudo-random number generator


# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14) #https://matplotlib.org/api/_as_gen/matplotlib.pyplot.rc.html
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "imagesAM", CHAPTER_ID)#Join one or more path components intelligently. 
#The return value is the concatenation of path and any members of *paths
HOUSING_PATH = "datasets/housing/"

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path) #Read a comma-separated values (csv) file into DataFrame



In [3]:
housing = load_housing_data()

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

from sklearn.model_selection import StratifiedShuffleSplit
#Stratified ShuffleSplit cross-validator
#Provides train/test indices to split data in train/test sets.
#es la semilla utilizada por el generador de números aleatorios.
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"].values):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
#pandas.DataFrame.loc: Acceda a un grupo de filas y columnas por etiqueta (s) o una matriz booleana. 

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
#Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names.
housing_labels = strat_train_set["median_house_value"].copy()
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
#Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names.
housing_labels = strat_train_set["median_house_value"].copy()


## Ordenar los datos, limpiarlos...
Lidiar con datos perdidos -> SimpleImputer
Añadir nuevos atributos desde conocimiento a priori 

In [4]:
try:
    from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
    from sklearn.preprocessing import Imputer as SimpleImputer

imputer = SimpleImputer(strategy="median")
housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)
#Fit the imputer on housing_num

from sklearn.preprocessing import FunctionTransformer
rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]


def add_extra_features(X, add_bedrooms_per_room=True):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    if add_bedrooms_per_room:
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,
                     bedrooms_per_room]
    else:
        return np.c_[X, rooms_per_household, population_per_household]

attr_adder = FunctionTransformer(add_extra_features, validate=False,
                                 kw_args={"add_bedrooms_per_room": False})
housing_extra_attribs = attr_adder.fit_transform(housing.values)


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),
        ('std_scaler', StandardScaler()), #MinMaxScaler StandardScaler
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)


try:
    from sklearn.compose import ColumnTransformer
except ImportError:
    from future_encoders import ColumnTransformer # Scikit-Learn < 0.20
try:
    from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
    from sklearn.preprocessing import OneHotEncoder
except ImportError:
    from future_encoders import OneHotEncoder # Scikit-Learn < 0.20
#Encode categorical integer features as a one-hot numeric array.

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs), #OneHotEncoder or OrdinalEncoder()
    ])

housing_prepared = full_pipeline.fit_transform(housing)
hausing_prepared_short=housing_prepared[:10000]
housing_labels_short=housing_labels[:10000]
hausing_prepared_short.shape

(10000, 17)

## aplicar los modelos de regresión a la base de datos preparada

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn import linear_model
from sklearn.linear_model import ElasticNet
from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics import mean_squared_error

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)#Fit linear model(Training data,target values).
reg_housing_predictions = lin_reg.predict(housing_prepared) #Predict using the linear model
lin_mse = mean_squared_error(housing_labels, reg_housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
#This model solves a regression model where the loss function is the linear 
#least squares function and regularization is given by the l2-norm
lin_reg_rid= Ridge(alpha=1) #alpha es el parametro de regularización del error
lin_reg_rid.fit(housing_prepared, housing_labels) #Fit Ridge regression model(training data, target value)
reg_rid_pred=lin_reg_rid.predict(housing_prepared)#
lin_rid_mse = mean_squared_error(housing_labels, reg_rid_pred)
lin_rid_rmse = np.sqrt(lin_rid_mse)
lin_rid_rmse
#Lasso
reg_lass=linear_model.Lasso(alpha=1)
reg_lass.fit(housing_prepared, housing_labels)
reg_lass_pred=reg_lass.predict(housing_prepared)
reg_lass_mse= mean_squared_error(housing_labels, reg_lass_pred)
reg_lass_rmse = np.sqrt(reg_lass_mse)
reg_lass_rmse
#ElasticNet
reg_El=ElasticNet(random_state=42)#Linear regression with combined L1 and L2 priors as regularizer. es decir, 
#regulariza con dos factores, uno con norma lasso y norma cuadrada
reg_El.fit=(housing_prepared, housing_labels)
reg_El_pred=reg_El.predict(housing_prepared)
#lin_El_mse = mean_squared_error(housing_labels, reg_El_pred)
#lin_El_rmse = np.sqrt(lin_El_mse)
#lin_El_rmse
#kernel.Ridge
#combines ridge regression (linear least squares with l2-norm regularization) with the kernel trick
#reg_kernel=KernelRidge(alpha=1.0)
#reg_kernel.fit=(housing_prepared, housing_labels)
#reg_kernel_pred=reg_kernel.predict(housing_prepared)
#lin_ker_mse = mean_squared_error(housing_labels, reg_kernel_pred)
#lin_ker_rmse = np.sqrt(lin_ker_mse)
#lin_ker_rmse      
                               



NotFittedError: This ElasticNet instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

## Aplicar Cross-Validation

In [10]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
#The decision trees is used to fit a sine curve with addition noisy observation. 
#As a result, it learns local linear regressions approximating the sine curve
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels) #Build a decision tree regressor from the training set

housing_predictions= tree_reg.predict(housing_prepared)#Predict class or regression value for X
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)

#con arbol de decisión
#Evaluate a score by cross-validation
#score: estimate the accuracy
#CV:different split, StratifiedKFold (estraificado) strategies. Numero de iteraciones

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10) #scikitlearn trabaja con función util (mayor mejor) no función de costo (menor mejor)
tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

#lineal
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

#Ridge
lin_rid_scores = cross_val_score(lin_reg_rid, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rid_rmse_scores = np.sqrt(-lin_rid_scores)
display_scores(lin_rid_rmse_scores)
#lasso
lin_lass_scores = cross_val_score(reg_lass, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_lass_rmse_scores = np.sqrt(-lin_lass_scores)
display_scores(lin_lass_rmse_scores)

#ElasticNet
#lin_El_scores = cross_val_score(reg_El, housing_prepared, housing_labels,
 #                            scoring="neg_mean_squared_error", cv=10)
#lin_El_rmse_scores = np.sqrt(-lin_El_scores)
#display_scores(lin_rid_rmse_scores)

#kernel
#lin_ker_scores = cross_val_score(reg_kernel, housing_prepared, housing_labels,
#                             scoring="neg_mean_squared_error", cv=10)
#lin_ker_rmse_scores = np.sqrt(-lin_ker_scores)
#display_scores(lin_ker_rmse_scores)

Scores: [66877.52325028 66608.120256   70575.91118868 74179.94799352
 67683.32205678 71103.16843468 64782.65896552 67711.29940352
 71080.40484136 67687.6384546 ]
Mean: 68828.99948449331
Standard deviation: 2662.7615706103497
Scores: [66876.67072391 66607.87567524 70575.03527812 74172.93549378
 67647.78835804 71102.85236533 64784.19903468 67710.78979273
 71081.17183773 67732.97299799]
Mean: 68829.22915575479
Standard deviation: 2660.851286941816




Scores: [66877.26042194 66608.25364462 70574.78109136 74178.9124137
 67671.53397802 71103.31069075 64782.65577309 67711.23370379
 71079.8170915  67700.68654667]
Mean: 68828.84453554364
Standard deviation: 2662.4068922896417




In [7]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
#A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples
#of the dataset and uses averaging to improve the predictive accuracy and control over-fitting

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)




Scores: [51481.61843757 48867.18698326 53592.93924497 54917.35753217
 50463.39403515 56776.52132037 51940.08691385 50521.84102811
 55729.5024898  53136.5656865 ]
Mean: 52742.701367175265
Standard deviation: 2412.0829538615826


Sintonizar Parametros

In [11]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)





#lineal
#No se sintonizan, los parametros son tipo bolean

#Ridge
R_param_grid = [ {'alpha': [0.1,0.2, 0.5, 1]}]
R_grid_search = GridSearchCV(lin_reg_rid, R_param_grid, cv=10,
                           scoring='neg_mean_squared_error', return_train_score=True)
R_grid_search.fit(housing_prepared, housing_labels)

#lasso
La_grid_search = GridSearchCV(reg_lass, R_param_grid, cv=10,
                           scoring='neg_mean_squared_error', return_train_score=True)
La_grid_search.fit(housing_prepared, housing_labels)

best_param=[[grid_search.best_params_],[La_grid_search.best_params_],[R_grid_search.best_params_]]


#elastic
#E_grid_search = GridSearchCV(reg_El, R_param_grid, cv=5,
#                           scoring='neg_mean_squared_error', return_train_score=True)
#E_grid_search.fit(housing_prepared, housing_labels)
# kernel
#K_param_grid = [
#  {'C': [1, 10, 100, 1000], 'kernel': ['linear']}]
#K_grid_search = GridSearchCV(reg_kernel, K_param_grid, cv=5,
#                           scoring='neg_mean_squared_error', return_train_score=True)
#K_grid_search.fit(housing_prepared, housing_labels)





In [12]:
best_param


[[{'max_features': 6, 'n_estimators': 30}], [{'alpha': 1}], [{'alpha': 0.2}]]