## Programa 2

1. Para este programa se utilizará el dataset <i>cal_housing.csv</i>. Las primeras columnas son las características y la última ccolumna el target
2. Divide <i>datos.csv</i> en 80% para entrenamiento y 20% para pruebas, con los parámetros shuffle=True y random_state=0
3. Con las bibliotecas de scikit-learn realiza las siguientes regresiones con OLS:
    * Lineal
    * Polinomial de grado 2
    * Polinomial de grado 2 con escalamiento estándar
    * Polinomial de grado 2 con escalamiento robusto
    * Polinomial de grado 3
    * Polinomial de grado 3 con escalamiento estándar
    * Polinomial de grado 3 con escalamiento robusto
4. Entrada 
    * Archivo <i>cal_housing.csv</i>
5. Salida
    * Resumen de los resultados
6. Bibliotecas de scikit-learn
    * from sklearn import preprocessing


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler

In [2]:
data = pd.read_csv('cal_housing.csv')
data

Unnamed: 0,longitude,latitude,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0
...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0


In [3]:
x = data.drop('medianHouseValue', axis=1).values
y = data['medianHouseValue'].values

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0, shuffle=True)

In [5]:
def TransformaryEscalar(x,grado,escalador):
    x_transformado = x.copy()
    if grado > 1:
        poly = PolynomialFeatures(degree=grado)
        x_transformado = poly.fit_transform(x_transformado)
        if escalador:
            x_transformado = escalador.fit_transform(x_transformado)
    else:
        x_transformado = escalador.fit_transform(x_transformado)
    return x_transformado

In [6]:
mses = []
r2s = []
regresiones = ['Lineal','Polinomial de grado 2', 'Polinomial de grado 2 con escalamiento Robusto','Polinomial con escalamiento robusto','Polinomial de grado 3','Polinomial de grado 3 con escalamiento estándar','Polinomial de grado 3 con escalamiento robusto']

In [7]:
def Regresiones(x_train, x_test, y_train, y_test):
    grados = [2,3]
    escaladores = [None,StandardScaler(), RobustScaler()]
    model = LinearRegression()
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mses.append(mse)
    r2s.append(r2)
    #print(f'Regresion Lineal: \t MSE: {mse} \t R2: {r2}')

    for grado in grados:
        for escalador in escaladores:
            x_train_poly = TransformaryEscalar(x_train,grado,escalador)
            x_test_poly = TransformaryEscalar(x_test,grado,escalador)
            model = LinearRegression()
            model.fit(x_train_poly, y_train)
            y_pred = model.predict(x_test_poly)
            mse = mean_squared_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)
            mses.append(mse)
            r2s.append(r2)
            #print(f'Regresion Polinomial de grado {grado} con escalador {escalador}: \t MSE: {mse} \t R2: {r2}')

    return mses, r2s

            
    

In [8]:
Regresiones(x_train, x_test, y_train, y_test)

([4853781771.947973,
  4076152861.394485,
  4111950340.7116613,
  5047513814.436979,
  4131193919.5275493,
  8862039818.106274,
  8684145318843.362],
 [0.6277645980446445,
  0.6874007794166417,
  0.6846554790037493,
  0.6129073324946555,
  0.6831796933932404,
  0.32037221514529335,
  -664.9850968332794])

In [9]:
final = pd.DataFrame({'Regresiones':regresiones, 'MSE':mses, 'R2':r2s})
final

Unnamed: 0,Regresiones,MSE,R2
0,Lineal,4853782000.0,0.627765
1,Polinomial de grado 2,4076153000.0,0.687401
2,Polinomial de grado 2 con escalamiento Robusto,4111950000.0,0.684655
3,Polinomial con escalamiento robusto,5047514000.0,0.612907
4,Polinomial de grado 3,4131194000.0,0.68318
5,Polinomial de grado 3 con escalamiento estándar,8862040000.0,0.320372
6,Polinomial de grado 3 con escalamiento robusto,8684145000000.0,-664.985097
