#  Proyecto: predicción de precios de casas

**Equipo: XX;  
Integrantes: A,B,C;  
Claves únicas: CU-A, CU-B, CU-C.**

Inicio 22/02/2021, Entrega 03/03/2021 a las 2:30pm por correo electrónico identificandolo con `[DL-2021] Equipo XX`.

En este mini proyecto vamos a construir un modelo de predicción de precios de ventas de casas. Los datos son de 
una competencia en [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). El código incluido se encarga de descargar los datos, construir un modelo base y dejar los resultados en el formato de entrega. 
Lo que tendrán que hacer es

1. Desarrollar un mejor modelo predictivo. Encontrarás algunas pistas al final de este archivo. 
1. Hacer la entrega de tus resultados en Kaggle y obtener una captura de pantalla del _score_ de tus respuestas. 

<!--  Por favor incorpora impresion de pantalla de su puntaje -->
![](score.png)

Algunos puntos a considerar:

1. Comiencen lo mas pronto posible. El proceso de selección de hiper-parámetros es tardado y Kaggle sólo acepta un número limitado de entregas al día.
1. Trabajen en equipo. 

La calificación dependerá del lugar que ocupen dentro del _Leaderboard_ en relación con los demás equipos de la clase.

## Importando librerías y módulos

#### Standard library imports

In [1]:
import os
import requests
import zipfile
import tarfile
import hashlib
import sys
import json

#### Third party imports

In [2]:
%matplotlib inline
# %config Completer.use_jedi = False

from d2l import tensorflow as d2l
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer

#### Local application imports

In [3]:
sys.path.append("../../")

In [4]:
%load_ext autoreload
%autoreload 2

from utils.rob.dlp_ml import(
    loss,
    k,
    num_epochs,
    lr,
    weight_decay,
    batch_size,
    
    num_pipeline,
    cat_pipeline,
    
    get_net,
    log_rmse,
    train,
    get_k_fold_data,
    k_fold,
    train_and_pred
)

from utils.rob.dlp_params import (
    features_dict
)

from utils.rob.dlp_funcs import (
    clean_data,
    transform_target_var,
    features_to_pipes,
    features_dictrionary
)

In [5]:
## Name of feature that will be predicted
for feat in features_dict:
    if features_dict[feat]["ml_label"] == True:
        predict_feature = feat

predict_feature

'SalePrice'

---

## Descargando y leyendo los datos

In [6]:
DATA_HUB = dict()
DATA_URL = 'http://d2l-data.s3-accelerate.amazonaws.com/'

In [7]:
def download(name, cache_dir=os.path.join('.', 'data')):  #@save
    """Download a file inserted into DATA_HUB, return the local filename."""
    assert name in DATA_HUB, f"{name} does not exist in {DATA_HUB}."
    url, sha1_hash = DATA_HUB[name]
    os.makedirs(cache_dir, exist_ok=True)
    fname = os.path.join(cache_dir, url.split('/')[-1])
    if os.path.exists(fname):
        sha1 = hashlib.sha1()
        with open(fname, 'rb') as f:
            while True:
                data = f.read(1048576)
                if not data:
                    break
                sha1.update(data)
        if sha1.hexdigest() == sha1_hash:
            return fname  # Hit cache
    print(f'Downloading {fname} from {url}...')
    r = requests.get(url, stream=True, verify=True)
    with open(fname, 'wb') as f:
        f.write(r.content)
    return fname


Lo siguiente se encargará de definir la ruta de dónde extraer los datos. 

In [8]:
DATA_HUB['kaggle_house_train'] = (  
    DATA_URL + 'kaggle_house_pred_train.csv',
    '585e9cc93e70b39160e7921475f9bcd7d31219ce')

DATA_HUB['kaggle_house_test'] = (  
    DATA_URL + 'kaggle_house_pred_test.csv',
    'fa19780a7b011d9b009e8bff8e99922a8ee2eb90')

In [9]:
train_data = pd.read_csv(download('kaggle_house_train'))
test_data = pd.read_csv(download('kaggle_house_test'))

In [None]:
train_data

---

## Tratamiento de los datos

### Limpieza inicial

#### Datos de entrenamiento

In [None]:
## Cleaning data based on definitions dictionary.
housingc = clean_data(train_data)

## Adding training labels to cleaned data.
housingc = housingc.join(train_data[predict_feature])

## Transforming the target variable
housingc = transform_target_var(housingc)

## Training labels
housingc_labs = housingc[predict_feature]

## Training data
housingc.drop(predict_feature, axis=1, inplace=True)

#### Datos de prueba

In [None]:
## Cleaning data based on definitions dictionary.
housingc_test = clean_data(test_data)

### Procesamiento usando pipelines

In [None]:
housingc_num, housingc_cat = features_to_pipes(features_dict)

In [None]:
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, housingc_num),
    ("cat", cat_pipeline, housingc_cat),
])

#### Datos de entrenamiento

In [None]:
housingc_prp = full_pipeline.fit_transform(housingc)

train_features = tf.constant(housingc_prp, dtype=tf.float32)

train_labels = tf.constant(housingc_labs, dtype=tf.float32)

#### Datos de prueba

In [None]:
housingc_prp = full_pipeline.fit_transform(housingc_test)

test_features = tf.constant(housingc_prp, dtype=tf.float32)

## Entrenamiento

In [None]:
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr, weight_decay, batch_size)

print(f'{k}-fold validation: avg train log rmse: {float(train_l):f}, '
      f'avg valid log rmse: {float(valid_l):f}')

---

## Predicción

In [None]:
train_and_pred(train_features, test_features, train_labels, test_data, num_epochs, lr, weight_decay, batch_size)

---

## Notas

#### Estudio de los features en gral

In [16]:
lx = ["SalePrice", "OverallQual",
"GrLivArea",
"GarageCars",
"GarageArea",
"TotalBsmtSF",
"1stFlrSF",
"FullBath",
"YearBuilt",
"YearRemodAdd",
"GarageYrBlt",
"TotRmsAbvGrd",
"Fireplaces",
"MasVnrArea",
"BsmtFinSF1",
"LotFrontage",
"WoodDeckSF",
"OpenPorchSF",
"2ndFlrSF",
"HalfBath"]

In [22]:
len(lx)

20

In [17]:
features_dict

{'Id': {'relevant': False,
  'data_default_type': 'int64',
  'data_obj_type': 'int64',
  'ml_label': False,
  'pipeline': 'num_pipe'},
 'MSSubClass': {'relevant': False,
  'data_default_type': 'int64',
  'data_obj_type': 'int64',
  'ml_label': False,
  'pipeline': 'num_pipe'},
 'MSZoning': {'relevant': True,
  'data_default_type': 'object',
  'data_obj_type': 'category',
  'ml_label': False,
  'pipeline': 'cat_pipe'},
 'LotFrontage': {'relevant': False,
  'data_default_type': 'float64',
  'data_obj_type': 'float64',
  'ml_label': False,
  'pipeline': 'num_pipe'},
 'LotArea': {'relevant': True,
  'data_default_type': 'int64',
  'data_obj_type': 'int64',
  'ml_label': False,
  'pipeline': 'num_pipe'},
 'Street': {'relevant': False,
  'data_default_type': 'object',
  'data_obj_type': 'category',
  'ml_label': False,
  'pipeline': 'cat_pipe'},
 'Alley': {'relevant': False,
  'data_default_type': 'object',
  'data_obj_type': 'category',
  'ml_label': False,
  'pipeline': 'cat_pipe'},
 'LotS

In [18]:
lx = ["SalePrice", "OverallQual",
"GrLivArea",
"GarageCars",
"GarageArea",
"TotalBsmtSF",
"1stFlrSF",
"FullBath",
"YearBuilt",
"YearRemodAdd",
"GarageYrBlt",
"TotRmsAbvGrd",
"Fireplaces",
"MasVnrArea",
"BsmtFinSF1",
"LotFrontage",
"WoodDeckSF",
"OpenPorchSF",
"2ndFlrSF",
"HalfBath"]

for feat in features_dict:
    if feat in lx:
        features_dict[feat]["relevant"] = True
    else:
        features_dict[feat]["relevant"] = False

In [19]:
features_dict

{'Id': {'relevant': False,
  'data_default_type': 'int64',
  'data_obj_type': 'int64',
  'ml_label': False,
  'pipeline': 'num_pipe'},
 'MSSubClass': {'relevant': False,
  'data_default_type': 'int64',
  'data_obj_type': 'int64',
  'ml_label': False,
  'pipeline': 'num_pipe'},
 'MSZoning': {'relevant': False,
  'data_default_type': 'object',
  'data_obj_type': 'category',
  'ml_label': False,
  'pipeline': 'cat_pipe'},
 'LotFrontage': {'relevant': True,
  'data_default_type': 'float64',
  'data_obj_type': 'float64',
  'ml_label': False,
  'pipeline': 'num_pipe'},
 'LotArea': {'relevant': False,
  'data_default_type': 'int64',
  'data_obj_type': 'int64',
  'ml_label': False,
  'pipeline': 'num_pipe'},
 'Street': {'relevant': False,
  'data_default_type': 'object',
  'data_obj_type': 'category',
  'ml_label': False,
  'pipeline': 'cat_pipe'},
 'Alley': {'relevant': False,
  'data_default_type': 'object',
  'data_obj_type': 'category',
  'ml_label': False,
  'pipeline': 'cat_pipe'},
 'Lot

In [21]:
len([feat for feat in features_dict if features_dict[feat]["relevant"] == True])

20

#### Playing with features dictionary

---