# The purpose of this notebook is to assemble a pipeline for preparing the data, transforming it and then training a model to predict house prices. 

On this project, we will explore the increasingly popular [Ames dataset](https://www.notion.so/Diccionario-de-Datos-y-hints-8f8613b67b4140f1940f67463c4a0ced#bc3273399294410083987b036aef2356). Our goal is to predict the sale price of a house given the rest of the parameters. 

### Imports. 

In [244]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score 
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error as mse 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
import datetime

In [245]:
pd.set_option('display.max_columns', 40)

# Data preprocessing. 

A couple of things had to be done when preparing the data. As explained in the link provided in the first cell, it is best to remove houses with a "Gr Liv Area" above 4,000 and houses sold in abnormal conditions. Other than that, the rest of this process consisted of iteratively finding which columns to keep and which to drop.

There are more than 70 columns, so we won't explain in full detail why we chose some and left others out. But the main idea was to look for a parsimonious, coherent and logical model to predict house prices. We were perhaps a bit ruthless on dropping most categorical variables for their potential to increase dimensionality. 

In [247]:
#data = pd.read_csv('casas_entrena.csv')
#data.columns.values

In [248]:
#we want to clean the column names 
def clean_column(col):
    return col.lower().replace('/','_').replace(' ', '_')

In [249]:
def replacing_nans(data):
    """
    Function to fill the columns with Nan Values where it is convenient to keep them.
    Parameters:
    -----------
    data: pandas dataframe
    
    Returns:
    --------
    data: pandas dataframe.
    """
    data["pool_qc"] = data["pool_qc"].fillna("None") #Pool brings useful information
    data["pool_qc"] = np.where(data["pool_qc"] == "None", 0, 1)
    data["fireplace_qu"] = data["fireplace_qu"].fillna("None")
    #Garage information might be useful
    for col in ('garage_type', 'garage_finish', 'garage_qual', 'garage_cond'):
        data[col] = data[col].fillna('None')
    
    return data

In [250]:
def basic_preprocessing(path):
    """
    Una función básica para preprocesar los datos de entrenamiento. 
    Parameters:
    -------
    path: str
          path en tu compu donde está el dataset
    
    Returns:
    --------
    data: pandas dataframe
          Un dataframe listo para el pipeline de sklearn. 
    """
    data = pd.read_csv(path)
    data = data[data['Sale Condition'] == "Normal"]
    data = data.drop(columns = COLS_TO_DROP)
    data = data[data['Gr Liv Area'] < 4_000]
    data.rename(columns={col: clean_column(col) for col in data.columns.values}, 
                 inplace=True)
    data = replacing_nans(data)
    
    return data 

In [251]:
#this is what it looks like right now
#data = basic_preprocessing('casas_entrena.csv')
#data.head()

In [252]:
#data.shape #this is slightly more parsimonious (compared to 80+ columns!)

# Transforming the data. 

We have done a basic preprocessing of the data just for it to be a little bit cleaner and prepared for a ML model. Now we will focus on the transformation pipeline to be implemented later on. 

### Label Encoder. 

Some categorical features —kitchen quality, for example— are arrenged in an ordering such that it may be useful to transform them with Sklearn's LabelEncoder() so that we can further manipulate them and combine them with other features. 

This has the added benefit of reducing the number of variables in the final model because we are not one-hot-encoding them into n or n-1 new variables. 

In [253]:
def encode_variables(data, cols):
    """
    Function to transform numerical features into a numerical ordering. 
    
    Parameters: 
    -----------
    data: pandas dataframe
    cols: list
          list of cols that you want to transform with the label encoder. 
    
    Returns:
    --------
    
    data: pandas dataframe
          A cleanear, happier dataframe :) 
    
    """
    for col in cols:
        le = LabelEncoder() 
        data[col] = le.fit_transform(list(data[col].values))
    return data

# Interactions. 

This is the fun part. Now we get to add new variables and interactions between some of them. For example, it makes sense to add a variable that shows the interaction between total square feet and the overall quality of the house. 

In [254]:
class CombinedAttributesAdder(BaseEstimator, TransformerMixin): 
    def __init__(self, house_condition = True): 
        self.house_condition = house_condition
    def fit(self, X, y=None):
        return self 
    def transform(self, X, y=None):
        qual_squared = X[:, idx["overall_qual"]] ** 2 #quality squared
        qual_m2 = X[:, idx["overall_qual"]] * X[:, idx["lot_area"]] #quality * sq. feet
       # garage_int = X[:, idx["garage_area"]] * X[:, idx["garage_qual"]]
        total_sf = X[:, idx['total_bsmt_sf']] + X[:, idx['1st_flr_sf']] +\
                   X[:, idx['2nd_flr_sf']]
        qual_sf_total = X[:, idx["overall_qual"]] * total_sf
        #kitchen_total = X[:, idx["kitech_qual"]]
        return np.c_[X, qual_squared, qual_m2, total_sf, 
                    qual_sf_total]  

### Sklearn pipeline. 

Now it is time to deploy a transformation pipeline to streamline and automate most of the recurring operations.

We learned this from Geron's book "Hands-on Machine Learning with Scikit-learn, Keras, and TensorFlow"

In [255]:
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

In [256]:
full_pipeline = ColumnTransformer([
    ("numeric", numeric_pipeline, num_attribs),
    ("categorical", OneHotEncoder(handle_unknown='ignore'), cat_attribs),
])

### Summing up the preprocessing. 

After the basic preprocessing (for cleaning up the data), we then transformed the numerical and categorical features into a simpler model. We can put this in one function. 

In [257]:
def preparing_data(data, cols_to_encode):
    """
    Function designed to transformed a slightly preprocessed dataset into numpy arrays
    fit for the model. This will call the sklearn pipeline we defined above. 
    Parameters:
    -----------
    data: pandas dataframe
          You can plug in the result of basic_preprocessing('casas_entrena.csv')
          
    cols_to_encode: list
                    List of columns you want to transform with the LabelEncoder()
    
    Returns:
    ---------
    X_train_prepared: numpy.ndarray
       Transformed features.
    
    Y: numpy.ndarray
       The labels transformed. 
    
    """
    X = data.iloc[:,:-1].copy() #first we split each into its own category
    Y = data["saleprice"].copy()
    Y_log = np.log(Y)
    X = encode_variables(X, cols_to_encode)
    num_attribs = X.select_dtypes(include = "number").columns.values
    cat_attribs = X.select_dtypes(include = 'object').columns.values
    idx = {col: X.columns.get_loc(col) for col in X.columns}
    X_train_prepared = full_pipeline.fit_transform(X)
    
    
    return X_train_prepared, Y_log

In [258]:
cols_to_encode = ['exter_qual', 'exter_cond', 'heating_qc', 'central_air',
                 'kitchen_qual', 'fireplace_qu', 'garage_qual', 'garage_cond']

In [259]:
 PATH = 'casas_entrena.csv'

In [260]:
COLS_TO_DROP = ['MS Zoning', 
                'Lot Frontage',
                'Lot Shape',
                'Street',
                'Land Contour',
                'Utilities',
                'Lot Config',
                'Land Slope',
                'Neighborhood',
                'Alley', 
                'Mas Vnr Type', 
                'Mas Vnr Area',          
                'Bsmt Qual', 
                'Bsmt Cond', 
                'Bsmt Exposure', 
                'BsmtFin Type 1',
                'BsmtFin Type 2',
                'Electrical',
                #FirePlace Qu, 
                #'Garage Type', 
                'Garage Yr Blt', 
               # 'Garage Finish', 
               # 'Garage Qual',
               # 'Garage Cond',
                #'Pool QC', 
                'Fence', 
                'Misc Feature',
                'Condition 1', 
                'Condition 2',
                'Exterior 1st', 
                'Exterior 2nd', 
                'Heating', 
                #'Heating QC',
               'Roof Style',
               'Roof Matl',
               'Foundation',
               'Functional',
               'Fireplaces',
               'Paved Drive',  #maybe keep it
               'Year Remod/Add',
               '3Ssn Porch',
               'Pool Area', #we already have a categorical variable for pool
               'Mo Sold',
               'Misc Val',
               'Open Porch SF',
               'BsmtFin SF 2',
               'Wood Deck SF',
               'Enclosed Porch',
               'Screen Porch',
               'Sale Condition',
               'Sale Type'
               ]

In [261]:
X_train, Y_train = preparing_data(basic_preprocessing(PATH), cols_to_encode)

Nice. 

Lo que puedo hacer para después, el 30  de octubre o así. 

- Tirar más columnas (chance fireplace quality nelson y otras así, igual y 60 al final son muchas.
- Building type tirarla igual. 
- Intentar crear automáticamente la gráfica de las predicciones en el entrenamiento. 
- La meta es llegar a menos de  7% de error. 

# Training the model. 

We will use linear regression, regularization and cross validation to train, adjust and evaluate our model. 

In [262]:
def get_initial_scores(X_train, Y_train, CV):
    """
    Just a simple function to get scores from cross validation. 
    
    Parameters:
    -----------
    X_train: Prepared features.
    Y_train: Prepared labels. 
    CV: Number of cross validation folds.
    
    Returns:
    ---------
    None
    """
    Reg = Ridge()
    scores = cross_val_score(ridge_reg, X_train, Y_train, scoring = "neg_mean_squared_log_error", cv = CV)
    ridge_reg_scores = np.sqrt(-scores)
    print("These are your scores:")
    print()
    print(ridge_reg_scores)

In [263]:
get_initial_scores(X_train, Y_train, 5)

These are your scores:

[0.00858856 0.01053167 0.00746376 0.00841631 0.00740729]


### Grid Search. 

In [264]:
param_grid = [
    {'alpha': [.001, .01, .1, 1, 10, 15, 20, 25, 30], 
     'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}
]

In [265]:
def grid_searching(X_train, Y_train, CV):
    """
    Simple function to simplify all this process.
    Same parameters as above.
    """
    ridge = Ridge()
    grid_search = GridSearchCV(ridge, param_grid, cv = CV, scoring = "neg_mean_squared_log_error",
                          return_train_score = True, n_jobs=-1)
    grid_search.fit(X_train, Y_train)
    print("These are your best parameters:")
    print(grid_search.best_params_)
    #return the best model
    return grid_search.best_estimator_

In [228]:
final_model = grid_searching(X_train, Y_train, 10)

These are your best parameters:
{'alpha': 15, 'solver': 'sparse_cg'}


### Final predictions and the submission. 

In [266]:
def clean_test_dataset(path):
    """
    Simple function to prepare the test dataset.
    """
    data = pd.read_csv(path)
    data = data.drop(columns = COLS_TO_DROP)
    data.rename(columns={col: clean_column(col) for col in data.columns.values}, 
                 inplace=True)
    data = encode_variables(data, cols_to_encode)
    data = replacing_nans(data)
    return data


def make_final_predictions(path, titulo):
    """
    This function uses the best model as selected above and makes predictions 
    for the test data. 
    """
    X_test = clean_test_dataset(path)
    X_test_prepared = full_pipeline.transform(X_test.iloc[:,:-1]) 
    #the last column is an "ID" column we don't need
    final_predictions = final_model.predict(X_test_prepared)
    predictions_exp = np.exp(final_predictions) #we predicted for ln(y)
    submissions = pd.DataFrame({'id': [e for e in range(1,1204)], 'SalePrice': predictions_exp})
    submissions.to_csv(titulo) #save the model
    print("Your submissions are ready!")

In [267]:
TITULO_MODELO = "30_octubre.csv"
make_final_predictions('casas_prueba.csv', TITULO_MODELO)

Your submissions are ready!


### Saving the model.

In [235]:
#import joblib
#joblib.dump(final_model, "29_octubre.pkl")
#y luego lo cargas: 
#modelo_cargado = joblib.load("modelo_uno.pkl")