In [1]:
from __future__ import annotations

import os
import os.path as P
import typing

import numpy as np
import pandas as pd
import sklearn
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import (
    FunctionTransformer,
    MinMaxScaler,
    OneHotEncoder,
    StandardScaler,
)

# Data Preprocessing Strategy

In the last notebook, after executing our Exploratory Data Analysis, we applied a simple data preprocessing strategy and built our baseline model based on a [Ridge regressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html). In this notebook we'll focus on enhancing the preprocess pipeline to make it full use of the available data. At the end we'll check how our new features will correlate with each other and the effectiveness of the new approach by traning a new Ridge regressor model.

## Data Loading

Here we just load our previously cleaned data

In [2]:
preprocessed_dataset_root_dir = P.join(P.dirname(P.abspath("")), "data", "processed")

In [3]:
df_file = P.join(preprocessed_dataset_root_dir, "sp_sales_data.parquet")

features = pd.read_parquet(df_file)
features

Unnamed: 0,bairro,tipo_imovel,area_util,banheiros,suites,quartos,vagas_garagem,anuncio_criado,preco_venda,taxa_condominio,iptu_ano
0,Jardim da Saude,Casa de dois andares,388.0,3.0,1.0,4.0,6.0,2017-02-07,700000,,
1,Vila Santa Teresa (Zona Sul),Casa,129.0,2.0,1.0,3.0,2.0,2016-03-21,336000,,
2,Vila Olimpia,Apartamento,80.0,2.0,1.0,3.0,2.0,2018-10-26,739643,686.0,1610.0
3,Pinheiros,Apartamento,94.0,1.0,0.0,3.0,2.0,2018-05-29,630700,1120.0,489.0
4,Vila Santa Clara,Condominio,110.0,1.0,1.0,3.0,2.0,2018-04-16,385000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
88742,Vila Carmosina,Apartamento,48.0,1.0,0.0,2.0,1.0,2017-10-07,171150,244.0,0.0
88743,Bela Vista,Apartamento,60.0,1.0,,1.0,1.0,2017-12-13,251999,273.0,86.0
88744,Liberdade,Apartamento,53.0,2.0,1.0,2.0,1.0,2018-11-28,249782,210.0,0.0
88745,Vila Lageado,Apartamento,20.0,3.0,2.0,3.0,2.0,2019-02-06,623000,,


We'll also isolate our target feature (`preco_venda`).

In [4]:
prices = features.pop("preco_venda")

display(features)
display(prices)

Unnamed: 0,bairro,tipo_imovel,area_util,banheiros,suites,quartos,vagas_garagem,anuncio_criado,taxa_condominio,iptu_ano
0,Jardim da Saude,Casa de dois andares,388.0,3.0,1.0,4.0,6.0,2017-02-07,,
1,Vila Santa Teresa (Zona Sul),Casa,129.0,2.0,1.0,3.0,2.0,2016-03-21,,
2,Vila Olimpia,Apartamento,80.0,2.0,1.0,3.0,2.0,2018-10-26,686.0,1610.0
3,Pinheiros,Apartamento,94.0,1.0,0.0,3.0,2.0,2018-05-29,1120.0,489.0
4,Vila Santa Clara,Condominio,110.0,1.0,1.0,3.0,2.0,2018-04-16,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
88742,Vila Carmosina,Apartamento,48.0,1.0,0.0,2.0,1.0,2017-10-07,244.0,0.0
88743,Bela Vista,Apartamento,60.0,1.0,,1.0,1.0,2017-12-13,273.0,86.0
88744,Liberdade,Apartamento,53.0,2.0,1.0,2.0,1.0,2018-11-28,210.0,0.0
88745,Vila Lageado,Apartamento,20.0,3.0,2.0,3.0,2.0,2019-02-06,,


0         700000
1         336000
2         739643
3         630700
4         385000
          ...   
88742     171150
88743     251999
88744     249782
88745     623000
88746    1820000
Name: preco_venda, Length: 88747, dtype: int64

## The Sklearn's set_output API

The design of the whole data preprocessing pipeline is based with modularity in mind: each processing step is Sklearn's **transform** that can be combined or stacked with one another. For this we'll use one of the Sklearn's newest and most useful feature introduced in [version 1.2](https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_2_0.html): the ability to output Pandas dataframes from any estimator/transformer, through the [set_output API](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html#introducing-the-set-output-api).

We can activate this feature by setting the `set_output(transform="pandas")` method of any estimator or by setting int globaly, by running the following cell.

In [5]:
sklearn.set_config(transform_output="pandas")

## Basic Feature Imputing

First of all, we have to deal with the missing values. The approach is pretty basic:
- For `taxa_condominio`, `suites`, `vagas_garagem`, `quartos` e `banheiros`, it makes sense that a missing value represents the literal absence of this feature.
- For `iptu_ano` and `area_util`, for obvious reasons, we'll consider missing values as errors in data collection process and, as we don't have a better plan, we'll fill those values with the **mean of the corresponding feature**.

In [6]:
imputing_transformer = make_column_transformer(
    (
        SimpleImputer(fill_value=0.0),
        [
            "taxa_condominio",
            "suites",
            "vagas_garagem",
            "quartos",
            "banheiros"
        ]
    ), (
        SimpleImputer(strategy="mean"),
        ["iptu_ano", "area_util"]
    ),
    remainder="passthrough",
    verbose_feature_names_out=False
)

The use of [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) does'nt need an explanation. The remainder parameter of [make_column_transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html) just tells that the `reminder` features will be outputer as they are. As for `get_feature_names_out`, if `True`, will prefix all feature names with the name of the transformer that generated that feature. Otherwise, feature names won't be prefixed (just be careful and make sure that all names are unique).

Now let's use our first pipeline to check what it outputs.

In [7]:
imputing_transformer.fit_transform(features)

Unnamed: 0,taxa_condominio,suites,vagas_garagem,quartos,banheiros,iptu_ano,area_util,bairro,tipo_imovel,anuncio_criado
0,987.804458,1.000000,6.0,4.0,3.0,734.084966,388.0,Jardim da Saude,Casa de dois andares,2017-02-07
1,987.804458,1.000000,2.0,3.0,2.0,734.084966,129.0,Vila Santa Teresa (Zona Sul),Casa,2016-03-21
2,686.000000,1.000000,2.0,3.0,2.0,1610.000000,80.0,Vila Olimpia,Apartamento,2018-10-26
3,1120.000000,0.000000,2.0,3.0,1.0,489.000000,94.0,Pinheiros,Apartamento,2018-05-29
4,0.000000,1.000000,2.0,3.0,1.0,0.000000,110.0,Vila Santa Clara,Condominio,2018-04-16
...,...,...,...,...,...,...,...,...,...,...
88742,244.000000,0.000000,1.0,2.0,1.0,0.000000,48.0,Vila Carmosina,Apartamento,2017-10-07
88743,273.000000,1.062818,1.0,1.0,1.0,86.000000,60.0,Bela Vista,Apartamento,2017-12-13
88744,210.000000,1.000000,1.0,2.0,2.0,0.000000,53.0,Liberdade,Apartamento,2018-11-28
88745,987.804458,2.000000,2.0,3.0,3.0,734.084966,20.0,Vila Lageado,Apartamento,2019-02-06


As you can see, the resulting dataframe preserves all feature names.

This is all the imputing we need to execute. Now we'll create some new features.

# Feature Engineering

We'll to follow the modular approach, and every new feature will be created using the [Sklearn's FunctionTransformer API](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html). For this, we'll just need to define some transformation functions, that takes a dataframe as input and deliver the transformed dataframe as output.

## Area Scores

We'll create two scores based on the value of the property area.
- `condominio_per_area`: the condominium tax billed per property area;
- `iptu_per_area`: the anual property tax billed per area.

In [8]:
def add_area_scores(X: pd.DataFrame) -> pd.DataFrame:
    area_scores = pd.DataFrame()
    
    area_scores["condominio_per_area"] = X["taxa_condominio"] / X["area_util"]
    area_scores["iptu_per_area"] = X["iptu_ano"] / X["area_util"]
    
    return area_scores

As you can see, I decided to, at every step, return only the newly created features. This is what we call a **functional transformer**. I found that this approach is the best in terms of interpretability and debbuging, as every feature creation function works independently from each other. Soon I'll show how can we use this function to compose the final dataframe.

## Neighboorhood Scores

We'll also create scores based on the area value of the property's neighboorhood.
- `neighboor_condominio_per_area`: the average condominium tax billed for the neighboorhood properties.
- `neighboor_iptu_per_area`: the average anual property tax billed for the neighboorhood properties.

This time well create our transformer a bit different.

In [9]:
class NeighboorsScores(BaseEstimator, TransformerMixin):
    def __init__(self) -> None:
        super().__init__()

        self.neighboors_metrics = None
        self.mean_condominium_per_area = None
        self.mean_iptu_per_area = None

    def fit(
        self, X: pd.DataFrame, y: typing.Optional[typing.Any] = None
    ) -> NeighboorsScores:
        required_cols = X[["bairro", "condominio_per_area", "iptu_per_area"]].copy()
        required_cols["bairro"] = required_cols["bairro"].apply(self._normalize_str)
        self.neighboors_metrics = required_cols.groupby("bairro").mean()
        self.neighboors_metrics.columns = [
            "neighboor_condominio_per_area",
            "neighboor_iptu_per_area",
        ]

        means = self.neighboors_metrics.mean().to_dict()
        self.mean_condominium_per_area = means["neighboor_condominio_per_area"]
        self.mean_iptu_per_area = means["neighboor_iptu_per_area"]

        return self

    def transform(
        self, X: pd.DataFrame, y: typing.Optional[typing.Any] = None
    ) -> pd.DataFrame:
        neighs_scores = X[["bairro", "condominio_per_area", "iptu_per_area"]].copy()
        neighs_scores["bairro"] = neighs_scores["bairro"].apply(self._normalize_str)

        joinded_df = neighs_scores.join(
            self.neighboors_metrics, on="bairro", how="left"
        )

        result = pd.DataFrame()
        result["neighboor_condominio_per_area"] = joinded_df[
            "neighboor_condominio_per_area"
        ].fillna(self.mean_condominium_per_area)
        result["neighboor_iptu_per_area"] = joinded_df[
            "neighboor_iptu_per_area"
        ].fillna(self.mean_iptu_per_area)

        return result

    def _normalize_str(self, string: str) -> str:
        return string.lower().strip()

OK. That one was a bit tricky! I'll try to answer that may come in mind when first looking at this class.

### Why do we need a class instead of a plain function?
In opposed to the **function transformer**, this one needs to retain some kind of **memory**: we need to keep the neighboor's condominium and iptu data to used in new data.

### Why do we subclass from Sklearn's BaseEstimator?
By deriving from [BaseEstimator](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html), we make our transformer compatible with Sklean's API by providing a common interface to get the transformer's parameters (through the `get_params` method) and to set parameters (through the `set_params` method).

### Why do we subclass from Sklearn's TransformerMixin?
By deriving from [TransformerMixin](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html), our transformer supply a common interface for training (through the `fit` method) and for transforming new data (through the `transform` method). As a bonus, we also get the `fit_transform` method, that fit's the transformer and transforms the train data in one go.

### How this transformer works?
1. In training phase, we compute the average condominium and iptu taxes for each neighboorhood and put it in a table. We also compute the global average condominium and iptu taxes.

2. In transforming phase, we join our table with the new data using the neighboor name as key and fill missing values (neighboors that we don't have data) with the global average of the condominium and iptu taxes

## Feature Transformation

With all features (old and new ones) in hand, we'll prepare them to be properly used to train our models. In this step involves:
- **Transform nominal features to numerical ones:** Most of the Machine Learning that we'll design only accept numbers. Feature `tipo_imovel` on the other hand, holds values that verbosely represents the property types. For this we'll apply the [One-hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) techinique: a new column will be created for each of `tipo_imovel` values, and each column will be a boolean value that represents if the property is/isn't of the corresponding type.
- **Standard Scalling:** The difference of values scales between features impact greatly in most of Machine Learning algorithms. Thus, each numeric feature will be transformed using the [Standard Scalling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) method: for a specific feature, it's values will be subtracted by it's average value and next divided by it's standard deviation.

As before, this step will be implemented usign the Sklearn's API. Let's proceed to do it.

## Putting Everything Together

In [10]:
feature_engineering_pipeline = make_pipeline(
    make_column_transformer(
        (
            SimpleImputer(fill_value=0.0),
            ["taxa_condominio", "suites", "vagas_garagem", "quartos", "banheiros"],
        ),
        (SimpleImputer(strategy="mean"), ["iptu_ano", "area_util"]),
        remainder="passthrough",
        verbose_feature_names_out=False,
    ),
    make_union("passthrough", FunctionTransformer(add_area_scores)),
    FunctionTransformer(
        lambda X: X.drop(["taxa_condominio", "iptu_ano"], axis="columns")
    ),
    make_union("passthrough", NeighboorsScores()),
    FunctionTransformer(lambda X: X.drop(["bairro", "anuncio_criado"], axis="columns")),
    make_column_transformer(
        (OneHotEncoder(sparse_output=False), ["tipo_imovel"]),
        (
            StandardScaler(),
            [
                "area_util",
                "condominio_per_area",
                "iptu_per_area",
                "neighboor_condominio_per_area",
                "neighboor_iptu_per_area",
            ],
        ),
        remainder="passthrough",
        verbose_feature_names_out=False,
    ),
)

As you can see, with all feature engineering functions defined, it's easy to arrange all steps that we've mentioned in a sequential order of a pipeline. Note the use of [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) to encapsulate our previously created functions to integrate them in our pipeline. **FunctionTransformer** is also used to create simple and stateless transformation e.g.: drop columns.

Now, to use our pipeline, we must train it. After that we can transform any new data that it receives as input. We can accomplish these two steps in one go with `fit_transform` method.

In [11]:
transformed_features = feature_engineering_pipeline.fit_transform(features)

transformed_features

Unnamed: 0,tipo_imovel_Apartamento,tipo_imovel_Casa,tipo_imovel_Casa de dois andares,tipo_imovel_Cobertura,tipo_imovel_Condominio,tipo_imovel_Flat,tipo_imovel_Kitnet,tipo_imovel_Predio Residencial,area_util,condominio_per_area,iptu_per_area,neighboor_condominio_per_area,neighboor_iptu_per_area,suites,vagas_garagem,quartos,banheiros
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,4.039322,-0.031827,-0.015712,3.180273,0.073782,1.000000,6.0,4.0,3.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.229577,-0.010645,-0.000007,-0.216482,0.135286,1.000000,2.0,3.0,2.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.491186,-0.006842,0.059671,1.011119,-0.096446,1.000000,2.0,3.0,2.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.285254,0.006999,-0.002027,0.616780,-0.083665,0.000000,2.0,3.0,1.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.049903,-0.042378,-0.023535,-0.325976,-0.147722,1.000000,2.0,3.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88742,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.961888,-0.021312,-0.023535,-0.338419,-0.162822,0.000000,1.0,2.0,1.0
88743,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.785375,-0.023522,-0.017609,-0.139360,-0.105350,1.062818,1.0,1.0,1.0
88744,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.888341,-0.025958,-0.023535,-0.189627,-0.062737,1.000000,1.0,2.0,2.0
88745,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.373752,0.162300,0.128216,-0.181446,-0.077305,2.000000,2.0,3.0,3.0
