## Feature Engineering

Feature Engineering, consiste em selecionar, transformar e criar características relevantes a partir dos dados brutos, para deixa-los mais adequados ao problema em si

### Relacionamento de Feautures

Para uma feauture ser util, ela deve ter uma certa relação, por exemplo uma feauture que segue um padrão linear, para ser relacionada com outra de padrão polinomial, você deve transformar a feauture linear em polinomial, para que ela possa ser relacionada com a outra.

### Exemplo de Feature Engineering

In [10]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

data_frame = pd.read_excel("./data/concreto.xls")
data_frame.head()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [15]:
X = data_frame.copy()
y = X.pop("Concrete compressive strength(MPa, megapascals) ")

# Train and score baseline model
baseline = RandomForestRegressor(criterion="absolute_error", random_state=0)
baseline_score = cross_val_score(
    baseline, X, y, cv=5, scoring="neg_mean_absolute_error"
)

mae = -1 * baseline_score.mean()

print(f"Erro absoluto medio: {mae:.4}")


Erro absoluto medio: 8.397


### Criação sintetica de Feautures

Agora tentaremos criar 3 novas feautures a partir de colunas já existentes

In [17]:
X = data_frame.copy()
y = X.pop("Concrete compressive strength(MPa, megapascals) ")
"""
Colunas:
'Cement (component 1)(kg in a m^3 mixture)',
       'Blast Furnace Slag (component 2)(kg in a m^3 mixture)',
       'Fly Ash (component 3)(kg in a m^3 mixture)',
       'Water  (component 4)(kg in a m^3 mixture)',
       'Superplasticizer (component 5)(kg in a m^3 mixture)',
       'Coarse Aggregate  (component 6)(kg in a m^3 mixture)',
       'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)'
"""
X["FCRatio"] = X["Fine Aggregate (component 7)(kg in a m^3 mixture)"] / X["Coarse Aggregate  (component 6)(kg in a m^3 mixture)"]
X["AggCmtRatio"] = (X["Coarse Aggregate  (component 6)(kg in a m^3 mixture)"] + X["Fine Aggregate (component 7)(kg in a m^3 mixture)"]) / X["Cement (component 1)(kg in a m^3 mixture)"]
X["WtrCmtRatio"] = X["Water  (component 4)(kg in a m^3 mixture)"] / X["Cement (component 1)(kg in a m^3 mixture)"]

model = RandomForestRegressor(criterion="absolute_error", random_state=0)

mae = -1 * cross_val_score(model, X, y, cv=5, scoring="neg_mean_absolute_error").mean()

print(f"Erro absoluto medio: {mae:.4}")

Erro absoluto medio: 8.01
