# Feature Engineering

## Topics we'll go over

- determine which features are the most important with mutual information
- invent new features in several real-world problem domains
- encode high-cardinality categoricals with a target encoding
- create segmentation features with k-means clustering
- decompose a dataset's variation into features with principal component analysis

## WHY Feature Engineering?

In the world of machine learning, we have so may algorithms that are well optimized. The difference between a good model and a bad one is no longer the algorithms since basically we all used the same algorithms. The major difference now is the ability to adapt the data set to suit your needs and draw more information out of a given data set. This process is what we call feature engineering.

Feature engineering can be preformed for the purposes of:

- Improve model performance
- Reduce computation or data needs
- Improve interpretability of the results

## Principle Of Feature Engineering

For a feature to be usedful it must have a relationship type which the algorithm can handle. Example for a linear relationship its better to use a Linear model.

The relationship of the features must relate to the target variable. Example when given data set with the morning temperature and evening temperature and you wish to use this to predict the average huminity of the day, its better to used the average of the moring and evening temperatures instead of one of the temperatures only. Since the average temperature gives a better description of the temperature of the whole day.

Example two, then given a data set of the Length of a square plot and you wish to predict the price of the plot. Creating a feature of the Square of the Length or plot of land will give your model a better relation to study and develop parameters for. Creating an Area feature will produce better relationship of the model to deal with.

The feature engineering process enables you to help the model understand and learn features that it can not other wise learn on its own. We should always think about what information could help the model best learn the data set. This is what the main aim of feature engineering is all about.

In [64]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [65]:
df = pd.read_excel('datasets/Concrete_Data.xls')

In [66]:
df.head()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [67]:
df.columns

Index(['Cement (component 1)(kg in a m^3 mixture)',
       'Blast Furnace Slag (component 2)(kg in a m^3 mixture)',
       'Fly Ash (component 3)(kg in a m^3 mixture)',
       'Water  (component 4)(kg in a m^3 mixture)',
       'Superplasticizer (component 5)(kg in a m^3 mixture)',
       'Coarse Aggregate  (component 6)(kg in a m^3 mixture)',
       'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)',
       'Concrete compressive strength(MPa, megapascals) '],
      dtype='object')

In [68]:
df['cement'] = df['Cement (component 1)(kg in a m^3 mixture)']
df['BlastFurnaceSlag'] = df['Blast Furnace Slag (component 2)(kg in a m^3 mixture)']
df['FlyAsh'] = df['Fly Ash (component 3)(kg in a m^3 mixture)']
df['Water'] = df['Water  (component 4)(kg in a m^3 mixture)']
df['Superplasticizer'] = df['Superplasticizer (component 5)(kg in a m^3 mixture)']
df['CoarseAggregate'] = df['Coarse Aggregate  (component 6)(kg in a m^3 mixture)']
df['FineAggregate'] = df['Fine Aggregate (component 7)(kg in a m^3 mixture)']
df['Age'] = df['Age (day)']
df['ConcreteCompressiveStrength'] = df['Concrete compressive strength(MPa, megapascals) ']

In [69]:
df.columns[:9]

Index(['Cement (component 1)(kg in a m^3 mixture)',
       'Blast Furnace Slag (component 2)(kg in a m^3 mixture)',
       'Fly Ash (component 3)(kg in a m^3 mixture)',
       'Water  (component 4)(kg in a m^3 mixture)',
       'Superplasticizer (component 5)(kg in a m^3 mixture)',
       'Coarse Aggregate  (component 6)(kg in a m^3 mixture)',
       'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)',
       'Concrete compressive strength(MPa, megapascals) '],
      dtype='object')

In [70]:
for col in df.columns[:9]:
    df.drop([col], axis = 1, inplace = True)

In [71]:
# df.to_csv('datasets/concretdata.csv')

In [72]:
df = pd.read_csv('datasets/concretdata.csv')

In [73]:
df.drop('Unnamed: 0', axis = 1, inplace = True)

In [74]:
df.head()

Unnamed: 0,cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,ConcreteCompressiveStrength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [85]:
X = df.copy()
y = X.drop(['ConcreteCompressiveStrength'], axis=1)

In [86]:
baseline = RandomForestRegressor(criterion='mae', random_state=0)

In [87]:
baseline_score = cross_val_score(baseline, X, y, cv=5, scoring = "neg_mean_squared_error")

In [88]:
baseline_score = -1 * baseline_score.mean()

In [89]:
print(f"MAE basemodel score: {baseline_score:.4f}")

MAE basemodel score: 1176.3310


## Create synthetic features

In [94]:
X = df.copy()
y = X.drop(['ConcreteCompressiveStrength'], axis=1)
X["FCRatio"] = X["FineAggregate"] / X["CoarseAggregate"]
X["AggCmtRatio"] = (X["CoarseAggregate"] + X["FineAggregate"]) / X["cement"]
X["WtrCmtRatio"] = X["Water"] / X["cement"]

In [95]:
X

Unnamed: 0,cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,ConcreteCompressiveStrength,FCRatio,AggCmtRatio,WtrCmtRatio
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111,0.650000,3.177778,0.300000
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366,0.640758,3.205556,0.300000
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535,0.637339,4.589474,0.685714
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.052780,0.637339,4.589474,0.685714
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075,0.843724,9.083082,0.966767
...,...,...,...,...,...,...,...,...,...,...,...,...
1025,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28,44.284354,0.883002,5.927641,0.649783
1026,322.2,0.0,115.6,196.0,10.4,817.9,813.4,28,31.178794,0.994498,5.063004,0.608318
1027,148.5,139.4,108.6,192.7,6.1,892.4,780.0,28,23.696601,0.874048,11.261953,1.297643
1028,159.1,186.7,0.0,175.6,11.3,989.6,788.9,28,32.768036,0.797191,11.178504,1.103708


In [96]:
baseline = RandomForestRegressor(criterion='mae', random_state=0)
baseline_score = cross_val_score(baseline, X, y, cv=5, scoring = "neg_mean_squared_error")
baseline_score = -1 * baseline_score.mean()
print(f"MAE basemodel score: {baseline_score:.4f}")

MAE basemodel score: 1035.9032


We can see that the models performance improved abit.