In [3]:
! kaggle datasets download -d dsrivastava2020/concretecsv
! unzip concretecsv.zip
! cp concrete.csv data/.
! rm -rf concrete.csv
! rm -rf concretecsv.zip

Downloading concretecsv.zip to /home/fdelca/Documents/KaggleCourses/FeatureEngineering
  0%|                                               | 0.00/11.1k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 11.1k/11.1k [00:00<00:00, 8.99MB/s]
Archive:  concretecsv.zip
  inflating: concrete.csv            


## Check DataSet

The Concrete dataset contains a variety of concrete formulations and the resulting product's compressive strength, which is a measure of how much load that kind of concrete can bear. The task for this dataset is to predict a concrete's compressive strength given its formulation.

In [4]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [5]:
df = pd.read_csv("data/concrete.csv")
df.head()

Unnamed: 0,cement,slag,ash,water,superplastic,coarseagg,fineagg,age,strength
0,141.3,212.0,0.0,203.5,0.0,971.8,748.5,28,29.89
1,168.9,42.2,124.3,158.3,10.8,1080.8,796.2,14,23.51
2,250.0,0.0,95.7,187.4,5.5,956.9,861.2,28,29.22
3,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85
4,154.8,183.4,0.0,193.3,9.1,1047.4,696.7,28,18.29


## Build a Baseline

Establishing baselines like this is good practice at the start of the feature engineering process. A baseline score can help you decide whether your new features are worth keeping, or whether you should discard them and possibly try something else.

In [7]:
X = df.copy()
y = X.pop("strength")

# Train and score baseline model
baseline = RandomForestRegressor(criterion='absolute_error', random_state=0)

baseline_score = cross_val_score(baseline, X, y, cv=5, scoring='neg_mean_absolute_error')
baseline_score = -1 * baseline_score.mean()

print(f"MAE Baseline Score: {round(baseline_score, 4)}")


MAE Baseline Score: 3.4704


### Add New Features: Based on Ratios

If you ever cook at home, you might know that the ratio of ingredients in a recipe is usually a better predictor of how the recipe turns out than their absolute amounts. We might reason then that ratios of the features above would be a good predictor of `strength`.

In [10]:
X = df.copy()
y = X.pop('strength')

# New synthetic features
X['FCRatio'] = X.fineagg / X.coarseagg
X['AggCmtRatio'] = (X.coarseagg + X.fineagg) / X.cement
X['WtrCmtRatio'] = X.water / X.cement

# Train and score model
model = RandomForestRegressor(criterion='absolute_error', random_state=0)

model_score = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
model_score = -1 * model_score.mean()

print(f"MAE Baseline Score: {round(model_score, 4)}")

MAE Baseline Score: 3.3881


Sure we can see some improvement, but not an extraordinary one. Next chapter we will introduce new ways to create features that will help the model to improve