
# Introduction #

Run this cell to set everything up.

In [None]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.deep_learning_intro.ex1 import *

In this exercise, you'll create a complete machine-learning pipeline for the *Concrete* dataset. One measure of the quality of a concrete formulation is the concrete's *compressive strength*, meaning how much load it can bear. The task is to predict the compressive strength of concrete manufactured according to various formulations.

In [None]:
import pandas as pd

df = pd.read_csv("../input/fe-course-data/concrete.csv")
df.head()

# Step 1 - Define Transformer #

The correct ratio ingredients in a recipe is often essential to its success. So let's create a few ratio features to add to our dataset.

Define a function transformer that will add the following features to the dataset:
1) the ratio of `"Water"` to `"Cement"`
2) the ratio of `"FineAggregate"` to `"CoarseAggregate"`
3) the ratio of `"FineAggregate"` plus `"CoarseAggregate"` to `"Cement"`


In [None]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer


# YOUR CODE HERE: Define a function to create the ratio features
def make_ratios(X):
    X = X.copy()
    X["FCRatio"] = X["FineAggregate"] / X["CoarseAggregate"]
    X["AggCmtRatio"] = (X["CoarseAggregate"] + X["FineAggregate"]) / X[
        "Cement"
    ]
    X["WtrCmtRatio"] = X["Water"] / X["Cement"]
    return X


# YOUR CODE HERE: Create a FunctionTransformer from the function you just defined
ratio_transformer = FunctionTransformer(make_ratios)


# Check your answer
q_1.check()

# Step 2 - Create Pipeline #

Now create a random forest model and construct a pipeline that includes the feature engineering transform `ratio_transformer`.


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline


# YOUR CODE HERE: Create a random forest model
random_forest = RandomForestRegressor()

# YOUR CODE HERE: Construct a pipeline
pipeline = Pipeline([("ft", ratio_transformer), ("rf", random_forest)])


# Check your answer
q_2.check()

# Step 3 - Estimate Performance #

Now estimate the generalization error of this pipeline with 5-fold cross-validation. As in the tutorial, use the scoring metric `"neg_mean_absolute_error"`.


In [None]:
from sklearn.model_selection import cross_val_score


# YOUR CODE HERE: Cross-validate with the transformed features
score = cross_val_score(
    pipeline, X, y, cv=5, scoring="neg_mean_absolute_error"
)
score = -1 * score.mean()
print("With ratios: {:.4f}".format(score))


# Check your answer
q_3.check()

# Step 4 - Compare to Baseline #

To check whether the feature engineering led to any improvement, check it agaist a baseline score. Use the same parameters as before, but this time with `random_forest` instead of `pipeline`.


In [None]:
# YOUR CODE HERE: Cross-validate a baseline score
score = cross_val_score(
    random_forest, X, y, cv=5, scoring="neg_mean_absolute_error"
)
score = -1 * score.mean()
print("Baseline: {:.4f}".format(score))


# Check your answer
q_4.check()

# Step 5 - Evaluate #

Based on this baseline score, do you think it would be a good idea to keep these new ratio features? Run the following cell after you've thought about your answer.


In [None]:
# Check your answer
q_5.check()

# Keep Going #