
# Introduction #

In the tutorial we saw how to create *interaction features* using the `PolynomialTransformer` from scikit-learn. Now you'll apply this technique to a dataset with both categorical and numeric features and incorporate the transform into an XGBoost pipeline.

Run this cell to set everything up.

In [None]:
SEED = 31415

# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex2 import *

In this exercise, you'll create interaction features for the *Ames Housing* dataset. Your task will be to predict the `'Sale_Price'` of a home from features describing its design, history, and location. Some of these features are categorical, like the `'Neighborhood'` the house is in, while others are numeric, like the number of `'Fireplaces'` the house contains.

Run the next cell to set up the dataset.

In [None]:
import pandas as pd
from IPython.display import display

df = pd.read_csv('../input/fe-course-data/ames.csv')
display(df.head())

CATEGORICALS = [
    "Neighborhood",
    "House_Style",
    "Street",
    "Utilities",
    "Heating",
    "Central_Air",
]
NUMERICS = [
    "Year_Built",
    "Lot_Area",
    "Gr_Liv_Area",
    "Full_Bath",
    "Half_Bath",
    "Bedroom_AbvGr",
    "TotRms_AbvGrd",
    "Fireplaces",
    "Garage_Area",
    "Mo_Sold",
]

X = df[CATEGORICALS + NUMERICS]
y = df["Sale_Price"]

# Step 1 - Define Transfomers #

Start by defining three transformers:
1. a `OneHotEncoder` applied only to the `CATEGORICALS` (use `make_column_transformer`)
2. a `PolynomialFeatures` to create the interaction features
3. a `VarianceThreshold` to filter out any empty features

In [None]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
from sklearn.compose import make_column_transformer


# YOUR CODE HERE: define the three transfomers
one_hot_encoder = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), CATEGORICALS),
)
interaction_features = PolynomialFeatures(
    degree=2,
    interaction_only=True,
    include_bias=False,
)
no_variance_filter = VarianceThreshold()


# Check your answer
q_1.check()

# Step 2 - Create Pipeline #

Now create the pipeline you'll use for prediction. You'll use XGBoost to predict the home prices. Be sure to put the transformers in the correct order, with `xgboost` at the end.

In [None]:
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor


# YOUR CODE HERE: Create the xgboost model
xgboost = XGBRegressor()

# YOUR CODE HERE: Create the pipeline with transformers and model
pipeline = make_pipeline(
    one_hot_encoder,
    interaction_features,
    no_variance_filter,
    xgboost,
)


# Check your answer
q_2.check()

# Step 3 - Estimate Performance #

Now estimate the generalization error of the `pipeline` model with 5-fold cross-validation. As before, use `'neg_mean_absolute_error'` for the scoring metric.

In [None]:
from sklearn.model_selection import cross_val_score

# YOUR CODE HERE
score = cross_val_score(
    pipeline, X, y, cv=5, scoring='neg_mean_absolute_error'
)
score = -1 * score.mean()
print("MAE with interactions: {:.4f}".format(score))


# Check your answer
q_3.check()

# Step 4 - Evaluate #

What could you do to determine whether adding the interaction features improved the performance of XGBoost on this dataset? Can you think of any other stateless transforms that might be useful? After you've thought about it, run the next cell for some discussion.

In [None]:
# Check your answer (run this cell for credit!)
q_4.check()

```
# Solution: Check it against a baseline, that is, without the interaction features.
# We could try taking ratios of rooms (bathrooms to total rooms) or rooms to living area.
```

# Keep Going #