
# Introduction #

Run this cell to set everything up.

In [None]:
SEED = 31415

# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex3 import *

In this exercise, you'll use another housing dataset, the *California Housing* dataset. As opposed to individual homes, this dataset uses features that describe an average or median within a census block. You'll be predicting the median house value within a block. (A "census block" is a tract of land containing a few hundred to a few thousand people, as defined by the US Census Bureau.)

Run the next cell to set up the data.

In [None]:
import pandas as pd
from IPython.display import display

df = pd.read_csv('../input/fe-course-data/housing.csv')
display(df.head())
X = df.copy()
y = X.pop("MedHouseVal")

# Step 2 - Define Transforms #

The features in the *California Housing* dataset are all numeric, which makes them a good candidate for PCA. Define three transforms:
1. a `PolynomialFeatures` instance to create interaction features
2. a `StandardScaler` to standardize the data
3. a `PCA` instance that retains 95% of the variance in the dataset

In [None]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.decomposition import PCA


# YOUR CODE HERE: Define the feature transforms
polynomial_features = PolynomialFeatures(
    degree=2, interaction_only=True, include_bias=False,
)
scaler = StandardScaler()
pca = PCA(n_components=0.95)


# Check your answer
q_2.check()

# Step 3 - Define Pipeline

Now create the complete pipeline you'll use for prediction, using XGBoost as before. Pay attention to the order the transforms occur in the pipeline.

In [None]:
from xgboost import XGBRegressor
from sklearn.pipeline import make_pipeline

# YOUR CODE HERE: Create the XGBRegressor, then define the complete pipeline 
xgb = XGBRegressor()
pipeline = make_pipeline(
    polynomial_features, scaler, pca, xgb,
)


# Check your answer
q_3.check()

# Step 4 - Estimate Performance #

Now evaluate the complete pipeline with 5-fold cross validation.

In [None]:
from sklearn.model_selection import cross_val_score


# YOUR CODE HERE: Cross-validate
score = cross_val_score(
    pipeline, X, y, cv=5, scoring='neg_mean_absolute_error'
)
score = -1 * score.mean()

print("Score: {:.4f}".format(score))


# Check your answer
q_4.check()

# Keep Going #