# The Ames Housing Dataset #

Throughout the exercises you'll complete a feature engineering project with the *Ames Housing* dataset. The *Ames Housing Case Study* bonus lesson reproduces this work on the same dataset used in our *House Prices* Getting Started competition. After completing this course, you'll have a starter notebook ready to extend with ideas of your own.

In this first exercise, you'll do a complete iteration of feature development: understand the dataset, create a baseline model, create a derived feature, and compare performance.

Run this next cell to set everything up!

In [None]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex1 import *

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)


def score_dataset(X, y):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Train and score baseline model
    xgb = XGBRegressor()
    score = cross_val_score(
        xgb, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score


df = pd.read_csv("../input/fe-course-data/ames.csv")

# Understand Data #

The kinds of transformations you can apply to a feature depends on its **statistical type**. Here are several features from the *Ames* data. Decide the statistical type of each feature.

In [None]:
features = [
    "GrLivArea",  # continuous
    "Fireplaces",  # discrete
    "OverallQual",  # ordinal
    "Neighborhood",  # nominal
    "CentralAir",  # binary
]
df[features].head(10)

In [None]:
# Hint: Look at this dataset's `DataDescription.txt` file. Find it by scrolling to the bottom of this notebook and opening the `input` folder.

# Visualize Relationships #


# Create a New Feature #

In [None]:
X_1 = df.copy()
y = X_1.pop('SalePrice')


X_1["TotalBaths"] = \
    2 * df.FullBath + df.HalfBath + \
    2 * df.BsmtFullBath + df.BsmtHalfBath

# Compare Performance #


In [None]:
X = df.copy()
y = X.pop("SalePrice")

baseline = score_dataset(X, y)
fs_1_score = score_dataset(X_1, y)

Run the next cell to see the two scores.

In [None]:
print(f"Baseline Score: {baseline:.4f}")
print(f"New Feature Set Score: {fs_1_score:.4f}")

# Feature Selection #

Based on the performance of the model with and without the additional feature, would you decide to keep or discard the new feature?

In [None]:
# Answer: keep

# Iterating on Feature Sets #

You've just worked through a complete iteration of feature development: discovery, creation, validation, and selection. You'll likely go through many such iterations before arriving at your final, best feature set.

In the next lesson, you'll learn about *feature utility*, which you can use during the discovery step to identify features with the most potential.

# Keep Going #