# The Ames Housing Dataset #

In the exercises you'll complete a feature engineering project with the [*Ames Housing*](https://www.kaggle.com/marcopale/housing) dataset. The *House Prices Getting Started* bonus lesson reproduces this work on the same dataset used in our *House Prices* Getting Started competition. After completing this course, you'll have a starter notebook ready to extend with ideas of your own.

In this first exercise, you'll do a complete iteration of feature development: understand the dataset, create a baseline model, create a derived feature, and compare performance.

Run this next cell to set everything up!

In [None]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex1 import *

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)


def score_dataset(X, y):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Train and score baseline model
    xgb = XGBRegressor()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        xgb, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score


df = pd.read_csv("../input/fe-course-data/ames.csv")

Here are five features from the *Ames* dataset.

In [None]:
features = [
    "OverallQual",
    "CentralAir",
    "GrLivArea",
    "Neighborhood",
    "Fireplaces",
]
df[features].head(10)

# 1) Determine Data Types

Can you name the data type of each of these features? Record your answers as a `string`, one of: `"continuous"`, `"discrete"`, `"ordinal"`, `"nominal"`, or `"binary"`.

In [None]:
# YOUR CODE HERE: Enter your answer as a string, like:
# feature = "continuous"
overall_qual = ____
central_air = ____
gr_liv_area = ____
neighborhood = ____
fireplaces = ____


# Check your answer
q_1.check()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
q_1.hint()
#_COMMENT_IF(PROD)_
q_1.solution()

In [None]:
#%%RM_IF(PROD)%%
overall_qual = "nominal"
central_air = "ordinal"
gr_liv_area = "continuous"
neighborhood = "discrete"
fireplaces = "continuous"

q_1.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
overall_qual = 1
central_air = "ordinal"
gr_liv_area = False
neighborhood = None
fireplaces = "disc"

q_1.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
overall_qual = "ordinal"
central_air = "binary"
gr_liv_area = "continuous"
neighborhood = "nominal"
fireplaces = "discrete"

q_1.assert_check_passed()

Often, the author of a dataset which will tell you the intended types and values of each feature in the dataset's documentation. You can see the author's documentation of this dataset by running the cell below:

In [None]:
# Uncomment and run to see data documentation
#_UNCOMMENT_IF(PROD)_
#!cat "../input/fe-course-data/DataDocumentation.txt"

-------------------------------------------------------------------------------

Effective feature engineering makes use of prominent relationships in the dataset. Data visualization is one of the best ways to discover these relationships. Now you'll use Seaborn to discover some important things about the *Ames* data. (Check our our [Data Visualization](https://www.kaggle.com/learn/data-visualization) course, too!)

You can see the relationship a feature has to the target with a *scatterplot*. Take a look at scatterplots for `YearBuilt` and `MoSold` relative to `SalePrice`:

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 5), sharey=True)
axs[0] = sns.scatterplot(x="YearBuilt", y="SalePrice", data=df, ax=axs[0])
axs[1] = sns.scatterplot(x="MoSold", y="SalePrice", data=df, ax=axs[1])

# 2) Discover Relationships

Does there appear to be a significant relationship between either feature and the target? If so, does the relationship appear to be linear (best fit with a line)?

After you've thought about your answer, run the following cell for a solution.

In [None]:
# View the solution (Run this cell to receive credit!)
q_2.check()

-------------------------------------------------------------------------------

The number of bathrooms in a home is often important to prospective home-buyers. The *Ames* data contains four such features:
- `FullBath`
- `HalfBath`
- `BsmtFullBath`
- `BsmtHalfBath`

# 3) Create a New Feature

Create a new feature `TotalBaths` that describes the *total* number of bathrooms in a home. There are a couple answers that might be reasonable. Can you think of a way that's better than just summing the features up?

In [None]:
X = df.copy()
y = X.pop('SalePrice')


# YOUR CODE HERE
X["TotalBaths"] = ____


# Check your answer
q_3.check()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
q_3.hint()
#_COMMENT_IF(PROD)_
q_3.solution()

In [None]:
#%%RM_IF(PROD)%%
X = df.copy()
y = X.pop('SalePrice')

X["TotalBaths"] = \
    df.FullBath + df.HalfBath + \
    df.BsmtFullBath + df.BsmtHalfBath

q_3.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
X = df.copy()
y = X.pop('SalePrice')

X["TotalBaths"] = \
    df.FullBath + df.HalfBath

q_3.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
X = df.copy()
y = X.pop('SalePrice')

X["TotalBaths"] = \
    df.BsmtFullBath + df.BsmtHalfBath

q_3.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
X = df.copy()
y = X.pop('SalePrice')

# Solution 1: HalfBath with half the weight of FullBath
X["TotalBaths"] = \
    df.FullBath + 0.5 * df.HalfBath + \
    df.BsmtFullBath + 0.5 * df.BsmtHalfBath

# Solution 2: Same, but preserves int type
X["TotalBaths"] = \
    2 * df.FullBath + df.HalfBath + \
    2 * df.BsmtFullBath + df.BsmtHalfBath

q_3.assert_check_passed()

-------------------------------------------------------------------------------

Now compare the performance of XGBoost on *Ames* with and without the `TotalBaths` feature. (The `score_dataset` function performs 5-fold cross-validation with XGBoost using with the RMSLE metric, the same metric used in the *House Prices* competition.)

In [None]:
X_base = df.copy()
y_base = X_base.pop("SalePrice")

baseline_score = score_dataset(X_base, y_base)
new_score = score_dataset(X, y)

print(f"Score Without New Feature: {baseline_score:.4f} RMSLE")
print(f"Score With New Feature: {new_score:.4f} RMSLE")

# 4) Feature Selection

Based on the performance of XGBoost with and without the additional feature, would you decide to keep or discard the new feature? Or is there not enough information to decide?

After you've thought about you're answer, run the next cell for the solution.

In [None]:
# View the solution (Run this cell to receive credit!)
q_4.check()

# Iterating on Feature Sets #

You've just worked through a complete iteration of feature development: discovery, creation, validation, and selection. In most machine learning projects, you'll likely go through many such iterations before arriving at your final, best feature set.

In the next lesson, you'll learn about *feature utility*, a way of scoring features for their potential usefulness -- a big help when you're just getting started with a new dataset!

# Keep Going #

[**Discover useful features**](#$NEXT_NOTEBOOK_URL$) with mutual information.