# Introduction #

In this exercise, you'll apply target encoding on the *Ames* dataset.

Run this cell to set everything up!

In [None]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex5 import *

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
from category_encoders import MEstimateEncoder

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)
warnings.filterwarnings('ignore')

df = pd.read_csv("../input/fe-course-data/ames.csv")

# Choose Features for Encoding #


In [None]:
df.select_dtypes(["object"]).nunique()

In [None]:
df["Neighborhood"].value_counts()
df["MSSubClass"].value_counts()

Any ideas?


In [None]:
# Neighborhood could be a good one. It has a fairly large number of categories, several of which are rare.

-------------------------------------------------------------------------------

To get credit for this question, you need to create a target encoding that achieves a score less than 0.140 RMSLE. You're free to choose any set of features to encode and any value of `m`.

Start by creating a split for the encoding.

# 2) Apply M-Estimate Encoding



In [None]:
# YOUR CODE HERE
features = [
    "Neighborhood",
]

X = df.copy()
y = X.pop('SalePrice')

# Create the encoder instance. Choose m to control noise.
encoder = MEstimateEncoder(cols=features, m=20)
# Fit the encoder on the encoding split.
X_encoded = encoder.fit_transform(X[features], y)
# Join encoded features to original feature set
X = X.join(X_encoded, rsuffix="_encoded")

score = score_dataset(X, y)
print(f"Your score: {score:.5f} RMSLE")

# Check your answer
q_2.check()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
q_2.hint()
#_COMMENT_IF(PROD)_
q_2.solution()

# Examine Encoding #


In [None]:
plt.figure(dpi=90)
ax = sns.distplot(y, kde=True, hist=False)
ax = sns.distplot(X.Neighborhood_encoded, color='r', ax=ax, hist=True, kde=False, norm_hist=True)
ax.set_xlabel("SalePrice")

Does it seem like the encoding was able to capture useful information?

In [None]:
# Yes. Not perfect, but follows distribution.

# Alternative Encoders # 


# The End #

We hope you enjoyed the course!

Almost anything can be a feature. Any kind of description you can think of could become a feature.

- *The Art of Feature Engineering* by Pablo Duboue.
- *An Empirical Analysis of Feature Engineering for Predictive Modeling* by Jeff Heaton. Check out his dataset and accompanying notebook, too!
- *Feature Engineering for Machine Learning* by Alice Zheng and Amanda Casari. The tutorial on clustering was inspired by this excellent book.
- *Feature Engineering and Selection* by Max Kuhn and Kjell Johnson.
