# Introduction #

In this exercise you'll identify an initial set of features in the *Ames* dataset to develop using mutual information scores and interaction plots.

Run this cell to set everything up!

In [None]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex2 import *

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.feature_selection import mutual_info_regression

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)


# Load data
df = pd.read_csv("../input/fe-course-data/ames.csv")
    
# Let's include the feature we created in the Lesson 1 exercise
df["TotalBaths"] = \
    2 * df.FullBath + df.HalfBath + \
    2 * df.BsmtFullBath + df.BsmtHalfBath


# Utility functions from Tutorial
def make_mi_scores(X, y):
    X = X.copy()
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    discrete_features = X.dtypes == int
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    color = np.array(["C0"] * scores.shape[0])
    # Color red for probes
    idx = [i for i, col in enumerate(scores.index)
           if col.startswith("PROBE")]
    color[idx] = "C3"
    # Create plot
    plt.barh(width, scores, color=color)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

-------------------------------------------------------------------------------


In [None]:
sns.relplot(x="TotRmsAbvGrd", y="SalePrice", data=df);

# 1) Understand Mutual Information

Based on the plots, which feature do you think would have the highest mutual information with the `SalePrice`?

In [None]:
# View the solution (Run this cell to receive credit!)
q_1.check()

-------------------------------------------------------------------------------

Use the `make_mi_scores` function to compute mutual information scores for the *Ames* features.


In [None]:
X = df.copy()
y = X.pop('SalePrice')

mi_scores = make_mi_scores(X, y)

Now examine the scores using the functions in this cell. Look especially at top and bottom ranks.

In [None]:
print(mi_scores.head(20))

plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores.head(20))

# 2) Examine MI Scores

Do the scores seem reasonable? Are those qualities that you would think about looking for a home?

In [None]:
# View the solution (Run this cell to receive credit!)
q_2.check()

-------------------------------------------------------------------------------

In this step you'll investigate possible interaction effects for the `BldgType` feature. This feature describes the broad structure of the dwelling in five categories:

> Bldg Type (Nominal): Type of dwelling
>		
>       1Fam	Single-family Detached	
>       2FmCon	Two-family Conversion; originally built as one-family dwelling
>       Duplx	Duplex
>       TwnhsE	Townhouse End Unit
>       TwnhsI	Townhouse Inside Unit

The `BldgType` feature didn't get a very high MI score. A plot confirms this. We can see that the class distributions aren't very well separated:

In [None]:
sns.catplot(x="BldgType", y="SalePrice", data=df, kind="boxen")

Still, the type of a dwelling seems like it should be important information. Investigate whether `BldgType` produces a significant interaction with either of the following:

```
GrLivArea  # Above ground living area
MoSold     # Month sold
```

Run the following cell:

In [None]:
# YOUR CODE HERE:
feature = "MoSold"

sns.lmplot(
    x=feature, y="SalePrice", hue="BldgType", col="BldgType",
    data=df, scatter_kws={"edgecolor": 'w'}, col_wrap=3, height=4,
)

Remember that a difference in the trend lines means that there is an interaction effect.

# 3) Discover Interactions

From the plots, does `BldgType` seem to exhibit an interaction effect with either `GrLivArea` or `MoSold`?

In [None]:
# View the solution (Run this cell to receive credit!)
q_3.check()

In [None]:
# Solution: yes for GrLivArea, no for MoSold

Are there any other low-scoring features you suspect might be important because of interactions they create?

-------------------------------------------------------------------------------

# (Optional) Feature Selection #

Whether in the original data or after a round of feature development, you may find yourself with a large number of features. At this point, you may want to apply a **feature selection** technique. Reducing the number of uninformative features in the training set can:

- reduce computational time, and
- prevent overfitting and improve predictive performance

MI filters are a computationally inexpensive selection method especially useful for large or high-dimensional datasets.

This exercise presents a filtering method similar to what statisticians call a "permutation test". The idea is to compare the original feature set to a copy with values that have been randomly permuted.


In [None]:
def add_probes(X):
    num_probes = X.shape[1]
    P = (X.sample(n=num_probes, axis=1, random_state=0)
         .sample(frac=1, random_state=1)
         .reset_index(drop=True))
    P.columns = [f"PROBE_{colname}" for colname in P.columns]
    X_p = pd.concat([P, X], axis=1)
    return X_p


def select_features(X, y, k=3):
    X_p = add_probes(X)
    mi_scores = make_mi_scores(X_p, y)
    # Get rank of first PROBE
    rank = -1
    i = 0
    for r, col in enumerate(mi_scores.index):
        if col.startswith("PROBE"):
            if i == k:
                rank = r
                break
            else:
                i += 1
    features = [col for col in mi_scores[:rank].index
                if not col.startswith("PROBE")]
    return features, mi_scores


Select features using several values of `k`. The baseline score was around `0.1428`. Does using feature selection significantly improve the performance of the model for any values of `k`? Does it ever degrade the performance? (You might try: `0, 1, 2, 6, 10, 15`.)

In [None]:
k = 5

X = df.copy()
X.pop('SalePrice')
features, _ = select_features(X, y, k=k)
print(f"Number of Features Selected:{len(features)} / {len(df.columns)}")
score_dataset(X[features], y)
features

In [None]:
scores = []
num_features = []
for k in range(40):
    features, _ = select_features(X, y, k=k)
    scores.append(score_dataset(X[features], y))
    num_features.append(len(features))
scores = pd.DataFrame(dict(Score=scores, NumFeatures=num_features))
fig, ax = plt.subplots(dpi=100)
ax.set(title="Feature Selection")
scores.plot(secondary_y=["NumFeatures"], ax=ax);


What do you think is the optimal number of features?

# A First Set of Development Features #

Let's take a moment to make a list of features to focus on. In the exercise in Lesson 3, you'll start to build up a feature set by creating combinations of these initial development features.

# Keep Going #