
# Introduction #


Run this cell to set everything up.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=18,
    titlepad=10,
)


# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.deep_learning_intro.ex6 import *

In this exercise, you'll predict the demand for rental bikes on a given day of the year in a bike sharing program. The *Bike Sharing* dataset contains features describing a day of the year and the weather conditions on that day. The target `cnt` is the number of rentals on that day.

Run the next cell to set up the data.


In [None]:
import pandas as pd

df = pd.read_csv("../input/fe-course-data/bike-sharing.csv")
X = df.drop(["instant", "dteday", "casual", "registered"], axis=1)
y = X.pop('cnt')

X.head()

It's important when selecting features for prediction to first examine the data. As discussed in the tutorial, high correlation among features in particular can cause problems.

Start by running this cell to get a plot of the correlations between features.

In [None]:
plt.figure(dpi=100)
sns.heatmap(X.corr(), cmap="RdBu", vmin=-1.0, vmax=1.0)

# # Step 1 - Examine Data #

Are the correlations high enough to warrant removal of any features? If so, which ones?

After you've thought about it, run the next cell for some discussion.

In [None]:
# Check your answer
q_1.check()

# # Step 2 - Drop Features #

Now drop one of each pair of highly-correlated features you identified above.

In [None]:
# YOUR CODE HERE
X.drop(["temp", "season"], axis=1, inplace=True)


# Check your answer
q_2.check()

You can look at the new correlation matrix by running this cell, if you like.

In [None]:
plt.figure(dpi=100)
sns.heatmap(X.corr(), cmap="RdBu", vmin=-1.0, vmax=1.0)

# # Step 3 - Define Mutual Information Filter #

Use a mutual information filter to select 5 features for prediction.

In [None]:
from sklearn.feature_selection import mutual_info_regression, SelectKBest


# YOUR CODE HERE
mi_filter = SelectKBest(score_func=mutual_info_regression, k=5)


# Check your answer
q_3.check()

# # Step 4 - Apply Filter #

Since the mutual information filter is a supervised technique, you'll need to fit it on data that is independent from the training data.

For this exercise:
1. Create a data split for fitting the MI filter. Use 25% of the total data for the fitting set.
2. Fit the MI filter.
3. Apply the mutual information filter to the data you'll use for training.


In [None]:
from sklearn.model_selection import train_test_split


# YOUR CODE HERE: Split the data
X_mi, X_train, y_mi, y_train = train_test_split(X, y, train_size=0.25)

# YOUR CODE HERE: Fit the MI filter on X_mi
mi_filter.fit(X_mi, y_mi)

# YOUR CODE HERE: Apply the filter to X_train
X_filtered = mi_filter.transform(X_train)


# Check your answer
q_4.check()

Now run this next cell to see the transformed dataset.

In [None]:
features = X.columns  # get the column index
mask = mi_filter.get_support()  # selected? True or False
mi_features = features[mask]  # select columns
X_train = pd.DataFrame(X_train, columns=mi_features)
X_train.head()

# # Step 5 - Evaluate Decision Tree #

Create a decision tree model with `max_depth=3` and evaluate its performance using 5-fold cross-validation. Use '`neg_mean_absolute_error'` for the scoring metric.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score


# YOUR CODE HERE: Create a decision tree
decision_tree = DecisionTreeRegressor(max_depth=3)

# YOUR CODE HERE: Validate with 5-fold CV
score = cross_val_score(
    decision_tree, X_train, y_train, cv=5, scoring="neg_mean_absolute_error"
)
score = -1 * score.mean()
print("Score: {:.4f}".format(score))


# Check your answer
q_5.check()

# # Step 6 - Train Final Decision Tree #

At this point in the modeling process you could decide if using this reduced set of features gives you acceptable performance for your application. If so, it's time to fit the model on the complete dataset.

In [None]:
# YOUR CODE HERE
decision_tree.fit(X_train, y_train)

# Check your answer
q_6.check()

Now run this cell to see a plot of the decision tree you created!

In [None]:
from sklearn.tree import plot_tree

plt.figure(dpi=200)
plot_tree(
    decision_tree, feature_names=xgb_features, filled=True, impurity=False,
)

# # The End #

That's all for **Feature Engineering**. We hope you enjoyed learning with us!