# Introduction #

We've learned how feature hashing is an efficient technique for dealing with high-cardinality categorical features. In this exercise you'll apply feature hashing to large dataset with a feature having over 3000 categories.

Run this cell to set everything up.

In [None]:
SEED = 31415

# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex4 import *

In this exercise, you'll create a model to predict the rating an audience will give to movies various genres. The *MovieLens 1 Million* dataset contains one million movie ratings together with the movie's genre and some simple demographic data like the rater's occupation and zipcode.

Run the next cell to set up the dataset.

In [None]:
import pandas as pd
from IPython.display import display

df = pd.read_csv("../input/fe-course-data/movielens1m.csv")
display(df.head())
print("Number of Unique Zipcodes: {}".format(df["Zipcode"].nunique()))

X = df.copy()
X.drop(['User ID', 'Movie ID'], inplace=True, axis=1)
y = X.pop("Rating").to_frame()

# Step 1 - Define Transforms #

Notice that there are quite a large number of categories in the `Zipcode` feature with the possibility of unseen categories in future data. This makes it a good candidate for feature hashing.

Define a feature hasher that will hash the zipcodes into 400 features. You'll need to use `make_column_transformer` to restrict the hasher to the `Zipcode` column.

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction import FeatureHasher


# YOUR CODE HERE: Define a feature hasher for the Zipcode column
hasher = FeatureHasher(input_type='string', n_features=400)
transformer = make_column_transformer(
    (hasher, "Zipcode"),
    remainder='passthrough',
)


# Check your answer
q_1.check()

# Step 2 - Create Pipeline #

Define a pipeline with the transformer you just defined followed by an XGBoost regressor. Since this dataset is quite large, use XGBoost's histogram algorithm with `tree_method='hist'`.

In [None]:
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor


# YOUR CODE HERE: Create a pipeline with a FeatureHasher followed by XGBRegressor
xgb = XGBRegressor(tree_method='hist')
pipeline = make_pipeline(transformer, xgb)


# Check your answer
q_2.check()

# Step 2 - Estimate Generalization Error #

Estimate the generalization error using 3-fold cross validation.

In [None]:
from sklearn.model_selection import cross_val_score

score = cross_val_score(
    pipeline, X, y, cv=3, scoring='neg_mean_absolute_error'
)
score = -1 * score.mean()
print("XGB with Feature Hashing: {:.4f}".format(score))


# Check your answer
q_2.check()

# Step 3 - Tune #

It's sometimes worth tuning the number of features the hasher creates. Use `GridSearchCV` to compare the model with 400 features to one with 10 and one with 2000 features.


In [None]:
from sklearn.model_selection import GridSearchCV

num_features = [10, 2000]
param_grid = {"columntransformer__featurehasher__n_features": num_features}

# YOUR CODE HERE: Tune the number of features created
tuner = GridSearchCV(pipeline, param_grid, cv=3, scoring='neg_mean_absolute_error')
tuner.fit(X, y)


# Check your answer
q_3.check()

Run this cell without changes to see the results.

In [None]:
print("Best Score: {}".format(-1 * tuner.best_score_))
print("Best Params: {}".format(tuner.best_params_))
results = pd.DataFrame(
    -1 * tuner.cv_results_['mean_test_score'],
    index=num_features,
    columns=['CV Score'],
)
display(results)

# Step 4 - Evaluate #

Were the results suprising to you? Can you think of an explanation? How does that affect your evaluation of the usefulness of feature hashing?

After you've thought about it, run the next cell for some discussion.

In [None]:
# Check your answer
q_4.check()

```
# Solution: Feature hashing creates collisions, but many ML models (including XGBoost) can learn to work around them. That's why 400 hashed features was as good as 2000 in this case. Hashing into only 10 features though clearly reduced performance.

For high-cardinality features, the hashing trick is very useful. It can reduce computational requirements while preserving predictive performance.
```

# Keep Going #