# Feature Engineering

Author : [Alexandre Gramfort](http://alexandre.gramfort.net)
         
         
with some code snippets from [Olivier Grisel](http://ogrisel.com/) (leaf encoder)

It is the most creative aspect of Data Science!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import seaborn as sns
df = sns.load_dataset("titanic")

In [None]:
df.head()

As you can see the data contains both quantitative and categorical variables. These categorical have some predictive power:

In [None]:
sns.catplot(data=df, x='pclass', y='survived', hue='sex', kind='bar')

The question is how to feed these non-quantitative features to a supervised learning model?

## Categorical features

 - Nearly always need some treatment
 - High cardinality can create very sparse data
 - Difficult to impute missing

### One-Hot encoding


 - It is the most basic method. It is used with most linear algorithms
 - Drop first column to avoid collinearity
 - It uses sparse format which is memory-friendly
 - Most current implementations don’t gracefully treat missing, unseen variables

In [None]:
df1 = df[['embarked']]
df1.head(10)

In [None]:
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder().fit_transform(df1.head(10)).toarray()

In [None]:
# OneHotEncoder().fit_transform(df1).toarray()  # problem of missing values

In [None]:
# Avoid colinearity
OneHotEncoder(drop='first').fit_transform(df1.head(10)).toarray()

# Ordinal encoding

 - Give every categorical variable a unique numerical ID
 - Useful for non-linear tree-based algorithms (forests, gradient-boosting)
 - Does not increase dimensionality

In [None]:
from sklearn.preprocessing import OrdinalEncoder
OrdinalEncoder().fit_transform(df1.head(10))

## Count encoding

Replace categorical variables with their count in the train set

- Useful for both linear and non-linear algorithms
- Can be sensitive to outliers
- May add log-transform, works well with counts
- Replace unseen variables with `1`
- May give collisions: same encoding, different variables

In [None]:
import category_encoders as ce

In [None]:
ce.CountEncoder().fit_transform(df1.head(10).values)

## Label / Ordinal count encoding

Rank categorical variables by count in train set

- Useful for both linear and non-linear algorithms
- Not sensitive to outliers
- Won’t give same encoding to different variables
- Best of both worlds

In [None]:
from sklearn.preprocessing import OrdinalEncoder

class CountOrdinalEncoder(OrdinalEncoder):
    """Encode categorical features as an integer array
    usint count information.
    """
    def __init__(self, categories='auto', dtype=np.float64):
        self.categories = categories
        self.dtype = dtype

    def fit(self, X, y=None):
        """Fit the OrdinalEncoder to X.

        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
            The data to determine the categories of each feature.

        Returns
        -------
        self
        """
        super().fit(X)
        X_list, _, _ = self._check_X(X)
        # now we'll reorder by counts
        for k, cat in enumerate(self.categories_):
            counts = []
            for c in cat:
                counts.append(np.sum(X_list[k] == c))
            order = np.argsort(counts)
            self.categories_[k] = cat[order]
        return self

coe = CountOrdinalEncoder()
coe.fit_transform(pd.DataFrame(['fr', 'fr', 'fr', 'en', 'en', 'es', 'es']))

# Hash encoding

Does “OneHot-encoding” with arrays of a fixed length.

- Avoids extremely sparse data
- May introduce collisions
- Can repeat with different hash functions and bag result for small bump in accuracy
- Collisions usually degrade results, but may improve it.
- Gracefully deals with new variables (eg: new user-agents)

In [None]:
df1.head(10)

In [None]:
ce.hashing.HashingEncoder(n_components=4).fit_transform(df1.head(10).values)

## Target encoding

Encode categorical variables by their ratio of target (binary classification or regression)

Formula reads:

$$
    TE(X) = \alpha(n(X)) E[ y | x=X ] +  (1 - \alpha(n(X))) E[y]
$$

where $n(X)$ is the count of category $X$ and $\alpha$ is a monotonically increasing function bounded between 0 and 1.[1].

- Add smoothing to avoid setting variable encodings to 0.
```
[1] Micci-Barreca, 2001: A preprocessing scheme for
high-cardinality categorical attributes in classification
and prediction problems.
```

In [None]:
import dirty_cat as dc  # install with: pip install dirty_cat

X = np.array(['A', 'B', 'C', 'A', 'B', 'B'])[:, np.newaxis]
y = np.array([1  , 1  , 1  , 0  , 0  , 1])

dc.TargetEncoder(clf_type='binary-clf').fit_transform(X, y)
# If \alpha was 1 you would get: [0.5, 0.66, 1, 0.5, 0.66, 0.66]

## NaN encoding

The idea to encode that a feature was missing.

In [None]:
df.groupby('survived').count()[['deck', 'sex']]

In [None]:
from sklearn.impute import SimpleImputer

X = np.array([0, 1., np.nan, 2., 0.])[:, None]
SimpleImputer(strategy='median', add_indicator=True).fit_transform(X)

or

In [None]:
from sklearn.impute import MissingIndicator

X = np.array([0, 1., np.nan, 2., 0.])[:, None]
MissingIndicator().fit_transform(X)

## Polynomial encoding

Encode interactions between categorical variables

- Linear algorithms without interactions can not solve the XOR problem
- A polynomial kernel *can* solve XOR

In [None]:
X = np.array([[0, 1], [1, 1], [1, 0], [0, 0]])
X

In [None]:
from sklearn.preprocessing import PolynomialFeatures
PolynomialFeatures(include_bias=False, interaction_only=True).fit_transform(X)

## To go beyond

You can also use some form of embedding eg using a Neural Network to create dense embeddings from categorical variables.

- Map categorical variables in a function approximation problem into Euclidean spaces
- Faster model training.
- Less memory overhead.
- Can give better accuracy than 1-hot encoded.
- See for example https://arxiv.org/abs/1604.06737

# Binning

See https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_classification.html

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

rng = np.random.RandomState(42)
X = rng.randn(10, 2)
X

In [None]:
KBinsDiscretizer(n_bins=2).fit_transform(X).toarray()

# Scaling


Scale to numerical variables into a certain range

- Standard (Z) Scaling
- MinMax Scaling
- Root scaling
- Log scaling

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

rng = np.random.RandomState(42)
X = 10 + rng.randn(10, 1)
X

In [None]:
StandardScaler().fit_transform(X)

In [None]:
MinMaxScaler().fit_transform(X)

In [None]:
from sklearn.preprocessing import FunctionTransformer

X = np.arange(1, 10)[:, np.newaxis]
FunctionTransformer(func=np.log).fit_transform(X)

# Leaf coding

The following is an implementation of a trick found in:

Practical Lessons from Predicting Clicks on Ads at Facebook
Junfeng Pan, He Xinran, Ou Jin, Tianbing XU, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, Joaquin Quiñonero Candela
International Workshop on Data Mining for Online Advertising (ADKDD)

https://www.facebook.com/publications/329190253909587/

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import hstack


class TreeTransform(BaseEstimator, TransformerMixin):
    """One-hot encode samples with an ensemble of trees
    
    This transformer first fits an ensemble of trees (e.g. gradient
    boosted trees or a random forest) on the training set.

    Then each leaf of each tree in the ensembles is assigned a fixed
    arbitrary feature index in a new feature space. If you have 100
    trees in the ensemble and 2**3 leafs per tree, the new feature
    space has 100 * 2**3 == 800 dimensions.
    
    Each sample of the training set go through the decisions of each tree
    of the ensemble and ends up in one leaf per tree. The sample if encoded
    by setting features with those leafs to 1 and letting the other feature
    values to 0.
    
    The resulting transformer learn a supervised, sparse, high-dimensional
    categorical embedding of the data.
    
    This transformer is typically meant to be pipelined with a linear model
    such as logistic regression, linear support vector machines or
    elastic net regression.
    """
    def __init__(self, estimator):
        self.estimator = estimator
        
    def fit(self, X, y):
        self.fit_transform(X, y)
        return self
        
    def fit_transform(self, X, y):
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(X, y)
        self.binarizers_ = []
        sparse_applications = []
        estimators = np.asarray(self.estimator_.estimators_).ravel()
        for t in estimators:
            lb = LabelBinarizer(sparse_output=True)
            X_leafs = t.tree_.apply(X.astype(np.float32))
            sparse_applications.append(lb.fit_transform(X_leafs))
            self.binarizers_.append(lb)
        return hstack(sparse_applications)
        
    def transform(self, X, y=None):
        sparse_applications = []
        estimators = np.asarray(self.estimator_.estimators_).ravel()
        for t, lb in zip(estimators, self.binarizers_):
            X_leafs = t.tree_.apply(X.astype(np.float32))
            sparse_applications.append(lb.transform(X_leafs))
        return hstack(sparse_applications)


boosted_trees = GradientBoostingClassifier(
    max_leaf_nodes=5, learning_rate=0.1,
    n_estimators=10, random_state=0,
)

from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

TreeTransform(boosted_trees).fit_transform(X, y)

<div class="alert alert-success">
    <b>EXERCISE</b>:
     <ul>
      <li>
      Limiting yourself to LogisticRegression propose features to predict survival.
      </li>
    </ul>
</div>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

y = df.survived.values
X = df.drop(['survived', 'alive'], axis=1)

In [None]:
X.head()

In [None]:
lr = LogisticRegression()
ct = make_column_transformer(
    (make_pipeline(SimpleImputer(), StandardScaler()), ['age', 'pclass', 'fare'])
)
clf = make_pipeline(ct, lr)
np.mean(cross_val_score(clf, X, y, cv=10))

### Now do better !