# Evaluator Demo

We use feature selection after wrangling for two reasons.

1. Obtain a set of good features that represents the current dataset.
2. Obtain the set of *not good* features that should be refined in the next wrangling step.

This happens in three steps.

1. A first preselection step removes obviously bad features.
2. A second preselection step removes features that have the same predictive capabilities, in order to prevent the final feature selection step to select.
3. A real feature selection step to make the final decision.

The following methods are implemented in this notebook.

1. A (baseline) random sampling based approach — done.
2. CHCGA — a genetic algorithm based approach — done.
3. SFFS — a forward selection based approach — done.

Both (1) and (2) allow use to set a max running time.

# Preliminaries

## Imports

In [1]:
%reload_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from typing import Optional, List, Tuple, Callable
from tqdm.notebook import tqdm
from avatar.language import WranglingLanguage
from avatar.analysis import FeatureEvaluator
from avatar.analysis import *
from avatar.selection import (
    SamplingSelector,
    CHCGASelector,
    Population,
    Individual,
    SFFSelector,
    StackedFilter,
    ConstantFilter,
    IdenticalFilter,
    BijectiveFilter,
    UniqueFilter,
    MissingFilter,
)

import warnings

In [2]:
# (Optional) Black codeformatter (`pip install nb_black`) for jupyterlab. In jupyter notebook, this changes slightly.
%load_ext lab_black

## Data

Load dataset.

In [3]:
titanic = pd.read_csv("../../data/raw/demo/titanic.csv")
titanic.Survived = titanic.Survived.astype("category")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Wrangle

## Apply Transformations

Transformations without replacement.


In [4]:
language = WranglingLanguage()
expanded = language.expand(titanic, target="Survived")
expanded.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,OneHot()(Parch)_5,OneHot()(Parch)_6,OneHot()(Embarked)_C,OneHot()(Embarked)_Q,OneHot()(Embarked)_S,"NaN(Pernot, Mr. Rene)(Name)_Name","NaN(Somerton, Mr. Francis William)(Name)_Name",WordToNumber()(Ticket)_Ticket,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,0,0,0,0,1,"Braund, Mr. Owen Harris","Braund, Mr. Owen Harris",,B96 B98,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,0,0,1,0,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...","Cumings, Mrs. John Bradley (Florence Briggs Th...",,C85,C


## Prune

Remove some features that are not appropriate and don't need more wrangling.

In [5]:
pruner = StackedFilter([ConstantFilter(), IdenticalFilter()])

pruned = pruner.select(expanded)
pruned.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,OneHot()(Parch)_5,OneHot()(Parch)_6,OneHot()(Embarked)_C,OneHot()(Embarked)_Q,OneHot()(Embarked)_S,"NaN(Pernot, Mr. Rene)(Name)_Name","NaN(Somerton, Mr. Francis William)(Name)_Name",WordToNumber()(Ticket)_Ticket,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,0,0,0,0,1,"Braund, Mr. Owen Harris","Braund, Mr. Owen Harris",,B96 B98,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,0,0,1,0,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...","Cumings, Mrs. John Bradley (Florence Briggs Th...",,C85,C


So let us compare the progress here

In [6]:
msg = """
PRUNING SUMMARY:

Number of columns original dataset:      {}
Number of columns after transformations: {}
Number of columns after pruning:         {}

Total gain: {} columns pruned
""".format(
    titanic.shape[1],
    expanded.shape[1],
    pruned.shape[1],
    expanded.shape[1] - pruned.shape[1],
)
print(msg)


PRUNING SUMMARY:

Number of columns original dataset:      12
Number of columns after transformations: 127
Number of columns after pruning:         118

Total gain: 9 columns pruned



## Preselection

Preselect features that will never be appropriate. These can still be wrangled.

* Remove columns with too many missing values.
* Columns consisting of unique, categorical features are removed.

In [7]:
preselector = StackedFilter([BijectiveFilter(), UniqueFilter(), MissingFilter()])
preselected = preselector.select(expanded)
preselected.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Split( )(Name)_1,...,OneHot()(Parch)_3,OneHot()(Parch)_4,OneHot()(Parch)_5,OneHot()(Parch)_6,OneHot()(Embarked)_C,OneHot()(Embarked)_Q,OneHot()(Embarked)_S,WordToNumber()(Ticket)_Ticket,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked
0,1,0,3,male,22.0,1,0,7.25,S,Mr.,...,0,0,0,0,0,0,1,,B96 B98,S
1,2,1,1,female,38.0,1,0,71.2833,C,Mrs.,...,0,0,0,0,1,0,0,,C85,C


## Sampler

We sample a subset of the data with at least one row containing no NaNs.

## Correlation Filter

Next, we look for features with the same predictive power using a wrapper approach. A decision stump is learned for each feature individually and the predictions for this stump are compared. Features that make the same predictions are pruned.

# Evaluator

In [8]:
preselected.shape

(891, 36)

In [9]:
mask = np.random.randint(2, size=len(preselected.columns))
mask = np.ones_like(mask)

evaluator_shap = FeatureEvaluator(folds=10, method="shap", max_depth=4)
evaluator_shap.fit(preselected, target="Survived")


evaluator_fimp = FeatureEvaluator(folds=10, max_depth=None)
evaluator_fimp.fit(preselected, target="Survived")

# Extract MERCS

## Helper Methods

In [10]:
def extract_mercs_and_training_data_from_feature_evaluator(
    evaluator, mask=None, fold_idx=0
):
    mercs = _extract_mercs_from_feature_evaluator(
        evaluator, mask=mask, fold_idx=fold_idx
    )
    train, test = _extract_data_from_feature_evaluator(
        evaluator, fold_idx=fold_idx, kind=None
    )

    return mercs, train, test


def _extract_mercs_from_feature_evaluator(evaluator, mask=None, fold_idx=0):
    if mask is not None:
        mercs = evaluator.models(mask)[fold_idx]
    else:
        warnings.warn("Extracting first model from the cache.")
        mercs = next(iter(evaluator._cache))[fold_idx]
    return mercs


def _extract_data_from_feature_evaluator(evaluator, fold_idx=0, kind=None):
    train, test = evaluator._folds[fold_idx]
    if kind in {"train"}:
        return train
    elif kind in {"test"}:
        return test
    else:
        return train, test

## Testing `mercs.avatar`

In [11]:
import shap

mercs, X_train, X_test = extract_mercs_and_training_data_from_feature_evaluator(
    evaluator_fimp, mask=mask, fold_idx=0
)

In [22]:
model = mercs.m_list[0].model
m_code = mercs.m_codes[0]
Xb = X_train[:, m_code == 0]
Xt = X_test[:, m_code == 0]

explainer = shap.TreeExplainer(model, data=None,)
raw_shaps = explainer.shap_values(Xb, check_additivity=True)

Setting feature_perturbation = "tree_path_dependent" because no background data was given.


In [13]:
from sklearn.preprocessing import normalize

# Make tensor
tsr_shaps = np.array(raw_shaps)

# Convert to absolute values
abs_shaps = np.abs(raw_shaps)

# In case of a nominal target, sum shaps across all targets
if len(abs_shaps.shape) == 3:
    abs_shaps = np.sum(abs_shaps, axis=0)

# Average over instances
avg_shaps = np.mean(abs_shaps, axis=0)

# Normalize (between 0 and 1)
nrm_shaps = np.squeeze(normalize(avg_shaps.reshape(1, -1), norm="l1"))

In [14]:
nrm_shaps

array([0.05404627, 0.        , 0.20074717, 0.21775432, 0.03426271,
       0.        , 0.10399929, 0.        , 0.        , 0.04266714,
       0.05389032, 0.        , 0.        , 0.        , 0.09689895,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.00526375, 0.00320323,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.16931689, 0.01113763, 0.00681233])

In [15]:
X = X_train

In [16]:
mercs.avatar(X, keep_abs_shaps=True)

mercs.nrm_shaps  # What we had already (without bugs!)
mercs.abs_shaps  # Per-instance information!

print("See, I did not crash!")

See, I did not crash!


Setting feature_perturbation = "tree_path_dependent" because no background data was given.


In [17]:
mercs.avatar(
    X, keep_abs_shaps=True, check_additivity=False,
)

mercs.nrm_shaps  # What we had already (without bugs!)
mercs.abs_shaps  # Per-instance information!

print(
    """See, I did not crash!
ALSO NOT ON MAC ANYMORE
"""
)

Setting feature_perturbation = "tree_path_dependent" because no background data was given.


See, I did not crash!
ALSO NOT ON MAC ANYMORE



In [18]:
mercs.abs_shaps.shape

(1, 713, 36)

In [19]:
mercs.abs_shaps[0][0, :]

array([0.03442074, 0.        , 0.        , 0.20406541, 0.12617931,
       0.04267356, 0.        , 0.28609163, 0.        , 0.        ,
       0.18458367, 0.03898494, 0.        , 0.        , 0.        ,
       0.48854098, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.0098635 ,
       0.01222065, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.08096751, 0.00583583,
       0.00215006])

In [20]:
abs_shaps[0, :]

array([0.03442074, 0.        , 0.20406541, 0.12617931, 0.04267356,
       0.        , 0.28609163, 0.        , 0.        , 0.18458367,
       0.03898494, 0.        , 0.        , 0.        , 0.48854098,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.0098635 , 0.01222065,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.08096751, 0.00583583, 0.00215006])

In [21]:
i = 0
for i, (x, y) in enumerate(zip(m_list, abs_shaps)):
    i += 1
    print(type(y))
i

NameError: name 'm_list' is not defined

In [None]:
assert 1==0::

In [None]:
for m_idx, (mod, abs_shap) in enumerate(zip(m_list, abs_shaps)):
    print(abs_shap.shape)
    init_abs = np.zeros((713, 36))
    init_abs[:, list(mod.desc_ids)] = abs_shap
    init[m_idx, :, :] = init_abs

In [None]:
init.shape

In [None]:
init[1][0, :]

In [None]:
init[0, [[3, 4]] * init.shape[1]].shape

In [None]:
zip(list(range(m_list)), m_list, abs_shaps)

In [None]:

for m_idx, mod, abs_shap in zip(*enumerate([mercs.m_list[0], mercs.m_list[0]]), abs_shaps):
    init[m_idx, : ,list(mod.desc_ids)] = abs_shap

In [None]:
nrm_shaps

In [None]:
l = preselected.columns.values.tolist()
l.remove("Survived")
l

df = pd.DataFrame()
df["cols"] = l
df["shap"] = nrm_shaps
df.sort_values(by="shap", ascending=False).head(5)

In [None]:
df = pd.DataFrame()
df["cols"] = preselected.columns
df["shap"] = evaluator_shap.importances(mask)
df["fimp"] = evaluator_fimp.importances(mask)

In [None]:
df.sort_values(by="fimp", ascending=False).head(5)

In [None]:
df.sort_values(by="shap", ascending=False).head(5)

# Feature Selection

Feature selection ranks the features according to their importance. The end goal is to obtain two groups of features;
- **Non-relevant features**: these need more bending
- **Relevant features**: these can go in the final evaluator to see whether the model performs better now.


Three wrapper methods are implemented.

* Randomly sampling columns, training a model and getting the feature relevances.
* A genetic approach, which is similar but should combine features slightly better. We perform a small experiment on whether to fix the genome size.
* Classic sequential and backwards sequential feature selection.

The idea is similar; the feature selection algorithms return sets of feature importances and the associated scores.

## Mask-Generation: Random

Randomly sample subsets of features, evaluate and get feature relevances.

## Mask Generation: Genetic

The CHC Genetic Algorithm for feature selection. Uses

* Cross-generational elitist selection
* Heterogeneous recombination
* and Cataclysmic mutation

for maintaining diversity and avoiding stagnation.

After the final population is obtained, combine importances from this population.

## Mask Generation: SFFS

Sequential Forward Floating Selection. We don't use the adaptive version as there will often be many columns and that is too slow.