# Feature selection.

We use feature selection after wrangling for two reasons.

1. Obtain a set of good features that represents the current dataset.
2. Obtain the set of *not good* features that should be refined in the next wrangling step.

This happens in three steps.

1. A first preselection step removes obviously bad features.
2. A second preselection step removes features that have the same predictive capabilities, in order to prevent the final feature selection step to select.
3. A real feature selection step to make the final decision.

The following methods are implemented in this notebook.

1. A (baseline) random sampling based approach — done.
2. CHCGA — a genetic algorithm based approach — done.
3. SFFS — a forward selection based approach — done.
4. AdaBoost with decision stump approach — TODO.

Both (1) and (2) allow use to set a max running time.

In [10]:
%reload_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from typing import Optional, List, Tuple, Callable
from tqdm.notebook import tqdm
from avatar.language import WranglingLanguage
from avatar.analysis import *

Load dataset.

In [11]:
titanic = pd.read_csv("../data/raw/demo/titanic.csv")
titanic.Survived = titanic.Survived.astype("category")
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Find transformed columns. Don't use replacement.

In [12]:
language = WranglingLanguage()
expanded = language.expand(titanic, target="Survived")
expanded

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,OneHot()(Parch)_5,OneHot()(Parch)_6,OneHot()(Embarked)_C,OneHot()(Embarked)_Q,OneHot()(Embarked)_S,"NaN(Pernot, Mr. Rene)(Name)_Name","NaN(Somerton, Mr. Francis William)(Name)_Name",WordToNumber()(Ticket)_Ticket,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,...,0,0,0,0,1,"Braund, Mr. Owen Harris","Braund, Mr. Owen Harris",,B96 B98,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,0,0,1,0,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...","Cumings, Mrs. John Bradley (Florence Briggs Th...",,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,...,0,0,0,0,1,"Heikkinen, Miss. Laina","Heikkinen, Miss. Laina",,B96 B98,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,...,0,0,0,0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)","Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803.0,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,...,0,0,0,0,1,"Allen, Mr. William Henry","Allen, Mr. William Henry",373450.0,B96 B98,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,...,0,0,0,0,1,"Montvila, Rev. Juozas","Montvila, Rev. Juozas",211536.0,B96 B98,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,...,0,0,0,0,1,"Graham, Miss. Margaret Edith","Graham, Miss. Margaret Edith",112053.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,...,0,0,0,0,1,"Johnston, Miss. Catherine Helen ""Carrie""","Johnston, Miss. Catherine Helen ""Carrie""",,B96 B98,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,...,0,0,1,0,0,"Behr, Mr. Karl Howell","Behr, Mr. Karl Howell",111369.0,C148,C


## Preselection

Preselect features that will never be appropriate.

* Remove columns with too many missing values.
* Constant columns are removed.
* Columns consisting of unique, categorical features are removed.
* Identical columns are collapsed into one.


In [17]:
from avatar.selection import *
    
preselector = StackedPreselector([ConstantPreselector(),
                                  IdenticalPreselector(),
                                  BijectivePreselector(),
                                  UniquePreselector(),
                                  MissingPreselector()])
preselected = preselector.select(expanded)
preselected

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,...,OneHot()(Parch)_3,OneHot()(Parch)_4,OneHot()(Parch)_5,OneHot()(Parch)_6,OneHot()(Embarked)_C,OneHot()(Embarked)_Q,OneHot()(Embarked)_S,WordToNumber()(Ticket)_Ticket,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.2500,S,...,0,0,0,0,0,0,1,,B96 B98,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C,...,0,0,0,0,1,0,0,,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,S,...,0,0,0,0,0,0,1,,B96 B98,S
3,4,1,1,female,35.0,1,0,113803,53.1000,S,...,0,0,0,0,0,0,1,113803.0,C123,S
4,5,0,3,male,35.0,0,0,373450,8.0500,S,...,0,0,0,0,0,0,1,373450.0,B96 B98,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,27.0,0,0,211536,13.0000,S,...,0,0,0,0,0,0,1,211536.0,B96 B98,S
887,888,1,1,female,19.0,0,0,112053,30.0000,S,...,0,0,0,0,0,0,1,112053.0,B42,S
888,889,0,3,female,,1,2,W./C. 6607,23.4500,S,...,0,0,0,0,0,0,1,,B96 B98,S
889,890,1,1,male,26.0,0,0,111369,30.0000,C,...,0,0,0,0,1,0,0,111369.0,C148,C


We sample a subset of the data with at least one row containing no NaNs.

In [18]:
sampler = WeightedColumnSampler(preselected)
sampled = sampler.sample()
sampled

Unnamed: 0,PassengerId,Survived,Pclass,Sex,SibSp,Parch,Ticket,Fare,"Split(, )(Name)_0","Split(, )(Name)_1",...,OneHot()(Embarked)_C,OneHot()(Embarked)_Q,OneHot()(Embarked)_S,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked,Age,Embarked,Split( )(Name)_3,"ExtractWord([Master, Dr, Rev, Miss, Mr, Mrs])(Name)_0",WordToNumber()(Ticket)_Ticket
0,1,0,3,male,1,0,A/5 21171,7.2500,Braund,Mr. Owen Harris,...,0,0,1,B96 B98,S,22.0,S,Harris,Mr,
1,2,1,1,female,1,0,PC 17599,71.2833,Cumings,Mrs. John Bradley (Florence Briggs Thayer),...,1,0,0,C85,C,38.0,C,Bradley,Mr,
2,3,1,3,female,0,0,STON/O2. 3101282,7.9250,Heikkinen,Miss. Laina,...,0,0,1,B96 B98,S,26.0,S,,Miss,
3,4,1,1,female,1,0,113803,53.1000,Futrelle,Mrs. Jacques Heath (Lily May Peel),...,0,0,1,C123,S,35.0,S,Heath,Mr,113803.0
4,5,0,3,male,0,0,373450,8.0500,Allen,Mr. William Henry,...,0,0,1,B96 B98,S,35.0,S,Henry,Mr,373450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,0,0,211536,13.0000,Montvila,Rev. Juozas,...,0,0,1,B96 B98,S,27.0,S,,Rev,211536.0
887,888,1,1,female,0,0,112053,30.0000,Graham,Miss. Margaret Edith,...,0,0,1,B42,S,19.0,S,Edith,Miss,112053.0
888,889,0,3,female,1,2,W./C. 6607,23.4500,Johnston,"Miss. Catherine Helen ""Carrie""",...,0,0,1,B96 B98,S,,S,Helen,Miss,
889,890,1,1,male,0,0,111369,30.0000,Behr,Mr. Karl Howell,...,1,0,0,C148,C,26.0,C,Howell,Mr,111369.0


## Evaluation

How do we evaluate a set of features?

* Which **model** do we use?
* what **max depth** do we use?

Wrapping evaluation in a class saves the time of converting data for MERCS and allows us to reuse the same split in every iteration.

In [128]:
from avatar.analysis import FeatureEvaluator, FoldedFeatureEvaluator

mask = np.random.randint(2, size=len(sampled.columns))

evaluator = FeatureEvaluator(sampled, target="Survived")
evaluator.importances(mask)

array([0.        , 0.        , 0.10480587, 0.31877449, 0.        ,
       0.00667055, 0.04149813, 0.0977558 , 0.        , 0.10441766,
       0.07104207, 0.03637528, 0.04126673, 0.        , 0.03706598,
       0.        , 0.        , 0.        , 0.03541291, 0.        ,
       0.01686683, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.00085735,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.08719035, 0.        , 0.        , 0.        ,
       0.        ])

In [159]:
from avatar.analysis import FoldedFeatureEvaluator

mask = np.random.randint(2, size=len(sampled.columns))

evaluator = FoldedFeatureEvaluator(sampled, target="Survived", folds=5)
evaluator.accuracy(mask)

0.7191011235955056

### Timing

Small experiment on runtime for decision trees of different depths.

In [None]:
# import time
# import itertools

# rng = np.random.RandomState(1337)

# classifiers = ["DT"]
# depths = [1, 4, 8, 16, 32]
# iterations = 50

# mask = rng.randint(2, size=len(sampled.columns)).astype(bool)

# results = list()
# for classifier, depth in tqdm(list(itertools.product(classifiers, depths))):
#     start = time.time()

#     evaluator = FeatureEvaluator(sampled,
#                                  target="Survived",
#                                  classifier_algorithm=classifier,
#                                  max_depth=depth,
#                                  random_state=depth)

#     accuracies = list()
#     for i in range(iterations):
#         acc, fis = evaluator.evaluate(mask)
#         accuracies.append(acc)
#     results.append((classifier, depth, time.time() - start, np.mean(accuracies)))

# for classifier, depth, t, score in results: 
#     print("{}\t{}\t{}\t{}".format(classifier, depth, t, score))

Next, we look for features with the same predictive power using a wrapper approach. A decision stump is learned for each feature individually and the predictions for this stump are compared. Features that make the same predictions are pruned.

In [21]:
from avatar.selection import IterativePreselector


preselected = IterativePreselector().select(sampled, "Survived")
preselected

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Parch,Ticket,"Split(, )(Name)_0","Split(, )(Name)_1",Split(.)(Name)_0,Split(.)(Name)_1,...,OneHot()(Parch)_1,OneHot()(Embarked)_C,OneHot()(Embarked)_S,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked,Age,Embarked,Split( )(Name)_3,"ExtractWord([Master, Dr, Rev, Miss, Mr, Mrs])(Name)_0",WordToNumber()(Ticket)_Ticket
0,1,0,3,male,0,A/5 21171,Braund,Mr. Owen Harris,"Braund, Mr",Owen Harris,...,0,0,1,B96 B98,S,22.0,S,Harris,Mr,
1,2,1,1,female,0,PC 17599,Cumings,Mrs. John Bradley (Florence Briggs Thayer),"Cumings, Mrs",John Bradley (Florence Briggs Thayer),...,0,1,0,C85,C,38.0,C,Bradley,Mr,
2,3,1,3,female,0,STON/O2. 3101282,Heikkinen,Miss. Laina,"Heikkinen, Miss",Laina,...,0,0,1,B96 B98,S,26.0,S,,Miss,
3,4,1,1,female,0,113803,Futrelle,Mrs. Jacques Heath (Lily May Peel),"Futrelle, Mrs",Jacques Heath (Lily May Peel),...,0,0,1,C123,S,35.0,S,Heath,Mr,113803.0
4,5,0,3,male,0,373450,Allen,Mr. William Henry,"Allen, Mr",William Henry,...,0,0,1,B96 B98,S,35.0,S,Henry,Mr,373450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,0,211536,Montvila,Rev. Juozas,"Montvila, Rev",Juozas,...,0,0,1,B96 B98,S,27.0,S,,Rev,211536.0
887,888,1,1,female,0,112053,Graham,Miss. Margaret Edith,"Graham, Miss",Margaret Edith,...,0,0,1,B42,S,19.0,S,Edith,Miss,112053.0
888,889,0,3,female,2,W./C. 6607,Johnston,"Miss. Catherine Helen ""Carrie""","Johnston, Miss","Catherine Helen ""Carrie""",...,0,0,1,B96 B98,S,,S,Helen,Miss,
889,890,1,1,male,0,111369,Behr,Mr. Karl Howell,"Behr, Mr",Karl Howell,...,0,1,0,C148,C,26.0,C,Howell,Mr,111369.0


## Feature selection


Next, we can take a look at actual feature selection. Three wrapper methods are implemented.

* Randomly sampling columns, training a model and getting the feature relevances.
* A genetic approach, which is similar but should combine features slightly better. We perform a small experiment on whether to fix the genome size.
* Classic sequential and backwards sequential feature selection.

The idea is similar; the feature selection algorithms return sets of feature importances and the associated scores.

### Random

Randomly sample subsets of features, evaluate and get feature relevances.

In [123]:
from avatar.selection import SamplingSelector


ss = SamplingSelector(preselected, target="Survived")
ss.run(iterations=50)

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1.]
[ 0.          0.          1.77965814 16.14723465  0.          1.96762478
  0.          3.26163199  0.          0.          0.          0.
  0.          0.          0.          0.21122215  1.13775686  0.64085981
  0.62492974  1.29896779  0.07987337  2.59699762  4.39952733 12.34313512
  0.          0.          0.          0.          0.          0.16864547
  0.          1.31897393  0.          0.          1.73966326  0.28329798]
50.0


### Genetic

The CHC Genetic Algorithm for feature selection. Uses

* Cross-generational elitist selection
* Heterogeneous recombination
* and Cataclysmic mutation

for maintaining diversity and avoiding stagnation.

After the final population is obtained, combine importances from this population.

In [100]:
a = np.array([[1,2],[3,4]])
c = np.array([[2], [4]])

c.shape

(2, 1)

In [105]:
from avatar.selection import CHCGASelector, Population, Individual
    

gas = CHCGASelector(preselected, target="Survived")
gas.run(20)

[0.         0.         0.08170259 0.45102093 0.         0.00980193
 0.         0.02984449 0.         0.         0.         0.
 0.         0.         0.         0.01870419 0.01322304 0.01534058
 0.00530969 0.00802392 0.00244979 0.08998058 0.10610285 0.16849543
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.        ] 1.0


### SFFS

Sequential Forward Floating Selection. We don't use the adaptive version as there will often be many columns and that is too slow.

In [125]:
from avatar.selection import SFFSelector
    

sffs = SFFSelector(preselected, target="Survived")
sffs.run(iterations=10)
sffs.scores()

[0.03211201 0.         1.2791235  6.30997341 0.         0.11484567
 0.         0.2414735  0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.        ] 7.977528089887642


### AdaBoost

In [31]:
selected.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Split(-)(Name)_0', 'Split(. )(Name)_0', 'Split(. )(Name)_1',
       'Split(')(Name)_0', 'Split(.)(Name)_1', 'Split(, )(Name)_0',
       'Split(, )(Name)_1', 'Split( )(Name)_0', 'Split( )(Name)_1',
       'Split( )(Name)_2', 'Split(,)(Name)_1', 'Split(./)(Ticket)_0',
       'Split(. )(Ticket)_0', 'Split(/)(Ticket)_0', 'Split(.)(Ticket)_0',
       'Split( )(Ticket)_0', 'Lowercase()(Ticket)_Ticket',
       'OneHot()(Pclass)_1', 'OneHot()(Pclass)_2', 'OneHot()(Pclass)_3',
       'OneHot()(Sex)_female', 'OneHot()(Sex)_male', 'OneHot()(SibSp)_0',
       'OneHot()(SibSp)_1', 'OneHot()(SibSp)_2', 'OneHot()(SibSp)_3',
       'OneHot()(SibSp)_4', 'OneHot()(SibSp)_5', 'OneHot()(SibSp)_8',
       'OneHot()(Parch)_0', 'OneHot()(Parch)_1', 'OneHot()(Parch)_2',
       'OneHot()(Parch)_3', 'OneHot()(Parch)_4', 'OneHot()(Parch)_5',
       'OneHot()(Parch)_6', 'OneHot()(Embarked)_C', 'OneHot()(Embarked)_Q',
  

In [32]:
sampled.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Split( (")(Name)_0', 'Split(" )(Name)_0',
       'Split(-)(Name)_0', 'Split( ()(Name)_0', 'Split(. )(Name)_0',
       'Split(. )(Name)_1', 'Split(/)(Name)_0', 'Split((")(Name)_0',
       'Split(")(Name)_0', 'Split(. ()(Name)_0', 'Split() )(Name)_0',
       'Split())(Name)_0', 'Split(')(Name)_0', 'Split(()(Name)_0',
       'Split() ()(Name)_0', 'Split(.)(Name)_0', 'Split(.)(Name)_1',
       'Split("))(Name)_0', 'Split() (")(Name)_0', 'Split(, )(Name)_0',
       'Split(, )(Name)_1', 'Split( ")(Name)_0', 'Split( )(Name)_0',
       'Split( )(Name)_1', 'Split( )(Name)_2', 'Split(,)(Name)_0',
       'Split(,)(Name)_1', 'Split(./)(Ticket)_0', 'Split(. )(Ticket)_0',
       'Split(/)(Ticket)_0', 'Split(.)(Ticket)_0', 'Split( )(Ticket)_0',
       'ExtractWord([male, female])(Sex)_0', 'Lowercase()(Name)_Name',
       'Lowercase()(Sex)_Sex', 'Lowercase()(Ticket)_Ticket',
       'OneHot()(Pclass)_