# Feature selection.

We use feature selection after wrangling for two reasons.

1. Obtain a set of good features that represents the current dataset.
2. Obtain the set of *not good* features that should be refined in the next wrangling step.

This happens in three steps.

1. A first preselection step removes obviously bad features.
2. A second preselection step removes features that have the same predictive capabilities, in order to prevent the final feature selection step to select.
3. A real feature selection step to make the final decision.

The following methods are implemented in this notebook.

1. A (baseline) random sampling based approach — done.
2. CHCGA — a genetic algorithm based approach — done.
3. SFFS — a forward selection based approach — done.

Both (1) and (2) allow use to set a max running time.

In [3]:
%reload_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from typing import Optional, List, Tuple, Callable
from tqdm.notebook import tqdm
from avatar.language import WranglingLanguage
from avatar.analysis import *

Load dataset.

In [4]:
titanic = pd.read_csv("../data/raw/demo/titanic.csv")
titanic.Survived = titanic.Survived.astype("category")
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Find transformed columns. Don't use replacement.

In [5]:
language = WranglingLanguage()
expanded = language.expand(titanic, target="Survived")
expanded

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,OneHot()(Parch)_5,OneHot()(Parch)_6,OneHot()(Embarked)_C,OneHot()(Embarked)_Q,OneHot()(Embarked)_S,"NaN(Pernot, Mr. Rene)(Name)_Name","NaN(Somerton, Mr. Francis William)(Name)_Name",WordToNumber()(Ticket)_Ticket,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,...,0,0,0,0,1,"Braund, Mr. Owen Harris","Braund, Mr. Owen Harris",,B96 B98,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,0,0,1,0,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...","Cumings, Mrs. John Bradley (Florence Briggs Th...",,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,...,0,0,0,0,1,"Heikkinen, Miss. Laina","Heikkinen, Miss. Laina",,B96 B98,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,...,0,0,0,0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)","Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803.0,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,...,0,0,0,0,1,"Allen, Mr. William Henry","Allen, Mr. William Henry",373450.0,B96 B98,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,...,0,0,0,0,1,"Montvila, Rev. Juozas","Montvila, Rev. Juozas",211536.0,B96 B98,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,...,0,0,0,0,1,"Graham, Miss. Margaret Edith","Graham, Miss. Margaret Edith",112053.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,...,0,0,0,0,1,"Johnston, Miss. Catherine Helen ""Carrie""","Johnston, Miss. Catherine Helen ""Carrie""",,B96 B98,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,...,0,0,1,0,0,"Behr, Mr. Karl Howell","Behr, Mr. Karl Howell",111369.0,C148,C


## Pruning

Remove some features that are not appropriate and don't need more wrangling.

In [23]:
from avatar.selection import *


pruner = StackedFilter([MissingFilter(),
                        ConstantFilter(),
                        IdenticalFilter()])
pruned = pruner.select(expanded)
pruned

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,OneHot()(Parch)_5,OneHot()(Parch)_6,OneHot()(Embarked)_C,OneHot()(Embarked)_Q,OneHot()(Embarked)_S,"NaN(Pernot, Mr. Rene)(Name)_Name","NaN(Somerton, Mr. Francis William)(Name)_Name",WordToNumber()(Ticket)_Ticket,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,...,0,0,0,0,1,"Braund, Mr. Owen Harris","Braund, Mr. Owen Harris",,B96 B98,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,0,0,1,0,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...","Cumings, Mrs. John Bradley (Florence Briggs Th...",,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,...,0,0,0,0,1,"Heikkinen, Miss. Laina","Heikkinen, Miss. Laina",,B96 B98,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,...,0,0,0,0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)","Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803.0,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,...,0,0,0,0,1,"Allen, Mr. William Henry","Allen, Mr. William Henry",373450.0,B96 B98,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,...,0,0,0,0,1,"Montvila, Rev. Juozas","Montvila, Rev. Juozas",211536.0,B96 B98,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,...,0,0,0,0,1,"Graham, Miss. Margaret Edith","Graham, Miss. Margaret Edith",112053.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,...,0,0,0,0,1,"Johnston, Miss. Catherine Helen ""Carrie""","Johnston, Miss. Catherine Helen ""Carrie""",,B96 B98,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,...,0,0,1,0,0,"Behr, Mr. Karl Howell","Behr, Mr. Karl Howell",111369.0,C148,C


## Preselection

Preselect features that will never be appropriate. These can still be wrangled.

* Remove columns with too many missing values.
* Columns consisting of unique, categorical features are removed.

In [25]:
from avatar.selection import *
    
preselector = StackedFilter([BijectiveFilter(),
                             UniqueFilter(),
                             CorrelationFilter()])
preselected = preselector.select(pruned, target="Survived")
preselected

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Split( )(Name)_2,...,OneHot()(Parch)_3,OneHot()(Parch)_4,OneHot()(Parch)_5,OneHot()(Parch)_6,OneHot()(Embarked)_C,OneHot()(Embarked)_Q,OneHot()(Embarked)_S,WordToNumber()(Ticket)_Ticket,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked
0,1,0,3,male,22.0,1,0,7.2500,S,Owen,...,0,0,0,0,0,0,1,,B96 B98,S
1,2,1,1,female,38.0,1,0,71.2833,C,John,...,0,0,0,0,1,0,0,,C85,C
2,3,1,3,female,26.0,0,0,7.9250,S,Laina,...,0,0,0,0,0,0,1,,B96 B98,S
3,4,1,1,female,35.0,1,0,53.1000,S,Jacques,...,0,0,0,0,0,0,1,113803.0,C123,S
4,5,0,3,male,35.0,0,0,8.0500,S,William,...,0,0,0,0,0,0,1,373450.0,B96 B98,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,27.0,0,0,13.0000,S,Juozas,...,0,0,0,0,0,0,1,211536.0,B96 B98,S
887,888,1,1,female,19.0,0,0,30.0000,S,Margaret,...,0,0,0,0,0,0,1,112053.0,B42,S
888,889,0,3,female,,1,2,23.4500,S,Catherine,...,0,0,0,0,0,0,1,,B96 B98,S
889,890,1,1,male,26.0,0,0,30.0000,C,Karl,...,0,0,0,0,1,0,0,111369.0,C148,C


In [26]:
preselected.to_csv("../data/raw/demo/titanic_expanded.csv")

We sample a subset of the data with at least one row containing no NaNs.

In [7]:
sampler = WeightedColumnSampler(preselected)
sampled = sampler.sample()
sampled

Unnamed: 0,PassengerId,Survived,Pclass,Sex,SibSp,Parch,Ticket,Fare,Split(')(Name)_0,Split(.)(Name)_0,...,OneHot()(Embarked)_Q,OneHot()(Embarked)_S,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked,Age,Embarked,Split( )(Name)_3,ExtractNumber()(Ticket)_0,"ExtractWord([Master, Mrs, Mr, Rev, Miss, Dr])(Name)_0",WordToNumber()(Ticket)_Ticket
0,1,0,3,male,1,0,A/5 21171,7.2500,"Braund, Mr. Owen Harris","Braund, Mr",...,0,1,B96 B98,S,22.0,S,Harris,5.0,Mr,
1,2,1,1,female,1,0,PC 17599,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...","Cumings, Mrs",...,0,0,C85,C,38.0,C,Bradley,17599.0,Mrs,
2,3,1,3,female,0,0,STON/O2. 3101282,7.9250,"Heikkinen, Miss. Laina","Heikkinen, Miss",...,0,1,B96 B98,S,26.0,S,,2.0,Miss,
3,4,1,1,female,1,0,113803,53.1000,"Futrelle, Mrs. Jacques Heath (Lily May Peel)","Futrelle, Mrs",...,0,1,C123,S,35.0,S,Heath,113803.0,Mrs,113803.0
4,5,0,3,male,0,0,373450,8.0500,"Allen, Mr. William Henry","Allen, Mr",...,0,1,B96 B98,S,35.0,S,Henry,373450.0,Mr,373450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,0,0,211536,13.0000,"Montvila, Rev. Juozas","Montvila, Rev",...,0,1,B96 B98,S,27.0,S,,211536.0,Rev,211536.0
887,888,1,1,female,0,0,112053,30.0000,"Graham, Miss. Margaret Edith","Graham, Miss",...,0,1,B42,S,19.0,S,Edith,112053.0,Miss,112053.0
888,889,0,3,female,1,2,W./C. 6607,23.4500,"Johnston, Miss. Catherine Helen ""Carrie""","Johnston, Miss",...,0,1,B96 B98,S,,S,Helen,6607.0,Miss,
889,890,1,1,male,0,0,111369,30.0000,"Behr, Mr. Karl Howell","Behr, Mr",...,0,0,C148,C,26.0,C,Howell,111369.0,Mr,111369.0


Next, we look for features with the same predictive power using a wrapper approach. A decision stump is learned for each feature individually and the predictions for this stump are compared. Features that make the same predictions are pruned.

In [6]:
from avatar.selection import CorrelationFilter


preselected = CorrelationFilter().select(sampled, target="Survived")
preselected

NameError: name 'sampled' is not defined

### evaluation

Wrapping evaluation in a class saves the time of converting data for MERCS and allows us to reuse the same split in every iteration.

In [41]:
from avatar.analysis import FeatureEvaluator

mask = np.random.RandomState(1337).randint(2, size=len(preselected.columns))
display(mask)

evaluator = FeatureEvaluator(folds=4, method=None, max_depth=3)
evaluator.fit(preselected, target="Survived")
evaluator.accuracy(mask)

array([1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1])

0.6898876404494383

In [42]:
evaluator.importances(mask)

array([0.01027012, 0.        , 0.        , 0.        , 0.00288254,
       0.11541804, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.35429808, 0.        , 0.        ,
       0.        , 0.06288489, 0.        , 0.        , 0.00511118,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.00590094, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.44323421, 0.        ,
       0.        ])

## Feature selection


Next, we can take a look at actual feature selection. Three wrapper methods are implemented.

* Randomly sampling columns, training a model and getting the feature relevances.
* A genetic approach, which is similar but should combine features slightly better. We perform a small experiment on whether to fix the genome size.
* Classic sequential and backwards sequential feature selection.

The idea is similar; the feature selection algorithms return sets of feature importances and the associated scores.

### Random

Randomly sample subsets of features, evaluate and get feature relevances.

In [43]:
from avatar.selection import SamplingSelector


ss = SamplingSelector(iterations=40, explain=0.8)
ss.fit(preselected, target="Survived")

[[0.00543882 0.         0.         ... 0.01882822 0.         0.00142104]
 [0.00325218 0.         0.00979265 ... 0.         0.00048406 0.        ]
 [0.         0.         0.00183586 ... 0.         0.         0.        ]
 ...
 [0.00218836 0.         0.00803168 ... 0.         0.00158117 0.00150755]
 [0.         0.         0.0088656  ... 0.00427658 0.         0.        ]
 [0.         0.         0.00065101 ... 0.00990961 0.         0.        ]]
[[0.11812232 0.         0.         ... 0.40891853 0.         0.03086264]
 [0.0690797  0.         0.20800621 ... 0.         0.01028194 0.        ]
 [0.         0.         0.0322372  ... 0.         0.         0.        ]
 ...
 [0.04446366 0.         0.16318993 ... 0.         0.03212672 0.03063088]
 [0.         0.         0.17675052 ... 0.08526078 0.         0.        ]
 [0.         0.         0.01183488 ... 0.18014952 0.         0.        ]]


In [45]:
ss.select()

0.9999999999999999


Index(['Sex', 'Pclass', 'Fare', 'OneHot()(Pclass)_3', 'Age',
       'WordToNumber()(Ticket)_Ticket',
       'ExtractWord([Master, Dr, Mr, Miss, Rev, Mrs])(Name)_0',
       'Split( )(Name)_2'],
      dtype='object')

Warm starting.

In [52]:
from avatar.selection import WarmSamplingSelector


wss = WarmSamplingSelector(iterations=40, explain=0.8)
wss.fit(preselected, target="Survived", start=["Sex", "Fare"])

In [53]:
wss.scores()

array([2.21069514e-02, 0.00000000e+00, 1.05930685e-01, 3.10572630e-01,
       7.25367337e-02, 2.45517911e-02, 4.64752992e-03, 9.29676911e-02,
       3.50170417e-03, 3.44891219e-02, 6.36260811e-02, 4.62745110e-02,
       5.36063348e-03, 5.84724851e-02, 8.93198971e-03, 1.29226289e-02,
       0.00000000e+00, 2.83927162e-03, 4.32044436e-03, 0.00000000e+00,
       0.00000000e+00, 1.47402204e-03, 3.24813212e-03, 2.83883121e-03,
       1.86426450e-04, 1.81047718e-03, 2.05101405e-04, 0.00000000e+00,
       2.07321714e-03, 8.08568805e-04, 2.15381018e-03, 9.32348172e-02,
       1.57373644e-02, 2.17634806e-03])

In [55]:
wss.select()

Index(['Sex', 'Pclass', 'WordToNumber()(Ticket)_Ticket', 'Fare', 'Age',
       'ExtractNumber()(Ticket)_0', 'OneHot()(Pclass)_3',
       'ExtractWord([Master, Dr, Mr, Miss, Rev, Mrs])(Name)_0'],
      dtype='object')

### Genetic

The CHC Genetic Algorithm for feature selection. Uses

* Cross-generational elitist selection
* Heterogeneous recombination
* and Cataclysmic mutation

for maintaining diversity and avoiding stagnation.

After the final population is obtained, combine importances from this population.

In [10]:
from avatar.selection import CHCGASelector, Population, Individual
    

gas = CHCGASelector(iterations=40)
gas.fit(preselected, target="Survived")

KeyboardInterrupt: 

In [19]:
gas.scores()

array([2.60595576e-03, 0.00000000e+00, 7.66667563e-02, 5.00365116e-01,
       6.74611830e-02, 1.88363097e-03, 1.89906182e-02, 2.88435918e-03,
       1.82362498e-02, 3.30708285e-03, 1.58330265e-02, 5.97178475e-05,
       6.61083587e-03, 1.08146542e-02, 2.03944971e-02, 8.16878293e-03,
       8.81785311e-03, 8.78311056e-03, 6.02337108e-03, 2.47739319e-02,
       0.00000000e+00, 6.67062813e-02, 2.69091666e-05, 6.61716210e-05,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.19842614e-03, 0.00000000e+00, 1.31540812e-04,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 5.60196748e-03, 1.14148704e-02,
       1.63882755e-02, 2.72461257e-03, 3.42540000e-03, 1.99982710e-04,
       1.06825878e-02, 7.87522417e-02, 0.00000000e+00])

### SFFS

Sequential Forward Floating Selection. We don't use the adaptive version as there will often be many columns and that is too slow.

In [91]:
from avatar.selection import SFFSelector
    

sffs = SFFSelector(iterations=20)
sffs.fit(preselected, target="Survived", start=["Sex", "Fare"])

Adding Pclass
Removing Fare
Removing Pclass
Adding SibSp
Adding Parch
Removing Parch
Adding Parch
Adding OneHot()(SibSp)_0
Adding ExtractWord([Master, Dr, Mr, Miss, Rev, Mrs])(Name)_0
Adding ExtractNumber()(Ticket)_0
Removing OneHot()(SibSp)_0
Adding OneHot()(Embarked)_Q
Adding OneHot()(Pclass)_2
Removing OneHot()(Pclass)_2
Adding OneHot()(Pclass)_2
Adding OneHot()(SibSp)_0
Adding OneHot()(SibSp)_4
Adding OneHot()(Parch)_4
Adding OneHot()(SibSp)_5
Adding OneHot()(Parch)_0
Adding OneHot()(SibSp)_3
Adding OneHot()(Parch)_3
Adding OneHot()(Parch)_1
Adding OneHot()(SibSp)_1
Adding OneHot()(SibSp)_2


In [92]:
sffs._best

{3: (array([0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
  0.8132022471910113),
 2: (array([0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
  0.7991573033707864),
 1: (array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
  0.7851123595505618),
 4: (array([0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
  0.8103932584269663),
 5: (array([0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
  0.8342696629213484),
 6: (array([0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0.,
         0

In [94]:
sffs.select()

Index(['Sex', 'SibSp', 'Parch', 'ExtractNumber()(Ticket)_0',
       'ExtractWord([Master, Dr, Mr, Miss, Rev, Mrs])(Name)_0',
       'OneHot()(Embarked)_Q'],
      dtype='object')

In [89]:
sffs.scores()

array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 6.53397157e-01,
       0.00000000e+00, 9.58453551e-02, 1.38914460e-02, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.74607119e-01, 5.97109866e-02,
       2.16780040e-03, 0.00000000e+00, 3.02909013e-04, 7.72264926e-05,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00])