# Feature selection.

The following methods are implemented in this notebook.

1. A (baseline) random sampling based approach — done.
2. CHCGA — a genetic algorithm based approach — done.
3. SFFS — a forward selection based approach — done.

Both (1) and (2) allow use to set a max running time.

In [1]:
%reload_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from typing import Optional, List, Tuple, Callable

Load dataset.

In [2]:
titanic = pd.read_csv("../data/raw/demo/titanic_expanded.csv")
titanic.Survived = titanic.Survived.astype("category")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Split( )(Name)_1,...,OneHot()(Parch)_3,OneHot()(Parch)_4,OneHot()(Parch)_5,OneHot()(Parch)_6,OneHot()(Embarked)_C,OneHot()(Embarked)_Q,OneHot()(Embarked)_S,WordToNumber()(Ticket)_Ticket,ModeImputation()(Cabin)_Cabin,ModeImputation()(Embarked)_Embarked
0,1,0,3,male,22.0,1,0,7.25,S,Mr.,...,0,0,0,0,0,0,1,,B96 B98,S
1,2,1,1,female,38.0,1,0,71.2833,C,Mrs.,...,0,0,0,0,1,0,0,,C85,C
2,3,1,3,female,26.0,0,0,7.925,S,Miss.,...,0,0,0,0,0,0,1,,B96 B98,S
3,4,1,1,female,35.0,1,0,53.1,S,Mrs.,...,0,0,0,0,0,0,1,113803.0,C123,S
4,5,0,3,male,35.0,0,0,8.05,S,Mr.,...,0,0,0,0,0,0,1,373450.0,B96 B98,S


### Random

Randomly sample subsets of features, evaluate and get feature relevances.

In [3]:
from avatar.selection import SamplingSelector


ss = SamplingSelector(iterations=40, explain=0.8)
ss.fit(titanic, target="Survived")

In [15]:
ss.select(explain=0.8)

In [5]:
ss.ranked()

Index(['Sex', 'Split( )(Name)_1',
       'ExtractWord([Mr, Rev, Mrs, Master, Miss, Dr])(Name)_0', 'Age',
       'Pclass', 'Fare', 'OneHot()(Pclass)_3', 'WordToNumber()(Ticket)_Ticket',
       'ExtractNumber()(Ticket)_0', 'Split( )(Name)_2',
       'ModeImputation()(Cabin)_Cabin', 'SibSp', 'PassengerId', 'Parch',
       'OneHot()(Pclass)_2', 'OneHot()(SibSp)_1', 'OneHot()(Parch)_0',
       'OneHot()(SibSp)_0', 'ModeImputation()(Embarked)_Embarked', 'Embarked',
       'OneHot()(Embarked)_S', 'OneHot()(Embarked)_Q', 'OneHot()(Embarked)_C',
       'OneHot()(Parch)_2', 'OneHot()(Parch)_1', 'OneHot()(SibSp)_4',
       'OneHot()(SibSp)_3', 'OneHot()(SibSp)_5', 'OneHot()(SibSp)_8',
       'OneHot()(Parch)_3', 'OneHot()(Parch)_4', 'OneHot()(Parch)_5',
       'OneHot()(Parch)_6', 'Survived', 'OneHot()(SibSp)_2'],
      dtype='object')

Warm starting.

In [12]:
from avatar.selection import WarmSamplingSelector


wss = WarmSamplingSelector(iterations=40, explain=0.8)
wss.fit(titanic, target="Survived", start=["Sex", "Fare"])

In [13]:
wss.scores()

array([0.01504597, 0.        , 0.0965897 , 0.22044658, 0.08333636,
       0.02815102, 0.00751281, 0.07085068, 0.00393563, 0.07180004,
       0.04578654, 0.08724388, 0.05736441, 0.00970799, 0.07083569,
       0.00317614, 0.00830437, 0.00045336, 0.00062545, 0.00209437,
       0.00061622, 0.00038976, 0.02089348, 0.00463773, 0.00085403,
       0.        , 0.        , 0.00030675, 0.        , 0.00336481,
       0.0011657 , 0.00883354, 0.05982983, 0.01291745, 0.0029297 ])

In [14]:
wss.select()

### Genetic

The CHC Genetic Algorithm for feature selection. Uses

* Cross-generational elitist selection
* Heterogeneous recombination
* and Cataclysmic mutation

for maintaining diversity and avoiding stagnation.

After the final population is obtained, combine importances from this population.

In [16]:
from avatar.selection import CHCGASelector, Population, Individual
    

gas = CHCGASelector(iterations=50)
gas.fit(titanic, target="Survived")

In [17]:
gas.select(explain=0.8)

In [23]:
gas.select()

### SFFS

Sequential Forward Floating Selection. We don't use the adaptive version as there will often be many columns and that is too slow.

In [19]:
from avatar.selection import SFFSelector
    

sffs = SFFSelector(iterations=10)
sffs.fit(titanic, target="Survived", start=[])

In [29]:
sffs.select()

In [28]:
sffs.scores()

array([0.        , 0.        , 0.        , 0.68116625, 0.        ,
       0.09456094, 0.03565845, 0.        , 0.        , 0.        ,
       0.        , 0.09386584, 0.09209759, 0.00126835, 0.        ,
       0.00138258, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ])

In [24]:
sffs.select()

In [89]:
sffs.scores()

array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 6.53397157e-01,
       0.00000000e+00, 9.58453551e-02, 1.38914460e-02, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.74607119e-01, 5.97109866e-02,
       2.16780040e-03, 0.00000000e+00, 3.02909013e-04, 7.72264926e-05,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00])