## Select with Target Mean as Performance Proxy

This transformer contains the methods of feature selection described in the notebook **06.2-Method-used-in-a-KDD-competition**

The functionality has now been included in Feature-engine.

Feature-engine automatically detects categorical and numerical variables. 

- Categories in categorical variables will be replaced by the mean value of the target.

- Numerical variables will be first discretised and then, each bin replaced by the target mean value.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from feature_engine.selection import SelectByTargetMeanPerformance

In [2]:
# load the titanic dataset

data = pd.read_csv('../titanic.csv')
data.shape

(1309, 14)

In [3]:
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"


In [4]:
# Variable preprocessing:

# then I will narrow down the different cabins by selecting only the
# first letter, which represents the deck in which the cabin was located

# captures first letter of string (the letter of the cabin)
data['cabin'] = data['cabin'].str[0]

# now we will rename those cabin letters that appear only 1 or 2 in the
# dataset by N

# replace rare cabins by N
data['cabin'] = np.where(data['cabin'].isin(['T', 'G']), 'N', data['cabin'])

data['cabin'].unique()

array(['B', 'C', 'E', 'D', 'A', nan, 'N', 'F'], dtype=object)

In [5]:
# number of passenges per cabin

data['cabin'].value_counts()

C    94
B    65
D    46
E    41
A    22
F    21
N     6
Name: cabin, dtype: int64

In [6]:
# number of passengers per value
data['parch'].value_counts()

0    1002
1     170
2     113
3       8
4       6
5       6
6       2
9       2
Name: parch, dtype: int64

In [7]:
# cap variable at 3, the rest of the values are
# shown by too few observations

data['parch'] = np.where(data['parch']>3,3,data['parch'])

In [8]:
data['sibsp'].value_counts()

0    891
1    319
2     42
4     22
3     20
8      9
5      6
Name: sibsp, dtype: int64

In [9]:
# cap variable at 3, the rest of the values are
# shown by too few observations

data['sibsp'] = np.where(data['sibsp']>3,3,data['sibsp'])

In [10]:
# cast discrete variables as categorical

# feature-engine considers categorical variables all those of type
# object. So in order to work with numerical variables as if they
# were categorical, we  need to cast them as object

data[['pclass','sibsp','parch']] = data[['pclass','sibsp','parch']].astype('O')

In [11]:
# check absence of missing data

data.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [12]:
# separate train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['survived'], axis=1),
    data['survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((916, 13), (393, 13))

In [13]:
# feautre engine automates the selection for both
# categorical and numerical variables

sel = SelectByTargetMeanPerformance(
    variables=None, # automatically finds categorical and numerical variables
    scoring="roc_auc", # the metric to evaluate performance
    threshold=0.6, # the threshold for feature selection, 
    bins=3, # the number of intervals to discretise the numerical variables
    strategy="equal_frequency", # whether the intervals should be of equal size or equal number of observations
    cv=2,# cross validation
    regression=False,
)

sel.fit(X_train, y_train)

Traceback (most recent call last):
  File "C:\Users\Bang\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 355, in _score
    y_pred = method_caller(clf, "decision_function", X)
  File "C:\Users\Bang\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 68, in _cached_call
    return getattr(estimator, method)(*args, **kwargs)
AttributeError: 'TargetMeanClassifier' object has no attribute 'decision_function'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Bang\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\Bang\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\Bang\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 369, in _score
    y_pred = method_caller(clf, "predict_proba", X)
  File "C:\Users\Bang\anacond

2 fits failed out of a total of 2.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Bang\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Bang\anaconda3\lib\site-packages\feature_engine\_prediction\target_mean_classifier.py", line 126, in fit
    return super().fit(X, y)
  File "C:\Users\Bang\anaconda3\lib\site-packages\feature_engine\_prediction\base_predictor.py", line 128, in fit
    _check_contains_na(X, self.variables_categorical_)
  File "C:\Users\Bang\anaconda3\lib\site-packages\feature_engine\dataframe_checks.py", line 268, in 

SelectByTargetMeanPerformance(bins=3, cv=2, strategy='equal_frequency',
                              threshold=0.6)

In [14]:
# after fitting, we can find the categorical variables
# using this attribute

sel.variables_

['pclass',
 'name',
 'sex',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked',
 'boat',
 'body',
 'home.dest']

In [15]:
# here the selector stores the roc-auc per feature

sel.feature_performance_

{'pclass': 0.6647955905218121,
 'name': nan,
 'sex': 0.7647734737037902,
 'age': nan,
 'sibsp': 0.5652547781529103,
 'parch': 0.5624106995859147,
 'ticket': nan,
 'fare': 0.6401180861719492,
 'cabin': nan,
 'embarked': nan,
 'boat': nan,
 'body': nan,
 'home.dest': nan}

In [16]:
# and these are the features that will be dropped

sel.features_to_drop_

['sibsp', 'parch']

In [17]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((916, 11), (393, 11))

That is all for this lecture, I hope you enjoyed it and see you in the next one!