## Feature Extraction for Data Preparation

https://machinelearningmastery.com/feature-extraction-on-tabular-data/


##### Common Data Preparation Approach (slow, expensive, requires experience)
1. study dataset
2. review expectations of the ML algo
3. carefully chose the appropriate transformations

##### Alternative Data Preparation Approach
1. apply a suite of common and commonly useful techniques
2. aggregate all features together and create one large dataset

### Performance of the different Methods
Tested with a simple LogisticRegression with a 'liblinear' solver (small dataset)
- Baseline: Accuracy of 0.953
- Feature Extraction: Accuracy of 0.968
- Feature Extraction + Selection: Accuracy of 0.989


In [26]:
# imports
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler, RobustScaler
from sklearn.preprocessing import QuantileTransformer, KBinsDiscretizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_selection import RFE
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression

In [3]:
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'

The dataset used for this is the wine dataset which has information about the chemical analysis of wine, which are from three different cultivars.

In [4]:
df = pd.read_csv(url, header=None)

In [5]:
data = df.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

(178, 13) (178,)


### Baseline Model Performance
test the base performance with raw input features:

In [7]:
# minimal preperartion to meet the expected input from sklearn
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))

In [9]:
# define the model
# for small datasets liblinear is a good choice
model = LogisticRegression(solver='liblinear')

# define cross validation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model
score = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-2)

In [11]:
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(score), np.std(score)))

Accuracy: 0.953 (0.048)


### Feature Extraction Approach to Data Preparation
test if we can improve performance with a suit of common and commonly used data preparation techniques:
- Scaling
    - MinMaxScaler
    - StandarScaler
    - RobustScaler
- Distribution transformers
    - QuantileTransformer
    - KBinsDiscretizer
- Dimensionality Reduction
    - PCA
    - TruncatedSVD
    
With the feature union class we can define a list of transformers

In [15]:
# get data
data = df.values
X, y = data[:, :-1], data[:, -1]
# minimal preperartion to meet the expected input from sklearn
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))

In [17]:
# transformers for the feature union
transformers = list()
transformers.append(('mms', MinMaxScaler()))
transformers.append(('ss', StandardScaler()))
transformers.append(('rs', RobustScaler()))
transformers.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal')))
transformers.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')))
transformers.append(('pca', PCA(n_components=7)))
transformers.append(('svd', TruncatedSVD(n_components=7)))

In [18]:
transformers

[('mms', MinMaxScaler()),
 ('ss', StandardScaler()),
 ('rs', RobustScaler()),
 ('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal')),
 ('kbd', KBinsDiscretizer(encode='ordinal', n_bins=10, strategy='uniform')),
 ('pca', PCA(n_components=7)),
 ('svd', TruncatedSVD(n_components=7))]

In [19]:
# crate feature union
fu = FeatureUnion(transformers)

In [20]:
# define model
model = LogisticRegression(solver='liblinear')

In [23]:
# define pipline
steps = list()
steps.append(('fu', fu))
steps.append(('m', model))
pipeline = Pipeline(steps=steps) 

In [24]:
# define the cross-validation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-2)

In [25]:
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

Accuracy: 0.968 (0.037)


### Feature Selection
Now add feature selection to the process.

With Recursive Feature Elimination
> RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains.
- is is a wrapper-type feature selection algorithm

In [28]:
# get data
data = df.values
X, y = data[:, :-1], data[:, -1]
# minimal preperartion to meet the expected input from sklearn
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))

In [29]:
# transformers for the feature union
transformers = list()
transformers.append(('mms', MinMaxScaler()))
transformers.append(('ss', StandardScaler()))
transformers.append(('rs', RobustScaler()))
transformers.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal')))
transformers.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')))
transformers.append(('pca', PCA(n_components=7)))
transformers.append(('svd', TruncatedSVD(n_components=7)))

In [30]:
# crate feature union
fu = FeatureUnion(transformers)

In [31]:
# define the feature selection
rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=15)

In [32]:
# define the model
model = LogisticRegression(solver='liblinear')

In [33]:
# define the pipeline
steps = list()
steps.append(('fu', fu))
steps.append(('rfe', rfe))
steps.append(('m', model))
pipeline = Pipeline(steps=steps)

In [34]:
# define the cross-validation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-2)

In [35]:
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

Accuracy: 0.989 (0.022)
