There are in general two reasons why feature selection is used:
1. Reducing the number of features, to reduce overfitting and improve the generalization of models.
2. To gain a better understanding of the features and their relationship to the response variables.

# Table of contents
1. [Exploring the data, removing redudant features and benchmarking](#1)
2. [Correlation and feature selection](#2)
3. [Univariate Statistics](#3)
4. [Model Based Feature Selection using the Random Forrest Classifier](#4)
5. [Model Based Feature Selection using LightGBM Classifier](#5)
6. [Iterative Feature Selection](#6)
7. [Recursive feature elimination with cross validation and random forest classification](#7)
8. [PCA](#8)
9. [Conclusions](#9)


In [None]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectPercentile, SelectFromModel, RFE
from sklearn.base import clone
import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import sys
sys.path.insert(0, '../data/')
sys.path.insert(0, '../')
from feature_selector import FeatureSelector
from sklearn.metrics import classification_report, confusion_matrix
from utils import plot_confusion_matrix
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")

RANDOM_SEED=42

Some function that will be used throughout the notebook. 

In [159]:
def score(estimator, X_train, y_train, X_test, y_test, X_transformed=None, X_test_transformed=None):
    """
    Prints the score of the original and the transformed data.
    Returns the trained (in the transformed data) estimator.
    """
    #Create clones/copies of the estimator
    est_clone1=clone(estimator)
    est_clone2=clone(estimator)
    #Train the first clone at the original dataset
    est_clone1.fit(X_train, y_train)
    print("Score with all features: {:.3f}".format(est_clone1.score(X_test, y_test)))
    if all(v is not None for v in [X_transformed, X_test_transformed]):
        #Train the second clone at the transformed dataset
        est_clone2.fit(X_transformed, y_train)
        print("Score with only selected features: {:.3f}".format(est_clone2.score(X_test_transformed, y_test)))
        return est_clone2
def selected_columns(estimator, X_train):
    """
    Returns an array with the features selected by the method used.
    """
    mask = estimator.get_support()
    columns = np.asarray(X_train.columns.values)
    selected= np.asarray(mask)
    columns_selected= columns[selected]
    return columns_selected

Our dataset is very big and for that reason I will use a chunk out of it.

In [None]:
chunk = 50000
data_dir = "../data/"
train_file_name ='aggregated_train.csv'
train_path = os.path.join(data_dir, train_file_name)
df_train = pd.read_csv(train_path, nrows= chunk)

In [None]:
#Create an extra column where 0 is when a visitor has zero sum and 1 else
df_train['label'] = np.where(df_train['target_sum']==0, 0, 1)
df_train.drop(['target_sum'], axis=1, inplace=True)

In [None]:
test_size =0.2
X = df_train.copy()
y = df_train.label.values

Let's check the distirbution of the features for the X set

In [None]:
ax = sns.countplot(x="label", data=X)

We are good and ready to move on 🤘

We should remove the feature 'fullVisitorId' because we really want not to play any importnat role. 

In [None]:
X.drop(['fullVisitorId', 'label'], axis=1, inplace=True)

## 1. Exploring the data, removing redudant features and benchmarking <a name="1"></a>

We will use the `feature_selector` `class`.

In [None]:
fs = FeatureSelector(data = X , labels = y)

Let's check the number of features with missing values

In [None]:
fs.identify_missing(missing_threshold=0.0)
missing_features = fs.ops['missing']

In [None]:
missing_features = fs.ops['missing']
missing_features

For the moment we will remove them

In [None]:
X.drop(missing_features, axis=1,inplace=True)

Let's check the features that have a single unique value 

In [None]:
fs.identify_single_unique()

In [None]:
single_unique = fs.ops['single_unique']
single_unique

In [None]:
fs.plot_unique()


In [None]:
fs.unique_stats.sample(5)

#### Benchmarking: Let's see the score with all our features and without any transformation. The score can be used as a baseline

In [None]:
test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state=RANDOM_SEED, test_size = test_size)

In [None]:
lr = LogisticRegression(random_state=RANDOM_SEED)
score(lr, X_train, y_train, X_test, y_test)

In [None]:
y_pred = lr.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, classes=['Did not Buy', 'Buy'])

In [None]:
print(classification_report(y_test,y_pred,target_names=["Did not buy", "Buy"]))

## 2. Correlation and feature selection <a name="2"></a>

This method finds pairs of collinear features based on the Pearson correlation coefficient. For each pair above the specified threshold (in terms of absolute value), it identifies one of the variables to be removed. We need to pass in a correlation_threshold.

This method is based on code found at https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

For each pair, the feature that will be removed is the one that comes last in terms of the column ordering in the dataframe. (This method does not one-hot encode the data beforehand unless one_hot=True. Therefore correlations are only calculated between numeric columns)

In [None]:
fs.identify_collinear(correlation_threshold=0.98)


Many features are highly correlated

In [None]:
correlated_features = fs.ops['collinear']
correlated_features[:5]

We can view a heatmap of the correlations above the threhold. The features which will be dropped are on the x-axis.


In [None]:
fs.plot_collinear()

Not that helpful! But we view the details of the corelations above the threshold.

In [None]:
fs.record_collinear.head()

Well we can expect that he operating system is highly correlated with the browser. I checked most of them and there is nothing important to see. But I set also very high the threshold. Maybe with lower threshold there will be something importnt to note.

## 3. Univariate Statistics <a name="3"></a>

**Background** : In univariate statistics, we compute whether there is a statistically significant relationship between each feature and the target. Here the target is the label.  Then the features that are related with the highest confidence are selected. In the case of classification, this is also known as analysis of variance (ANOVA). 

One score is computed for the first feature, and another score is computed for the second feature. But it does not indicate anything on the combination of both features (mutual information). This is the **main weakness** of F-score. Scikit uses as defalult the Anova f-value. The larger theF-score is, the more likely this feature is more discriminative. Therefore, we use this score as a feature selection criterion. In other words, F-score reveals the discriminative power of each feature independently from others.

All methods for discarding parameters use a threshold to discard all features with too high a p-value (which means they are unlikely to be related to the target).

In [None]:
select_uni = SelectPercentile(percentile=10)
select_uni.fit(X_train, y_train)

In [None]:
# transform training set
X_train_uni = select_uni.transform(X_train)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_selected.shape: {}".format(X_train_uni.shape))
# transform test set
X_test_uni = select_uni.transform(X_test)

The space of the data has been reduces significantly.

#### Compare the performance of logistic regression on all features against the performance using only the selected features.

In [None]:
lr = LogisticRegression(random_state=RANDOM_SEED)
lr.fit(X_train, y_train)
print("Score with all features: {:.3f}".format(lr.score(X_test, y_test)))
lr.fit(X_train_uni, y_train)
print("Score with only selected features: {:.3f}".format(lr.score(X_test_uni, y_test)))

In [None]:
#Evaluate the dataset
lr = LogisticRegression(random_state=RANDOM_SEED)
lr = score(lr, X_train, y_train, X_test, y_test, X_train_uni, X_test_uni)
#Make predictions for the test set
y_pred = lr.predict(X_test_uni)
#Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, classes=['Did not Buy', 'Buy'])

In [None]:
y_pred = lr.predict(X_test_uni)
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, classes=['Did not Buy', 'Buy'])

In [None]:
#Plot classification report
print(classification_report(y_test,y_pred,
target_names=["Did not buy", "Buy"]))

One can observe that there is no significant difference between the scores of the original and reduced data and he/she is right. However we have significantly reduced the space. 

## 4. Model Based using RF <a name="4"></a>

**Background**: Model-based feature selection uses a supervised machine learning model to judge the importance of each feature, and keeps only the most important ones. The feature selection model needs to provide some measure of importance for each feature, so that they can be ranked by this measure.The SelectFromModel class is a meta-learner that selects all features that have an importance measure of the feature (based on the weights of the classifier) greater than the provided threshold (here median).

In [None]:
select_RF = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=RANDOM_SEED), threshold='mean')
select_RF.fit(X_train, y_train)

In [None]:
# transform training set
X_train_RF = select_RF.transform(X_train)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_selected.shape: {}".format(X_train_RF.shape))
# transform test set
X_test_RF = select_RF.transform(X_test)

In [None]:
#Evaluate the method
lr = LogisticRegression(random_state=RANDOM_SEED)
lr = score(lr, X_train, y_train, X_test, y_test, X_train_RF, X_test_RF)
#Make predictions for the test set
y_pred = lr.predict(X_test_RF)
#Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, classes=['Did not Buy', 'Buy'])

In [None]:
print(classification_report(y_test,y_pred,target_names=["Did not buy", "Buy"]))

Let's see the 10 most important features based on the estimator

In [None]:
nmr = 10
features = selected_columns(select_RF, X_train)
importances = select_RF.estimator_.feature_importances_
topten = sorted(importances, reverse=True)[:nmr]
ind = np.argpartition(importances, -nmr)[-nmr:]

In [None]:
plt.figure(1, figsize=(10, 5))
plt.title("Feature importances")
plt.bar(range(nmr), importances[ind], color="b", align="center")
plt.xticks(range(nmr), X_train.columns[ind],rotation=90)
plt.xlim([-1, nmr])
plt.show()

## 5. Model based  feature selection with LightGBM <a name="5"></a>

In [None]:
fs.identify_zero_importance(task = 'classification', eval_metric = '', 
                            n_iterations = 10, early_stopping = True)

 First we can access the list of features with zero importance.

In [None]:
zero_importance_features = fs.ops['zero_importance']
zero_importance_features

#### Plot Feature Importances

Threshold = 0.99 will tell us the number of features needed to account for 99% of the total importance.

In [None]:
fs.plot_feature_importances(threshold = 0.99, plot_n = 20)


In [None]:
fs.feature_importances.head()

#### Low Importance Features

In [None]:
fs.identify_low_importance(cumulative_importance = 0.99)


The low importance features to remove are those that do not contribute to the specified cumulative importance. These are also available in the ops dictionary.

In [None]:
low_importance_features = fs.ops['low_importance']

In [None]:
# transform training set
X_train_LightGBM = X_train.drop(low_importance_features, axis=1)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_selected.shape: {}".format(X_train_LightGBM.shape))
# transform test set
X_test_LightGBM = X_test.drop(low_importance_features, axis=1)

In [None]:
lr = LogisticRegression(random_state=RANDOM_SEED)
score(lr, X_train_LightGBM, y_train, X_test_LightGBM, y_test)

In [None]:
print(classification_report(y_test,y_pred, target_names=["Did not buy", "Buy"]))

## 6. Iterative Feature Selection <a name="6"></a>

**Background**: Recursive feature elimination is based on the idea to recursively remove features, build a model using the remaining attributes and calculates model accuracy. This process is applied until all features in the dataset are exhausted. Features are then ranked according to when they were eliminated. As such, it is a greedy optimization for finding the best performing subset of features.

In [None]:
select_RFE = RFE(RandomForestClassifier(n_estimators=50, random_state=RANDOM_SEED), step = 50, n_features_to_select=200, verbose=1)
select_RFE.fit(X_train, y_train)

In [None]:
# transform training set
X_train_RFE= select_RFE.transform(X_train)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_selected.shape: {}".format(X_train_RFE.shape))
# transform test data
X_test_RFE = select_RFE.transform(X_test)

In [None]:
#Evaluate the method
lr = LogisticRegression(random_state=RANDOM_SEED)
score(lr, X_train, y_train, X_test, y_test, X_train_RFE, X_test_RFE)
#Make predictions for the test set
y_pred = lr.predict(X_test_RFE)
#Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, classes=['Did not Buy', 'Buy'])

In [None]:
print(classification_report(y_test,y_pred,
target_names=["Did not buy", "Buy"]))

## 7. Recursive feature elimination with cross validation and random forest classification <a name="7"></a>



**Background**: RFE with cross validation starts with all the *n* features, makes predictions with cross validation using the classifier (here RF), computes the relative cross-validated performance score (here accuracy) and the ranking of the importance of the features. Then it eliminates the lowest *k* features in the ranking and re-makes the predictions, the computation of the performance score and the feature ranking. It proceeds until all the features are eliminated. Finally it outputs the set of features which produced the predictor with the best score.

In [None]:
from sklearn.feature_selection import RFECV

# The "accuracy" scoring is proportional to the number of correct classifications
estimator = RandomForestClassifier() 
select_RFECV = RFECV(estimator=estimator, step=400, verbose = 1, cv=5, scoring='accuracy')   #5-fold cross-validation
select_RFECV = select_RFECV.fit(X_train, y_train)

print('Optimal number of features :', select_RFECV.n_features_)
print('Best features :', X_train.columns[select_RFECV.support_])

In [None]:
# transform training set
X_train_RFECV= select_RFECV.transform(X_train)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_selected.shape: {}".format(X_train_RFECV.shape))
# transform test set
X_test_RFECV = select_RFECV.transform(X_test)

In [None]:
#Evaluate the method
lr = LogisticRegression(random_state=RANDOM_SEED)
lr = score(lr, X_train, y_train, X_test, y_test, X_train_RFECV, X_test_RFECV)
#Make predictions for the test set
y_pred = lr.predict(X_test_RFECV)
#Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, classes=['Did not Buy', 'Buy'])

In [None]:
print(classification_report(y_test,y_pred,target_names=["Did not buy", "Buy"]))

## 8. PCA  <a name="8"></a>

In [None]:
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline


In [None]:
scaler = StandardScaler()
pca = PCA()
pipe = Pipeline(steps=[('scaler', scaler), ('pca', pca)])

Let's find the number of components that we will use

In [None]:
pipe.fit(X_train)
plt.figure(1, figsize=(14, 13))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_ratio_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_ratio_')

As we can see 800 components is a reasonable number.

In [None]:
n_components=800
scaler = StandardScaler()
pca = PCA(n_components=n_components)
pipe = Pipeline(steps=[('scaler', scaler), ('pca', pca)])
# transform data onto number of selected principal components
X_train_pca = pipe.fit_transform(X_train)
print("Original shape: {}".format(str(X_train.shape)))
print("Reduced shape: {}".format(str(X_train_pca.shape)))

In [None]:
#Transform the test set
X_test_pca = pipe.transform(X_test)

In [None]:
#Evaluate the method
lr = LogisticRegression(random_state=RANDOM_SEED)
lr = score(lr, X_train, y_train, X_test, y_test, X_train_pca, X_test_pca)
#Make predictions for the test set
y_pred = lr.predict(X_test_pca)
#Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, classes=['Did not Buy', 'Buy'])

In [None]:
print(classification_report(y_test,y_pred,target_names=["Did not buy", "Buy"]))

## 9. Conclusions <a name="9"></a>
In general, we could not perform better than the baseline in the classification task. However, we managed to achieve more or less the same results with way reduced data.