# Filter Methods

**Important notes**

* **In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.**

* **In practice, feature selection should be done after data pre-processing, so ideally, all the categorical variables are encoded into numbers, and then you can asses whether they are correlated with other features.**

**Basic methods**

* Quasi
* constant features
* Duplicated features
* Duplicated features may arise after one hot encoding of categorical
variables

## Split the data

In [None]:
# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

## Constant features

Constant features are those that show the same value, just one value, for all the observations of the dataset. In other words, the same value for all the rows of the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target.

**Note**

* Identifying and removing constant features is an easy first step towards feature selection and more easily interpretable machine learning models.

* The DropConstantFeatures class from Feature-engine finds and removes constant and quasi-constant features from a dataset. We can remove constant features by setting the parameter tol to 1, or quasi-constant with smaller values for tol.

In [None]:
# Step 1: Import the DropConstantFeatures class from feature_engine
# Import the class that helps remove features with constant values
from feature_engine.selection import DropConstantFeatures

# Step 2: Initialize the DropConstantFeatures selector with specified parameters
# Set up the selector to detect constant features with no tolerance for variability
sel = DropConstantFeatures(tol=1, variables=None, missing_values='raise')

# Step 3: Fit the selector to the training data to identify constant features
# Fit the model to find features with constant values in the training data
sel.fit(X_train)

# Step 4: Access the list of constant features identified during fitting
# Retrieve the list of features that were found to be constant
sel.features_to_drop_

# Step 5: Count the number of constant features in the dataset
# Find out how many features are constant
len(sel.features_to_drop_)

# Step 6: Examine the unique values of the first constant feature
# Display the unique value(s) of the first constant feature
X_train[sel.features_to_drop_[0]].unique()

# Step 7: Remove constant features from the training data
# Drop the constant features from the training data
X_train = sel.transform(X_train)
# Step 8: Remove constant features from the test data using the same transformation
# Apply the same transformation to the test data
X_test = sel.transform(X_test)

# Step 9: Display the shape of the training and test datasets after removing constant features
# Show the new dimensions of both training and test sets
X_train.shape, X_test.shape

## Quasi-constant features

Quasi-constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little, if any, information that allows a machine learning model to discriminate or predict a target. But there can be exceptions. So you should be careful when removing these type of features.

**Note**

* Identifying and removing quasi-constant features, is an easy first step towards feature selection and more interpretable machine learning models.

In [None]:
# Step 1: Initialize the DropConstantFeatures selector with a tolerance for quasi-constant features
# Set up the selector to identify quasi-constant features, keeping variables with less than 0.2% variance
sel = DropConstantFeatures(tol=0.998, variables=None, missing_values='raise')

# Step 2: Fit the selector to the training data to identify quasi-constant features
# Fit the model to find quasi-constant features in the training data
sel.fit(X_train)

# Step 3: Count the number of quasi-constant features in the dataset
# Determine how many features have very low variability
len(sel.features_to_drop_)

# Step 4: Retrieve the list of quasi-constant features identified
# Get the list of features that were found to be quasi-constant
sel.features_to_drop_

# Step 5: Calculate the percentage of observations for each unique value in the first quasi-constant feature
# Show the distribution of values for the first quasi-constant feature
var = sel.features_to_drop_[0]
X_train[var].value_counts(normalize=True)

# Step 6: Explore the distribution of another quasi-constant feature
# Show the distribution of values for the third quasi-constant feature
var = sel.features_to_drop_[2]
X_train[var].value_counts(normalize=True)

# Step 7: Remove quasi-constant features from the training data
# Drop the quasi-constant features from the training data
X_train = sel.transform(X_train)
# Step 8: Remove quasi-constant features from the test data using the same transformation
# Apply the same transformation to the test data
X_test = sel.transform(X_test)

# Step 9: Display the shape of the training and test datasets after removing quasi-constant features
# Show the new dimensions of both training and test sets
X_train.shape, X_test.shape

## Duplicated features

Often datasets contain duplicated features, that is, features that despite having different names, are identical.

In addition, we may often introduce duplicated features when performing one hot encoding of categorical variables, particularly if our datasets have many and /or highly cardinal categorical variables.

Identifying and removing duplicated, and therefore redundant features, is an easy first step towards feature selection and more interpretable machine learning models.

**Note**
* Finding duplicated features can be a computationally costly operation in Python, therefore depending on the size of your dataset, you might not always be able to do it.
* This method that I describe here to find duplicated features works for both numerical and categorical variables.

In [None]:
# Step 1: Import DropDuplicateFeatures and DropConstantFeatures from feature_engine
# Import the classes to remove duplicate and constant features
from feature_engine.selection import DropDuplicateFeatures, DropConstantFeatures

# Step 2: Set up the DropDuplicateFeatures selector with the specified parameters
# Initialize the selector to identify duplicate features
sel = DropDuplicateFeatures(variables=None, missing_values='raise')

# Step 3: Fit the selector to the training data to find duplicate features
# Identify duplicate features in the training data (this might take some time)
sel.fit(X_train)

# Step 4: Retrieve the sets of duplicated features
# Access the list of feature sets that are duplicates of each other
sel.duplicated_feature_sets_

# Step 5: Retrieve the features that will be dropped (one from each duplicate set)
# Get the list of features that are marked to be removed
sel.features_to_drop_

# Step 6: Check how many duplicated features will be removed
# Count the number of features that will be dropped
len(sel.features_to_drop_)

# Step 7: Remove the duplicated features from the training data
# Drop the duplicated features from the training dataset
X_train = sel.transform(X_train)
# Step 8: Remove the duplicated features from the test data using the same transformation
# Apply the same transformation to the test data
X_test = sel.transform(X_test)

# Step 9: Display the shape of the training and test datasets after removing duplicate features
# Show the new dimensions of both training and test sets
X_train.shape, X_test.shape

## Stack Feature selection in a Pipeline

We can perform both steps together by setting up the transformers within a pipeline.


In [None]:
# Step 1: Create a pipeline with DropConstantFeatures and DropDuplicateFeatures
# Set up a pipeline that removes quasi-constant and duplicated features
pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=0.998)),
    ('duplicated', DropDuplicateFeatures()),
])

# Step 2: Fit the pipeline to the training data to identify features to remove
# Fit the pipeline to detect quasi-constant and duplicate features in the training data
pipe.fit(X_train)

# Step 3: Transform the training data by removing identified features
# Remove the quasi-constant and duplicated features from the training data
X_train = pipe.transform(X_train)
# Step 4: Transform the test data by applying the same removal process
# Remove the quasi-constant and duplicated features from the test data
X_test = pipe.transform(X_test)

# Step 5: Display the shape of the training and test datasets after feature removal
# Show the new dimensions of both training and test sets
X_train.shape, X_test.shape

# Step 6: Get the number of quasi-constant features that were removed
# Check how many quasi-constant features were dropped
len(pipe.named_steps['constant'].features_to_drop_)

# Step 7: Access the list of duplicated features that were removed
# Retrieve the list of features that were dropped due to duplication
pipe.named_steps['duplicated'].features_to_drop_