<a href="https://colab.research.google.com/github/PamelaKinga/feature_selection/blob/main/Feature_selection_constant_and_quasi_constant_feature_method.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Feature Selection**: Basic Methods - Filtering out Constant and Quasi-Constant Features

# About

There are many benefits to feature seleciton including improved accuracy, improved likelihood to reduce overfitting, mitigating variable redundancy, shorter training times, and mitigating bad learning behaviour across high dimensional spaces. A starting techique is filtering on features, a basic method is to quickly check for constant or quasi-constant features in the dataset.

To identify constant & quasi-constant features I use the python library [fast-ml](https://pypi.org/project/fast-ml/) 

This workbook is a quick code sample for implementing a constant & quasi-constant feature cleanup using fast-ml

# Import Fast-ML Package

In [None]:
# Install the fast_ml library for data scientists
pip install fast_ml

In [4]:
import pandas as pd
import numpy as np
from fast_ml.utilities import display_all
from fast_ml.feature_selection import get_constant_features

In [5]:
# Save train.csv data set from Kaggle's house prices competition dataset https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv
df = pd.read_csv('house_prices_train.csv')

In [7]:
# Show data
df.shape

(1460, 81)

# Constant Features

*  Constant features are the same value in all rows/records
*  Use the get_constant_features function to get all the constant features



In [9]:
constant_features = get_constant_features(df)
constant_features.head(10)

Unnamed: 0,Desc,Var,Value,Perc
0,Quasi Constant,Utilities,AllPub,99.931507
1,Quasi Constant,Street,Pave,99.589041
2,Quasi Constant,PoolArea,0,99.520548
3,Quasi Constant,PoolQC,,99.520548


*This particular dataset has no constant features, only quasi-features, I'll continue as though this were not the case for the sake of writing the code*

*   Store all the constant features as a list for removing from the dataset

In [15]:
# Create and print a list of all the constant features in the data set as identified in the fast-ml function
constant_features_list = constant_features.query("Desc=='Constant'")['Var'].to_list()
print(constant_features_list)

[]


*   Drop all the features from the data set that are constant

In [14]:
# Check the before and after shape of the data set to see how many features have been dropped
print('Shape of the dataset before dropping constant feature: ', df.shape)
df.drop(columns= constant_features_list, inplace = True)
print('Shape of the dataset after dropping constant feature: ', df.shape)

Shape of the dataset before dropping constant feature:  (1460, 81)
Shape of the dataset after dropping constant feature:  (1460, 81)


# Quasi-Constant Features

*   Quasi-contant features: one of the values is dominant 99.9%
*   Threshold cut-off value for defining the quasi-ness of the data you want to eliminate, i.e. 99.9%, 99%, 98%, etc.

In [17]:
quasi_constant_features = get_constant_features(df, threshold = 0.99, dropna = False)
quasi_constant_features.head(10)

Unnamed: 0,Desc,Var,Value,Perc
0,Quasi Constant,Utilities,AllPub,99.931507
1,Quasi Constant,Street,Pave,99.589041
2,Quasi Constant,PoolArea,0,99.520548
3,Quasi Constant,PoolQC,,99.520548


*   Store all the quasi constant features as a list for removing from the dataset



In [12]:
quasi_constant_features_list = constant_features.query("Desc=='Quasi Constant'")['Var'].to_list()
print(quasi_constant_features_list)

['Utilities', 'Street', 'PoolArea', 'PoolQC']


*   Drop all the quasi-features

In [18]:
# Check the before and after shape of the data set to see how many features have been dropped
print('Shape of the dataset before dropping quasi-constant feature: ', df.shape)
df.drop(columns= quasi_constant_features_list, inplace = True)
print('Shape of the dataset after dropping quasi-constant feature: ', df.shape)

Shape of the dataset before dropping quasi-constant feature:  (1460, 81)
Shape of the dataset after dropping quasi-constant feature:  (1460, 77)


*  In this data set, we went from 81 to 77 features remaining after dropping 4 quasi-constant features



The End