You have a categorical target vector and want to remove uninformative features.

If the features are categorical, calculate a chi-square (χ
2
) statistic between each
feature and the target vector:

In [1]:
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif
# Load data
iris = load_iris()
features = iris.data
target = iris.target
# Convert to categorical data by converting data to integers
features = features.astype(int)
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
features_kbest = chi2_selector.fit_transform(features, target)
# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])


Original number of features: 4
Reduced number of features: 2


If the features are quantitative, compute the ANOVA F-value between each
feature and the target vector:

In [2]:
# Select two features with highest F-values
fvalue_selector = SelectKBest(f_classif, k=2)
features_kbest = fvalue_selector.fit_transform(features, target)
# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])

Original number of features: 4
Reduced number of features: 2


Instead of selecting a specific number of features, we can also use SelectPercentile to select the top n percent of features:

In [3]:
# Load library
from sklearn.feature_selection import SelectPercentile
# Select top 75% of features with highest F-values
fvalue_selector = SelectPercentile(f_classif, percentile=75)
features_kbest = fvalue_selector.fit_transform(features, target)
# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])

Original number of features: 4
Reduced number of features: 3


![](./pics/chisquare.jpg)

It is important to note that chi-square statistics can only be calculated between
two categorical vectors. For this reason, chi-squared for feature selection
requires that both the target vector and the features are categorical. However, if
we have a numerical feature we can use the chi-squared technique by first
transforming the quantitative feature into a categorical feature. Finally, to use our
chi-squared approach, all values need to be non-negative.
Alternatively, if we have a numerical feature we can use f_classif to calculate
the ANOVA F-value statistic with each feature and the target vector. F-value
scores examine if, when we group the numerical feature by the target vector, the
means for each group are significantly different. For example, if we had a binary
target vector, gender, and a quantitative feature, test scores, the F-value score
would tell us if the mean test score for men is different than the mean test score
for women. If it is not, then test score doesn’t help us predict gender and
therefore the feature is irrelevant.