# CHAPTER 4: FEATURE ENGINEERING AND SELECTION
## Feature Selection
In this notebook, I will review some important facts into the feature selections methods used to improve models by reducing the number of features in the dataset avoiding to deal with the curse of dimensionality even avoiding one of the most populer mistakes in a Machine learning model building like the over-fitting on the training datasett. 

Feature selection strategies can be divided into 3 main areas on the same type of strategy. 

Filter Methods, Wrapper Methods and Embedded methods

#### *Jose Ruben Garcia Garcia*
#### *February 2024*
*Reference: Practical Machine Learning Python Problems Solver*

## Feature scaling

### Loading and Viz data

In [2]:
# Importing libraries
import numpy as np 
import pandas as pd
np.set_printoptions(suppress = True)
pt = np.get_printoptions()['threshold']

### Threshold-Based methods

This is a filter based method where we can use some form of cut-off or thresholdings for limiting the total of features during the feature selection strategy

In [4]:
#Building a count vectorizer which ignores feature terms which occurs in less than 10% of the total corpus
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0.1, max_df=0.85, max_features=2000)
cv

In [5]:
df = pd.read_csv('Pokemon.csv')
poke_gen = pd.get_dummies(df['Generation'])
poke_gen.head()

Unnamed: 0,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6
0,True,False,False,False,False,False
1,True,False,False,False,False,False
2,True,False,False,False,False,False
3,True,False,False,False,False,False
4,True,False,False,False,False,False


In [6]:
from sklearn.feature_selection import VarianceThreshold

vt = VarianceThreshold(threshold=0.15)
vt.fit(poke_gen)


In [7]:
pd.DataFrame({'variance': vt.variances_,
             'select_feature': vt.get_support()},
             index = poke_gen.columns).T

Unnamed: 0,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6
variance,0.164444,0.114944,0.16,0.128373,0.163711,0.091994
select_feature,True,False,True,False,True,False


In [9]:
## Getting the final subset 
poke_gen_subset = poke_gen.iloc[:,vt.get_support()].head()
poke_gen_subset

Unnamed: 0,Gen 1,Gen 3,Gen 5
0,True,False,False
1,True,False,False
2,True,False,False
3,True,False,False
4,True,False,False


### Statistical Methods

#### Selecting the best features based in an Statistical score

In [12]:
from sklearn.datasets import load_breast_cancer

bc_data = load_breast_cancer()
bc_features = pd.DataFrame(bc_data.data, columns=bc_data.feature_names)
bc_classes = pd.DataFrame(bc_data.target, columns=['IsMalignant'])

#Building featureset and response class labels to see the shape of every distinc dataset

bc_X = np.array(bc_features)
bc_Y = np.array(bc_classes).T[0]
print('Feature set shape:', bc_X.shape)
print('Response class shape:', bc_Y.shape)

Feature set shape: (569, 30)
Response class shape: (569,)


In [14]:
# Zoom into the Dataset
np.set_printoptions(threshold=30)
print('Feature set data [shape: '+str(bc_X.shape)+']')
print(np.round(bc_X, 2), '\n')
print('Features names: ')
print(np.array(bc_features.columns), '\n')
print('Response class label data [shape: '+str(bc_Y.shape)+']')
print(np.round(bc_Y, 2), '\n')
print('Response variable name: ', np.array(bc_classes.columns))
np.set_printoptions(threshold=pt)

Feature set data [shape: (569, 30)]
[[ 17.99  10.38 122.8  ...   0.27   0.46   0.12]
 [ 20.57  17.77 132.9  ...   0.19   0.28   0.09]
 [ 19.69  21.25 130.   ...   0.24   0.36   0.09]
 ...
 [ 16.6   28.08 108.3  ...   0.14   0.22   0.08]
 [ 20.6   29.33 140.1  ...   0.26   0.41   0.12]
 [  7.76  24.54  47.92 ...   0.     0.29   0.07]] 

Features names: 
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension'] 

Response class label data [shape: (569,)]
[0 0 0 ... 0 0 1] 

Response variable name:  [

In [15]:
# Selecting the best 15 features from the dataset using chi2 and selectkbest methods
from sklearn.feature_selection import chi2, SelectKBest

skb = SelectKBest(score_func=chi2, k=15)
skb.fit(bc_X,bc_Y)

In [16]:
## Looking the score that each feature obtained
features_scores = [(item, score) for item, score in zip(bc_data.feature_names, skb.scores_)]
sorted(features_scores, key=lambda x: -x[1])[:10]

[('worst area', 112598.43156405364),
 ('mean area', 53991.65592375085),
 ('area error', 8758.504705334473),
 ('worst perimeter', 3665.0354163405946),
 ('mean perimeter', 2011.1028637679046),
 ('worst radius', 491.6891574333232),
 ('mean radius', 266.104917195178),
 ('perimeter error', 250.57189635982192),
 ('worst texture', 174.4493996057108),
 ('mean texture', 93.8975080986333)]

In [17]:
### Creating a subset for the top 15 best features
select_features_kbest = skb.get_support()
feature_names_kbest = bc_data.feature_names[select_features_kbest]
feature_subset_df = bc_features[feature_names_kbest]
bc_SX = np.array(feature_subset_df)
print(bc_SX.shape)
print(feature_names_kbest) 

(569, 15)
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean concavity' 'radius error' 'perimeter error' 'area error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst compactness' 'worst concavity' 'worst concave points']


In [18]:
np.round(feature_subset_df.iloc[20:25],2)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean concavity,radius error,perimeter error,area error,worst radius,worst texture,worst perimeter,worst area,worst compactness,worst concavity,worst concave points
20,13.08,15.71,85.63,520.0,0.05,0.19,1.38,14.67,14.5,20.49,96.09,630.5,0.28,0.19,0.07
21,9.5,12.44,60.34,273.9,0.03,0.28,1.91,15.7,10.23,15.66,65.13,314.9,0.11,0.09,0.06
22,15.34,14.26,102.5,704.4,0.21,0.44,3.38,44.91,18.07,19.08,125.1,980.9,0.6,0.63,0.24
23,21.16,23.04,137.2,1404.0,0.11,0.69,4.3,93.99,29.17,35.59,188.0,2615.0,0.26,0.32,0.2
24,16.65,21.38,110.0,904.6,0.15,0.81,5.46,102.6,26.46,31.56,177.0,2215.0,0.36,0.47,0.21


#### Creating a simple classification model with logistic Regression in order to compare the dataset with the 30 features and the new one with only 15

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Build logistic regression model
lr = LogisticRegression(max_iter=1000, solver='liblinear', C=1.0)  # Try 'liblinear' solver and adjust C

# Evaluate accuracy for model built on full feature set
full_feat_acc = np.average(cross_val_score(lr, bc_X, bc_Y, scoring='accuracy', cv=5))

# Evaluate accuracy for model built on selected feature set
sel_feat_acc = np.average(cross_val_score(lr, bc_SX, bc_Y, scoring='accuracy', cv=5))

print('Model accuracy statistics with 5-fold cross-validation')
print('Accuracy of the model with complete feature set', bc_X.shape, ':', full_feat_acc)
print('Accuracy of the model with selected feature set', bc_SX.shape, ':', sel_feat_acc)


Model accuracy statistics with 5-fold cross-validation
Accuracy of the model with complete feature set (569, 30) : 0.9508150908244062
Accuracy of the model with selected feature set (569, 15) : 0.9525694767893185


Thus I can see a litle improvement in accuracy metric used for this case in the dataset with the 15 features (95.26%) against the full dataset (95.08)


### Recursive feature elimination (RFE)
The basic idea of this method is to use a ML estimator like the LogReg algorithm that I used in the past section, RFE assign weight to all the features in the dataset based on the model fit, Features with the smallest weights are deleted. 


In [28]:
#Selecting the best 15 features using now RFE

from sklearn.feature_selection import RFE

lr = LogisticRegression(max_iter=1000, solver='liblinear', C=1.0)  # Try 'liblinear' solver and adjust C
rfe = RFE(estimator=lr, n_features_to_select=15, step=1)
rfe.fit(bc_X, bc_Y)

In [29]:
select_features_rfe = rfe.get_support()
features_names_rfe = bc_data.feature_names[select_features_rfe]
print(features_names_rfe)

['mean radius' 'mean texture' 'mean concavity' 'mean concave points'
 'mean symmetry' 'radius error' 'texture error' 'perimeter error'
 'area error' 'worst radius' 'worst texture' 'worst perimeter'
 'worst compactness' 'worst concavity' 'worst concave points']


In [31]:
# Comparation between both techniques
set(feature_names_kbest) & set(features_names_rfe)

{'area error',
 'mean concavity',
 'mean radius',
 'mean texture',
 'perimeter error',
 'radius error',
 'worst compactness',
 'worst concave points',
 'worst concavity',
 'worst perimeter',
 'worst radius',
 'worst texture'}

With this we can see that 12/15 features are the same in a comparation of both techniques

### Model based selection

In [32]:
#Using a random forest model to score the rank features based on their importance

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(bc_X, bc_Y)

In [36]:
# With the random forest estimator I was looking to score the features
importance_scores = rfc.feature_importances_
feature_importances = [(feature, score) for feature, score in zip(bc_data.feature_names, importance_scores)]
sorted(feature_importances, key=lambda x: -x[1])[:10]

[('worst perimeter', 0.15524758228509045),
 ('worst radius', 0.1279131456505142),
 ('mean concave points', 0.1252403999560926),
 ('worst area', 0.1124408470589262),
 ('worst concave points', 0.09197753603108351),
 ('mean perimeter', 0.07449993187988935),
 ('mean radius', 0.03997534904237185),
 ('mean area', 0.03145827783652463),
 ('mean concavity', 0.028135456473625402),
 ('worst texture', 0.0234314134755979)]

### Dimensionality reduction

#### PCA Principal component analysis
This is a statistical method that uses the process of linear, orthogonal transformation to transform a higher-dimensional set of features that could be possibly correlated into a lower-dimensional set of linearly uncorrelated features. 


In [37]:
# center the feature set
bc_XC = bc_X - bc_X.mean(axis=0)

# decompose using SVD
U, S, VT = np.linalg.svd(bc_XC)

# get principal components
PC = VT.T

# get first 3 principal components
PC3 = PC[:, 0:3]
PC3.shape


(30, 3)

In [38]:
# reduce feature set dimensionality 
np.round(bc_XC.dot(PC3), 2)

array([[-1160.14,  -293.92,   -48.58],
       [-1269.12,    15.63,    35.39],
       [ -995.79,    39.16,     1.71],
       ...,
       [ -314.5 ,    47.55,    10.44],
       [-1124.86,    34.13,    19.74],
       [  771.53,   -88.64,   -23.89]])

In [39]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(bc_X)


In [40]:
pca.explained_variance_ratio_

array([0.98204467, 0.01617649, 0.00155751])

In [41]:
bc_pca = pca.transform(bc_X)
np.round(bc_pca, 2)

array([[1160.14, -293.92,   48.58],
       [1269.12,   15.63,  -35.39],
       [ 995.79,   39.16,   -1.71],
       ...,
       [ 314.5 ,   47.55,  -10.44],
       [1124.86,   34.13,  -19.74],
       [-771.53,  -88.64,   23.89]])

In [43]:
np.average(cross_val_score(lr, bc_pca, bc_Y, scoring='accuracy', cv=5))

0.9262071106970968