<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

## Lab 4.2.2: Feature Selection

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

%matplotlib inline

### 5. Forward Feature Selection

> Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Create a Regression model using Forward Feature Selection by looping over all the features adding one at a time until there are no improvements on the prediction metric ( R2  and  AdjustedR2  in this case).

#### 5.1 Load Wine Data & Define Predictor and Target

In [2]:
## Load the wine quality dataset

# Load the wine dataset from csv
wine = pd.read_csv('/Users/stephanienduaguba/Documents/DATA/winequality_merged.csv')

# Define the target variable (dependent variable) as y
y = wine['quality']

# Take all columns except target as predictor columns
predictor_columns = [c for c in wine.columns if c != 'quality']

# Load the dataset as a pandas DataFrame
X = pd.DataFrame(wine, columns = predictor_columns)

In [3]:
## Create training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

#### 5.2 Overview of the code below

The external `while` loop goes forever until there are no improvements to the model, which is controlled by the flag `changed` (until is **not** changed).
The inner `for` loop goes over each of the features not yet included in the model and calculates the correlation coefficient. If any model improves on the previous best model then the records are updated.

#### Code variables
- `included`: list of the features (predictors) that were included in the model; starts empty.
- `excluded`: list of features that have **not** been included in the model; starts as the full list of features.
- `best`: dictionary to keep record of the best model found at any stage; starts 'empty'.
- `model`: object of class LinearRegression, with default values for all parameters.

#### Methods of the `LinearRegression` object to investigate
- `fit()`
- `fit.score()`

#### Adjusted $R^2$ formula
$$Adjusted \; R^2 = 1 - { (1 - R^2) (n - 1)  \over n - k - 1 }$$

#### Linear Regression [reference](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

In [4]:
## Flag intermediate output
show_steps = True   # for testing/debugging
# show_steps = False  # without showing steps

In [5]:
# Use Forward Feature Selection to pick a good model

# Start with no predictors
included = []

# Keep track of model and parameters
best = {'feature': '', 'r2': 0, 'a_r2': 0}  # r2 = R-squared and a_r2 = adjusted R-squared

# Create a model object to hold the modeling parameters
model = LinearRegression()  # Create a model for Linear Regression

# Get the number of cases in the training data
n = X_train.shape[0]

while True:
    changed = False
    
    if show_steps:
        print('') 

    # List the features to be evaluated
    excluded = list(set(X.columns) - set(included))
    
    if show_steps:
        print('(Step) Excluded = %s' % ', '.join(excluded))  

    # For each remaining feature to be evaluated
    for new_column in excluded:
        
        if show_steps:
            print('(Step) Trying %s...' % new_column)
            print('(Step) - Features = %s' % ', '.join(included + [new_column]))

        # Fit the model with the Training data
        X_train_temp = X_train[included + [new_column]]  # Include new feature
        model.fit(X_train_temp, y_train)  # Fit the model
        
        # Calculate the score (R^2 for Regression)
        r2 = model.score(X_train_temp, y_train)  # Calculate the R-squared
        
        # Number of predictors in this model
        k = len(included) + 1
        
        # Calculate the adjusted R^2
        adjusted_r2 = 1 - (1 - r2) * ((n - 1) / (n - k - 1))
        
        if show_steps:
            print('(Step) - Adjusted R^2: This = %.3f; Best = %.3f' % 
                  (adjusted_r2, best['a_r2']))

        # If the model improves
        if adjusted_r2 > best['a_r2']:
            # Record new parameters
            best = {'feature': new_column, 'r2': r2, 'a_r2': adjusted_r2}
            # Flag that found a better model
            changed = True
            if show_steps:
                print('(Step) - New Best!   : Feature = %s; R^2 = %.3f; Adjusted R^2 = %.3f' % 
                      (best['feature'], best['r2'], best['a_r2']))
    # END

    # If found a better model after testing all remaining features
    if changed:
        # Update control details
        included.append(best['feature'])
        excluded = list(set(excluded) - set([best['feature']]))
        print('Added feature %-4s with R^2 = %.3f and adjusted R^2 = %.3f' % 
              (best['feature'], best['r2'], best['a_r2']))
    else:
        # Terminate if no better model
        break

print('')
print('Resulting features:')
print(', '.join(included))


(Step) Excluded = density, citric acid, volatile acidity, alcohol, sulphates, red_wine, chlorides, pH, fixed acidity, total sulfur dioxide, residual sugar, free sulfur dioxide
(Step) Trying density...
(Step) - Features = density
(Step) - Adjusted R^2: This = 0.091; Best = 0.000
(Step) - New Best!   : Feature = density; R^2 = 0.091; Adjusted R^2 = 0.091
(Step) Trying citric acid...
(Step) - Features = citric acid
(Step) - Adjusted R^2: This = 0.007; Best = 0.091
(Step) Trying volatile acidity...
(Step) - Features = volatile acidity
(Step) - Adjusted R^2: This = 0.069; Best = 0.091
(Step) Trying alcohol...
(Step) - Features = alcohol
(Step) - Adjusted R^2: This = 0.201; Best = 0.091
(Step) - New Best!   : Feature = alcohol; R^2 = 0.201; Adjusted R^2 = 0.201
(Step) Trying sulphates...
(Step) - Features = sulphates
(Step) - Adjusted R^2: This = 0.002; Best = 0.201
(Step) Trying red_wine...
(Step) - Features = red_wine
(Step) - Adjusted R^2: This = 0.012; Best = 0.201
(Step) Trying chlorid

There is a slight increase in R^2 (0.303) when all the features are considered compared to when only four features were considered (in Lab 4.2.1 with R^2 = 0.2559...)

### Sequential Feature Selector - Using sklearn.feature_selection

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html

#### Forward

In [6]:
from sklearn.feature_selection import SequentialFeatureSelector

# Create a LinearRegression model
model = LinearRegression()

# Initialize SequentialFeatureSelector for forward selection
# Choose the number of features to select (k_features) and the scoring metric (e.g., 'r2')
sfs = SequentialFeatureSelector(model, direction='forward', n_features_to_select='auto', tol=1e-5, scoring='r2', cv=5)

# Fit the SequentialFeatureSelector to the training data
sfs.fit(X_train, y_train)

# Get the selected feature names and their scores at each step
selected_features = list(X_train.columns[sfs.support_])
selected_features

['fixed acidity',
 'volatile acidity',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'red_wine']

In [7]:
# Target variable
y = wine['quality']

In [8]:
# Predictor variables
X = wine[['fixed acidity', 
          'volatile acidity', 
          #'citric acid', 
          'residual sugar',
          'chlorides', 
          'free sulfur dioxide', 
          'total sulfur dioxide', 
          'density', 
          'pH', 
          'sulphates', 
          'alcohol',
          'red_wine']]

In [9]:
# Create training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [10]:
# Create a model for Linear Regression
# Fit the model with the Training data
LR = LinearRegression().fit(X_train, y_train)

# Calculate the score (R^2 for Regression) for Training Data
LR.score(X_train, y_train)

# Fit the model with the Testing data
LR = LinearRegression().fit(X_test, y_test)

# Calculate the score (R^2 for Regression) for Testing Data
LR.score(X_test, y_test)

0.278164041255501

#### Backward

In [11]:
from sklearn.feature_selection import SequentialFeatureSelector

# Create a LinearRegression model
model = LinearRegression()

# Initialize SequentialFeatureSelector for forward selection
# Choose the number of features to select (k_features) and the scoring metric (e.g., 'r2')
sfs = SequentialFeatureSelector(model, direction='backward', n_features_to_select='auto', tol=1e-5, scoring='r2', cv=5)

# Fit the SequentialFeatureSelector to the training data
sfs.fit(X_train, y_train)

# Get the selected feature names and their scores at each step
selected_features = list(X_train.columns[sfs.support_])
selected_features

['fixed acidity',
 'volatile acidity',
 'residual sugar',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'red_wine']

## Feature Selection Description - chatGPT

Forward feature selection and backward feature selection are two different approaches to feature selection, and they may not necessarily yield the same set of variables. Here's why:
</br>

#### Forward Feature Selection:

Forward feature selection starts with an empty set of features and iteratively adds the most relevant feature at each step based on some criterion (e.g., improvement in model performance).
It builds up the feature set step by step, selecting one feature at a time that provides the best improvement in the chosen evaluation metric.
The final set of selected features depends on the order in which features were added and the evaluation metric used. It may not include all features that could be relevant together.


#### Backward Feature Selection:

Backward feature selection starts with all available features and iteratively removes the least relevant feature at each step based on some criterion (e.g., model performance degradation).
It begins with the full feature set and reduces it step by step, eliminating the least important features.
The final set of selected features depends on the order in which features were removed and the evaluation metric used. It may not include all features that could be relevant together.
In practice, whether forward or backward feature selection is more appropriate depends on your specific problem and dataset. It's possible that the two methods may select similar or overlapping sets of features, especially if your dataset has clear features that significantly contribute to the target variable. However, they can also yield different results based on the order of evaluation and the chosen evaluation metric.

It's a good practice to try both forward and backward feature selection, as well as other feature selection techniques, to explore different subsets of features and evaluate their impact on model performance. Ultimately, the choice of features should be based on a combination of domain knowledge, experimentation, and validation through cross-validation or other evaluation methods.

### Extra - Sequential Feature Selector - Using mlxtend.feature_selection

In [12]:
pip install mlxtend

Note: you may need to restart the kernel to use updated packages.


In [13]:
# Import the SequentialFeatureSelector class from mlxtend
from mlxtend.feature_selection import SequentialFeatureSelector

# Create an instance of SequentialFeatureSelector
forward_feature_selection = SequentialFeatureSelector(
    LinearRegression(n_jobs=-1),  # Linear regression model to evaluate feature subsets
    k_features=(1, 12),           # Range of the number of features to select (from 1 to 12)
    forward=True,                 # Perform forward feature selection
    floating=False,               # Disable floating feature selection (not used here)
    verbose=2,                    # Verbosity level for output
    scoring='r2',                 # Scoring metric to evaluate feature subsets (R-squared)
    cv=5                          # Number of cross-validation folds
).fit(X_train, y_train)

AttributeError: k_features tuple max value must be between 1 and X.shape[1].



---



---



> > > > > > > > > © 2023 Institute of Data


---



---



