<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

## Lab 4.2.2: Feature Selection

Moving beyond basic feature selection methods, this lab introduces forward feature selection. Through an iterative process, we progressively include features that contribute to improving the model's adjusted R-squared score. By systematically evaluating the impact of each feature, we aim to construct a regression model that captures the underlying patterns in the data.

In [48]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, KFold, cross_val_score

%matplotlib inline

### 5. Forward Feature Selection

> Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Create a Regression model using Forward Feature Selection by looping over all the features adding one at a time until there are no improvements on the prediction metric ( R2  and  AdjustedR2  in this case).

#### 5.1 Load Wine Data & Define Predictor and Target

In [33]:
## Load the wine quality dataset

# Load the wine dataset from csv
wine = pd.read_csv('winequality_merged.csv')

# define the target variable (dependent variable) as y
y = wine['quality']

# Take all columns except target as predictor columns
predictor_columns = [c for c in wine.columns if c != 'quality']
# Load the dataset as a pandas data frame
X = pd.DataFrame(wine, columns = predictor_columns)

In [35]:
## Create training and testing subsets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape: {y_test.shape}')

X_train shape: (5197, 12)
X_test shape: (1300, 12)
y_train shape: (5197,)
y_test shape: (1300,)


#### 5.2 Overview of the code below

The external `while` loop goes forever until there are no improvements to the model, which is controlled by the flag `changed` (until is **not** changed).
The inner `for` loop goes over each of the features not yet included in the model and calculates the correlation coefficient. If any model improves on the previous best model then the records are updated.

#### Code variables
- `included`: list of the features (predictors) that were included in the model; starts empty.
- `excluded`: list of features that have **not** been included in the model; starts as the full list of features.
- `best`: dictionary to keep record of the best model found at any stage; starts 'empty'.
- `model`: object of class LinearRegression, with default values for all parameters.

#### Methods of the `LinearRegression` object to investigate
- `fit()`
- `fit.score()`

#### Adjusted $R^2$ formula
$$Adjusted \; R^2 = 1 - { (1 - R^2) (n - 1)  \over n - k - 1 }$$

#### Linear Regression [reference](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

In [38]:
## Flag intermediate output
def adjusted_r2(r2, n, k):
    return 1 - (1 - r2) * ((n - 1) / (n - k - 1))

show_steps = True   # for testing/debugging
show_steps = True  # Change to False for no intermediate output

remaining_features = list(X_train.columns)
selected_features = []
current_score, best_new_score = 0, 0

while remaining_features and current_score == best_new_score:
    scores_with_candidates = []
    for candidate in remaining_features:
        model = LinearRegression().fit(X_train[selected_features + [candidate]], y_train)
        r2 = r2_score(y_train, model.predict(X_train[selected_features + [candidate]]))
        adj_r2 = adjusted_r2(r2, X_train.shape[0], len(selected_features) + 1)
        scores_with_candidates.append((adj_r2, candidate))
        
        if show_steps:
            print(f"Evaluating feature: {candidate}, Adjusted R2: {adj_r2:.4f}")

    scores_with_candidates.sort()
    best_new_score, best_candidate = scores_with_candidates.pop()
    
    if current_score < best_new_score:
        remaining_features.remove(best_candidate)
        selected_features.append(best_candidate)
        current_score = best_new_score
        
        if show_steps:
            print(f"Selected feature: {best_candidate}, New Adjusted R2: {current_score:.4f}")

print(f"Selected features: {selected_features}")

final_model = LinearRegression().fit(X_train[selected_features], y_train)
final_r2 = r2_score(y_test, final_model.predict(X_test[selected_features]))
final_adj_r2 = adjusted_r2(final_r2, X_test.shape[0], len(selected_features))

print(f"Final R2 on test set: {final_r2}")
print(f"Final adjusted R2 on test set: {final_adj_r2}")









# show_steps = False  # without showing steps

Evaluating feature: fixed acidity, Adjusted R2: 0.0059
Evaluating feature: volatile acidity, Adjusted R2: 0.0753
Evaluating feature: citric acid, Adjusted R2: 0.0077
Evaluating feature: residual sugar, Adjusted R2: 0.0011
Evaluating feature: chlorides, Adjusted R2: 0.0420
Evaluating feature: free sulfur dioxide, Adjusted R2: 0.0027
Evaluating feature: total sulfur dioxide, Adjusted R2: 0.0012
Evaluating feature: density, Adjusted R2: 0.0987
Evaluating feature: pH, Adjusted R2: 0.0001
Evaluating feature: sulphates, Adjusted R2: 0.0007
Evaluating feature: alcohol, Adjusted R2: 0.1944
Evaluating feature: red_wine, Adjusted R2: 0.0167
Selected feature: alcohol, New Adjusted R2: 0.1944
Evaluating feature: fixed acidity, Adjusted R2: 0.1956
Evaluating feature: volatile acidity, Adjusted R2: 0.2604
Evaluating feature: citric acid, Adjusted R2: 0.2038
Evaluating feature: residual sugar, Adjusted R2: 0.2128
Evaluating feature: chlorides, Adjusted R2: 0.2035
Evaluating feature: free sulfur dioxi

In [50]:
## Use Forward Feature Selection to pick a good model

# Define the target variable (dependent variable) as y
y = wine['quality']

# Take all columns except 'quality' as predictor columns
predictor_columns = [c for c in wine.columns if c != 'quality']
X = wine[predictor_columns]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the linear regression model
model = LinearRegression()

# Forward feature selection using k-fold cross-validation
# SequentialFeatureSelector with forward selection and k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)  

selector = SequentialFeatureSelector(model, n_features_to_select=None, direction='forward', cv=kf)

# Fit the selector to the training data
selector = selector.fit(X_train, y_train)

# Get the selected feature indices
selected_features = selector.get_support(indices=True)

# Subset the training and testing data with selected features
X_train_selected = X_train.iloc[:, selected_features]
X_test_selected = X_test.iloc[:, selected_features]

# Train the model on the selected features using cross-validation
scores = cross_val_score(model, X_train_selected, y_train, cv=kf, scoring='neg_mean_squared_error')
mse_cv = -scores.mean()

# Train the model on the selected features (without cross-validation for final model)
model.fit(X_train_selected, y_train)

# Predict on the test set
y_pred = model.predict(X_test_selected)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error (MSE) on test set: {mse}')
print(f'Mean Squared Error (MSE) with {kf.n_splits}-fold cross-validation: {mse_cv}')

# Print selected feature names
selected_feature_names = X.columns[selected_features]
print('Selected Features:')
print(selected_feature_names)






Mean Squared Error (MSE) on test set: 0.5409956167696763
Mean Squared Error (MSE) with 5-fold cross-validation: 0.5476801133891139
Selected Features:
Index(['volatile acidity', 'residual sugar', 'density', 'sulphates', 'alcohol',
       'red_wine'],
      dtype='object')




---



---



> > > > > > > > > © 2024 Institute of Data


---



---



