<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

## Lab 4.2.2: Feature Selection

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### 1. Forward Feature Selection

> Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Create a Regression model using Forward Feature Selection by looping over all the features adding one at a time until there are no improvements on the prediction metric ( R2  and  AdjustedR2  in this case).

#### 1.1 Load Wine Data & Define Predictor and Target

In [3]:
## Load the wine quality dataset

# Load the wine dataset from csv
wine = pd.read_csv('winequality_merged.csv')

# define the target variable (dependent variable) as y
y = wine['quality']

# Take all columns except target as predictor columns
predictor_columns = [c for c in wine.columns if c != 'quality']
# Load the dataset as a pandas data frame
X = pd.DataFrame(wine, columns = predictor_columns)

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [5]:
## Create training and testing subsets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [6]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS 

In [7]:
reg = LinearRegression()
sfs = SFS(reg, k_features=12, forward=True, scoring='r2', cv=5)

sfs.fit(X, y)
sfs.subsets_

{1: {'feature_idx': (10,),
  'cv_scores': array([0.19737904, 0.14259664, 0.16402737, 0.25951304, 0.11718204]),
  'avg_score': 0.17613962742677963,
  'feature_names': ('alcohol',)},
 2: {'feature_idx': (1, 10),
  'cv_scores': array([0.25002068, 0.23000381, 0.21328574, 0.2912951 , 0.1659067 ]),
  'avg_score': 0.23010240552676492,
  'feature_names': ('volatile acidity', 'alcohol')},
 3: {'feature_idx': (1, 10, 11),
  'cv_scores': array([0.30485787, 0.2238263 , 0.22070722, 0.29339553, 0.18436065]),
  'avg_score': 0.2454295140031742,
  'feature_names': ('volatile acidity', 'alcohol', 'red_wine')},
 4: {'feature_idx': (1, 3, 10, 11),
  'cv_scores': array([0.28782052, 0.20295259, 0.24483178, 0.31213809, 0.21505592]),
  'avg_score': 0.2525597798088244,
  'feature_names': ('volatile acidity',
   'residual sugar',
   'alcohol',
   'red_wine')},
 5: {'feature_idx': (1, 3, 9, 10, 11),
  'cv_scores': array([0.2975946 , 0.21535936, 0.24327025, 0.31583553, 0.2044268 ]),
  'avg_score': 0.2552973071750



#### 1.2 Overview of the code below

The external `while` loop goes forever until there are no improvements to the model, which is controlled by the flag `changed` (until is **not** changed).
The inner `for` loop goes over each of the features not yet included in the model and calculates the correlation coefficient. If any model improves on the previous best model then the records are updated.

#### Code variables
- `included`: list of the features (predictors) that were included in the model; starts empty.
- `excluded`: list of features that have **not** been included in the model; starts as the full list of features.
- `best`: dictionary to keep record of the best model found at any stage; starts 'empty'.
- `model`: object of class LinearRegression, with default values for all parameters.

#### Methods of the `LinearRegression` object to investigate
- `fit()`
- `fit.score()`

#### Adjusted $R^2$ formula
$$Adjusted \; R^2 = 1 - { (1 - R^2) (n - 1)  \over n - k - 1 }$$

#### Linear Regression [reference](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

In [9]:
## Flag intermediate output

show_steps = True   # for testing/debugging
# show_steps = False  # without showing steps

In [11]:
## Use Forward Feature Selection to pick a good model

show_steps = True
included = [] # start with no predictors

best = {'feature': '', 'r2': 0, 'a_r2': 0} # keep track of model and parameters

model = LinearRegression() 

n = X_test.shape[0]

while True:
    changed = False
    
    if show_steps:
        print('') 

   
    excluded = list(set(X.columns) - set(included))  # list the features to be evaluated
    
    if show_steps:
        print('(Step) Excluded = %s' % ', '.join(excluded))  

   
    for new_column in excluded:  # for each remaining feature to be evaluated
        
        if show_steps:
            print('(Step) Trying %s...' % new_column)
            print('(Step) - Features = %s' % ', '.join(included + [new_column]))

        
        fit = model.fit(X_train[included + [new_column]], y_train) # fit the model with the Training data
        
        r2 = fit.score(X_train[included + [new_column]], y_train) # calculate the score (R^2 for Regression)
        
        k = len(included + [new_column]) # number of predictors in this model
        
        adjusted_r2 = 1 - (((1 - r2) * (n - 1)) / (n - k -1)) # calculate the adjusted R^2

        if show_steps:
            print('(Step) - Adjusted R^2: This = %.3f; Best = %.3f' % 
                  (adjusted_r2, best['a_r2']))

        
        if adjusted_r2 > best['a_r2']: # if model improves
            
            best = {'feature': new_column, 'r2': r2, 'a_r2': adjusted_r2} # record new parameters
            # flag that found a better model
            changed = True
            if show_steps:
                print('(Step) - New Best!   : Feature = %s; R^2 = %.3f; Adjusted R^2 = %.3f' % 
                      (best['feature'], best['r2'], best['a_r2']))
    # END for

    # if found a better model after testing all remaining features
    if changed:
        # update control details
        included.append(best['feature'])
        excluded = list(set(excluded) - set(best['feature']))
        print('Added feature %-4s with R^2 = %.3f and adjusted R^2 = %.3f' % 
              (best['feature'], best['r2'], best['a_r2']))
    else:
        
        print('*' * 100)
        break

print('')
print('Resulting features:')
print(', '.join(included))


(Step) Excluded = red_wine, sulphates, volatile acidity, chlorides, pH, citric acid, alcohol, fixed acidity, total sulfur dioxide, free sulfur dioxide, residual sugar, density
(Step) Trying red_wine...
(Step) - Features = red_wine
(Step) - Adjusted R^2: This = 0.013; Best = 0.000
(Step) - New Best!   : Feature = red_wine; R^2 = 0.014; Adjusted R^2 = 0.013
(Step) Trying sulphates...
(Step) - Features = sulphates
(Step) - Adjusted R^2: This = 0.001; Best = 0.013
(Step) Trying volatile acidity...
(Step) - Features = volatile acidity
(Step) - Adjusted R^2: This = 0.074; Best = 0.013
(Step) - New Best!   : Feature = volatile acidity; R^2 = 0.074; Adjusted R^2 = 0.074
(Step) Trying chlorides...
(Step) - Features = chlorides
(Step) - Adjusted R^2: This = 0.040; Best = 0.074
(Step) Trying pH...
(Step) - Features = pH
(Step) - Adjusted R^2: This = -0.001; Best = 0.074
(Step) Trying citric acid...
(Step) - Features = citric acid
(Step) - Adjusted R^2: This = 0.007; Best = 0.074
(Step) Trying al

(Step) - Adjusted R^2: This = 0.280; Best = 0.280
(Step) - New Best!   : Feature = fixed acidity; R^2 = 0.286; Adjusted R^2 = 0.280
(Step) Trying citric acid...
(Step) - Features = alcohol, volatile acidity, sulphates, residual sugar, red_wine, density, free sulfur dioxide, total sulfur dioxide, chlorides, citric acid
(Step) - Adjusted R^2: This = 0.280; Best = 0.280
Added feature fixed acidity with R^2 = 0.286 and adjusted R^2 = 0.280

(Step) Excluded = pH, citric acid
(Step) Trying pH...
(Step) - Features = alcohol, volatile acidity, sulphates, residual sugar, red_wine, density, free sulfur dioxide, total sulfur dioxide, chlorides, fixed acidity, pH
(Step) - Adjusted R^2: This = 0.283; Best = 0.280
(Step) - New Best!   : Feature = pH; R^2 = 0.289; Adjusted R^2 = 0.283
(Step) Trying citric acid...
(Step) - Features = alcohol, volatile acidity, sulphates, residual sugar, red_wine, density, free sulfur dioxide, total sulfur dioxide, chlorides, fixed acidity, citric acid
(Step) - Adjuste

The ForwardFeature wrapping has shown better score result than mxltend. However, when we compared CV between the 2 function. mxltend has equal score as ForwardFeature. Hence, mxltend works better because it used less predictors compare to ForwardFeature



---



---



> > > > > > > > > © 2019 Institute of Data


---



---



