# Feature Selection Codealong

In this notebook we will try 3 methods of feature selection:
* Filtering by low multicollinearity
* Selecting via Permutation Importance
* Using SelectFromModel in a Pipeline
* Applying SequentialFeatureSelector to test many models and find the best combination of features.

The data is the engineered data we created in the last lecture.  However, instead of PCA, we will try some feature selection methods.  

The target Grade, has been binned to create a classification of whose who will pass the exam.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel, SequentialFeatureSelector
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.inspection import permutation_importance

from sklearn import set_config
set_config(transform_output='pandas')

import joblib
pd.set_option('display.max_columns', None)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
def classification_metrics(y_true, y_pred, label="",
                           output_dict=False, figsize=(8,4),
                           normalize='true', cmap='Blues',
                           colorbar=False):
  # Get the classification report
  report = classification_report(y_true, y_pred)
  ## Print header and report
  header = "-"*70
  print(header, f" Classification Metrics: {label}", header, sep='\n')
  print(report)
  ## CONFUSION MATRICES SUBPLOTS
  fig, axes = plt.subplots(ncols=2, figsize=figsize)
  # create a confusion matrix  of raw counts
  ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                normalize=None, cmap='gist_gray', colorbar=colorbar,
                ax = axes[0],);
  axes[0].set_title("Raw Counts")
  # create a confusion matrix with the test data
  ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                normalize=normalize, cmap=cmap, colorbar=colorbar,
                ax = axes[1]);
  axes[1].set_title("Normalized Confusion Matrix")
  # Adjust layout and show figure
  fig.tight_layout()
  plt.show()
  # Return dictionary of classification_report
  if output_dict==True:
    report_dict = classification_report(y_true, y_pred, output_dict=True)
    return report_dict
    
    
    
def evaluate_classification(model, X_train, y_train, X_test, y_test,
                         figsize=(6,4), normalize='true', output_dict = False,
                            cmap_train='Blues', cmap_test="Reds",colorbar=False):
  # Get predictions for training data
  y_train_pred = model.predict(X_train)
  # Call the helper function to obtain regression metrics for training data
  results_train = classification_metrics(y_train, y_train_pred, #verbose = verbose,
                                     output_dict=True, figsize=figsize,
                                         colorbar=colorbar, cmap=cmap_train,
                                     label='Training Data')
  print()
  # Get predictions for test data
  y_test_pred = model.predict(X_test)
  # Call the helper function to obtain regression metrics for test data
  results_test = classification_metrics(y_test, y_test_pred, #verbose = verbose,
                                  output_dict=True,figsize=figsize,
                                         colorbar=colorbar, cmap=cmap_test,
                                    label='Test Data' )
  if output_dict == True:
    # Store results in a dataframe if ouput_frame is True
    results_dict = {'train':results_train,
                    'test': results_test}
    return results_dict



In [None]:
loaded = joblib.load('../Lecture 1/Data/engineered_student_data.joblib')
loaded.keys()

In [None]:
X_train = loaded['X_train']
X_test = loaded['X_test']
y_train = loaded['y_train']
y_test = loaded['y_test']
preprocessor = loaded['columntransformer']

X_train.head()

# Process the Data

In [None]:
X_train_proc = preprocessor.fit_transform(X_train)
X_test_proc = preprocessor.transform(X_test)
X_train_proc.shape

# Base Model

We will create a base model on all data to compare.

In [None]:
## Create and fit the initial model
rf_base = RandomForestClassifier(random_state=42)
rf_base.fit(X_train_proc, y_train)

In [None]:
%%time
## Evaluate the intial model
evaluate_classification(rf_base, X_train_proc, y_train, X_test_proc, y_test)

# Filter Method: Multi-collinearity

In this section we will select features based on the correlation of each feature to the target.

1. We will join the training features and target.  We will use only training data to avoid peeking at the test data.
2. We will determine the correlations between each feature and the target
3. We will select only the features whose correlation exceeds a chosen threshold
4. We will fit a new model on only the features with higher correlation to the target and evaluate it.

In [None]:
# Visualize Correlations

plt.figure(figsize=(12,12))
sns.heatmap(X_train_proc.corr(), cmap='coolwarm');

We are seeing come high correlations, especially with one-hot encode columns.  We will set a threshold of .7 to limit multicollinearity

In [None]:
## Import Some new packages
from collinearity import SelectNonCollinear
from sklearn.feature_selection import f_classif

`SelectNonCollinear` is a class with will select the features more correlated with the target that are less correlated to other features.

f_classif is a metric from sklearn that determines the relationship between features in a classification model.  It uses ANOVA tests to determine this.

In [None]:
## Set a correlation threshold

## Instantiate the non-collinear selector


## Fit Selector


## Use selector to subset the columns



## Fit a model with less collinear features.


In [None]:
## Create and fit a model on the higher correlated features



In [None]:
print(f'We reduced dimensionality by {X_train_proc.shape[1] - X_train_non_col.shape[1]}')

In [None]:
%%time
## Evaluate the correlation model




In [None]:
print(f'We reduced the number of features by {X_train_proc.shape[1] - X_train_non_col.shape[1]}')

# Embedded Method: Permutation Importance

<font color='red'> You will need to do this on Project 4 Part 1 </font>

In this section we will:
1. Create and fit an initial model
2. Determine feature importances using `permutation_importance()`
3. Create a Series using the discovered importances
4. Create a filter out of the Series using a chosen threshold
5. Use that filter to select which features to keep.
6. Fit a new model using the selected features.

In [None]:
## Calculate feature importances


## Create a Series of Feature Importances



In [None]:
## Plot the importances



In [None]:
## Create a filter based on a threshold



## Use the filter to select features to keep



## Fit a new model just on the more important features.

In [None]:
## Create and fit a new model on only important features



In [None]:
%%time
##Evaluate the model using permutation importance selected data
evaluate_classification(rf_perm_sel, X_train_perm_sel, y_train, X_test_perm_sel, y_test)

In [None]:
print(f'We reduced the number of features by {X_train_proc.shape[1] - X_train_perm_sel.shape[1]}')

# Embedded Method: Importance or Coefficients using `SelectFromModel` in a Pipeline

* This works with Linear and Tree models only

* Since SelectFromModel uses the inherent coefficients or feature importances of a model.  It has a .transform() method, it can be used in a pipeline!

In [None]:
## instantiate the selector with a model.


## Put it in a pipeline between a preprocessor and another model



## Fit the pipeline



In [None]:
## Evaluate the pipeline model
evaluate_classification(sel_pipe, X_train, y_train, X_test, y_test)

# Wrapper Method: `SequentialFeatureSelection` Class

In this section we will use a class that will fit many models with many combinations of features and see which combination is best.  This is simple to code, but can take a very long time!

1. Instantiate and fit the SequentialFeatureSelector class.  We will use the base RandomForestClassifier we made earlier for this.
2. Extract the features that the class suggests that we keep and use them to filter our data
3. Fit and evaluate a new model on just those features.


**In all cases, the general flow is to identify the features to keep, subset the dataframe, then fit and evaluate a model on those features.**

In [None]:
## Decide on a number of features to keep


## Instantiate the feature selector



## Fit the feature selector



In [None]:
## Extract the features suggested by the selector


## Use the filter to subset the features.



In [None]:
## Instanciate and fit a new model on just the features suggested by the selector



In [None]:
%%time
## Evaluate the model
evaluate_classification(rf_selected, X_train_sel, y_train, X_test_sel, y_test)

In [None]:
print(f'We reduced the dimensionality of the feature set by {X_train_proc.shape[1] - X_train_sel.shape[1]}')

# Summary

In this notebook we implemented 3 methods for selecting features:

1. Selecting based on multicollinearity of features
2. Selecting based on the permutation importance of each feature
3. Selecting based on the suggestions of an Scikit-Learn wrapper class.

In all cases we were able to reduce the number of features without significantly hurting the model metrics.