# Wrapper methods


## Suitable for Classification/Regression


### Remove Correlated features

Step Forward & Backward Feature Selection takes a long time to run, so to speed it up we will reduce the feature space by removing correlated features first.

In [None]:
# Step 1: Define a function to find correlated features
# This line defines a function 'correlation' that takes a dataset and a threshold as inputs.
def correlation(dataset, threshold):

    # Step 2: Initialize an empty set to store correlated column names
    # Create an empty set 'col_corr' to hold the names of correlated columns.
    col_corr = set()

    # Step 3: Compute the correlation matrix of the dataset
    # Calculate the correlation matrix for all numeric features in the dataset.
    corr_matrix = dataset.corr()

    # Step 4: Loop over each column in the correlation matrix
    # Iterate through the columns of the correlation matrix by their index.
    for i in range(len(corr_matrix.columns)):

        # Step 5: Loop over the previous columns
        # For each column, iterate over all columns before it (to avoid repetition).
        for j in range(i):

            # Step 6: Check if the absolute value of correlation exceeds the threshold
            # If the correlation between two columns is greater than the threshold, consider them correlated.
            if abs(corr_matrix.iloc[i, j]) > threshold:

                # Step 7: Get the name of the correlated column
                # Store the name of the correlated column in the 'colname' variable.
                colname = corr_matrix.columns[i]

                # Step 8: Add the correlated column name to the set
                # Add the column name to the set 'col_corr' to avoid duplicates.
                col_corr.add(colname)

    # Step 9: Return the set of correlated features
    # The function returns the set of correlated column names.
    return col_corr

# Step 10: Call the correlation function on the training data with a threshold of 0.8
# 'corr_features' stores the names of columns in 'X_train' that have correlations greater than 0.8.
corr_features = correlation(X_train, 0.8)

# Step 11: Print the number of correlated features
# Display the count of correlated features that will be removed.
print('correlated features: ', len(set(corr_features)))

# Step 12: Remove the correlated features from the training set
# Drop the correlated features from 'X_train'.
X_train.drop(labels=corr_features, axis=1, inplace=True)

# Step 13: Remove the correlated features from the test set
# Drop the same correlated features from 'X_test'.
X_test.drop(labels=corr_features, axis=1, inplace=True)

# Step 14: Check the shapes of the training and test sets after feature removal
# Output the new dimensions of 'X_train' and 'X_test'.
X_train.shape, X_test.shape

### Forward Feature Selection


**Step forward feature selection** starts by training a machine learning model for each feature in the dataset and selecting, as the starting feature, the one that returns the best performing model according to the evaluation criteria we choose.

**In the second step**, it creates machine learning models for all combinations of the feature selected in the previous step and a second feature. It selects the pair that produces the best performing algorithm.

It continues by adding 1 feature at a time to the features that were selected in previous steps, until a stopping criteria is reached.

Step forward feature selection is called a greedy procedure because it evaluates many possible single, double, triple, and so on feature combinations. **Therefore, it is very computationally expensive and, sometimes, if the feature space is big enough, even unfeasible.**

In [None]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score, r2_score
from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SequentialFeatureSelector as SFS

In [None]:
# Step 1: Set up Sequential Feature Selector (SFS)
# Define the SFS with a RandomForestClassifier and its parameters.
sfs = SFS(
    estimator=RandomForestClassifier(n_estimators=10, n_jobs=4, random_state=0),  # RandomForest with 10 trees and parallel processing
    n_features_to_select=10,  # Number of features to retain
    tol=None,  # Stopping criteria (None to ignore performance change)
    direction='forward',  # Step forward selection
    scoring='roc_auc',  # Use roc_auc as the evaluation metric
    cv=2,  # 2-fold cross-validation
    n_jobs=4,  # Enable parallel processing with 4 jobs
)

# Step 2: Fit the SFS model on the training data
# Train the model using 'X_train' and 'y_train'.
sfs = sfs.fit(X_train, y_train)

# Step 3: Get the selected feature names
# Retrieve and store the names of the selected features.
selected_feat = sfs.get_feature_names_out()

# Step 4: Output the selected features
# Display the list of selected features.
selected_feat

### Backward Feature Selection

Step Backward Feature Selection starts by fitting a model using all the features in the data set and determining its performance.

* Then, it trains models on all possible combinations of all features, minus one, and removes the feature that returns the model with the lowest performance.
* In the third step, it trains models in all possible combinations of the features remaining from step 2, minus 1 feature, and removes the feature that produced the lowest performing model.
* The algorithm stops when a certain criteria determined by the user is met. This criteria could be that the model performance does not decrease beyond a certain threshold, or alternatively, as in the mlxtend implementation, when we reach a certain number of selected features.

Step Backward Feature Selection is called greedy because it evaluates all possible n, and then n-1 and n-2 and so on feature combinations. Therefore, it is very computationally expensive and sometimes, if the feature space is big enough, even unfeasible.

In [None]:
# Change this parameter
direction='forward',  # Step forward selection

### Exhaustive Search

Exhaustive Feature Selection finds the best subset of features out of all possible subsets, according to a determined performance metric for a certain machine learning algorithm.

Exhaustive Feature Selection is a greedy algorithm as it evaluates all possible feature combinations. It is very computationally expensive and, sometimes, if the feature space is large, even unfeasible.

**Due to not being practical in many cases, i will not include a code for this algorithm**

## Compare performance of feature subsets


In [None]:
# Step 1: Define a function to train and evaluate Random Forests
# This function takes training and test datasets as inputs and evaluates their performance.
def run_randomForests(X_train, X_test, y_train, y_test):

    # Step 2: Initialize the RandomForestClassifier
    # Set up a Random Forest model with 200 trees and a maximum depth of 4.
    rf = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)

    # Step 3: Fit the Random Forest model to the training data
    # Train the model using 'X_train' and 'y_train'.
    rf.fit(X_train, y_train)

    # Step 4: Evaluate the model on the training set
    # Predict probabilities for the training data and print roc-auc score.
    print('Train set')
    pred = rf.predict_proba(X_train)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))

    # Step 5: Evaluate the model on the test set
    # Predict probabilities for the test data and print roc-auc score.
    print('Test set')
    pred = rf.predict_proba(X_test)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

In [None]:
# Step 1: Evaluate the Random Forest performance with selected features
# Run the 'run_randomForests' function using only the selected features from 'X_train' and 'X_test'.
run_randomForests(X_train[selected_feat],
                  X_test[selected_feat],
                  y_train, y_test)

# Step 2: Compare the performance with all features
# Run the 'run_randomForests' function using all features in 'X_train' and 'X_test' (excluding previously removed correlated features).
run_randomForests(X_train,
                  X_test,
                  y_train, y_test)