# Correlation

Correlation Feature Selection evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the target, yet uncorrelated to each other".

**Methods**

* The first one is a brute force function that finds correlated features without any further insight.

* The second procedure finds groups of correlated features, which we can then explore to decide which one we keep and which ones we discard.

**Common Question**

If a group has several features that are highly correlated. Which one should we keep and which ones should we remove?

One criteria to select which features to use from this group, would be to use those **with less missing data**. Our dataset contains no missing values, so this is not an option. But keep this in mind when you work with your own datasets.

Alternatively, we could build a machine learning algorithm using all the features from the above list, and select the more predictive one.**

# Correlation heatmap

In [None]:
# Step 1: Build a correlation matrix for the features in X_train
# Compute the Pearson correlation matrix for all features in X_train, which shows the linear relationships between them
corrmat = X_train.corr(method='pearson')  # Pearson correlation is the default method in pandas' corr function

# Step 2: Set up a color map for the heatmap
# Create a diverging color palette with seaborn to visualize positive and negative correlations clearly
cmap = sns.diverging_palette(220, 20, as_cmap=True)  # Customize the color gradient for the heatmap

# Step 3: Configure the figure size
# Create a figure and set its size to 11x11 inches for better visibility of the heatmap
fig, ax = plt.subplots()   # Create the figure and axes
fig.set_size_inches(11,11)  # Set the figure size

# Step 4: Plot the correlation matrix using seaborn's heatmap
# Visualize the correlation matrix as a heatmap with the custom color map
sns.heatmap(corrmat, cmap=cmap)  # Plot the correlation matrix as a heatmap

# Brute Force Method

In the brute force method, the approach is simple: you check every possible pair of features, and if two features are highly correlated (above a predefined threshold like 0.9), you remove one of them.

The DropCorrelatedFeatures class from Feature-engine does a similar job to the brute force approach that we described earlier.

The SmartCorrelationSelection allows us to select a feature from each correlated group based on model performance, number of missing values, cardinality or variance.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_validate

from feature_engine.selection import DropCorrelatedFeatures, SmartCorrelatedSelection

In [None]:
# Step 1: Set up the DropCorrelatedFeatures selector with the specified parameters
# Initialize the DropCorrelatedFeatures object to identify and remove features
# that have a correlation coefficient above the specified threshold (0.8)
sel = DropCorrelatedFeatures(
    threshold=0.8,      # Features with a Pearson correlation higher than 0.8 will be considered correlated
    method='pearson',   # Use the Pearson correlation method to measure linear relationships between features
    missing_values='ignore'  # Ignore missing values when calculating correlation
)

# Step 2: Fit the selector to the training data to find correlated features
# Analyze the training data and identify pairs or groups of features that are highly correlated
sel.fit(X_train)

# Step 3: Retrieve sets of correlated features identified by the selector
# Access the groups of correlated features; each set contains features that are strongly correlated with each other
sel.correlated_feature_sets_

# Step 4: Check the number of correlated features that will be removed
# Get the count of features that will be dropped; only one feature from each correlated group is kept, others are removed
len(sel.features_to_drop_)

# Step 5: Remove correlated features from the training data
# Drop the correlated features from the training dataset, keeping only one feature from each correlated group
X_train = sel.transform(X_train)

# Step 6: Remove correlated features from the test data using the same transformation
# Apply the same transformation to the test dataset, ensuring consistency with the training data
X_test = sel.transform(X_test)

# Step 7: Display the shape of the training and test datasets after removing correlated features
# Show the new dimensions of both training and test sets after removing correlated features
X_train.shape, X_test.shape

# Grouping Method

In the grouping method, instead of removing features pair by pair, you group all the correlated features together and then choose the most representative feature from each group.

The second approach looks to identify groups of highly correlated features. And then, we can make further investigation within these groups to decide which feature we keep and which one we remove.

## SmartCorrelationSelection

Model Performance
We will keep a feature from each correlation group based on the performance of a random forest.

### Using Random Forest

You train a Random Forest on the full dataset, which builds multiple decision trees. Each tree makes decisions based on feature splits that reduce impurity at each node.

* **For classification, impurity is often measured using Gini impurity or entropy.**
* **For regression, impurity is measured using variance.**


In [None]:
# Step 1: Set up the RandomForestClassifier with specified parameters
# Initialize a RandomForestClassifier with 10 trees, a fixed random state for reproducibility, and parallel processing with 4 jobs
rf = RandomForestClassifier(
    n_estimators=10,    # Number of trees in the forest
    random_state=20,    # Seed for random number generation to ensure reproducibility
    n_jobs=4,           # Number of parallel jobs to run for fitting the model
)

# Step 2: Set up the SmartCorrelatedSelection selector with the specified parameters
# Initialize SmartCorrelatedSelection to identify and select features based on their correlation and model performance
sel = SmartCorrelatedSelection(
    variables=None,             # Consider all numerical variables for correlation analysis if set to None
    method="pearson",           # Use Pearson correlation to measure linear relationships between features
    threshold=0.8,              # Correlation threshold above which features are considered correlated
    missing_values="raise",     # Raise an error if missing values are encountered
    selection_method="model_performance",  # Select features based on their impact on model performance
    estimator=rf,               # RandomForestClassifier used to evaluate feature subsets
    scoring="roc_auc",          # Scoring metric for model performance evaluation (ROC AUC score)
    cv=3,                       # Number of cross-validation folds for performance evaluation
)

# Step 3: Fit the SmartCorrelatedSelection selector to the training data
# Train the SmartCorrelatedSelection by evaluating features in correlation groups with the RandomForestClassifier
# This process may take some time due to multiple RandomForestClassifier trainings
sel.fit(X_train, y_train)

In [None]:
# Step 4: Examine the performance of the RandomForestClassifier for each feature in the second group of correlated features
# Select the second group of correlated features identified by the SmartCorrelatedSelection
group = sel.correlated_feature_sets_[1]

# Step 5: Evaluate the RandomForestClassifier's performance for each feature in the selected group
# Perform cross-validation for each feature in the group and print the mean ROC AUC score
for f in group:

    model = cross_validate(
        rf,
        X_train[f].to_frame(),   # Convert feature to DataFrame for model fitting
        y_train,
        cv=3,                    # Number of cross-validation folds
        return_estimator=False,  # Do not return the trained models
        scoring='roc_auc',       # Scoring metric for model performance evaluation (ROC AUC score)
    )

    print(f, model["test_score"].mean())  # Print the feature name and its average ROC AUC score

# Step 6: Check if specific features were retained or dropped by the selector
# Check if 'var_28' was retained (not in the list of dropped features)
'var_28' in sel.features_to_drop_

# Check if 'var_5' was dropped (in the list of dropped features)
'var_5' in sel.features_to_drop_

# Check if 'var_75' was dropped (in the list of dropped features)
'var_75' in sel.features_to_drop_

### Using variance

In [None]:
# Step 1: Set up the SmartCorrelatedSelection selector with the specified parameters
# Initialize SmartCorrelatedSelection to identify and select features based on their correlation and variance
sel = SmartCorrelatedSelection(
    variables=None,             # Consider all numerical variables for correlation analysis if set to None
    method="pearson",           # Use Pearson correlation to measure linear relationships between features
    threshold=0.8,              # Correlation threshold above which features are considered correlated
    missing_values="raise",     # Raise an error if missing values are encountered
    selection_method="variance", # Select features based on their variance within correlation groups
    estimator=None,             # No estimator used for this selection method
    scoring="roc_auc",          # Scoring metric for model performance evaluation (ROC AUC score)
    cv=3,                       # Number of cross-validation folds for performance evaluation
)

# Step 2: Fit the SmartCorrelatedSelection selector to the training data
# Train the SmartCorrelatedSelection to identify features in correlation groups based on their variance
sel.fit(X_train, y_train)

# Step 3: Examine the variance of the features in the second group of correlated features
# Select the second group of correlated features identified by the SmartCorrelatedSelection
group = sel.correlated_feature_sets_[1]

# Display the standard deviation (variance) of each feature in the selected group
X_train[group].std()

# Step 4: Check if specific features were retained or dropped by the selector
# Check if 'var_28' was retained (not in the list of dropped features)
'var_28' in sel.features_to_drop_

# Check if 'var_5' was dropped (in the list of dropped features)
'var_5' in sel.features_to_drop_

# Check if 'var_75' was dropped (in the list of dropped features)
'var_75' in sel.features_to_drop_


## Pipeline

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline

from feature_engine.selection import (
    DropConstantFeatures,
    DropDuplicateFeatures,
    SmartCorrelatedSelection,
)

In [None]:
# Step 1: Set up a pipeline with multiple feature selection methods
# Stack the feature selection steps into a pipeline: remove constant features, duplicate features, and correlated features based on variance
pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=0.998)),    # Remove quasi-constant features with a tolerance of 0.998
    ('duplicated', DropDuplicateFeatures()),          # Remove duplicate features
    ('correlation', SmartCorrelatedSelection(selection_method='variance')),  # Remove correlated features based on variance
])

# Step 2: Fit the pipeline to the training data
# Apply the feature selection pipeline to the training data
pipe.fit(X_train)

# Step 3: Transform the training and test data by removing selected features
# Remove the selected features from both training and test datasets using the fitted pipeline
X_train = pipe.transform(X_train)  # Apply the transformations to the training data
X_test = pipe.transform(X_test)    # Apply the same transformations to the test data

# Step 4: Check the shapes of the transformed datasets (optional)
X_train.shape, X_test.shape  # Display the shapes of the transformed datasets

In [None]:
def run_logistic(X_train, X_test, y_train, y_test):
    # Step 1: Initialize the Logistic Regression model
    # Create a logistic regression model with a fixed random state for reproducibility and a maximum of 500 iterations
    logit = LogisticRegression(random_state=44, max_iter=500)

    # Step 2: Fit the model to the training data
    # Train the logistic regression model using the training data
    logit.fit(X_train, y_train)

    # Step 3: Evaluate the model on the training set
    # Predict probabilities for the training set and compute the ROC-AUC score
    print('Train set')
    pred = logit.predict_proba(X_train)  # Get predicted probabilities for the training data
    print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))  # ROC-AUC score for training set

    # Step 4: Evaluate the model on the test set
    # Predict probabilities for the test set and compute the ROC-AUC score
    print('Test set')
    pred = logit.predict_proba(X_test)  # Get predicted probabilities for the test data
    print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))  # ROC-AUC score for test set

In [None]:
# Step 1: Standardize the training data
# Fit a StandardScaler to the training data to standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler().fit(X_train)

# Step 2: Apply the scaling to both the training and test data
# Use the fitted scaler to transform (standardize) both the training and test data before feeding them to the logistic regression model
run_logistic(scaler.transform(X_train),  # Standardize X_train and pass to logistic regression
             scaler.transform(X_test),   # Standardize X_test and pass to logistic regression
             y_train, y_test)            # Pass target variables y_train and y_test as they are

**First Code** (Impurity Approach): Uses Random Forest to select features based on their impurity reduction and model performance (e.g., ROC AUC score). It retains features that contribute most to reducing impurity and improving predictive power.

**Second Code** (Variance Approach): Uses variance within correlated groups to select features. It retains the feature with the highest variance in each group, as high variance indicates greater differentiation between data points.

**Why Select High Variance Features?**

* A feature with higher variance generally contains more information about differences between data points. For example, if one feature has many unique or varying values, it is more likely to be useful for distinguishing between different outcomes or classes in a model.
* Correlated features often carry similar information. Selecting the feature with the highest variance helps ensure you are keeping the one that best represents the differences in the data, while discarding the others that are largely redundant.