# Feature Selection


## Import Needed Libraries

In [14]:
############ GENERAL LIBS ####################
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

############### FILTERING METHOD / CHI-SQUARE ##################
from sklearn.feature_selection import SelectKBest, chi2

############### WRAPPER METHOD / FORWARD SELECTION & BACKWARD ELIMINATION #################
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

################ WRAPPER METHOD / RECURSIVE FEATURE ELIMINATION ###############
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression


## Filter Method

### Using Chi-Square Test

**Step 1:**

**Load Data:**

We use the Iris dataset and split the data into features (X) and target (y).

**Step 2:**

**Normalize Data:**

Chi-square requires non-negative data, so we take the absolute values of the features.

**Step 3:**

**Apply SelectKBest:**

This filter method uses the Chi-Square test to select the top 2 features that are most related to the target variable.

**Step 4:**

*   **Fit and Transform:** We fit the selector on the data and transform it to select the best features.
*   We print the selected top 2 features based on the Chi-square score.





In [4]:
# Load a sample dataset (Iris)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Apply SelectKBest with Chi-square test to select the top 2 features
X_normalized = np.abs(X)

# Select the best 2 features based on Chi-square test
selector = SelectKBest(score_func=chi2, k=2)
X_selected = selector.fit_transform(X_normalized, y)

# Display the selected features
print("Original features:\n", X.columns)
print("\nSelected top 2 features:\n", X.columns[selector.get_support()])

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)




Original features:
 Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

Selected top 2 features:
 Index(['petal length (cm)', 'petal width (cm)'], dtype='object')


### Using Correlation

**Step 1:**

**Load Data:**

We load the Iris dataset and split it into **features (X)** and **target (y)**.

**Step 2:**

**Calculate Correlation:**

*   We compute the correlation between each feature and the target using the **np.corrcoef function**
*   We take the absolute values of the correlations because we are interested in the strength, not the direction.

**Step 3:**

**Select Features:**


*    We define a correlation threshold (0.5 in this case) and select the features with correlation values greater than the threshold.
*   We print the selected features


**Step 4:**

Split the data into training and test sets using only the selected features.

In [2]:
# Load a sample dataset (Iris)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# Compute correlation of each feature with the target
correlations = X.apply(lambda feature: np.abs(np.corrcoef(feature, y)[0, 1]))

# Select features with high correlation (Set a threshold for this example I chose 0.5)
threshold = 0.5
selected_features = correlations[correlations > threshold].index
X_selected = X[selected_features]

# Display the selected features based on correlation
print("Original features:\n", X.columns)
print("\nSelected features with correlation > 0.5:\n", selected_features)

# Split data into train and test sets using the selected features
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)


Original features:
 Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

Selected features with correlation > 0.5:
 Index(['sepal length (cm)', 'petal length (cm)', 'petal width (cm)'], dtype='object')

Shape of X_train with selected features: (105, 3)
Shape of X_test with selected features: (45, 3)


## Wrapper Method

### Forward Selection

In this example, we’ll use Scikit-learn's LinearRegression model and evaluate performance based on R-squared score to select the best features.

**Step 1:**

**Load Data:**

We load the Iris dataset and split it into features (X) and target (y).

**Step 2:**

**Train-Test Split:**

We split the data into training and testing sets to evaluate the model on unseen data.

**Step 3:**

**Initialize Forward Selection:**

We start with an empty list of selected_features and a list of remaining_features.

**Step 4:**

**Evaluate Model Performance:**

For each iteration, we:


*   Add each remaining feature to the selected set.
*   Train a LinearRegression model with the current set of selected features.
*   Evaluate the model using the R-squared score on the test set.

**Step 5:**

**Select the Best Feature:**


*   We select the feature that leads to the highest R-squared score
*   add it to the selected_features
*   remove it from remaining_features.


**Output:** After all iterations, we print the final set of selected features.

In [6]:
# Load a sample dataset (Iris)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target


In [7]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [8]:
# Initialize the model
model = LinearRegression()


In [9]:
# Forward selection: start with no features
selected_features = []
remaining_features = list(X.columns)
best_r2 = 0  # Initialize best R-squared score

# Perform forward selection
for i in range(len(remaining_features)):
    r2_scores = []
    for feature in remaining_features:
        # Try adding each remaining feature to the selected set
        features_to_try = selected_features + [feature]
        model.fit(X_train[features_to_try], y_train)
        y_pred = model.predict(X_test[features_to_try])
        r2_scores.append(r2_score(y_test, y_pred))

    # Select the feature that gives the best R-squared score
    best_feature_idx = np.argmax(r2_scores)
    best_feature = remaining_features[best_feature_idx]
    best_r2 = r2_scores[best_feature_idx]

    # Add the best feature to the selected features
    selected_features.append(best_feature)
    remaining_features.remove(best_feature)

    # Print the feature added and the updated R-squared score
    print(f"Added feature: {best_feature}, R-squared score: {best_r2}")

# Final selected features
print("\nFinal selected features:", selected_features)

Added feature: petal width (cm), R-squared score: 0.9456393705719633
Added feature: sepal width (cm), R-squared score: 0.9461771742063361
Added feature: petal length (cm), R-squared score: 0.9423839660334823
Added feature: sepal length (cm), R-squared score: 0.9442318571467434

Final selected features: ['petal width (cm)', 'sepal width (cm)', 'petal length (cm)', 'sepal length (cm)']


### Backward Elimination

**Step 1**:

**Load Data**:

We load the Iris dataset and split it into features (X) and target (y).

**Step 2:**

**Train-Test Split:**

We split the data into training and testing sets to evaluate model performance on unseen data.

**Step 3:**

**Initialize Backward Elimination:**

We start with all features (selected_features initialized to all feature names).

**Step 4:**

**Train Model with All Features:**

Initially, we train a LinearRegression model using all features and record the R-squared score.

**Step 5:**

**Iterative Feature Removal:**

*   We iteratively remove one feature at a time and evaluate the model's performance.
*   After removing a feature, you check if the model’s performance (R-squared score) changes.
*   If removing a feature improves the model's performance or only causes a small reduction in R-squared, this feature is likely not contributing much to the model.
*   Once you find the feature whose removal has the least negative impact (or even improves the R-squared score), you remove it permanently from the model.
*   Continue the process by removing one feature at a time in each iteration, always checking the model’s performance after each removal.

**Step 6**

**Stop Criterion:**

The process stops when removing further features reduces model performance.


In [10]:
# Load a sample dataset (Iris)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target


In [11]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [12]:
# Initialize the model
model = LinearRegression()


In [13]:
# Backward elimination: start with all features
selected_features = list(X.columns)
best_r2 = 0  # Initialize the best R-squared score

# Train the model with all features initially
model.fit(X_train[selected_features], y_train)
y_pred = model.predict(X_test[selected_features])
best_r2 = r2_score(y_test, y_pred)
print(f"Initial R-squared score with all features: {best_r2}")

# Perform backward elimination
while len(selected_features) > 1:
    r2_scores = []

    # Test removing each feature and train the model
    for feature in selected_features:
        features_to_try = selected_features.copy()
        features_to_try.remove(feature)
        model.fit(X_train[features_to_try], y_train)
        y_pred = model.predict(X_test[features_to_try])
        r2_scores.append(r2_score(y_test, y_pred))

    # Find the worst feature (the one whose removal improves or minimally impacts performance)
    worst_feature_idx = np.argmax(r2_scores)
    worst_feature = selected_features[worst_feature_idx]

    # Remove the feature if it improves or doesn't degrade performance much
    if r2_scores[worst_feature_idx] >= best_r2:
        best_r2 = r2_scores[worst_feature_idx]
        print(f"Removed feature: {worst_feature}, New R-squared score: {best_r2}")
        selected_features.remove(worst_feature)
    else:
        break  # Stop if removing features worsens the performance

# Final selected features
print("\nFinal selected features:", selected_features)


Initial R-squared score with all features: 0.9442318571467434
Removed feature: sepal width (cm), New R-squared score: 0.9446990399987059

Final selected features: ['sepal length (cm)', 'petal length (cm)', 'petal width (cm)']


### Recursive Feature Elimination (RFE)

**Step 1**:

**Load Data**:

We load the Iris dataset and split it into features (X) and target (y).

**Step 2:**

**Train-Test Split:**

We split the data into training and testing sets to evaluate model performance on unseen data.

**Step 3:**

**Initialize the Model:**

We use a LogisticRegression model, but any model with feature importance measures can be used.

**Step 4:**

**Apply RFE:**

* We initialize RFE and specify n_features_to_select=2 to select the top 2 most important features.
* RFE recursively eliminates the least important feature at each iteration based on the model's learned weights or feature importance.

**Step 5:**

**Feature Selection:**

After training, RFE returns the selected features (those with rfe.support_ == True).

**Step 6:**

**Train on Selected Features:**

The model is trained using only the selected features, and the accuracy is evaluated on the test set.



In [20]:
# Load a sample dataset (Iris)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the model
model = LogisticRegression(max_iter=200)

# Initialize RFE and select the top 2 features
rfe = RFE(estimator=model, n_features_to_select=2)
rfe.fit(X_train, y_train)

# Display the results
print("Selected features:", X.columns[rfe.support_])

# Use the selected features to train the model
X_train_selected = X_train[X.columns[rfe.support_]]
X_test_selected = X_test[X.columns[rfe.support_]]

# Train the model on the selected features
model.fit(X_train_selected, y_train)

# Evaluate the model
accuracy = model.score(X_test_selected, y_test)
print(f"Model accuracy with selected features: {accuracy:.4f}")


Selected features: Index(['petal length (cm)', 'petal width (cm)'], dtype='object')
Model accuracy with selected features: 1.0000
