# Feature Engineering and Selection

Feature engineering and selection are crucial steps in the machine learning pipeline to improve model performance. Let's go through an example using a hypothetical dataset. In this example, we'll create a synthetic dataset with some features and then demonstrate feature engineering and selection.

**Feature Engineering**

We create new features (```interaction_feature```, ```binned_feature```, and one-hot encoding for ```feature_4```) based on existing features.

**Train-Test Split**

We split the dataset into training and testing sets.

**Train a Model**

We train a Random Forest model on the original dataset.

**Feature Selection**

We use the feature importances provided by the model to select important features. In this case, we set a threshold of $0.05$ to select features.

**Evaluate the Model with Selected Features**

We train a new model using only the selected features and evaluate its performance on the test set.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

# Create a synthetic dataset
np.random.seed(42)
data = pd.DataFrame({
    'feature_1': np.random.rand(1000),
    'feature_2': np.random.randn(1000),
    'feature_3': np.random.randint(0, 2, size=1000),
    'feature_4': np.random.choice(['A', 'B', 'C'], size=1000),
    'target': np.random.randint(0, 2, size=1000)
})

# Display the first few rows of the dataset
print("Original Dataset:")
print(data.head())

# Separate features and target variable
X = data.drop('target', axis=1)
y = data['target']

# Step 1: Feature Engineering
# Let's create some new features based on existing ones

# Example 1: Interaction feature
X['interaction_feature'] = X['feature_1'] * X['feature_2']

# Example 2: Binning numerical feature
X['binned_feature'] = pd.cut(X['feature_1'], bins=[0, 0.25, 0.5, 0.75, 1], labels=['Q1', 'Q2', 'Q3', 'Q4'])

# One-hot encoding the binned_feature
X = pd.get_dummies(X, columns=['binned_feature'], prefix='bin')

# Example 3: One-hot encoding categorical feature
X = pd.get_dummies(X, columns=['feature_4'], prefix='one_hot')

# Display the dataset after feature engineering
print("\nDataset after Feature Engineering:")
print(X.head())

# Step 2: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train a model (Random Forest for example)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Step 4: Feature Selection using feature importances
# We can use the feature importances provided by the model to select important features

# Display feature importances
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': model.feature_importances_})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)
print("\nFeature Importances:")
print(feature_importances)

# Select features with importance greater than a threshold (e.g., 0.05)
selected_features = feature_importances[feature_importances['Importance'] > 0.05]['Feature'].tolist()

# Display selected features
print("\nSelected Features:")
print(selected_features)

# Step 5: Evaluate the model with selected features
# Train the model with selected features
model_selected_features = RandomForestClassifier(n_estimators=100, random_state=42)
model_selected_features.fit(X_train[selected_features], y_train)

# Make predictions on the test set
y_pred = model_selected_features.predict(X_test[selected_features])

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("\nModel Accuracy with Selected Features:", accuracy)

Original Dataset:
   feature_1  feature_2  feature_3 feature_4  target
0   0.374540   0.177701          0         B       0
1   0.950714  -1.335344          0         A       0
2   0.731994   0.380198          0         C       0
3   0.598658   0.610586          0         C       1
4   0.156019   0.559790          0         B       1

Dataset after Feature Engineering:
   feature_1  feature_2  feature_3  interaction_feature  bin_Q1  bin_Q2  \
0   0.374540   0.177701          0             0.066556       0       1   
1   0.950714  -1.335344          0            -1.269531       0       0   
2   0.731994   0.380198          0             0.278303       0       0   
3   0.598658   0.610586          0             0.365532       0       0   
4   0.156019   0.559790          0             0.087338       1       0   

   bin_Q3  bin_Q4  one_hot_A  one_hot_B  one_hot_C  
0       0       0          0          1          0  
1       0       1          1          0          0  
2       1       0 

# Conclusion

Effective feature engineering and selection play pivotal roles in enhancing the performance of machine learning models. In the presented Python example, we started with a synthetic dataset and showcased various feature engineering techniques, such as creating interaction features, binning numerical variables, and one-hot encoding categorical features. Subsequently, a Random Forest model was trained on the original dataset, and feature importances were used to guide feature selection. By establishing a threshold for importance, we identified and retained the most relevant features, leading to a more focused and potentially more interpretable model. This process demonstrates the iterative and exploratory nature of feature engineering, where domain knowledge and experimentation contribute to refining the set of input features, ultimately improving model accuracy and generalization to new data.