# Feature Selection Techniques

### Why use feature selection?

Not all features contribute equally to predictions. Removing irrelevant features improves accuracy, reduces overfitting, and speeds up training.

Feature selection is an important step in machine learning to improve model performance by selecting only the most relevant features. It helps in:
- Reducing overfitting
- Improving accuracy
- Reducing training time

### Filter Methods (Using Statistical Tests) 

These methods select features based on statistical scores between independent and dependent variables.

Using SelectKBest (Chi-Square Test)

In [31]:
from sklearn.feature_selection import SelectKBest, chi2
import pandas as pd

# Load dataset
data = pd.read_csv("fraud.csv")  # Example dataset
X = data.drop(columns=["Class"])  # Features
y = data["Class"]  # Target

# Apply Chi-Square test to select top 5 features
selector = SelectKBest(score_func=chi2, k=5)
X_new = selector.fit_transform(X, y)

# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("Selected Features:", selected_features)


Selected Features: Index(['transaction_amount', 'transaction_time', 'card_present',
       'merchant_category'],
      dtype='object')




**📌 When to use?**

- When the dataset has categorical target variables.

- When you want to filter out weakly related features.

**📌 Real-time Use Case:**

- Fraud detection: Finding the top features that contribute to fraudulent transactions.

### Wrapper Methods (Using Recursive Feature Elimination - RFE)

Wrapper methods train the model multiple times to find the best subset of features.

Using RFE (Recursive Feature Elimination)

In [44]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
model = RandomForestClassifier()

# Apply RFE for feature selection
rfe = RFE(model, n_features_to_select=5)
X_new = rfe.fit_transform(X, y)

# Get selected feature names
selected_features = X.columns[rfe.support_]
print("Selected Features:", selected_features)


Selected Features: Index(['transaction_amount', 'transaction_time', 'card_present',
       'merchant_category'],
      dtype='object')


**📌 When to use?**

- When you have a small dataset.

- When you want to evaluate the best feature subset dynamically.

**📌 Real-time Use Case:**

- In loan default prediction, selecting the best features that indicate whether a person will default or not.



###  Embedded Methods (Using Lasso Regression - L1 Regularization)

Embedded methods use machine learning models that have built-in feature selection.

Using Lasso Regression (L1 Regularization)

In [55]:
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# Train Lasso model
lasso = Lasso(alpha=0.01)  # Small alpha to avoid removing too many features
lasso.fit(X, y)

# Apply feature selection
selector = SelectFromModel(lasso, prefit=True)
X_new = selector.transform(X)

# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("Selected Features:", selected_features)

Selected Features: Index(['transaction_amount', 'transaction_time', 'merchant_category'], dtype='object')




**📌 When to use?**

- When you have a large dataset with many irrelevant features.

- When you want an automatic way to remove weak features.

**📌 Real-time Use Case:**

- In stock price prediction, removing weak features that don’t contribute to stock price movement.

** Which Feature Selection Method Should You Use?** 
- **Method**	Best For	Example Use Case
- **Filter (SelectKBest)**	Large datasets, categorical targets	Fraud detection
- **Wrapper (RFE)**	Small datasets, best subset search	Loan default prediction
- **Embedded (Lasso)**	High-dimensional data, automatic selection	Stock price prediction
