# 05: Feature Engineering and Selection

## Overview
This notebook covers feature engineering and selection techniques essential for machine learning and deep learning.

## Topics Covered:
1. Feature Scaling
2. Feature Encoding
3. Feature Creation
4. Feature Selection Methods
5. Dimensionality Curse
6. PCA for Feature Extraction

## Interview Focus:
- Understanding core concepts
- Practical implementation
- When to apply each technique
- Common pitfalls and solutions

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print('Libraries imported successfully')

## Key Concepts

This notebook covers:

- Feature scaling (StandardScaler, MinMaxScaler)
- Encoding categorical variables
- Creating polynomial features
- SelectKBest, RFE
- Variance thresholding

## Implementation Examples

The following sections provide practical implementations of the concepts above.

In [None]:
# Example 1: Basic setup and data preparation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'Training samples: {len(X_train)}')
print(f'Test samples: {len(X_test)}')
print(f'Number of features: {X.shape[1]}')

## 1. Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.datasets import load_boston
import warnings
warnings.filterwarnings('ignore')

# Create sample data
np.random.seed(42)
data = pd.DataFrame({
    'income': np.random.exponential(50000, 1000),
    'age': np.random.randint(18, 80, 1000),
    'credit_score': np.random.randint(300, 850, 1000)
})

print('Original data statistics:')
print(data.describe())

# Standard Scaler (z-score normalization)
scaler_std = StandardScaler()
data_std = pd.DataFrame(scaler_std.fit_transform(data), columns=data.columns)

# Min-Max Scaler
scaler_minmax = MinMaxScaler()
data_minmax = pd.DataFrame(scaler_minmax.fit_transform(data), columns=data.columns)

# Robust Scaler (resistant to outliers)
scaler_robust = RobustScaler()
data_robust = pd.DataFrame(scaler_robust.fit_transform(data), columns=data.columns)

# Visualize scaling effects
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
data['income'].hist(ax=axes[0,0], bins=30, edgecolor='black')
axes[0,0].set_title('Original Income')
data_std['income'].hist(ax=axes[0,1], bins=30, edgecolor='black', color='green')
axes[0,1].set_title('StandardScaler')
data_minmax['income'].hist(ax=axes[1,0], bins=30, edgecolor='black', color='orange')
axes[1,0].set_title('MinMaxScaler')
data_robust['income'].hist(ax=axes[1,1], bins=30, edgecolor='black', color='red')
axes[1,1].set_title('RobustScaler')
plt.tight_layout()
plt.show()

print('\nStandardScaler statistics:')
print(data_std.describe())

## 2. Encoding Categorical Variables

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

# Sample categorical data
df_cat = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue', 'green'],
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large'],
    'price': [10, 20, 30, 15, 12, 35]
})

print('Original data:')
print(df_cat)

# Label Encoding
le = LabelEncoder()
df_cat['color_label'] = le.fit_transform(df_cat['color'])
print('\nLabel Encoding:')
print(df_cat[['color', 'color_label']])

# One-Hot Encoding
df_onehot = pd.get_dummies(df_cat, columns=['color', 'size'], prefix=['color', 'size'])
print('\nOne-Hot Encoding:')
print(df_onehot.head())

# Ordinal Encoding (for ordered categories)
size_order = [['small', 'medium', 'large']]
oe = OrdinalEncoder(categories=size_order)
df_cat['size_ordinal'] = oe.fit_transform(df_cat[['size']])
print('\nOrdinal Encoding (size):')
print(df_cat[['size', 'size_ordinal']])

## 3. Feature Selection Methods

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif, RFE, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Generate sample data
X, y = make_classification(n_samples=500, n_features=20, n_informative=10, 
                          n_redundant=5, n_repeated=2, random_state=42)

# Method 1: SelectKBest
selector_kbest = SelectKBest(f_classif, k=10)
X_kbest = selector_kbest.fit_transform(X, y)
print('SelectKBest:')
print(f'Original features: {X.shape[1]}')
print(f'Selected features: {X_kbest.shape[1]}')
print(f'Selected feature indices: {selector_kbest.get_support(indices=True)}')

# Method 2: Recursive Feature Elimination (RFE)
estimator = RandomForestClassifier(n_estimators=50, random_state=42)
selector_rfe = RFE(estimator, n_features_to_select=10, step=1)
X_rfe = selector_rfe.fit_transform(X, y)
print('\nRecursive Feature Elimination:')
print(f'Selected features: {X_rfe.shape[1]}')
print(f'Feature ranking: {selector_rfe.ranking_}')

# Method 3: Variance Threshold
selector_var = VarianceThreshold(threshold=0.1)
X_var = selector_var.fit_transform(X)
print('\nVariance Threshold:')
print(f'Features after variance threshold: {X_var.shape[1]}')

# Feature importance from Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
feature_importance = pd.DataFrame({
    'feature': [f'Feature_{i}' for i in range(X.shape[1])],
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(12, 6))
plt.barh(range(15), feature_importance['importance'].head(15))
plt.yticks(range(15), feature_importance['feature'].head(15))
plt.xlabel('Importance')
plt.title('Top 15 Features by Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 4. Creating New Features

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Polynomial features
X_sample = np.array([[1, 2], [3, 4], [5, 6]])
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_sample)

print('Original features:')
print(X_sample)
print('\nPolynomial features (degree=2):')
print(X_poly)
print('\nFeature names:')
print(poly.get_feature_names_out(['x1', 'x2']))

# Domain-specific feature engineering example
df_dates = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=100),
    'sales': np.random.randint(100, 1000, 100)
})

# Extract time-based features
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month
df_dates['day'] = df_dates['date'].dt.day
df_dates['dayofweek'] = df_dates['date'].dt.dayofweek
df_dates['is_weekend'] = df_dates['dayofweek'].isin([5, 6]).astype(int)

print('\nTime-based feature engineering:')
print(df_dates.head())

## Best Practices

1. Always split data before preprocessing
2. Use cross-validation for model evaluation
3. Monitor for overfitting/underfitting
4. Document hyperparameters and experiments
5. Start simple, then add complexity

## Interview Questions

### Q1: What are the main concepts in Feature Engineering and Selection?
**Answer:** Feature Scaling, Feature Encoding, Feature Creation

### Q2: When should you apply these techniques?
**Answer:** Apply when you need to improve model performance, reduce dimensionality, or extract meaningful patterns from data.

### Q3: What are common challenges?
**Answer:** Common challenges include overfitting, computational complexity, hyperparameter tuning, and interpretation of results.

## Practice Exercises

1. Implement the core algorithms from scratch
2. Compare performance across different methods
3. Tune hyperparameters systematically
4. Apply to real-world datasets
5. Analyze and interpret results

## Additional Resources

- Scikit-learn documentation
- TensorFlow and PyTorch tutorials
- Relevant research papers
- Online courses and books