## Phase 5: Feature Engineering & Selection

### What is Feature Engineering?

Feature engineering is the process of transforming raw data into meaningful features that help the model learn better patterns.

"Feature engineering is the most important part of machine learning" - Andrew Ng

### Why Feature Engineering Matters

```
Good Features + Simple Model = Often beats Poor Features + Complex Model

Example:
Bad approach: Raw transaction amount → Model accuracy 72%
Good approach: Customer_avg_transaction / customer_total_spent → Accuracy 85%

The ratio (normalized feature) is more informative than raw amount.
```

### Feature Engineering Techniques

**1. Encoding Categorical Variables**


In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df = cleaned_data.copy()

# Technique 1: Label Encoding (for ordinal categories)
# Use when: Categories have order (Low, Medium, High)
le = LabelEncoder()
df['risk_level_encoded'] = le.fit_transform(df['risk_level'])
# Low=0, Medium=1, High=2

# Technique 2: One-Hot Encoding (for nominal categories)
# Use when: Categories have no order (USA, UK, Canada)
account_type_dummies = pd.get_dummies(df['account_type'], prefix='account_type')
df = pd.concat([df, account_type_dummies], axis=1)
# Creates: account_type_basic, account_type_premium, account_type_vip

# Technique 3: Target Encoding (for high-cardinality)
# Use when: Many categories (100+ cities), need to reduce dimensionality
target_encoding = df.groupby('city')['churned'].mean()
df['city_churn_rate'] = df['city'].map(target_encoding)
# Each city replaced by its actual churn rate

# Technique 4: Frequency Encoding
# Use when: Frequency of category is informative
frequency_encoding = df['country'].value_counts() / len(df)
df['country_frequency'] = df['country'].map(frequency_encoding)


**2. Scaling/Normalization Numeric Features**


In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Technique 1: Standardization (z-score normalization)
# Formula: x_scaled = (x - mean) / std
# Use when: Features are normally distributed, algorithms assume normal dist
scaler = StandardScaler()
df['age_scaled'] = scaler.fit_transform(df[['age']])
# Result: mean=0, std=1

# Technique 2: Min-Max Scaling
# Formula: x_scaled = (x - min) / (max - min)
# Use when: Need features in fixed range [0, 1]
minmax = MinMaxScaler()
df['total_spent_scaled'] = minmax.fit_transform(df[['total_spent']])
# Result: range [0, 1]

# Technique 3: Robust Scaling (handles outliers better)
# Use when: Data has outliers, don't want to remove them
robust = RobustScaler()
df['age_robust'] = robust.fit_transform(df[['age']])
# Uses median and IQR instead of mean and std


**3. Creating Time-Based Features**


In [None]:
import pandas as pd

# Extract date features
df['account_creation_date'] = pd.to_datetime(df['account_creation_date'])
df['last_transaction_date'] = pd.to_datetime(df['last_transaction_date'])

# Temporal features
df['account_age_days'] = (pd.Timestamp.now() - df['account_creation_date']).dt.days
df['days_since_last_transaction'] = (pd.Timestamp.now() - df['last_transaction_date']).dt.days

# Cyclic features
df['account_creation_month'] = df['account_creation_date'].dt.month
df['account_creation_quarter'] = df['account_creation_date'].dt.quarter
df['account_creation_dayofweek'] = df['account_creation_date'].dt.dayofweek

# Cyclical encoding (for month: 1-12 should wrap around)
# Use sine/cosine transformation to maintain cyclical nature
df['month_sin'] = np.sin(2 * np.pi * df['account_creation_month']/12)
df['month_cos'] = np.cos(2 * np.pi * df['account_creation_month']/12)


**4. Creating Aggregation Features**


In [None]:
# Customer-level aggregations (RFM: Recency, Frequency, Monetary)

# Recency: Days since last purchase
df['recency'] = (df['observation_date'] - df['last_purchase_date']).dt.days

# Frequency: Number of purchases
df['frequency'] = df.groupby('customer_id')['transaction_id'].transform('count')

# Monetary: Total amount spent
df['monetary'] = df.groupby('customer_id')['transaction_amount'].transform('sum')

# Additional aggregations
df['avg_transaction_amount'] = df.groupby('customer_id')['transaction_amount'].transform('mean')
df['std_transaction_amount'] = df.groupby('customer_id')['transaction_amount'].transform('std')
df['max_transaction_amount'] = df.groupby('customer_id')['transaction_amount'].transform('max')


**5. Creating Interaction Features**


In [None]:
# Interaction between two features (useful when features interact)

# Ratio features
df['debt_to_income_ratio'] = df['total_debt'] / df['annual_income']
df['avg_transaction_to_total'] = df['avg_transaction_amount'] / df['total_spent']

# Product features
df['age_by_account_age'] = df['age'] * df['account_age_days']

# Polynomial features
df['age_squared'] = df['age'] ** 2
df['total_spent_log'] = np.log1p(df['total_spent'])  # log transformation


**6. Feature Selection**

### Why Select Features?

- Reduce dimensionality (fewer features = faster training)
- Remove noise (irrelevant features confuse model)
- Improve interpretability (easier to explain)
- Reduce overfitting


In [None]:
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier

# Method 1: Statistical Tests (Fast)
# SelectKBest selects k features with highest statistical scores
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
print("Selected features:", selected_features)

# Method 2: Mutual Information (captures non-linear relationships)
selector = SelectKBest(score_func=mutual_info_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Method 3: Feature Importance from Tree Models
# Most practical for real-world use
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print(feature_importance)
# Keep top 15 features (example)
top_features = feature_importance.head(15)['feature'].tolist()

# Method 4: Recursive Feature Elimination (Wrapper Method)
# Iteratively removes least important features
from sklearn.feature_selection import RFE
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=15)
X_rfe = rfe.fit_transform(X, y)
selected_rfe = X.columns[rfe.support_]

# Method 5: Correlation-based (Remove multicollinearity)
corr_matrix = X.corr().abs()
upper_triangle = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
# Find features with correlation > 0.9
drop_features = [column for column in upper_triangle.columns 
                 if any(upper_triangle[column] > 0.9)]
X_clean = X.drop(columns=drop_features)


### Complete Feature Engineering Pipeline


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def create_feature_engineering_pipeline():
    """
    Create a preprocessing pipeline
    """
    # Define columns
    numeric_features = ['age', 'total_spent', 'transaction_count', 
                       'account_age_days']
    categorical_features = ['account_type', 'country']
    
    # Numeric transformer: Scale
    numeric_transformer = Pipeline(steps=[
        ('scaler', StandardScaler())
    ])
    
    # Categorical transformer: One-hot encode
    categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    # Combine
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ]
    )
    
    return preprocessor

# Use in model training
from sklearn.linear_model import LogisticRegression

preprocessor = create_feature_engineering_pipeline()

full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

full_pipeline.fit(X_train, y_train)


### Tools Used in Feature Engineering

| Tool | Purpose |
|------|---------|
| Pandas | Feature creation, manipulation |
| Scikit-learn | Feature selection, scaling |
| Featuretools | Automated feature engineering |
| Category Encoders | Advanced categorical encoding |
| Apache Spark | Large-scale feature engineering |

---
